Using Ollama to Run Local LLMs on the Raspberry Pi 5

Ian Wootten

มุมมอง 89 684

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 4 ก.พ. 2025

ความคิดเห็น • 125

@pafnutiytheartist 4 หลายเดือนก่อน ⁺¹⁵
I appreciate you showing the actual inference speed, other videos usually cut around it, making it unrealistically fast.
@sweetbb125 8 หลายเดือนก่อน ⁺¹¹
I've trie drunning OLLAMA on my Raspberry Pi 5, as well as an Intel Celeron based computer, and also an old Intel i7 based computer, and it worked everywhere. It is really behind impressive, thank you for this video to show me how to do it!
@DorukMergan-hq7jo 5 หลายเดือนก่อน
what do youb mean by behind impressive
@kewa_design 4 หลายเดือนก่อน
Extremely slow i guess @@DorukMergan-hq7jo
@HVDynamo 28 วันที่ผ่านมา ⁺¹
@@DorukMergan-hq7jo It's probably a typo and they meant to say beyond impressive.
@DorukMergan-hq7jo 23 วันที่ผ่านมา
@@HVDynamo no,ı dont know what does the typo means,thats why ı asked
@metacob 9 หลายเดือนก่อน ⁺³⁹
I just got a RPi 5 and ran the new Llama 3 (ollama run llama3).
I was not expecting it to be this fast for something that is on the level of GPT-3.5 (or above). On a Raspberry Pi. Wow.
@brando2818 8 หลายเดือนก่อน
I just recieved my pi, and I'm about to do the same thing.. Are you doing anything else on it?
@zayedc7348 4 หลายเดือนก่อน
could you please give more data about how fast it runs? like tokens per second please
I'm trying different models and having a hard time having a natural speed conversation with an LLM on a RPi 5
@flrn84791 2 หลายเดือนก่อน
Tinyllama is not on the level of GPT-3.5, let's be realistic...
@fabiosequeira8844 4 วันที่ผ่านมา
@@flrn84791 he did not ran tinyllama he ran llama3
@KDG860 11 หลายเดือนก่อน ⁺³
Thank u for sharing this. I am blown away.
@Flash136 3 หลายเดือนก่อน ⁺¹
Thank you for making this video. I've always wondered how these LLMs perform on low-end devices. Obviously, not great, but this looks promising!
@whitneydesignlabs8738 ปีที่แล้ว ⁺⁸
Thanks, Ian. Can confirm. It works and is plausible. I am getting about 8-10 minutes for multi-modal image processing with Llava. I find the tiny models to be too dodgy for good responses, and have currently settled on Llama2-uncensored as my go to LLM for the moment. Response times are acceptable, but looking for better performance. (BTW my Pi5 is using an nVME drive and a Hat from Pineberry)
@IanWootten ปีที่แล้ว ⁺²
Nice, I'd like to compare to see how much faster an nVME would run these models.
@whitneydesignlabs8738 ปีที่แล้ว ⁺⁴
If you want to do a test, let me know. I could run the same model and query as you, and we could compare notes. My guess is that processing time has more to do with CPU and RAM. but not 100% sure. Having said that large (1TB+) nvme makes storing models on the Pi convenient. Also boot times are rather expeditious. When the Pi5 was announced, I knew right away that I wanted to to add an nvme via the PCI express connector. Worth the money, IMO. @@IanWootten
@BillYovino 11 หลายเดือนก่อน ⁺³
Thanks for this. So far I've tested TinyLlama, Llama2, and Gemma:2b with the question "Who's on first" ( a baseball reference from a classic Abbott and Costello comedy skit). TinyLlama and Llama2 understood that it was a baseball reference, but had some bizarre ideas on how baseball works. Gemma:2b didn't understand the question but when asked "What is a designated hitter?" came up with an equally incorrect answer.
@IanWootten 11 หลายเดือนก่อน
Nice. I love your Hal replica. Was that done with a Raspberry Pi?
@BillYovino 11 หลายเดือนก่อน
@@IanWoottenYes, a 3B+. I'm working on a JARVIS that uses ChatGPT API and I'm interested in preforming the AI function locally. That's why I'm looking into Ollama.
@SocialNetwooky ปีที่แล้ว ⁺²⁶
As I just said on the discord server : you might be able to squeeze a (very) tiny bit of performance by not loading the WM and just interact with ollama via SSH. But great that it works as well with tinyllama! Phi based models might work well too! Dolphin-Phi is a 2.7B model.
@BradleyPitts666 ปีที่แล้ว
I don't follow? What VM? ssh into what?
@SocialNetwooky ปีที่แล้ว
@@BradleyPitts666 WM ... windows Manager.
@SocialNetwooky ปีที่แล้ว ⁺⁶
@BradleyPitts666 meh ... youtube not showing my previous (phone written) answer again, so I can't edit it, and I can't see/edit my previous answer ... so this might be a near identical answer to another answer, sorry. I blame TH-cam :P
The Edit is that I disabled even more services, and marginally faster answer.
So : WM is the Windows Manager. It uses resources (processor time and memory) while it runs, not a lot, but it's not marginal. So disabling the WM with 'sudo systemctl disable lightdm' and rebooting is beneficial for this particular usecase. Technically ,just calling 'systemctl stop lightdm' would work too, but by disabling and rebooting you make sure any services lightdm started really aren't running in the background. You can then use ollama on the command line.
If you want to use it from your main system without hooking the rpi to a monitor and plug a keyboard in it you can enable sshd (the ssh daemon, which isn't enabled by default in the pi-os image afaik) and then ssh to it, and then use ollama there (THAT uses a marginal amount of memory though). I also disabled bluetooth, sound.target and graphical.target, snapd (though I only stop that one, as I need it for nvim), pipewire and pipewire-pulse (those two are disabled using systemctl --user disable pipewire.socket and systemctl --user disable pipewire-pulse.socket).
Without any models loaded, at idle, I only have 154MB of memory used.
With that configuration tinyllama on the question 'why is the sky blue' I get 13.02 t/s on my rpi5, so nearly 1/3rd faster than with all the unneeded services
@DominequeTaylor 7 หลายเดือนก่อน
What about the new ai attachment that they announced for the pi to do ai stuff. Would this work faster?
@SocialNetwooky 7 หลายเดือนก่อน
@@DominequeTaylor as far as I know it's for visual recognition, not for llms
@nilutpolsrobotlab 9 หลายเดือนก่อน ⁺³
Such a calm tutorial but so informative💙
@isuckatthat ปีที่แล้ว ⁺³
I've been testing llamacpp on it and it works great as well. Although, I've had to use my air purifier as a fan to keep it from overheating even with the aftermarket cooling fan/heatsync on it.
@daveys 10 หลายเดือนก่อน ⁺¹
The Pi 5 is pretty good when you consider the cost, and what you can do with it. I picked one up recently for Python coding, and it runs Jupyter Notebook beautifully on my 4k screen. I might give the GPIO a whirl at some point in the near future.
@vedantdalvi7523 6 หลายเดือนก่อน
I am considering buying a Pi 5 but confused between the 4gb vs 8gb models. Any suggestions?
@daveys 6 หลายเดือนก่อน
@@vedantdalvi7523 - I have the 8GB model, no experience with the 4GB one…sorry I can’t help.
@markr9640 11 หลายเดือนก่อน
Really useful stuff on your videos. Subscribed 👍
@VikPatel01 20 วันที่ผ่านมา
what if you add a egpu to the raspberry pi5 connected via the M2 that has a pcie. I know its bit complicated setup, but im sure for you it should be easy.
@kairysisKrantas 29 วันที่ผ่านมา
I drowned into your beautiful eyes and uour voice made me forget all problems ❤
@MarkSze ปีที่แล้ว ⁺⁴
Might be worth trying the quantised versions of llama2
@BenAulbrook ปีที่แล้ว ⁺³
I finally got my Pi5 yesterday and already have ollama working with a couple of models. But id like to provide a text to speech for the output on the screen having a hard time wrapping my brain around it how it works... like allowing the Ollama functions from the terminal to turn into audible speech.. but so many resources too pick from and also just getting the code/scripts working, i wish it was easy to install an external package and allow the internal functions to just "work" without having to move files and scripts around it becomes confusing sometimes.
@Wolkebuch99 ปีที่แล้ว
Well, how about a pi-cluster where one node runs ollama and one runs a screen reader ssh'd into the ollama node? Could add another layer and have another node running NLP for the screen reader node, or a series of nodes connected to animatronics and sensors.
@davidkisielewski605 11 หลายเดือนก่อน ⁺²
You can run meta whisper alongside your model from what I read. t-t-s and s-t-t
@Vhbaske 10 หลายเดือนก่อน
In the USA Digilent also has many Raspberys5 available!
@allurbase ปีที่แล้ว ⁺¹
How big was the image, maybe that affected the response time? Very cool, although not convinced by tiny-llama or the speed for a 7B model, but still crazy we are getting close. You should try something with more power like a Jetson Nano. THanks!!
@IanWootten ปีที่แล้ว
Less than 400KB. Might try a jetson nano if I get my hands on one.
@Augmented_AI ปีที่แล้ว ⁺⁵
How do we run this in python, so for voice to text and text to speech for a voice assistant
@vishwanathasardeshpande7092 6 หลายเดือนก่อน ⁺²
Will the performance improve by adding AI accelerator like hailo 8
@dinoscheidt ปีที่แล้ว ⁺⁴
Would love to know if the google coral board would provide a substantial improvement. If Ollama can even utilize that. Also, how it would compare to a jetson nano. Nonetheless: Thank you very much for posting this. Chirps to the Birds ❤️
@IanWootten ปีที่แล้ว ⁺¹
That would be great to try out if I could get my hands on one.
@AlwaysCensored-xp1be 11 หลายเดือนก่อน
Been having fun running different LLM. The small ones are fast, the 7B ones are slow. I have Pi5 8G. The small LLMs should run on a Pi4? Tinyllama has trouble adding 2+2. They also seem Monotropic, spiting out random vaguely related answers. I need more Pi5 so I can network a bunch with different LLM on each.
@pattern 28 วันที่ผ่านมา
Not sure what I did wrong or if llama has just gotten too big for PI but when i followed these steps for llama 2 it’s telling me it’s too large (8.2g required) and I only have 7.8g
@101kawsar หลายเดือนก่อน ⁺¹
I wonder how it would do with those AI accelerators
@ozzy-az หลายเดือนก่อน
me too
@davidkisielewski605 11 หลายเดือนก่อน ⁺³
Hi! I have the m.2 VMe hat and I am waiting for my coral accelerator. Does anyone else run with the accelerator and how much does it speed things up? I know what they say it does, but I am interested in real-world figures. I'll post when it arrives from blighty.
@eMgotcha77 7 หลายเดือนก่อน
Sorry to burst the bubble but Coral will not help. It has only 1G DRAM. It's also limited to Tensorflow Lite AND each model needs to be converted with an edge-tpu-compiler which limites the possible layer types even more.
@HifiCentret 2 หลายเดือนก่อน
Can Ollama benefit from an NPU hat? The Hailo 26 TOPS for instance.
@donmitchinson3611 9 หลายเดือนก่อน
Thanks for video and testing. I was wondering if you have tried setting num_threads =3. I can't find video of where I saw this but I think they set it before calling ollama. Like environment variable. It's supposed to run faster. I'm just building a rpi5 test station now
@1091tube 11 หลายเดือนก่อน ⁺¹
could the compute process be distributed, like a grid compute? 4 raspberry pi?
@IanWootten 11 หลายเดือนก่อน
Not really - a model file is downloaded to the machine using Ollama and brought into memory.
@m41ek ปีที่แล้ว ⁺¹
Thanks for the video! What's your camera please ?
@ray-charc3131 2 หลายเดือนก่อน
Ollama2 no longer runs on my raspberry pi5 8G, there is error message that models requiring system memory (8.4G) than is availabel(7.4G), can it be resolved or anything I got wrong?
@alexmcd378 4 หลายเดือนก่อน
I wonder if this is compatible with the AI kit you can add to the pi5 for many times faster AI in other tasks
@Lp-ze1tg 11 หลายเดือนก่อน
Was this pi 5 consisted of microsd card or external storage?
How big the storage size is suitable?
@IanWootten 11 หลายเดือนก่อน
Just using the microsd. I'd imagine speeds would be a fair bit better from USB or nvme.
@technocorpus1 9 หลายเดือนก่อน
Awesome! I want to try this now! Can someone tell me if it necessary to install the model on an exterior SSD?
@IanWootten 9 หลายเดือนก่อน
Not necessary, but may be faster. All the experiments here I was just using a microsd.
@technocorpus1 9 หลายเดือนก่อน
@@IanWootten That's just amazing to me. I have a Pi3, but am planning on upgrading to a pi5. After I saw your video, I downloaded ollama onto my windows pc. It only has 4 GB RAM, but I will still able to run several models!
@razvanstroe7044 6 หลายเดือนก่อน
If you get a tpu could you run a bigger llm and even more efficiently?
@IanWootten 6 หลายเดือนก่อน
I don't think so, since Ollama isn't built to leverage it.
@nmstoker ปีที่แล้ว ⁺²
Great video but it's not a good idea to encourage use of those all-in-one curl commands. Best to download the shell script, ideally look over it before you run it, but even if you don't check it first at least you have the file if something goes wrong
@IanWootten ปีที่แล้ว
Yes, I've mentioned this in my other videos and have in my blog on this too.
@nmstoker ปีที่แล้ว
@@IanWootten ah, sorry hadn't seen that. Anyway thanks again for the video! I've subscribed to your channel as looks great 🙂
@dibu28 10 หลายเดือนก่อน
also try MS Phi2 for Python and Gemma-2b
@Marco916SacTown ปีที่แล้ว ⁺²
This is a good start - I bet the Raspberry Pi makers have a Pi 6 in the works with a better GPU to really drive these LLM's.
@IanWootten ปีที่แล้ว ⁺³
No doubt they will do. But, the Pi 4 was released 4 years ago, so you might have to wait a while.
@madmax2069 11 หลายเดือนก่อน
That's wishful thinking.
You might as well try to figure out how to run an ADLINK Pocket AI on a Pi 5.
@ioaircraft 7 หลายเดือนก่อน
tryin, 10 turing pi's with 40 nvidia jetson orin's on them, with primary linux computer, so 41 systems total, clustered using archer and kubermites. it will def rocket then... 4,000-6,000 tops, 1.2 terabytes of ram.
@IanWootten 7 หลายเดือนก่อน ⁺¹
That sounds crazy
@user-vl4vo2vz4f ปีที่แล้ว ⁺⁴
please try adding a coral module to the pi and see the difference
@madmax2069 11 หลายเดือนก่อน
A Coral module is not suited for this. It lacks the available Ram to really partake in helping an LLM run.
What you really need is something like an external GPU, something like one of those ADLINK Pocket AI GPUs to hook up to the system, BUT it only has 4GB Vram.
@Bigjuergo ปีที่แล้ว
can you connect it with speach recognition and make tts output with pretrained voicemodel (*.index and *.pth) file?
@IanWootten ปีที่แล้ว
You probably could, but it wouldn't give a quick enough response for something like a conversation.
@whitneydesignlabs8738 ปีที่แล้ว
I am working on something similar, but using a Pi4 for STT & TTS (and animatronics) and a dedicated Pi5 for running the LLM with Ollama like Ian demonstrates. They are on the same network and use MQTT for communication protocol. This is for robotics project.@@IanWootten
@isuckatthat ปีที่แล้ว
I've been trying to do this, but its impossibly hard to get tts setup.
@donniealfonso7100 ปีที่แล้ว
@@isuckatthat Yes not easy. I was trying to implement speech with Google Wavenet using TH-cam Data Slayer example. I put the key reference in pi's user.profile as export. Script runs okay now creating the mp3 files but no speech so pretty much gave up as other fish to fry.
@nmstoker ปีที่แล้ว ⁺¹
@@isuckatthathave you tried espeak? It would give robotic quality output but uses very little processing and works fine on a Pi
@fontenbleau ปีที่แล้ว
maybe better try genius Mozilla LLM container in one file project LLAMAFILE. I was able to run it on my 2011 laptop(some ancient Gpu) with Windows 8 a LLAVA in llamafile, which is also an image scanner llm. Ollama i've tested can't run on win 8.
@galdakaMusic 7 หลายเดือนก่อน
What about renew this video with the new Rpi Hat AI? Thanks
@IanWootten 7 หลายเดือนก่อน ⁺¹
Could do, but I don't think Ollama would be able to leverage it, plus it's not out yet.
@anonymously-rex-cole 11 หลายเดือนก่อน
is that realtime? is that how fast it replies?
@IanWootten 11 หลายเดือนก่อน ⁺²
All the text model responses are in realtime. I've only made edits when using llava since there was a 5 min delay between hitting enter and it responding...
@sanjaydamodaran3219 14 วันที่ผ่านมา
Does having the AI hat help ?
@chetana9802 ปีที่แล้ว
now lets try it on a cluster or ampere altra?
@IanWootten ปีที่แล้ว
Happy to give it a try if there's one going spare!
@theguywhoasked0.0 3 หลายเดือนก่อน
wow that's even faster than my pc idk how
@GuillermoTs 11 หลายเดือนก่อน
Is possible to run in a Raspberry Pi 3?
@IanWootten 11 หลายเดือนก่อน
Maybe one of the smaller models, but it'll run a lot slower than here
@NicolasSilvaVasault 9 หลายเดือนก่อน
that's super impressive even if it takes quite a while to respond, is a RASPBERRY PI
@IanWootten 9 หลายเดือนก่อน
EXACTLY!
@AlexanderGriaznov 9 หลายเดือนก่อน ⁺¹
Am I the only one who noticed tiny llama response to “why sky is blue?” was shitty? What the heck rust causing blue color of the sky?
@IanWootten 9 หลายเดือนก่อน
Others have mentioned it in the comments too. It is a much smaller model, but there are many others to choose from (albeit possibly slower).
@lifeofanitguy 17 วันที่ผ่านมา
Good one so probably Raspberry Pi is not worth to run local LLM
@IanWootten 17 วันที่ผ่านมา
Not if you have something more powerful, but the fact that it runs at all is pretty impressive.
@TreeLuvBurdpu ปีที่แล้ว
What if you put a compute module on it or something?
@IanWootten ปีที่แล้ว ⁺¹
A compute module is a RPi in a slightly different form. So I think it would behave the same.
@marsrocket 11 หลายเดือนก่อน
What’s the point of running a LLM locally if the responses are going to be nonsense? That blue sky response was ridiculous.
@IanWootten 11 หลายเดือนก่อน ⁺¹
The response for that one model/prompt may have been, but there are plenty of others to choose from.
@zachhoy ปีที่แล้ว ⁺¹
I'm curious why run it on a Pi instead of a proper PC?
@IanWootten ปีที่แล้ว ⁺⁶
To satisfy my curiosity - to see whether it's technically possible on such a low powered, cheap machine.
@zachhoy ปีที่แล้ว ⁺¹
thanks for the genuine response :D Yes I can see that drive now. @@IanWootten
@TreeLuvBurdpu ปีที่แล้ว
There's lot of videos of people running it on their PC, but if you use it all the time it will hog your PC all the time. There's several reasons you might want a dedicated host.
@justADeni 12 วันที่ผ่านมา
It's great that the model runs, but it's super wrong :D
@realSkyfr 6 หลายเดือนก่อน ⁺¹
the model i have found works fastest is qwen 2 0.5b. the only downside is it thinks trump is a night and pope
@Tarbard ปีที่แล้ว ⁺¹
I liked tinydolphin better than tinyllama.
@IanWootten ปีที่แล้ว
Not tried it out yet.
@pengain4 10 หลายเดือนก่อน
I dunno. It seems cheaper for buy actual second-hand GPU to run Ollama on it than to buy RPi. [Partially] a joke. :)
@IanWootten 10 หลายเดือนก่อน ⁺¹
Possibly if you already have a machine. This might work out if you don't. Power consumption is next to nothing on the Pi too.
@flrn84791 2 หลายเดือนก่อน
This is LLMs running on CPU... yikes
@therealwhite 5 หลายเดือนก่อน
Anyone else read the title as Obama
@blender_wiki ปีที่แล้ว
Too expensive for what it is. Interesting proof of concept but absolutely useless and inefficient in a production context
@c-LAW 4 หลายเดือนก่อน
I can't understand language. Speak English like a real American.
:)
@IanWootten 4 หลายเดือนก่อน ⁺¹
I'll have you know I speak the Queen's English.
@arkaprovobhattacharjee8691 ปีที่แล้ว ⁺³
This is so exciting! Can you pair this with a Coral TPU ? and then check the inference speed ? I was wondering if that's possible
@madmax2069 11 หลายเดือนก่อน ⁺¹
The coral TPU isn't suited for this, it lacks the available Ram to do any good with an LLM. What you'd need is one of those ADLINK Pocket AI GPUs but it only has 4GB Vram.
@arkaprovobhattacharjee8691 11 หลายเดือนก่อน
@@madmax2069 makes sense.
@BradleyPitts666 ปีที่แล้ว ⁺³
I have cpu usage at 380% when ollama2 responding. Anyone else tested?

ต่อไป

เล่นอัตโนมัติ

This solves the Raspberry Pi’s BIGGEST problem - Pineboards POE+