Viewers should probably note that the actual text generation is much slower and the video is sped up (look at the timestamps) massively. This is particularly true for the multimodal models like LLaVA which can take a couple of minutes to produce that output. These outputs are also quite cherry picked, a lot of the time, these quantized models can give garbage outputs. Not to mention most of the script of this video is AI generated...
Thank you! I installed it on a faster SBC and it's slower as in the Video xD I was already wondering if its speed up but there is no info that it's sped up?!
@@lonesome_rouleur5305 there are loads of tutorials of how to do it for free . let him make a living . otherwise youll be complaining that he doesnt upload because he has to get a job lol
Got about 14 LLM's running on my Pi5. This is the vid that started my dive down tge AI rabbit hole. You can have multiple Ollama/LLMs running at once as long as only one is answering a prompt.
Hey have you seen the Microsoft paoer that changes how the binary runs? To make them run super? Any way to train a bottom up AI with you as human reinforcement learning.
That means either, they aren’t running or they’re not loaded into ram which means.. they’re basically not running? At least that’s my assumption. When it loads, it loads into vram or ram. If you got enough ram theoretically you could but then why wouldn’t you just run a much larger more accurate model?
Llama 2 got the 1952 POTUS question wrong. Harry S. Truman was POTUS in 1952. Eisenhower won the 1952 election, but wasn’t inaugurated until 1953. Small, but an important detail to note.
nice video! a little correction at 10:41, privateGPT doesn't train a model on your documents but does something called RAG - basically, smartly searches through your docs to find the context relevant to your query and pass it on to the LLM for more factually correct answers!
I really like your explanation... could i also ask you, what does it do to the LLM? are the LLMs teachable or are they supposed to get trained on the information over and over until mastered? for example if i tell an LLM that 1+1 = 2 will it remember it forever or do i need to repeat it many times?
@@issair-man2449 I mean in theory they just generate a probable world one after another following the given prompt. The probability of the next word generated depends on the training database and on the "training setup" in general, once the model is trained, the "weights" that decide the probability of the word generated in a given context are fixed. So if you say to a model "from now on you live in a world where 1+1 = 3", it's probable that it will keep saying (in that conversation) that 1+1=3 because it's the most probable thing to say after you made that assertion. Btw if you wanna do a new conversation you will need to specify it again in the new prompt because usually the databases that are used to train LLMS contain data that says "1+1=2". Alternatively you could fine-tune the model (basically adding new data to train the model to respond in a certain way to a specific stimulus) in that way the "weights" will be modified and you'll end up with basically a (although slightly) different model.
@@issair-man2449 hey sorry just saw ur ques! so LLMs themselves do not actually remember any stuff (i.e no longterm memory) but there might be applications that are built leveraging techniques similar to RAG that being said, you can actually "fine-tune" a model with certain information or replying style to customize a LLM to your requirements which is usually a much more complex process
you should mention they are quantized and are pretty bad, not only that but they would take several minutes to reply vrs less than 10sec on a medium gpu
This is absolutely fascinating! Thank you so much for sharing. It was just 1 year ago we were blown away by this multi billion dollar tech that now can run on a small raspberry pi. It's an amazing exploration you did here. Please continue for good.
It would be fascinating to work out a way to cause multiple small edge computers hosting LLMs to work in synchrony. A cluster of Pi 5 SBC's could narrow the memory gap required to run larger models, providing more accurate responses if not measurably better performance. There would be a lot of tradeoffs for sure, since the bulk of these currently seem to be created to run within a monolithic structure (composed of massively parallel hardware GPUs) which does not lend itself as well to "node-based" distributed computing on consumer-grade processing and networking hardware, so I wonder if the traffic running across the network meshing multiple processors would create bottlenecks, and if these could operate on a common data store to eliminate attempts to "parse" and distribute training data among nodes? I have the feeling that the next step toward AGI will involve using generative models in "reflective layers" anyway, using adversarial models to temper and cross-check responses before they are submitted for output, and perhaps others "tuned to hallucinate" to form a primitive "imagination", which perhaps could form the foundation for "synthesizing" new "ideas", for deep analysis and cross-checking of assumed "inferences", and potentially for providing "insights" toward problem solving where current models fall short. As one of my favorite TH-cam white-paper PHDs always says, "What a time to be alive!" Thanks for a great production!
Problem is the maximum bandwidth, these models basically need a crap ton of ram, and sharing a model across multiple pis is very difficult though not impossible
@@peterdagrape Got it; diminishing returns. With so many Pi cluster configurations out there, I figured there was a reason the Pi people weren't all over this already.
Thanks for the video. I have also been experimenting with various LLMs on the Pi5, locally. Have best results with Ollama so far. I am also running these pis on battery power for robotic, mobile use. I am pretty close to successfully integrating local speech to text, LLM & text to speech using 2 pi5s, including animatronics. Fun stuff.
Most guys running LLMa with hone assist Handel all the voice recognition and text speech on via the PI so and the LLM off from a local api so I’d assume two PI’s would run fine. Not sure pd ever play with a base model or even a restrained model though. There’s plenty of dope 7B models available including unaligned models.
I actually run a ping every 60 seconds, and when Internet is available, I run some APIs, but when it is not available, it falls back to local. So for stt, I using Google's free service when Internet is available and will use whisper when no Internet. Whisper is actually one my the steps I have not installed yet. But will soon. The Google stt is working. Also using Eleven Labs API and pytts3x the same way for tts. (Internet/no Internet) This part is working and tested. Same with the LLM, locally working and tested. A pi5 handles local LLM (its only job), a Pi4 handles speech in and out, plus simple animatronics. Another Pi5 manages overall operations and runs MQTT server. All communicate the message data over MQTT messages on the robot's internal wifi. @@raplapla9329
Interesting! I also have a Pi3 running Home Assistant, with plans to integrate it into the robot architecture. My current issue with Home Assistant is I can't seem to get Node Red installed onto it. The robot uses Node Red, and I would love to make use of the GPIO function in Node Red on Home Assistant with all this. But stuck on Node Red...@@ChrisS-oo6fl
00:03 Exploring LLM capabilities on the Raspberry Pi 5. 01:44 Setting up performance storage for advanced LLMs on Raspberry Pi 5. 03:29 Running advanced LLMs on Raspberry Pi demonstrates impressive image analysis and flexible recipe generation. 05:28 Advanced LLMs on Raspberry Pi 5 offer decent speeds and practical usability. 07:17 Running AI models on Raspberry Pi shows impressive coding assistance capabilities. 09:37 Exploring LLM requirements and use of private models on Raspberry Pi. 11:20 The advanced LLMs can answer queries based on source documents. 13:14 Semiconductors drive technological advancement and local LLMs retain vast knowledge.
the memory doesent line up with the models you are loading, Im not seeing any changes on your memory when swapping models... I assume these are gguf models? and they appear to be running faster that what a rpi5 is capable of...
It would be slower but I'm curious if setting up ZRAM or increasing the cache size with an SSD or NVME drive might be what's needed to run the larger language models.
You can run them that way, the issue is that for each request you would have to wait for tens of minutes. I tried really big models and you can’t call it chatting anymore.
This presents a very intetesting use case. Is it possible to feed technical manuals into one these models, and then ask them specific questions about the content of the manuals? It would be really neat if you could take a picture of an error code from a machine, send that pic to the AI model and then have it provide information about the errors or faults
Awesome video! Thank you ;-) My question is... How could we run these local LLMs locally and at the same time having them accessing the internet to search stuff that they do not have it?
I think your description of how PrivateGPT works is wrong, I think it stores the texts in a vector DB and then uses a different model to check the DB with your prompt, the DB returns some text that is injected in the context with your prompt, using the model that you have chosen. Please correct me if I am wrong, I just had a quick look at the sources.
I watched this three times lol I love this. Thank you for this and Do you think the raspberry pi 5 is the best single board for the job or would the zimaboard compare just as good if not better Also since you had repurposed a wifi adapter would you have an idea how tear down old pcs and laptops and combine the hardware to create the vram needed for an upgrade like this? Probably more complex then it needs to be but got a whole bunch of old computers with junk processors and low ram etc to today's standards buy feel like you could repurpose alot of the stuff with a different board or if we just flashed windows off😂 and used the giant mother board and maybe even part of the laptop screen or something idk lol. A way to combine multiple processors or something to create a Frankenstein that works well lol Or another side project to make a control box for golf simulator. Basically just buttons to map to a keyboard and have a decent housing for the thing. Maybe your box Is for an arcade emulator box or something or controls your smart home or sound set up idk 🤷♂️
Yeah, my question also. If I use some of these 7B models on my M1 the tokens/sec is just not fast enough that I don't want to resort to using a model behind an API (which is faster) for things like coding. Still excited for other more data and privacy sensitive use cases or where latency is permissive to run them on my Pi. 8GB versions were sold out, last I checked...
ollama looks like a very helpful pull ty for that. I’ve been looking for a couple weeks on training with coral tpu. Coral having so many dated dependencies breaks pip every time (for me, a dude who isn’t the smartest.) Next run at it will be w conda and optimum[exporters-tf].
I was thinking for the past month of trying this, but edge tpu's bandwidth plus my sceptism of any successful conversion to tflite held me back. Never knew pi's cpu was that capable. Anyways what inference speed (in like tokens per second) in mistral 7b approx?
Thank you for sharing this information, it is great to have a local llm and it was quite easy to set up after all. I did not know that there are so many models available.
Thanks for the video. It was interesting to see what the pi5 can do. I do think, however, it's a huge mistake and misinformation to say that LLM's contain any of the information it was trained on. The models are trained to finish a sentence, to guess what the next word is, and do not contain any of the actual training data. I feel like this is important, so that we know how to trust LLM's properly.
LLMs, including GPT, do memorise things, though they're not built for directly for that purpose. Try entering this into chatgpt: Finish this sentence: "As Mike Tyson once said, " You can guess what it responds. If you have strong weights for words in a sequence that match a given article or quote - that appears thousands of times across the web/training data - it's effectively memorised. Look into the nytimes lawsuit.
You can absolutely get models to spit out training data with text completion and the right parameters. In fact, most "censored" models will even give up the "censored" bad-think ideas that they're not supposed to give you when you know how to prompt them to do so, and you already kinda touched on the reason WHY you can do it
@@bakedbeings You're right that you can extract knowledge (sort of the whole point of an LLM). I only mean to highlight the differences between how a models "remember" things. Its closer to how humans remember, than actual computer memory. There is also a random seed for most models that can change the output.
This looks like a project worth exploring. Although the limitation of AI is that it sources data accumulated on internet and so is subject to biases which leads to inaccuracies. I'm sure however that there would possibly be a way to clean up data for accuracy if another unbiased reference was easily available.
i actually was thinking abt putting a model on a raspberry pi, looks like you beat me to it, but what abt putting the raspberry pie on a drone and getting the AI to fly it???
I know imagine having like this drone AI army that you can command its kinda like jarvis when Tony told him to send all the iron man suits in iron man 3
Llama2, you missed one question. 7:21 the US president in 1952 was NOT Dwight David Eisenhower; it was Harry S. Truman. Eisenhower won the election in November 1952, and was then inaugurated on January 20,1953.
The clear lesson here is that this is software about credibility, not accuracy. It's just as smart as the not so smart sources on which it was trained, garbage in, garbage out. At least with Wikipedia, there are checks and balances of people with differing opinions having access to make corrections. Not so with LLMs. Fact checking costs extra.
The speed is *perfect*. Now run it on a green CRT and give those little sound effects as the words come out at reading speed and it'll be just like being in a Hollywood movie.
Dude, you should design an end to end doomsday/prepper raspberry pi LLM machine! This is honestly such amazing work. It looks like you’ve already developed the scaffolding and initial prototype for this type of device. I wonder if you could build an all in one package with a reasonable cost that does local LLM inference with a larger model instead? That would be so awesome. One of the most useful features would probably be creating an easier method to input queries. I wonder if it’s feasible to use a speech to text model.
How long did it take for the LLaVa to return results from the selfie? Lotta use cases there alone. imagine you're a spy looking for a particular person. you're walking around in public with your llava lora model taking a pic a second. neat
you're also keep about power draw.. how long could a spy walk around town taking as many pics as possible drawing on a couple of cheap powerbanks? just spitballing. subscribed.
at the start of video you have been using a terminal with ports and other details mentioned of Pi5, How you did it? I am new to raspberry pi? want to know if it is a software or something else?
enable SSH on your Raspberry Pi, then SSH over Port 22 from your laptop to the Raspberry Pi. The Terminal you're seeing is that he's controlling the RPI from his laptop.
there are a bunch of nvme hats out there, but a lot of people are having problems getting them to work. issues with booting, recognizing, boot order, compatibility problems etc.
I wonder how much more work it would take to add a microphone to it so we could talk to it. I'm fairly new to software development so I'm sure someone probably figured that out already.
Probably not much. When GPT first came out I had it walk me thru building an app in Python that did just that, I pressed a button and spoke, it processed it into text using a google API and then sent it to GPT to get an answer. For self containment it would need some other voice to text library, but again GPT could probably talk you thru building it. I'm not a dev, I just tinker and was able to make it work.
@@onewheelpeelproductions470 That's really awesome. Looking forward to seeing more of your tinker videos. I've been wanting to dip my toes into Python but wanted to build something that's useful for me. Thanks! Keep up the great work.
Hm, in Ollama there seems a tendency use way less ram, then the model should actually use. Or at least Htop did not seem to pick up on a substancial in crease in memory use one would expect from loading like 7B model. Can anybody explain why? I saw the same for Mixtral on my Laptop, it did just run even so the RAM only was occupied with about 3.7 GB instead of the 30GB that would be expected.
At the end, you mention 25% of wikipedia, but that is bytes in text and not a model. Remember the model deals in tokens which is roughly a word but can also be an entire phrase. It is likely that the models contain most or all of wikipedia not a small fraction.
A 7b model that is probably compressed/quantized up the @55 don't qualify as any "Advanced LLM" like the video's title says. Still impressive you can run anything like this on a Rasberry Pi tho.
Does it need to be on a raspberry pi or a linux based system? I'm interested in running these models in my windows system or even over WSL 2, if it is possible, i'd like some feedback on the possibilities of you making a video on it
Is there any "How to" or maybe a "Step-by-step"? I have a RaspberryPi 3B+ and an useless OrangePi A20... Is it possible to use them any way? Congrats for the great job!!
So, I've installed privGPT on a gaming laptop having an RTX 4060, and it worked. Speed was so so even after enabling the LLM to use the gpu instead of cpu. I'd be interested in knowing which configuration yields the fastest response. I've seen pcie to m2, which enabled the use of an external gpu, because gpus process ai data faster than cpus Ive heard. What is the best hardware combination would you recommend for speed and portibility?
same here. i7 13700kf, 4060ti 16gb, 160gb of ram. privateGPT on cpu is pretty slow, with cuda enabled it runs good, but not as fast as on this pi5? So what's the magic here?
Running certain models locally is extremely slow for my laptop. I wondered if the pi could it. I figured someone already tried with local AI or oobabooga. But I’m very confused why you didn’t try any really good 7B uncensored models. If your gonna run a local LLM why would anyone want a censored / aligned or base model? Can you list the ones you tried with success?
How can i use that LLM and the RASPI to use my own LLM in my IDE? Is there already something that can read my code and help me program, based on the code?
Hi sir, I am making a similar project where i'm using a rasberry pi to awnser questions on a py file using python, but its having a voice rsponse to the quesiotn. Im having problems making it work because its having errors with the alsa thing n ot being located or smth. Could we get in contact and you can help me with it please? thanks.
I want to do this use coral usb ontop with webcam object and face recognition, voice interaction, and link it to be able to control my home by accessing my existing home assistant raspberry pi 5 device.
Viewers should probably note that the actual text generation is much slower and the video is sped up (look at the timestamps) massively. This is particularly true for the multimodal models like LLaVA which can take a couple of minutes to produce that output. These outputs are also quite cherry picked, a lot of the time, these quantized models can give garbage outputs.
Not to mention most of the script of this video is AI generated...
Sneaky given that reading the blog post with details is a PDF costing 7 bucks.
Thank you! I installed it on a faster SBC and it's slower as in the Video xD I was already wondering if its speed up but there is no info that it's sped up?!
@@lonesome_rouleur5305 I mean it is not hidden we can easily see that it speed up. The time is going up like crazy in htop.
@@lonesome_rouleur5305 there are loads of tutorials of how to do it for free . let him make a living . otherwise youll be complaining that he doesnt upload because he has to get a job lol
but what if instead of running it on a raspberry pi, we run it on a top computer, like a M3 Mac?
Clickbait I came here to check the display lol
Same 😂
Yup
Yup
Thought this was a Rabbit R1 competitor, looks dope
Yup
The fact that these can run on Raspberry Pi is crazy. I always assumed you needed a pretty beefy GPU to do any of this
GOOGLE did even make booster device for pi specifically to run AI model on it it like 99 dollars or something
If you ready to wait few hours or days models can run on pretty much anything
Is it just me, or is bro recording this while a little baked?
Is there something wrong with being a little baked?
Baking Raspberry Pie 🥧 😋
He’s focused and not getting in his own way; works for me!
Got about 14 LLM's running on my Pi5. This is the vid that started my dive down tge AI rabbit hole. You can have multiple Ollama/LLMs running at once as long as only one is answering a prompt.
Hey have you seen the Microsoft paoer that changes how the binary runs? To make them run super? Any way to train a bottom up AI with you as human reinforcement learning.
What's the most effective LLM for custom data sets? Can you train it to sit to guidelines?
That means either, they aren’t running or they’re not loaded into ram which means.. they’re basically not running?
At least that’s my assumption. When it loads, it loads into vram or ram. If you got enough ram theoretically you could but then why wouldn’t you just run a much larger more accurate model?
Llama 2 got the 1952 POTUS question wrong. Harry S. Truman was POTUS in 1952. Eisenhower won the 1952 election, but wasn’t inaugurated until 1953. Small, but an important detail to note.
Good catch. When he typed the question, I answered Truman, but when Eisenhower came up, I thought my age was catching up to me.
r/presidents moment
nice video!
a little correction at 10:41, privateGPT doesn't train a model on your documents but does something called RAG - basically, smartly searches through your docs to find the context relevant to your query and pass it on to the LLM for more factually correct answers!
Thanks, nerd.
I really like your explanation...
could i also ask you, what does it do to the LLM? are the LLMs teachable or are they supposed to get trained on the information over and over until mastered?
for example if i tell an LLM that 1+1 = 2
will it remember it forever or do i need to repeat it many times?
@@issair-man2449 I mean in theory they just generate a probable world one after another following the given prompt. The probability of the next word generated depends on the training database and on the "training setup" in general, once the model is trained, the "weights" that decide the probability of the word generated in a given context are fixed. So if you say to a model "from now on you live in a world where 1+1 = 3", it's probable that it will keep saying (in that conversation) that 1+1=3 because it's the most probable thing to say after you made that assertion. Btw if you wanna do a new conversation you will need to specify it again in the new prompt because usually the databases that are used to train LLMS contain data that says "1+1=2". Alternatively you could fine-tune the model (basically adding new data to train the model to respond in a certain way to a specific stimulus) in that way the "weights" will be modified and you'll end up with basically a (although slightly) different model.
Welcome, Douche.@@ArrtusMusic
@@issair-man2449 hey sorry just saw ur ques!
so LLMs themselves do not actually remember any stuff (i.e no longterm memory) but there might be applications that are built leveraging techniques similar to RAG
that being said, you can actually "fine-tune" a model with certain information or replying style to customize a LLM to your requirements which is usually a much more complex process
You can run the 13b models with 8GB RAM. Just add swap file in Linux of e.g 10GB. It's slower, but will still run with ollama and other variants.
that's a killer for your SSD/SD card.
@@_JustBeingCasual Digitally destroy a microSD card any% speedrun
or use something that has a gpu like jetson nano
you should mention they are quantized and are pretty bad, not only that but they would take several minutes to reply vrs less than 10sec on a medium gpu
Time to run is the biggest issue.
What ikf someone attached a Coral USB Accelerator to their Pi?
Sorry for necro posting on your 8 month old comment.
This is absolutely fascinating! Thank you so much for sharing. It was just 1 year ago we were blown away by this multi billion dollar tech that now can run on a small raspberry pi. It's an amazing exploration you did here. Please continue for good.
Fantastic Tutorial! Looking forward to more from you.
It would be fascinating to work out a way to cause multiple small edge computers hosting LLMs to work in synchrony. A cluster of Pi 5 SBC's could narrow the memory gap required to run larger models, providing more accurate responses if not measurably better performance. There would be a lot of tradeoffs for sure, since the bulk of these currently seem to be created to run within a monolithic structure (composed of massively parallel hardware GPUs) which does not lend itself as well to "node-based" distributed computing on consumer-grade processing and networking hardware, so I wonder if the traffic running across the network meshing multiple processors would create bottlenecks, and if these could operate on a common data store to eliminate attempts to "parse" and distribute training data among nodes?
I have the feeling that the next step toward AGI will involve using generative models in "reflective layers" anyway, using adversarial models to temper and cross-check responses before they are submitted for output, and perhaps others "tuned to hallucinate" to form a primitive "imagination", which perhaps could form the foundation for "synthesizing" new "ideas", for deep analysis and cross-checking of assumed "inferences", and potentially for providing "insights" toward problem solving where current models fall short.
As one of my favorite TH-cam white-paper PHDs always says, "What a time to be alive!"
Thanks for a great production!
Problem is the maximum bandwidth, these models basically need a crap ton of ram, and sharing a model across multiple pis is very difficult though not impossible
@@peterdagrape Got it; diminishing returns. With so many Pi cluster configurations out there, I figured there was a reason the Pi people weren't all over this already.
Thanks for the video. I have also been experimenting with various LLMs on the Pi5, locally. Have best results with Ollama so far. I am also running these pis on battery power for robotic, mobile use. I am pretty close to successfully integrating local speech to text, LLM & text to speech using 2 pi5s, including animatronics. Fun stuff.
which STT model are you using? whisper?
I am literally dreaming about doing this right now. I have a pi5 on the way. Let me know how it goes!
Most guys running LLMa with hone assist Handel all the voice recognition and text speech on via the PI so and the LLM off from a local api so I’d assume two PI’s would run fine. Not sure pd ever play with a base model or even a restrained model though. There’s plenty of dope 7B models available including unaligned models.
I actually run a ping every 60 seconds, and when Internet is available, I run some APIs, but when it is not available, it falls back to local. So for stt, I using Google's free service when Internet is available and will use whisper when no Internet. Whisper is actually one my the steps I have not installed yet. But will soon. The Google stt is working. Also using Eleven Labs API and pytts3x the same way for tts. (Internet/no Internet) This part is working and tested. Same with the LLM, locally working and tested. A pi5 handles local LLM (its only job), a Pi4 handles speech in and out, plus simple animatronics. Another Pi5 manages overall operations and runs MQTT server. All communicate the message data over MQTT messages on the robot's internal wifi. @@raplapla9329
Interesting! I also have a Pi3 running Home Assistant, with plans to integrate it into the robot architecture. My current issue with Home Assistant is I can't seem to get Node Red installed onto it. The robot uses Node Red, and I would love to make use of the GPIO function in Node Red on Home Assistant with all this. But stuck on Node Red...@@ChrisS-oo6fl
Great vid, been wanting to know this for ages.
00:03 Exploring LLM capabilities on the Raspberry Pi 5.
01:44 Setting up performance storage for advanced LLMs on Raspberry Pi 5.
03:29 Running advanced LLMs on Raspberry Pi demonstrates impressive image analysis and flexible recipe generation.
05:28 Advanced LLMs on Raspberry Pi 5 offer decent speeds and practical usability.
07:17 Running AI models on Raspberry Pi shows impressive coding assistance capabilities.
09:37 Exploring LLM requirements and use of private models on Raspberry Pi.
11:20 The advanced LLMs can answer queries based on source documents.
13:14 Semiconductors drive technological advancement and local LLMs retain vast knowledge.
the memory doesent line up with the models you are loading, Im not seeing any changes on your memory when swapping models... I assume these are gguf models? and they appear to be running faster that what a rpi5 is capable of...
but if you buy the $6 guide, all will be explained
@@AustinMark I am guessing you are saying it's a bullshit demo? lol
It would be slower but I'm curious if setting up ZRAM or increasing the cache size with an SSD or NVME drive might be what's needed to run the larger language models.
You can run them that way, the issue is that for each request you would have to wait for tens of minutes. I tried really big models and you can’t call it chatting anymore.
use a nvidia jetson nano or something similar . the gpu is waaay better for running llm's
Dude, you are doing gods work 🤘
Wow! Much appreciated, thanks for the video, subbed!! Keep it up.
It would be useful if we can add more ram to the pi5's m.2 slot so we can run the 13B models
I've been waiting for this video for months! 😆Thanks for putting it together!
Oh, that’s just awesome. Edge AI. Just confirm if you would… the google voice was not generated in real time with an webhook or API, right?
No, generation time was very slow. That had to have been put together in post production
rmdir is not recursive and requires the directory to be empty.
I want to know about the latency. Can it be fast enough for a real-time conversation?
What was the speed(tokens/sec) for these models have you recorded it somewhere?
Can you add the coral AI m.2 accelerator to the pi 5 and test it yet?
This presents a very intetesting use case.
Is it possible to feed technical manuals into one these models, and then ask them specific questions about the content of the manuals?
It would be really neat if you could take a picture of an error code from a machine, send that pic to the AI model and then have it provide information about the errors or faults
Yes, local gpt if you want to actually try it, but hallucinations are a problem.
What is the case with the monochrome screen with text displayed. How do you do that??
Thx brow i was looking for something like that
just came across your video, great insights! just gained a sub
Awesome video! Thank you ;-) My question is... How could we run these local LLMs locally and at the same time having them accessing the internet to search stuff that they do not have it?
I think your description of how PrivateGPT works is wrong, I think it stores the texts in a vector DB and then uses a different model to check the DB with your prompt, the DB returns some text that is injected in the context with your prompt, using the model that you have chosen. Please correct me if I am wrong, I just had a quick look at the sources.
Looks like this was running fully on CPU. Can this workload not run on the Pi GPU?
Possible with clBLAS, but won't be faster. Can offload the CPU at best
I watched this three times lol I love this. Thank you for this and Do you think the raspberry pi 5 is the best single board for the job or would the zimaboard compare just as good if not better
Also since you had repurposed a wifi adapter would you have an idea how tear down old pcs and laptops and combine the hardware to create the vram needed for an upgrade like this? Probably more complex then it needs to be but got a whole bunch of old computers with junk processors and low ram etc to today's standards buy feel like you could repurpose alot of the stuff with a different board or if we just flashed windows off😂 and used the giant mother board and maybe even part of the laptop screen or something idk lol. A way to combine multiple processors or something to create a Frankenstein that works well lol
Or another side project to make a control box for golf simulator. Basically just buttons to map to a keyboard and have a decent housing for the thing. Maybe your box Is for an arcade emulator box or something or controls your smart home or sound set up idk 🤷♂️
Wow, never thought a Pi could perform. I was thinking of trying this with a Jetson
So you are saying you invented an offline encyclopedia, we’ve come full circle.
what is output token speed? tokens per seconds on rpi5?
Yeah, my question also. If I use some of these 7B models on my M1 the tokens/sec is just not fast enough that I don't want to resort to using a model behind an API (which is faster) for things like coding. Still excited for other more data and privacy sensitive use cases or where latency is permissive to run them on my Pi. 8GB versions were sold out, last I checked...
Awesome Work! will be interesting to see how the new Mistral Ai's ~ GPT4 equivalent performs on Pi Edge Compute.
Funny enough i was looking for this today nice !
ollama looks like a very helpful pull ty for that. I’ve been looking for a couple weeks on training with coral tpu. Coral having so many dated dependencies breaks pip every time (for me, a dude who isn’t the smartest.) Next run at it will be w conda and optimum[exporters-tf].
I was thinking for the past month of trying this, but edge tpu's bandwidth plus my sceptism of any successful conversion to tflite held me back. Never knew pi's cpu was that capable. Anyways what inference speed (in like tokens per second) in mistral 7b approx?
Thanks! very useful!
Thank you for sharing this information, it is great to have a local llm and it was quite easy to set up after all.
I did not know that there are so many models available.
especially having an uncensored llm . those might even be illegal one day because of their power
Thanks for the video. It was interesting to see what the pi5 can do.
I do think, however, it's a huge mistake and misinformation to say that LLM's contain any of the information it was trained on. The models are trained to finish a sentence, to guess what the next word is, and do not contain any of the actual training data. I feel like this is important, so that we know how to trust LLM's properly.
LLMs, including GPT, do memorise things, though they're not built for directly for that purpose. Try entering this into chatgpt:
Finish this sentence: "As Mike Tyson once said, "
You can guess what it responds. If you have strong weights for words in a sequence that match a given article or quote - that appears thousands of times across the web/training data - it's effectively memorised. Look into the nytimes lawsuit.
You can absolutely get models to spit out training data with text completion and the right parameters. In fact, most "censored" models will even give up the "censored" bad-think ideas that they're not supposed to give you when you know how to prompt them to do so, and you already kinda touched on the reason WHY you can do it
@@bakedbeings You're right that you can extract knowledge (sort of the whole point of an LLM). I only mean to highlight the differences between how a models "remember" things. Its closer to how humans remember, than actual computer memory. There is also a random seed for most models that can change the output.
This looks like a project worth exploring. Although the limitation of AI is that it sources data accumulated on internet and so is subject to biases which leads to inaccuracies. I'm sure however that there would possibly be a way to clean up data for accuracy if another unbiased reference was easily available.
i actually was thinking abt putting a model on a raspberry pi, looks like you beat me to it, but what abt putting the raspberry pie on a drone and getting the AI to fly it???
I was thinking the same!😂
I know imagine having like this drone AI army that you can command its kinda like jarvis when Tony told him to send all the iron man suits in iron man 3
I love how he delivers punchlines
Where i can buy the tiny display for the Rasberry Pi 5?
Great video but huge shame you didn't show how long they each take to process, before responding....
great question
That would ruin a surprise...
What model screen is that that you are using for your raspberry pi?
The scientific revolution in the area of advanced mathematics and algorithms is just amazing these days. ❤❤❤
How could we use a bunch of raspbery pi clusters with fast memory parallelized to run mixtral 8x7b? Is that even possible?
Llama2, you missed one question. 7:21 the US president in 1952 was NOT Dwight David Eisenhower; it was Harry S. Truman. Eisenhower won the election in November 1952, and was then inaugurated on January 20,1953.
And Pérez wasn't president in 1980...
The clear lesson here is that this is software about credibility, not accuracy. It's just as smart as the not so smart sources on which it was trained, garbage in, garbage out. At least with Wikipedia, there are checks and balances of people with differing opinions having access to make corrections. Not so with LLMs. Fact checking costs extra.
0:33 is the thing on your hand what's on the thumbnail? And does it have screen like in the thumbnail or was it an edit?
Was the tiny Llama not out yet? Thinking about doing it with that
The speed is *perfect*. Now run it on a green CRT and give those little sound effects as the words come out at reading speed and it'll be just like being in a Hollywood movie.
Which recording tool you are using?
rmdir won't delete directories recursively that way btw. You also don't need root if you own the empty dir. At least you can try it easily enough
you should mention time it takes to give the responce
Hi got some questions around the content in the course where can we contact you?
Dude, you should design an end to end doomsday/prepper raspberry pi LLM machine! This is honestly such amazing work. It looks like you’ve already developed the scaffolding and initial prototype for this type of device.
I wonder if you could build an all in one package with a reasonable cost that does local LLM inference with a larger model instead? That would be so awesome. One of the most useful features would probably be creating an easier method to input queries. I wonder if it’s feasible to use a speech to text model.
What is a LLM
Thank you :) I love youre content Data Slayer :)
How long did it take for the LLaVa to return results from the selfie? Lotta use cases there alone. imagine you're a spy looking for a particular person. you're walking around in public with your llava lora model taking a pic a second. neat
you're also keep about power draw.. how long could a spy walk around town taking as many pics as possible drawing on a couple of cheap powerbanks? just spitballing. subscribed.
at the start of video you have been using a terminal with ports and other details mentioned of Pi5, How you did it? I am new to raspberry pi? want to know if it is a software or something else?
enable SSH on your Raspberry Pi, then SSH over Port 22 from your laptop to the Raspberry Pi. The Terminal you're seeing is that he's controlling the RPI from his laptop.
Dude LLMs is great to see on small boards, any possibility of running AI img gen using stable diffuse at least running base models
there are a bunch of nvme hats out there, but a lot of people are having problems getting them to work. issues with booting, recognizing, boot order, compatibility problems etc.
Hey, bro! Could you please make a video on how to install this? I'd really appreciate it!
whats good logic to use Llava if we do surveillance video processing instead of image
What display is that in the thumbnail?
The only reason I clicked on the video 😂
I wonder how much more work it would take to add a microphone to it so we could talk to it. I'm fairly new to software development so I'm sure someone probably figured that out already.
Probably not much. When GPT first came out I had it walk me thru building an app in Python that did just that, I pressed a button and spoke, it processed it into text using a google API and then sent it to GPT to get an answer. For self containment it would need some other voice to text library, but again GPT could probably talk you thru building it. I'm not a dev, I just tinker and was able to make it work.
@@onewheelpeelproductions470 That's really awesome. Looking forward to seeing more of your tinker videos. I've been wanting to dip my toes into Python but wanted to build something that's useful for me. Thanks! Keep up the great work.
Hm, in Ollama there seems a tendency use way less ram, then the model should actually use. Or at least Htop did not seem to pick up on a substancial in crease in memory use one would expect from loading like 7B model.
Can anybody explain why?
I saw the same for Mixtral on my Laptop, it did just run even so the RAM only was occupied with about 3.7 GB instead of the 30GB that would be expected.
I've noticed this. I can run 14B models with near instant responses on an ancient ryzen 5 and a 6gb 2060. Its weird.
Tested it with Google Coral USB Accelerator and camera?
Isn't privategpt doing RAG rather than actually doing any training?
What’s the terminal software?
What's the case with the screen? Is that for RB5?
At the end, you mention 25% of wikipedia, but that is bytes in text and not a model. Remember the model deals in tokens which is roughly a word but can also be an entire phrase. It is likely that the models contain most or all of wikipedia not a small fraction.
What power detector is that ?
Can I still use RBPi 4?
Can this do any kind of image generation stuff like stable diffusion?
simply brilliant 😮
Now, imagine these programs running on cellphones! I think we're not very far out from it!
A 7b model that is probably compressed/quantized up the @55 don't qualify as any "Advanced LLM" like the video's title says. Still impressive you can run anything like this on a Rasberry Pi tho.
How do I dislike only the Egyptian cotton joke but like the rest of the video
I don’t think privategpt is training or finetuning. It is creating a vector database with your documents and retrieving information based on context.
That was my idea also.. to bad, would love to see how to training
Super inspiring! I had no idea it was so simple to get this running locally. Amazing!
Does it need to be on a raspberry pi or a linux based system? I'm interested in running these models in my windows system or even over WSL 2, if it is possible, i'd like some feedback on the possibilities of you making a video on it
Lm Studio? A way to run tons of LLM's in windows.
What’s the shell name? Where can I buy it?🎉🎉 it’s so fascinating.
Is there any "How to" or maybe a "Step-by-step"? I have a RaspberryPi 3B+ and an useless OrangePi A20... Is it possible to use them any way?
Congrats for the great job!!
So, I've installed privGPT on a gaming laptop having an RTX 4060, and it worked. Speed was so so even after enabling the LLM to use the gpu instead of cpu. I'd be interested in knowing which configuration yields the fastest response. I've seen pcie to m2, which enabled the use of an external gpu, because gpus process ai data faster than cpus Ive heard. What is the best hardware combination would you recommend for speed and portibility?
same here. i7 13700kf, 4060ti 16gb, 160gb of ram. privateGPT on cpu is pretty slow, with cuda enabled it runs good, but not as fast as on this pi5? So what's the magic here?
Simple answer: the video has been accelerated.
👾 Morning coffee tastes great while learning useful things. I express my thankfulness for the important video.
Will this work on a CM4 8GB board?
After install, when I try to run any of the models I just get “No such file or directory”
Running certain models locally is extremely slow for my laptop. I wondered if the pi could it. I figured someone already tried with local AI or oobabooga.
But I’m very confused why you didn’t try any really good 7B uncensored models. If your gonna run a local LLM why would anyone want a censored / aligned or base model? Can you list the ones you tried with success?
How can i use that LLM and the RASPI to use my own LLM in my IDE? Is there already something that can read my code and help me program, based on the code?
How does this comparing to something like an nVidia Orin?
that english-to-spanish translation was shit
great video
Hi sir, I am making a similar project where i'm using a rasberry pi to awnser questions on a py file using python, but its having a voice rsponse to the quesiotn. Im having problems making it work because its having errors with the alsa thing n ot being located or smth. Could we get in contact and you can help me with it please? thanks.
Yo Data Slayer, make the screen from the thumbnail real (+mic) and do a Rabbit R1 competitor. Looks dope
I want to do this use coral usb ontop with webcam object and face recognition, voice interaction, and link it to be able to control my home by accessing my existing home assistant raspberry pi 5 device.