The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)
Ok, to be fair. If you running Llama on a old Thinkpad x260, you actually do get twice the performance by running the model on *both* cores. Having true AVX256 or better and more than two cores really helps with doing the math.
"A bit of math" is.... an interesting way of putting it. I'm aware that training is several orders of magnitude more compute intensive than inferencing, but weather I run in CPU or GPU mode both are taxed pretty heavily. Never to 100%, which does indeed confirm that memory bandwidth/latency is the bottleneck, but still, taxing an 8 core CPU to 45% on LP-DDR5 6400 is hardly "a bit of math".
@@andersjjensenit really isn't that much math. The only reason it even registers as 45% is because we're talking about models that use all the input tokens and the output tokens as active bi-lstm nodes. So it's more like it's constantly rechecking it's work. Just consider how fast the mac pro pumps the tokens out when any other benchmark doesn't make the GPU look all that impressive. Mac pro is more similar to an rtx 2060 with loads of fast ram strapped onto it. This is a case where the way usage data is monitored isn't representative of really how the hardware is taxed. usage monitoring is more an indicator of how full the wait queue is. Ah i just realized you specifically mentioned cpu for the 45% figure. But either way, my point is that you can't actually extrapolate down from that number what the ideal hardware configuration would be. Same amount & bandwidth of ram but half the raw compute is still much faster than it really takes. Even if the usage seems to say it's the spot.
Use a Vega 20 GPU (excluding radeon VII) and you can pool VRAM with RAM to run whatever models you want. You can even add swap space on NVMEs. I got LLAMA 405b running on a system with Vega 56 which supports HBCC (although it's worse) and I used 4 NVME drives raid 0 for swap. PCIE Gen 3 is part of the problem, but The system prioritized VRAM, then ram, then Swap, as I expected so about 192GB of real RAM was used and only 600GB of Swap. Vega 20 (MI60 for example) has PCIE 4.0, and Optane DIMMs or Optane U.2s would work better though.
@@JonVB-t8l you can basically always do this. It's not vega specific. The computers just works that way. What you're doing is changing how it's reported to the system so the basic flag checking that the software does before sending the model clears without complaining. But you could also just remove the flags or use wrappers that doesn't check. The reason they do try to prevent it is because you lose 90% of the speed when you do this. And it can be unstable on some systems.
I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!
having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.
I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.
The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.
With a 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy.
Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!
I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.
Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!
Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.
I'm running 405b on a 8 year old server with a Vega 56. Abusing the F outta HBCC to add ram and Swap into the pool of "VRAM". Yes, I have 600GB of the 810GB model running from swap spread across 4 NVME drives.
@@Steamrick I am pretty sure not well enough to be acceptable. Even with the NVME I think the read write speeds are like quarter-ish compared to a DDR4 RAM stick.
@@thecompanioncube4211 Oh, even the fastest NVMe SSD is far less performant than a quarter of DRAM. It's not just the speed, it's also the latency that's much worse.
I am freaking amazed to run this locally on my laptop (13900HX plus 4070 mobile) and it is only 2gb and performs amazing. Thanks for sharing this Dave, great content piece! thx!
Good luck with the longevity of your laptop.!!! If you have any random problems, crashes, things just not working, make notes of what and when (time, date) and contact the laptop company and have them officially note this as a warranty issue (if you have a warranty), and otherwise make preparations for a replacement laptop. Good luck and best wishes.
Since some people (predictably) like to complain in your videos because you're not catering to their exact needs, here's my demand for a followup with you running it on your PDP-11.
Watch it turn out to be faster than the 50K Dell. I know, no chance of that. Yet a PDP-11 used to power a Xerox 9700 printer. It could read from network or tape, merge data with a form at 300 DPI, print at 2 pages a second duplex and do that hour after hour.
This testing is right up the alley of the sort of video that I've been looking for and I really appreciate it. Going through a wide range of machines is much more useful than just testing like a 20k machine. That being said, there's something I am super confused about. Before you start the Threadripper test, you said up till now we've been using the 70 billion parameter model. The download sizes were showing around 5GB and the 70 billion parameter model would be much larger than that on the order of over 10 times, even for a quantized version. And there's just absolutely no way a 70 billion parameter model would run on anything remotely close to as wimpy as a Raspberry Pi. I assume you misspoke, which does lead me into a request. I would actually really, really appreciate seeing this sort of range testing across a variety of machines, specifically for larger models around ~30 billion or ~70 billion parameters, because I assume that most of the early tests were for some quant of the 8 billion parameter model. Most of the results available online are for the 8 billion parameter models, which is really a shame because higher end consumer machines like a gaming PC or an M2 Ultra really should be able to handle larger models around 30-70 billion parameters.
I saw your previous video. It made me want to make my system dual boot. Your first video I followed and was able to execute the LLM you suggested within VirtualBox. It worked just fine and I was gratefu. And so I installed Linux Mint in a dual boot, and your FIRST video was inspiring enough for me to figure out how to get Ollama on Linux and then pick out any LLM I wanted and install it from there. I am grateful for this video, but to be fair, your first video shouldn't have garnered any hate. Because, if people are even your viewers they should be savvy enough to figure things out on their own, and use your videos as a guide. Otherwise, those viewers wouldn't be your subscribers if they were that afraid of their own computers.
This is amazing. I just installed it on my home PC. ZorinOS / Ryzen 5 3600 / AMD 5700XT / 16GB ... It runs great (running the 3.2:latest). I have been trying to learn how to make my first game in Unity and I've been struggling with some basic ideas on the interface to code a basic shader to apply to a material and get it into the scene. The format this thing uses is perfect! ChatGPT couldn't tell me in a way I understand, couldn't find a tutorial that was what I wanted... this thing spit it out in 3 questions. I can actually understand exactly what it means, not just some vague concept I'm going to have to stumble through! I don't understand how this is even possible with such a small data set, but I will take it. THANK YOU!!!!
I appreciate this vid of using “affordable” or affordable” hardware. I’m already on a Mac, I’m researching Ubuntu and windows as an option for some old vid cards
The 7940hs CPU on your mini pc has a dedicated ai hardwares acceleration dubbed "Ryzen ai“. Hopefully the project enables and starts optimizing for it (in addition to the igpu) in the Future. Looks promising for cheap devices.
You are always entertaining Dave! and considering your niche topic this is true talent! Im not even that much of a nerd, or am I interested in programming or computer hardware but I really enjoy your channel. Keep up the great work!
Llama 3.2 3B is clear winner for general chat tasks on local machines. I just love it! Thanks for testing the 405B - I was wondering how fast it will go and how much RAM it needs. Now I know it's not worth it. I'm looking forward for llama 3.2 7B which I think will be the sweet spot.
Yet another fab video Dave. (It's amazing how many people who have never produced anything in their lives feel compelled to criticize the heck out of other people work)...
Even a 8G RAM Pi 5B is still under 100 Dollars US, thus it would be a reasonable entry level platform. Beyond the learning experience of setting up AI and LLM, there might be utility in having a Pi as an offline server which could e-mail answers to questions which don’t need to be answered within a few seconds real time.
@Dave's Garage: Thanks for the video! That LLM on Raspberry Pi looks painful, ouch. I am testing some new beta releases of WIndows Server and other WIndows OS, and I got my rig over here running on Corsair Origin Neuron AMD 79503dfx and NVIDIA 4090 GPU. I was not impressed with the last LLM software I used, but I am going to check out your recommendations in the video. Thanks! I usually go to Chat GPT for my subscription plan, but there are many use cases where I prefer working offline. Thanks again for all the awesome videos!
I'm surprised how smart a off-line LLM is , I asked the question " I have Ryzen x670e motherboard with a Ryzen 9700x cpu which idles at 45w from the wall how much is from the chipset. " , and the answer was correct and relevant with pages of it. i tried words with multiple meanings , spelling mistakes etc and the answers was correct. Do lto drive need drivers , what is the difference between lto 5 and 6 , all the worlds knowledge in a few gigabytes.
@@ArthurFlimbimlinson-x1r Likely one with half a dozen to a dozen billion parameters. I get around 20-30 tokens/s on my RTX 3060 12 GB when using LLMs with those sizes. Intel i5-12400F, 32GB DDR4 and Windows 11 if you want the other details but I'm pretty sure the rest of your PC can be a potato as long as the entire model plus context window cache fits in the GPU. I can also load a 70 billion parameter model that's been cut down to a smaller size (quantized to 2-bits) but it uses all my RAM+VRAM and runs at a glorious 1 token/s.
The fact that you got llama-3.1:405B running at all at home is just impressive even if its mostly running on CPU. My Ryzen 7 is hardware capped at 128gb of system ram, I really should have waited for the AM5 socket.
I have a 7950x with 32GB RAM and a 3090. No probs running 405B if I can wait for the result. Also have a 64 core Threadripper, 256GB RAM and a 3090. Both machines are level pegging. The more GPU VRAM you have, the bigger your model can be
@@darksushi9000 Which quant of the 405B model are you using in your 32GB RAM machine? I can barely fit a 2-bit quant of the 70B model in 32GB RAM plus 12GB VRAM.
@@darksushi9000 Hmm, that doesn't fit in 32GB of RAM unless you have 10 RTX 3090. Didn't you mean to say you're running the 70b on your 32GB RAM machine and the 405b on your 256GB RAM machine?
I'm running full fat 405b on a 7 year old Xeon Gold seystem with 192GB of ram and a Vega 56 GPU. I mean I'm cheating because I'm using 4 NVME drives raid 0 as swap space and HBCC to pull it off, but hey... It works sorta.
I don’t know why anyone would give you heat , that video was OUTSTANDING !! I was up and running on my HP Gen 9 with an old Nvidia P2000 in no time at all ! The thing ran GREAT ! The replies were smooth and fast … The thing I don’t understand is the three variants or size options in 3.1 ? I want the most powerful model available. My GPU seems to be doing just fine and I have a ton of CPU and memory .
Bigger models are (usually) smarter. But to run them fast enough, you need to fit the entire thing in VRAM or else your GPU has to pull data from the RAM, which is slow as fuck. Try loading a model that's bigger than your 5GB of VRAM and see how it goes for you, I bet you'll be disappointed.
I think it's worth mentioning that the quality of a word is also important not just the speed of an idea. something well thought out has more value and I personally could see the value of your expensive machine as a host-body for the language model in the quality of the sentence that it came up with. Maybe it's nice to think of something for a bit, but I didn't see the word 'delightful' in the other examples. Thanks for making this video
nice episode. I have been playing with a local AI in Win 11(using LM studio) on a 7950x / RTX 3070 ti. I also have a RPi 4, Orange Pi 5+ and an old 4790k that I am loading Linux on. This video helps me decide what fast enough.
And..., the $50,000 Dell said, "I'm sorry, Dave. I can't do that". Excellent video. Much better than the previous one on LLM. I actually have it working now. Thanks!
🎯 Key points for quick navigation: 00:00:00 *💡 Introduction & Overview* - Introduction to testing LLMs on different hardware setups, ranging from $50 to $50,000, - Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation. 00:00:43 *🐢 Running on Raspberry Pi 4* - Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM, - Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use. 00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)* - Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU, - Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead. 00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080* - Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2, - GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware. 00:09:42 *🍎 Mac Pro M2 Ultra Testing* - Tested on Mac Pro with M2 Ultra and 128 GB unified memory, - Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs. 00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada* - Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada, - Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware. 00:13:12 *⚡ Efficient Model on High-End Hardware* - Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup, - Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization. 00:14:33 *📢 Conclusion & Call to Action* - Summary of testing LLMs on various hardware from low-end to high-end, - Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video. Made with HARPA AI
My next-door neighbour has an autistic son aged 10. I am reading as much as I can find to understand the condition. Your book is my latest purchase. I'm not sure if it will help the lad as he has very complex needs, but the knowledge will be useful.
Yeah, i installed ollama after your video. Had to comment some stuff out of install script because it didn't notice that in my Fedora machine cuda drivers were installed from RPMFusion. But yeah after install script went through it works crazy fast in my office machine i7-7800X/RTX4070Ti. And even in my old livingroom machine it works faster that i can read so it is enough ;) i5-4670k/Quadro P2000
Correct me if I am wrong, but the reason the Herk box is using a CPU is because its GPU is an AMD. Pretty much every ML framework today expects to use CUDA library for GPU acceleration. CUDA is proprietary library developed by Nvidia. AMD has been fighting tooth and nail to gain wider adoption for their own alternatives, but they are simply not there yet.
When you did the intro into the last video, I knew this would be a followup kind of video. It made no sense to just leave the demo out of youtube watcher reach :D
The real problem I find is context tends to eat lots of memory, beyond just loading the model itself. Sure, I can maybe load a 70b model with the memory I have, but I'm gonna hit the ceiling pretty fast with 128k context. I don't have the budget for 512gb of video memory, or a high end mac, so unless I load it into system memory, which is just insane, even with some smaller models I'm going to struggle once the context is full up. Of course, I can manually reduce the context length, but it's a shame because I'd like it to be able to handle large amounts of text or long discussions. Great video as always!
Tested the 70B Q4 (42gb) on a 5950x and 128gb ram with RAG and 40K context. was about 80GB ram usage and the inferencing was around 0.56/s. (usually gets 30-50 on GPU using 11B). Then tried the IQ1_S which was 15GB on the 4060TI 16GB +30K context and got the same speed. (obviously offloading to the ram). The good thing is that the 70B generates long and detailed answer unlike the 3.2 1-3B models which sometimes say that it did not find the query in the document attached. (2H 30K words YT interview)
I think the GUI of Jan makes the installation and user experience of models to try things more convenient. It also has the capability for you to put instructions for it per what it calls threads, which are basically what ChatGPT calls a new chat. It also has a nifty thing where you can tweak settings on the models and have different models per thread. For example, I have one model that's been trained a lot on code/documentation, that can be useful for searching when I remember the concept of some language feature I need, but don't remember the specific keywords in the language I'm doing it in, most relevant when I'm doing something in a language that I either haven't touched in a while or not often. Whereas I have a separate model that's been trained on a lot of fictional writing that I use to help proofread things that I wrote. Even if it doesn't give me the fix that I want, it at least demonstrates where certain errors are that need looking at. Another nice thing about Jan is that if you wanted to, you can hook it up to online services as well, if you wanted. You can keep all your LLM stuff in one place with it. I'm predominantly doing things on it locally only, but I know at least one person that does ChatGPT stuff through it
I haven't read the story of Little Red Robin Hood yet. :) I'm glad you did this video on a variety of hardware that includes today's computer enthusiasts.
@DavesGarage @6:00 you are talking about the "fixed" RAM allocated to the GPU. The BIOS/UEFI "should" have an option to set the memory as "shared" or (similar meaning), where the amount of RAM is dynamically allocated between the CPU and the GPU. This is one of the reasons why people are interested in the upcoming "Strix Halo" that has a beefy GPU (and CPU), but also quad channel RAM and can be fitted with 256GB, which can be dynamically adjusted, and then eaten up by the GPU.! Please find this setting in the BIOS, change it to "dynamic" and post a video about your findings, many would be I am sure interested in such a thing. Thanks.
I just watched a video on the limits of LLM error rate as relates to parameters, performance, etc. basically the relationship is asymptotic. More is better but the relationship decreases logarithmically. I think most people won't understand how AI models are being designed for levels of complexity and ambiguity that are difficult to grasp. They do this by having a massive number of parameters and ability to discriminate finer and finer details. These are use cases for AI to interact with humans in a visual and audio world that is absurdly complex, all while hoping to have the ability to interact with millions or billions of humans.
I seriously think your show is great. It's interesting and it's entertaining. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I think there's TH-camrs, that could benefit, from as well as you do at presenting the material. You're not. Just staring at a screen and watching you do stuff.
Dave... First off thanks for this and many other videos you have done. I am thinking my pushing the like button is going to wear out the button soon :). I am trying to wrap my brain around many things in this and have had the local running chat gpt that you showed us try to teach me about each of the parts. I am working on understanding each piece. The one question I might have is What is the difference in the 8 billion , 70 billion, and 405 billion parameters as far as reliable answers go? I understand they take more horsepower for the larger ones but not sure "exactly" what the benifit of more parameters are. maybe a future video explaining the intracicaes or more parameters or maybe one of the other co-patrons here would help out and try to clue me in. Either way thanks for now as I not only jealous of your infomation quality but also that you are retired and I am not. :)
I’ll second that. It would be interesting to what it takes to turn a database of help desk ticket problems and resolutions into an LLM which could try to answer technical questions.
I'm here for the moment when the Pi says: "I can't do that, Dave"
it has to wait for dave to forget his space helmet
Open the pod bay doors!!
The irony being that the Pi could do that
1:17 on this part it would actually be I CAN DO THAT, Dave
😆😆😆🤣
Dave, I appreciate your mindfulness of how valuable our time is and editing this vid down to a reasonable time frame.
The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)
Ok, to be fair. If you running Llama on a old Thinkpad x260, you actually do get twice the performance by running the model on *both* cores. Having true AVX256 or better and more than two cores really helps with doing the math.
"A bit of math" is.... an interesting way of putting it. I'm aware that training is several orders of magnitude more compute intensive than inferencing, but weather I run in CPU or GPU mode both are taxed pretty heavily. Never to 100%, which does indeed confirm that memory bandwidth/latency is the bottleneck, but still, taxing an 8 core CPU to 45% on LP-DDR5 6400 is hardly "a bit of math".
@@andersjjensenit really isn't that much math. The only reason it even registers as 45% is because we're talking about models that use all the input tokens and the output tokens as active bi-lstm nodes.
So it's more like it's constantly rechecking it's work.
Just consider how fast the mac pro pumps the tokens out when any other benchmark doesn't make the GPU look all that impressive. Mac pro is more similar to an rtx 2060 with loads of fast ram strapped onto it.
This is a case where the way usage data is monitored isn't representative of really how the hardware is taxed. usage monitoring is more an indicator of how full the wait queue is.
Ah i just realized you specifically mentioned cpu for the 45% figure. But either way, my point is that you can't actually extrapolate down from that number what the ideal hardware configuration would be. Same amount & bandwidth of ram but half the raw compute is still much faster than it really takes. Even if the usage seems to say it's the spot.
Use a Vega 20 GPU (excluding radeon VII) and you can pool VRAM with RAM to run whatever models you want. You can even add swap space on NVMEs. I got LLAMA 405b running on a system with Vega 56 which supports HBCC (although it's worse) and I used 4 NVME drives raid 0 for swap. PCIE Gen 3 is part of the problem, but The system prioritized VRAM, then ram, then Swap, as I expected so about 192GB of real RAM was used and only 600GB of Swap.
Vega 20 (MI60 for example) has PCIE 4.0, and Optane DIMMs or Optane U.2s would work better though.
@@JonVB-t8l you can basically always do this. It's not vega specific. The computers just works that way.
What you're doing is changing how it's reported to the system so the basic flag checking that the software does before sending the model clears without complaining.
But you could also just remove the flags or use wrappers that doesn't check.
The reason they do try to prevent it is because you lose 90% of the speed when you do this. And it can be unstable on some systems.
I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!
I smiled too, but got the impression that Dave cares for his viewers.
He is quite precise when he talks which rather suits me.
Definitely learned something there. 😀
having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.
Hey Dave, in your next LLM tutorial, can you give us a demo on how to connect external data sources to it? I'm struggling to wrap my brain around it.
Do you mean using your own reference documents? If so, take a look at AnythingLLM, it might meet your requirements
Check out N8N or Dify
LMstudio. Anything LLM or simular
As someone who gave you "heat" in the last video, thank you for the follow-up!
You bet!
Thanks for updating and including budget friendly options.
I rather liked your having demonstrated with WSL, as I was able to follow along on my Ubuntu server
I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.
Superb content. Not many channels with this amount of quality in terms of delivery.
The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.
What's the cost of such a home "server"
Thanks Dave, I really appreciate the time you spend to make these videos for us. Really enjoy these geeky rabbitholes.
11:00 I believe you've been running the 8B model if you're pulling 3.1 latest. I could be wrong, but I believe latest defaults to 8B flavor.
correct, llama3.1:latest =llama3.1:8B
With a 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy.
Came here to say the same. The 70B might be a great fit for the faster machines.
I haven't played with llama yet, mostly mistral, so I was also surprised when the 70b param model was only 5gb 🥲
@@sharpenednoodles 70b llama3.1 is more like 40gb 😅
Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!
I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.
That windows method is even more straightforward than the wsl from the last video. Thanks for sharing!
I'm so glad you're doing a hardware comparison. I watched your previous video and wanted this immediately.
I'd prefer it directly on Linux, but ofc I'm sure I can figure that out myself I'm just here watch 😂
Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!
Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.
I'm running 405b on a 8 year old server with a Vega 56. Abusing the F outta HBCC to add ram and Swap into the pool of "VRAM". Yes, I have 600GB of the 810GB model running from swap spread across 4 NVME drives.
@@JonVB-t8l That's quite the setup. I'd be very curious how that performs.
@@Steamrick I am pretty sure not well enough to be acceptable. Even with the NVME I think the read write speeds are like quarter-ish compared to a DDR4 RAM stick.
Came here to say this. I think 70b is like 40GB model
@@thecompanioncube4211 Oh, even the fastest NVMe SSD is far less performant than a quarter of DRAM. It's not just the speed, it's also the latency that's much worse.
Thanks so much for this favorite opportunities. We really loving your online classes.
I am freaking amazed to run this locally on my laptop (13900HX plus 4070 mobile) and it is only 2gb and performs amazing. Thanks for sharing this Dave, great content piece! thx!
Good luck with the longevity of your laptop.!!! If you have any random problems, crashes, things just not working, make notes of what and when (time, date) and contact the laptop company and have them officially note this as a warranty issue (if you have a warranty), and otherwise make preparations for a replacement laptop. Good luck and best wishes.
and how do you use this 2gb (8B?) model in daily use?
Thanks, Dave. You've given me a lot more confidence in my beat-up 2015 MacBook Pro. Off to Ollama now!
Since some people (predictably) like to complain in your videos because you're not catering to their exact needs, here's my demand for a followup with you running it on your PDP-11.
Video to come out in 200 years
Do you want it done in real time?
@@20chocsaday What's the max allowed length for a TH-cam video, 10 hours?
Watch it turn out to be faster than the 50K Dell.
I know, no chance of that. Yet a PDP-11 used to power a Xerox 9700 printer. It could read from network or tape, merge data with a form at 300 DPI, print at 2 pages a second duplex and do that hour after hour.
This testing is right up the alley of the sort of video that I've been looking for and I really appreciate it. Going through a wide range of machines is much more useful than just testing like a 20k machine. That being said, there's something I am super confused about. Before you start the Threadripper test, you said up till now we've been using the 70 billion parameter model. The download sizes were showing around 5GB and the 70 billion parameter model would be much larger than that on the order of over 10 times, even for a quantized version. And there's just absolutely no way a 70 billion parameter model would run on anything remotely close to as wimpy as a Raspberry Pi. I assume you misspoke, which does lead me into a request. I would actually really, really appreciate seeing this sort of range testing across a variety of machines, specifically for larger models around ~30 billion or ~70 billion parameters, because I assume that most of the early tests were for some quant of the 8 billion parameter model. Most of the results available online are for the 8 billion parameter models, which is really a shame because higher end consumer machines like a gaming PC or an M2 Ultra really should be able to handle larger models around 30-70 billion parameters.
Pretty awesome the pi even ran. Super cool Dave thanks as always man!
I saw your previous video. It made me want to make my system dual boot. Your first video I followed and was able to execute the LLM you suggested within VirtualBox. It worked just fine and I was gratefu.
And so I installed Linux Mint in a dual boot, and your FIRST video was inspiring enough for me to figure out how to get Ollama on Linux and then pick out any LLM I wanted and install it from there.
I am grateful for this video, but to be fair, your first video shouldn't have garnered any hate. Because, if people are even your viewers they should be savvy enough to figure things out on their own, and use your videos as a guide. Otherwise, those viewers wouldn't be your subscribers if they were that afraid of their own computers.
I built a system with 4x P102-100's which total 40GB of GPU ram. Now I can use the 70b quantized models and it is awesome! Best bang for your $$$.
Wonderful!! Actually very useful. I plan on upgrading my own PC to do AI stuff, and now I can see roughly how well it'll do it! Thank you so much!
I've run Windows on my RPi4, tutorial videos are out there. Not to complicated.
You needed to run Minesweeper on the $50k Dell to really push it ;) Another great video Dave, thanks.
This is amazing. I just installed it on my home PC. ZorinOS / Ryzen 5 3600 / AMD 5700XT / 16GB ... It runs great (running the 3.2:latest). I have been trying to learn how to make my first game in Unity and I've been struggling with some basic ideas on the interface to code a basic shader to apply to a material and get it into the scene. The format this thing uses is perfect! ChatGPT couldn't tell me in a way I understand, couldn't find a tutorial that was what I wanted... this thing spit it out in 3 questions. I can actually understand exactly what it means, not just some vague concept I'm going to have to stumble through! I don't understand how this is even possible with such a small data set, but I will take it. THANK YOU!!!!
Nice content, i like that you seem completely agnostic between, mac, linux and windows and even the different hardware.
Turns out, 3.1 runs reasonably well on 4080. Thanks for the tip! Until this video I didn't know I could run an LLM on my PC.
Thanks for making this video. I'm building a new PC and wanted to play with running local LLMs. To see just how fast a 4080 is...holy crap!
I appreciate this vid of using “affordable” or affordable” hardware.
I’m already on a Mac, I’m researching Ubuntu and windows as an option for some old vid cards
The 7940hs CPU on your mini pc has a dedicated ai hardwares acceleration dubbed "Ryzen ai“. Hopefully the project enables and starts optimizing for it (in addition to the igpu) in the Future. Looks promising for cheap devices.
Only at 10 TOPS according to their website. For comparison, the Copilot+-PCs need at least 40 TOPS. So questionable if it's accelerating anything.
There are projects working on incorporating ROCm which I believe can leverage the TOPS AI processor. Similar to MLX based Apple Silicon models.
You are always entertaining Dave! and considering your niche topic this is true talent! Im not even that much of a nerd, or am I interested in programming or computer hardware but I really enjoy your channel. Keep up the great work!
Llama 3.2 3B is clear winner for general chat tasks on local machines. I just love it! Thanks for testing the 405B - I was wondering how fast it will go and how much RAM it needs. Now I know it's not worth it. I'm looking forward for llama 3.2 7B which I think will be the sweet spot.
Yet another fab video Dave. (It's amazing how many people who have never produced anything in their lives feel compelled to criticize the heck out of other people work)...
Nice pivot and delivery, sir. Respect. I can't wait to follow along.
These vids are exactly what I need right now. Good to know that the pi can actually run it in some capacity.
Even a 8G RAM Pi 5B is still under 100 Dollars US, thus it would be a reasonable entry level platform. Beyond the learning experience of setting up AI and LLM, there might be utility in having a Pi as an offline server which could e-mail answers to questions which don’t need to be answered within a few seconds real time.
So kewl. Was just about to look for resources regarding this topic and this video got recommended. Amazing, thank you!
Wow, educational, interesting and inspiring! Thanks for showing us what is possible, in detail. I'd not even heard of ollama!
This video should save me a lot of time when I get around to running an LLM, many thanks.
That was best of the internet right there. Thanks, Dave.
Best I can do is like and say "thank you" since I've already subscribed. How about a heart? ❤
@Dave's Garage: Thanks for the video! That LLM on Raspberry Pi looks painful, ouch.
I am testing some new beta releases of WIndows Server and other WIndows OS, and I got my rig over here running on Corsair Origin Neuron AMD 79503dfx and NVIDIA 4090 GPU. I was not impressed with the last LLM software I used, but I am going to check out your recommendations in the video. Thanks! I usually go to Chat GPT for my subscription plan, but there are many use cases where I prefer working offline. Thanks again for all the awesome videos!
Awesome video Dave. I was playing with Stable Diffusion. Will try to explore Llama in WSL
I'm surprised how smart a off-line LLM is , I asked the question " I have Ryzen x670e motherboard with a Ryzen 9700x cpu which idles at 45w from the wall how much is from the chipset. " , and the answer was correct and relevant with pages of it.
i tried words with multiple meanings , spelling mistakes etc and the answers was correct.
Do lto drive need drivers , what is the difference between lto 5 and 6 , all the worlds knowledge in a few gigabytes.
Love it. Would also like to see a chart showing tokens per second on thr same model across the hardware. Good ollama benchmarks are hard to come by
I used this on my machine , a i5 14500 with 16GB DRR5 with a nvidia gpu rtx 4060 running linux mint , and the speed is good enough for me
What LLM?
@@ArthurFlimbimlinson-x1r Likely one with half a dozen to a dozen billion parameters. I get around 20-30 tokens/s on my RTX 3060 12 GB when using LLMs with those sizes. Intel i5-12400F, 32GB DDR4 and Windows 11 if you want the other details but I'm pretty sure the rest of your PC can be a potato as long as the entire model plus context window cache fits in the GPU.
I can also load a 70 billion parameter model that's been cut down to a smaller size (quantized to 2-bits) but it uses all my RAM+VRAM and runs at a glorious 1 token/s.
@@ArthurFlimbimlinson-x1r Dolphin
The fact that you got llama-3.1:405B running at all at home is just impressive even if its mostly running on CPU.
My Ryzen 7 is hardware capped at 128gb of system ram, I really should have waited for the AM5 socket.
I have a 7950x with 32GB RAM and a 3090. No probs running 405B if I can wait for the result. Also have a 64 core Threadripper, 256GB RAM and a 3090. Both machines are level pegging. The more GPU VRAM you have, the bigger your model can be
@@darksushi9000 Which quant of the 405B model are you using in your 32GB RAM machine? I can barely fit a 2-bit quant of the 70B model in 32GB RAM plus 12GB VRAM.
@@firecat6666 I am running the Q4
@@darksushi9000 Hmm, that doesn't fit in 32GB of RAM unless you have 10 RTX 3090. Didn't you mean to say you're running the 70b on your 32GB RAM machine and the 405b on your 256GB RAM machine?
I'm running full fat 405b on a 7 year old Xeon Gold seystem with 192GB of ram and a Vega 56 GPU.
I mean I'm cheating because I'm using 4 NVME drives raid 0 as swap space and HBCC to pull it off, but hey... It works sorta.
Good info… answers many questions I had without me having to do the experiments myself, so thanks.
I don’t know why anyone would give you heat , that video was OUTSTANDING !! I was up and running on my HP Gen 9 with an old Nvidia P2000 in no time at all ! The thing ran GREAT ! The replies were smooth and fast … The thing I don’t understand is the three variants or size options in 3.1 ? I want the most powerful model available. My GPU seems to be doing just fine and I have a ton of CPU and memory .
Bigger models are (usually) smarter. But to run them fast enough, you need to fit the entire thing in VRAM or else your GPU has to pull data from the RAM, which is slow as fuck. Try loading a model that's bigger than your 5GB of VRAM and see how it goes for you, I bet you'll be disappointed.
I think it's worth mentioning that the quality of a word is also important not just the speed of an idea. something well thought out has more value and I personally could see the value of your expensive machine as a host-body for the language model in the quality of the sentence that it came up with. Maybe it's nice to think of something for a bit, but I didn't see the word 'delightful' in the other examples. Thanks for making this video
Ok, thanks Dave. Got it running. Any interest in setting it up to web scrape and analyze results based on a local query?
nice episode. I have been playing with a local AI in Win 11(using LM studio) on a 7950x / RTX 3070 ti. I also have a RPi 4, Orange Pi 5+ and an old 4790k that I am loading Linux on. This video helps me decide what fast enough.
And..., the $50,000 Dell said, "I'm sorry, Dave. I can't do that". Excellent video. Much better than the previous one on LLM. I actually have it working now. Thanks!
🎯 Key points for quick navigation:
00:00:00 *💡 Introduction & Overview*
- Introduction to testing LLMs on different hardware setups, ranging from $50 to $50,000,
- Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation.
00:00:43 *🐢 Running on Raspberry Pi 4*
- Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM,
- Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use.
00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)*
- Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU,
- Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead.
00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080*
- Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2,
- GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware.
00:09:42 *🍎 Mac Pro M2 Ultra Testing*
- Tested on Mac Pro with M2 Ultra and 128 GB unified memory,
- Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs.
00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada*
- Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada,
- Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware.
00:13:12 *⚡ Efficient Model on High-End Hardware*
- Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup,
- Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization.
00:14:33 *📢 Conclusion & Call to Action*
- Summary of testing LLMs on various hardware from low-end to high-end,
- Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video.
Made with HARPA AI
Even though you brought the 50K machine to it's knees , and we're somewhat saddened ; I'm guessing there was a well hidden smirk as well ..😅
WSL2 Linux on Windows is a perfectly cromulent decision. That WSL2 tech is magical.
I also came here for the dog playing the piano. You're the best, Dave!!
The salute gives me goosebumps. Makes me think I am a war hero that served in a war zone when I didn't.
Perfect! Just in time for me to install Ollama on my new Lenovo Yoga Slim 7x Copilot+ PC with the Snapdragon X Elite processor and NPU!
Well-made, full of information for the public.
Thanks for listening to the comments. Great video!
It was nice to see the canals of the city of Brugge in the background of the windows machine.
Great video, thanks for sharing 👍
Top notch work Dave!!! Thank you!
win10 i7-13700k with no video card pegs at 100%, and llama3.2 generates about 80% as fast as normal reading speed.
with a 10600k its at least 2-3x times faster than normal reading speed. But I am on linux
You should use the --verbose flag when running the examples as it will give the tokens/sec
Nice one Dave, bravo.
My next-door neighbour has an autistic son aged 10. I am reading as much as I can find to understand the condition. Your book is my latest purchase. I'm not sure if it will help the lad as he has very complex needs, but the knowledge will be useful.
There's a lot of overlap even between mild and severe cases, so hopefully the info is still useful!
@DavesGarage Thanks, I'm sure it will help. I love your work on the channel. Keep it up.
Yeah, i installed ollama after your video. Had to comment some stuff out of install script because it didn't notice that in my Fedora machine cuda drivers were installed from RPMFusion. But yeah after install script went through it works crazy fast in my office machine i7-7800X/RTX4070Ti. And even in my old livingroom machine it works faster that i can read so it is enough ;) i5-4670k/Quadro
P2000
Thanks for tickling my fancy with the "Do it Len" animations! 😂
I love your channel! The OGs of Tech Samarai!
You're the developer who created Task Manager! Awesome
Great episode! I loved this one.
Correct me if I am wrong, but the reason the Herk box is using a CPU is because its GPU is an AMD. Pretty much every ML framework today expects to use CUDA library for GPU acceleration. CUDA is proprietary library developed by Nvidia. AMD has been fighting tooth and nail to gain wider adoption for their own alternatives, but they are simply not there yet.
It does support some AMD dedicated video cards as you saw in the video. Not sure how effective it will be vs CUDA.
When you did the intro into the last video, I knew this would be a followup kind of video. It made no sense to just leave the demo out of youtube watcher reach :D
There's a 3.2 11b that will be out soon. That's probably the sweet spot for most people. Especially for 12Gb and up GPUs. It also adds image support.
The real problem I find is context tends to eat lots of memory, beyond just loading the model itself. Sure, I can maybe load a 70b model with the memory I have, but I'm gonna hit the ceiling pretty fast with 128k context. I don't have the budget for 512gb of video memory, or a high end mac, so unless I load it into system memory, which is just insane, even with some smaller models I'm going to struggle once the context is full up. Of course, I can manually reduce the context length, but it's a shame because I'd like it to be able to handle large amounts of text or long discussions. Great video as always!
Tested the 70B Q4 (42gb) on a 5950x and 128gb ram with RAG and 40K context. was about 80GB ram usage and the inferencing was around 0.56/s. (usually gets 30-50 on GPU using 11B). Then tried the IQ1_S which was 15GB on the 4060TI 16GB +30K context and got the same speed. (obviously offloading to the ram).
The good thing is that the 70B generates long and detailed answer unlike the 3.2 1-3B models which sometimes say that it did not find the query in the document attached. (2H 30K words YT interview)
"Nothing but the 2nd best, for dave.... " Classic hahahaha
I think the GUI of Jan makes the installation and user experience of models to try things more convenient. It also has the capability for you to put instructions for it per what it calls threads, which are basically what ChatGPT calls a new chat. It also has a nifty thing where you can tweak settings on the models and have different models per thread. For example, I have one model that's been trained a lot on code/documentation, that can be useful for searching when I remember the concept of some language feature I need, but don't remember the specific keywords in the language I'm doing it in, most relevant when I'm doing something in a language that I either haven't touched in a while or not often. Whereas I have a separate model that's been trained on a lot of fictional writing that I use to help proofread things that I wrote. Even if it doesn't give me the fix that I want, it at least demonstrates where certain errors are that need looking at.
Another nice thing about Jan is that if you wanted to, you can hook it up to online services as well, if you wanted. You can keep all your LLM stuff in one place with it. I'm predominantly doing things on it locally only, but I know at least one person that does ChatGPT stuff through it
I haven't read the story of Little Red Robin Hood yet. :) I'm glad you did this video on a variety of hardware that includes today's computer enthusiasts.
You should be using llama3.2 on the PI, which is designed specifically for edge devices like SBCs or smartphones
Great content. As succint and complete as one could hope
the final story about jeff bezos generated by llama 3.2 2B model was actually funny ngl
@DavesGarage @6:00 you are talking about the "fixed" RAM allocated to the GPU. The BIOS/UEFI "should" have an option to set the memory as "shared" or (similar meaning), where the amount of RAM is dynamically allocated between the CPU and the GPU. This is one of the reasons why people are interested in the upcoming "Strix Halo" that has a beefy GPU (and CPU), but also quad channel RAM and can be fitted with 256GB, which can be dynamically adjusted, and then eaten up by the GPU.! Please find this setting in the BIOS, change it to "dynamic" and post a video about your findings, many would be I am sure interested in such a thing. Thanks.
As always, great video Dave.
Run quantized Llama 405B on a 192GB Mac Studio. The $6.5k Mac will run circles around that $50k beast.
I just watched a video on the limits of LLM error rate as relates to parameters, performance, etc. basically the relationship is asymptotic. More is better but the relationship decreases logarithmically. I think most people won't understand how AI models are being designed for levels of complexity and ambiguity that are difficult to grasp.
They do this by having a massive number of parameters and ability to discriminate finer and finer details. These are use cases for AI to interact with humans in a visual and audio world that is absurdly complex, all while hoping to have the ability to interact with millions or billions of humans.
I have an M2 Mac and run LLM's locally using Msty locally with very good results.
I appreciate this video thanks. I don’t know how in the world wsl is considered shenanigans though.
I seriously think your show is great. It's interesting and it's entertaining. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I think there's TH-camrs, that could benefit, from as well as you do at presenting the material. You're not. Just staring at a screen and watching you do stuff.
Thank you for this, Dave!
Dave... First off thanks for this and many other videos you have done. I am thinking my pushing the like button is going to wear out the button soon :). I am trying to wrap my brain around many things in this and have had the local running chat gpt that you showed us try to teach me about each of the parts. I am working on understanding each piece. The one question I might have is What is the difference in the 8 billion , 70 billion, and 405 billion parameters as far as reliable answers go? I understand they take more horsepower for the larger ones but not sure "exactly" what the benifit of more parameters are. maybe a future video explaining the intracicaes or more parameters or maybe one of the other co-patrons here would help out and try to clue me in. Either way thanks for now as I not only jealous of your infomation quality but also that you are retired and I am not. :)
Hey, as for RPI4 and RPI5 there are tons of models of 1B-3B size, which are pretty fast even on Raspberry PI
Think the next good video should be on how to trin it on your own data. Lets say a simple ms access local db?
I’ll second that. It would be interesting to what it takes to turn a database of help desk ticket problems and resolutions into an LLM which could try to answer technical questions.
Definitely! Or a collection of things, such as a bunch of emails or source code files.
This