I generally like your channel as it is super informative and concise but this is not particularly good information there are 1000s of tasks that can be formed without needing a 70b model the idea that 70b etc is the only useful model is hampering people from actually hosting locally and improving these smaller models. Creating a hybrids where you have smaller local models and then when required use a large model on the cloud or using Claude, ChatGPT etc is a totally viable model
First of all, I appreciate you prefacing this in the way you did, thank you! Secondly, Your idea here of a hybrid approach with local models for some parts of your app/workloads and then a closed source model when you need the speed/power/scaling is great! I didn't mean to portray in the video that you NEED a 70b model, I just chose to use Llama 3.1 70b as a good example because for most use cases I have encountered, the smaller models don't perform well enough for me. But that's with more advanced use cases with RAG/agentic workloads. I agree that there are 1000s of use cases that could use a smaller model like Llama 3.1 8b or even Microsoft's Phi3 Mini!
@@ColeMedin thank you for understanding my critique you're correct you need to know the specific use case or some of the small models are terrible. I tried starcoder 12b with continue and in comparison to codestral etc it's not great
@@ColeMedincorrect Mini models are for edge devices.. Think about a small device that parses your input, trained on your personal information and passes on anything it can't handle to a hosted larger like llama 3.2b 90
Wait no. I can run llama 3.1 the 70b version on a laptop that's 4 years old without a GPU..... Sure, it's slow. But it works. So yes you can run them locally for not huge price. It just takes much longer to answer if you use large ones
Good video, but still, only one Pod can´t handle 3k prompts per day because you would need to consider user concurrency, if you have 10 users at the same time requesting LLM inference, you would have a long time to 1st token and a very slow service. So, even in that edge situation I think is still better going for Groq or other service.
Fair point! Honestly I agree that I didn't fully respect the nuances that come with the need for concurrency here... there are ways to handle it with things like queuing (as long as you can get fast enough responses so it doesn't hurt UX), and you can always scale the hardware to handle concurrency. But yeah it still makes hosting locally more complicated!
I was excited to this video up, and then quickly realized that some of the info presented is incorrect. I’ve been running my own LLM now for a few months and I even started with a basic AMD GPU. Last month I upgraded to a second Nvidia GPU and I have absolutely no problem running any of the latest LLMs. Lastly it didn’t cost me $1700 per GPU in order to do this. I think you may have gotten some bad data as far as prices go for price to performance.
Wow interesting... you gotta tell me what hardware you are running! haha Because all I ever hear is that GPUs under $2,000 can't even run Llama 3.1 8b. You're able to run even Llama 3.1 70b with not insanely expensive graphics cards?
I think $1000 on the gpu front is probably the least expensive to run 8b models very smoothly. $2000+ would be for more small/mid size businesses. I am trying to sort out what makes most sense for our company now using what gpus I have on hand. Great info though, thanks for your content
It looks like he does not know how to set it up correctly. I'm running the 8B models, and they run superfast and not even with an NVIDIA 4090, I'm running it with an AMD 7900 XTX 24GB - $900 CPU AMD 5950X 16 Cores 32 threads - $300 and 64Ram, let me know if you need help. Not only that, but I can send you the link of my WebUi so you can try it yourself and let me know. I think your problem is you are not setting the Models to be loaded and ready. I remember that was happening to me at the beginning of my setup, but my IT guy figured out.... Furthermore, I don't need the 70b with all those trained models out there for productivity and business. By the way, Nice Content!
@@ColeMedin You don't need a 4090, I have the 7900XTX. It writes so fast that you can't even read. Way Faster than the regular ChatGPT. You have to set the model to be in the memory of the graphic card all the time.
I've been looking a machine for LLM inference, and I'm beginning to feel like for personal use Mac is, for once, a wild value. I really want to see some more testing of ~70b models on a Mac Studio M2 Ultra or M1 Ultra. You can configure them up to 192GB and 128GB RAM respectively, and because of the unified memory design, the memory speed is almost as fast as NVIDIA GPU VRAM. They're also super efficient, so the power usage is way lower than wiring up a bunch of NVIDIA GPUs. Yes, you've still got to spend at least $4k, but considering how smart the medium-size open-source models are becoming, might really be worth it if you're also weaving the models into a bunch of automations you've set up. It really could be like having a somewhat stupid but reasonably helpful part-time employee. For a one-time payment of $4k.
This is super interesting, thank you for sharing! A part time employee for a one time payment of $4k is definitely a steal... and with what you can do with AI agents that's certainly possible! I probably won't get my hands on one of these Macs soon but it is tempting...
For me, the issue with local LLMs is not so much whether or not it is cost effective in terms of dollars... but an issue of quality of responses. I think this is under-emphasized as a consideration. Models like Sonnet are simply much more robust than the models that can be run locally. And in the big picture, the quality difference makes a huge difference to how well the application functions.
Yeah that's completely fair! For many use cases, local LLMs are "good enough" (whatever that means for the use case) so it's worth using them for privacy, cost, etc. But for many other use cases like you are saying, you need Claude or at least can benefit from a more powerful model like Claude Sonnet making less mistakes.
I couldn't be more agree with you with this video. Congrats The only second question that you get on top of the table from businesses is: what about privacy? It looks like outside AWS, Azure and GCP, it's hard to have a client to trust on it.
Thank you very much! So are you saying that most businesses would not trust using a service like RunPod to host their LLMs locally? In my experience most businesses are fine with where the model is hosted as long as it is actually self hosted. But if a company is only comfortable hosting with the big players like AWS you could certainly spin up an EC2 instance with a GPU in place of a service like RunPod! You'll probably just have to pay more for it.
@@ColeMedin Yeap, that's right, if they don't trust the hosting they don't and working with medium-nig companies that's what happens everyday. I trust small ones like Runpod. Also trying Crypto GPU sharing hosting. You should try it.
I appreciate the suggestion! I actually am considering getting 2x 3090 GPUs for my build. Do you have a 3090 (or 2) yourself? I did indeed see they are pretty easy to get for $700.
Out of interest .. What about us guys who have mining rigs ? I have like 15 3080TI’s lying around, would that work pretty well for local hosting ? With these models, could we then use it to, say, integrate them into a company website to help clients with technical help etc ? I have a client that does Construction and engineering ( at scale ) , wondering if we set something up like this for that website idea I mentioned to help their clients on critical tech specs and giving options like to issues with a different product etc …
Yes, 3080TIs would be fantastic for local hosting! Not nearly as good as the 3090s though because of the extra VRAM, but they can certainly still run a good number of local LLMs especially if you are running 2 or more together. Also yes - you can host these LLMs yourself on machines with your 3080TIs and then create API endpoints hosted on your machine that leverage these models to create agents for tech support or things like that. I would look into Ollama as a way to run LLMs yourself!
Llama 3.1 is running blazingly fast on my midrange PC. Not the 405B and not the 70B. But the 8B is pretty nice and lightning fast. Edit: I am right now building an AI server from old spare parts running a 1080TI - and Llama 3.1 and Llama 3.2 run very well. I therefore do not quite understand the point of this video if you can get hold of a decent legacy GPU with sufficient vram.
Fair point! The 8b models are great for a lot of things, but for many use cases, especially agentic ones, I've found that I need at least a 70b model for the LLM to handle function calling well. For the use cases that can use 8b models then this video doesn't apply well, you are definitely right there. And that is a critique that I appreciate a lot and wish I had discussed in this video!
One of the things I've looked at from a very early day was the LLM size and constraints around local resources. We all need to keep in mind that we are already connected. Using products or tech like PETALS (torrent based LLM) or things like clustering groups or for that matter we could start collectively creating DAO's that handle excatly these types of things. Think about a new form of a Library, hell even our existing Libraries in the US could be used this way. Digitized LLM's served from each community. Open and free for public access. So many ways we can approach this opposite of trusting large corp's of controlling this tech. We have to acknowledge this tech is not new, just new to us.
Yeah that's awesome Jason! Have you tried Llama 3.2 11b as well? I know it's "only" 3b more but I'm curious what your performance for that would be too!
Small ones like 7b and below are great for simple things. It runs fast on my laptop 1660 ti 6gb. But for coding the quality is not great for 7b and below
Yes I agree! You could maybe try the smaller models of Qwen 2.5 coder but yeah in generally only the bigger models do well with coding. github.com/QwenLM/Qwen2.5-Coder
Just choose suitable one based on our current system spec. Not all company giving free access to internet to all employee, so local LLM is better solution
I agree! With a lot of companies you have to go local because they block the closed source models. The main trouble is though with your current system specs, you might not be able to run a model powerful enough for your use case. Llama 3.1 8b and similar models can't really do function calling well.
Hi, my question is how many prompt request can an A40 handles. Groq can scale indefinitely so there will be no lag. But A40 will start to slow down at some point. What do you think?
@@ApoorvKhandelwal Yes great question! All depends on the model size. For 8b models, for example, an A40 could handle many requests at once. But for a 70b model it would start to get slower when you have more than a couple requests being handled at once. Groq scaling infinitely is definitely a plus so you don't have to worry about that!
Imagine sharing a server with multiple people, so the load is constant and everyone has lower costs. To share the cost, we could use the amount of input and output the model generates. To make it accessible, we could put that server on the internet. You could even combine it on demand with other models to use what is best for each use case not just locking into one model or vendor architecture. That would be so smart…. the ultimate local LLM. Oh wait.
Haha touché... 😂 A lot of businesses I work with/hear about need to have all their data private for compliance reasons hence they really do need an LLM hosted locally. But for a majority of use cases, the "ultimate local LLM" is best!
@@ColeMedin yes, and they discuss that stuff in teams. And collect the data themselves under - often weaker - data protection terms of service compared to the cloud vendors. I know I know. I have these conversations too
@@dinoscheidt Bravo! It's ridiculously hard to properly do security, privacy and encryption at scale correctly! Anyone who profess to want to get better security or data privacy by running it themselves, is making a complete joke. Of course things could be simpler (more vulnerable) and cheaper (less resilient) and appears faster (less compliance processes in place) in an "arm's reach data center", but it's not an apples to apples comparison. Running a local Nvidia is great, until your business loses money at a rate of millions per unit of time if the system is down, or is liable for data security lawsuits. Basically: don't run anything important and mission critical, like an actual airline booking system, an actual flight controller, a real adword agency, a bank or a real stock trading backend.
I’m running ollama 3.1 8B on my AI system with the following hardware i7 7700 with 64GB ram and a RTX 3060 gpu with 12GB of ram and a 2TB NVMe drive, responses usually within 3 seconds works like a charm, doing rag and runing several agents
That's awesome, thank you for sharing! Do you mind sharing a specific use case you are using the 8b parameter version for? That's impressive you're able to use 8b for agents! I've found I need the 70b version for most agentic tasks, so I'm also curious if you've done any fine tuning for function calling?
Thank you for building my knowledge on LLMs! I have found each and every of your videos to be extremely valuable and I subscribed to stay updated, keep up the good job! One question: what is the mental model to calculate necessay RAM per LLM parameters? Mine was "1 bit for 1 param" so 70B params would translate to 70 GB RAM needed to run the model. But you say 48GB are enough for a 70B model? What am I missing here? Is it the regular RAM or the GPU RAM? Can someone explain?
Parameters-to-GB RAM is a somewhat helpful rule-of-thumb that puts you in the vicinity, but you tend to need a fair percent more GB RAM than the parameters count to run an unquantized model, and quite a bit less to run quantized models. High quality quants are nearly as good as the original model but will save you a lot of RAM. That being said, being able to run a model and being able to run a model at high speeds are two different things. I'm having a bit of trouble, as I consider an inference machine, determining exactly how important the actual processing power of the GPU is. It's important but maybe less-so than I used to think. RAM really is one of the very biggest concerns for LLM inference, and, as far as I can tell, the primary reason that we typically depend on a GPUs VRAM for AI inference basically boils down to the fact that it has a much higher bandwidth than system RAM, and is therefore a lot faster than system RAM. But Apple Silicone, because of its unique unified memory design, has bandwidth almost as high as a consumer NVIDIA GPU, so I'm starting to realize they may actually be a bargain for inference. You can get an M1 Ultra Mac Studio with 128GB RAM for ~$4k, for example. While the RAM speed and GPU processing power won't be *quite* as high as an NVIDIA GPU, and while the Mac RAM that's available to the GPU might be closer to like ~96GB, you'd have to spend ~$2800 minimum JUST on the GPUs (used 3090s) to get equivalent VRAM in consumer NVIDIA GPUs. The latter setup would also be incredibly bulky, incredibly hot, and *extremely* power-hungry at a rate like 14 or 15x the Mac Studio. And you'd need kind of a wild PC setup to run all those GPUs. If you wanted to get more power-efficient NVIDIA server GPUs, you'd have to spend like $10k, and you'd *still* be using 5 or 6 times the electricity. The best alternative to a Mac Studio that I've been able to determine is to get a Threadripper 7000-series CPU and a mobo with 8 RDIMM RAM slots. The Threadripper supports DDR5, so you could be running at RAM speeds a good bit higher than most systems and you could get a ton of it with that many slots, but you're still limited to 5200 MT/s, which as far as I can determine would still put you at like 40-something% of the Mac Studio bandwidth and about 1/3 of an NVIDIA GPU. I wish I could find more tests of people running ~70b LLMs on M-Ultra hardware to confirm what I've been learning. There's surprisingly not much of anything I can find-most of the tests are on M3 Max or pretty pointless tests of useless 8b models. Anyway, tl;dr, I'm no expert, but I think slow speeds for local LLM models typically has more to do with available high-bandwidth memory than it does with GPU processing-power, and Apple's M-chips have got some very high-speed memory on them at very efficient power-usage and at an affordable price, in comparison to GPUs
BTW, as a more straightforward answer to your question, you can go look at the model cards on Hugging Face to see how much RAM is required for different quant sizes. The amount of RAM you need is basically the model size (in GB of storage, not parameters) plus another 1 or 2 gigs for overhead, as far as I've been able to determine
Thank you Sergey - I appreciate it a lot! I liked @justtiredthings thoughts a lot, and generally the "1 bit for 1 param" idea is good if you want really good speeds. 48GB of VRAM is more of a good starting point with a 70B parameter model that will work for a lot of agentic use cases that don't need insanely fast speeds. But if you are trying to build an application that needs high tokens/second then yeah you'd want at least 70GB of VRAM for a model like Llama 3.1 70B. I hope that makes sense!
@@SergeyNeskhodovskiy One thing I do want to correct, bc I think I perpetuated a bit of misinfo: I think memory transfer for the Threadripper or any DDR5 setup would actually be much, much slower than the M-Ultra or an Nvidia GPU. like maybe 5-7% the speed, something like that. I was getting gigabytes per second and gigabits per second mixed up while trying to figure out the bandiwdth. Sorry about that
That mean when we remove llama service from n8n self hosting kit and use groq instead we will gain a lot of time for installing and running n8n ai agent localy. If you are agree I hope you make video about this.
Yes that is true! Though the advantage of using Ollama is it is fully local. With Groq you can use open source models, but it's still not running them locally. There certainly is a time and place to use Groq though, don't get me wrong!
Buy refurbished or build your own. I built an LLM pc for $500. Has a 1080 in it, which runs 22b models a decent speed, not lightning, but good enough. Smaller than 22b and there’s only a slight difference in speed from Oai or Ant
@@ColeMedin Im trying to figure out a way to give you an accurate answer but am coming up short. Do you know of a way? Might just record a gif or something
? ? ? This take is actually insane, I literally see regular people run AI models locally constantly in communities I'm a part of. - Timeouts are not a problem of the LLM; that's a problem of your setup, and is related to web infrastructure. There's any number of solutions to this. Literally just ask Claude or GPT4 how to fix it, lol. - 7B, 8B, 10.3B and even ~22B models are not expensive to run. If you try to run them natively at FP16, they might be prohibitively expensive, in the sense that you're looking at 14GB, 16GB, 20.6GB, and 44GB of RAM to load them respectively. That's crazy, because that's at 0 context, so before the conversation starts, and the attention mechanism can actually use more memory than the weights at ultra high context. - But that's not how people actually run AI models locally. With a 16GB GPU, you could load this in Ollama or LlamaCPP with as many layers on the GPU as you can, and then offload the rest to CPU, which incurs a slight speed penalty, but lets you run the model reasonably fast, and potentially works out favorably for short conversations because you don't have cloud latency. - Quantization. Did you try running models quantized? That's a huge game changer. They essentially lop off the latter half of the numbers on each weight so they take up less RAM. GGUF is probably the easiest to use, and supports variable bit widths, so at Q6_K I can comfortably run Mistral Small (really good model, btw, it's about 22B parameters) on my setup with some CPU offloading, and the speed isn't amazing, but is fine, and "free" in the sense that I already bought the computer. A lot of people have had great success with 10.3-13B models quantized to 8bit quants, like int8, GGUF Q8, exl, etc, which are really only around 14GB of VRAM to run, usually, which is totally achievable with reasonably priced GPUs (at least on Linux, I'm not sure how much VRAM Windows uses), or can be run with an 8GB GPU and some light offloading onto CPU, with extremely affordable GPUs. - Agents. I think the biggest possible use case for local LLMs is AI agents (outside of entertainment purposes like LLMs as a roleplay partner); once you get a sufficiently robust architecture for them built up, they really don't need you to provide input at every stage of their work, so you can run them passively in the backround while you're doing other stuff. The raw speed doesn't really matter, so these can honestly run on CPU for all you care, or a spare PC, or a Raspberry Pi cluster, etc. If you don't really need to check in with your model for an hour or so, the actual token speed probably doesn't matter that much. On the other hand, the price per token may actually matter quite a bit, and even running a huge model on a consumer CPU (ie: Qwen 2.5 72B), at extremely low speeds could still add up to really quite a lot of tokens that you would need to pay for normally (~0.3-2 T/s works out to around 1,080-7,200 tokens per hour, depending on your CPU, quantization, etc), and I can definitely say that if you're looking at a week of operation of an agent there comes a point where you don't really want to pay for those tokens. - Customization. There's reasons you might want to customize a model for your use case, or use an off-the-shelf customized LLM for something really specific you need. You can get great results with 7B models fine tuned specifically for handling SQL, or you can train a LoRA to encourage your coding style to be used, and any other number of considerations. Plus, the quantity of models available on Huggingface is absolutely insane, and there's almost certainly at least one you will find interesting hosted on there for some purpose. You miss out on a lot of this using the ultra fast/cheap hosts like Groq, or Cerebras, or Sambanova. I think the only people who need to hear the "hard" truth on hosting LLMs are corporate business types trying to cash in on the AI hype without any technical experience because they want to use the "AI thing", too. There's any number of solutions to any of the problems that were listed in this video, and it's a little awkward to know that there are people out there who might be swayed from a lot of the really interesting benefits of running AI locally by this opinion.
I am running 9-11 B models with a 2022 laptop with an nvidia 3070 and intel i9 12900h with 40 Gb of ram and it is fast. It's laptop hardware you can get for under 1k today all up.
There are definitely some competitors out there but Groq is typically considered to be on top! Not saying it's entirely true but that seems to be the general consensus from what I have seen. I am seeing benchmarks though that show Cerbras is better in terms of speed. It seems the pricing is a bit worse but not by much. Super interesting! I'll have to check out Cerbras more, I've only really dove into it a bit in the past.
@@ColeMedin you can run the 70B models which is mindblowing to have on a laptop. A jailbreaked 70b model pretty much gives you the entire human knowledge for you to access with zero need for internet. You probably know this already. It blows my mind just how powerful these things can be if you custom train them. ❤️❤️❤️ what a time to be alive.
@@ColeMedin I actually saw someone running a 400B model on two parallel connected maxed out M3 laptops yesterday. Yes you read correct. You can run a 4 four hundred billion vector model locally but it’s gonna cost you two maxed out M3 MacBook Pros. It is not cheap. Pretty wild!
Youi forget something id you are running a propwr app you may likely need a higher rate limit, local hosting or rented gpt geta you one install that can run 1 request at a time wereaa on groq or openrouter i get 200 per second rate limit ao local does not work for anything but a sikple personal app you cant use it for a milti user app or anything powerful Not so? This is the damn problem. Or am I wrong on the rate limit for local install? I really want to be wrong
Yes this is a very fair point! I agree that I didn't fully respect the need for concurrency in this video... something I am still learning how to handle well myself! With RunPod, you can certainly handle more than one request at once! But it does require better hardware or going with a model with less parameters, so it's not easy. You can also implement queuing for your LLM calls as long as you are getting fast enough responses where it doesn't hurt UX. Lot of nuances here but overall Groq does make this MUCH easier as long as you are okay not have the LLM hosted yourself.
Got 4o mini is way cheaper and better and open ai doesn't use your api data or at least the risk is exactly the same as groq both external companies so you can't be sure
Yeah gpt-4o-mini is a good option if you never have the need to host an LLM yourself and want something cheap! The main reason you would use Groq and a model like Llama 3.1 8/70b is because you want to eventually switch to hosting a model yourself and don't want to entirely switch your LLM when you do so. As far as the risk of sending data to OpenAI vs Groq, a lot of it comes down to the level of mistrust that people have with closed source models, especially GPT. But I totally respect that that is up for debate.
dude llama 3.1 function calling sucks. If any part of your app uses function expect it to fail a fair lot more then gpt. I wish I could switch over to llama, it just doesn't make sense at the moment. Let's hope llama 3.2 is better
Can you explain a little more? Have not used function calling with llama 3.1 but am interested in it. Is there something particular that just doesn‘t work? Thank you
Man, I feel like the Qwen 2.5 drop is being egregiously ignored. It's smashed Llama in benchmarks, and it's doing pretty damn impressively in my own prodding. It should have been the biggest news of the week--we need more content putting it to the test
All they did with Llama 3.2 90b is add vision. The text benchmarks are exactly the same. You could have a better text model in Qwen 2.5 and a comparable vision model in Pixtral running simultaneously at a lower RAM requirement overall
Yeah totally fair! I've also had not the best experience with Llama function calling. I'm going to be playing around a lot with Llama 3.2 function calling on Groq, I hope it's much better too!
I generally like your channel as it is super informative and concise but this is not particularly good information there are 1000s of tasks that can be formed without needing a 70b model the idea that 70b etc is the only useful model is hampering people from actually hosting locally and improving these smaller models. Creating a hybrids where you have smaller local models and then when required use a large model on the cloud or using Claude, ChatGPT etc is a totally viable model
First of all, I appreciate you prefacing this in the way you did, thank you!
Secondly, Your idea here of a hybrid approach with local models for some parts of your app/workloads and then a closed source model when you need the speed/power/scaling is great!
I didn't mean to portray in the video that you NEED a 70b model, I just chose to use Llama 3.1 70b as a good example because for most use cases I have encountered, the smaller models don't perform well enough for me. But that's with more advanced use cases with RAG/agentic workloads. I agree that there are 1000s of use cases that could use a smaller model like Llama 3.1 8b or even Microsoft's Phi3 Mini!
@@ColeMedin thank you for understanding my critique you're correct you need to know the specific use case or some of the small models are terrible. I tried starcoder 12b with continue and in comparison to codestral etc it's not great
Small local models should then be compared with the cheap models from OpenAI, Google etc
I definitely agree!
@@ColeMedincorrect
Mini models are for edge devices..
Think about a small device that parses your input, trained on your personal information and passes on anything it can't handle to a hosted larger like llama 3.2b 90
Thanks for what you do man! Learning a lot.
I'm glad, it's my pleasure! :)
Wait no. I can run llama 3.1 the 70b version on a laptop that's 4 years old without a GPU.....
Sure, it's slow. But it works. So yes you can run them locally for not huge price. It just takes much longer to answer if you use large ones
Yeah that is fair! Depends on the use case with how long you are able to wait for responses.
Good video, but still, only one Pod can´t handle 3k prompts per day because you would need to consider user concurrency, if you have 10 users at the same time requesting LLM inference, you would have a long time to 1st token and a very slow service. So, even in that edge situation I think is still better going for Groq or other service.
Fair point! Honestly I agree that I didn't fully respect the nuances that come with the need for concurrency here... there are ways to handle it with things like queuing (as long as you can get fast enough responses so it doesn't hurt UX), and you can always scale the hardware to handle concurrency. But yeah it still makes hosting locally more complicated!
I was excited to this video up, and then quickly realized that some of the info presented is incorrect. I’ve been running my own LLM now for a few months and I even started with a basic AMD GPU. Last month I upgraded to a second Nvidia GPU and I have absolutely no problem running any of the latest LLMs. Lastly it didn’t cost me $1700 per GPU in order to do this. I think you may have gotten some bad data as far as prices go for price to performance.
Wow interesting... you gotta tell me what hardware you are running! haha
Because all I ever hear is that GPUs under $2,000 can't even run Llama 3.1 8b. You're able to run even Llama 3.1 70b with not insanely expensive graphics cards?
@@ColeMedin I'm running Llama 3.1 8B with a Ryzen 5 2600x 6 core processor, 32 GB Ram, and a RTX 2080 Ti with11GB Vram. Works fine.
@keithhunt8 Good to know! That's still a pretty beefy setup but yeah not incredibly expensive!
I think $1000 on the gpu front is probably the least expensive to run 8b models very smoothly. $2000+ would be for more small/mid size businesses. I am trying to sort out what makes most sense for our company now using what gpus I have on hand. Great info though, thanks for your content
@jteds771 yeah that sound about right! And my pleasure :)
It looks like he does not know how to set it up correctly. I'm running the 8B models, and they run superfast and not even with an NVIDIA 4090, I'm running it with an AMD 7900 XTX 24GB - $900 CPU AMD 5950X 16 Cores 32 threads - $300 and 64Ram, let me know if you need help. Not only that, but I can send you the link of my WebUi so you can try it yourself and let me know. I think your problem is you are not setting the Models to be loaded and ready. I remember that was happening to me at the beginning of my setup, but my IT guy figured out.... Furthermore, I don't need the 70b with all those trained models out there for productivity and business. By the way, Nice Content!
Thank you and I appreciate the feedback here! So are you using a 4090 for your Graphics card? Or are you saying you don't even need a 4090?
8B works fast and fine with rtx3050 and amd5800x 64GB RAM
@@Bigjuergo Nice! Thanks for sharing your specs!
@@ColeMedin You don't need a 4090, I have the 7900XTX. It writes so fast that you can't even read. Way Faster than the regular ChatGPT. You have to set the model to be in the memory of the graphic card all the time.
Yeah for the 8B models that makes sense! Thanks again for sharing!
I've been looking a machine for LLM inference, and I'm beginning to feel like for personal use Mac is, for once, a wild value. I really want to see some more testing of ~70b models on a Mac Studio M2 Ultra or M1 Ultra. You can configure them up to 192GB and 128GB RAM respectively, and because of the unified memory design, the memory speed is almost as fast as NVIDIA GPU VRAM. They're also super efficient, so the power usage is way lower than wiring up a bunch of NVIDIA GPUs. Yes, you've still got to spend at least $4k, but considering how smart the medium-size open-source models are becoming, might really be worth it if you're also weaving the models into a bunch of automations you've set up. It really could be like having a somewhat stupid but reasonably helpful part-time employee. For a one-time payment of $4k.
This is super interesting, thank you for sharing!
A part time employee for a one time payment of $4k is definitely a steal... and with what you can do with AI agents that's certainly possible!
I probably won't get my hands on one of these Macs soon but it is tempting...
For me, the issue with local LLMs is not so much whether or not it is cost effective in terms of dollars... but an issue of quality of responses. I think this is under-emphasized as a consideration. Models like Sonnet are simply much more robust than the models that can be run locally. And in the big picture, the quality difference makes a huge difference to how well the application functions.
Yeah that's completely fair! For many use cases, local LLMs are "good enough" (whatever that means for the use case) so it's worth using them for privacy, cost, etc. But for many other use cases like you are saying, you need Claude or at least can benefit from a more powerful model like Claude Sonnet making less mistakes.
I couldn't be more agree with you with this video. Congrats
The only second question that you get on top of the table from businesses is: what about privacy?
It looks like outside AWS, Azure and GCP, it's hard to have a client to trust on it.
Thank you very much!
So are you saying that most businesses would not trust using a service like RunPod to host their LLMs locally? In my experience most businesses are fine with where the model is hosted as long as it is actually self hosted.
But if a company is only comfortable hosting with the big players like AWS you could certainly spin up an EC2 instance with a GPU in place of a service like RunPod! You'll probably just have to pay more for it.
@@ColeMedin Yeap, that's right, if they don't trust the hosting they don't and working with medium-nig companies that's what happens everyday. I trust small ones like Runpod. Also trying Crypto GPU sharing hosting. You should try it.
Yeah that makes sense, thanks for sharing! I'll look into crypto GPU, sounds interesting! I believe I've actually heard of it already
you should try 3.2 yourself on rtx3060. decent and super fast
Interesting! Which version of Llama 3.2 are you referring to?
You can run 8B Q4 models on AMD Ryzen 7 desktop hardware already now without a GPU
Wow that's incredible! Thanks for sharing 🔥
God Bless you Cole, thank you for your information
Thank you very much! My pleasure :)
What is your opinion about Cerebras? according to Artificial Analysis, it's the best price-output speed choice, better than Groq. Have you tried it?
I see it's 0.60$ per 1M tokens, so it seems more expensive
Right it is more expensive. But from what I have seen Cerebras is pretty phenomenal too!
Rtx 3900 24gb used costs 700$.
Try it.
I appreciate the suggestion! I actually am considering getting 2x 3090 GPUs for my build. Do you have a 3090 (or 2) yourself? I did indeed see they are pretty easy to get for $700.
Really useful video man - just signed up to Groq!
@@themarksmith Awesome!! Thank you!
Not true. I have an MSI laptop with 64GB RAM and one RTX4060 and get sub-2-second responses using LLAMA 3.1.
The unit only cost $1700 USD.
Impressive! Which version of Llama 3.1 are you using (8b or 70b)?
Out of interest ..
What about us guys who have mining rigs ?
I have like 15 3080TI’s lying around, would that work pretty well for local hosting ?
With these models, could we then use it to, say, integrate them into a company website to help clients with technical help etc ?
I have a client that does Construction and engineering ( at scale ) , wondering if we set something up like this for that website idea I mentioned to help their clients on critical tech specs and giving options like to issues with a different product etc …
Yes, 3080TIs would be fantastic for local hosting! Not nearly as good as the 3090s though because of the extra VRAM, but they can certainly still run a good number of local LLMs especially if you are running 2 or more together.
Also yes - you can host these LLMs yourself on machines with your 3080TIs and then create API endpoints hosted on your machine that leverage these models to create agents for tech support or things like that. I would look into Ollama as a way to run LLMs yourself!
I tried it. Problem is: miners dont care pcie bandwith and thats the bottleneck.
Llama 3.1 is running blazingly fast on my midrange PC. Not the 405B and not the 70B. But the 8B is pretty nice and lightning fast.
Edit: I am right now building an AI server from old spare parts running a 1080TI - and Llama 3.1 and Llama 3.2 run very well.
I therefore do not quite understand the point of this video if you can get hold of a decent legacy GPU with sufficient vram.
Fair point! The 8b models are great for a lot of things, but for many use cases, especially agentic ones, I've found that I need at least a 70b model for the LLM to handle function calling well. For the use cases that can use 8b models then this video doesn't apply well, you are definitely right there. And that is a critique that I appreciate a lot and wish I had discussed in this video!
One of the things I've looked at from a very early day was the LLM size and constraints around local resources. We all need to keep in mind that we are already connected. Using products or tech like PETALS (torrent based LLM) or things like clustering groups or for that matter we could start collectively creating DAO's that handle excatly these types of things. Think about a new form of a Library, hell even our existing Libraries in the US could be used this way. Digitized LLM's served from each community. Open and free for public access. So many ways we can approach this opposite of trusting large corp's of controlling this tech.
We have to acknowledge this tech is not new, just new to us.
Very interesting thoughts, thank you for sharing! I've never heard of the idea of creating a DAO to share resources for an LLM but I like it!
@@ColeMedin I will keep reminding you that I'm old and crazy. Let those things that make sense stick and the rest ehh... ;)
Haha fair enough! 😂 I appreciate you sharing!
I'm running an 8B local llm without any issues... But I do have a super fast MacBook pro M3 😊
Yeah that's awesome Jason! Have you tried Llama 3.2 11b as well? I know it's "only" 3b more but I'm curious what your performance for that would be too!
Small ones like 7b and below are great for simple things. It runs fast on my laptop 1660 ti 6gb. But for coding the quality is not great for 7b and below
Yes I agree! You could maybe try the smaller models of Qwen 2.5 coder but yeah in generally only the bigger models do well with coding.
github.com/QwenLM/Qwen2.5-Coder
You’re videos are fantastic
Thank you very much MK, I appreciate it alot!
Just choose suitable one based on our current system spec. Not all company giving free access to internet to all employee, so local LLM is better solution
I agree! With a lot of companies you have to go local because they block the closed source models. The main trouble is though with your current system specs, you might not be able to run a model powerful enough for your use case. Llama 3.1 8b and similar models can't really do function calling well.
Hi, my question is how many prompt request can an A40 handles. Groq can scale indefinitely so there will be no lag. But A40 will start to slow down at some point. What do you think?
@@ApoorvKhandelwal Yes great question! All depends on the model size. For 8b models, for example, an A40 could handle many requests at once. But for a 70b model it would start to get slower when you have more than a couple requests being handled at once. Groq scaling infinitely is definitely a plus so you don't have to worry about that!
Imagine sharing a server with multiple people, so the load is constant and everyone has lower costs. To share the cost, we could use the amount of input and output the model generates. To make it accessible, we could put that server on the internet. You could even combine it on demand with other models to use what is best for each use case not just locking into one model or vendor architecture.
That would be so smart…. the ultimate local LLM. Oh wait.
Haha touché... 😂
A lot of businesses I work with/hear about need to have all their data private for compliance reasons hence they really do need an LLM hosted locally. But for a majority of use cases, the "ultimate local LLM" is best!
@@ColeMedin yes, and they discuss that stuff in teams. And collect the data themselves under - often weaker - data protection terms of service compared to the cloud vendors. I know I know. I have these conversations too
@@dinoscheidt Bravo! It's ridiculously hard to properly do security, privacy and encryption at scale correctly! Anyone who profess to want to get better security or data privacy by running it themselves, is making a complete joke.
Of course things could be simpler (more vulnerable) and cheaper (less resilient) and appears faster (less compliance processes in place) in an "arm's reach data center", but it's not an apples to apples comparison.
Running a local Nvidia is great, until your business loses money at a rate of millions per unit of time if the system is down, or is liable for data security lawsuits. Basically: don't run anything important and mission critical, like an actual airline booking system, an actual flight controller, a real adword agency, a bank or a real stock trading backend.
I’m running ollama 3.1 8B on my AI system with the following hardware i7 7700 with 64GB ram and a RTX 3060 gpu with 12GB of ram and a 2TB NVMe drive, responses usually within 3 seconds works like a charm, doing rag and runing several agents
That's awesome, thank you for sharing! Do you mind sharing a specific use case you are using the 8b parameter version for? That's impressive you're able to use 8b for agents! I've found I need the 70b version for most agentic tasks, so I'm also curious if you've done any fine tuning for function calling?
Thank you for building my knowledge on LLMs! I have found each and every of your videos to be extremely valuable and I subscribed to stay updated, keep up the good job! One question: what is the mental model to calculate necessay RAM per LLM parameters? Mine was "1 bit for 1 param" so 70B params would translate to 70 GB RAM needed to run the model. But you say 48GB are enough for a 70B model? What am I missing here? Is it the regular RAM or the GPU RAM? Can someone explain?
Parameters-to-GB RAM is a somewhat helpful rule-of-thumb that puts you in the vicinity, but you tend to need a fair percent more GB RAM than the parameters count to run an unquantized model, and quite a bit less to run quantized models. High quality quants are nearly as good as the original model but will save you a lot of RAM. That being said, being able to run a model and being able to run a model at high speeds are two different things.
I'm having a bit of trouble, as I consider an inference machine, determining exactly how important the actual processing power of the GPU is. It's important but maybe less-so than I used to think. RAM really is one of the very biggest concerns for LLM inference, and, as far as I can tell, the primary reason that we typically depend on a GPUs VRAM for AI inference basically boils down to the fact that it has a much higher bandwidth than system RAM, and is therefore a lot faster than system RAM.
But Apple Silicone, because of its unique unified memory design, has bandwidth almost as high as a consumer NVIDIA GPU, so I'm starting to realize they may actually be a bargain for inference. You can get an M1 Ultra Mac Studio with 128GB RAM for ~$4k, for example. While the RAM speed and GPU processing power won't be *quite* as high as an NVIDIA GPU, and while the Mac RAM that's available to the GPU might be closer to like ~96GB, you'd have to spend ~$2800 minimum JUST on the GPUs (used 3090s) to get equivalent VRAM in consumer NVIDIA GPUs. The latter setup would also be incredibly bulky, incredibly hot, and *extremely* power-hungry at a rate like 14 or 15x the Mac Studio. And you'd need kind of a wild PC setup to run all those GPUs. If you wanted to get more power-efficient NVIDIA server GPUs, you'd have to spend like $10k, and you'd *still* be using 5 or 6 times the electricity.
The best alternative to a Mac Studio that I've been able to determine is to get a Threadripper 7000-series CPU and a mobo with 8 RDIMM RAM slots. The Threadripper supports DDR5, so you could be running at RAM speeds a good bit higher than most systems and you could get a ton of it with that many slots, but you're still limited to 5200 MT/s, which as far as I can determine would still put you at like 40-something% of the Mac Studio bandwidth and about 1/3 of an NVIDIA GPU.
I wish I could find more tests of people running ~70b LLMs on M-Ultra hardware to confirm what I've been learning. There's surprisingly not much of anything I can find-most of the tests are on M3 Max or pretty pointless tests of useless 8b models.
Anyway, tl;dr, I'm no expert, but I think slow speeds for local LLM models typically has more to do with available high-bandwidth memory than it does with GPU processing-power, and Apple's M-chips have got some very high-speed memory on them at very efficient power-usage and at an affordable price, in comparison to GPUs
BTW, as a more straightforward answer to your question, you can go look at the model cards on Hugging Face to see how much RAM is required for different quant sizes. The amount of RAM you need is basically the model size (in GB of storage, not parameters) plus another 1 or 2 gigs for overhead, as far as I've been able to determine
Thank you Sergey - I appreciate it a lot!
I liked @justtiredthings thoughts a lot, and generally the "1 bit for 1 param" idea is good if you want really good speeds. 48GB of VRAM is more of a good starting point with a 70B parameter model that will work for a lot of agentic use cases that don't need insanely fast speeds. But if you are trying to build an application that needs high tokens/second then yeah you'd want at least 70GB of VRAM for a model like Llama 3.1 70B. I hope that makes sense!
@@justtiredthings Thanks for your detailed explanation! You can't imagine how valuable this info is for me!
@@SergeyNeskhodovskiy One thing I do want to correct, bc I think I perpetuated a bit of misinfo: I think memory transfer for the Threadripper or any DDR5 setup would actually be much, much slower than the M-Ultra or an Nvidia GPU. like maybe 5-7% the speed, something like that. I was getting gigabytes per second and gigabits per second mixed up while trying to figure out the bandiwdth. Sorry about that
That mean when we remove llama service from n8n self hosting kit and use groq instead we will gain a lot of time for installing and running n8n ai agent localy. If you are agree I hope you make video about this.
Yes that is true! Though the advantage of using Ollama is it is fully local. With Groq you can use open source models, but it's still not running them locally. There certainly is a time and place to use Groq though, don't get me wrong!
Buy refurbished or build your own. I built an LLM pc for $500. Has a 1080 in it, which runs 22b models a decent speed, not lightning, but good enough. Smaller than 22b and there’s only a slight difference in speed from Oai or Ant
That's super impressive! What kinds of speeds do you get for something like Llama 3.1 8b?
@@ColeMedin Im trying to figure out a way to give you an accurate answer but am coming up short. Do you know of a way? Might just record a gif or something
To be honest I'm not sure how exactly to do this, but I know people have a way to measure the tokens per second they get when running models locally.
@@ColeMedinfound it, run `-version` after ollama run model.
? ? ?
This take is actually insane, I literally see regular people run AI models locally constantly in communities I'm a part of.
- Timeouts are not a problem of the LLM; that's a problem of your setup, and is related to web infrastructure. There's any number of solutions to this. Literally just ask Claude or GPT4 how to fix it, lol.
- 7B, 8B, 10.3B and even ~22B models are not expensive to run. If you try to run them natively at FP16, they might be prohibitively expensive, in the sense that you're looking at 14GB, 16GB, 20.6GB, and 44GB of RAM to load them respectively. That's crazy, because that's at 0 context, so before the conversation starts, and the attention mechanism can actually use more memory than the weights at ultra high context.
- But that's not how people actually run AI models locally. With a 16GB GPU, you could load this in Ollama or LlamaCPP with as many layers on the GPU as you can, and then offload the rest to CPU, which incurs a slight speed penalty, but lets you run the model reasonably fast, and potentially works out favorably for short conversations because you don't have cloud latency.
- Quantization. Did you try running models quantized? That's a huge game changer. They essentially lop off the latter half of the numbers on each weight so they take up less RAM. GGUF is probably the easiest to use, and supports variable bit widths, so at Q6_K I can comfortably run Mistral Small (really good model, btw, it's about 22B parameters) on my setup with some CPU offloading, and the speed isn't amazing, but is fine, and "free" in the sense that I already bought the computer. A lot of people have had great success with 10.3-13B models quantized to 8bit quants, like int8, GGUF Q8, exl, etc, which are really only around 14GB of VRAM to run, usually, which is totally achievable with reasonably priced GPUs (at least on Linux, I'm not sure how much VRAM Windows uses), or can be run with an 8GB GPU and some light offloading onto CPU, with extremely affordable GPUs.
- Agents. I think the biggest possible use case for local LLMs is AI agents (outside of entertainment purposes like LLMs as a roleplay partner); once you get a sufficiently robust architecture for them built up, they really don't need you to provide input at every stage of their work, so you can run them passively in the backround while you're doing other stuff. The raw speed doesn't really matter, so these can honestly run on CPU for all you care, or a spare PC, or a Raspberry Pi cluster, etc. If you don't really need to check in with your model for an hour or so, the actual token speed probably doesn't matter that much. On the other hand, the price per token may actually matter quite a bit, and even running a huge model on a consumer CPU (ie: Qwen 2.5 72B), at extremely low speeds could still add up to really quite a lot of tokens that you would need to pay for normally (~0.3-2 T/s works out to around 1,080-7,200 tokens per hour, depending on your CPU, quantization, etc), and I can definitely say that if you're looking at a week of operation of an agent there comes a point where you don't really want to pay for those tokens.
- Customization. There's reasons you might want to customize a model for your use case, or use an off-the-shelf customized LLM for something really specific you need. You can get great results with 7B models fine tuned specifically for handling SQL, or you can train a LoRA to encourage your coding style to be used, and any other number of considerations. Plus, the quantity of models available on Huggingface is absolutely insane, and there's almost certainly at least one you will find interesting hosted on there for some purpose. You miss out on a lot of this using the ultra fast/cheap hosts like Groq, or Cerebras, or Sambanova.
I think the only people who need to hear the "hard" truth on hosting LLMs are corporate business types trying to cash in on the AI hype without any technical experience because they want to use the "AI thing", too. There's any number of solutions to any of the problems that were listed in this video, and it's a little awkward to know that there are people out there who might be swayed from a lot of the really interesting benefits of running AI locally by this opinion.
I am running 9-11 B models with a 2022 laptop with an nvidia 3070 and intel i9 12900h with 40 Gb of ram and it is fast.
It's laptop hardware you can get for under 1k today all up.
That's awesome! What kinds of use cases are you able to use the smaller 8b-11b models for?
@@ColeMedin I am using codeqwen:1.5-chat as a coding assistant. I am also using llama3.2:3b for quick writting assistant
That's awesome, thanks for sharing!
You are smart with all these tutorials, but why not tell everyone you can offload everything to ram? No need for 8x 30k h100 gpus these days
Thanks for the compliment, and could you tell me more? From what I know you can't offload to ram but I'm curious what you're thinking!
Interesting that OpenAI has this kind of competition when it comes to entreprise-level use, they might be cooked lowkey
@@adomakins haha at some point they will definitely be 😂
are you sure about grok being the fastest and most efficient , I thought Cerbras has been dominating for a while now
There are definitely some competitors out there but Groq is typically considered to be on top! Not saying it's entirely true but that seems to be the general consensus from what I have seen.
I am seeing benchmarks though that show Cerbras is better in terms of speed. It seems the pricing is a bit worse but not by much. Super interesting! I'll have to check out Cerbras more, I've only really dove into it a bit in the past.
With a mac M3 max memory you can run pretty big LLMs.
Nice!! How big you talking?
@@ColeMedin you can run the 70B models which is mindblowing to have on a laptop. A jailbreaked 70b model pretty much gives you the entire human knowledge for you to access with zero need for internet. You probably know this already. It blows my mind just how powerful these things can be if you custom train them. ❤️❤️❤️ what a time to be alive.
Wow that in insane!! It sure is quite the time to be alive haha
@@ColeMedin I actually saw someone running a 400B model on two parallel connected maxed out M3 laptops yesterday. Yes you read correct. You can run a 4 four hundred billion vector model locally but it’s gonna cost you two maxed out M3 MacBook Pros. It is not cheap.
Pretty wild!
Wow I actually can't believe it haha, that is incredible!
Youi forget something id you are running a propwr app you may likely need a higher rate limit, local hosting or rented gpt geta you one install that can run 1 request at a time wereaa on groq or openrouter i get 200 per second rate limit ao local does not work for anything but a sikple personal app you cant use it for a milti user app or anything powerful Not so? This is the damn problem. Or am I wrong on the rate limit for local install? I really want to be wrong
Yes this is a very fair point! I agree that I didn't fully respect the need for concurrency in this video... something I am still learning how to handle well myself!
With RunPod, you can certainly handle more than one request at once! But it does require better hardware or going with a model with less parameters, so it's not easy. You can also implement queuing for your LLM calls as long as you are getting fast enough responses where it doesn't hurt UX. Lot of nuances here but overall Groq does make this MUCH easier as long as you are okay not have the LLM hosted yourself.
Great vid - many thanks! Love seeing a fellow groqster. I'm on this very journey right now!
Thank you!! :)
Good luck on your LLM journey, I'd love to hear how it goes for you!
llama 3.2 1 and 3 B models just dropped......the larger one has vision now
I know it's exciting! :D
Got 4o mini is way cheaper and better and open ai doesn't use your api data or at least the risk is exactly the same as groq both external companies so you can't be sure
Yeah gpt-4o-mini is a good option if you never have the need to host an LLM yourself and want something cheap! The main reason you would use Groq and a model like Llama 3.1 8/70b is because you want to eventually switch to hosting a model yourself and don't want to entirely switch your LLM when you do so.
As far as the risk of sending data to OpenAI vs Groq, a lot of it comes down to the level of mistrust that people have with closed source models, especially GPT. But I totally respect that that is up for debate.
you run it in a rented server with a GPU....is cheaper by a long way just tunnel
Maybe more details
But what's the rate limit you can only do 1 run at a time not 250 per second any real app is going to need a high rate limit
Something like a RunPod instance is a rented server with a GPU, or are you thinking of something different? I would appreciate more details as well!
dude llama 3.1 function calling sucks. If any part of your app uses function expect it to fail a fair lot more then gpt.
I wish I could switch over to llama, it just doesn't make sense at the moment.
Let's hope llama 3.2 is better
Can you explain a little more? Have not used function calling with llama 3.1 but am interested in it. Is there something particular that just doesn‘t work? Thank you
Man, I feel like the Qwen 2.5 drop is being egregiously ignored. It's smashed Llama in benchmarks, and it's doing pretty damn impressively in my own prodding. It should have been the biggest news of the week--we need more content putting it to the test
All they did with Llama 3.2 90b is add vision. The text benchmarks are exactly the same. You could have a better text model in Qwen 2.5 and a comparable vision model in Pixtral running simultaneously at a lower RAM requirement overall
Yeah totally fair! I've also had not the best experience with Llama function calling. I'm going to be playing around a lot with Llama 3.2 function calling on Groq, I hope it's much better too!
Thank you for mentioning Qwen 2.5, I agree it's being missed out on! I am definitely considering making a video on it very soon.
no, rtx3080 awesome performance.
Tell me more! What models are you running and what speeds are you getting?
Air LLM can run anything on 3090 even 405b, you are outdated.
Air LLM seems awesome! I will certainly have to look into it. Have you tried it before?
waste of time, you can never achieve the results of the big boys.
It certainly depends on the use case! There are a ton of use cases where a local LLM works great, and sometimes even better with fine tuning!
ፔስቶ ለፕሬዚዳንት
በህልምዎ ውስጥ
Seems pretty wrong.
Could you clarify what you think is wrong?