The HARD Truth About Hosting Your Own LLMs

แชร์
ฝัง
  • เผยแพร่เมื่อ 17 พ.ย. 2024

ความคิดเห็น • 149

  • @TheChidmas
    @TheChidmas หลายเดือนก่อน +21

    I generally like your channel as it is super informative and concise but this is not particularly good information there are 1000s of tasks that can be formed without needing a 70b model the idea that 70b etc is the only useful model is hampering people from actually hosting locally and improving these smaller models. Creating a hybrids where you have smaller local models and then when required use a large model on the cloud or using Claude, ChatGPT etc is a totally viable model

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +6

      First of all, I appreciate you prefacing this in the way you did, thank you!
      Secondly, Your idea here of a hybrid approach with local models for some parts of your app/workloads and then a closed source model when you need the speed/power/scaling is great!
      I didn't mean to portray in the video that you NEED a 70b model, I just chose to use Llama 3.1 70b as a good example because for most use cases I have encountered, the smaller models don't perform well enough for me. But that's with more advanced use cases with RAG/agentic workloads. I agree that there are 1000s of use cases that could use a smaller model like Llama 3.1 8b or even Microsoft's Phi3 Mini!

    • @TheChidmas
      @TheChidmas หลายเดือนก่อน +1

      @@ColeMedin thank you for understanding my critique you're correct you need to know the specific use case or some of the small models are terrible. I tried starcoder 12b with continue and in comparison to codestral etc it's not great

    • @cristiandarie
      @cristiandarie หลายเดือนก่อน +2

      Small local models should then be compared with the cheap models from OpenAI, Google etc

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      I definitely agree!

    • @intrestingness
      @intrestingness หลายเดือนก่อน +1

      ​@@ColeMedincorrect
      Mini models are for edge devices..
      Think about a small device that parses your input, trained on your personal information and passes on anything it can't handle to a hosted larger like llama 3.2b 90

  • @stormkingdigital
    @stormkingdigital 23 วันที่ผ่านมา +2

    Thanks for what you do man! Learning a lot.

    • @ColeMedin
      @ColeMedin  20 วันที่ผ่านมา

      I'm glad, it's my pleasure! :)

  • @serikazero128
    @serikazero128 หลายเดือนก่อน +2

    Wait no. I can run llama 3.1 the 70b version on a laptop that's 4 years old without a GPU.....
    Sure, it's slow. But it works. So yes you can run them locally for not huge price. It just takes much longer to answer if you use large ones

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yeah that is fair! Depends on the use case with how long you are able to wait for responses.

  • @reynoldoramas3138
    @reynoldoramas3138 หลายเดือนก่อน +4

    Good video, but still, only one Pod can´t handle 3k prompts per day because you would need to consider user concurrency, if you have 10 users at the same time requesting LLM inference, you would have a long time to 1st token and a very slow service. So, even in that edge situation I think is still better going for Groq or other service.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +2

      Fair point! Honestly I agree that I didn't fully respect the nuances that come with the need for concurrency here... there are ways to handle it with things like queuing (as long as you can get fast enough responses so it doesn't hurt UX), and you can always scale the hardware to handle concurrency. But yeah it still makes hosting locally more complicated!

  • @crypto_que
    @crypto_que หลายเดือนก่อน +4

    I was excited to this video up, and then quickly realized that some of the info presented is incorrect. I’ve been running my own LLM now for a few months and I even started with a basic AMD GPU. Last month I upgraded to a second Nvidia GPU and I have absolutely no problem running any of the latest LLMs. Lastly it didn’t cost me $1700 per GPU in order to do this. I think you may have gotten some bad data as far as prices go for price to performance.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Wow interesting... you gotta tell me what hardware you are running! haha
      Because all I ever hear is that GPUs under $2,000 can't even run Llama 3.1 8b. You're able to run even Llama 3.1 70b with not insanely expensive graphics cards?

    • @keithhunt8
      @keithhunt8 หลายเดือนก่อน

      @@ColeMedin I'm running Llama 3.1 8B with a Ryzen 5 2600x 6 core processor, 32 GB Ram, and a RTX 2080 Ti with11GB Vram. Works fine.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      @keithhunt8 Good to know! That's still a pretty beefy setup but yeah not incredibly expensive!

    • @jteds711
      @jteds711 หลายเดือนก่อน

      I think $1000 on the gpu front is probably the least expensive to run 8b models very smoothly. $2000+ would be for more small/mid size businesses. I am trying to sort out what makes most sense for our company now using what gpus I have on hand. Great info though, thanks for your content

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      @jteds771 yeah that sound about right! And my pleasure :)

  • @erasmo84
    @erasmo84 หลายเดือนก่อน +7

    It looks like he does not know how to set it up correctly. I'm running the 8B models, and they run superfast and not even with an NVIDIA 4090, I'm running it with an AMD 7900 XTX 24GB - $900 CPU AMD 5950X 16 Cores 32 threads - $300 and 64Ram, let me know if you need help. Not only that, but I can send you the link of my WebUi so you can try it yourself and let me know. I think your problem is you are not setting the Models to be loaded and ready. I remember that was happening to me at the beginning of my setup, but my IT guy figured out.... Furthermore, I don't need the 70b with all those trained models out there for productivity and business. By the way, Nice Content!

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Thank you and I appreciate the feedback here! So are you using a 4090 for your Graphics card? Or are you saying you don't even need a 4090?

    • @Bigjuergo
      @Bigjuergo หลายเดือนก่อน

      8B works fast and fine with rtx3050 and amd5800x 64GB RAM

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      @@Bigjuergo Nice! Thanks for sharing your specs!

    • @erasmo84
      @erasmo84 หลายเดือนก่อน

      @@ColeMedin You don't need a 4090, I have the 7900XTX. It writes so fast that you can't even read. Way Faster than the regular ChatGPT. You have to set the model to be in the memory of the graphic card all the time.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yeah for the 8B models that makes sense! Thanks again for sharing!

  • @justtiredthings
    @justtiredthings หลายเดือนก่อน +1

    I've been looking a machine for LLM inference, and I'm beginning to feel like for personal use Mac is, for once, a wild value. I really want to see some more testing of ~70b models on a Mac Studio M2 Ultra or M1 Ultra. You can configure them up to 192GB and 128GB RAM respectively, and because of the unified memory design, the memory speed is almost as fast as NVIDIA GPU VRAM. They're also super efficient, so the power usage is way lower than wiring up a bunch of NVIDIA GPUs. Yes, you've still got to spend at least $4k, but considering how smart the medium-size open-source models are becoming, might really be worth it if you're also weaving the models into a bunch of automations you've set up. It really could be like having a somewhat stupid but reasonably helpful part-time employee. For a one-time payment of $4k.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      This is super interesting, thank you for sharing!
      A part time employee for a one time payment of $4k is definitely a steal... and with what you can do with AI agents that's certainly possible!
      I probably won't get my hands on one of these Macs soon but it is tempting...

  • @marka5215
    @marka5215 หลายเดือนก่อน

    For me, the issue with local LLMs is not so much whether or not it is cost effective in terms of dollars... but an issue of quality of responses. I think this is under-emphasized as a consideration. Models like Sonnet are simply much more robust than the models that can be run locally. And in the big picture, the quality difference makes a huge difference to how well the application functions.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yeah that's completely fair! For many use cases, local LLMs are "good enough" (whatever that means for the use case) so it's worth using them for privacy, cost, etc. But for many other use cases like you are saying, you need Claude or at least can benefit from a more powerful model like Claude Sonnet making less mistakes.

  • @tecnopadre
    @tecnopadre หลายเดือนก่อน +1

    I couldn't be more agree with you with this video. Congrats
    The only second question that you get on top of the table from businesses is: what about privacy?
    It looks like outside AWS, Azure and GCP, it's hard to have a client to trust on it.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +2

      Thank you very much!
      So are you saying that most businesses would not trust using a service like RunPod to host their LLMs locally? In my experience most businesses are fine with where the model is hosted as long as it is actually self hosted.
      But if a company is only comfortable hosting with the big players like AWS you could certainly spin up an EC2 instance with a GPU in place of a service like RunPod! You'll probably just have to pay more for it.

    • @tecnopadre
      @tecnopadre หลายเดือนก่อน +1

      @@ColeMedin Yeap, that's right, if they don't trust the hosting they don't and working with medium-nig companies that's what happens everyday. I trust small ones like Runpod. Also trying Crypto GPU sharing hosting. You should try it.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +2

      Yeah that makes sense, thanks for sharing! I'll look into crypto GPU, sounds interesting! I believe I've actually heard of it already

  • @RagdollRocket
    @RagdollRocket หลายเดือนก่อน +2

    you should try 3.2 yourself on rtx3060. decent and super fast

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      Interesting! Which version of Llama 3.2 are you referring to?

  • @Techonsapevole
    @Techonsapevole หลายเดือนก่อน

    You can run 8B Q4 models on AMD Ryzen 7 desktop hardware already now without a GPU

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Wow that's incredible! Thanks for sharing 🔥

  • @Jpfabulosity
    @Jpfabulosity หลายเดือนก่อน +1

    God Bless you Cole, thank you for your information

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Thank you very much! My pleasure :)

  • @McAko
    @McAko 29 วันที่ผ่านมา

    What is your opinion about Cerebras? according to Artificial Analysis, it's the best price-output speed choice, better than Groq. Have you tried it?

    • @McAko
      @McAko 29 วันที่ผ่านมา

      I see it's 0.60$ per 1M tokens, so it seems more expensive

    • @ColeMedin
      @ColeMedin  27 วันที่ผ่านมา +1

      Right it is more expensive. But from what I have seen Cerebras is pretty phenomenal too!

  • @888felipe
    @888felipe หลายเดือนก่อน +2

    Rtx 3900 24gb used costs 700$.
    Try it.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      I appreciate the suggestion! I actually am considering getting 2x 3090 GPUs for my build. Do you have a 3090 (or 2) yourself? I did indeed see they are pretty easy to get for $700.

  • @themarksmith
    @themarksmith หลายเดือนก่อน +1

    Really useful video man - just signed up to Groq!

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      @@themarksmith Awesome!! Thank you!

  • @mlg4035
    @mlg4035 หลายเดือนก่อน

    Not true. I have an MSI laptop with 64GB RAM and one RTX4060 and get sub-2-second responses using LLAMA 3.1.
    The unit only cost $1700 USD.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Impressive! Which version of Llama 3.1 are you using (8b or 70b)?

  • @TheOriginalMigz
    @TheOriginalMigz หลายเดือนก่อน +1

    Out of interest ..
    What about us guys who have mining rigs ?
    I have like 15 3080TI’s lying around, would that work pretty well for local hosting ?
    With these models, could we then use it to, say, integrate them into a company website to help clients with technical help etc ?
    I have a client that does Construction and engineering ( at scale ) , wondering if we set something up like this for that website idea I mentioned to help their clients on critical tech specs and giving options like to issues with a different product etc …

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yes, 3080TIs would be fantastic for local hosting! Not nearly as good as the 3090s though because of the extra VRAM, but they can certainly still run a good number of local LLMs especially if you are running 2 or more together.
      Also yes - you can host these LLMs yourself on machines with your 3080TIs and then create API endpoints hosted on your machine that leverage these models to create agents for tech support or things like that. I would look into Ollama as a way to run LLMs yourself!

    • @Kartratte
      @Kartratte หลายเดือนก่อน

      I tried it. Problem is: miners dont care pcie bandwith and thats the bottleneck.

  • @themorethemerrier281
    @themorethemerrier281 หลายเดือนก่อน

    Llama 3.1 is running blazingly fast on my midrange PC. Not the 405B and not the 70B. But the 8B is pretty nice and lightning fast.
    Edit: I am right now building an AI server from old spare parts running a 1080TI - and Llama 3.1 and Llama 3.2 run very well.
    I therefore do not quite understand the point of this video if you can get hold of a decent legacy GPU with sufficient vram.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Fair point! The 8b models are great for a lot of things, but for many use cases, especially agentic ones, I've found that I need at least a 70b model for the LLM to handle function calling well. For the use cases that can use 8b models then this video doesn't apply well, you are definitely right there. And that is a critique that I appreciate a lot and wish I had discussed in this video!

  • @redneq
    @redneq หลายเดือนก่อน

    One of the things I've looked at from a very early day was the LLM size and constraints around local resources. We all need to keep in mind that we are already connected. Using products or tech like PETALS (torrent based LLM) or things like clustering groups or for that matter we could start collectively creating DAO's that handle excatly these types of things. Think about a new form of a Library, hell even our existing Libraries in the US could be used this way. Digitized LLM's served from each community. Open and free for public access. So many ways we can approach this opposite of trusting large corp's of controlling this tech.
    We have to acknowledge this tech is not new, just new to us.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Very interesting thoughts, thank you for sharing! I've never heard of the idea of creating a DAO to share resources for an LLM but I like it!

    • @redneq
      @redneq หลายเดือนก่อน

      @@ColeMedin I will keep reminding you that I'm old and crazy. Let those things that make sense stick and the rest ehh... ;)

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Haha fair enough! 😂 I appreciate you sharing!

  • @jasonfnorth
    @jasonfnorth หลายเดือนก่อน

    I'm running an 8B local llm without any issues... But I do have a super fast MacBook pro M3 😊

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yeah that's awesome Jason! Have you tried Llama 3.2 11b as well? I know it's "only" 3b more but I'm curious what your performance for that would be too!

  • @kuyajon
    @kuyajon หลายเดือนก่อน

    Small ones like 7b and below are great for simple things. It runs fast on my laptop 1660 ti 6gb. But for coding the quality is not great for 7b and below

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yes I agree! You could maybe try the smaller models of Qwen 2.5 coder but yeah in generally only the bigger models do well with coding.
      github.com/QwenLM/Qwen2.5-Coder

  • @MK-jn9uu
    @MK-jn9uu หลายเดือนก่อน +1

    You’re videos are fantastic

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      Thank you very much MK, I appreciate it alot!

  • @syifaifamily840
    @syifaifamily840 หลายเดือนก่อน

    Just choose suitable one based on our current system spec. Not all company giving free access to internet to all employee, so local LLM is better solution

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      I agree! With a lot of companies you have to go local because they block the closed source models. The main trouble is though with your current system specs, you might not be able to run a model powerful enough for your use case. Llama 3.1 8b and similar models can't really do function calling well.

  • @ApoorvKhandelwal
    @ApoorvKhandelwal หลายเดือนก่อน

    Hi, my question is how many prompt request can an A40 handles. Groq can scale indefinitely so there will be no lag. But A40 will start to slow down at some point. What do you think?

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      @@ApoorvKhandelwal Yes great question! All depends on the model size. For 8b models, for example, an A40 could handle many requests at once. But for a 70b model it would start to get slower when you have more than a couple requests being handled at once. Groq scaling infinitely is definitely a plus so you don't have to worry about that!

  • @dinoscheidt
    @dinoscheidt หลายเดือนก่อน +4

    Imagine sharing a server with multiple people, so the load is constant and everyone has lower costs. To share the cost, we could use the amount of input and output the model generates. To make it accessible, we could put that server on the internet. You could even combine it on demand with other models to use what is best for each use case not just locking into one model or vendor architecture.
    That would be so smart…. the ultimate local LLM. Oh wait.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Haha touché... 😂
      A lot of businesses I work with/hear about need to have all their data private for compliance reasons hence they really do need an LLM hosted locally. But for a majority of use cases, the "ultimate local LLM" is best!

    • @dinoscheidt
      @dinoscheidt หลายเดือนก่อน +2

      @@ColeMedin yes, and they discuss that stuff in teams. And collect the data themselves under - often weaker - data protection terms of service compared to the cloud vendors. I know I know. I have these conversations too

    • @hchang007
      @hchang007 หลายเดือนก่อน

      ​​@@dinoscheidt Bravo! It's ridiculously hard to properly do security, privacy and encryption at scale correctly! Anyone who profess to want to get better security or data privacy by running it themselves, is making a complete joke.
      Of course things could be simpler (more vulnerable) and cheaper (less resilient) and appears faster (less compliance processes in place) in an "arm's reach data center", but it's not an apples to apples comparison.
      Running a local Nvidia is great, until your business loses money at a rate of millions per unit of time if the system is down, or is liable for data security lawsuits. Basically: don't run anything important and mission critical, like an actual airline booking system, an actual flight controller, a real adword agency, a bank or a real stock trading backend.

  • @menruletheworld
    @menruletheworld หลายเดือนก่อน

    I’m running ollama 3.1 8B on my AI system with the following hardware i7 7700 with 64GB ram and a RTX 3060 gpu with 12GB of ram and a 2TB NVMe drive, responses usually within 3 seconds works like a charm, doing rag and runing several agents

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      That's awesome, thank you for sharing! Do you mind sharing a specific use case you are using the 8b parameter version for? That's impressive you're able to use 8b for agents! I've found I need the 70b version for most agentic tasks, so I'm also curious if you've done any fine tuning for function calling?

  • @SergeyNeskhodovskiy
    @SergeyNeskhodovskiy หลายเดือนก่อน +1

    Thank you for building my knowledge on LLMs! I have found each and every of your videos to be extremely valuable and I subscribed to stay updated, keep up the good job! One question: what is the mental model to calculate necessay RAM per LLM parameters? Mine was "1 bit for 1 param" so 70B params would translate to 70 GB RAM needed to run the model. But you say 48GB are enough for a 70B model? What am I missing here? Is it the regular RAM or the GPU RAM? Can someone explain?

    • @justtiredthings
      @justtiredthings หลายเดือนก่อน +3

      Parameters-to-GB RAM is a somewhat helpful rule-of-thumb that puts you in the vicinity, but you tend to need a fair percent more GB RAM than the parameters count to run an unquantized model, and quite a bit less to run quantized models. High quality quants are nearly as good as the original model but will save you a lot of RAM. That being said, being able to run a model and being able to run a model at high speeds are two different things.
      I'm having a bit of trouble, as I consider an inference machine, determining exactly how important the actual processing power of the GPU is. It's important but maybe less-so than I used to think. RAM really is one of the very biggest concerns for LLM inference, and, as far as I can tell, the primary reason that we typically depend on a GPUs VRAM for AI inference basically boils down to the fact that it has a much higher bandwidth than system RAM, and is therefore a lot faster than system RAM.
      But Apple Silicone, because of its unique unified memory design, has bandwidth almost as high as a consumer NVIDIA GPU, so I'm starting to realize they may actually be a bargain for inference. You can get an M1 Ultra Mac Studio with 128GB RAM for ~$4k, for example. While the RAM speed and GPU processing power won't be *quite* as high as an NVIDIA GPU, and while the Mac RAM that's available to the GPU might be closer to like ~96GB, you'd have to spend ~$2800 minimum JUST on the GPUs (used 3090s) to get equivalent VRAM in consumer NVIDIA GPUs. The latter setup would also be incredibly bulky, incredibly hot, and *extremely* power-hungry at a rate like 14 or 15x the Mac Studio. And you'd need kind of a wild PC setup to run all those GPUs. If you wanted to get more power-efficient NVIDIA server GPUs, you'd have to spend like $10k, and you'd *still* be using 5 or 6 times the electricity.
      The best alternative to a Mac Studio that I've been able to determine is to get a Threadripper 7000-series CPU and a mobo with 8 RDIMM RAM slots. The Threadripper supports DDR5, so you could be running at RAM speeds a good bit higher than most systems and you could get a ton of it with that many slots, but you're still limited to 5200 MT/s, which as far as I can determine would still put you at like 40-something% of the Mac Studio bandwidth and about 1/3 of an NVIDIA GPU.
      I wish I could find more tests of people running ~70b LLMs on M-Ultra hardware to confirm what I've been learning. There's surprisingly not much of anything I can find-most of the tests are on M3 Max or pretty pointless tests of useless 8b models.
      Anyway, tl;dr, I'm no expert, but I think slow speeds for local LLM models typically has more to do with available high-bandwidth memory than it does with GPU processing-power, and Apple's M-chips have got some very high-speed memory on them at very efficient power-usage and at an affordable price, in comparison to GPUs

    • @justtiredthings
      @justtiredthings หลายเดือนก่อน +2

      BTW, as a more straightforward answer to your question, you can go look at the model cards on Hugging Face to see how much RAM is required for different quant sizes. The amount of RAM you need is basically the model size (in GB of storage, not parameters) plus another 1 or 2 gigs for overhead, as far as I've been able to determine

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      Thank you Sergey - I appreciate it a lot!
      I liked @justtiredthings thoughts a lot, and generally the "1 bit for 1 param" idea is good if you want really good speeds. 48GB of VRAM is more of a good starting point with a 70B parameter model that will work for a lot of agentic use cases that don't need insanely fast speeds. But if you are trying to build an application that needs high tokens/second then yeah you'd want at least 70GB of VRAM for a model like Llama 3.1 70B. I hope that makes sense!

    • @SergeyNeskhodovskiy
      @SergeyNeskhodovskiy หลายเดือนก่อน +1

      @@justtiredthings Thanks for your detailed explanation! You can't imagine how valuable this info is for me!

    • @justtiredthings
      @justtiredthings หลายเดือนก่อน +1

      @@SergeyNeskhodovskiy One thing I do want to correct, bc I think I perpetuated a bit of misinfo: I think memory transfer for the Threadripper or any DDR5 setup would actually be much, much slower than the M-Ultra or an Nvidia GPU. like maybe 5-7% the speed, something like that. I was getting gigabytes per second and gigabits per second mixed up while trying to figure out the bandiwdth. Sorry about that

  • @alycheikhouldsmail7576
    @alycheikhouldsmail7576 หลายเดือนก่อน

    That mean when we remove llama service from n8n self hosting kit and use groq instead we will gain a lot of time for installing and running n8n ai agent localy. If you are agree I hope you make video about this.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yes that is true! Though the advantage of using Ollama is it is fully local. With Groq you can use open source models, but it's still not running them locally. There certainly is a time and place to use Groq though, don't get me wrong!

  • @j0hnc0nn0r-sec
    @j0hnc0nn0r-sec หลายเดือนก่อน

    Buy refurbished or build your own. I built an LLM pc for $500. Has a 1080 in it, which runs 22b models a decent speed, not lightning, but good enough. Smaller than 22b and there’s only a slight difference in speed from Oai or Ant

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      That's super impressive! What kinds of speeds do you get for something like Llama 3.1 8b?

    • @j0hnc0nn0r-sec
      @j0hnc0nn0r-sec หลายเดือนก่อน

      @@ColeMedin Im trying to figure out a way to give you an accurate answer but am coming up short. Do you know of a way? Might just record a gif or something

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      To be honest I'm not sure how exactly to do this, but I know people have a way to measure the tokens per second they get when running models locally.

    • @j0hnc0nn0r-sec
      @j0hnc0nn0r-sec 15 วันที่ผ่านมา

      @@ColeMedinfound it, run `-version` after ollama run model.

  • @novantha1
    @novantha1 หลายเดือนก่อน

    ? ? ?
    This take is actually insane, I literally see regular people run AI models locally constantly in communities I'm a part of.
    - Timeouts are not a problem of the LLM; that's a problem of your setup, and is related to web infrastructure. There's any number of solutions to this. Literally just ask Claude or GPT4 how to fix it, lol.
    - 7B, 8B, 10.3B and even ~22B models are not expensive to run. If you try to run them natively at FP16, they might be prohibitively expensive, in the sense that you're looking at 14GB, 16GB, 20.6GB, and 44GB of RAM to load them respectively. That's crazy, because that's at 0 context, so before the conversation starts, and the attention mechanism can actually use more memory than the weights at ultra high context.
    - But that's not how people actually run AI models locally. With a 16GB GPU, you could load this in Ollama or LlamaCPP with as many layers on the GPU as you can, and then offload the rest to CPU, which incurs a slight speed penalty, but lets you run the model reasonably fast, and potentially works out favorably for short conversations because you don't have cloud latency.
    - Quantization. Did you try running models quantized? That's a huge game changer. They essentially lop off the latter half of the numbers on each weight so they take up less RAM. GGUF is probably the easiest to use, and supports variable bit widths, so at Q6_K I can comfortably run Mistral Small (really good model, btw, it's about 22B parameters) on my setup with some CPU offloading, and the speed isn't amazing, but is fine, and "free" in the sense that I already bought the computer. A lot of people have had great success with 10.3-13B models quantized to 8bit quants, like int8, GGUF Q8, exl, etc, which are really only around 14GB of VRAM to run, usually, which is totally achievable with reasonably priced GPUs (at least on Linux, I'm not sure how much VRAM Windows uses), or can be run with an 8GB GPU and some light offloading onto CPU, with extremely affordable GPUs.
    - Agents. I think the biggest possible use case for local LLMs is AI agents (outside of entertainment purposes like LLMs as a roleplay partner); once you get a sufficiently robust architecture for them built up, they really don't need you to provide input at every stage of their work, so you can run them passively in the backround while you're doing other stuff. The raw speed doesn't really matter, so these can honestly run on CPU for all you care, or a spare PC, or a Raspberry Pi cluster, etc. If you don't really need to check in with your model for an hour or so, the actual token speed probably doesn't matter that much. On the other hand, the price per token may actually matter quite a bit, and even running a huge model on a consumer CPU (ie: Qwen 2.5 72B), at extremely low speeds could still add up to really quite a lot of tokens that you would need to pay for normally (~0.3-2 T/s works out to around 1,080-7,200 tokens per hour, depending on your CPU, quantization, etc), and I can definitely say that if you're looking at a week of operation of an agent there comes a point where you don't really want to pay for those tokens.
    - Customization. There's reasons you might want to customize a model for your use case, or use an off-the-shelf customized LLM for something really specific you need. You can get great results with 7B models fine tuned specifically for handling SQL, or you can train a LoRA to encourage your coding style to be used, and any other number of considerations. Plus, the quantity of models available on Huggingface is absolutely insane, and there's almost certainly at least one you will find interesting hosted on there for some purpose. You miss out on a lot of this using the ultra fast/cheap hosts like Groq, or Cerebras, or Sambanova.
    I think the only people who need to hear the "hard" truth on hosting LLMs are corporate business types trying to cash in on the AI hype without any technical experience because they want to use the "AI thing", too. There's any number of solutions to any of the problems that were listed in this video, and it's a little awkward to know that there are people out there who might be swayed from a lot of the really interesting benefits of running AI locally by this opinion.

  • @oncrei
    @oncrei หลายเดือนก่อน

    I am running 9-11 B models with a 2022 laptop with an nvidia 3070 and intel i9 12900h with 40 Gb of ram and it is fast.
    It's laptop hardware you can get for under 1k today all up.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      That's awesome! What kinds of use cases are you able to use the smaller 8b-11b models for?

    • @oncrei
      @oncrei หลายเดือนก่อน

      @@ColeMedin I am using codeqwen:1.5-chat as a coding assistant. I am also using llama3.2:3b for quick writting assistant

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      That's awesome, thanks for sharing!

  • @akierum
    @akierum 26 วันที่ผ่านมา

    You are smart with all these tutorials, but why not tell everyone you can offload everything to ram? No need for 8x 30k h100 gpus these days

    • @ColeMedin
      @ColeMedin  26 วันที่ผ่านมา

      Thanks for the compliment, and could you tell me more? From what I know you can't offload to ram but I'm curious what you're thinking!

  • @adomakins
    @adomakins หลายเดือนก่อน +1

    Interesting that OpenAI has this kind of competition when it comes to entreprise-level use, they might be cooked lowkey

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      @@adomakins haha at some point they will definitely be 😂

  • @ghaith2580
    @ghaith2580 หลายเดือนก่อน

    are you sure about grok being the fastest and most efficient , I thought Cerbras has been dominating for a while now

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      There are definitely some competitors out there but Groq is typically considered to be on top! Not saying it's entirely true but that seems to be the general consensus from what I have seen.
      I am seeing benchmarks though that show Cerbras is better in terms of speed. It seems the pricing is a bit worse but not by much. Super interesting! I'll have to check out Cerbras more, I've only really dove into it a bit in the past.

  • @ThomasConover
    @ThomasConover หลายเดือนก่อน

    With a mac M3 max memory you can run pretty big LLMs.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Nice!! How big you talking?

    • @ThomasConover
      @ThomasConover หลายเดือนก่อน

      @@ColeMedin you can run the 70B models which is mindblowing to have on a laptop. A jailbreaked 70b model pretty much gives you the entire human knowledge for you to access with zero need for internet. You probably know this already. It blows my mind just how powerful these things can be if you custom train them. ❤️❤️❤️ what a time to be alive.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Wow that in insane!! It sure is quite the time to be alive haha

    • @ThomasConover
      @ThomasConover หลายเดือนก่อน

      @@ColeMedin I actually saw someone running a 400B model on two parallel connected maxed out M3 laptops yesterday. Yes you read correct. You can run a 4 four hundred billion vector model locally but it’s gonna cost you two maxed out M3 MacBook Pros. It is not cheap.
      Pretty wild!

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      Wow I actually can't believe it haha, that is incredible!

  • @jarad4621
    @jarad4621 หลายเดือนก่อน

    Youi forget something id you are running a propwr app you may likely need a higher rate limit, local hosting or rented gpt geta you one install that can run 1 request at a time wereaa on groq or openrouter i get 200 per second rate limit ao local does not work for anything but a sikple personal app you cant use it for a milti user app or anything powerful Not so? This is the damn problem. Or am I wrong on the rate limit for local install? I really want to be wrong

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yes this is a very fair point! I agree that I didn't fully respect the need for concurrency in this video... something I am still learning how to handle well myself!
      With RunPod, you can certainly handle more than one request at once! But it does require better hardware or going with a model with less parameters, so it's not easy. You can also implement queuing for your LLM calls as long as you are getting fast enough responses where it doesn't hurt UX. Lot of nuances here but overall Groq does make this MUCH easier as long as you are okay not have the LLM hosted yourself.

  • @gr8tbigtreehugger
    @gr8tbigtreehugger หลายเดือนก่อน +2

    Great vid - many thanks! Love seeing a fellow groqster. I'm on this very journey right now!

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      Thank you!! :)
      Good luck on your LLM journey, I'd love to hear how it goes for you!

  • @jasonrhodes5034
    @jasonrhodes5034 หลายเดือนก่อน

    llama 3.2 1 and 3 B models just dropped......the larger one has vision now

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      I know it's exciting! :D

  • @jarad4621
    @jarad4621 หลายเดือนก่อน

    Got 4o mini is way cheaper and better and open ai doesn't use your api data or at least the risk is exactly the same as groq both external companies so you can't be sure

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      Yeah gpt-4o-mini is a good option if you never have the need to host an LLM yourself and want something cheap! The main reason you would use Groq and a model like Llama 3.1 8/70b is because you want to eventually switch to hosting a model yourself and don't want to entirely switch your LLM when you do so.
      As far as the risk of sending data to OpenAI vs Groq, a lot of it comes down to the level of mistrust that people have with closed source models, especially GPT. But I totally respect that that is up for debate.

  • @jasonrhodes5034
    @jasonrhodes5034 หลายเดือนก่อน

    you run it in a rented server with a GPU....is cheaper by a long way just tunnel

    • @marcusmayer1055
      @marcusmayer1055 หลายเดือนก่อน

      Maybe more details

    • @jarad4621
      @jarad4621 หลายเดือนก่อน

      But what's the rate limit you can only do 1 run at a time not 250 per second any real app is going to need a high rate limit

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Something like a RunPod instance is a rented server with a GPU, or are you thinking of something different? I would appreciate more details as well!

  • @digitalsoultech
    @digitalsoultech หลายเดือนก่อน

    dude llama 3.1 function calling sucks. If any part of your app uses function expect it to fail a fair lot more then gpt.
    I wish I could switch over to llama, it just doesn't make sense at the moment.
    Let's hope llama 3.2 is better

    • @dinoscheidt
      @dinoscheidt หลายเดือนก่อน

      Can you explain a little more? Have not used function calling with llama 3.1 but am interested in it. Is there something particular that just doesn‘t work? Thank you

    • @justtiredthings
      @justtiredthings หลายเดือนก่อน

      Man, I feel like the Qwen 2.5 drop is being egregiously ignored. It's smashed Llama in benchmarks, and it's doing pretty damn impressively in my own prodding. It should have been the biggest news of the week--we need more content putting it to the test

    • @justtiredthings
      @justtiredthings หลายเดือนก่อน

      All they did with Llama 3.2 90b is add vision. The text benchmarks are exactly the same. You could have a better text model in Qwen 2.5 and a comparable vision model in Pixtral running simultaneously at a lower RAM requirement overall

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Yeah totally fair! I've also had not the best experience with Llama function calling. I'm going to be playing around a lot with Llama 3.2 function calling on Groq, I hope it's much better too!

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน +1

      Thank you for mentioning Qwen 2.5, I agree it's being missed out on! I am definitely considering making a video on it very soon.

  • @RagdollRocket
    @RagdollRocket หลายเดือนก่อน

    no, rtx3080 awesome performance.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Tell me more! What models are you running and what speeds are you getting?

  • @akierum
    @akierum 26 วันที่ผ่านมา

    Air LLM can run anything on 3090 even 405b, you are outdated.

    • @ColeMedin
      @ColeMedin  26 วันที่ผ่านมา

      Air LLM seems awesome! I will certainly have to look into it. Have you tried it before?

  • @venuev
    @venuev 6 วันที่ผ่านมา

    waste of time, you can never achieve the results of the big boys.

    • @ColeMedin
      @ColeMedin  4 วันที่ผ่านมา

      It certainly depends on the use case! There are a ton of use cases where a local LLM works great, and sometimes even better with fine tuning!

  • @NorthCountrySportChiro
    @NorthCountrySportChiro 11 วันที่ผ่านมา

    ፔስቶ ለፕሬዚዳንት

    • @ColeMedin
      @ColeMedin  10 วันที่ผ่านมา

      በህልምዎ ውስጥ

  • @jayd8935
    @jayd8935 หลายเดือนก่อน

    Seems pretty wrong.

    • @ColeMedin
      @ColeMedin  หลายเดือนก่อน

      Could you clarify what you think is wrong?