LLAMA 3.1 70b GPU Requirements (FP32, FP16, INT8 and INT4)

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ม.ค. 2025

ความคิดเห็น • 84

  • @einstien2409
    @einstien2409 4 หลายเดือนก่อน +111

    Lets all take a moment to appreciate how much nvidia knee caps their GPUs with low Vram to screw the customers into buying more 50K usd GPUs. For 25K the H100 should not come for less than 256GB vram. At all.

    • @alyia9618
      @alyia9618 4 หลายเดือนก่อน +19

      shame to the competitors that cannot/don't want to offer better solutions! and there are many startups out there with innovative solutions ( no GPU but real "neural" processors )...these startups need money, but the likes of AMD, Intel, etc... instead continue with their bollocks and get out "CPUs with NPUs" that are clearly not enough to run "real" LLMs...and this is because they are playing at the same game as Nvidia, trying to squeeze as much money as possible from the gullible...sooner or later we will have machines with Graphcore IPUs or Groq LPUs, but not before the usual culprits will get rich squeezing everyone

    • @kineticraft6977
      @kineticraft6977 3 หลายเดือนก่อน +5

      I just want someone to slap a nvme slot on an affordable Tesla gpu

    • @einstien2409
      @einstien2409 3 หลายเดือนก่อน +3

      @@alyia9618 Nvidia will be the first to get those out and charge an insane price for each while Intel and AMD suckle on their thumbs and make the next gen AI HZ XX SX AI max pro Ultra Intelligence Hyper Max XX AI CPU with an NPU that will suck 2 milli wars to produce a total of 60 TOPs of performance so you can have blurry background on your zoom calls and generate naked women images.

    • @bjarne431
      @bjarne431 3 หลายเดือนก่อน +2

      M1/2/3/4 with high memory configurations looks like a steal if you want to run larger models….

    • @coleisman
      @coleisman 2 หลายเดือนก่อน

      basic marketing, every company does with every product, from iphones to game consoles to cars they remove basic features that cost very little in order to incentivize you to step up to a higher model even if you dont need the other features

  • @lrrr
    @lrrr 4 หลายเดือนก่อน +16

    Thanks man , I was trying to find video like this on for a long time You save my day!

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +2

      Glad I could help

    • @AaronBlox-h2t
      @AaronBlox-h2t 4 หลายเดือนก่อน

      Same here...although I only need this info yesterday so lucky to have found it now. haha. New sub

  • @serikazero128
    @serikazero128 4 หลายเดือนก่อน +23

    I think your video is pretty solid and also, its missing something.
    I can currently run llama 3.1 with 0 Video RAM. Yes, you heard that right, 0 GB of VRAM.
    How is this possible, well with low quaternization types, similar to int4 and int8; In my case more exactly: llama3.1:70b-instruct-q3_K_L
    I can run with with around 50-64 gb of RAM. And run it on my CPU.
    Its takes however roughly 2 minutes to answer: hey, My name is Jack, what's yours?
    What's the deal?
    AI needs RAM, not really specifically VRAM. VRAM is much faster of course, but I'm using a laptop CPU (weaker than a desktop one) and one that is from 3-4 years ago.
    After I load the model my RAM usage jumps to around 48gb, while normally without using the model it sits at around 10gb.
    My point is: you don't need insane resources to run AI as long as speed ain't the issue you can even run it on the CPU. It just is going to take longer. The GPU isn't the one that makes AI go, the GPU only makes the AI go much faster.
    I have no doubt that with a 40gb VRAM, my llama 3.1 would answer in 20-30 seconds instead of 2 minutes or 2.5 minutes.
    However, you can still run it on a outdated LAPTOP CPU. As long as you have enough memory. And that's the key thing here, Memory.
    And it doesn't have to be VRAM!!

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +3

      Thank you for sharing your experience! You're absolutely right that running LLaMA 3.1 70B on a CPU with low quantization like Q3_K_L is possible with enough RAM, but it comes with trade-offs. While CPUs can handle the load, they tend to overheat more than GPUs when running large language models, which can slow down the generation even further due to throttling. So, while it's feasible, the long response times (e.g., 2 minutes for a simple query) and the potential for overheating make it impractical for real-life usage. For faster and more stable performance, GPUs with sufficient VRAM are much better suited for these tasks. Thanks again for bringing up this important discussion!

    • @alyia9618
      @alyia9618 4 หลายเดือนก่อน +3

      yeah ok, but the loss of precision from using 3 bit quantization is colossal!!! there is a reason why fp16 ( or bf16 ) is the sweet spot for quantization, with int8 as a "good enough" stopgap....

    • @serikazero128
      @serikazero128 4 หลายเดือนก่อน +3

      @@alyia9618 I could run fp16, if I add more RAM, that's my point.
      And if a laptop processor can do this, A LAPTOP CPU, You can run even fp16 on a CPU with a computer with 256 RAM. And getting 256 RAM is a looooot cheaper than getting 256 VRAM

    • @alyia9618
      @alyia9618 4 หลายเดือนก่อน +2

      @@serikazero128 yes you can run fp16 no problem, especially with avx512 equipped cpus! the problem is that by going up on the number of parameters, memory bandwidth becomes a huge bottleneck...this is the real problem, because the cpus can cope with the load, especially the latest ones with integrated npus and it is a no brainer if we run the computation on the igpus too! feeding all those computational units is the problem, because 2 memory channels and a theoretical max of 100GB/s for the bandwidth isn't enough...the solution the likes of Nvidia and AMD have found for now is to add hbm memory to their chips...and it is an empirically verified solution too, because we have Apple M3 chips going strong exactly because they have high bandwidth memory on the socs

    • @ДмитрийКарпич
      @ДмитрийКарпич 4 หลายเดือนก่อน +2

      "I have no doubt that with a 40gb VRAM, my llama 3.1 would answer in 20-30 seconds instead of 2 minutes or 2.5 minutes." - no, its would answer in 2-3 seconds, maybe 5 in worst scenario. Its little bit tricky, but base idea not just place model in some memory. You need dozen cores to have deal with it. And with desktop CPU you get 6-8-10-20 cores, instead of 5,888 CUDA cores in 4070.

  • @nithinbhandari3075
    @nithinbhandari3075 4 หลายเดือนก่อน +6

    Nice video.
    Thanks for the info.
    We are sooo gpu poor.

  • @sleepyelk5955
    @sleepyelk5955 หลายเดือนก่อน

    Very cool overview, thanks a lot ... (always in the search for more ram^^)

  • @SimonNgai-d3u
    @SimonNgai-d3u 20 วันที่ผ่านมา

    LOL I can't imagine one day we run the ai agents on spacecrafts and they manage the missions to set up stuff on Mars.

  • @____________________________.x
    @____________________________.x 3 หลายเดือนก่อน +1

    The GPU tool would be easier if it listed the tools you could run with a specific GPU? Still, it’s nice to have something so thanks 👍

    • @AIFusion-official
      @AIFusion-official  3 หลายเดือนก่อน

      Thank you for your comment! I’m glad you find the tool helpful. Could you clarify what you mean by 'tools'? Our tool is specifically designed to show GPU requirements for large language models. I’d love to hear more about your thoughts!

  • @gazzalifahim
    @gazzalifahim 4 หลายเดือนก่อน +3

    Man, this is the tool I was wishing for the last 3 months! Thanks Thanks Thanks!
    Just got a question. I was planning to buy a RTX 4060Ti for my new build to run some Thesis work. My work is mostly on the Open Source Small LLMs like Llama 3.1 8B, Phi-3-Medium 128K etc. Will I be able run those with almost a great inference speed?

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +1

      Thank you, I’m glad it’s useful! As for the RTX 4060 Ti, if you’re looking at the 8GB version, I’d actually recommend considering the RTX 3060 with 12GB instead. It’s usually cheaper and gives you more room to run models at higher quantization levels. For example, with LLAMA 3.1 8B, the RTX 3060 can run it in INT8, whereas the 4060 Ti with 8GB would only handle INT4. Just to give you some perspective, I personally use an RTX 4060 with 8GB of VRAM, and I can run LLaMA 3.1 8B in INT4 with around 41 tokens per second at the start of a conversation. So while the 4060 Ti will work, the 3060 might give you more flexibility for your thesis work with LLMs

    • @AaronBlox-h2t
      @AaronBlox-h2t 4 หลายเดือนก่อน

      I recommend Intel ARC A770 16GB and IPEX-LLM and Intel Python, both optimized for the ARC and beats 4060 by 70%, according to Intel.

    • @guytech7310
      @guytech7310 2 หลายเดือนก่อน

      @@AaronBlox-h2t Does Llama 3.1 support Intel ARC?

    • @fuzzydunlop7154
      @fuzzydunlop7154 2 หลายเดือนก่อน

      That's the question you'll be asking for every novel application if you buy an Intel GPU

  • @io9021
    @io9021 4 หลายเดือนก่อน +3

    When running Llama3.1 70b with ollama, by default it selects a version using 40GB memory. That's 70b-instruct-q4_0 (c0df3564cfe8). So that has to be int4. I guess in this case all parameters (key / value / query and feedforward weights) are int4?
    Then there are intermediate sizes where probably different parameters are quantized differently?
    70b-instruct-q8_0 (5dd991fa92a4) needs 75GB, presumably that's all int8?

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +2

      Thank you for your insightful comment! Yes, when running LLaMA 3.1 70B with Ollama, the 70b-instruct-q4_0 version likely uses INT4 quantization, which would apply to all parameters, including key, value, query, and feedforward weights. As for intermediate quantization levels, you're correct different parameters may be quantized to varying degrees, depending on the model version. The 70b-instruct-q8_0, needing 75GB, would indeed suggest that it’s fully quantized to INT8. Each quantization level strikes a balance between memory usage and model performance.

    • @akierum
      @akierum 2 หลายเดือนก่อน +1

      Just use air llm to put everything into ram not ssd then use 2x 3090 for speed.

  • @maxxflyer
    @maxxflyer 4 หลายเดือนก่อน +3

    great tool

  • @malloott
    @malloott หลายเดือนก่อน

    Cool tool, could you add rtx4070s ti, 4090, 4080s etc? It's kinda silly they aren't there already

  • @ngroy8636
    @ngroy8636 หลายเดือนก่อน +1

    What about using cpu offloading for inference

  • @bestof467
    @bestof467 หลายเดือนก่อน

    In China they've increased VRAM on 4090 cards by soldering- similar to upgrading Apple storage chips.

  • @robertoguerra5375
    @robertoguerra5375 2 หลายเดือนก่อน

    Nobody knows how much precision they need

  • @Felix-st2ue
    @Felix-st2ue 4 หลายเดือนก่อน +2

    How does the 70b q4 version vompare to lets say the 8b version at fp32? Basically whats more inportant, the number of parameters or the quantization?

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +4

      Thank you for your question! The 70B model at Q4 has many more parameters, allowing it to capture more complex patterns, but the lower precision from quantization can reduce its accuracy. On the other hand, the 8B model at FP32 has fewer parameters but higher precision, making it more accurate in certain tasks. Essentially, it’s a trade-off: the 70B Q4 model is better for tasks requiring more knowledge, while the 8B FP32 model may perform better in tasks needing precision.

    • @alyia9618
      @alyia9618 4 หลายเดือนก่อน +1

      if you must do "serious" things always prefer a bigger number of parameters ( with 33b and 70b being the sweet spots ), but try to not go under int8 if you want your LLM to not spit out "bullshit"...loss of precision can drive accuracy down very fast and make the network hallucinate a lot, loses cognitive power ( a big problem if you are reasoning on math problems, logic problems, etc... ), becomes incapable of understanding and producing nuanced text, spells disaster for non latin languages ( yes the effects are magnified for non latin scripts ), dequantization ( during inference you must go back to fp and back again to the desider quant level ) increases the overhead

    • @guytech7310
      @guytech7310 2 หลายเดือนก่อน +1

      @@AIFusion-official Can you do a video showing the differences between the precision levels? I am curious about error rates between the different precision. - Thanks!

  • @jasonly003
    @jasonly003 หลายเดือนก่อน

    Can we do reverse search? From GPU to LLM?

  • @chuanjiang6931
    @chuanjiang6931 2 หลายเดือนก่อน

    According to Meta Blog, "We performed training runs on two custom-built 24K GPU clusters." How come only 13 H100 GPUs are required for a 70B model for full training on your webpage? Do you mean "at least"?

    • @vojtechkment2956
      @vojtechkment2956 2 หลายเดือนก่อน +1

      It is the minimal count of GPUs which allows you to keep all 70B parameters of the model, plus all Adam parameters (necessary during the training), in the VRAM memory at the same time. Ie. the minimal assumption of any computing efficiency. The untold information is that this way you would train it for 61 years. :/
      Provided that you train it on the same corpus and with the same approach which META used. Good luck.

  • @AhmadQ.81
    @AhmadQ.81 3 หลายเดือนก่อน +3

    Is it sufficient to use AMD MI325X 288GB vRAM for Lama 3.1 70b using FP32 for inference.

    • @AIFusion-official
      @AIFusion-official  3 หลายเดือนก่อน +2

      Thank you for your question! The AMD MI325X with 288GB of VRAM is not sufficient for running LLaMA 3.1 70B in FP32, as that would require more memory. However, FP16 is recommended for inference and would fit well within the 288GB limit, allowing for efficient performance.

  • @chuanjiang6931
    @chuanjiang6931 2 หลายเดือนก่อน

    When you say 'full Adam training', is it full parameter fine-tuning or training an LLM from scratch?

  • @px43
    @px43 4 หลายเดือนก่อน +2

    This app you made is awesome, but I've also heard people got 405b running on a MacBook, which your app says should only be possible with $100k of GPUs, even at the lowest quantization. I'd love to use your site to be my go-to for ML builds but it seems to be overestimating the requirements.
    Maybe there should be a field for speed benchmarks, and you could give people tokens per second when using various swap and external ram options?

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +2

      Thank you for your feedback! Running a 405 billion parameter model on a MacBook is highly unrealistic due to hardware constraints, even with extreme quantization, which can severely degrade performance. In practice, very low quantization levels like Q2 would significantly reduce precision, making the model's output much poorer compared to a smaller model running at full or half precision. Additionally, tokens per second can vary based on the length of the input and output, as well as the context window size, so providing a fixed benchmark isn't feasible. We’re considering ways to better address performance metrics and appreciate your suggestions to help improve the app!

  • @AIbutterFlay
    @AIbutterFlay หลายเดือนก่อน

    site not updated to M4 max and ultra

  • @harivenkat1021
    @harivenkat1021 หลายเดือนก่อน

    is there any tool to find the time taken to run on the GPUs, for x number of tokens, with y bit quantization

  • @dinoscheidt
    @dinoscheidt 3 หลายเดือนก่อน +1

    Why does the tool not list Apple M3 Chips? Inference and LORA run better than on most GPUs you listed

    • @AIFusion-official
      @AIFusion-official  3 หลายเดือนก่อน

      Thanks for your question! The M3 chips are definitely included in the tool, but if you selected a model and quantization level that require more memory than the M3 chips can provide, they won’t appear in the results. Try choosing a different model or quantization level, and you should see the M3 listed. Let me know if you have any other questions!

    • @dinoscheidt
      @dinoscheidt 3 หลายเดือนก่อน

      @@AIFusion-official mh, I selected LLAMA 3.1 70B. I have a M3 with 128GB ram… no problem. But did not see it in the list. Just smaller vram GPUs EDIT: I tried it again. Now it appears to show. From a UX perspective it would make sense to always show all GPUs so the list doesn’t jump around and grey out what is supported.

    • @AIFusion-official
      @AIFusion-official  3 หลายเดือนก่อน

      You mean LLAMA 3.1 70B?
      You won't see it because FP32 is selected. If you select INT8, you will see it listed.

  • @K.F.L
    @K.F.L 3 หลายเดือนก่อน

    After getting into Linux , i really want to learn more about coding in general . Struggling for help via forums, i installed llama3.1 8B but it gets alot wrong . I was going to install the 70B version . It seems with my current memory specs, the ideal RAM usage, would be up my own arse considering its a GTX 1060

  • @SahlEbrahim
    @SahlEbrahim 2 หลายเดือนก่อน

    Is there a way to fine-tune this model via cloud

  • @mohamadbazmara601
    @mohamadbazmara601 4 หลายเดือนก่อน +1

    Great, what if we want to run it not just for one since request. What if we have 1000 request per second?

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +2

      Handling 1,000 requests per second is a massive task that would require much more than just a few GPUs. You'd be looking at a full-scale data center with racks of GPUs working together, along with the necessary infrastructure for cooling, power, and security. It’s a significant investment, and you’d need to carefully optimize the setup to ensure everything runs smoothly at that scale. In most cases, relying on cloud services or specialized AI infrastructure providers might be more practical for such heavy workloads.

  • @seryoga6308
    @seryoga6308 2 หลายเดือนก่อน

    Thank you for your video. Tell me please, what models can you advise for i9 9900, 32gb ram, rtx3090.

  • @treniotajuodvarnis
    @treniotajuodvarnis 4 หลายเดือนก่อน +2

    70b runs on two or even one 1080ti and old xeon v2 with 128gb ram, yes it takes a bit to generate like from 10 seconds up to a minute but not that bad! needs only RAM, I guess 64 would be enough.

    • @chiminhtran7534
      @chiminhtran7534 4 หลายเดือนก่อน

      Which motherboard are you using if i may ask

    • @treniotajuodvarnis
      @treniotajuodvarnis 3 หลายเดือนก่อน

      @@chiminhtran7534 huananzhi x79 delluxe, has 4 ram slots that supports LRDIMM 64gb modules and 2 pcie x16, and I ran on another system with 18core xeon v3 cpu and 128gb 1866ddr3 and two 3090 runs flawlesly without hickups and acceptable speeds (mb: huananzhi x99-T8, 8 ddr3 slots, max 512gb ram)

  • @loktevra
    @loktevra 2 หลายเดือนก่อน +1

    is it possible to use AMD EPYC 9965 (192 cores, 576 GB/s memory bandwidth) for inference and training? maybe it is not as fast as GPUs but I can use much cheaper RAM modules and only one processor and it will be cheap enough

    • @guytech7310
      @guytech7310 2 หลายเดือนก่อน

      No, Consider that NVidia GPUS have between 1024 cores to over 16,384 cores.

    • @loktevra
      @loktevra 2 หลายเดือนก่อน

      @@guytech7310 but for LLM's as I know bottle neck is memory bandwidth not an amount of cores. And my question is how much cores is enough to reach memory bandwidth's bottle neck

    • @loktevra
      @loktevra 2 หลายเดือนก่อน

      @@guytech7310 and do not forget that avx512 instructions allow compute numbers in parallel in just one core

    • @guytech7310
      @guytech7310 2 หลายเดือนก่อน

      @@loktevra If that was true than LLMs would not be heavily dependent on GPUs for processing. Its that the larger LLM models require more VRAM to load. Otherwise with low VRAM, the LLM has to swap out parts with the DRAM on the motherboard. PCIe Supports DMA (Direct Memory Access) & thus the GPU already has full access to the Memory on the motherboard.

    • @loktevra
      @loktevra 2 หลายเดือนก่อน

      @@guytech7310 yes, but on GPUs memory bandwidth is bigger then on CPU. AMD EPYC 9965 is the latest CPU from AMD has just 576 GB/s. So for commercial usage using GPUs without doubt will be better chose with higher speed of VRAM. But for home lab maybe EPYC CPU is just enough?

  • @dipereira0123
    @dipereira0123 4 หลายเดือนก่อน +1

    Nice =D

  • @shadowhacker27
    @shadowhacker27 3 หลายเดือนก่อน

    Imagine being the one paying, in some cases, over 60,000USD for a card with 80gb [EIGHTY] of VRAM... twice.

  • @sinayagubi8805
    @sinayagubi8805 4 หลายเดือนก่อน +2

    wow. can you add tokens per second on that tool?

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +1

      Thank you for your comment! Regarding the tokens per second metric, it’s tricky because the speed varies greatly based on the input length, the number of tokens in the context window, and how far along you are in a conversation (since more tokens slow things down). Giving a fixed tokens-per-second value would be unrealistic, as it depends on these factors. I’ll consider ways to offer more detailed performance metrics in the future to make the tool even more helpful. Your feedback is greatly appreciated!

  • @Xavileiro
    @Xavileiro 4 หลายเดือนก่อน +1

    And lets be honest. Llama 8b sucks really bad.

    • @AIFusion-official
      @AIFusion-official  4 หลายเดือนก่อน +2

      @Xavileiro i respect your opinion, but i dont agree. Maybe you've been using a heavily quantized version. Some quantization levels reduce the model accuracy and the quality of the output significantly. You should try the fp16 version. It is really good for a lot of use cases.

    • @mr.gk5
      @mr.gk5 3 หลายเดือนก่อน +1

      Llama 8b instruct fp16 is great, much better actually