Run LLAMA 3.1 405b on 8GB Vram

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 พ.ย. 2024

ความคิดเห็น • 32

  • @ArekMateusiak
    @ArekMateusiak 23 วันที่ผ่านมา +25

    Well, If it works, I wonder if this should be incorporated into LM Studio and similar projects...

    • @BeethovenHD
      @BeethovenHD 23 วันที่ผ่านมา

      Please try it out xD than I switch to Nvidia for that one

    • @TriconDigital
      @TriconDigital 22 วันที่ผ่านมา

      ​@@BeethovenHD You switch to Nemo?

  • @elwii04
    @elwii04 23 วันที่ผ่านมา +32

    And how long does the inference take? 40h/token? Just because you can run it does not mean its useful

    • @cupotko
      @cupotko 23 วันที่ผ่านมา +5

      I only can wish the author of the video the similar "revolutionary" rate of viewcount and subscriber growth.

    • @funkytaco1358
      @funkytaco1358 22 วันที่ผ่านมา +4

      Apparently, you're right. I see a discussion saying A100 GPU with 16 core CPU inference time for one sentence takes 20+ minutes.... for one sentence.

    • @smakfu1375
      @smakfu1375 22 วันที่ผ่านมา +3

      @@funkytaco1358 So what AirLLM is claiming to do (and I haven't tested it yet) is selective layer loading / activation (e.g. it loads and unloads the layers as needed in the execution sequence of the model) thus the required memory is equal to roughly the parameter size of the layer. Besides this selective activation/loading, they also claim to be doing block-wise quantization. How much overhead all this incurs is something I'll have to poke at it. It That said, while it might sound like voodoo, this is actually memory optimization strategies people have been talking about for a while.

  • @landryplacid4065
    @landryplacid4065 22 วันที่ผ่านมา +2

    Note it wont run on an AMD RX 6700 xt, due to the absence of bitsandbytes module implementation for the AMD HIP on rocm. Bitsandbytes modules is key to the quantization of models.
    If u use any consumer AMD gpu < 7700 get an nvidia gpu or migrate to a higher module.

  • @harshkamdar6509
    @harshkamdar6509 21 วันที่ผ่านมา +4

    you are loading a 4 bit bitsandbytes quantized LLM its already compromised on precision and then using AirLLM blockwise quantization the accuracy of the model will take a major hit

  • @testales
    @testales 23 วันที่ผ่านมา +11

    That's probably a software that processes layer after layer and yes in theory you can run a very large model that way. But nobody said you can expect a result before the next ice age....

    • @yakmage8085
      @yakmage8085 22 วันที่ผ่านมา +3

      Yes very misleading video.
      I did the same thing a few years ago on llama 1 when it was leaked. Ran it on a 3090. Layer by layer, it took something like 5 mins per token. But it did work, but not useful at all.

    • @prakaashsukhwal1984
      @prakaashsukhwal1984 21 วันที่ผ่านมา

      @@yakmage8085

  • @AK-ox3mv
    @AK-ox3mv 22 วันที่ผ่านมา +5

    In infinity, even a calculator can run gpt-5

  • @AaronBlox-h2t
    @AaronBlox-h2t 14 วันที่ผ่านมา +2

    I run qwen2.5 72B at BF16 , which is what you can dl off HF, with intel cpu intel gpu arc a770 16GB and 64GB DDR4 RAM and 5TB 7400MB/sec NVME ssd and it runs good! Ok, the initial load takes a few mintues but inference is fine. I only swap to SSD disk about 8GB but if you have much less than 64GB RAM then you bettter have a fast SSD cuz it will be hammered alot. For my setup intel IPEX-LLM is crucial and it's what makes it all possible on the intel cpu intel gpu. YEs, you need to know how to code, basic python at least.
    I also first got running a qwen2.5 coder 7B so if you have less resources than my windows pc, so should stick to the 7B model.
    On the bright side, register with HF and you can call the qwen2.5 72B via their API where they host the model. You get 1000.....

  • @dsfsgsgxx
    @dsfsgsgxx 23 วันที่ผ่านมา +15

    You did not run it on 8gb vram

  • @ksreedharamurthy
    @ksreedharamurthy 22 วันที่ผ่านมา +2

    Please share the script mentioned in the video

  • @R055LE
    @R055LE 23 วันที่ผ่านมา +4

    Considering this is Python we're talking about here, how long does it take to run.. 1000 years?

    • @orlandovftw
      @orlandovftw 22 วันที่ผ่านมา +3

      Almost all of the Python you see in ML/AI projects are thin wrappers around C, C++ or CUDA.

    • @R055LE
      @R055LE 22 วันที่ผ่านมา +1

      @@orlandovftw so... 800?

  • @michabbb
    @michabbb 22 วันที่ผ่านมา +1

    Sure, why put any links in the description, makes no sense. People should search, make so much more fun 🙄

  • @kittengray9232
    @kittengray9232 23 วันที่ผ่านมา +1

    Better wait for a BitNET version
    add: or TriLM

  • @DeepThinker193
    @DeepThinker193 23 วันที่ผ่านมา +4

    ...is this real? I've never heard of this o_O

  • @awwaey_tw9414
    @awwaey_tw9414 23 วันที่ผ่านมา

    Crazyyyyyyy😮

  • @avalagum7957
    @avalagum7957 22 วันที่ผ่านมา +1

    Why version 3.1 while 3.2 is available for months?

    • @Q1nt0
      @Q1nt0 22 วันที่ผ่านมา +1

      Because 3.2 still hasn't published its 405B model

    • @alby13
      @alby13 22 วันที่ผ่านมา +1

      testing with the 3.1 model that has been out is the appropriate and expected path. would you want the release to be delayed so they can start over again with 3.2?

  • @HikaruGCT
    @HikaruGCT 22 วันที่ผ่านมา

    if this is true then it would be ollama and openwebui loooong ago. and it would mean ipex would be able to smash through it but alas i do not think it is as good as its said.

  • @CucumberHK
    @CucumberHK 24 วันที่ผ่านมา +4

    WTF!!!!! 🫨

  • @LorenaMartínez-r1s
    @LorenaMartínez-r1s 24 วันที่ผ่านมา

    script ???????