@@funkytaco1358 So what AirLLM is claiming to do (and I haven't tested it yet) is selective layer loading / activation (e.g. it loads and unloads the layers as needed in the execution sequence of the model) thus the required memory is equal to roughly the parameter size of the layer. Besides this selective activation/loading, they also claim to be doing block-wise quantization. How much overhead all this incurs is something I'll have to poke at it. It That said, while it might sound like voodoo, this is actually memory optimization strategies people have been talking about for a while.
Note it wont run on an AMD RX 6700 xt, due to the absence of bitsandbytes module implementation for the AMD HIP on rocm. Bitsandbytes modules is key to the quantization of models. If u use any consumer AMD gpu < 7700 get an nvidia gpu or migrate to a higher module.
you are loading a 4 bit bitsandbytes quantized LLM its already compromised on precision and then using AirLLM blockwise quantization the accuracy of the model will take a major hit
That's probably a software that processes layer after layer and yes in theory you can run a very large model that way. But nobody said you can expect a result before the next ice age....
Yes very misleading video. I did the same thing a few years ago on llama 1 when it was leaked. Ran it on a 3090. Layer by layer, it took something like 5 mins per token. But it did work, but not useful at all.
I run qwen2.5 72B at BF16 , which is what you can dl off HF, with intel cpu intel gpu arc a770 16GB and 64GB DDR4 RAM and 5TB 7400MB/sec NVME ssd and it runs good! Ok, the initial load takes a few mintues but inference is fine. I only swap to SSD disk about 8GB but if you have much less than 64GB RAM then you bettter have a fast SSD cuz it will be hammered alot. For my setup intel IPEX-LLM is crucial and it's what makes it all possible on the intel cpu intel gpu. YEs, you need to know how to code, basic python at least. I also first got running a qwen2.5 coder 7B so if you have less resources than my windows pc, so should stick to the 7B model. On the bright side, register with HF and you can call the qwen2.5 72B via their API where they host the model. You get 1000.....
testing with the 3.1 model that has been out is the appropriate and expected path. would you want the release to be delayed so they can start over again with 3.2?
if this is true then it would be ollama and openwebui loooong ago. and it would mean ipex would be able to smash through it but alas i do not think it is as good as its said.
Well, If it works, I wonder if this should be incorporated into LM Studio and similar projects...
Please try it out xD than I switch to Nvidia for that one
@@BeethovenHD You switch to Nemo?
And how long does the inference take? 40h/token? Just because you can run it does not mean its useful
I only can wish the author of the video the similar "revolutionary" rate of viewcount and subscriber growth.
Apparently, you're right. I see a discussion saying A100 GPU with 16 core CPU inference time for one sentence takes 20+ minutes.... for one sentence.
@@funkytaco1358 So what AirLLM is claiming to do (and I haven't tested it yet) is selective layer loading / activation (e.g. it loads and unloads the layers as needed in the execution sequence of the model) thus the required memory is equal to roughly the parameter size of the layer. Besides this selective activation/loading, they also claim to be doing block-wise quantization. How much overhead all this incurs is something I'll have to poke at it. It That said, while it might sound like voodoo, this is actually memory optimization strategies people have been talking about for a while.
Note it wont run on an AMD RX 6700 xt, due to the absence of bitsandbytes module implementation for the AMD HIP on rocm. Bitsandbytes modules is key to the quantization of models.
If u use any consumer AMD gpu < 7700 get an nvidia gpu or migrate to a higher module.
you are loading a 4 bit bitsandbytes quantized LLM its already compromised on precision and then using AirLLM blockwise quantization the accuracy of the model will take a major hit
That's probably a software that processes layer after layer and yes in theory you can run a very large model that way. But nobody said you can expect a result before the next ice age....
Yes very misleading video.
I did the same thing a few years ago on llama 1 when it was leaked. Ran it on a 3090. Layer by layer, it took something like 5 mins per token. But it did work, but not useful at all.
@@yakmage8085
In infinity, even a calculator can run gpt-5
I run qwen2.5 72B at BF16 , which is what you can dl off HF, with intel cpu intel gpu arc a770 16GB and 64GB DDR4 RAM and 5TB 7400MB/sec NVME ssd and it runs good! Ok, the initial load takes a few mintues but inference is fine. I only swap to SSD disk about 8GB but if you have much less than 64GB RAM then you bettter have a fast SSD cuz it will be hammered alot. For my setup intel IPEX-LLM is crucial and it's what makes it all possible on the intel cpu intel gpu. YEs, you need to know how to code, basic python at least.
I also first got running a qwen2.5 coder 7B so if you have less resources than my windows pc, so should stick to the 7B model.
On the bright side, register with HF and you can call the qwen2.5 72B via their API where they host the model. You get 1000.....
You did not run it on 8gb vram
Please share the script mentioned in the video
Considering this is Python we're talking about here, how long does it take to run.. 1000 years?
Almost all of the Python you see in ML/AI projects are thin wrappers around C, C++ or CUDA.
@@orlandovftw so... 800?
Sure, why put any links in the description, makes no sense. People should search, make so much more fun 🙄
Better wait for a BitNET version
add: or TriLM
...is this real? I've never heard of this o_O
Crazyyyyyyy😮
Why version 3.1 while 3.2 is available for months?
Because 3.2 still hasn't published its 405B model
testing with the 3.1 model that has been out is the appropriate and expected path. would you want the release to be delayed so they can start over again with 3.2?
if this is true then it would be ollama and openwebui loooong ago. and it would mean ipex would be able to smash through it but alas i do not think it is as good as its said.
WTF!!!!! 🫨
script ???????
www.patreon.com/posts/114566125