@@quinxx12 AFAIK to run LLMs effectively the data needs to be held in VRAM, as GFX cards can process data significantly faster than CPUs. I didn't fully understand this video btw so I assume its some kind of hack to run the 70b model in memory and CPU processed.
This is my favorite subsubsubgenre because figuring out how to run LLMs on consumer equipment with fast & smart models is hard today. Gaming GPUs (too small), Mac Studios (too expensive) are stop gap solutions. I think these will have huge application in business when Groq-like chips are available and we don't have to send most LLM requests to frontier models.
I don't want to disappoint you but I am quite sure you will get the same 1.4t/s running 70b parameters model purely on a CPU and it will use half of the memory. So theoretically you will be able to run 180b models on CPU (q4_K_M version). The thing is that on current PCs not a compute power is the limiting factor but the memory bandwidth and since both iGPU and CPU using the same memory you will get very similar speeds. Make a follow up video, may be I am wrong if so I will be happy to learn that.
Strix halo will have 256-bit memory controller. DDR6 will be 2xDDR5 speed. Potential we will 4X memory bandwidth in 2-3 years. Expensive Mac studio ultra has 800 GB/memory bandwidth right now.
@@Fordance100 I was talking about Ultra 5 125H. Strix Halo iGPU will be in its own league and I am impatient to see its test results. Mac Studio has unified memory, which is basically soldered on the chip. As I understand Intel is also going to employ this approach for ultra thin series. Let's see, let's see.
@@perelmanych I ran models on a 13900K and on a mobile Quadro RTX 5000. They both ran about the same, about 20-30 seconds for responses. With an RTX 4080 though it was way faster, responses in about a second or two. This was on a self hosted website with a fast api backend using GPT4All with langchain for local docs, memory, context awareness and torchserve for fast model loading and to help with concurrency.
You can "run" it like that, but you are not going to get any kind of decent speed. That's the issue with running these models on CPU. Yes, the 4090 only has a space of 24GB, but that 24GB is super fast. The more layers you give to your GPU, the faster it will be. So I doubt it will be faster purely on CPU.
How old are you guys? I assume 20 or younger Because 10 years ago 16 gb ram was okay, and today macbooks still have 8 gb RAM. 10 years in the future this will still hold up
Thank you. I learned a lot. With respect, we have a truly vastly different idea of what cheap means. U.S. $700-$800 total for this unit (after tax ~$900.00) is a whole lot to me. I get it that it's cheaper than other new stuff by comparison.
@@tyanite1 right now, they sell them (barbone) for USD 400. Still not nothing, I could imagine better use for my money, too, but getting the barbone, I'll buy 1 48gb RAM bar first for ~100, soon, and play with it and whenever I feel like upgrading, I'll buy another 48gb bar. This should cost a bit over USD 600 tops. Yes, that's also still a chunk of money, and I'm not saying this to prove you wrong, but since you said money is a limiting factor, who knows... Maybe it helps. I'm curious AF. Can't wait. There are situations where I can't bring myself to send certain info (mostly work / code related) over to some overseas company, so whatever brings me closer to a usable local LLM is greatly appreciated. Even if I'll have to wait around 30 seconds for a reply. Everything is better than waiting 10 minutes or getting faster replies which are of no use whatsoever.
Cheap is relative to the product itself. A cheap car can be a few thousand, a cheap house 50-500k depending on the area. $800 to run this model is 'cheap'
It's not just the computation-sepeed of the 4090, its VRAM has an extremely high bandwidth (and is therefore so expensive but also crazily power-hungry). Apple Silicon has not "just more RAM", Pro/Max/Ultra each double the base M-series memory-width/bandwidth. So the M2 Ultra gets closer to a 4090's >1TB/s with its 800GB/s bandwidth. LLM token-generation is mainly dependent on memory-bandwidth. THIS (and power-consumption) is why many buy a Mac Studio instead of multiple 4090s, if they do just LLM inference and not machine-learning. But NVIDIA is nearly without peers for ML because of its raw compute-power.
i run 103B on a 4 slot RAM and also get about 3T/s and this is almost exactly half that with 2 slots The way LLMs run at 1~20 T/s until they get a decent GPU is entirely dependant on the memory bandwidth. The best machine for a 70B is actually a 256GB 12 slot dual CPU circa 2015 xeon which run about $2000 total with ebay parts (90% of the cost is the mb and cpu) in other words no GPU is required at all, just as many iRAM slots as you can find.
MiniPCs are amazing!!! I got a ser8 last week with 96GB of memory and a 4TB nvme and it matches my old threadripper 1950x in multicore but has more memory and storage and BLOWS It away in single core and fits in the palm of my hand I literally am in love with it now o.o
Thanks for testing this out. Thought about testing this for myself using the Minisforum version of this mini PC. There seems to be another way of running LLMs using the actual NPU of the Ultra CPUs instead of the Arc GPU when running it via OpenVino. I would be very much interested in some more testing on linux+openvino.
It's a software issue. Ultra processors are optimized to use the NPU for artificial intelligence, NOT the GPU. You're using the wrong part (a very slow GPU instead of a very fast NPU), but I understand it's because the software you're using doesn't allow you to use the NPU. You should also SHOW how many tokens the CPU alone can process (NOT using the GPU) to compare the performance. I insist, you’re using a very slow GPU, maybe even the CPU is better. In science, you always have to check all the factors experimentally and not take anything for granted. Good luck 🍀
This is not an exploration into alternatives to GPUs, more so getting a model to even fit on such a tiny machine. We know that the more RAM you slap on a machine, the larger the models you can "run", but as you can see it slowed to a snails pace. VRAM will always be king to RAM. That 24GB in the 4090 is incredibly fast. If you can fit a model solely on the 4090, there's no bigger model you could really need. 70B models are quite underwhelming for their weight.
Running Ollama with Phi3.5 and multimodel models like minicpm-v on an Amazon DeepLens, basically a camera that Amazon sold to developers that is actually an Intel PC with 8GB of RAM and some Intel-optimized AI frameworks built in. Amazon discontinued the cloud-based parts of the DeepLens program so these perfectly functional mini-PCs are as cheap as $20 on eBay. I have 10. :)
@@kiloabnehmen2592 No, compared to prior >3GB LLMs, the fact that it wasn't rambling with incomplete sentences, repeating sentences and start inventing new questions to answer, was beyond a fsking miracle -- and it does often produce high quality output. And now there's even smaller LLMs like IBM's Granite MOE 1b, only freaking 862Kb and it's a __mixture of experts__ model, was able to output functional VHDL even, and being a mixture of experts model, it's perfect for embedded devices. The point of tiny LLMs is not it's ability to recall esoteric facts but to provide a way to do menial tasks by way of voice conversations. Function-calling is a nifty way to give LLMs knowledge it may not have within it's training data, as well as up-to-date info, but to be able to ask an LLM running on Home Assistant to turn on a light for 10 minutes, or add breakfast cereal to a shopping list. An LLM that can do that conversationally, with semblence of humor & cultural awareness, in realtime, on an embedded chip is a freakin' game changer.
what do you think would happen if we connect multiple k9's with exolabs? would we get an linear increase of the tok/s? how does the bandwidth affect it if we would connect it by usb thunderbolt4 with a 40Gbs speed?
1.43 t/s is kinda OK, but realistically, it's not very useful. I think a bang for buck situation would be to use a couple Tesla P40s to get like 5t/s. It won't look pretty, but if you chuck it in the garage or something it's not a problem.
What's the deal with the integrated NPU in Intel Ultra CPU's. Do you take advantage of it in this setup? I couldn't find detailed information about it, in various articles it usually just says "designed to accelerate artificial intelligence (AI) tasks" which is pretty vague.
Hmmm, besides RAM/VRAM size, its mostly RAM bandwidth for token-generation, which determines llama.cpp's speed (the 4090 has >1TB/s, the M2 Ultra has 800GB/s) . GPU-horsepower is mainly useful for (batched) prompt-processing and learning. And for RAM-size, its not just the model! With the large-context models like llama-3.1, RAM-requirements totally explode, if you try to start the model with its default 128k token-limit. But cool video, thanks!!!!
Is there a bottleneck for discreet cards moving memory though versus a shared memory bus that can load it faste, as in the 4090 bandwidth is only within itself and he regular ram utilized to feed it being 3-5x slower? Unfortunate that Nvidia will most likely never make something in the middle of the 4090 - A4400 for ML and ai people
@@xpowerchord12088 its simpler, yet complicatd: A transformer has to go through its entire compute-graph for each end every token it generates. So it has to pump ALL billions of parameters as well as the transformer's KV-cache (which can get to additional many GBs for 128k context-sizes) from memory (RAM or VRAM) via the (multi-level but small) on-chip caches to the ALUs (in the GPU or CPU or NPU). Token-generation (unlike prompt-processing) is not batched, so this has to be repeated for EACH and every token it generates. Modern CPUs (with their matrix-instructions), GPUs and NPUs have very many ALUs calculating/working in parallel. Because of this its not calculating, but pumping the parameters/KV-cache from memory to these ALUs becomes the bottleneck. Current NVIDIA (e.g. 4090) is able to pump more than 1TB/s with ultra-fast and wide RAM. Apple Silicon uses 128Bit wide RAM in the M, 256 in the M Pro (except for the M3 Pro, which is crippled), 512 in the M Max, and 1028 Bit for the M Ultra. Combined with the RAM's transaction/s this yields 120-133GB/s for the M4 with its LPDDR5X (new Intel/AMD and Snapdragon X do similarly), but faster for the Pro and up to 800GB/s for the M2 Ultra (with its older LPDDR5). Hope this clarifies, sorry for the long-winded explanation.
Nice chair, but why wht the HM lumbar support part removed? Seriously though, I think monitor size effects me moer than my chair. Never used a curved screen and looking at them in stores never impressed me, so I'd like to get a sense of whether or not switching is worthwhile.
thank you very much! I have the newest 70b model on both an MSI laptop AND a MSI desktop each with 64 Gb DDR5. They run somwhat slow, but useable and FASTER than your Demo !😮
Same CPU and GPU? Anyway, please give token rates, quantization, and relevant machine specifications. I would assume better performance in the desktop at least because of better cooling versus a mini PC.
Technically, nvidia could allow the user to use normal system ram as vram. Similar to swapping. They could also use memory compression for vram. It’s pretty usual for system ram. Maybe they already do this for vram too. I’m not sure. Yes it’d be slower if they used these techniques, but it’d be better than not being able to run the task at all.
Tehnically you could just mount X amount of RAM as a volume and use that as swap-disk as well though, no need for Nvidia to do anything. If they allow swapping any volume could be swapped to.
@@AZisk Cool, especially since you do some development @ Win but for some reason use WSL2 even when not needed (it can be quite a bit slower for some stuff then "native"...)
I knew window was going to win at this LLM thing. I knew that Intel Arch would also win. Nice one. I want to to know that some of us might not have access to Mac machine ever due to location. And the fact that you can just buy an upgrade of RAM is amazing.
While the 4090 can't run the whole model, it can still speed up the process significantly as you will offload some layers to the gpu. BTW a year ago I was able to run llama2 70B on my laptop with 6900HS 8 core cpu. and I only have 24gb of ram, so it was using swap memory (virual memory) aka the internal ssd. I was getting one token output every 10 secs. I only had a 3060 6gb so I couldn't offload much to the gpu.
It can load whole model (Google exllama2). And it can do so much more. Particularly use q4_0 on kv-cache bringing it down from 40Gb to 10Gb on 128k context
The same library seems to support multi gpu setups. Hence an 8x Intel Arc Pro A60 totalling 96 GB of VRAM could in theory be attempted and still be more competitive than a MacStudio from a TOPS-per-dollar perspective. Don’t expect the same size, silence and power efficiency though…
For a small lab maybe. Functionality would be way down though. An M3 Max with 96gb would be a better all around deal for an individual. You should see the pro level Nvidia cards that can be linked each with 48gb of ram. Too band Nvidia will not jump in this world when it has the gaming and pro sector nailed
my mini-pc, for general usage, is a Ryzen 7 APU with integrated AMD graphics and 64GB DDR4 RAM, 56GB of which has been set as dedicated to graphics in BIOS. It's slow, it's AMD, but it runs stuff in GPU and still alot faster than CPU only (still sucks at running Cyberpunk 2077)
.. keep wondering how a modern AMD desktop CPU *G model, with a load of CPU cores and a decent integrated GPU and 128 or 256GB of fast DDR memory available would handle things. Certainly the cheapest way to get (close to) 256GB of memory on a GPU that I can think of - you could have a rack of them for cost of the Nvidia GPUs you would need to get to that 256GB
@@jelliott3604 more cores don’t help you’ll actually get better performance with hyperthreading off. singlecore benchmarks are a better indicator for llm/ai as it’s about clockspeed/turbo boost x RAM bandwidth throughput
@@jelliott3604 AMD keeping AVX512 in their consumer line is gonna make the competition really interesting for CPU-centric builds tho maybe as soon as this next gen refresh. Intel making all the wrong moves
@whodis5438 my gaming box, the one in the nice case with all the ARGB lighting, is another Haswell-E CPU (i7-5690X) and LGA-2011 board with hyper-threading turned-off and that octa-core 3GHz processor clocked up to just under 4.6 GHz on all cores
You can always pick up a second hand tesla k80 and run it side by side with your 3090/4090 or other gpu I have a 3090 and the tesla k80 sure its old but heck I have 48 gig of vram to play with and things just run smoothly. Sure Im not going to break the 100meter sprint but coming last in an Olympics out of 8 billion people is good enough for me. Lots of alternative ways to leverage big company clean outs of servers which no longer have value to them but are of value to us consumers running AI on the smell of an oily rag. Love the videos.
I’ve been using an Intel i7-1255u in a mini pc to run GPT4All with some pretty good results, as long as you stick with smaller highly quantized models.
I have a core i7 9750H and am running llama3.1 model pretty well. I'm just now getting into AI models and learning about this stuff and it's pretty crazy. I want to scale up and mess with this stuff but finances are the limit lol. It's crazy to think that in 8 or so years, we'll likely have something far better running on our phones without a problem. It doesn't have to be perfect, just "good enough. " to help people with their work.
No. NPU's only get used when a particular software uses it's api for it. Ie photoshop, apple intelligence, copilot. NPU's are proprietary and underutilized and stuck in hype train.GPU's are much better at this aspect.
From my understanding NPU's are for apps that apple gives the api for. It's proprietary. They are mostly for hype right now and not used in running Ai's. Mostly in apps that do ai and video imaging
@@univera1111 NPU's are basically GPU's from my understanding, but are proprietary and and not used for this. Not much actually uses the NPU aside from the os and companies who get the api like adobe.
Let me get this straght: spend nearly usd 700 on the k6 mini, which has a lobotomized Intel Arc Graphics(2.2 GHz)8 Xe Cores 112EU Graphics Card.... then spend another usd 200 on 100GB DDR5 RAM. So about usd 1000 with tax to run a QUANTIZED 70B model = reduced accuracy and precision. I did n't have usd 1000 to throw away but I am a gamer and coder, and I have windows intel cpu intel gpu ARC A770 16GB and 64GB DDR4 RAM and 5TB 7400MB/sec NVME ssd with DirectStorage all bought for gaming by the way, so I set about coding and set up my system to run inference on a qwen2.5 72B model and it runs fine. Ok, it takes minutes to warm up and load the first time but after that it runs good and it runs BF16, not quantized, but as it is in HuggingFace so no reduced accuracy and precision. In contrast, running a Q8 qwen2.5 32B model, moderately reduced accuracy and precision via the Q8, through LM Studio, and doing inference was SLOW.....I mean, I could count the letters being printied out if I wanted to. haha Yes, LM Studio on the same system.
Hey Alex, is there a direct correlation between the amount of RAM and the number of parameters a system would support? I’m just thinking of getting an M4 Mac mini and wondering what a difference it would make to get 16 or 32GB of RAM. What kind of LLM would I be able to run on this small system?
Yes, there is a direct correlation. For a non-quantized model I generally assume I need about twice as much ram as parameters, but I could be totally off base. You can get an estimate by looking at the size of the model download file un hugging face. On a Windows machine, I believe only 75% of the ram is available to the internal GPU, which is why he only had 55 GB of ram available and not 96. You can see that he still used the whole 96 though.
On a 32 GB machine you should be able to run a 7 or 8 billion parameter model. Some people say that they can run a heavily quantized 33 billion model. I even saw one claim about running a 70 billion model. However, even the 33 billion model that was heavily quantized was only 7-8 tokens per second. I think you can run the same model on a 16 GB machine if it was heavily quantized, but my guess is it would take up most of the system so you couldn’t do much else on the machine. I would buy as much RAM as you easily can: that is my plan. These models are very ram hungry. Plus, you may be running multiple models at the same time, or at least having them in at the same time, once the baked in apple models come out.
Doubtful but I wonder what I could run on my homelab. I have 3 Dell r620s running Proxmox clustered. Each node has 2 Xeon E5-2690’s 10c/20t (20c/40t total per node) and 128gb ram. It’s all old hardware so I doubt it but would be nice to
9:12 Is it possible the model is loaded into the RAM twice? First it loads it into the CPU side RAM and then copies it to the GPU side RAM effectively using twice the memory.
This may be the first time ever that integrated GPU's are preferred over dedicated cards as they rarely have more than 24GB. Heck even AMD's old Server cards were only 32GB. Sure the speed of those Cards is better, but the fact you could put some of these AMD CPU's with an iGPU on board and max it out to 1TB of RAM on some boards is pretty amazing. The old DDR4 servers IMO are going to be spend their lives in Home Labs packed with RAM, and doing LLM's and other work for a long time IMO. Just using a system like this Built like China does in their CCTV systems sure seems like something that is going to be leveraged in the future for home CCTV setups that don't want to use the Cloud in order to such operations.
I was running Llama 3.1: 70b on a old server 2x xeon chips 128gb running at 1333hz ....total cost for server = $125 off facebook marketplace. (Poweredge R710). Responses took awhile but it ran.
I have a 7w intel core N300 mini pc with just 8gb lpddr5 ram, it runs Q8-Q4 models very well of qwen2.5 models and llama3.2, I mostly use gpt4all and ollama but the newer 7b q4 models with embed knock the sox of all google's models, and arent slow. I usually use qwen2.57ninstruct q4, and the new whiterabbitneo 3b sat 8bQ, machine cost 180$
if your only unit of measure for success is that it can run it regardless of how quickly, i made llama 3 7b run on a khadas vim4 pro using ollama. every cpu core spikes and pins at 100 for the majority of the output, but that's expected of an iot sbc.
"but what's impressive is that this tiny little bo can run a 70B LLM... like a snail. So if you're really REAAALLY patient, this is possibly a solution for you" Lol
VRAM is a part of the GPU and that 192GB is system RAM so 2 seprate things. However, you can buy Intel ARC 1770 16GB cheap on sale or for more monry an Nvidia GPU 24GB. BY MORE money I mean at least usd 1000 VS 250 for the ARC. You need custom code to split the load between system RAM , VRAM and a fast NVME ssd . SSD is optional cuz with your massive RAM you won't need it for a 70B model. That's a great MB by the way. I aim to upgrade to that or one with 256GB max system RAM if I can find it. Maybe next year, I'm good for now.
you should close the browsers as it uses significant ram. and instead of gpu, the npu should be faster in that intel apu for more igpu ram, you can try desktop amd 8700G with 4x full size dimm. the cpu can go up to 256GB but max udimm in the market is 48GB so it will be 192GB max. the integrated gpu spec is around 17 tflops fp16/bf16 and 16 tflops npu int8.
Make the context size 32768 see how it runs. I noticed if I use a few less gpu layers then my 3970x threads do the rest gives me 8 tokens a sec. Then if I do all GPU layers offload it is actually slower using meta3.3 70B, Q4_K_M, 4090 GPU...256GB RAM running at 3600
Not really sure what's happening but I've been running models of 40b and higher on a 2060. How come I'm able to do this I thought you needed huge power?
It ends with electric cords galore. I bought old PCs nd they were not upto dte. I had to supply Bluetooth dongles and wifi dongles. I tried to take them apart because it were "So easy", but it was not, and they malfunctioned. Now they are changed with a Lenovo Laptop.
@@AZisk no i mean adding it to the regular ram slots of the PC that has the 4090 in it. With modern nvidia drivers it increases the shared vram by 96gb, but with a text generator like oggabooga it makes even more efficient use of it for offloading, and deepspeed will make further use of ram with special optimization techniques only PCs with >64GB of ram can handle. Not much different to what you're currently doing by just holding the model in ram and processing with the CPU, although there is some overhead swapping between ram and vram so experimentation is necessary
I’m interested if my 2019 intel with 128gb of ram will finally have the opportunity to use all of it. Most I ever really needed was like 55 or so. My 16gb m1 hates me. Haha.
Although it's cool to see that it works at all, I can't think of how it would be usable with such low output speed. Besides maybe confirming that the model runs? I'm comparing it with my M1 Max MacBook as a reference for 70b, which provides usable generation speeds (reading speed between 5-8 token/s depending on quantization)
I wouldn't necessarily say that this PC can "run" a 70B model.
It can walk one for sure...
Still replying faster than all my people trying not to reply right away to not look desperate 🤣
I didnt quite get what the limiting factor is for running the model faster. Processor speed?
@@quinxx12 concurrency and memory bandwidth.
With a rollator walker 😁
@@quinxx12 AFAIK to run LLMs effectively the data needs to be held in VRAM, as GFX cards can process data significantly faster than CPUs. I didn't fully understand this video btw so I assume its some kind of hack to run the 70b model in memory and CPU processed.
I wonder if it would be capable to run Mixtral 8x22b. Does anybody have experience with it? How fast would it be if it can run it?
This is my favorite subsubsubgenre because figuring out how to run LLMs on consumer equipment with fast & smart models is hard today. Gaming GPUs (too small), Mac Studios (too expensive) are stop gap solutions. I think these will have huge application in business when Groq-like chips are available and we don't have to send most LLM requests to frontier models.
Excellent explanation
Been running a local LLM (Mistral 7B and gemma-2-2B) on my iPhone 15 Pro for about a year. Output is instant.
@@haganlife and virtually useless, I presume
@@justtiredthings you didnt have to kill'em like that. lolololol
I'll take a smaller model running faster on VRAM over a larger model at 2 tok/s because it's running on CPU and RAM anyday
I don't want to disappoint you but I am quite sure you will get the same 1.4t/s running 70b parameters model purely on a CPU and it will use half of the memory. So theoretically you will be able to run 180b models on CPU (q4_K_M version). The thing is that on current PCs not a compute power is the limiting factor but the memory bandwidth and since both iGPU and CPU using the same memory you will get very similar speeds. Make a follow up video, may be I am wrong if so I will be happy to learn that.
Strix halo will have 256-bit memory controller. DDR6 will be 2xDDR5 speed. Potential we will 4X memory bandwidth in 2-3 years. Expensive Mac studio ultra has 800 GB/memory bandwidth right now.
@@Fordance100 I was talking about Ultra 5 125H. Strix Halo iGPU will be in its own league and I am impatient to see its test results. Mac Studio has unified memory, which is basically soldered on the chip. As I understand Intel is also going to employ this approach for ultra thin series. Let's see, let's see.
@@perelmanych I ran models on a 13900K and on a mobile Quadro RTX 5000. They both ran about the same, about 20-30 seconds for responses. With an RTX 4080 though it was way faster, responses in about a second or two. This was on a self hosted website with a fast api backend using GPT4All with langchain for local docs, memory, context awareness and torchserve for fast model loading and to help with concurrency.
@@criostasis May I ask which models and what API you managed to get working?
You can "run" it like that, but you are not going to get any kind of decent speed. That's the issue with running these models on CPU. Yes, the 4090 only has a space of 24GB, but that 24GB is super fast. The more layers you give to your GPU, the faster it will be. So I doubt it will be faster purely on CPU.
In 10 years, videos like this will be nostalgic
It will be like watching people spinning up 56k modems and getting amazed at the internet.
@@fusseldieb haha yes
How old are you guys? I assume 20 or younger
Because 10 years ago 16 gb ram was okay, and today macbooks still have 8 gb RAM.
10 years in the future this will still hold up
@@lockin222 You forget that most technological advancements aren't linear, but logarithmic.
Every video recorded 10 years ago is nostalgic today, isn't it? 🤔
Thank you. I learned a lot. With respect, we have a truly vastly different idea of what cheap means. U.S. $700-$800 total for this unit (after tax ~$900.00) is a whole lot to me. I get it that it's cheaper than other new stuff by comparison.
@@tyanite1 right now, they sell them (barbone) for USD 400. Still not nothing, I could imagine better use for my money, too, but getting the barbone, I'll buy 1 48gb RAM bar first for ~100, soon, and play with it and whenever I feel like upgrading, I'll buy another 48gb bar. This should cost a bit over USD 600 tops.
Yes, that's also still a chunk of money, and I'm not saying this to prove you wrong, but since you said money is a limiting factor, who knows... Maybe it helps.
I'm curious AF. Can't wait. There are situations where I can't bring myself to send certain info (mostly work / code related) over to some overseas company, so whatever brings me closer to a usable local LLM is greatly appreciated. Even if I'll have to wait around 30 seconds for a reply. Everything is better than waiting 10 minutes or getting faster replies which are of no use whatsoever.
the alternative is 6000-8000 usd + for just the GPU with enough memory
Cheap is relative to the product itself. A cheap car can be a few thousand, a cheap house 50-500k depending on the area. $800 to run this model is 'cheap'
Most hardwares are still not designed for running ai. Average Joe won't buy 192gb Mac to run llm. 4090 doesn't have enough vram to run most llm.
Maybe 5090 will have 48 gb VRAM
My average Joe uni classmate bought a max out MacBook pro with near hundred GBs of ram to run LLM and he is happy with it 😂
Apple is the ONE company to actually push local LLM's. Surely they'll upsell you to a 40GB language model if it makes any sense.
It's not just the computation-sepeed of the 4090, its VRAM has an extremely high bandwidth (and is therefore so expensive but also crazily power-hungry). Apple Silicon has not "just more RAM", Pro/Max/Ultra each double the base M-series memory-width/bandwidth. So the M2 Ultra gets closer to a 4090's >1TB/s with its 800GB/s bandwidth. LLM token-generation is mainly dependent on memory-bandwidth. THIS (and power-consumption) is why many buy a Mac Studio instead of multiple 4090s, if they do just LLM inference and not machine-learning. But NVIDIA is nearly without peers for ML because of its raw compute-power.
@@ThePgR777 Nah it's 28gb
Also Amd Ryzen 7 7700 can run 70b without gpu but at 3tokens/s
i run 103B on a 4 slot RAM and also get about 3T/s and this is almost exactly half that with 2 slots
The way LLMs run at 1~20 T/s until they get a decent GPU is entirely dependant on the memory bandwidth. The best machine for a 70B is actually a 256GB 12 slot dual CPU circa 2015 xeon which run about $2000 total with ebay parts (90% of the cost is the mb and cpu)
in other words no GPU is required at all, just as many iRAM slots as you can find.
Uhm ... i need to Test my HP DL380G9 -> 768gb RAM. Cost 200€ THX for the idear 😁
Please keep this series going!
MiniPCs are amazing!!! I got a ser8 last week with 96GB of memory and a 4TB nvme and it matches my old threadripper 1950x in multicore but has more memory and storage and BLOWS It away in single core and fits in the palm of my hand I literally am in love with it now o.o
I might get another and connect them via the usb4 40gbps and cluster them if that's possible o.o
Thanks for testing this out. Thought about testing this for myself using the Minisforum version of this mini PC.
There seems to be another way of running LLMs using the actual NPU of the Ultra CPUs instead of the Arc GPU when running it via OpenVino.
I would be very much interested in some more testing on linux+openvino.
It's a software issue. Ultra processors are optimized to use the NPU for artificial intelligence, NOT the GPU. You're using the wrong part (a very slow GPU instead of a very fast NPU), but I understand it's because the software you're using doesn't allow you to use the NPU.
You should also SHOW how many tokens the CPU alone can process (NOT using the GPU) to compare the performance. I insist, you’re using a very slow GPU, maybe even the CPU is better. In science, you always have to check all the factors experimentally and not take anything for granted.
Good luck 🍀
thank you 🙏
@@AZisk NO, thank to you for all your hard work in making such an interesting and useful video.
Some models won't run on NPU. We still need some time for all the software and hardware to align
Yes, keep exploring these alternatives to running expensive GPU cards or Apple silicon
That's why I like this channel.
This is not an exploration into alternatives to GPUs, more so getting a model to even fit on such a tiny machine. We know that the more RAM you slap on a machine, the larger the models you can "run", but as you can see it slowed to a snails pace. VRAM will always be king to RAM. That 24GB in the 4090 is incredibly fast. If you can fit a model solely on the 4090, there's no bigger model you could really need. 70B models are quite underwhelming for their weight.
They need to start making GPUs with DDR slots. It would be slower for gaming but great for LLM and image generation
they could make a new socket to pop vram chips directly on the GPU without the need for soldering
Running Ollama with Phi3.5 and multimodel models like minicpm-v on an Amazon DeepLens, basically a camera that Amazon sold to developers that is actually an Intel PC with 8GB of RAM and some Intel-optimized AI frameworks built in. Amazon discontinued the cloud-based parts of the DeepLens program so these perfectly functional mini-PCs are as cheap as $20 on eBay. I have 10. :)
well but phi3.5 has such low quality output its basicly useless
@@kiloabnehmen2592 No, compared to prior >3GB LLMs, the fact that it wasn't rambling with incomplete sentences, repeating sentences and start inventing new questions to answer, was beyond a fsking miracle -- and it does often produce high quality output. And now there's even smaller LLMs like IBM's Granite MOE 1b, only freaking 862Kb and it's a __mixture of experts__ model, was able to output functional VHDL even, and being a mixture of experts model, it's perfect for embedded devices.
The point of tiny LLMs is not it's ability to recall esoteric facts but to provide a way to do menial tasks by way of voice conversations. Function-calling is a nifty way to give LLMs knowledge it may not have within it's training data, as well as up-to-date info, but to be able to ask an LLM running on Home Assistant to turn on a light for 10 minutes, or add breakfast cereal to a shopping list. An LLM that can do that conversationally, with semblence of humor & cultural awareness, in realtime, on an embedded chip is a freakin' game changer.
did you try running a 4 bit quantized larger model on those? what's the best tok/s you got?
those are using an intel atom cpu with only 100+ GFLOPS of power.
@@danielselaru7247I’ll check,been using an RK3588 lately
Why not use the new processors with huge TOPS perf instead...?
bro, this is the fucking videos we need. why everyone talking and never do videos like this?
@@maxxflyer watch your mouth
@@Y0UTUBEADMIN no
10:48 not impossible, llama.cpp can do partial acceleration, running some layers on GPU and remaining layers on CPU.
Now I'm curious how well your mac can run the 70b model
what do you think would happen if we connect multiple k9's with exolabs? would we get an linear increase of the tok/s? how does the bandwidth affect it if we would connect it by usb thunderbolt4 with a 40Gbs speed?
Could you make a video comparing the iGPU vs the NPU?
1.43 t/s is kinda OK, but realistically, it's not very useful.
I think a bang for buck situation would be to use a couple Tesla P40s to get like 5t/s. It won't look pretty, but if you chuck it in the garage or something it's not a problem.
What's the deal with the integrated NPU in Intel Ultra CPU's. Do you take advantage of it in this setup? I couldn't find detailed information about it, in various articles it usually just says "designed to accelerate artificial intelligence (AI) tasks" which is pretty vague.
7:05 Always try to power up before screwing to close cover of the device. On rare occasion need to reseat them DIMMs.
Hmmm, besides RAM/VRAM size, its mostly RAM bandwidth for token-generation, which determines llama.cpp's speed (the 4090 has >1TB/s, the M2 Ultra has 800GB/s) . GPU-horsepower is mainly useful for (batched) prompt-processing and learning.
And for RAM-size, its not just the model! With the large-context models like llama-3.1, RAM-requirements totally explode, if you try to start the model with its default 128k token-limit.
But cool video, thanks!!!!
i definitely need to include bandwidth in my next vid in the series
Is there a bottleneck for discreet cards moving memory though versus a shared memory bus that can load it faste, as in the 4090 bandwidth is only within itself and he regular ram utilized to feed it being 3-5x slower? Unfortunate that Nvidia will most likely never make something in the middle of the 4090 - A4400 for ML and ai people
@@xpowerchord12088 its simpler, yet complicatd: A transformer has to go through its entire compute-graph for each end every token it generates. So it has to pump ALL billions of parameters as well as the transformer's KV-cache (which can get to additional many GBs for 128k context-sizes) from memory (RAM or VRAM) via the (multi-level but small) on-chip caches to the ALUs (in the GPU or CPU or NPU). Token-generation (unlike prompt-processing) is not batched, so this has to be repeated for EACH and every token it generates. Modern CPUs (with their matrix-instructions), GPUs and NPUs have very many ALUs calculating/working in parallel. Because of this its not calculating, but pumping the parameters/KV-cache from memory to these ALUs becomes the bottleneck. Current NVIDIA (e.g. 4090) is able to pump more than 1TB/s with ultra-fast and wide RAM. Apple Silicon uses 128Bit wide RAM in the M, 256 in the M Pro (except for the M3 Pro, which is crippled), 512 in the M Max, and 1028 Bit for the M Ultra. Combined with the RAM's transaction/s this yields 120-133GB/s for the M4 with its LPDDR5X (new Intel/AMD and Snapdragon X do similarly), but faster for the Pro and up to 800GB/s for the M2 Ultra (with its older LPDDR5).
Hope this clarifies, sorry for the long-winded explanation.
@@andikunar7183 Thanks for the concise and educational response! Appreciate your time.
It only explodes if you are not using context quantisation
Would this PC support two 128GB modules? Maybe those modules are too large.
Can you run the model in something with a better UI like LM studio? Or serve it up as an endpoint? How much does that quant reduce quality?
Could AMD APUs be even better?
I wish you had tried a 32B LLM like Qwen 2.5 to show it in action. I'd like to know what parameter level produces a reasonable token output.
Nice chair, but why wht the HM lumbar support part removed?
Seriously though, I think monitor size effects me moer than my chair. Never used a curved screen and looking at them in stores never impressed me, so I'd like to get a sense of whether or not switching is worthwhile.
soooo excited to see you testing the new lunar lake intel cpus
me too. coming soon hopefully
A bit disappointing that we can't use the NPU ?
thank you very much! I have the newest 70b model on both an MSI laptop AND a MSI desktop each with 64 Gb DDR5. They run somwhat slow, but useable and FASTER than your Demo !😮
@@davidtindell950 so, what are the gpus use in there
Same CPU and GPU?
Anyway, please give token rates, quantization, and relevant machine specifications.
I would assume better performance in the desktop at least because of better cooling versus a mini PC.
Which CPU and GPU
@@paultparker Sorry for Delayed Response to Replies. Have Muti-Project DEADLINES ! MSI CodR2B14NUD7092: Intel Core i7-14700F, NVIDIA GeForce RTX 4060Ti. 64GB DDR5 5600 and 2TB M.2 NVMe Gen3 !
@@vaibhavbv3409 Sorry for Delayed Response to Replies. Have Muti-Project DEADLINES ! MSI CodR2B14NUD7092: Intel Core i7-14700F, NVIDIA GeForce RTX 4060Ti. 64GB DDR5 5600 and 2TB M.2 NVMe Gen3 !
Technically, nvidia could allow the user to use normal system ram as vram. Similar to swapping. They could also use memory compression for vram. It’s pretty usual for system ram. Maybe they already do this for vram too. I’m not sure. Yes it’d be slower if they used these techniques, but it’d be better than not being able to run the task at all.
Tehnically you could just mount X amount of RAM as a volume and use that as swap-disk as well though, no need for Nvidia to do anything. If they allow swapping any volume could be swapped to.
@@TommieHansen swap won’t help if nvidia doesn’t support offloading video memory pressure to system memory pressure.
I am a huge fan of the minisforums PCs. Extremely similar in form factor. Sounds like soon we will be having a AMD/ARM/Intel AI benchmark race. :-)
I've got one of them too, video coming soon :)
@@AZisk Cool, especially since you do some development @ Win but for some reason use WSL2 even when not needed (it can be quite a bit slower for some stuff then "native"...)
I knew window was going to win at this LLM thing. I knew that Intel Arch would also win. Nice one. I want to to know that some of us might not have access to Mac machine ever due to location. And the fact that you can just buy an upgrade of RAM is amazing.
Miniforum are unreliable and have BIOS bugs.
@@TH-camGlobalAdminstratorsounds like a little update and god to go
While the 4090 can't run the whole model, it can still speed up the process significantly as you will offload some layers to the gpu.
BTW a year ago I was able to run llama2 70B on my laptop with 6900HS 8 core cpu. and I only have 24gb of ram, so it was using swap memory (virual memory) aka the internal ssd. I was getting one token output every 10 secs. I only had a 3060 6gb so I couldn't offload much to the gpu.
Totally not worth, but thanks for the information.
It can load whole model (Google exllama2). And it can do so much more. Particularly use q4_0 on kv-cache bringing it down from 40Gb to 10Gb on 128k context
The same library seems to support multi gpu setups. Hence an 8x Intel Arc Pro A60 totalling 96 GB of VRAM could in theory be attempted and still be more competitive than a MacStudio from a TOPS-per-dollar perspective. Don’t expect the same size, silence and power efficiency though…
For a small lab maybe. Functionality would be way down though. An M3 Max with 96gb would be a better all around deal for an individual. You should see the pro level Nvidia cards that can be linked each with 48gb of ram. Too band Nvidia will not jump in this world when it has the gaming and pro sector nailed
my mini-pc, for general usage, is a Ryzen 7 APU with integrated AMD graphics and 64GB DDR4 RAM, 56GB of which has been set as dedicated to graphics in BIOS.
It's slow, it's AMD, but it runs stuff in GPU and still alot faster than CPU only (still sucks at running Cyberpunk 2077)
.. keep wondering how a modern AMD desktop CPU *G model, with a load of CPU cores and a decent integrated GPU and 128 or 256GB of fast DDR memory available would handle things.
Certainly the cheapest way to get (close to) 256GB of memory on a GPU that I can think of - you could have a rack of them for cost of the Nvidia GPUs you would need to get to that 256GB
@@jelliott3604 more cores don’t help you’ll actually get better performance with hyperthreading off. singlecore benchmarks are a better indicator for llm/ai as it’s about clockspeed/turbo boost x RAM bandwidth throughput
@@jelliott3604 AMD keeping AVX512 in their consumer line is gonna make the competition really interesting for CPU-centric builds tho maybe as soon as this next gen refresh. Intel making all the wrong moves
@whodis5438 my gaming box, the one in the nice case with all the ARGB lighting, is another Haswell-E CPU (i7-5690X) and LGA-2011 board with hyper-threading turned-off and that octa-core 3GHz processor clocked up to just under 4.6 GHz on all cores
Serious question. Hows the chair still? That springy lumbar, arm rest adjustment and mesh seats for summer are what I’m looking for
aside from the ability to run the 70B LLM, is there a practical use case here when the tokens/second is pretty slow?
absolutely no practical use case with that low tps
You can always pick up a second hand tesla k80 and run it side by side with your 3090/4090 or other gpu I have a 3090 and the tesla k80 sure its old but heck I have 48 gig of vram to play with and things just run smoothly. Sure Im not going to break the 100meter sprint but coming last in an Olympics out of 8 billion people is good enough for me.
Lots of alternative ways to leverage big company clean outs of servers which no longer have value to them but are of value to us consumers running AI on the smell of an oily rag.
Love the videos.
I’ve been using an Intel i7-1255u in a mini pc to run GPT4All with some pretty good results, as long as you stick with smaller highly quantized models.
GTP-4 ? I thought GPTs are all proprietary and not released to the public…
Alex, try to check it out with minipc or laptop together with eGPU like 4080 or 4090. Thank you!
What about GMKtec M3 Plus Mini PC con Intel Core i9 12900HK
Version? It supports 96gb ram also? Same consumption?
I have a core i7 9750H and am running llama3.1 model pretty well. I'm just now getting into AI models and learning about this stuff and it's pretty crazy. I want to scale up and mess with this stuff but finances are the limit lol.
It's crazy to think that in 8 or so years, we'll likely have something far better running on our phones without a problem. It doesn't have to be perfect, just "good enough. " to help people with their work.
Wow, we’re having a Hack Week, and I was thinking of this-nice timing!
in order to view GPU utilization. set the taskmanager to compute ( rightclick one of the diagrams )
.
Why the NPU is not used?
Isn't this an actually use case?
No. NPU's only get used when a particular software uses it's api for it. Ie photoshop, apple intelligence, copilot. NPU's are proprietary and underutilized and stuck in hype train.GPU's are much better at this aspect.
@@xpowerchord12088 I bet there is an NPU card module you can plug into the PC.
From my understanding NPU's are for apps that apple gives the api for. It's proprietary. They are mostly for hype right now and not used in running Ai's. Mostly in apps that do ai and video imaging
@@univera1111 NPU's are basically GPU's from my understanding, but are proprietary and and not used for this. Not much actually uses the NPU aside from the os and companies who get the api like adobe.
@@xpowerchord12088 somewhat close, basically single part of gpu (no memory and specialized blocks for video encoding/output/rt/etc).
Can you run the same on Interl core ultra series 2 machines ?
I bought this setup and 96GB of memory after seeing this so I'm hoping you do more in the future.
Let me get this straght: spend nearly usd 700 on the k6 mini, which has a lobotomized Intel Arc Graphics(2.2 GHz)8 Xe Cores 112EU Graphics Card.... then spend another usd 200 on 100GB DDR5 RAM. So about usd 1000 with tax to run a QUANTIZED 70B model = reduced accuracy and precision.
I did n't have usd 1000 to throw away but I am a gamer and coder, and I have windows intel cpu intel gpu ARC A770 16GB and 64GB DDR4 RAM and 5TB 7400MB/sec NVME ssd with DirectStorage all bought for gaming by the way, so I set about coding and set up my system to run inference on a qwen2.5 72B model and it runs fine. Ok, it takes minutes to warm up and load the first time but after that it runs good and it runs BF16, not quantized, but as it is in HuggingFace so no reduced accuracy and precision.
In contrast, running a Q8 qwen2.5 32B model, moderately reduced accuracy and precision via the Q8, through LM Studio, and doing inference was SLOW.....I mean, I could count the letters being printied out if I wanted to. haha
Yes, LM Studio on the same system.
Hey Alex, is there a direct correlation between the amount of RAM and the number of parameters a system would support? I’m just thinking of getting an M4 Mac mini and wondering what a difference it would make to get 16 or 32GB of RAM. What kind of LLM would I be able to run on this small system?
Yes, there is a direct correlation. For a non-quantized model I generally assume I need about twice as much ram as parameters, but I could be totally off base.
You can get an estimate by looking at the size of the model download file un hugging face.
On a Windows machine, I believe only 75% of the ram is available to the internal GPU, which is why he only had 55 GB of ram available and not 96. You can see that he still used the whole 96 though.
On a 32 GB machine you should be able to run a 7 or 8 billion parameter model. Some people say that they can run a heavily quantized 33 billion model. I even saw one claim about running a 70 billion model. However, even the 33 billion model that was heavily quantized was only 7-8 tokens per second.
I think you can run the same model on a 16 GB machine if it was heavily quantized, but my guess is it would take up most of the system so you couldn’t do much else on the machine.
I would buy as much RAM as you easily can: that is my plan. These models are very ram hungry. Plus, you may be running multiple models at the same time, or at least having them in at the same time, once the baked in apple models come out.
I can feel mobile phones with 128Gb+ of RAM approaching already.
Is there a better setup to run70B?
Inspired me to try it out with 135H, works like a charm with upto 13B-ish params
Doubtful but I wonder what I could run on my homelab. I have 3 Dell r620s running Proxmox clustered. Each node has 2 Xeon E5-2690’s 10c/20t (20c/40t total per node) and 128gb ram.
It’s all old hardware so I doubt it but would be nice to
Alex, Intel lunar lake cpu laptop are out.Please review and share your experience in development environment
I don't think they're for sale right now. What we've seen so far are reviews thanks to brands like Asus and Acer partnering with some TH-camrs
thank you for your experiment, great job
9:12 Is it possible the model is loaded into the RAM twice? First it loads it into the CPU side RAM and then copies it to the GPU side RAM effectively using twice the memory.
This may be the first time ever that integrated GPU's are preferred over dedicated cards as they rarely have more than 24GB.
Heck even AMD's old Server cards were only 32GB.
Sure the speed of those Cards is better, but the fact you could put some of these AMD CPU's with an iGPU on board and max it out to 1TB of RAM on some boards is pretty amazing.
The old DDR4 servers IMO are going to be spend their lives in Home Labs packed with RAM, and doing LLM's and other work for a long time IMO.
Just using a system like this Built like China does in their CCTV systems sure seems like something that is going to be leveraged in the future for home CCTV setups that don't want to use the Cloud in order to such operations.
Wonder are those npu's any use for video segmentation or object detection or is it still better to run those models on GPU?
In context of where we're at with current LLMs, are these the equivalent of 1980's dial-up modem 1,200 bits/sec speeds?
Probably good form to put a link in the description to the post you based this video off of. 👍
While at it, install LM Studio. It now supports Volkan.
How well do these LLMs run without unified memory? Seems like even with 96gb here it's not quite like apple silicon.
how many chrome tabs can it open at a time?
I wonder if it would work on GPUs of some older low power intel CPUs like J4125 or N100 that don't have Arc iGPU and what would be the performance.
I was running Llama 3.1: 70b on a old server 2x xeon chips 128gb running at 1333hz ....total cost for server = $125 off facebook marketplace. (Poweredge R710). Responses took awhile but it ran.
؟ 00:44 is that a thunderbolt dock's cable ?
I have a 7w intel core N300 mini pc with just 8gb lpddr5 ram, it runs Q8-Q4 models very well of qwen2.5 models and llama3.2, I mostly use gpt4all and ollama but the newer 7b q4 models with embed knock the sox of all google's models, and arent slow. I usually use qwen2.57ninstruct q4, and the new whiterabbitneo 3b sat 8bQ, machine cost 180$
Hey that's cheap for what it can do. Cool.
What's the size of the smallest LLM? I don't know if it's perplexity, gemini nano or mistral ai
How many tokens does this cpu shows without integrated gpu?
if your only unit of measure for success is that it can run it regardless of how quickly, i made llama 3 7b run on a khadas vim4 pro using ollama. every cpu core spikes and pins at 100 for the majority of the output, but that's expected of an iot sbc.
Sorry for Delayed Response to Replies. Have Muti-Project DEADLINES ! MSI CodR2B14NUD7092: Intel Core i7-14700F, NVIDIA GeForce RTX 4060Ti. 64GB DDR5 5600 and 2TB M.2 NVMe Gen3 !
"but what's impressive is that this tiny little bo can run a 70B LLM... like a snail. So if you're really REAAALLY patient, this is possibly a solution for you" Lol
GEM12
AMD Ryzen 7 8845HS 32GB DDR5 5600
Radeon 780M (Fixed 8GB for iGPU)
::LMStudio::
Llama 3.1 8B Q5 => 7.4 tok/sec
Llama 3.1 8B Q6 => 6.78 tok/sec
Llama 3.1 8B Q8 => 5.548 tok/sec
yeah, 2022 hardware you have a problem with VRAM on nVidia Card.. that cost like gold and always out of stock.
about the "GPU spike" change the track parameter to CUDA
I just found a MB for Intel Core ultra 200 series, that can use 192GB RAM. How much could we assign to the GPU VRAM?
VRAM is a part of the GPU and that 192GB is system RAM so 2 seprate things. However, you can buy Intel ARC 1770 16GB cheap on sale or for more monry an Nvidia GPU 24GB. BY MORE money I mean at least usd 1000 VS 250 for the ARC. You need custom code to split the load between system RAM , VRAM and a fast NVME ssd . SSD is optional cuz with your massive RAM you won't need it for a 70B model.
That's a great MB by the way. I aim to upgrade to that or one with 256GB max system RAM if I can find it. Maybe next year, I'm good for now.
you should close the browsers as it uses significant ram.
and instead of gpu, the npu should be faster in that intel apu
for more igpu ram, you can try desktop amd 8700G with 4x full size dimm.
the cpu can go up to 256GB but max udimm in the market is 48GB so it will be 192GB max.
the integrated gpu spec is around 17 tflops fp16/bf16 and 16 tflops npu int8.
Can you run stable diffusion or flux or Text to speach models using this library? Can i request a video on that?
Make the context size 32768 see how it runs. I noticed if I use a few less gpu layers then my 3970x threads do the rest gives me 8 tokens a sec. Then if I do all GPU layers offload it is actually slower using meta3.3 70B, Q4_K_M, 4090 GPU...256GB RAM running at 3600
Are there any similar tests on ryzen ai 9 it should have more powerful GPU and NPU
If you want to use those external GPU, i think MoE model is better. With Ktransformer library.
would you say to run big llms locally, Mac Studio with huge unified memory is the best economical option? especially for 70b non-quantized?
mac studio is $$$$$. this box is $.
@@AZisk he also said not quantized. Also, I would guess he wants more than 1.5 tokens per second.
Can you share your thoughts why go for this config and not opt for config with Nvidia in it?
It can handle SD? But unreal rendering?
but what are the advantages of downloading an AI vs using a web browser version? Is using it offline the only advantage?
I mean it seems way more expensive considering the hardware you need. Way more complicated and even less powerful...So what is the point?
@@tonyman187ask your local AI…ohh wait you can’t
I guessing amd dont have the right cudas or something that it's not tested?
Alex doing a commercial made me laugh- plus never knew he was in bare feet 😂
Why didn't you just use ollama?
Actually, installing a model is much easier now. You can even have UI for free. Msty, for example. Or use ollama directly, if you prefer CLI.
Cool... Is it really practical though??
Not really sure what's happening but I've been running models of 40b and higher on a 2060. How come I'm able to do this I thought you needed huge power?
It ends with electric cords galore. I bought old PCs nd they were not upto dte. I had to supply Bluetooth dongles and wifi dongles. I tried to take them apart because it were "So easy", but it was not, and they malfunctioned. Now they are changed with a Lenovo Laptop.
would like to see u try putting the 96gb in the 4090 machine so the 4090 can process the whole model
you can’t add ram to the 4090
@@AZisk no i mean adding it to the regular ram slots of the PC that has the 4090 in it. With modern nvidia drivers it increases the shared vram by 96gb, but with a text generator like oggabooga it makes even more efficient use of it for offloading, and deepspeed will make further use of ram with special optimization techniques only PCs with >64GB of ram can handle.
Not much different to what you're currently doing by just holding the model in ram and processing with the CPU, although there is some overhead swapping between ram and vram so experimentation is necessary
why is the igpu only able to use ~50gb? shouldnt it be able to use almost all of 96gb like mac studios?
macs can’t use the full ram for their gpus either even with unified ram
I’m interested if my 2019 intel with 128gb of ram will finally have the opportunity to use all of it. Most I ever really needed was like 55 or so. My 16gb m1 hates me. Haha.
Although it's cool to see that it works at all, I can't think of how it would be usable with such low output speed. Besides maybe confirming that the model runs?
I'm comparing it with my M1 Max MacBook as a reference for 70b, which provides usable generation speeds (reading speed between 5-8 token/s depending on quantization)