Bad Dad, using up all the emergency tape! But really, thanks for another GREAT video that simplifies something super useful for so many. We appreciate you and your family's tape!!
Thank you so much! This has helped me a lot! Please keep going. I also enjoy videos that aren’t just about Ollama (but of course, I like the ones that are about Ollama too!). Thank you!
Thank you Matt for this amazing explanation. I had a brief understanding so I thought of this but you really helped me fully grasp how this works. Also, your videos are an emergency
This is just the info I was looking for. My goal is get useful coarse AI running on the GPU of my gaming laptop, leaving the APU to give me full attention.
I just tossed money at a new laptop with a 4070 JUST for Ollama, and with this video I was also able to throw smarts at it too to get it to do more with the 8GB of VRAM on laptop 4070s. Thanks so much! I'd been spending a lot of time building models with various context widths and benchmarking the VRAM consumption. Deleted a bunch of them because they ended up getting me a CPU/GPU split. Time to create them again because they will now fit in VRAM! Thanks again!
Unfortunately, Macs are better for local LLMs. One for $5K with 128GB shared RAM... Can potentially run a 70B model at 8 bit quantization. The memory in Nvidia GPUs is just too low.
my 1070ti also has 8GB vram and I can fit mistral 7B q6 k with a gig to spare. so far it's the best I've found that fits 100% on my gpu. what have you been having good luck with?
I think what the viewer meant was for specifically those mem availabilities. that said, it's also not that realistic because everyone has different available mem depending on what else they have running (vscode, cline, docker / podman + various containers, browser windows, n8n / langflow / ..., ...) - it all depends on what one's specific setup and use-case. people keep forgetting that it's all apple and oranges
LOL opening with a Mac.... That's like a cheatcode for running AI locally. I can run LLama3.3 70B (Q4-K_M) (43GB) on my two generation old Macbook Pro (M2 Max) and it works pretty darn well, more T/s than I can read at least. Doesn't really change the point of your video but just saying if someone thinks quantization is going to get them to run a 70B model on most other laptops they are going to have a bad day.
Thank you for your awesome AI videos! Is there an easy way to always set the environment variables by default when starting ollama as I sometimes forget to set them after a restart ?
Wow! This made my favorite model much faster! 🤯 I couldn't run `OLLAMA_FLASH_ATTENTION=true ollama serve` for some reason, so I set the environment variable instead. Now, if only Open WebUI used those settings...
Thanks, this was very helpful! Question - I have a Mac Studio M2 Ultra - 192 Unified Ram. What do you think is the largest model I could run on it? Llama 3.1 has a 405b model that at q4 is 243G. Do you think I could run it with the flash kv and context quantizing?
While this video brilliantly explains quantization and offers valuable technical insights for running LLMs locally, it's worth considering whether the trade-offs are truly worth it for most users. Running a heavily quantized model on consumer hardware, while impressive, means accepting significant compromises in model quality, processing power, and reliability compared to data center-hosted solutions like Claude or GPT. The video's techniques are fascinating from an educational standpoint and useful for specific privacy-focused use cases, but for everyday users seeking consistent, high-quality AI interactions, cloud-based solutions might still be the more practical choice - offering access to full-scale models without the complexity of hardware management or the uncertainty of quantization's impact on output quality.
Which model are you using exactly on your laptop? 70B even at Q2 should be 18GB or so. On my laptop with a 3080 with 8GB GPU RAM, I'd still need to offload to CPU to get that to work. Unless... You have one of those maxed out Macs with 128GB shared RAM or something...
Matt you alway , always , always , never show the most crucial code at most cruicial , time your more interested in prompt coming across , it crazy "where the k_m command ... you was nearly the best helpful , but , you always expect people to know what your talk about , without code or at 4 word command ollama such and scuh , this crazy
There's a lot to learn, same channel has a playlist to learn Ollama. Ollama is the open source platform your AI models run on. If his style of explanation is clicking try someone else that does similar.
Thank you, Matt. You might be interested that large models are cheapest to run on Orange pi 5 plus. Where RAM is used as vRAM. We have up to 32Gb of vRAM for $220 with great performance 6Tops and power consumption 2,5AX5V. Ollama in arch packages, and available for arm64. price/performance!
the kid at the end is my spirit animal
Bad Dad, using up all the emergency tape!
But really, thanks for another GREAT video that simplifies something super useful for so many. We appreciate you and your family's tape!!
Great info, thanks! Also, very glad you put that clip in at the end
I absolutely love that you the child's reprimand at the end; I'm a new fan boy of yours, great video
Stella's always pointing out to me what I get wrong.
@ similar admonishments here… great work
Amazing explanation, thanks!
Thank you, Matt! 🙌 This was the topic I was going to ask you to cover. Great explanation and props! 👏👍
youre videos are the best. really useful and thought throug about the actually important concepts in AI
Absolute champion! Really appreciate you Matt. Thank you ...
you may be a bad dad, but you're a great teacher!
With a shirt like that he has got to be a best dad on the block
Thank you so much! This has helped me a lot! Please keep going. I also enjoy videos that aren’t just about Ollama (but of course, I like the ones that are about Ollama too!). Thank you!
Thank you Matt for this amazing explanation. I had a brief understanding so I thought of this but you really helped me fully grasp how this works. Also, your videos are an emergency
This is just the info I was looking for. My goal is get useful coarse AI running on the GPU of my gaming laptop, leaving the APU to give me full attention.
Thanks!
I just tossed money at a new laptop with a 4070 JUST for Ollama, and with this video I was also able to throw smarts at it too to get it to do more with the 8GB of VRAM on laptop 4070s. Thanks so much!
I'd been spending a lot of time building models with various context widths and benchmarking the VRAM consumption. Deleted a bunch of them because they ended up getting me a CPU/GPU split. Time to create them again because they will now fit in VRAM!
Thanks again!
Unfortunately, Macs are better for local LLMs.
One for $5K with 128GB shared RAM... Can potentially run a 70B model at 8 bit quantization.
The memory in Nvidia GPUs is just too low.
my 1070ti also has 8GB vram and I can fit mistral 7B q6 k with a gig to spare. so far it's the best I've found that fits 100% on my gpu. what have you been having good luck with?
It would be nice to have a video downloading a model and modifying for example for a Mac mini 16GB or 24gb as real case. Awesome as usual. Thank you
I am using my personal machine, a M1 Max with 64gb. Pretty real case
I think what the viewer meant was for specifically those mem availabilities. that said, it's also not that realistic because everyone has different available mem depending on what else they have running (vscode, cline, docker / podman + various containers, browser windows, n8n / langflow / ..., ...) - it all depends on what one's specific setup and use-case. people keep forgetting that it's all apple and oranges
Some have 8 or 16 or 24 or 32 gb. But the actual Mem isn’t all that important. Know what model fits in the space available is the important part.
Very interesting video, thank you
LOL opening with a Mac.... That's like a cheatcode for running AI locally. I can run LLama3.3 70B (Q4-K_M) (43GB) on my two generation old Macbook Pro (M2 Max) and it works pretty darn well, more T/s than I can read at least. Doesn't really change the point of your video but just saying if someone thinks quantization is going to get them to run a 70B model on most other laptops they are going to have a bad day.
Nice, way to end with a smile :)
Thank you for your awesome AI videos!
Is there an easy way to always set the environment variables by default when starting ollama as I sometimes forget to set them after a restart ?
Matt, you blew my mind
Flash attention is precisely what I needed.
This is a good one. Nice topic.
Wow! This made my favorite model much faster! 🤯 I couldn't run `OLLAMA_FLASH_ATTENTION=true ollama serve` for some reason, so I set the environment variable instead. Now, if only Open WebUI used those settings...
Thank you and what is the tool name in mac os that you are using to see those memory graphs ?
thank you very much 👍👍😎😎
Hi Matt, what about the quality of responses with flash attention enabled?
Nice information
Super helpful: S, M, L … I didn’t realize that was the scheme, duh.
I'm trying to use ollama in production,
Could you please explain how to handle multiple requests ?
Thanks, this was very helpful!
Question - I have a Mac Studio M2 Ultra - 192 Unified Ram. What do you think is the largest model I could run on it? Llama 3.1 has a 405b model that at q4 is 243G. Do you think I could run it with the flash kv and context quantizing?
I doubt it, but its easy to find out. But I cant think of a good reason to want to.
Good info.
what about the IQ quantization such as IQ3M?
Yes, but where can I buy that rubber duck shirt? That is the ultimate programming shirt.
Ahhh, purveyor of all things good and bad:Amazon
@@technovangelist That moment of realization that amazon has *pages* of results with "men rubber duck button down shirt".
no understand how activate flash attention
🎉🎉🎉
How to run Flash Attention commands in Windows. Are there alternatives to Flash Attention for Windows? Can the commands be run in wsl
its more about if its supported by your hardware.
Thanks
While this video brilliantly explains quantization and offers valuable technical insights for running LLMs locally, it's worth considering whether the trade-offs are truly worth it for most users. Running a heavily quantized model on consumer hardware, while impressive, means accepting significant compromises in model quality, processing power, and reliability compared to data center-hosted solutions like Claude or GPT. The video's techniques are fascinating from an educational standpoint and useful for specific privacy-focused use cases, but for everyday users seeking consistent, high-quality AI interactions, cloud-based solutions might still be the more practical choice - offering access to full-scale models without the complexity of hardware management or the uncertainty of quantization's impact on output quality.
Considering that you can get results very comparable to hosted models when even using q4 and q3 I’d say it certainly is worth it.
GPT is a tech and not a (cloud) product
In this context it is absolutely a cloud product
how about write this instructions right in descriptions? how about write this instructions in just in main page of ollama?
I had tried to get it added. To make all the descriptions consistent. It was a conscious decision to make them inconsistent.
this guy is dope
Which model are you using exactly on your laptop?
70B even at Q2 should be 18GB or so.
On my laptop with a 3080 with 8GB GPU RAM, I'd still need to offload to CPU to get that to work.
Unless... You have one of those maxed out Macs with 128GB shared RAM or something...
I tend to use various 7-30ish b models. Going to 70 rarely has enough benefit most of the time. I have an M1 Max MBP with 64GB ram
combine this with a bigger swap file and your laughing! you dont need gpu swap file is your friend!
@@JNET_Reloaded I make sure to use 5400 rpm spinning disk's as well. Seagate's too... Best not to scrimp on the important stuff.
@@OneIdeaTooMany I'd rather break out my 4200rpm laptop ide drives haha
❤
She is right! 😂
I'm reporting you to the emergency tape misappropriation department.
What am I going to do with my 300 GB dual Xeon server I have now I can do it on a laptop. LOL
I can take that off your hands for ya 😉
child is awesome :D you are wasting tape :D
El audio😢, no problem i stand inglish, 😅 the life dev😂, thank
Matt you alway , always , always , never show the most crucial code at most cruicial , time your more interested in prompt coming across , it crazy "where the k_m command ... you was nearly the best helpful , but , you always expect people to know what your talk about , without code or at 4 word command ollama such and scuh , this crazy
umm, cant improve if you dont tell me whats missing...
I hate when viewers said "nice explanation", im absolutely no idea about this
There's a lot to learn, same channel has a playlist to learn Ollama. Ollama is the open source platform your AI models run on. If his style of explanation is clicking try someone else that does similar.
Thank you, Matt.
You might be interested that large models are cheapest to run on Orange pi 5 plus. Where RAM is used as vRAM. We have up to 32Gb of vRAM for $220 with great performance 6Tops and power consumption 2,5AX5V. Ollama in arch packages, and available for arm64.
price/performance!