Optimize Your AI - Quantization Explained

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 ม.ค. 2025

ความคิดเห็น • 74

  • @lemniscif
    @lemniscif หลายเดือนก่อน +13

    the kid at the end is my spirit animal

  • @eyeseethru
    @eyeseethru หลายเดือนก่อน +6

    Bad Dad, using up all the emergency tape!
    But really, thanks for another GREAT video that simplifies something super useful for so many. We appreciate you and your family's tape!!

  • @lofiurbex2511
    @lofiurbex2511 หลายเดือนก่อน +6

    Great info, thanks! Also, very glad you put that clip in at the end

  • @gearscodeandfire
    @gearscodeandfire 12 วันที่ผ่านมา +2

    I absolutely love that you the child's reprimand at the end; I'm a new fan boy of yours, great video

    • @technovangelist
      @technovangelist  11 วันที่ผ่านมา +2

      Stella's always pointing out to me what I get wrong.

    • @gearscodeandfire
      @gearscodeandfire 8 วันที่ผ่านมา

      @ similar admonishments here… great work

  • @andikunar7183
    @andikunar7183 หลายเดือนก่อน +1

    Amazing explanation, thanks!

  • @octopusfinds
    @octopusfinds หลายเดือนก่อน +2

    Thank you, Matt! 🙌 This was the topic I was going to ask you to cover. Great explanation and props! 👏👍

  • @FlorianImmanuelFischer-di7wb
    @FlorianImmanuelFischer-di7wb 6 วันที่ผ่านมา

    youre videos are the best. really useful and thought throug about the actually important concepts in AI

  • @ShaneHolloman
    @ShaneHolloman หลายเดือนก่อน +1

    Absolute champion! Really appreciate you Matt. Thank you ...

  • @romayojr
    @romayojr หลายเดือนก่อน +4

    you may be a bad dad, but you're a great teacher!

    • @colinmaharaj
      @colinmaharaj 25 วันที่ผ่านมา

      With a shirt like that he has got to be a best dad on the block

  • @yuda2207
    @yuda2207 หลายเดือนก่อน

    Thank you so much! This has helped me a lot! Please keep going. I also enjoy videos that aren’t just about Ollama (but of course, I like the ones that are about Ollama too!). Thank you!

  • @TheInternalNet
    @TheInternalNet 29 วันที่ผ่านมา

    Thank you Matt for this amazing explanation. I had a brief understanding so I thought of this but you really helped me fully grasp how this works. Also, your videos are an emergency

  • @skyak4493
    @skyak4493 หลายเดือนก่อน +1

    This is just the info I was looking for. My goal is get useful coarse AI running on the GPU of my gaming laptop, leaving the APU to give me full attention.

  • @vincentnestler1805
    @vincentnestler1805 หลายเดือนก่อน +1

    Thanks!

  • @MikeCreuzer
    @MikeCreuzer 29 วันที่ผ่านมา

    I just tossed money at a new laptop with a 4070 JUST for Ollama, and with this video I was also able to throw smarts at it too to get it to do more with the 8GB of VRAM on laptop 4070s. Thanks so much!
    I'd been spending a lot of time building models with various context widths and benchmarking the VRAM consumption. Deleted a bunch of them because they ended up getting me a CPU/GPU split. Time to create them again because they will now fit in VRAM!
    Thanks again!

    • @Leto2ndAtreides
      @Leto2ndAtreides 6 วันที่ผ่านมา

      Unfortunately, Macs are better for local LLMs.
      One for $5K with 128GB shared RAM... Can potentially run a 70B model at 8 bit quantization.
      The memory in Nvidia GPUs is just too low.

    • @TheJunky228
      @TheJunky228 วันที่ผ่านมา

      my 1070ti also has 8GB vram and I can fit mistral 7B q6 k with a gig to spare. so far it's the best I've found that fits 100% on my gpu. what have you been having good luck with?

  • @tecnopadre
    @tecnopadre หลายเดือนก่อน +1

    It would be nice to have a video downloading a model and modifying for example for a Mac mini 16GB or 24gb as real case. Awesome as usual. Thank you

    • @technovangelist
      @technovangelist  หลายเดือนก่อน

      I am using my personal machine, a M1 Max with 64gb. Pretty real case

    • @themax2go
      @themax2go หลายเดือนก่อน

      I think what the viewer meant was for specifically those mem availabilities. that said, it's also not that realistic because everyone has different available mem depending on what else they have running (vscode, cline, docker / podman + various containers, browser windows, n8n / langflow / ..., ...) - it all depends on what one's specific setup and use-case. people keep forgetting that it's all apple and oranges

    • @technovangelist
      @technovangelist  หลายเดือนก่อน

      Some have 8 or 16 or 24 or 32 gb. But the actual Mem isn’t all that important. Know what model fits in the space available is the important part.

  • @mohamedmaf
    @mohamedmaf 20 วันที่ผ่านมา

    Very interesting video, thank you

  • @X85283
    @X85283 6 วันที่ผ่านมา +1

    LOL opening with a Mac.... That's like a cheatcode for running AI locally. I can run LLama3.3 70B (Q4-K_M) (43GB) on my two generation old Macbook Pro (M2 Max) and it works pretty darn well, more T/s than I can read at least. Doesn't really change the point of your video but just saying if someone thinks quantization is going to get them to run a 70B model on most other laptops they are going to have a bad day.

  • @CptKosmo
    @CptKosmo หลายเดือนก่อน

    Nice, way to end with a smile :)

  • @Noctalin
    @Noctalin หลายเดือนก่อน

    Thank you for your awesome AI videos!
    Is there an easy way to always set the environment variables by default when starting ollama as I sometimes forget to set them after a restart ?

  • @dylanelens
    @dylanelens หลายเดือนก่อน +1

    Matt, you blew my mind

    • @dylanelens
      @dylanelens หลายเดือนก่อน

      Flash attention is precisely what I needed.

  • @TomanswerAi
    @TomanswerAi หลายเดือนก่อน

    This is a good one. Nice topic.

  • @BlenderInGame
    @BlenderInGame หลายเดือนก่อน

    Wow! This made my favorite model much faster! 🤯 I couldn't run `OLLAMA_FLASH_ATTENTION=true ollama serve` for some reason, so I set the environment variable instead. Now, if only Open WebUI used those settings...

  • @ArtificialIntelligenceSP
    @ArtificialIntelligenceSP หลายเดือนก่อน

    Thank you and what is the tool name in mac os that you are using to see those memory graphs ?

  • @cloudsystem3740
    @cloudsystem3740 หลายเดือนก่อน

    thank you very much 👍👍😎😎

  • @greatermoose
    @greatermoose หลายเดือนก่อน

    Hi Matt, what about the quality of responses with flash attention enabled?

  • @styxlegendgaming
    @styxlegendgaming หลายเดือนก่อน +1

    Nice information

  • @brentknight9318
    @brentknight9318 หลายเดือนก่อน

    Super helpful: S, M, L … I didn’t realize that was the scheme, duh.

  • @adarshaddagatla8782
    @adarshaddagatla8782 14 วันที่ผ่านมา

    I'm trying to use ollama in production,
    Could you please explain how to handle multiple requests ?

  • @vincentnestler1805
    @vincentnestler1805 หลายเดือนก่อน

    Thanks, this was very helpful!
    Question - I have a Mac Studio M2 Ultra - 192 Unified Ram. What do you think is the largest model I could run on it? Llama 3.1 has a 405b model that at q4 is 243G. Do you think I could run it with the flash kv and context quantizing?

    • @technovangelist
      @technovangelist  หลายเดือนก่อน

      I doubt it, but its easy to find out. But I cant think of a good reason to want to.

  • @leondbleondb
    @leondbleondb หลายเดือนก่อน

    Good info.

  • @AhmedAshraf-pw1bn
    @AhmedAshraf-pw1bn หลายเดือนก่อน

    what about the IQ quantization such as IQ3M?

  • @eric81766
    @eric81766 หลายเดือนก่อน

    Yes, but where can I buy that rubber duck shirt? That is the ultimate programming shirt.

    • @technovangelist
      @technovangelist  หลายเดือนก่อน +1

      Ahhh, purveyor of all things good and bad:Amazon

    • @eric81766
      @eric81766 หลายเดือนก่อน

      @@technovangelist That moment of realization that amazon has *pages* of results with "men rubber duck button down shirt".

  • @TheYuriTS
    @TheYuriTS หลายเดือนก่อน +1

    no understand how activate flash attention

  • @60pluscrazy
    @60pluscrazy หลายเดือนก่อน

    🎉🎉🎉

  • @QorQar
    @QorQar 12 วันที่ผ่านมา

    How to run Flash Attention commands in Windows. Are there alternatives to Flash Attention for Windows? Can the commands be run in wsl

    • @technovangelist
      @technovangelist  11 วันที่ผ่านมา

      its more about if its supported by your hardware.

  •  หลายเดือนก่อน

    Thanks

  • @wawaldekidsfun4850
    @wawaldekidsfun4850 หลายเดือนก่อน +1

    While this video brilliantly explains quantization and offers valuable technical insights for running LLMs locally, it's worth considering whether the trade-offs are truly worth it for most users. Running a heavily quantized model on consumer hardware, while impressive, means accepting significant compromises in model quality, processing power, and reliability compared to data center-hosted solutions like Claude or GPT. The video's techniques are fascinating from an educational standpoint and useful for specific privacy-focused use cases, but for everyday users seeking consistent, high-quality AI interactions, cloud-based solutions might still be the more practical choice - offering access to full-scale models without the complexity of hardware management or the uncertainty of quantization's impact on output quality.

    • @technovangelist
      @technovangelist  หลายเดือนก่อน +3

      Considering that you can get results very comparable to hosted models when even using q4 and q3 I’d say it certainly is worth it.

    • @themax2go
      @themax2go หลายเดือนก่อน

      GPT is a tech and not a (cloud) product

    • @technovangelist
      @technovangelist  หลายเดือนก่อน +1

      In this context it is absolutely a cloud product

  • @sergey6661313
    @sergey6661313 7 วันที่ผ่านมา

    how about write this instructions right in descriptions? how about write this instructions in just in main page of ollama?

    • @technovangelist
      @technovangelist  7 วันที่ผ่านมา +1

      I had tried to get it added. To make all the descriptions consistent. It was a conscious decision to make them inconsistent.

  • @imcrazyo
    @imcrazyo วันที่ผ่านมา

    this guy is dope

  • @Leto2ndAtreides
    @Leto2ndAtreides 6 วันที่ผ่านมา

    Which model are you using exactly on your laptop?
    70B even at Q2 should be 18GB or so.
    On my laptop with a 3080 with 8GB GPU RAM, I'd still need to offload to CPU to get that to work.
    Unless... You have one of those maxed out Macs with 128GB shared RAM or something...

    • @technovangelist
      @technovangelist  6 วันที่ผ่านมา

      I tend to use various 7-30ish b models. Going to 70 rarely has enough benefit most of the time. I have an M1 Max MBP with 64GB ram

  • @JNET_Reloaded
    @JNET_Reloaded หลายเดือนก่อน +4

    combine this with a bigger swap file and your laughing! you dont need gpu swap file is your friend!

    • @OneIdeaTooMany
      @OneIdeaTooMany 3 วันที่ผ่านมา

      @@JNET_Reloaded I make sure to use 5400 rpm spinning disk's as well. Seagate's too... Best not to scrimp on the important stuff.

    • @TheJunky228
      @TheJunky228 วันที่ผ่านมา

      @@OneIdeaTooMany I'd rather break out my 4200rpm laptop ide drives haha

  • @Talaria.School
    @Talaria.School หลายเดือนก่อน

  • @TheJonathanLugo
    @TheJonathanLugo วันที่ผ่านมา

    She is right! 😂

  • @zerosleep1975
    @zerosleep1975 หลายเดือนก่อน +1

    I'm reporting you to the emergency tape misappropriation department.

  • @dave24-73
    @dave24-73 หลายเดือนก่อน

    What am I going to do with my 300 GB dual Xeon server I have now I can do it on a laptop. LOL

    • @TheJunky228
      @TheJunky228 วันที่ผ่านมา

      I can take that off your hands for ya 😉

  • @TomeLokas
    @TomeLokas 16 วันที่ผ่านมา

    child is awesome :D you are wasting tape :D

  • @pabloescobar2738
    @pabloescobar2738 หลายเดือนก่อน

    El audio😢, no problem i stand inglish, 😅 the life dev😂, thank

  • @RoniMac-b5t
    @RoniMac-b5t 8 วันที่ผ่านมา

    Matt you alway , always , always , never show the most crucial code at most cruicial , time your more interested in prompt coming across , it crazy "where the k_m command ... you was nearly the best helpful , but , you always expect people to know what your talk about , without code or at 4 word command ollama such and scuh , this crazy

    • @technovangelist
      @technovangelist  8 วันที่ผ่านมา +2

      umm, cant improve if you dont tell me whats missing...

  • @reserseAI
    @reserseAI หลายเดือนก่อน

    I hate when viewers said "nice explanation", im absolutely no idea about this

    • @bobdole930
      @bobdole930 หลายเดือนก่อน

      There's a lot to learn, same channel has a playlist to learn Ollama. Ollama is the open source platform your AI models run on. If his style of explanation is clicking try someone else that does similar.

  • @andrei-xe7nu
    @andrei-xe7nu หลายเดือนก่อน

    Thank you, Matt.
    You might be interested that large models are cheapest to run on Orange pi 5 plus. Where RAM is used as vRAM. We have up to 32Gb of vRAM for $220 with great performance 6Tops and power consumption 2,5AX5V. Ollama in arch packages, and available for arm64.
    price/performance!