Local LLM Challenge | Speed vs Efficiency

แชร์
ฝัง

ความคิดเห็น • 332

  • @chinesesparrows
    @chinesesparrows หลายเดือนก่อน +116

    Man this question was literally on my mind. Not everyone can afford a H100

    • @adamrak7560
      @adamrak7560 13 วันที่ผ่านมา +2

      H100 is not inferencing, it is for training (but widely used for inferencing too, super wasteful).
      There is a massive market hole for inference hardware.

  • @AmanPatelPlus
    @AmanPatelPlus หลายเดือนก่อน +88

    If you run ollama command with `--verbose` flag, it will give you the tokens/sec for each prompt. So you don't have to time each machine separately.

  • @stokeseta7107
    @stokeseta7107 หลายเดือนก่อน +46

    Energy efficiency and heat are the reason I go for mac at the moment, upgraded a few years ago from a 1080 ti to a 3080 ti. Basically sold my pc a month later and bought a PS5 and MacBook.
    So I really appreciate the inclusion of efficiency in your testing.

    • @fallinginthed33p
      @fallinginthed33p หลายเดือนก่อน +7

      Electricity prices are really expensive in some parts of the world. I'd be happy running a MacBook, Mac Mini or some small NUC for a private LLM.

    • @vadym8713
      @vadym8713 29 วันที่ผ่านมา +1

      @@fallinginthed33p AND it good for environment!

    • @Solarmopp-i7r
      @Solarmopp-i7r 22 วันที่ผ่านมา

      @@vadym8713 sooolaaaar

  • @alx8439
    @alx8439 29 วันที่ผ่านมา +31

    For your 4090 machine give it a try with ExLlamaV2 or Tensor-RT as an inference engine. They will give you +50% performance boost compared to ollama / gguf models. Also this initial 2 seconds performance hit is only ollama specific thing with default settings - and it happens because by default ollama unloads model from memory after some idle timeout. You can turn this behavior off

  • @TheHardcard
    @TheHardcard หลายเดือนก่อน +26

    You can make LLMs nearly deterministic if not entirely with the “Temperature” setting. I haven’t yet had time to experiment with this myself, but I’ve been following this with great interest.
    I’d be joyous to see a similar comparison woth the (hopefully just days away) M4 Pro and M4 Max alongside your other unique, insightful, and well designed testing.

    • @TobyDeshane
      @TobyDeshane หลายเดือนก่อน +12

      Not to mention setting the seed value. I wonder if there would be any per-machine difference among them with the same seed/temp, or if it would generate the same on both.
      But without them pinned to the same output, a 'speed test' that isn't about tokens/sec is pointless.

    • @toadlguy
      @toadlguy 29 วันที่ผ่านมา +1

      @@TheHardcard “Hallucinations” (i.e. incorrect answers) and temperature are two different things. Although increasing the temperature (variability in the probability of the next token) is likely to produce more incorrect responses, it also can create more creative responses (that may be ‘better’) in many situations. Incorrect responses are simply the result of both the training data and the architecture of a next word prediction model.

  • @HaraldEngels
    @HaraldEngels 7 วันที่ผ่านมา +1

    I highly appreciate your focus to run LLMs locally, especially on affordable mini PCs. Very helpful. This kind of edge computing will grow massively soon. So please continue to create/run such tests.

  • @ferdinand.keller
    @ferdinand.keller หลายเดือนก่อน +13

    7:42 If you set the temperature of the model to zero, the output should be deterministic, and give the same results across all machines.

  • @djayjp
    @djayjp หลายเดือนก่อน +42

    Would be interested to test iGPU vs (new) NPU perf.

    • @chinesesparrows
      @chinesesparrows หลายเดือนก่อน +7

      Yeah tops per watt

    • @andikunar7183
      @andikunar7183 หลายเดือนก่อน +8

      Currently NPUs can't run the AI models of the tests well. llama.cpp (the inference code behind ollama) does not run on the NPUs (yet). It's all marketing only. Qualcomm/QNN, Intel, AMD, Apple all have different NPU architectures and frameworks, which make it very hard for llama.cpp to support them. Apple does not even support their own NPU (called ANE) with their own MLX-framework.
      llama.cpp does not even support the Qualcomm Snapdragon's GPU (Adreno), but ARM did some very clever CPU-tricks, so that the Napdragon's X Elite's CPU is approx. as fast as a M2 10-GPU for llama.cpp via their Q4_0_4_8 (re-)quantization. You can also use this speed-up with ollama (but need specially re-worked models).

    • @GetJesse
      @GetJesse หลายเดือนก่อน +1

      @@andikunar7183 good info, thanks

    • @flyingcheesecake3725
      @flyingcheesecake3725 หลายเดือนก่อน

      @@andikunar7183i heard apple mlx do use npu but we don't have control to manually target it. correct me if i am wrong

    • @andikunar7183
      @andikunar7183 หลายเดือนก่อน

      @@flyingcheesecake3725 you mean Apple's CoreML and not MLX. CoreML supports the GPU and NPU (Apple calls it ANE), but with limited control. Apple's great open-source machine-learning framework MLX does NOT support the ANE (at least until now), only the CPU+GPU.

  • @robertotomas
    @robertotomas หลายเดือนก่อน +7

    If you wanna make your tests deterministic with ollama, you can use the /set command to set the parameters for top k of 0 and top p, temperature of 0, and set the seed parameter to a same value. Also, if instead of running a chat if you put the prompt in the run command and use -verbose you’ll get the tps

  • @AndreAmorim-AA
    @AndreAmorim-AA 17 วันที่ผ่านมา +2

    It is not only the hardware, but the CUDA (Compute Unified Device Architecture) framework that allows developers to harness the massive parallel processing. The question is whether Apple will develop an MLX framework more suitable for AI development.

  • @danielkemmet2594
    @danielkemmet2594 หลายเดือนก่อน +8

    Yeah definitely please keep making this style of video!

  • @showbizjosh40
    @showbizjosh40 หลายเดือนก่อน +1

    The timing of this video could not have been more perfect. Literally working on figuring out how to get an LLM running locally because I want the freedom of choosing the LLM I want and for privacy reasons. I only have an RTX 4060 Ti w/ 16GB but it should be more than sufficient for my purposes.
    Love this style of format! It makes sense to consider electric cost when running these sorts of setups. Awesome quality as always!

    • @oppy124
      @oppy124 หลายเดือนก่อน

      LM studio should meet your needs if you're looking for a one click install system

  • @technovangelist
    @technovangelist หลายเดือนก่อน +3

    This was fantastic, thanks. I have done a video or two about Ollama, but haven't been able to do something like this because I haven't bought any of the nvidia cards. It is crazy to think that we have finally hit a use case where the cheapest way to go is with a brand new Mac.

  • @zalllon
    @zalllon 14 วันที่ผ่านมา +1

    Really interesting to see the Mac mini and how well it performs. As someone who owns a Mac mini M1, it’s simple design, relatively small footprint, exceptionally quiet, operation, and low power consumption are all pluses. I think with my experience on the original Apple Silicon, I would definitely go with the M2, especially since good for chunks of the day, this system would be idle.

  • @timsubscriptions3806
    @timsubscriptions3806 หลายเดือนก่อน +6

    I’m with you! Apple wins by a mile. I have compared my 64G Mac Studio M2 Ultra to my Windows WS that has dual Nvidia A4500 using NVlink (20G for each card) and at half the price the Mac Studio easily competes with the Nvidia cards. I can’t wait to get an M4 Ultra Mac Studio with 192G Ram -maybe more RAM 😂. I use AI for LOCAL RAG and research and local code assistant for Vs code

    • @timsubscriptions3806
      @timsubscriptions3806 หลายเดือนก่อน

      BTW, the Mac Studio was half the price of the Windows workstation

  • @nomiscph
    @nomiscph 15 วันที่ผ่านมา +2

    I love the test; I would love to see the speed of a beefed up M4 laptop and then maybe using 1.58bit interference

  • @deucebigs9860
    @deucebigs9860 หลายเดือนก่อน +5

    I got the K9 and 96gb after your first video about it, and am happy with the output speed and power consumption for the price. I did get a silent fan for it though cause it is loud. I only really do coding work so I think the Mac would be great, but the RTX is just overkill. I'm very curious the next gen of Intel processors so please do that test. I don't care about the RTX honestly cause I needed something that good I can put that money towards claude or chat gippity online and not have to deal with the heat, power consumption, maintenance, and run a huge model. I personally don't see the benefits of running a 7B parameter model at lightning speed. I'd rather run a bigger model slower at home. Great video! So much to think about.

  • @DevsonButani
    @DevsonButani หลายเดือนก่อน +7

    I'd love to see the same comparison with the Mac Studio with the Ultra chip since it has similar cost to a high spec PC with a RTX4090. This would be quite useful to know for workstation situations

    • @t33mtech59
      @t33mtech59 10 วันที่ผ่านมา

      I was thinking the same think. And more model size headroom due to unified ram up to 128gb whereas 4090 is locked at 24gb

  • @fernandonati
    @fernandonati 26 วันที่ผ่านมา +1

    This video is exactly the answer for me. Thank you!

  • @Himhalpert8
    @Himhalpert8 หลายเดือนก่อน +3

    I'll be totally honest... I don't have the slightest clue what's happening in this video but the little bit that I could understand seems really cool lol.

    • @leatherwiz
      @leatherwiz 2 วันที่ผ่านมา

      Best comment 😂

  • @samsquamsh78
    @samsquamsh78 หลายเดือนก่อน +2

    thanks for a great and interesting video! I have a Linux machine with a nvidia card and a macbook pro m3, and personally I really care about the noise and heat from the linux/nvidia machine - after a while it's bothering. the mac makes zero noise, and no heat and is always super responsive. in my opinion, it's incredible what Apple has built. thanks again for your fun and cool technical videos!

  • @RichWithTech
    @RichWithTech หลายเดือนก่อน +6

    I'm not sure how I feel about you reading all our minds but I'm glad you made this

  • @ChrisGVE
    @ChrisGVE หลายเดือนก่อน +4

    Very cool content thanks Alex, I've always wanted to try one of these models locally but I couldn't find the time, at least I can follow your steps and do it a bit faster :) See you in your next video!

  • @yoberto88
    @yoberto88 11 วันที่ผ่านมา

    Thanks for making this video! I needed to watch something like this!

  • @christhomas7774
    @christhomas7774 29 วันที่ผ่านมา +1

    Great video! Thanks for doing not only the comparison, but also the analysis for the second half of the video.
    I can’t wait when you test the new Intel Core CPU that just came out. Since you use the iGPU on the NUC, please mention the speed of the DRAM in your next comparison video, as it can have an impact on results (just like DRAM can have an impact on the performance of a video game). I hope a desktop machine with a 4090 makes it in your next comparison video. Even if it is a machine with a mini-ITX MB with a 16x pie slot would be much better

  • @haha666413
    @haha666413 หลายเดือนก่อน +2

    love the comparison. can't wait for a next, maybe try the 4090 in a pc so it can actually stretch its legs and copy the data much quicker to vram

  • @seeibe
    @seeibe หลายเดือนก่อน +1

    Great test! This is exactly what I'm interested in, especially the idle power test is important, as you won't be running inference 24/7 usually. Amazing to see that "mini PC" with the eGPU takes more power in idle than my full blown desktop PC with a 4090 😅

  • @alexbaumgertner
    @alexbaumgertner 29 วันที่ผ่านมา +3

    Thanks! My small participation in the electricity cost :)

    • @AZisk
      @AZisk  29 วันที่ผ่านมา

      Appreciate you!

  • @rayongracer
    @rayongracer 6 วันที่ผ่านมา

    Living in a cold climate the 4090 is perfect, in a machine with fast pcie it is super fast when coding, it sits on my home server and it really rocks. I did test the llama 3.1 8b and codegemma on a 4060 Ti 16 gig card which worked really nice, not blistering fast as the 4090 but fast enough. a 4060 Ti 16 gig is not expensive, so if cost is important that is a great option. It also is way less power hungry than a 4090. The test rig i made was running as a server with several services running and i let others play with the llm as well, nobody ever complained about speed. But the issue was if more than one was using the llm at the same time it started to slow down, or when you had a long sessions with it. That is when i upgraded the server to a 4090 card and all those issues went away.
    It would be cool if you tested multiple users on tests as well. From what I understand many people who run local llm share it with the family and even friends.

  • @leatherwiz
    @leatherwiz 2 วันที่ผ่านมา

    Thanks for this content and comparison. I think as this becomes more mainstream, there will be less nerdy setup’s and software to use this on a pc, Mac or tablet in the near future. 👍🤖

  • @rotors_taker_0h
    @rotors_taker_0h 29 วันที่ผ่านมา +2

    You don't have to load the model into the vram every time. It is actually crazy to compare like that. You load once and then send requests, then RTX latency will be the lowest.

  • @nasko235679
    @nasko235679 หลายเดือนก่อน +2

    I was actually interested in building a dedicated LLM server, but after a lot of looking around for language models, I realized most open source "coding" focused LLMs are either extremely outdated or planely not good. LLama is working with data from 2021, DeepSeek Coder 2022 etc. Unfortunately the best models for coding purposes are still Claude and ChatGPT and those are closed.

  • @PseudoProphet
    @PseudoProphet 13 วันที่ผ่านมา +2

    I don't think speed is all that important when you're running it locally. The most important thing is the memory and the bandwidth between the memory and the processing.

  • @EladBarness
    @EladBarness หลายเดือนก่อน +1

    Great video! Thanks
    In the verdict you forgot to mention
    that the only way to run bigger LLMs is with Mac’s if you have enough unified memory.
    yes it’s gonna be slow but still possible and faster than cpu
    You mentioned it in the beginning though :)

  • @camerascanfly
    @camerascanfly หลายเดือนก่อน +4

    0.53 EUR/kWh in Germany? Where did you get these numbers from? About 0.28 EUR/kWh would be an average price. In fact prices dropped considerably during the last year or so.

  • @hajjdawood
    @hajjdawood 21 วันที่ผ่านมา +2

    People always saying Apple is overpriced but if you actually use them as a work tool they are by far the best bang for the buck.

    • @SlavaBagmut
      @SlavaBagmut 8 วันที่ผ่านมา

      Storage pricing is not best bang for the buck for sure

  • @davidtindell950
    @davidtindell950 หลายเดือนก่อน +3

    thanks yet again !

  • @AlexanderRay92
    @AlexanderRay92 24 วันที่ผ่านมา +1

    You can actually measure the heat VERY EASILY because every watt consumed is converted to heat with EFFECTIVELY 100% efficiency! It's just resistive heating with a few extra steps, but all those extra steps only produce heat as a loss anyway. Technically a very small amount is lost to things like vibrations and maybe UV radiation leaving through an open window or whatever, but you can basically ignore that for all practical purposes.
    1W/h of power consumption is around 3.5 BTU/h and 1000J is around 1 BTU. So in your case, a single run consumes (rounding for easiness) let's say 5 BTU for the intel, 2.5 for the M2 and 4 for the RTX.

  • @comrade_rahul_1
    @comrade_rahul_1 หลายเดือนก่อน +2

    This is what I want to watch about tech. Amazing man! ❤

    • @AZisk
      @AZisk  หลายเดือนก่อน +1

      Glad you enjoyed it.

    • @comrade_rahul_1
      @comrade_rahul_1 หลายเดือนก่อน +1

      @@AZisk Great work sir. No one else does these things except you. 👌🏻

  • @ALYTV13
    @ALYTV13 29 วันที่ผ่านมา

    Just what I needed, have been carrying the idea of a multi mac mini setup for a while now, especially with the use of exo labs.

  • @TheRealTommyR
    @TheRealTommyR 26 วันที่ผ่านมา

    I like this style of video and 100% want to see more, especially with continuously updated test as hardware is updated like an m4 device or if you get a mac studio or ultra, intel, etc.

  • @abysmal5422
    @abysmal5422 28 วันที่ผ่านมา

    Interesting analysis! Two things:
    1) I believe heat is pretty much just the energy consumed. There's no chemistry involved, just physics, so aside from negligible light and sound and a small amount of work from spinning fans, a computer using 100 watts is essentially a 100W heater.
    2) The Intel mini pc loses in speed and efficiency, but its real strength is the amount of RAM. You could run a 70b model with minimal quantization and get better output than either of the other machines is capable of (though quantifying that and putting it on a chart would be difficult).

  • @danieljueleiby914
    @danieljueleiby914 29 วันที่ผ่านมา

    Hey Alex. Thank your for this kind of comparison. Often I only see speed matrixes, this with average energy consumption, speed, initial cost and heat generated was exactly what I wanted to see. A full array with multiple tests. Thank you, I will definitely see more if you make more power/speeds/cost/Wcost/.... Comparison. ! Thank you.

  • @tomat4135
    @tomat4135 29 วันที่ผ่านมา +1

    Would be really interesting to see a more comprehensive benchmark in price-tiers. Like @500-750 USD (NUC, maybe a M1 MBAir 8GB (often on sale) or Mac Mini M2 8GB) § @1000-1500 USD (Mac Mini M2Max, ARM-box, ???) § @2000-3000 USD (NUC+RTX4090, Mac Mini M2Max, Base Mac Studio) § @>3000 USD (Built PC with dual 3090 24GB, H100, Mac Studio Ultra)

  • @lordkacke6220
    @lordkacke6220 29 วันที่ผ่านมา

    I really enjoyed this kind of video. More stats and statistics would also be nice but I also know that it's time consuming

  • @siddu494
    @siddu494 หลายเดือนก่อน +2

    Great comparison Alex just shows if a model is tuned to run on a specific hardware it will outperform in terms of effficiency. However I saw an article today that shows Microsoft open-sourced bitnet.cpp a blazing-fast 1-bit LLM inference that runs directly on CPUs. It says that now you can run 100B parameter models on local devices.
    Will be waiting for your video on who this changes everything.

    • @andikunar7183
      @andikunar7183 หลายเดือนก่อน

      LLM inference is largely determined by RAM bandwidth. The newest SoCs (Qualcomm/AMD/Intel/Apple) almost all have 100-130 GB/s. While the M2/M3 Max has 400GB/s, the Ultra 800 GB/s, the 4090 has >1TB/s. And all the new CPUs have very fast matrix-instructions, rivalling the GPUs for performance. 1.5-Bit quantization might be some future thing (but not in any useable model), but currently 4-Bit is the sweet-spot. Snapdragon X Elite CPU Q4_0_4_8 quantized inference is already similar in speed to M2 10-GPU Q4_0 inference with the same accuracy.

  • @johannesmariomeissner7262
    @johannesmariomeissner7262 หลายเดือนก่อน +1

    The test is also flawed by the fact that an LLM can't really "count" its own tokens, so when you ask for a 1000 essay, it's just going to to a best effort, but results might be wildly different. Setting a token cutoff and having all machines reach the cutoff would be much more accurate.

  • @Solstice42
    @Solstice42 หลายเดือนก่อน

    Great assessment- so glad you're including power usage. Power and CO2 cost of AI is critical for people to consider going forward. (for the Earth and our grandchildren)

  • @kdl0
    @kdl0 14 วันที่ผ่านมา +1

    Well, yes, of course we want to know how the 200H core ultra series will perform. Have you heard any news about when Asus is releasing the Nuc 14 AI? I haven't seen any public information yet.

  • @mshark2205
    @mshark2205 หลายเดือนก่อน

    This is top quality video I was looking for. Surprised that Apple silicon runs LLMs quite well…

  • @Ruby_Witch
    @Ruby_Witch หลายเดือนก่อน +3

    How about on the Snapdragon CPUs? Do they have Ollama running natively on those yet? I'm guessing not, but it would be interesting to see how they match up against the Mac hardware.

    • @Barandur
      @Barandur หลายเดือนก่อน

      Asking the real question!

    • @sambojinbojin-sam6550
      @sambojinbojin-sam6550 13 วันที่ผ่านมา

      There's plenty of frontends on Android(Layla, Pocket Pal, chatterAI, MLCChat, etc) or you can run ollama through Termux if you want. Snapdragons do pretty well on phones, considering they're also a phone.
      About 2.5-3 tokens/sec on Snapdragon 695 on Llama 3.2 8B parameter model (~$200USD phone, ala Motorola g84. Slow memory, slow CPU, 12GB RAM. About 5 watts usage).
      About 12-22 tokens/sec+ on Snapdragon 8 gen 3 (~$600-700USD phone, ala OnePlus Ace 3 Pro. Good memory, good CPU, 24GB RAM, so can do bigger models. About 17.5 Watts usage).
      So, they're not bad, but they're nowhere near a decent desktop CPU/ GPU/ or Mac integrated RAM thingy. But they can run them, to vaguely "usable" speeds (smaller models go quite a bit quicker, and are getting pretty good too. There's Llama 4B Magnum and Qwen 2.5 3B that will double and a bit those speeds, especially using ARM optimized versions. They're not "super smart/ knowledgeable", but they're good enough for entertainment purposes).

  • @massimodileo7169
    @massimodileo7169 หลายเดือนก่อน +3

    -verbose
    please use this option, it’s easier than Schwarzenegger 2.0

    • @AZisk
      @AZisk  หลายเดือนก่อน +5

      but way less fun

  • @mynameisZhenyaArt_
    @mynameisZhenyaArt_ 29 วันที่ผ่านมา

    Yep, we are waiting for 20A process node CPUs to be released! Please share the performance review as soon as you get your hands on it! Thank you Alex!

  • @atom6_
    @atom6_ หลายเดือนก่อน +4

    a Mac with MLX backend can go even faster.

  • @newman429
    @newman429 หลายเดือนก่อน +14

    I am pretty sure you heard about asahi linux so will you make a video on it ?

    • @ALCE-h7b
      @ALCE-h7b หลายเดือนก่อน +4

      it seems like he already did 2 years ago.. Did something changed?

    • @newman429
      @newman429 หลายเดือนก่อน +2

      ​@@ALCE-h7b
      No they are still doing the what they were doing before but I suppose they took a big leap now with the release of their 'Asahi game playing toolkit' which in short makes playing AAA games on Mac very probable
      Maybe if you wanna lean more, why not read their blog.

    • @AZisk
      @AZisk  หลายเดือนก่อน +7

      Maybe I need to revisit it. Cheers

  • @AmosKito
    @AmosKito 12 วันที่ผ่านมา +1

    You assumed that the wattage was constant, which it likely isn't. Your outlet meter can measure the total energy usage.

  • @dave_kimura
    @dave_kimura หลายเดือนก่อน +3

    I'm surprised the the Oculink doesn't create more of a performance hit than it did. My 4090 plugged directly into the Mobo gets about 145 tokens/sec on llama3.1:8b verses the ~130 tokens/sec that you got. Kind of make sense since the model is first loaded into memory on the GPU.

    • @andikunar7183
      @andikunar7183 หลายเดือนก่อน

      Token-generation during inference is largely memory-bandwidth bound (OK, with some minimal impact for quantization-calculations) - Intel 1TB/s. And the LLM runs entirely on the 4090. The 4090 really shines during (batched) prompt-processing, blowing the Intel/Apple machine away - probably >20x faster than the M2 Pro, and way, way more so than the Intel CPU/GPU.

  • @pe6649
    @pe6649 หลายเดือนก่อน +1

    @Alex for me it is even more interesting how much longer the Mac Mini will take on a finetuning session than a 4090 - only inference is a bit boring

  • @theunknown21329
    @theunknown21329 29 วันที่ผ่านมา +1

    This is such cool content!

  • @parshwa_1
    @parshwa_1 24 วันที่ผ่านมา

    I like it , interesting to watch mate✌

  • @luisrortega
    @luisrortega หลายเดือนก่อน +2

    I wonder if the benchmark uses the latest MLX apple library, when I switch to it (on LM Studio), it was an incredible difference. I can't wait until Ollama add it!

    • @andikunar7183
      @andikunar7183 หลายเดือนก่อน

      Ollama uses the llama.cpp code-base for inference. And llama.cpp doesn't use MLX.
      I admit, that I did not benchmark MLX via llama.cpp recently, I have to look into it. MLX in my understanding can use GGUF-files, but only limited quantization variants. Q4_0 seems a good compromise, but llama.cpp can do better.

  • @gaiustacitus4242
    @gaiustacitus4242 หลายเดือนก่อน +3

    Let's be honest, once the 24Gb RAM on the RTX 4090 is exhausted the performance is dismal for LLMs that get pushed out to the 128Gb RAM on your motherboard. That's why I'm looking forward to the new Mac Studio M4 Ultra with 256Gb (or greater) integrated memory, 24+ CPU cores, 60+ GPU cores, and 32+ NPU cores.
    Many of the LLMs are developed on Mac hardware because it is presently the best option.

  • @Sumbuddysumwhere
    @Sumbuddysumwhere 16 วันที่ผ่านมา

    Speed and accuracy is what matters. Screw the extra power usage because you aren't running an LLM 24/7 you are querying it for a combined 5 minutes per day at most even if you are using it for hours during your work day. Your power usage is only spiking during actual generations which is seconds long on the 4090. The only real exception is running super big slow models for long periods of time or using AI agents which currently give crap results.

  • @mk500
    @mk500 หลายเดือนก่อน +1

    I ended up with a used studio M1 Ultra with 128GB RAM. I can run any model up to 96GB RAM, and often do. It could be faster, but the important thing is I have few limitations for large-ish models and context. What really competes with this?

  • @donjaime_ett
    @donjaime_ett หลายเดือนก่อน

    For an AI server, once the model is loaded into vram, you probably want it to remain there for repeated inferences. So it depends on how you set things up.
    Also if you want apples to apples determinism, reduce the temperature at inference time.

  • @dudaoutloud
    @dudaoutloud 29 วันที่ผ่านมา

    Excellent comparison! For me, an interesting future comparison would be between the RTX setup and an equally priced Mac Studio. And then re-compare after the new M4 Pro/Max is available (hopefully next week?).

  • @cobratom666
    @cobratom666 หลายเดือนก่อน +11

    I think that You wrongly calculate costs for RTX 4090. You assume that average watt usage for RTX is 300W, but you shoud use your electricity usage monitor. There is an option to show total usage. I try to explain. RTX consume for first 1-2 seconds (loading model to memory) less than 320W, let say 90W so total consumption will be 2x90W + 10x320 = 3380J or even less - not 3840J . For M2 PRO it doesn't matter that much, beacuse 2x15W + 45*48W = 2130 J instead of 2256J. So result is not accurate at least 15% for RTX

    • @laszlo6501
      @laszlo6501 หลายเดือนก่อน

      The 2 seconds would probably be there only for the first time the model is loaded.

    • @GraveUypo
      @GraveUypo หลายเดือนก่อน +3

      @@laszlo6501 yup, the 2 second delay only happens the first time you load the model and he put such emphasis on it.

  • @julesnopomar
    @julesnopomar 8 วันที่ผ่านมา

    what u think about the new mini pro could be pretty good no?

  • @DavidFlenaugh
    @DavidFlenaugh 29 วันที่ผ่านมา

    Thank you for doing this. I wonder about this stuff, and wish that there was a bit more content on GPUs and LLMs. Like is it better for LLMs to get 2 AMD cards or 1 4090? or even 2 A770s?

  • @jumbleblue
    @jumbleblue 5 วันที่ผ่านมา

    now that would be interesting with the entry level m4 mac mini

  • @arkangel7330
    @arkangel7330 หลายเดือนก่อน +4

    I'm curious how the Mac Studio with M2 Ultra would compare with the RT 4090

    • @AZisk
      @AZisk  หลายเดือนก่อน +6

      Depending on the task, I think it would do pretty well. It won't be as fast on smaller models, but it will destroy the RTX on larger models.

    • @brulsmurf
      @brulsmurf หลายเดือนก่อน +1

      @@AZisk It can run pretty large models, but the low speed makes it close to unusable for real time interaction.

    • @mk500
      @mk500 หลายเดือนก่อน +1

      @@brulsmurfI use 123B models on my M1 ultra 128GB. It can be as slow as 7 tokens per second, but I find that still usable for interactive chat. I’m more into quality than speed.

    • @MarcusHast
      @MarcusHast 29 วันที่ผ่านมา

      There is quite a lot of talk about this in r/localllama. For LLM work 4090 is faster than an M2 Ultra. You can get more memory on a M2 Ultra, but you can also get multiple 4090 (or perhaps even better, 3090) and put in a computer for typically less money (and better performance).
      If you get over 90 GB requirements (4 xx90 cards) then neither model is really fast enough to be useful. (When you get to about 1 token per second then you're probably better off just renting a machine online or buying tokens from eg OpenAI.)

  • @yudtpb
    @yudtpb หลายเดือนก่อน +2

    Would like to see how lunar lake performs

  • @wnicora
    @wnicora หลายเดือนก่อน

    Really nice video tx
    What about performing the test on the Qualcomm dev kit machine? Interesting to see how the snapdragon performs

  • @a-di-mo
    @a-di-mo หลายเดือนก่อน

    Very interesting, just a question. Can the mac npu improve the performances ?

  • @daillengineer
    @daillengineer 20 วันที่ผ่านมา

    cant wait to see how the new M4 variety fares in a few weeks!

  • @aaayakou
    @aaayakou 24 วันที่ผ่านมา

    I think the 4060 ti 16GB should have been included in the comparison.
    It seems like a most valuable solution.
    It combines a small price, good performance, low consumption and fairly compact size. I think for local llm it should be the best solution from nvidia for the average user.

  • @boltez6507
    @boltez6507 หลายเดือนก่อน +2

    Man waiting for strix halo.

  • @pixelbat
    @pixelbat 24 วันที่ผ่านมา

    Have you looked into Bitnet at all? It's Microsoft's new inference framework for 1 bit LLMs. People running like 70 billion parameters at home. Fun stuff.

  • @ALCE-h7b
    @ALCE-h7b หลายเดือนก่อน +1

    Have you thought about running such tests on the new AMD 8700 on tiny PCs? Heard that they have pretty good iGPU

  • @vinz3301
    @vinz3301 หลายเดือนก่อน

    can we talk about this colorful keyboard on the right ? gave me goose bumps !

  • @pascalmariany
    @pascalmariany 15 วันที่ผ่านมา

    Great! I myself test a lot. Could you make a comparison of cloud based energy consumption and the effect on our planet and local LLM?

  • @Moyano__
    @Moyano__ 18 วันที่ผ่านมา

    I love this channel.

  • @mrmerm
    @mrmerm หลายเดือนก่อน +1

    Would be great to see AMD in the benchmarks both with CPU and external GPU.

  • @loicdupond7550
    @loicdupond7550 10 วันที่ผ่านมา

    Now getting the m4 mini and m4 mini pros in this review is what I need 😂
    Waiting impatiently to see if a mini pro with 64gb is worth it for llms !

  • @damienong1462
    @damienong1462 19 วันที่ผ่านมา +1

    I wonder how will the new MacMini M4 Chip perform. If m2 is already fast, m4 ?🤔

  • @robertoguerra5375
    @robertoguerra5375 15 วันที่ผ่านมา +1

    Cool video :)
    What about the results? Did you like those ai generated stories?

  • @t33mtech59
    @t33mtech59 10 วันที่ผ่านมา

    Im considering an m4 max be of the 128gb unifieded ram. I have a 4090 but its locked at 11B models due to the low vram. Thoughts?

  • @xsiviso4835
    @xsiviso4835 หลายเดือนก่อน +1

    0,53$ per kWh is really high for Germany. At the moment it is at 0,30$ per kWh.

  • @conradohernanvillagil2764
    @conradohernanvillagil2764 29 วันที่ผ่านมา

    Alex, excelent video. Thanks you. Yes, it would be very interesting to know the performance on Intel Ultra ( lunar lake).

  • @kevinwestmor
    @kevinwestmor หลายเดือนก่อน

    Alex, thank you, and please comment on the NUC shared video memory management (in general - because the smaller test probably fit in either/both), especially when switching back-and-forth between a CPU test and a GPU test -would this be windows managed, or would you be changing parameters? Thank you again.

  • @ewm5487
    @ewm5487 29 วันที่ผ่านมา

    I'm curious to see the first Strix Halo APUs running this test with ROCm + Ollama. I'm dreaming of 96GB VRAM allocated and 128GB total for the system. What do you think?

  • @DanFrederiksen
    @DanFrederiksen 16 วันที่ผ่านมา

    btw for watt efficiency, you might be able to clock down the 4090 and get close to apple silicon

  • @ShinyTechThings
    @ShinyTechThings หลายเดือนก่อน +2

    Agreed

  • @Youtubeuseritis
    @Youtubeuseritis 2 วันที่ผ่านมา

    Intel mac trashcan with 12 core xeon and D300 AMD graphics card with 64gb Ram - is it any good for LLM? Better on Mac OS? Or Windows 11? Windows 10?

  • @kiloabnehmen2592
    @kiloabnehmen2592 29 วันที่ผ่านมา

    Hello Alex, could you try Mixtral 8x22b on the 96GB Ram mini pc? I am really curious to see the speed and results with that setup.

  • @Merlinvn82
    @Merlinvn82 29 วันที่ผ่านมา

    How about a used RTX 3060 12gb VRAM or RTX 4060 Ti 16Gb VRAM in place of RTX 4090? Will it beat the mac mini setup in terms of performance?

  • @Techonsapevole
    @Techonsapevole หลายเดือนก่อน +1

    cool, but where is AMD Ryzen 7 ?

  • @aysbg
    @aysbg หลายเดือนก่อน

    Mac Studio with M2 ultra would be a good comparision against 4090. I would love to see new M4s with something like 64GB of RAM so that we can run bigger modals locally without paying obscene amounts of money for either hardware or electricity.

  • @waynelau3256
    @waynelau3256 16 วันที่ผ่านมา

    With the 3.2 models, it should be quite fun

  • @didierpuzenat7280
    @didierpuzenat7280 29 วันที่ผ่านมา

    5:38 If you want to compare energy consumption, watt is not a good measure, watt hour is.