M3 max 128GB for AI running Llama2 7b 13b and 70b

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ย. 2024

ความคิดเห็น • 341

  • @tamsaiming2003
    @tamsaiming2003 10 หลายเดือนก่อน +562

    Yes, there’s how TH-camrs should present the 128gb model, not with video editing or benchmarks

    • @Sudip_Sarkar_Charles_Edwards
      @Sudip_Sarkar_Charles_Edwards 10 หลายเดือนก่อน +16

      I agree.

    • @jimss596840
      @jimss596840 10 หลายเดือนก่อน +10

      Yes. And not with gaming as well

    • @m.goedeker7381
      @m.goedeker7381 10 หลายเดือนก่อน +1

      What was your setup on the mac? Did you also try gpt4all too?

    • @tamsaiming2003
      @tamsaiming2003 10 หลายเดือนก่อน +3

      @@m.goedeker7381 receiving my MacBook Pro 16’ M3 max 128gb tmr. My usage is a bit different: SDXL (a lot of machine learning though)

    • @PKperformanceEU
      @PKperformanceEU 10 หลายเดือนก่อน

      This is the kind of video we want to see not fucking imbecilistic cinebench runs!! By the way Cinebench is the least fair benchmark for apple silicon!!
      And yet most people since they are ignorant believe cinebench is the gold standard.
      M3max is faster than a OC 14700k and rivals a 14900k, beating everything in more memory bound algorithms(for example prime numbers or water wave simulation)

  • @VikramMulukutla
    @VikramMulukutla 10 หลายเดือนก่อน +76

    Sold on the M3 Max. That 70B test. Damn.

    • @ammarahmad6079
      @ammarahmad6079 5 หลายเดือนก่อน +3

      @test12382but then it’s a desktop setup, M3 max is in Laptop which is much more impressive

    • @ammarahmad6079
      @ammarahmad6079 5 หลายเดือนก่อน +3

      @test12382 nothing can compete m2 ultra, we can't comapre desktop vs laptop. but for a laptop m3 max is really impressive.

    • @TsunamicBug
      @TsunamicBug 3 หลายเดือนก่อน

      you could look into tesla p40 servers 2 p40s should be able to run llama 3 70b at reasonable quantization

  • @daves.software
    @daves.software 9 หลายเดือนก่อน +31

    You didn't pin them to the same random seed, so they're generating different text, so it's hard to compare the elapsed time because they're not generating the same number of tokens.

    • @nowandrew4442
      @nowandrew4442 7 หลายเดือนก่อน +2

      The M1 & the M3 are producing ***almost*** identical text. While from a benchmarking perspective I guess consistency would be ideal, surely the nature of ML is that a process should do what it needs to do to get a workable result; you shouldn't make machines follow the exact same path if one can jump high fences but another can burrow under them. What's important is how fast they complete the task asked of them, not whether they did so in an identical manner.

  • @axotical8682
    @axotical8682 10 หลายเดือนก่อน +63

    I was hoping someone would post this kind of comparison. It seems the unified memory is a huge advantage for running larger llms. One thing I did not understand at the final 70b test, what was the amount the memory used by the m3, could you have gotten away with 64gb only instead of 128gb? Thank you for the effort you put in creating and sharing this test. Subscribed!

    • @rubencabrera8519
      @rubencabrera8519 10 หลายเดือนก่อน +8

      The 70b use around 35gb of memory, so 64gb will be more than enough at least for llama2 70b

    • @Joe_Brig
      @Joe_Brig 10 หลายเดือนก่อน +9

      I just ran the llama2:70b on a m3 64gb 14" and it works well. However, the fans did ramp up. No fans on smaller models like the 34b code lama though.

    • @Anderson-dy8ml
      @Anderson-dy8ml 10 หลายเดือนก่อน +4

      @@rubencabrera8519 do you thingk 64gb is enough to run Falcon 180B model?

    • @battlehunterofficial4586
      @battlehunterofficial4586 10 หลายเดือนก่อน +7

      ⁠​⁠@@Anderson-dy8mlI doubt because Falcon 180B is quoted as requiring 400gb of VRAM to infer on, which means you'll need at least 5 A100's before it gets off the ground in its original

    • @axotical8682
      @axotical8682 10 หลายเดือนก่อน +3

      @@Joe_Brig I’m thinking an M2 Mac Studio with 64gb would be better suited for this , on long term, heard other ppl commenting about the fans getting louder than normal on M3 MacbookPro s when running inference.

  • @matumatux
    @matumatux 10 หลายเดือนก่อน +46

    I’ve been searching EXACTLY for this! Thank you. Subscribed and looking forward to those next videos on stable diffusion and your grandma’s clone (if I understood well). Thanks bro!

    • @zt9233
      @zt9233 9 หลายเดือนก่อน

      same

  • @anguss2228
    @anguss2228 10 หลายเดือนก่อน +27

    Would be good to see fine tuning (Autotrain / PEFT) limits on Llama2 models for M3 MAX 128GB

    • @yogiwp_
      @yogiwp_ 10 หลายเดือนก่อน +4

      Would love to see this too!

  • @VictorVedmich
    @VictorVedmich 10 หลายเดือนก่อน +13

    What do you think 64Gb will be enough or still the better 128?

    • @vigreux8
      @vigreux8 6 หลายเดือนก่อน +1

      If AI consumes 35GB of RAM with 70 billion parameters, you are left with 29GB of available RAM, and in my opinion, models will become more performant and specialized, similar to Mistral AI models with 7B models that are very performant relative to their size. So, 128GB is a security if you can afford it, but 64GB is sufficient

    • @romainchanas
      @romainchanas 4 หลายเดือนก่อน

      128gb or regrets

  • @lhxperimental
    @lhxperimental 10 หลายเดือนก่อน +2

    Nvidia cards are so expensive now. The are blinded by the demand and feel they can get away with anything. I wish they stay in their lalaland for longer while competition develops and pulls the rug from under their feet. The reason Nvidia can get away with this is because most AI/ML software is optimized to run on Nvidia. AMD/Intel/Apple/Qualcomm or a consortium should should build a killer GPU / machine that can slay Nvidia. They need to work with one or two AI projects - Say Stable Diffusion and Llama - optimize these two to run crazy fast on their hardware. Some ridiculous out-performance over Nvidia is needed to shake things up. Only then will AI/ML libraries and ecosystem will consider optimizing for non-Nvidia hardware.

  • @andikunar7183
    @andikunar7183 10 หลายเดือนก่อน +16

    Great idea, but I hoped this would be a better-executed test, with actual benchmark numbers. And that you show which model you used (which quantization). E.g. llama.cpp prints out the token/s numbers at the end, and you can switch between CPU and GPU with parameters on start.
    With llama-2 single-user inference's response-token generation, the performance largely depends on memory bandwidth, and not very much on GPU-speed. The 4090 has 2.5x the memory-bandwidht of the M3 Max chip (even though its compute is much faster), and 5x vs. the M1/M2 Pro (the M3 Pro is slower in this due to lower bandwidth than an M1/M2). This largely should determine their performance results.
    I totally agree, that Apple silicon is great a) because of comparatively cheap large GPU-available memory and b) because of approx. 1/10th the power-draw (heat/noise).

    • @YuuriPenas
      @YuuriPenas 10 หลายเดือนก่อน +1

      +1 for this comment. I noticed that the results are not really the same.

    • @Teluric2
      @Teluric2 10 หลายเดือนก่อน +2

      Nobody prefers power over performance . Users who need power dont care about power bill.

    • @andikunar7183
      @andikunar7183 10 หลายเดือนก่อน +3

      ⁠@@Teluric2 „nobody“ is wrong, quite a few people care about power-consumption during inferece. AI inference is moving towards the edge. If you want raw power for training, etc., most use a cloud datacenter for AI.

    • @davout5775
      @davout5775 10 หลายเดือนก่อน

      ​@@aziz9488Meanwhile living in his S-hole in the middle east or North Africa 🤣🤣🤣

    • @davout5775
      @davout5775 10 หลายเดือนก่อน +1

      ​@@Teluric2This is a laptop and everybody should care about the power consumption. This machine is not even meant to compete against a machine with 4090 but apparently there are tasks where it exceeds.

  • @uwepuneet
    @uwepuneet 10 หลายเดือนก่อน +16

    Thank you for this comparison! I've been searching for this for a long time and couldn't find it anywhere concrete.

  • @klaymoon1
    @klaymoon1 4 หลายเดือนก่อน +2

    Great comparison! The parameter always gets bigger. Based on your result, I'm thinking the upcoming 5090 won't be my next purchase. More likely M4 with 128GB or 256 RAM will be my next stop.

  • @oterotube13
    @oterotube13 10 หลายเดือนก่อน +8

    hope the M3 Ultra make a huge shift in performance.

  • @markclayton8977
    @markclayton8977 10 หลายเดือนก่อน +10

    I have this same CPU/RAM combination. I've been able to run up to the 120B Goliath models, q4 quantization. Very fast inference.

    • @-_.DI2BA._-
      @-_.DI2BA._- 5 หลายเดือนก่อน

      Could you share the link of the Model Version you use?

  • @wagnerribeiro8036
    @wagnerribeiro8036 4 หลายเดือนก่อน +1

    Can you repeat this video for llama 3? That will be awesome!

  • @acqua_exp6420
    @acqua_exp6420 10 หลายเดือนก่อน +5

    Could you test the M3 max with the OpenChat 3.5 model and the Falcon 180B model?
    Amazing video & comparison, thank you! Subscribed! :)

    • @technopremium91
      @technopremium91  10 หลายเดือนก่อน +3

      I will try!

    • @acqua_exp6420
      @acqua_exp6420 10 หลายเดือนก่อน

      @@technopremium91 thank you! :)

    • @yogiwp_
      @yogiwp_ 10 หลายเดือนก่อน

      @@technopremium91 awesome. subbed!

  • @stephe92
    @stephe92 10 หลายเดือนก่อน +9

    Outstanding video - thank you for taking the time to do this. It’s exactly the comparison I was looking for.

  • @Duckstalker1340
    @Duckstalker1340 10 หลายเดือนก่อน +2

    Hello, would the Macbook Pro 64GB RAM one be able run the 70b model, or do I absolutely need 128GB RAM?

  • @dimeloloco
    @dimeloloco 2 หลายเดือนก่อน +1

    Youre comparing an apple 128Gb RAM with more than enough space to run 70b with a PC w/ 32Gb ram setup that’s far less than the recommended spec for running 70b which is around 40Gb RAM minimum. Compare the M3 to a PC with 128Gb in Ram. This is comparison is silly. The 128GB PC will be both cheaper to build and youd probably still be able to get a 4090 with the cash left over.

    • @dimeloloco
      @dimeloloco 2 หลายเดือนก่อน

      If you can’t fit the model in the VRAM then you’re no longer comparing the Max to a 4090, you’re comparing it to the inadequate CPU setup you have

  • @netify6582
    @netify6582 3 หลายเดือนก่อน +1

    Not Apple fan here, but M3 Max performance with 70b was impressive, indeed.

  • @stephenthumb2912
    @stephenthumb2912 10 หลายเดือนก่อน +8

    great test. thank you. for all of those working with LLM models and are considering M3's this is what we're looking for. People should understand these are q4 gguf models but still it's a very relevant test and it's good to see the Mac's unified memory working. I'd love to see how would be to run an interface like streamlit along with RAG on instructor XL and ollama or textgen and see if the Max can handle all of them together.

    • @nigratruo
      @nigratruo 10 หลายเดือนก่อน

      It was interesting to see that the Mac struggled with using that much memory via the GPU, you could easily see that there is a bottle neck that slowed down the system big time.

  • @donjaime_ett
    @donjaime_ett 10 หลายเดือนก่อน +6

    Would love to see someone test and benchmark MPS (Metal Performance Shaders) for pytorch for training Ml/AI models with a focus on transformers and LLMs. Some of us do more than just inference and the availability of GPUs for doing ML/AI development is just insanely bad.

  • @Fordance100
    @Fordance100 10 หลายเดือนก่อน +1

    Your 5950x should have 128GB RAM instead of 32GB. Just not much RAM left after loading the 70B model.

  • @iganmak
    @iganmak 10 หลายเดือนก่อน +29

    Potential problems with this comparison:
    1. Have you used GGUF format for all models on 4090 instead of GPTQ for smaller models?
    2. Have you used AutoGPTQ instead of ExLamav2 (much faster inference) model loader for 4090?
    3. Have you used part (up to 51 layers) of GGUF 70B model for 4090 acceleration?
    Also comparison would be much cleaner if you add tokens/second for all your runs.
    Other than that, the comparison is pretty useful.

    • @nathanbanks2354
      @nathanbanks2354 10 หลายเดือนก่อน +1

      I think ollama defaults to GGUF for everything. Not sure how well it deals with models that don't fit on the GPU, but it still uses some video ram. I've only played with changing the number of layers given to the GPU using text-generation-webui.

    • @PythonPlusPlus
      @PythonPlusPlus 9 หลายเดือนก่อน

      The point is comparing Apples to Apples. It wouldn’t be a useful comparison if the 4090 is using a different setup.

    • @MrGarkin
      @MrGarkin 8 หลายเดือนก่อน +1

      @@PythonPlusPlus apples to apples are same used models. optimizing setup and partial gpu offloading is a common sense.

  • @stevenharms9072
    @stevenharms9072 10 หลายเดือนก่อน +7

    A big difference that would make a next great video: Prompt Processing Time. I found using 4000 character strings as prompts was much much slower on the M1 Max vs 4090 for example.

    • @augustogalindo8687
      @augustogalindo8687 9 หลายเดือนก่อน +1

      Comparing a 4090 to an M1 Max is kind of unfair, and it depends a lot on the RAM as you can see in the video.

    • @stanchan
      @stanchan 9 หลายเดือนก่อน +1

      I doubt most people here are looking for a comparison based on fairness. We are looking to see what is currently available in the market, likely for use in our home labs.

  • @julle4083
    @julle4083 10 หลายเดือนก่อน +5

    Thanks for that! Could you compare the speeds for stable diffusion? There’s nothing out there comparing SD performances of M3 Max and 4090.

    • @technopremium91
      @technopremium91  10 หลายเดือนก่อน +13

      Thats actually my next video, working on it now. I will be uploading tomorrow.

    • @nguyenminh7780
      @nguyenminh7780 10 หลายเดือนก่อน +1

      @@technopremium91 heres a like and a sub, thanks to you minority out there who actually tests what matters and not video editing

    • @julle4083
      @julle4083 10 หลายเดือนก่อน

      @@technopremium91 Oh that’s really great. Thanks a lot for your reply and your time and effort!

  • @MikaMoupondo
    @MikaMoupondo 10 หลายเดือนก่อน +6

    Hey man! I was here for self gratification for having bought an M3 max but your grandma experiment got me subscribed. Can't wait!

    • @technopremium91
      @technopremium91  10 หลายเดือนก่อน +3

      Thanks for the sub, thats a project that took me quite some time, but i am ready to share, working on the video and I will upload early next week.

  • @Maariyyaa-i8f
    @Maariyyaa-i8f 2 หลายเดือนก่อน +1

    update 7 months later?

  • @broimnotyourbro
    @broimnotyourbro 5 หลายเดือนก่อน +1

    Yeah, I have an M2 Ultra (only 64GB RAM) and it performs similarly. Shared memory FTW

  • @ericpmoss
    @ericpmoss 10 หลายเดือนก่อน +4

    I read that the larger SSDs were faster than the smaller ones, peaking with the 4TB chips. If one regularly hits deep into swap space, I wonder how much it helps sustain performance.

  • @noesaenz1
    @noesaenz1 8 หลายเดือนก่อน

    I think in this case the amount of RAM is the real bottleneck, could you repeat the same test but with at least 64GB RAM on the RTX4090?

  • @DerekDavis213
    @DerekDavis213 10 หลายเดือนก่อน +3

    M3 max with 128GB will cost more than 5000 USD.
    For that kind of money, a Windows workstation will run Llama2 7b/13b/70b *MUCH* *FASTER* .

    • @zihechen3111
      @zihechen3111 10 หลายเดือนก่อน +1

      😅no it’s not. Only Nvidia gpu runs ai, but a gpu with 128gb rams is costing ur entire house baby.

  • @alexeycherkashin6251
    @alexeycherkashin6251 10 หลายเดือนก่อน +2

    Great content quality: like and subscribe from me :)
    For me the big question after watching all this is: is 70b worth it? I know that comparing essays could be boring, but... are they any better? Or generally does 70b model give more relative results? Asking, because for the current moment seems like gpt3.5 performs better then gpt4: listen to instructions more carefully, less hallucinate, etc.

  • @Integr8d
    @Integr8d หลายเดือนก่อน

    This guy: “So you can see the 4090 finishes a little faster than the M3 Max. Just a few seconds. Not a big deal.”
    Everyone else: “4090 >SLAMS< M3 Max!!!”

  • @hossromani
    @hossromani 10 หลายเดือนก่อน +8

    Omg, finally a channel using these machines for proper accessible AI instead of TH-cam content creation, keep up the great job and a video how to get models up and running would be awesome 👏

    • @TheWallReports
      @TheWallReports 10 หลายเดือนก่อน

      💯I could not agree more.

  • @sto2779
    @sto2779 หลายเดือนก่อน

    Apple is crazy 🤣. This laptop running 70 billion LLM parameters with ease. Imagine buying two of these laptops and doing a cluster... So the M3 is more cost effective in training LLMs than to a used A600s?

  • @yagoa
    @yagoa 10 หลายเดือนก่อน +1

    use quicktime recording and it will use 90% less resources

  • @Stewz66
    @Stewz66 10 หลายเดือนก่อน +4

    You just helped me so much. Thank you.

  • @edwincloudusa
    @edwincloudusa 5 หลายเดือนก่อน +1

    I was about to buy a 128g or 256h ram pc with rtx 4090 but it seems it is now the right choice after watching this video for running 80b locally. What would you say?

  • @MuhammadUsman-ix6jo
    @MuhammadUsman-ix6jo 18 วันที่ผ่านมา

    Damn while watching your video my computer start lagging😂😂

  • @justindressler5992
    @justindressler5992 5 หลายเดือนก่อน

    By the way GPTQ is faster than GGUF. But GPTQ obviously only runs on GPU, so no good if you want CPU and system RAM offloading. If I want speed I use GPTQ if I want large models I use GGUF. GGUF in this test makes sense since your using CPU's. But you do raise an interesting point the Max seems to have comparative inference speed as the 4090 even if the performance is a little restricted using GGUF. Maybe the 4090 is about 30-40% faster but only if the model fits.

  • @robertotomas
    @robertotomas 10 หลายเดือนก่อน +2

    if you are interested in inference, why are you avoiding quantization and just use the blazing fast Nvidia hardware (rather than buying 4 times as many gpu, buy one and quantize the model)? gpu to gpu, the m3 max 40 core is about the same as a 4060ti, if I understand correctly, and quantization is mostly only bad for training.
    edit: I think I see the answer, for the 70b model, even quantized to 3 bits it is still just over 26GB

    • @stephenthumb2912
      @stephenthumb2912 9 หลายเดือนก่อน

      ollama defaults to q4_0 quant. that said i would have liked to see the limit for the m3 max. It should be able to handle full precision 70:b, but i wonder if it'll generate fast enough to be usable.

  • @vit.c.195
    @vit.c.195 8 หลายเดือนก่อน

    Tests on Shitbuntu with fat generic binary code with less memory, than srapple, working on khAMD "brilliant" CPU... very good test...

  • @Matlockization
    @Matlockization 5 หลายเดือนก่อน

    Well, who's not impressed with the M3 MAX 128GB ? But Apple's price at $7,000 is unrealistic and as an added bonus the OS runs slightly more apps than Linux, yikes ! Waiting a few seconds more in getting the result will save anyone thousands in purchase costs. A very interesting channel.

  • @kaleidoscope_records_
    @kaleidoscope_records_ 10 หลายเดือนก่อน +1

    Please show us UNQUANTIZED memory usage for the 70B Llama on the m3 MAX. thank you very much!

  • @Litoof
    @Litoof 20 วันที่ผ่านมา

    damn i didn’t know openai, microsoft used macbooks m3 for their ai services…. ah no they use nvidia datacenter cards witch take the best performance per watt

  • @romansukach
    @romansukach 3 หลายเดือนก่อน

    M3 max outperformed 450w 4090 10 times even though have lost in other tests. Good job done there)

  • @darthvader4899
    @darthvader4899 10 หลายเดือนก่อน +2

    can you run 70b on m3 max 64 gb ram?

  • @大支爺
    @大支爺 5 หลายเดือนก่อน

    32gb of ram are too small which is not even enough to run windows 10/11, my PC has 192gb ddr5 + 4090 to run larger model as well.

  • @jean-marctrappier4436
    @jean-marctrappier4436 6 หลายเดือนก่อน

    As an owner of an M3 Ultra with all the features maxed out, I believe this comparison is extremely biased. The price of a single machine like mine equals a machine with 4 x RTX 4090s, so the comparison should be based on price, limiting ourselves to power consumption is undoubtedly interesting if we need to integrate our solution into a system with energy limits, but in that case, we are not comparing performance by loading a model that will require a lot of power, therefore, energy.The only and ultimate reason that seriously justifies buying the M3 is to be able to easily carry a high-performance model for demonstrations at clients' but really not for daily use, it makes no sense, the smaller models have almost the same performance so the size of the memory does not have immense interest.

  • @alexandervega3463
    @alexandervega3463 22 วันที่ผ่านมา

    It is impossible to find that mac now i hope next year M4 pro model provides me an alternative.

  • @devilalwayscry
    @devilalwayscry 8 หลายเดือนก่อน

    so the bottleneck on pc with rtx4090 is the RAM, you could have 64gb of RAM on the custom pc no?

  • @appleman7791
    @appleman7791 7 หลายเดือนก่อน

    This is the best real world test on youtube, i want to buy the 128gb ram m3 max 14 inch with 2 tb ssd only it is expensive €4779,- here in europe. is this laptop good for the next 10 years of programming, video editing, rendering,

  • @tobi6758
    @tobi6758 10 หลายเดือนก่อน +2

    So annoying that the iPhone continuity camera only work with center stage when recording on the main sensor..

    • @technopremium91
      @technopremium91  10 หลายเดือนก่อน

      It was annoying for me too, but that was the best way i have for recording because of the quality.

    • @tobi6758
      @tobi6758 10 หลายเดือนก่อน

      @@technopremium91 Yeah the quality looks stellar, I also use my iPhone to record, but it's only since the last update of Mac OS that they force you to use that center stage

  • @videofrat3115
    @videofrat3115 9 หลายเดือนก่อน +1

    The big question is, do you notice any difference in the quality of response on the 70b parameters, or on MiXtral vs 7b mistral? I am wondering if it's worth upgrading my 16gb m1 Mac.

  • @GursimarSinghMiglani
    @GursimarSinghMiglani หลายเดือนก่อน

    Cant wait for the m4 max macbook pro! Or even the m4 ultra/extreme mac pro!

  • @lb5928
    @lb5928 5 หลายเดือนก่อน

    Lies, 70B ru s just fine on the 4090 using LM Studio. OLLama is unnecessary and inferior.

  • @nat.serrano
    @nat.serrano 5 หลายเดือนก่อน

    so basically if I buy an m3 max I get somethig better tthan an rtx4090?? :)

  • @DIYDEGEN
    @DIYDEGEN 3 หลายเดือนก่อน

    Why not using Mac Studio with M2 Ultra and 192 gig of ram to run 140b

  • @plbfrost
    @plbfrost 10 หลายเดือนก่อน +1

    What's your suggestion about SSD volume when choosing 128G M3 Max? It is said different SSD has different transfer speed, which would influence the performance. Thanks~

  • @stavsap
    @stavsap 6 หลายเดือนก่อน

    nice trick with the terminal but why not to run the models with --verbose flag to get statistics to compare? all your compare charts are not accurate since finish time is not the same text generated on them.

  • @sergey_serebro
    @sergey_serebro 10 หลายเดือนก่อน +1

    Any ideas about which model I'll be able to comfortably run on MBP M3 Max 30 cores with 36Gb memory?
    I'm trying to figure out which version of M3 to buy.
    40GPU-cores and 128Gb is 400mb/s as far as I see, and 30GPU cores with 36Gb is 300mb/s, so i'm concerned a bit if I really should spend extra money for 40GPU cores or there won't be a big difference.

  • @riccardoatwork5291
    @riccardoatwork5291 10 หลายเดือนก่อน +1

    can you make a few example of interesting ML projects that would require to use mre than 64 GB or RAM?

  • @MobileSpace
    @MobileSpace 6 หลายเดือนก่อน +1

    I know others already said this a few times, but I just wanted to parrot. This actual usage comparison for real world workloads in large memory environments is truly what many people look for instead of all of the bull$hit repeated, bare-minimum usage, scenarios. A hearty thank you for showing true limits between a 4090 and a large memory M3 max. Cheers!

  • @andrewlee6917
    @andrewlee6917 3 หลายเดือนก่อน

    I tested with my M1 Max 64gb and it works quite well.

  • @carlosjesuscaro8274
    @carlosjesuscaro8274 8 หลายเดือนก่อน +1

    Thank you for the videos, very well done and informative. Have you tried doing LLM fine tuning with the M3 Max 128GB unified memory? I have seen people running the LLM models but using NVIDIA cards for fine tuning them. I'd be very interested to learn your thoughts about it

  • @javibaltierrez
    @javibaltierrez 10 หลายเดือนก่อน +1

    Thank you for this very interesting video. I’m just looking this kind of benchmark because I’m interested in buying an Apple Mac Book Pro M3 Max 128 Gb for LLM models. Great video!

  • @realharo
    @realharo 10 หลายเดือนก่อน +1

    What about two 3090s? With a second-hand price of about $700, could be an interesting option to run the larger models.

  • @FlylenseQ
    @FlylenseQ 22 วันที่ผ่านมา

    What if you bump up the ram to 128gb on the PC

  • @alsoeris
    @alsoeris 5 หลายเดือนก่อน

    Would it be better to set the AI model to the same seed at a temperature of 0 on each computer, so they get the same output?

  • @arthurlin5029
    @arthurlin5029 10 หลายเดือนก่อน +1

    I don't quite understand why on mac it's running on CPU instead of GPU?

    • @dszmaj
      @dszmaj 10 หลายเดือนก่อน +1

      Its gpu, cpu wouldnt be as fast

  • @R1L1.
    @R1L1. 4 หลายเดือนก่อน

    "Next video we are gonna be cloning my grandma"
    🗿

  • @AdamAI777
    @AdamAI777 4 วันที่ผ่านมา

    Vraiment merci beaucoup, ce comparatif m'aide pour mon choix le

  • @priontific
    @priontific 10 หลายเดือนก่อน +1

    Would you be able to test the Capybara-Tess-Yi 200K context across these devices? I was really impressed with the M3 Max's performance with the 70b model here, and I think it'll also really steal the show for making use of the whole 200k context window of the 34b model I mentioned. That 128gb of RAM is gonna be needed to make use of the whole window size with actually usable speeds

  • @-_.DI2BA._-
    @-_.DI2BA._- 5 หลายเดือนก่อน

    what model did you use? Can you share the Link?

  • @simonhill6267
    @simonhill6267 3 วันที่ผ่านมา

    i wanna. see 3090 in sli versus m3 max

  • @JouleDoc
    @JouleDoc 2 หลายเดือนก่อน

    you get new subcriber,
    thank you

  • @BroaderPerspectiveLLC
    @BroaderPerspectiveLLC 4 หลายเดือนก่อน +1

    The M3 max did beast out on that 70B model.

  • @MisterAndreSafari
    @MisterAndreSafari 5 หลายเดือนก่อน +1

    Thx man, i searched for this benchmark comparison so long. Greetz!

  • @martinomburajr.5905
    @martinomburajr.5905 8 หลายเดือนก่อน +2

    Well done! Straight to the point.

  • @BowenChen-sh3sz
    @BowenChen-sh3sz 4 หลายเดือนก่อน

    only if the M3 max are not so expensive :(

  • @ehenningsen
    @ehenningsen 10 หลายเดือนก่อน

    Paired the 4090 laptop with a crap CPU. Thats too bad

  • @maximodakila2873
    @maximodakila2873 3 หลายเดือนก่อน

    Shut up and take my money, Apple! 😁😁

  • @kaojaicam
    @kaojaicam หลายเดือนก่อน

    Idk what TH-cam is doing for your video bro but this is the EXACT video I was looking for in search but it didn’t come up. Didn’t appear till much later on my for you page. Glad I found it though, you did an excellent job

  • @felipeperrotta4677
    @felipeperrotta4677 8 หลายเดือนก่อน

    so everyone can run, just change te velocity?

  • @pogimestiso
    @pogimestiso 7 หลายเดือนก่อน +1

    you just sold me on getting the m3 max pro for my project. thank you!

  • @confounded_feline
    @confounded_feline 10 หลายเดือนก่อน +1

    It would be interesting if you could limit the power envelope of the discreet GPU to match the apple soc

  • @mamaleone1
    @mamaleone1 7 หลายเดือนก่อน +1

    This is exactly what I was looking for.
    Great video very informative.
    Subscribed

  • @Noname-iq1gz
    @Noname-iq1gz 5 หลายเดือนก่อน

    I got the 70b model running 22t/s with 2 3090s with the Aphrodite engine, can you do a test with the m3 max?

    • @Noname-iq1gz
      @Noname-iq1gz 5 หลายเดือนก่อน

      Try the 400b model as well

  • @DigiDriftZone
    @DigiDriftZone 10 หลายเดือนก่อน +1

    What parameters would you use with an M3 Max with 38gb ram?

    • @technopremium91
      @technopremium91  10 หลายเดือนก่อน +1

      I was running it with ollama which use the gpu by default.

  • @coolwzl
    @coolwzl 9 หลายเดือนก่อน +1

    Very useful, thanks! I wish we had more benchmarking videos for AI models of various sizes, so that people can set their expectations of what they can get at different budgets.

  • @JonathanPaz-zz6nu
    @JonathanPaz-zz6nu 3 หลายเดือนก่อน

    Please do the same test on Llama3 70b

  • @aberobwohl
    @aberobwohl 10 หลายเดือนก่อน +4

    Your test results concerning timting do not really matter, because you would have to count the tokens. Obviously they created each a different text, so a different amount of tokens.

  • @andysPARK
    @andysPARK 9 หลายเดือนก่อน +1

    Could you enable rebar expandable memory on the rtx 4090 to be able to use system memory curtly by the GPU? And rerun the tests?

    • @大支爺
      @大支爺 5 หลายเดือนก่อน

      He has 32gb ram only.

  • @sshivam6955
    @sshivam6955 10 หลายเดือนก่อน

    Cant wait for m3 ultra even tho i wont buy it.

  • @ssoka-m5n
    @ssoka-m5n 10 หลายเดือนก่อน +1

    thank you for sharing the result
    nice video

  • @testales
    @testales 10 หลายเดือนก่อน +1

    Wow, it's possible to run 70b (even though quantized) that fast on a CPU?! I thought 70b was out of question because when I tried 33b on CPU it was painfully slow and there's no way a 4090 will load 70b. I was considering dual 4090 or A6000 with the latter only having half the computing power of a 4090. Obviously these prices for these GPUs are insane and the the power requirements for the 4090 are also insane. But if I could get away with a high memory CPU system I'd be totally willing to do this instead! How many tokens per second will was it actually per second with that Mac?

    • @Pyriold
      @Pyriold 10 หลายเดือนก่อน +1

      The M3Max is not only a CPU, it has a lot of GPU cores and dedicated neural hardware as well.

    • @testales
      @testales 10 หลายเดือนก่อน

      @@Pyriold Well, I've read up on this topic now a little. Seems the test is not really accurate. According to some posts I read, this more or less CPU based configuration will take minutes to before even starting to respond when the context has already been populated to say 50% or like a few thousands tokens. Also benchmarks of CPU based setups where like 10-15x slower than any GPU based setup provided there was enough total VRAM to load the model (like in a RTX 4090 + RTX 3090 dual GPU setup). All numbers I found so far were around 1-3 tokens/s which also matches my experiences so far for CPU based setups. Given how powerful some 7b models recently became though, I more and more doubt that's worth it to invest a lot of money to run 70b models. Especially as these AXXXX GPUs are total rip-off no matter how you put this compared to the already very expensive high end gaming GPUs. Though of course it's probably a different story if you are building an actual data center.

    • @whitecrowuk575
      @whitecrowuk575 10 หลายเดือนก่อน +1

      Get simply 2 3090 - faster than single 4090, cheaper, 48GB memory

  • @DIYDEGEN
    @DIYDEGEN 3 หลายเดือนก่อน

    How much wattage uses?

  • @danielgall55
    @danielgall55 10 หลายเดือนก่อน +1

    Nice channel, pretty professional, got subscribed

  • @johnwilson7680
    @johnwilson7680 4 หลายเดือนก่อน

    How is the 128GB M3 Max on Llama 3 70B 4bit? I have the 48GB M3 Max and it is very slow, probably one token every 10 seconds.

    • @mariantocana8472
      @mariantocana8472 4 หลายเดือนก่อน

      In this video when author runs "ollama run llama3:70b" that means he runs by default Q4_0 quant model.

    • @johnwilson7680
      @johnwilson7680 4 หลายเดือนก่อน

      @@mariantocana8472 Thank you. I'm pretty sure he only runs Llama 2. There was no Llama 3 when this video came out. I'm curious if Llama 3 70B runs similarly.

  • @velo1337
    @velo1337 10 หลายเดือนก่อน +1

    finally a yt that actually does real testing