thanks much for doing this test. it's basically what i wanted to know to know if I should get the m3 max at 128gb. Although everything you said is true, and the inference speed is slow, i think the idea for me, is i can check the quality of the bigger models locally, even 70b's, to see if the quality difference is enough to justify in a deployment. i can check full precision and the quants pretty easily and maybe even some tuning. All with the assumption in a production deployment i can back it with 2 rtx 6000's or an a100. The main issue with chatgpt is of course privacy. although api is supposedly private, most enterprises i don't think are comfortable with that, and gpt4 is still very expensive. anyways thanks!
I've bought myself a M2 Max 32 GB notebook recently and am very happy for small 7B models. Thanks to your advice, I'm just using the small LLMs for minor stuff and whenever i need to have quality responses, ChatGPT's API really is highly useful to me
Your recent video content is a perfect fit for my needs, and I'm thinking about whether to buy a top-of-the-line M3 Max to do some of my hobbies: cloning myself with LLM, but to be honest your videos make me hesitate, because there doesn't seem to be a way to balance portability, fine-tuning and output speed, and although the Mac is a good choice at this point, it hasn't reached the state I want
@nooranshdhaka Yes, the rapid development of llm has made our childhood wish more and more feasible, in fact, in addition to deploying llm locally, I don't need a laptop with 128G RAM at all. But from my childhood to now, I have had this little wish for more than ten years, and I am willing to spend some extra money for it
Yes I agree with that, my advise is if you have an M1 max dont do the switch, if you have an M1 pro like I had then yes but go to the M3 max either the 36gbn or 48gb that will be enough and then for serious work get a PC with at least 1x 4090. The reason is because most software for AI is optimize for linux and running on Nvidia GPUs, you can do a few things with the mac but for real world you need an Nvidia GPU and linux. I was working on a personal project to clone my grandma and I have everything done from her memory to the voice to the 3D avatar because I didnt want to use a picture, I am planning to release that video next week so keep tuned for it and I will show my entire workflow.
Great, I'm looking forward to that video, and I sincerely ask you to make it more detailed, and people like me will watch it several times. Thank you very much for your advice, Apple must also be aware of Nvidia's leading position in the field of ai, which is why it will launch a notebook with high memory and long battery life this year, although I consider buying a macbook with a large memory, but in choosing a desktop computer, the mac studio is not very cost-effective compared to the 4090@@technopremium91
@@user-ob7fd8hv4t I don't know where to rent mac hardware, but I had a good experience renting an 3090 through RunPod last year to play with Stable Diffusion. It convinced me to buy a new laptop with a P5000 & 16GB of VRAM. It's a lot of effort to set up a container the way you like it each time you need to make a picture or use an LLM, but it's a great way to know how well a used 3090 or new 4090 will perform before buying it.
So wait, you said it's to expensive to run on your own machine so you'll just continue to use ChatGPT4. what's the point of your video? You do understand why people want to run these models locally, right? They want to experiment. They want control. They like the idea that they aren't beholden to a central authority. It's about democratization of technology. What is that worth to you? It's worth a lot to a lot of people so they will spend money on their own rig to experiment and use these models on their terms.
@@demoman7201 He understands. He states two things: - from the "time is money" standpoint, GPT 4 is better value - for these models to be usable from a normal user perspective, you need, at least, GPT 3,5 performance and behaviour. That is not true yet when using a maxed out MacBook Pro since the model is slow, and you cannot scale it. Both made sense to me even though I would rather control my own data as well.
I agree, I was going to purchase an old Precision 7810 Xeon Silver w/ 256GB of RAM to run open source models with the objective to show my organization how to run open source models privately hosted with our own data and hardware. However, using OpenAI API or ChatGPT is right now far more faster and reasonable.
I am planning to do a video for finetunning and compare it with RTX 4090 for example to see how good are this chip for training, let me know if you are interested on this?
@@technopremium91 I would love to see that. I'm still unsure how well a M3 Max 128 GB model would handle a 8x7B model for fine-tuning. So far, an NVIDIA chipset still blows the competition out of the water to this day.
@@technopremium91 I've done very little fine tuning, and I'd love to see a comparison between a 4090 and an M3 for Mistral & larger Mixtral/Llama2 models.
I have just 48gb m3 max, and can run a dolphin q5_k_m version of it (via ollama), and was not thrilled with it for programming. I feel like deepseek code 33b q_4 is already much better. Is the fp16 much better? I almost felt like something was wrong, the version I can run performed so poorly.
I am on the same boat, I am a devops engineer and I currently working with LLms options for companies and I cannot find any opensource model that perform good at coding like GPT-4, even GPT-4 Turbo is not as good like the old version and VS code copilot I think is the best option out there doing an incredible job. Simple test like snake game, GPT-4 can do it perfectly fine, there is no other model that can do it, and thats not even programming, that pretty basic staff. I found for myself that a combination between VScode and ChatGPT give me very good results for my projects but takes time iterating.
can you make a video of how to make the model run embedded in an app? I mean that part of the xcode project will contain the 7b model in the bundle, so the app runs with the llm locally not remotely. is that even possible or do we need to wait for apple to release it as core model?
Thanks for your sharing. I want to buy a m3 max mbp, and the only problem is how much memory I need to choose. What is your recommendation ? I want to make some development with LLM like to make my own knowledge system.
Thats a great question and the reason i am making this videos, I think that a M3 max with 48gb will be enough to run all models, specially the quantize versions.
@@technopremium91 After watching your videos, I was planning to buy M3 Max, 128GB spec. Will 48GB M3 Max be enough for 70B or higher models as well? Considering fine tuning as well? I'm a bit new to LLMs tbh.
This is great info, what I was looking for on MLX. But realistically if your budget is $5k you can buy a bunch of tesla cards and run it on an old xeon workstation for way less than that, probably less than half the price. You can use M10 GPUs that don't even need an motherboard that has above 4g decoding and stick 3 of them in one ancient system with 32*3=96 gb. Such as system would be much less than $1k and is one type of system I may end up building. However I'm not really sure that large models are nearly as useful as having a RAG system with a mixture of experts and in that case you can use almost any GPU and get nice results.
A single A100 80GB could run this model fine in 8-bit quantization which doesn't take much of an accuracy hit, a single A100 40GB in 6-bit also without much accuracy difference, and a single 24GB GPU in 4-bit. 8-bit takes a bit over 42GB, 6-bit a bit over 32GB, and 4-bit a bit over 21GB. You can see the accuracy difference between them in the model repository: turboderp/Mixtral-8x7B-exl2
Thanks for the video but I think you're completely missing the point here... First - you don't run models locally to save the money but to avoid exposing your data to OpenAI, Microsoft or any other 3rd party. Most of commercial companies prefer to pay more but to keep privacy. Second - from my experience GPT works much worse when it comes to some prompts, e.g. writing a code, Mistral-7B wins on this field IMO - so in most cases I'd prefer to wait longer but get better response
With how cheap GDDR6 is now they really should make cards with a lot more ram to catch up to this advantage. We can't have puny macs outperforming the most expensive PC stuff. The model spelled tokyo with an i though. Could be a fluke
Nvidia will just make them more expensive or will put some restriction on regular cards to prevent usage of LLMs on them as they did after realising people have been using them for crypto mining.
@@Tushar.Sharma they can't just do that if apple keeps making products with superior parameters. I would really hate if apple takes market dominance because the apple sheep paid for financial dominance in decades of sheepdom. apple's logo is the original sin. the first mac was priced 666. so NVDA, AMD and Intel better get their act together. if an integrated risc architecture has considerable advantages then we need such products
Both AMD & NVIDIA are selling really-expensive server chips that everyone is willing to buy. They don't want to cannibalize their own market. Though I am curious how well a model like Mixtral is capable of running well on two 4090's since it has a mixture of experts which may be able to run on just one of the two cards.
In terms of heat dissipation, the 16 version will have more advantages in a physical sense, but other than that, there is not much difference, and in addition, I think it is unacceptable to sacrifice portability for heat dissipation
Considering how hyped "AI" is, these LLMs can do close to nothing. In older news, Apple has limited the amount of RAM you can use on the GPU, I cannot find it right now but I think it might be about 15% "reserved for the system". Apple being Apple I don't think they've shared their exact algo for this anyway.
That’s because you still need resources for other systems.. It’s Unified, so it’s shared. Someone on Reddit said you can manual convert to MLpackage and use 100% of the GPU, haven’t tried his way though. This “industry” is so young, they will make access to all resources easier to manage in no time.
I agree, I have used the quantize versions on ollama and LMstudios and they run way better but I wanted to test the full model to see if it could run on M3 max and I saw a video that running on M2 max was crazy slow, I think the GPU arquitecture on M3 max is way more advance for AI.
Apple hardware is usually really poorly build. We have tried to deploy six of them, every two weeks of 24/7 run we would lose one, or it would throttle to the point of uselessness. Also, around 20% difference in performance in between units. It's just glued garbage. Cloud is probably the only way to go if you don't work with sensitive stuff.
Try adding a fan. Apple stuff is usually designed to run quiet and hot. Their M3 is one of the best ways to get moderate performance for medium-sized LLM's with 128GB of RAM, but the cloud would be faster if you don't need something deployed 24/7.
@@nathanbanks2354 24/7 or close to that was the point. Apple hardware is cheap but also kind of garbage and their production tends to be quite inconsistent. We were considering building a custom cooling solution for it and to make it fit into a proper rack, but we were really closing to the cost of proper hardware so, pointless. Let's just wait for the new M300 and see...
very good common sense for the average user that i am i will stick with 20$ / month for now the space is evolving so fast anyways i will reconsider investing in hardware in 2 years when things clear out
Open source models are inherently superior to centralized corporate models which have to have a political correctness layer that conflicts with basic logic. This has been proven with google's latest AI embarrassment. Open source models have a future, closed models will get worse and worse by comparison.
This video is outstandingly helpful. Thank you for the clarity of the tests and your insightful closing thoughts. 💯
thanks much for doing this test. it's basically what i wanted to know to know if I should get the m3 max at 128gb. Although everything you said is true, and the inference speed is slow, i think the idea for me, is i can check the quality of the bigger models locally, even 70b's, to see if the quality difference is enough to justify in a deployment. i can check full precision and the quants pretty easily and maybe even some tuning. All with the assumption in a production deployment i can back it with 2 rtx 6000's or an a100.
The main issue with chatgpt is of course privacy. although api is supposedly private, most enterprises i don't think are comfortable with that, and gpt4 is still very expensive. anyways thanks!
I wonder whats the maximum parameters of a non-MoE unquantized model to run on 128gb
Hey, really great content, thanks 🙏 I am still looking forward to the Grandma modelling video.
I am working on it, the code is done but I need time to record the video and edit it, planning to release it on Tuesday, its going to be a cool one.
I've bought myself a M2 Max 32 GB notebook recently and am very happy for small 7B models. Thanks to your advice, I'm just using the small LLMs for minor stuff and whenever i need to have quality responses, ChatGPT's API really is highly useful to me
For that ram I recommend some 13b full models, or minor quantized if u don't mind abt 1-2% quality loss
llama 3 8b has been great for programming for me
Your recent video content is a perfect fit for my needs, and I'm thinking about whether to buy a top-of-the-line M3 Max to do some of my hobbies: cloning myself with LLM, but to be honest your videos make me hesitate, because there doesn't seem to be a way to balance portability, fine-tuning and output speed, and although the Mac is a good choice at this point, it hasn't reached the state I want
That’s exactly what I’m thinking of. Making a mini version of my self
@nooranshdhaka Yes, the rapid development of llm has made our childhood wish more and more feasible, in fact, in addition to deploying llm locally, I don't need a laptop with 128G RAM at all. But from my childhood to now, I have had this little wish for more than ten years, and I am willing to spend some extra money for it
Yes I agree with that, my advise is if you have an M1 max dont do the switch, if you have an M1 pro like I had then yes but go to the M3 max either the 36gbn or 48gb that will be enough and then for serious work get a PC with at least 1x 4090. The reason is because most software for AI is optimize for linux and running on Nvidia GPUs, you can do a few things with the mac but for real world you need an Nvidia GPU and linux. I was working on a personal project to clone my grandma and I have everything done from her memory to the voice to the 3D avatar because I didnt want to use a picture, I am planning to release that video next week so keep tuned for it and I will show my entire workflow.
Great, I'm looking forward to that video, and I sincerely ask you to make it more detailed, and people like me will watch it several times. Thank you very much for your advice, Apple must also be aware of Nvidia's leading position in the field of ai, which is why it will launch a notebook with high memory and long battery life this year, although I consider buying a macbook with a large memory, but in choosing a desktop computer, the mac studio is not very cost-effective compared to the 4090@@technopremium91
@@user-ob7fd8hv4t I don't know where to rent mac hardware, but I had a good experience renting an 3090 through RunPod last year to play with Stable Diffusion. It convinced me to buy a new laptop with a P5000 & 16GB of VRAM. It's a lot of effort to set up a container the way you like it each time you need to make a picture or use an LLM, but it's a great way to know how well a used 3090 or new 4090 will perform before buying it.
For web ui, if you run the model in ollama, ollama web ui is pretty slick and very simple to set up
Hi, yes I have used it, its pretty great and I like the interface that is similar to chatGPT.
So wait, you said it's to expensive to run on your own machine so you'll just continue to use ChatGPT4. what's the point of your video? You do understand why people want to run these models locally, right? They want to experiment. They want control. They like the idea that they aren't beholden to a central authority. It's about democratization of technology. What is that worth to you? It's worth a lot to a lot of people so they will spend money on their own rig to experiment and use these models on their terms.
@@demoman7201 He understands.
He states two things:
- from the "time is money" standpoint, GPT 4 is better value
- for these models to be usable from a normal user perspective, you need, at least, GPT 3,5 performance and behaviour. That is not true yet when using a maxed out MacBook Pro since the model is slow, and you cannot scale it.
Both made sense to me even though I would rather control my own data as well.
Just from the perspective of LLM operation (not considering portability), will the m2 ultra 192G have a better performance?
I agree, I was going to purchase an old Precision 7810 Xeon Silver w/ 256GB of RAM to run open source models with the objective to show my organization how to run open source models privately hosted with our own data and hardware. However, using OpenAI API or ChatGPT is right now far more faster and reasonable.
Thanks for the video. Did you try fine-tuning, e.g. Lora, with your M3 Mac and if yes how fast is it?
I am planning to do a video for finetunning and compare it with RTX 4090 for example to see how good are this chip for training, let me know if you are interested on this?
@@technopremium91 I would love to see that. I'm still unsure how well a M3 Max 128 GB model would handle a 8x7B model for fine-tuning. So far, an NVIDIA chipset still blows the competition out of the water to this day.
@@technopremium91 I've done very little fine tuning, and I'd love to see a comparison between a 4090 and an M3 for Mistral & larger Mixtral/Llama2 models.
@@technopremium91 Thank you, yes I would love to see a fine-tuning comparison.
I have just 48gb m3 max, and can run a dolphin q5_k_m version of it (via ollama), and was not thrilled with it for programming. I feel like deepseek code 33b q_4 is already much better. Is the fp16 much better? I almost felt like something was wrong, the version I can run performed so poorly.
I am on the same boat, I am a devops engineer and I currently working with LLms options for companies and I cannot find any opensource model that perform good at coding like GPT-4, even GPT-4 Turbo is not as good like the old version and VS code copilot I think is the best option out there doing an incredible job. Simple test like snake game, GPT-4 can do it perfectly fine, there is no other model that can do it, and thats not even programming, that pretty basic staff. I found for myself that a combination between VScode and ChatGPT give me very good results for my projects but takes time iterating.
Does the Model run completly offline if you Download it?
can you make a video of how to make the model run embedded in an app? I mean that part of the xcode project will contain the 7b model in the bundle, so the app runs with the llm locally not remotely. is that even possible or do we need to wait for apple to release it as core model?
You are comparing too many things simultaneosuly. Local model vs a cloud model. Free vs paid. This is not a good strategy.
Thanks for your sharing. I want to buy a m3 max mbp, and the only problem is how much memory I need to choose. What is your recommendation ? I want to make some development with LLM like to make my own knowledge system.
Thats a great question and the reason i am making this videos, I think that a M3 max with 48gb will be enough to run all models, specially the quantize versions.
@@technopremium91 thanks for your reply, it helps a lot.
@@technopremium91 After watching your videos, I was planning to buy M3 Max, 128GB spec. Will 48GB M3 Max be enough for 70B or higher models as well? Considering fine tuning as well? I'm a bit new to LLMs tbh.
This is great info, what I was looking for on MLX. But realistically if your budget is $5k you can buy a bunch of tesla cards and run it on an old xeon workstation for way less than that, probably less than half the price. You can use M10 GPUs that don't even need an motherboard that has above 4g decoding and stick 3 of them in one ancient system with 32*3=96 gb. Such as system would be much less than $1k and is one type of system I may end up building. However I'm not really sure that large models are nearly as useful as having a RAG system with a mixture of experts and in that case you can use almost any GPU and get nice results.
A single A100 80GB could run this model fine in 8-bit quantization which doesn't take much of an accuracy hit, a single A100 40GB in 6-bit also without much accuracy difference, and a single 24GB GPU in 4-bit. 8-bit takes a bit over 42GB, 6-bit a bit over 32GB, and 4-bit a bit over 21GB. You can see the accuracy difference between them in the model repository: turboderp/Mixtral-8x7B-exl2
Thanks for the video but I think you're completely missing the point here... First - you don't run models locally to save the money but to avoid exposing your data to OpenAI, Microsoft or any other 3rd party. Most of commercial companies prefer to pay more but to keep privacy. Second - from my experience GPT works much worse when it comes to some prompts, e.g. writing a code, Mistral-7B wins on this field IMO - so in most cases I'd prefer to wait longer but get better response
Great! Please continue talking truth. Your channel is underrated! Merry Christmas, btw! 🎄🎅
Thank you so much for the support, I am trying to focus on reality and not just hype everything for the views.
Well said!
With how cheap GDDR6 is now they really should make cards with a lot more ram to catch up to this advantage. We can't have puny macs outperforming the most expensive PC stuff.
The model spelled tokyo with an i though. Could be a fluke
Nvidia will just make them more expensive or will put some restriction on regular cards to prevent usage of LLMs on them as they did after realising people have been using them for crypto mining.
@@Tushar.Sharma they can't just do that if apple keeps making products with superior parameters. I would really hate if apple takes market dominance because the apple sheep paid for financial dominance in decades of sheepdom. apple's logo is the original sin. the first mac was priced 666. so NVDA, AMD and Intel better get their act together.
if an integrated risc architecture has considerable advantages then we need such products
Both AMD & NVIDIA are selling really-expensive server chips that everyone is willing to buy. They don't want to cannibalize their own market. Though I am curious how well a model like Mixtral is capable of running well on two 4090's since it has a mixture of experts which may be able to run on just one of the two cards.
Can it run on mac 14 inch or it has to be 16 inch? Thank you
It will run on both, specs is what is important 128gb ram on m3 max.
In terms of heat dissipation, the 16 version will have more advantages in a physical sense, but other than that, there is not much difference, and in addition, I think it is unacceptable to sacrifice portability for heat dissipation
@@technopremium91Could it run on an m2 max with 96gb of ram?
Considering how hyped "AI" is, these LLMs can do close to nothing. In older news, Apple has limited the amount of RAM you can use on the GPU, I cannot find it right now but I think it might be about 15% "reserved for the system". Apple being Apple I don't think they've shared their exact algo for this anyway.
That’s because you still need resources for other systems.. It’s Unified, so it’s shared.
Someone on Reddit said you can manual convert to MLpackage and use 100% of the GPU, haven’t tried his way though.
This “industry” is so young, they will make access to all resources easier to manage in no time.
well, 100% exactly would be calling for trouble! @@OmarDaily
Can't open the page you link to...
Thats fix, please try again
Make more videos and talk to more people in this forum.
just use ollama mixtral q4, mlx will be the best, but it's not its prime time yet.
I agree, I have used the quantize versions on ollama and LMstudios and they run way better but I wanted to test the full model to see if it could run on M3 max and I saw a video that running on M2 max was crazy slow, I think the GPU arquitecture on M3 max is way more advance for AI.
@@technopremium91Do you anticipate a big increase from M2 Ultra to M3 Ultra based on M3 Max performance?.
Mixtral-Dolphin ftw, beats ClosedAI gpt3.5 and you get real privacy...
great video!!! those French models are overated!
When it comes to the capital of Japan, he got the pronunciation right but the spelling wrong, just like a real person!
I havw the volume on max and you're still a whisper. Please increase gain in mix down
Apple hardware is usually really poorly build. We have tried to deploy six of them, every two weeks of 24/7 run we would lose one, or it would throttle to the point of uselessness. Also, around 20% difference in performance in between units. It's just glued garbage. Cloud is probably the only way to go if you don't work with sensitive stuff.
Are you using a Mac Studio or Mac Pro?
Try adding a fan. Apple stuff is usually designed to run quiet and hot. Their M3 is one of the best ways to get moderate performance for medium-sized LLM's with 128GB of RAM, but the cloud would be faster if you don't need something deployed 24/7.
@@nathanbanks2354 24/7 or close to that was the point. Apple hardware is cheap but also kind of garbage and their production tends to be quite inconsistent. We were considering building a custom cooling solution for it and to make it fit into a proper rack, but we were really closing to the cost of proper hardware so, pointless.
Let's just wait for the new M300 and see...
@@Joe_Brig pro
very good common sense for the average user that i am
i will stick with 20$ / month for now
the space is evolving so fast anyways
i will reconsider investing in hardware in 2 years when things clear out
Both condition is usual. It happened as 📅 usual
In latency they will
Open source models are inherently superior to centralized corporate models which have to have a political correctness layer that conflicts with basic logic. This has been proven with google's latest AI embarrassment. Open source models have a future, closed models will get worse and worse by comparison.