MLX Mixtral 8x7b on M3 max 128GB | Better than chatgpt?

TECHNO PREMIUM

มุมมอง 13 197

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 22 ธ.ค. 2023
Here is the code to run the Mixtral 8x7B on mac
bit.ly/4aBqIaA
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 75

@clementcardonnel3219 26 วันที่ผ่านมา
This video is outstandingly helpful. Thank you for the clarity of the tests and your insightful closing thoughts. 💯
@stephenthumb2912 5 หลายเดือนก่อน ⁺³
thanks much for doing this test. it's basically what i wanted to know to know if I should get the m3 max at 128gb. Although everything you said is true, and the inference speed is slow, i think the idea for me, is i can check the quality of the bigger models locally, even 70b's, to see if the quality difference is enough to justify in a deployment. i can check full precision and the quants pretty easily and maybe even some tuning. All with the assumption in a production deployment i can back it with 2 rtx 6000's or an a100.
The main issue with chatgpt is of course privacy. although api is supposedly private, most enterprises i don't think are comfortable with that, and gpt4 is still very expensive. anyways thanks!
@1wc 4 หลายเดือนก่อน
I wonder whats the maximum parameters of a non-MoE unquantized model to run on 128gb
@vimuser 6 หลายเดือนก่อน
Hey, really great content, thanks 🙏 I am still looking forward to the Grandma modelling video.
@technopremium91 6 หลายเดือนก่อน ⁺¹
I am working on it, the code is done but I need time to record the video and edit it, planning to release it on Tuesday, its going to be a cool one.
@Dominik-K 5 หลายเดือนก่อน ⁺²
I've bought myself a M2 Max 32 GB notebook recently and am very happy for small 7B models. Thanks to your advice, I'm just using the small LLMs for minor stuff and whenever i need to have quality responses, ChatGPT's API really is highly useful to me
@1wc 4 หลายเดือนก่อน ⁺³
For that ram I recommend some 13b full models, or minor quantized if u don't mind abt 1-2% quality loss
@codecaine หลายเดือนก่อน
llama 3 8b has been great for programming for me
@user-ob7fd8hv4t 6 หลายเดือนก่อน ⁺⁴
Your recent video content is a perfect fit for my needs, and I'm thinking about whether to buy a top-of-the-line M3 Max to do some of my hobbies: cloning myself with LLM, but to be honest your videos make me hesitate, because there doesn't seem to be a way to balance portability, fine-tuning and output speed, and although the Mac is a good choice at this point, it hasn't reached the state I want
@nooranshdhaka 6 หลายเดือนก่อน
That’s exactly what I’m thinking of. Making a mini version of my self
@user-ob7fd8hv4t 6 หลายเดือนก่อน
@nooranshdhaka Yes, the rapid development of llm has made our childhood wish more and more feasible, in fact, in addition to deploying llm locally, I don't need a laptop with 128G RAM at all. But from my childhood to now, I have had this little wish for more than ten years, and I am willing to spend some extra money for it
@technopremium91 6 หลายเดือนก่อน ⁺⁹
Yes I agree with that, my advise is if you have an M1 max dont do the switch, if you have an M1 pro like I had then yes but go to the M3 max either the 36gbn or 48gb that will be enough and then for serious work get a PC with at least 1x 4090. The reason is because most software for AI is optimize for linux and running on Nvidia GPUs, you can do a few things with the mac but for real world you need an Nvidia GPU and linux. I was working on a personal project to clone my grandma and I have everything done from her memory to the voice to the 3D avatar because I didnt want to use a picture, I am planning to release that video next week so keep tuned for it and I will show my entire workflow.
@user-ob7fd8hv4t 6 หลายเดือนก่อน
Great, I'm looking forward to that video, and I sincerely ask you to make it more detailed, and people like me will watch it several times. Thank you very much for your advice, Apple must also be aware of Nvidia's leading position in the field of ai, which is why it will launch a notebook with high memory and long battery life this year, although I consider buying a macbook with a large memory, but in choosing a desktop computer, the mac studio is not very cost-effective compared to the 4090@@technopremium91
@nathanbanks2354 6 หลายเดือนก่อน
@@user-ob7fd8hv4t I don't know where to rent mac hardware, but I had a good experience renting an 3090 through RunPod last year to play with Stable Diffusion. It convinced me to buy a new laptop with a P5000 & 16GB of VRAM. It's a lot of effort to set up a container the way you like it each time you need to make a picture or use an LLM, but it's a great way to know how well a used 3090 or new 4090 will perform before buying it.
@robertotomas 6 หลายเดือนก่อน ⁺⁶
For web ui, if you run the model in ollama, ollama web ui is pretty slick and very simple to set up
@technopremium91 6 หลายเดือนก่อน ⁺³
Hi, yes I have used it, its pretty great and I like the interface that is similar to chatGPT.
@demoman7201 6 หลายเดือนก่อน ⁺²
So wait, you said it's to expensive to run on your own machine so you'll just continue to use ChatGPT4. what's the point of your video? You do understand why people want to run these models locally, right? They want to experiment. They want control. They like the idea that they aren't beholden to a central authority. It's about democratization of technology. What is that worth to you? It's worth a lot to a lot of people so they will spend money on their own rig to experiment and use these models on their terms.
@alexmsmartins 5 หลายเดือนก่อน
@@demoman7201 He understands.
He states two things:
- from the "time is money" standpoint, GPT 4 is better value
- for these models to be usable from a normal user perspective, you need, at least, GPT 3,5 performance and behaviour. That is not true yet when using a maxed out MacBook Pro since the model is slow, and you cannot scale it.
Both made sense to me even though I would rather control my own data as well.
@user-ob7fd8hv4t 6 หลายเดือนก่อน ⁺²
Just from the perspective of LLM operation (not considering portability), will the m2 ultra 192G have a better performance?
@datpspguy 6 หลายเดือนก่อน ⁺⁹
I agree, I was going to purchase an old Precision 7810 Xeon Silver w/ 256GB of RAM to run open source models with the objective to show my organization how to run open source models privately hosted with our own data and hardware. However, using OpenAI API or ChatGPT is right now far more faster and reasonable.
@daReturn1888 6 หลายเดือนก่อน ⁺⁴
Thanks for the video. Did you try fine-tuning, e.g. Lora, with your M3 Mac and if yes how fast is it?
@technopremium91 6 หลายเดือนก่อน ⁺¹³
I am planning to do a video for finetunning and compare it with RTX 4090 for example to see how good are this chip for training, let me know if you are interested on this?
@AIWithShrey 6 หลายเดือนก่อน ⁺¹
@@technopremium91 I would love to see that. I'm still unsure how well a M3 Max 128 GB model would handle a 8x7B model for fine-tuning. So far, an NVIDIA chipset still blows the competition out of the water to this day.
@nathanbanks2354 6 หลายเดือนก่อน ⁺¹
@@technopremium91 I've done very little fine tuning, and I'd love to see a comparison between a 4090 and an M3 for Mistral & larger Mixtral/Llama2 models.
@daReturn1888 6 หลายเดือนก่อน
@@technopremium91 Thank you, yes I would love to see a fine-tuning comparison.
@robertotomas 6 หลายเดือนก่อน ⁺²
I have just 48gb m3 max, and can run a dolphin q5_k_m version of it (via ollama), and was not thrilled with it for programming. I feel like deepseek code 33b q_4 is already much better. Is the fp16 much better? I almost felt like something was wrong, the version I can run performed so poorly.
@technopremium91 6 หลายเดือนก่อน ⁺⁵
I am on the same boat, I am a devops engineer and I currently working with LLms options for companies and I cannot find any opensource model that perform good at coding like GPT-4, even GPT-4 Turbo is not as good like the old version and VS code copilot I think is the best option out there doing an incredible job. Simple test like snake game, GPT-4 can do it perfectly fine, there is no other model that can do it, and thats not even programming, that pretty basic staff. I found for myself that a combination between VScode and ChatGPT give me very good results for my projects but takes time iterating.
@gangs0846 6 หลายเดือนก่อน
Does the Model run completly offline if you Download it?
@nat.serrano 2 หลายเดือนก่อน
can you make a video of how to make the model run embedded in an app? I mean that part of the xcode project will contain the 7b model in the bundle, so the app runs with the llm locally not remotely. is that even possible or do we need to wait for apple to release it as core model?
@dsblue1977 6 หลายเดือนก่อน ⁺⁷
You are comparing too many things simultaneosuly. Local model vs a cloud model. Free vs paid. This is not a good strategy.
@zhihmeng 6 หลายเดือนก่อน
Thanks for your sharing. I want to buy a m3 max mbp, and the only problem is how much memory I need to choose. What is your recommendation ? I want to make some development with LLM like to make my own knowledge system.
@technopremium91 6 หลายเดือนก่อน ⁺⁴
Thats a great question and the reason i am making this videos, I think that a M3 max with 48gb will be enough to run all models, specially the quantize versions.
@zhihmeng 6 หลายเดือนก่อน
@@technopremium91 thanks for your reply, it helps a lot.
@user-ge3ep6jp6v 4 หลายเดือนก่อน
@@technopremium91 After watching your videos, I was planning to buy M3 Max, 128GB spec. Will 48GB M3 Max be enough for 70B or higher models as well? Considering fine tuning as well? I'm a bit new to LLMs tbh.
@MattJonesYT 4 หลายเดือนก่อน
This is great info, what I was looking for on MLX. But realistically if your budget is $5k you can buy a bunch of tesla cards and run it on an old xeon workstation for way less than that, probably less than half the price. You can use M10 GPUs that don't even need an motherboard that has above 4g decoding and stick 3 of them in one ancient system with 32*3=96 gb. Such as system would be much less than $1k and is one type of system I may end up building. However I'm not really sure that large models are nearly as useful as having a RAG system with a mixture of experts and in that case you can use almost any GPU and get nice results.
@Anthonyg5005 5 หลายเดือนก่อน ⁺²
A single A100 80GB could run this model fine in 8-bit quantization which doesn't take much of an accuracy hit, a single A100 40GB in 6-bit also without much accuracy difference, and a single 24GB GPU in 4-bit. 8-bit takes a bit over 42GB, 6-bit a bit over 32GB, and 4-bit a bit over 21GB. You can see the accuracy difference between them in the model repository: turboderp/Mixtral-8x7B-exl2
@pawel753 2 หลายเดือนก่อน ⁺¹
Thanks for the video but I think you're completely missing the point here... First - you don't run models locally to save the money but to avoid exposing your data to OpenAI, Microsoft or any other 3rd party. Most of commercial companies prefer to pay more but to keep privacy. Second - from my experience GPT works much worse when it comes to some prompts, e.g. writing a code, Mistral-7B wins on this field IMO - so in most cases I'd prefer to wait longer but get better response
@MeinDeutschkurs 6 หลายเดือนก่อน
Great! Please continue talking truth. Your channel is underrated! Merry Christmas, btw! 🎄🎅
@technopremium91 6 หลายเดือนก่อน ⁺¹
Thank you so much for the support, I am trying to focus on reality and not just hype everything for the views.
@paulmiller591 6 หลายเดือนก่อน ⁺²
Well said!
@DanFrederiksen 6 หลายเดือนก่อน ⁺⁴
With how cheap GDDR6 is now they really should make cards with a lot more ram to catch up to this advantage. We can't have puny macs outperforming the most expensive PC stuff.
The model spelled tokyo with an i though. Could be a fluke
@Tushar.Sharma 6 หลายเดือนก่อน ⁺¹
Nvidia will just make them more expensive or will put some restriction on regular cards to prevent usage of LLMs on them as they did after realising people have been using them for crypto mining.
@DanFrederiksen 6 หลายเดือนก่อน
@@Tushar.Sharma they can't just do that if apple keeps making products with superior parameters. I would really hate if apple takes market dominance because the apple sheep paid for financial dominance in decades of sheepdom. apple's logo is the original sin. the first mac was priced 666. so NVDA, AMD and Intel better get their act together.
if an integrated risc architecture has considerable advantages then we need such products
@nathanbanks2354 6 หลายเดือนก่อน
Both AMD & NVIDIA are selling really-expensive server chips that everyone is willing to buy. They don't want to cannibalize their own market. Though I am curious how well a model like Mixtral is capable of running well on two 4090's since it has a mixture of experts which may be able to run on just one of the two cards.
@nothingnobest1 6 หลายเดือนก่อน ⁺¹
Can it run on mac 14 inch or it has to be 16 inch? Thank you
@technopremium91 6 หลายเดือนก่อน ⁺¹
It will run on both, specs is what is important 128gb ram on m3 max.
@user-ob7fd8hv4t 6 หลายเดือนก่อน
In terms of heat dissipation, the 16 version will have more advantages in a physical sense, but other than that, there is not much difference, and in addition, I think it is unacceptable to sacrifice portability for heat dissipation
@RightOverWrong 4 หลายเดือนก่อน
@@technopremium91Could it run on an m2 max with 96gb of ram?
@germank7924 6 หลายเดือนก่อน ⁺⁴
Considering how hyped "AI" is, these LLMs can do close to nothing. In older news, Apple has limited the amount of RAM you can use on the GPU, I cannot find it right now but I think it might be about 15% "reserved for the system". Apple being Apple I don't think they've shared their exact algo for this anyway.
@OmarDaily 6 หลายเดือนก่อน ⁺²
That’s because you still need resources for other systems.. It’s Unified, so it’s shared.
Someone on Reddit said you can manual convert to MLpackage and use 100% of the GPU, haven’t tried his way though.
This “industry” is so young, they will make access to all resources easier to manage in no time.
@germank7924 6 หลายเดือนก่อน
well, 100% exactly would be calling for trouble! @@OmarDaily
@materialvision 6 หลายเดือนก่อน
Can't open the page you link to...
@technopremium91 6 หลายเดือนก่อน
Thats fix, please try again
@Matlockization หลายเดือนก่อน
Make more videos and talk to more people in this forum.
@vehasuwatphisankij8846 6 หลายเดือนก่อน ⁺¹
just use ollama mixtral q4, mlx will be the best, but it's not its prime time yet.
@technopremium91 6 หลายเดือนก่อน ⁺¹
I agree, I have used the quantize versions on ollama and LMstudios and they run way better but I wanted to test the full model to see if it could run on M3 max and I saw a video that running on M2 max was crazy slow, I think the GPU arquitecture on M3 max is way more advance for AI.
@OmarDaily 6 หลายเดือนก่อน
@@technopremium91Do you anticipate a big increase from M2 Ultra to M3 Ultra based on M3 Max performance?.
@yagoa 3 หลายเดือนก่อน
Mixtral-Dolphin ftw, beats ClosedAI gpt3.5 and you get real privacy...
@nat.serrano 2 หลายเดือนก่อน
great video!!! those French models are overated!
@zijinzhang20 5 หลายเดือนก่อน
When it comes to the capital of Japan, he got the pronunciation right but the spelling wrong, just like a real person!
@crazytom 3 หลายเดือนก่อน
I havw the volume on max and you're still a whisper. Please increase gain in mix down
@NeuroScientician 6 หลายเดือนก่อน ⁺³
Apple hardware is usually really poorly build. We have tried to deploy six of them, every two weeks of 24/7 run we would lose one, or it would throttle to the point of uselessness. Also, around 20% difference in performance in between units. It's just glued garbage. Cloud is probably the only way to go if you don't work with sensitive stuff.
@Joe_Brig 6 หลายเดือนก่อน ⁺¹
Are you using a Mac Studio or Mac Pro?
@nathanbanks2354 6 หลายเดือนก่อน
Try adding a fan. Apple stuff is usually designed to run quiet and hot. Their M3 is one of the best ways to get moderate performance for medium-sized LLM's with 128GB of RAM, but the cloud would be faster if you don't need something deployed 24/7.
@NeuroScientician 6 หลายเดือนก่อน
@@nathanbanks2354 24/7 or close to that was the point. Apple hardware is cheap but also kind of garbage and their production tends to be quite inconsistent. We were considering building a custom cooling solution for it and to make it fit into a proper rack, but we were really closing to the cost of proper hardware so, pointless.
Let's just wait for the new M300 and see...
@NeuroScientician 6 หลายเดือนก่อน
@@Joe_Brig pro
@alexalainterieur 5 หลายเดือนก่อน
very good common sense for the average user that i am
i will stick with 20$ / month for now
the space is evolving so fast anyways
i will reconsider investing in hardware in 2 years when things clear out
@-blackcat-4749 5 หลายเดือนก่อน ⁺²
Both condition is usual. It happened as 📅 usual
@denizcanbillor9951 5 หลายเดือนก่อน ⁺¹
In latency they will
@MattJonesYT 4 หลายเดือนก่อน
Open source models are inherently superior to centralized corporate models which have to have a political correctness layer that conflicts with basic logic. This has been proven with google's latest AI embarrassment. Open source models have a future, closed models will get worse and worse by comparison.

ต่อไป

เล่นอัตโนมัติ

How to clone any human with AI | I cloned my grandma