@@AZisk Go for the Executive tier at $130. If you don't earn enough from the 2% cash back to make up for the extra cost over the $65 Gold tier, Costco will refund the difference at the end of the year. Also factor in the extra year of warranty Costco gives...
@@razor-b2d unrealistic? Some companies are already doing it, hence why some people had issue with the power button being switched, as some companies have server racks with the Mac minis.
Not a bad idea. I think Mac OS Server failed because of timing. Apple was not as popular with the Dev community as today. The Mac Mini is probably the most popular device for DevOps. Even AWS bought 1000s and rack them. (Yep you can run Mac OS in AWS). It should not be too complicated for them to use the MacPro and make it rackable. 1U unit, shallow depth, with necessary ports to create clusters. I think Alex is onto something. Clusters make sense, they provide redundancy. I hope Apple is watching and paying attention. Alex what would be interesting is the application side. What could these rackable Mac Mini be used for apart from running LLM test? The footprint and low power consumption is great. Need to find the market for it. Any ideas?
Thank you Alex, I was thinking of doing this exact project! One note: by using a hub you are creating a star topology with a 40Gbps bottleneck shared between all machines. If you used a partially meshed ring topology, you could connect each mini to 3 other minis with a connectivity set of 1:{5,2,3} 2:{1,3,4} 3:{2,4,1} 4:{3,5,2} 5:{4,1,2}. I'd be interesting in seeing if this improved performance. Another potential advantage of the mini cluster vs a single M4 Max is that all M4-series chips have the same 16 ANE cores; you might be able to run distributed inference on the neural engine to benefit from that scaling.
thanks a lot! definitely something to consider. however, the 40Gbps is per thunderbolt controller, and I thought that each port had a separate controller - they do not. Very good observation!!
@@AZisk Ahh I had been wondering why it wasn't just daisy chained, bc that was one of the things Thunderbolt was sold on originally. Would only use 1 port each, but yeah if they. would all share the same bandwidth, that might not work out like we wanted.
The EXO cluster can only expand available memory, not improve inference performance, because each token needs to be processed sequentially on every Mac mini. For a single request, parallel processing is not possible. For example, if you have one Mac mini with 64GB of RAM, its processing performance would be better than two Mac minis with 32GB each.
For the bottom mac being hotter, try giving it space under it like you have the others above, might be heat soaking because it cant dissipate the heat like the others can
Exolabs actually processes the parts of the model segmented onto each Mac sequentially, not in parallel which means it's slower than it would be running it on one machine with a lot of ram due to connection delay. However, if Exolabs supported Mixture of Experts models and allowed the experts to be split between devices, that would give insane performance when using all experts compared to doing it on one device because all the experts could be run in parallel.
@blisphul8084 Yes, but even if you only have 8 computers, you can still use the model with two experts per computer (if you have enough ram). You could also just elect to use fewer experts, and the result wouldn't be that much lower quality. You could also have all the experts loaded onto separate computers and still only actively use a couple of experts per token if you want to use less electricity. To clarify, "using fewer experts" only refers to how many experts are activated at a time for each individual token, MoE models still generally use all the experts across multiple tokens, so you need them all loaded somewhere.
@@blisphul8084 Not as much, MoE are already fast because they are sparse models and do not activate all their weights in every run, so they run pretty fast in a single computer already. The problem is fitting them into memory because all the weights have to be on memory (you do not know in advance which will be used). For example a 8x7b mixtral has the speed of 14b model but needs as much RAM as a 56b model. Maybe you could bring it down to 7b speed if all this worked like that, but it would be a 2x increase in speed not 8x. But such a set up would allow for running bigger models that would not fit into RAM normally.
To add onto this: Evaluating the performance of distributed systems is hard. Pipeline Parallel inference: This involves processing one layer on one device, and the next layer on the next, and so on. It’s probably the most intuitive form of parallelism, and means that your system’s RAM is essentially the sum of all constituent components. Note, that your total token/s will be equal to either the slowest machine, or the network bandwidth (whichever comes first). Asynchronous (Batched) Parallel Inference (Exo goes here): This is like PPI, but you are able to execute asynchronously, which lets you add together the bandwidth of your systems, potentially using batches, which lets you increase your token/s by essentially adding together your device’s compute, but it does incur a latency penalty, and each additional device makes the latency worse. Pretty good if you have a huge batch of prompts to go through or something. Lower bandwidth between devices doesn’t decrease token/s but does worsen latency, to my knowledge. Tensor Parallel: Matrix multiplies aren’t interdependent. What this means is that if I calculate the product of the first, third, fifth, and seventh rows on one device, I can calculate the second, fourth, sixth, and eighth on another device, and then just synchronize the product at the end. There’s a bit more to it (you can optimize communications by not synchronizing and actually just sending the products directly to the next layer), but that’s the basics. This requires less bandwidth than pipeline parallelism for LLMs at least (once the weights are loaded, at least), and does let you add together the bandwidth, capacity, and compute of the constituent devices, but it does have two issues. 1) You need a power of 2 number of devices for most tensor parallel implementations, and you also run into issues with certain calculations, like soft max, which require synchronization, which limits how wide a tensor parallel setup can go. It does improve latency, and improves throughput, though, so it’s pretty based. If you have two GPUs, you could run a model literally just twice as fast as not doing tensor parallel, and the same goes for CPUs. I think beyond 4 devices you have to think pretty carefully about how fast the interconnect is, though. Linear Transformers scale way better, so hypothetically if you had like, 64 CPUs for some reason you could probably get pretty close to linear performance improvements for each CPU in the network. Asymmetrical tensor parallel: I’ve never seen this, but in theory it’s possible. You don’t necessarily have to split the tensors evenly between devices. If you have a slower device (ie: CPU) you should still be able to offload, say, 10-20% of the model to that device, while doing the rest on GPU, which doesn’t sound great, but keep in mind it gives you a 10-20% speed up compared to just GPU, and it also reduces the VRAM required by that GPU. This means that you could probably run a 28GB model on a 24GB GPU, for instance, and run it a little bit faster than if you had the same GPU with 28GB of VRAM. Neat. Hybrid: If you have a tensor parallel setup configured as one device and then a second step in the pipeline going to another device, this would be a hybrid setup. To my knowledge no frameworks support this out of the box and would require some careful setup that would probably be very custom to your specific topology. You would expect the latency characteristics to be roughly equal to the ratio of tensor parallel and pipeline parallel components in the network. I guess it would make the most sense if you’re trying to pair several small devices (ie: 4 Raspberry Pi’s) with a more powerful device like a GPU. I’m guessing there’s probably other types of parallelism out there, and there’s also probably some learned parallelism types that might substitute elements of the communication with learned neural networks, or such, but to my knowledge those are the main types for Transformers. CNNs are where parallelism gets **really** fun, though.
A Mac Mini cluster with MLX and ExoLabs makes sense if you like extending context for the models. If you just need the model for a "Hello" or "Write a story" query, a single machine would be sufficient for the task.
I think the argument is that mini cluster with MLX and ExoLabs would not increase inference speed because inference are still done sequentially? But its definitely make running super big models at least possible with the scalability of memory resources
Correct me if I am wrong. According to your video: 1.If the LLM model can fit in just one Mac mini pro, we will get maximum tokens per second. Adding more Mac minis actually decreases TPS, rather than increasing it. 2.If the LLM model is too big for a single Mac mini, the only way to run the model is by using an exo cluster. However, in this case, the TPS will be very low. So I am wondering wouldn't it be better to buy a MacBook Pro M4 Max with 128GB RAM instead of 5 Mac minis? It might be cheaper, and the performance would be much better than an exo cluster with Mac minis (maybe 2 or 3 times on TPS?) no offense, just want to understand
Can you run more tests and show a chart with the results per $ spent? We are trying to figure out if it makes more sense to buy 4 base models in a cluster vs spending the same amount on a maxed out m4 pro or max.
If you’d like a quick overview that will be roughly correct but without a nice looking chart: Apple products are pretty well balanced between Compute, and Memory Bandwidth for the purposes of LLMs. In other words, you can basically look at the memory bandwidth per second, divide that by the cost, and that gets you your price to performance. You can test it, but it will match up quite closely with tokens per second. If your goal is to get the most tokens per second, that’s fine, but there’s other considerations. If you need to run larger models and don’t really care as much about the speed (ie: You need to ask an oracle a really important question every now and then), your best price to performance will be consumer PCs with maxed out RAM configs, running on CPUs. If you need to run at the most tokens per second, then usually GPUs or specialized accelerators (Tenstorrent) will give you orders of magnitude better performance. I will note that a hybrid approach is possible where the CPU runs the oracle that does the overall planning and analysis while the smaller fast models essentially take the “monkeys and Shakespeare” approach and just generate quickly until they get it. This works surprisingly well, don’t knock it until you’ve tried it. The final case is where you need reasonably fast tokens per second in the largest possible model. This is where Apple’s Silicon makes sense, but do keep in mind it won’t be cheap no matter how you slice it. The exact best deal will depend heavily on how much RAM you need for your model specifically, and there’s multiple points of overlap. Again, your best shot is to take the memory capacity you need for your model, and to measure that against every possible config that can run it, and then divide that against the memory bandwidth to cost ratio of each valid configuration. Rather awkwardly, you’ll run into a slightly different answer for every model size.
This kind of investigation is great and truly valuable. It’s a significant contribution to the community. This could become a rabbit hole once you start running tests. Thanks for sharing!
CUDA clusters have 'unified' memory between GPUs thanks BlueField DPUs, so one GPU could be connected to 1TB of memory easily. It's important to have all GPUs working ON THE SAME MEMORY during training. Not all models are compatible with clusters & separate memory spaces.
Very cool Alex! It would be very interesting to see the same topology with a Mac Studio or Pro, M2 Ultra be the hub of the cluster, with it's 800 GB/S bandwidth and eight Thunderbolt 4 ports. Great content!
Should totally be possible, the config might require some clever hackishness though, but then you might even get link aggregation to work and include WiFI and BT just for the halibut.
The main issue I have right now is that most libraries, papers, or any production models, like SAM, generative 3d geometry, etc, they are all built for CUDA, at work we try to explore every model, but when we try to use mac the answer most of the time is to "fall back to CPU" which makes it unusable and slow, LLM stuff is like hello world at this point lol, really cool to see your experiments :)
I’ve always imagined trying LLM clustering using a Mac, and you’ve turned that imagination into reality-I’m so thrilled! Additionally, if there are any Thunderbolt network issues, it would be great to create a video showcasing how to use a 10G network instead. This video is incredibly valuable and practical. Thank you so much for your hard work in making it. I truly appreciate it! ❤❤❤
You could have created a connection using thunderbolt in a Mesh, a -> b, c b -> c, d c -> d, e d -> e, a e -> a, b 15 Thunderbolt cables, setup as a bridge and just give each device it's own IP address, You could setup. It does mean to get from a to d it has to go through b or c.
Fun fact, WiFi ethernet packets are up to 64kB, by default, with my 2020 wifi 6 router. When streaming videos with DLNA protocol it uses maximum length WiFi ethernet packets.
Hi Alex, nice job! It's so cool to see those setups. I was wondering if it would be viable to use a LattePanda Sigma as a larger model server. Could that be a potential next project for you?
i think this will be slower, lets say machine A wants to communicate with machine E then the information has to go through B, C and D. Correct me if I'm wrong
Brilliant Alex!!! Thank you for doing this! To be expected that the thunderbolt hub slows things down.... too much contention. I think we will have to wait for M4 ULTRA with 128/192Gb. Maybe worthwhile to run 2x M4 Ultra using EXO.
A few years ago I myself personally believed in this idea, I dreamed of the concept of a mini PC with very powerful APUs that (ideally) will be in the form of cells that can be connected to each other and like honeycomb and thus automatically create clusters and having collected a whole cabinet of such you can organize a home computing lab. I think the M4 is the first step towards such a future because of its balance of performance and very small size. Thank you for the video (I got a question, if you compare this computing to the analog in the form of GPU, how much would it be in price?)
It's interesting that Exo stunts performance over MLX so much but is more convenient to use. If you were going to get a mini to supplement your 128GB M4 Max MacBook, would you just go for the base model mini at 24GB? At $800 its 60% cheaper than the 4090.
Best video so far. Can you try setting up the same cluster horizontally and see if that helps the bottom Mac mini becoming a furnace . Also rearranging the order to see if the bottom always heats up or if that one in particular is a hotboi 🥵😂
The good thing though about exo is that you can plug in just whatever of the new ecosystem. So if you have the m4 128 that needs more you can provide that through some n-tuple of mac minis. Or you have multiple workstations for an office and your need for intensive local inference is rare. Or in future when we all unnecessarily buy M5 devices 😢
Thank you for sharing the cluster idea Alex. I was excited just watching you try the different configurations! Definitely interested in any further networking experiments. Thanks again!
The Jumbo packets are more about using less bandwidth for protocol overhead. The TCP/IP header (20 - 40 Bytes), then the payload. So it's saving about 84 bytes if you are filling every jumbo packet. Great for sending a lot of data quickly. At 40Gb/s and assuming we get all of the pipe, you are saving like ~40MB/s in TCP packet overhead. Nothing to sneeze at.
@@Makronauta The MTU is the max transmission size. If the frame isn't full, it doesn't actually transmit a bunch of white space nulls. It only sends what it needs to. Having a constant MTU size is more about not requiring the edge router / switch from having to convert from 9000 MTU size, and splitting them up into 1500 MTU size packets costing CPU time. If you send a 9000 sized MTU packet and it's less than a 1500 sized packet in the payload, it will not need to spit the packet.
Thanks Alex. You're asking - and answering - the big questions that are at the edge of everyone's worry & interest. These are the things that SHOULD be obvious, or obvious questions. The elephant in the room! The current and future Macs might be incredible game changers or *just* incredible consumer desktops, but either answer feels super important to answer.
Great video! Likely this was shown in other videos but: M4 Macs Mini Pro here are the binned (12-c CPU and 16-c GPU) or the unbinned (14-c CPU and 20-c GPU) version? Cheers
off topic but do you recommend the m4 12-core pro MacBook Pro instead of the 14-core pro? better battery life? can do video editing, music production and coding? keep for 5 years at least, what are your thoughts? I will get the 48gb ram and 2tb ssd
You will need to quantise it, it needs more than 64gb ram. Then at q8 quantisation it needs around 32gb of ram, so you do not need 64gb ram in the first place and with 48 you should be good. I bet it is still gonna be slow, unless you quantise it further, in which case 64gb is even less needed. But in general even if it was 60gb and it fit there, it would not work well. M4 pro does not have the TFLOPS nor memory bandwidth to really support running such large models on a machine. It should be simple math if anybody took the time to do them. Wait for a mac studio max, which is prob gonna be slightly more expensive. Or buy a bunch of base macminis once one has figured out how to solve the network congestion issues, 3-4 m4s should end up being faster than 1 m4 pro (in the video 2 m4s were around the performance of one m4 pro, which makes sense in terms of TFLOPS and memory bandwidth).
Will be interesting when referbs and 'parts only' macbook pro laptop m4's start hitting the market... Imagine picking up 4 "broken screen" MBP M4 laptops that are 128gb Ram each and building a cluster - that would be wild
Try setting up the cluster in a hierarchical scheme, I've had experience with distributed computing and while this adds CPU overhead, model workloads are primarily GPU, so it doesn't affect running the models. One master node, 3 Sub-Master Nodes, followed by upto 3 worker nodes each per sub-master node. I want to see what kind of workloads it can accomplish
You are doing a great channel Alex thank you so much👍 And did you find out why the bottom mac mini was hotter than the other four? Was there something wrong or what was the reason🙏
Very interesting video and set up. Maybe the bottom Mac runs hotter because if doesn't have the same vertical space to push out air as the suspended macs above have.
Your bottom Mac mini is hot because it is supplying power through thunderbolt connections to other Mac minies. Test temperature of each Mac mini and you may see difference. If each Mac mini is powered separately, that may. Not be the case with each Mac. Also, Mac mini with pro chip (and more RAM and SSD storage may have slightly more temperature due to more hardware that remain active.
This is a phenomenal video!! I’ve been playing with the idea of selling some of my current hardware for a cluster of Mac Mini’s… this video is only pushing me further towards that lol.
Yes to running a jumbo test, suggest a maximum test ( just for giggles) and an “average” jumbo test - if there is such a thing. Love all the testing that you are doing - exciting.😅
amazing video. this is the type of content i come for. no one is doing videos about mini clusters. this is gonna be HUGE for llm. can the same be done with maxed out studios? theoretically it should work but would it be cost effective?
Great video Alex! You mention setting up for Jumbo MTU sizes...did that make any difference? In theory you could get 9K packets...but I have my doubts.
Thank you as always for a great video 😊 have you tried running parallels / a windows arm VM on the new M4 Mac mini? I’m on the fence about upgrading and wondering how it handles VMs or Containers (including Rosetta) - not sure if you’ve tried this on m4?
It'd be very interesting to see the evolution of "time to first token" between these different topologies. In real world use for dev work we have massive prompts, it's both the time to first token and tokens/s that matters ^^ Great informative video nonetheless.
It's interesting to me that despite the relative big leaps in max specifications each generation, theres already demand for more power. I hope Apple would consider some sort of monster server setup which multiple 128/192 m4 ultras in. although they'd need to work out how to up the bandwidth considerably. But of course, it's possible.
Alex, the only interesting use case is running models that you can’t run on one machine. And you are right: with NVIDIA consumer cards this moment comes early. So, please, try to run a 70B BF16 model on your cluster. It should fit just barely. Llama models need BF16, so please no smaller quantization. Also, try Daisy chaining as the current setup limits the Thunderbolt performance which is critical in this scenario. Godspeed 🫡
Great video! And great comment section also! @azisk you attract a good community! What is the quantisation of the models you use? Asking in order to try to replicate/add some test on other configurations and compare with your results. Kindest regards!
Thank you for this video Alex, it's what I needed to get it. Coles notes for other folks. If you want the top perf on the highest param models, you need to use CUDA capable cards, aka NVIDIA. If you want to save energy and you have some mac mini's hanging around they do an OK job and will be most cost effective. Cheers.!
Have you ever tried running LINPAK, or whatever TOP500 is running for benchmarking supercomputers? I've been away from this for a decade but I see LLNL again has the top supercomputer with the next two also being DOE facilities. It would be interesting to see what a base M4, M4 Pro and M4 Max Mac would do. I saw one of your screens showing >17Tflops and I'm wondering how fast a single M4 can run along with how fast a cluster of M4s can run.
Actually you were the reason why I settled with the M4 Pro base model because you introduced me to Exo and I immediately had im mind what you are demonstrating here. This base model with an additional 4TB Thunderbolt SSD is great value and handles all everyday tasks very well, and I am going to make an Exo cluster out of the cheapest M4 base models, the 256GB/16GB ones with external SSDs each. Now the only thing that is still missing is multi-node Kubernetes support with Docker Desktop for Mac and the Mac mini will be the best value home lab cluster node you can currently buy. btw: Nice keyboard selection. I bought the Keychron Q1 as well with my Mini. I originally wanted one with blue (click) switches but the only model left was the one with purple hall switches but it feels nice. It reminds me a lot of the feeling of the Amiga 500 keyboard. btw, is it possible to connect the nodes in a daisy-chained manner?
I think another part of the reason the lowest unit is the hottest is that it is closest to the table. It has the least airflow underneath and I also suspect the metal chassis is absorbing more heat here.
The M4 Mac Mini is discounted NOW: ⚡ amzn.to/3CJnjds
Costco has them for $500
@@vinfasi don’t have a Costco membership. How much is that?
@@AZisk $65 in the US. BTW, Costco has the M4/24/512 model for $889.
@@AZisk Go for the Executive tier at $130. If you don't earn enough from the 2% cash back to make up for the extra cost over the $65 Gold tier, Costco will refund the difference at the end of the year.
Also factor in the extra year of warranty Costco gives...
Not true, backordered is way different than discontinued. They’re just backordered now due to popularity.
MacOS servers should be a thing again with M model chips honestly
for sure
Unrealistic
just give me a good linux integration
@@razor-b2d unrealistic? Some companies are already doing it, hence why some people had issue with the power button being switched, as some companies have server racks with the Mac minis.
Not a bad idea. I think Mac OS Server failed because of timing. Apple was not as popular with the Dev community as today. The Mac Mini is probably the most popular device for DevOps. Even AWS bought 1000s and rack them. (Yep you can run Mac OS in AWS).
It should not be too complicated for them to use the MacPro and make it rackable. 1U unit, shallow depth, with necessary ports to create clusters.
I think Alex is onto something. Clusters make sense, they provide redundancy.
I hope Apple is watching and paying attention. Alex what would be interesting is the application side.
What could these rackable Mac Mini be used for apart from running LLM test?
The footprint and low power consumption is great. Need to find the market for it. Any ideas?
Thank you Alex, I was thinking of doing this exact project! One note: by using a hub you are creating a star topology with a 40Gbps bottleneck shared between all machines. If you used a partially meshed ring topology, you could connect each mini to 3 other minis with a connectivity set of 1:{5,2,3} 2:{1,3,4} 3:{2,4,1} 4:{3,5,2} 5:{4,1,2}. I'd be interesting in seeing if this improved performance. Another potential advantage of the mini cluster vs a single M4 Max is that all M4-series chips have the same 16 ANE cores; you might be able to run distributed inference on the neural engine to benefit from that scaling.
thanks a lot! definitely something to consider. however, the 40Gbps is per thunderbolt controller, and I thought that each port had a separate controller - they do not. Very good observation!!
@@AZisk Ahh I had been wondering why it wasn't just daisy chained, bc that was one of the things Thunderbolt was sold on originally. Would only use 1 port each, but yeah if they. would all share the same bandwidth, that might not work out like we wanted.
I'd be super interested in seeing this
Hehehehe this brought me back to the old times where I come from... old 1 mbit network setups... damn at that time it was amazing...
The EXO cluster can only expand available memory, not improve inference performance, because each token needs to be processed sequentially on every Mac mini. For a single request, parallel processing is not possible. For example, if you have one Mac mini with 64GB of RAM, its processing performance would be better than two Mac minis with 32GB each.
For the bottom mac being hotter, try giving it space under it like you have the others above, might be heat soaking because it cant dissipate the heat like the others can
According to the Apple video, the air flows in and out of the bottom, so this makes sense.
Or just lay the rack on its side?
You could really see the joy on you face in this video. Like a kid in a candy store. :)
it was a fun one for sure
Exolabs actually processes the parts of the model segmented onto each Mac sequentially, not in parallel which means it's slower than it would be running it on one machine with a lot of ram due to connection delay. However, if Exolabs supported Mixture of Experts models and allowed the experts to be split between devices, that would give insane performance when using all experts compared to doing it on one device because all the experts could be run in parallel.
So essentially, if you have a 16 expert model, you can get speed ups until you hit 16 computers?
@blisphul8084 Yes, but even if you only have 8 computers, you can still use the model with two experts per computer (if you have enough ram). You could also just elect to use fewer experts, and the result wouldn't be that much lower quality. You could also have all the experts loaded onto separate computers and still only actively use a couple of experts per token if you want to use less electricity. To clarify, "using fewer experts" only refers to how many experts are activated at a time for each individual token, MoE models still generally use all the experts across multiple tokens, so you need them all loaded somewhere.
Except for training, aren’t most tasks sequence dependent?
@@blisphul8084 Not as much, MoE are already fast because they are sparse models and do not activate all their weights in every run, so they run pretty fast in a single computer already. The problem is fitting them into memory because all the weights have to be on memory (you do not know in advance which will be used). For example a 8x7b mixtral has the speed of 14b model but needs as much RAM as a 56b model. Maybe you could bring it down to 7b speed if all this worked like that, but it would be a 2x increase in speed not 8x. But such a set up would allow for running bigger models that would not fit into RAM normally.
To add onto this: Evaluating the performance of distributed systems is hard.
Pipeline Parallel inference: This involves processing one layer on one device, and the next layer on the next, and so on. It’s probably the most intuitive form of parallelism, and means that your system’s RAM is essentially the sum of all constituent components. Note, that your total token/s will be equal to either the slowest machine, or the network bandwidth (whichever comes first).
Asynchronous (Batched) Parallel Inference (Exo goes here): This is like PPI, but you are able to execute asynchronously, which lets you add together the bandwidth of your systems, potentially using batches, which lets you increase your token/s by essentially adding together your device’s compute, but it does incur a latency penalty, and each additional device makes the latency worse. Pretty good if you have a huge batch of prompts to go through or something. Lower bandwidth between devices doesn’t decrease token/s but does worsen latency, to my knowledge.
Tensor Parallel: Matrix multiplies aren’t interdependent. What this means is that if I calculate the product of the first, third, fifth, and seventh rows on one device, I can calculate the second, fourth, sixth, and eighth on another device, and then just synchronize the product at the end. There’s a bit more to it (you can optimize communications by not synchronizing and actually just sending the products directly to the next layer), but that’s the basics. This requires less bandwidth than pipeline parallelism for LLMs at least (once the weights are loaded, at least), and does let you add together the bandwidth, capacity, and compute of the constituent devices, but it does have two issues. 1) You need a power of 2 number of devices for most tensor parallel implementations, and you also run into issues with certain calculations, like soft max, which require synchronization, which limits how wide a tensor parallel setup can go. It does improve latency, and improves throughput, though, so it’s pretty based. If you have two GPUs, you could run a model literally just twice as fast as not doing tensor parallel, and the same goes for CPUs. I think beyond 4 devices you have to think pretty carefully about how fast the interconnect is, though. Linear Transformers scale way better, so hypothetically if you had like, 64 CPUs for some reason you could probably get pretty close to linear performance improvements for each CPU in the network.
Asymmetrical tensor parallel: I’ve never seen this, but in theory it’s possible. You don’t necessarily have to split the tensors evenly between devices. If you have a slower device (ie: CPU) you should still be able to offload, say, 10-20% of the model to that device, while doing the rest on GPU, which doesn’t sound great, but keep in mind it gives you a 10-20% speed up compared to just GPU, and it also reduces the VRAM required by that GPU. This means that you could probably run a 28GB model on a 24GB GPU, for instance, and run it a little bit faster than if you had the same GPU with 28GB of VRAM. Neat.
Hybrid: If you have a tensor parallel setup configured as one device and then a second step in the pipeline going to another device, this would be a hybrid setup. To my knowledge no frameworks support this out of the box and would require some careful setup that would probably be very custom to your specific topology. You would expect the latency characteristics to be roughly equal to the ratio of tensor parallel and pipeline parallel components in the network. I guess it would make the most sense if you’re trying to pair several small devices (ie: 4 Raspberry Pi’s) with a more powerful device like a GPU.
I’m guessing there’s probably other types of parallelism out there, and there’s also probably some learned parallelism types that might substitute elements of the communication with learned neural networks, or such, but to my knowledge those are the main types for Transformers.
CNNs are where parallelism gets **really** fun, though.
A Mac Mini cluster with MLX and ExoLabs makes sense if you like extending context for the models.
If you just need the model for a "Hello" or "Write a story" query, a single machine would be sufficient for the task.
@@MrBlogbar but…. Speeeeeeeeeeeeeed
I think the argument is that mini cluster with MLX and ExoLabs would not increase inference speed because inference are still done sequentially? But its definitely make running super big models at least possible with the scalability of memory resources
I like the narrative style without getting too deep into the technical weeds and letting the screen do that talking.
Correct me if I am wrong. According to your video:
1.If the LLM model can fit in just one Mac mini pro, we will get maximum tokens per second. Adding more Mac minis actually decreases TPS, rather than increasing it.
2.If the LLM model is too big for a single Mac mini, the only way to run the model is by using an exo cluster. However, in this case, the TPS will be very low.
So I am wondering wouldn't it be better to buy a MacBook Pro M4 Max with 128GB RAM instead of 5 Mac minis? It might be cheaper, and the performance would be much better than an exo cluster with Mac minis (maybe 2 or 3 times on TPS?)
no offense, just want to understand
Isn't that what he says at the end of the video?
Its exactly what he says at the end of the video.
For batch size 1 (= one request at a time), yes. However, adding more Mac minis will improve TPS if you are dealing with higher batch size.
@@philippe.139 I see, parallel requests will be processed simultaneously, thanks
@philippe.139 a MacBook has better performance than a desktop MacPro? Or is it about performance per dollar?
Finally we get to see the results! You've hinted and showed the racked minis in many prior videos, I was going mad!
Booyy ooo boyyy ooo boyyyyy!! This is one heck of a video about M4 mac mini! Such a creative video man! Loved it!
thx!
I loved it too!
Can you run more tests and show a chart with the results per $ spent? We are trying to figure out if it makes more sense to buy 4 base models in a cluster vs spending the same amount on a maxed out m4 pro or max.
If you’d like a quick overview that will be roughly correct but without a nice looking chart:
Apple products are pretty well balanced between Compute, and Memory Bandwidth for the purposes of LLMs. In other words, you can basically look at the memory bandwidth per second, divide that by the cost, and that gets you your price to performance. You can test it, but it will match up quite closely with tokens per second.
If your goal is to get the most tokens per second, that’s fine, but there’s other considerations.
If you need to run larger models and don’t really care as much about the speed (ie: You need to ask an oracle a really important question every now and then), your best price to performance will be consumer PCs with maxed out RAM configs, running on CPUs.
If you need to run at the most tokens per second, then usually GPUs or specialized accelerators (Tenstorrent) will give you orders of magnitude better performance. I will note that a hybrid approach is possible where the CPU runs the oracle that does the overall planning and analysis while the smaller fast models essentially take the “monkeys and Shakespeare” approach and just generate quickly until they get it. This works surprisingly well, don’t knock it until you’ve tried it.
The final case is where you need reasonably fast tokens per second in the largest possible model. This is where Apple’s Silicon makes sense, but do keep in mind it won’t be cheap no matter how you slice it. The exact best deal will depend heavily on how much RAM you need for your model specifically, and there’s multiple points of overlap. Again, your best shot is to take the memory capacity you need for your model, and to measure that against every possible config that can run it, and then divide that against the memory bandwidth to cost ratio of each valid configuration.
Rather awkwardly, you’ll run into a slightly different answer for every model size.
This kind of investigation is great and truly valuable. It’s a significant contribution to the community. This could become a rabbit hole once you start running tests. Thanks for sharing!
CUDA clusters have 'unified' memory between GPUs thanks BlueField DPUs, so one GPU could be connected to 1TB of memory easily.
It's important to have all GPUs working ON THE SAME MEMORY during training.
Not all models are compatible with clusters & separate memory spaces.
Yeah pretty sure multiple NVidia cards connected with nvlink will be a lot better then this solution
Obviously but not all of us are millionaires
@ it doesn’t change the fact that most downloaded model can’t work in clusters
Very cool Alex! It would be very interesting to see the same topology with a Mac Studio or Pro, M2 Ultra be the hub of the cluster, with it's 800 GB/S bandwidth and eight Thunderbolt 4 ports. Great content!
Can you daisy chain the thunderbolt connections? This would eliminate the hub.
I thought that way to love to know the answer
Should totally be possible, the config might require some clever hackishness though, but then you might even get link aggregation to work and include WiFI and BT just for the halibut.
@@mpsii surprised this wasn’t his solution. There’s precedent for this.
@@abb0tt good to know, I assumed you could daisy chain via Thunderbolt but did not know for sure
@@JeremyAndersonBoise google "High-speed 10Gbps full-mesh network"
I was waiting for this video for so long, thanks Alex. ❤
me tooo
Me 3 😂
The main issue I have right now is that most libraries, papers, or any production models, like SAM, generative 3d geometry, etc, they are all built for CUDA, at work we try to explore every model, but when we try to use mac the answer most of the time is to "fall back to CPU" which makes it unusable and slow, LLM stuff is like hello world at this point lol, really cool to see your experiments :)
the thunderbolt dive would actually be really interesting
i hope you make the video at some point
Accidental Tech Podcast has some great deep dives on Thunderbolt and other Mac stuff starting from many years back
That was cool! But wouldn't it be cheaper to run on an RTX 4090 based PC?
I’ve always imagined trying LLM clustering using a Mac, and you’ve turned that imagination into reality-I’m so thrilled! Additionally, if there are any Thunderbolt network issues, it would be great to create a video showcasing how to use a 10G network instead. This video is incredibly valuable and practical. Thank you so much for your hard work in making it. I truly appreciate it! ❤❤❤
I love your videos dude. They are very interesting. Keep up with the good work.
Glad you like them!
You could have created a connection using thunderbolt in a Mesh,
a -> b, c
b -> c, d
c -> d, e
d -> e, a
e -> a, b
15 Thunderbolt cables, setup as a bridge and just give each device it's own IP address, You could setup. It does mean to get from a to d it has to go through b or c.
I was waiting for this one for weeks!!
Took me long enough, right?
I’m just happy it’s here 🎉
thanks for this Alex. finnaly someone test this thank you
Hope you like it!
Yay, you did what I suggested with the multiple cameras, and it's truly that cool as expected! 😁 🤘🏼
Alex you are doing a great job. The best Mac-related channel for devs.
Fun fact, WiFi ethernet packets are up to 64kB, by default, with my 2020 wifi 6 router. When streaming videos with DLNA protocol it uses maximum length WiFi ethernet packets.
I love your creativity and humour, I love your channel man!
Your videos are getting better and better Alex. This was very interesting and of high quality! Thx!
This was really cool to see, I always wanted to see the M series chips pushed like this!
Hi Alex, nice job! It's so cool to see those setups. I was wondering if it would be viable to use a LattePanda Sigma as a larger model server. Could that be a potential next project for you?
If it's Thunderbolt Bridge, can you try daisy-chaining Thunderbolt connections to get rid of the bridge?
@@Alexthesurfer this 👍🏻
i think this will be slower, lets say machine A wants to communicate with machine E then the information has to go through B, C and D. Correct me if I'm wrong
@@zeeshanrabbani8125 you're not wrong
But star connect is possible with 4 macs. I think he mentioned that and I think that would get rid of the bridge.
I was thinking of the sane thing
Best video in regards of M chips and machine learning/LLM. Keep it up. I love those kind of videos!!!
Brilliant Alex!!! Thank you for doing this! To be expected that the thunderbolt hub slows things down.... too much contention. I think we will have to wait for M4 ULTRA with 128/192Gb. Maybe worthwhile to run 2x M4 Ultra using EXO.
M4 Ultra will have 256GB. There's also supposed to be a chip coming that'll supersede the M4 Ultra with 512GB of memory.
A few years ago I myself personally believed in this idea, I dreamed of the concept of a mini PC with very powerful APUs that (ideally) will be in the form of cells that can be connected to each other and like honeycomb and thus automatically create clusters and having collected a whole cabinet of such you can organize a home computing lab. I think the M4 is the first step towards such a future because of its balance of performance and very small size.
Thank you for the video
(I got a question, if you compare this computing to the analog in the form of GPU, how much would it be in price?)
Great beginning! I don‘t think this setup is useless, but yes, a beginning.
It's interesting that Exo stunts performance over MLX so much but is more convenient to use. If you were going to get a mini to supplement your 128GB M4 Max MacBook, would you just go for the base model mini at 24GB? At $800 its 60% cheaper than the 4090.
Best video so far. Can you try setting up the same cluster horizontally and see if that helps the bottom Mac mini becoming a furnace . Also rearranging the order to see if the bottom always heats up or if that one in particular is a hotboi 🥵😂
The good thing though about exo is that you can plug in just whatever of the new ecosystem. So if you have the m4 128 that needs more you can provide that through some n-tuple of mac minis. Or you have multiple workstations for an office and your need for intensive local inference is rare. Or in future when we all unnecessarily buy M5 devices 😢
The thumbnail got me instantly, I used to run a grip of Intel Mac Minis at home.
I watched this video and said to myself how the hell am I not subbed to this guy? Great Content!
I loved this video! The fun experiments that tech nerds love!
I feel like you have all the hardware here you would need for a killer agentic workflow setup. I'll be waiting on that video.
Thank you for sharing the cluster idea Alex. I was excited just watching you try the different configurations! Definitely interested in any further networking experiments. Thanks again!
Nice video Alex, did you try daisy-chaining those thunderbolt connections? I'm not sure if it would work, but it might....
The Jumbo packets are more about using less bandwidth for protocol overhead. The TCP/IP header (20 - 40 Bytes), then the payload. So it's saving about 84 bytes if you are filling every jumbo packet. Great for sending a lot of data quickly. At 40Gb/s and assuming we get all of the pipe, you are saving like ~40MB/s in TCP packet overhead. Nothing to sneeze at.
If the packets are large enough, otherwise it's just a lot of empty packets to wait for
@@Makronauta The MTU is the max transmission size. If the frame isn't full, it doesn't actually transmit a bunch of white space nulls. It only sends what it needs to. Having a constant MTU size is more about not requiring the edge router / switch from having to convert from 9000 MTU size, and splitting them up into 1500 MTU size packets costing CPU time. If you send a 9000 sized MTU packet and it's less than a 1500 sized packet in the payload, it will not need to spit the packet.
@@Dygear Thanks for the explanation!
What about doing a cluster of two m4 base model Mac mini's as a home computer?
Been waiting for this one since I saw the YT short of this cluster 👍
Thanks Alex. You're asking - and answering - the big questions that are at the edge of everyone's worry & interest. These are the things that SHOULD be obvious, or obvious questions. The elephant in the room! The current and future Macs might be incredible game changers or *just* incredible consumer desktops, but either answer feels super important to answer.
The editing in this video is insane!
It would be interesting to see how it would work if each machine had internal 10Gb ethernet, connected through a 10Gb switch.
love your videos and the ideas behind....keep doing what you do!!!
Great video! Likely this was shown in other videos but: M4 Macs Mini Pro here are the binned (12-c CPU and 16-c GPU) or the unbinned (14-c CPU and 20-c GPU) version? Cheers
Now I see why the Mac Mini M4 is not available in my country yet! Subscribed!
off topic but do you recommend the m4 12-core pro MacBook Pro instead of the 14-core pro? better battery life? can do video editing, music production and coding? keep for 5 years at least, what are your thoughts? I will get the 48gb ram and 2tb ssd
What’s the performance of the mini pro 64GB on qwen 2.5 32b ? That’s the combo I planned on buying. Thanks !
You will need to quantise it, it needs more than 64gb ram. Then at q8 quantisation it needs around 32gb of ram, so you do not need 64gb ram in the first place and with 48 you should be good. I bet it is still gonna be slow, unless you quantise it further, in which case 64gb is even less needed. But in general even if it was 60gb and it fit there, it would not work well. M4 pro does not have the TFLOPS nor memory bandwidth to really support running such large models on a machine. It should be simple math if anybody took the time to do them. Wait for a mac studio max, which is prob gonna be slightly more expensive. Or buy a bunch of base macminis once one has figured out how to solve the network congestion issues, 3-4 m4s should end up being faster than 1 m4 pro (in the video 2 m4s were around the performance of one m4 pro, which makes sense in terms of TFLOPS and memory bandwidth).
Great proof of principle! I anticipate many more TH-cam videos on this topic. Would be interesting to see Mac Studio clusters also.
Will be interesting when referbs and 'parts only' macbook pro laptop m4's start hitting the market... Imagine picking up 4 "broken screen" MBP M4 laptops that are 128gb Ram each and building a cluster - that would be wild
Oh man this is cool! I'm really curious now about a cluster of mini PCs with maxed out ram like the one you tested a little while ago
Impressive - wasn’t sure how performance would scale when clustered. Continue to be impressed with Apple Silicon.
The cost of all these m4 you could have purchased a fk off good GPU cluster server 😅
Thanks for testing this out
Try setting up the cluster in a hierarchical scheme, I've had experience with distributed computing and while this adds CPU overhead, model workloads are primarily GPU, so it doesn't affect running the models. One master node, 3 Sub-Master Nodes, followed by upto 3 worker nodes each per sub-master node. I want to see what kind of workloads it can accomplish
Great job! What software do you use to mirror the MacMini displays on your MacBook?
You are doing a great channel Alex thank you so much👍
And did you find out why the bottom mac mini was hotter than the other four? Was there something wrong or what was the reason🙏
Very interesting video and set up. Maybe the bottom Mac runs hotter because if doesn't have the same vertical space to push out air as the suspended macs above have.
Mac Mini M4 PRO with 64GB ram also cost 2000$
keep up the great work. this is exactly the content I was and am interested in.
🎉 Thanks for trying 70B, honestly i think minis are ok just up to 32B
Your bottom Mac mini is hot because it is supplying power through thunderbolt connections to other Mac minies. Test temperature of each Mac mini and you may see difference. If each Mac mini is powered separately, that may. Not be the case with each Mac. Also, Mac mini with pro chip (and more RAM and SSD storage may have slightly more temperature due to more hardware that remain active.
This is a phenomenal video!! I’ve been playing with the idea of selling some of my current hardware for a cluster of Mac Mini’s… this video is only pushing me further towards that lol.
Yes to running a jumbo test, suggest a maximum test ( just for giggles) and an “average” jumbo test - if there is such a thing.
Love all the testing that you are doing - exciting.😅
amazing video. this is the type of content i come for. no one is doing videos about mini clusters. this is gonna be HUGE for llm. can the same be done with maxed out studios? theoretically it should work but would it be cost effective?
Great video Alex! You mention setting up for Jumbo MTU sizes...did that make any difference? In theory you could get 9K packets...but I have my doubts.
Thank you as always for a great video 😊 have you tried running parallels / a windows arm VM on the new M4 Mac mini? I’m on the fence about upgrading and wondering how it handles VMs or Containers (including Rosetta) - not sure if you’ve tried this on m4?
Only saw this post in my feed, not my subscriptions 😭 damn youtube algo wasn't showing me ur video 😭
Amazing work
It'd be very interesting to see the evolution of "time to first token" between these different topologies.
In real world use for dev work we have massive prompts, it's both the time to first token and tokens/s that matters ^^
Great informative video nonetheless.
AI should write a story with the prompt "one day a group of friends with 128GB M4 Max Macs come together for a LAN party..."
FINALLY!!! What we’ve all been waiting for.
It's interesting to me that despite the relative big leaps in max specifications each generation, theres already demand for more power. I hope Apple would consider some sort of monster server setup which multiple 128/192 m4 ultras in. although they'd need to work out how to up the bandwidth considerably. But of course, it's possible.
Alex, the only interesting use case is running models that you can’t run on one machine. And you are right: with NVIDIA consumer cards this moment comes early.
So, please, try to run a 70B BF16 model on your cluster. It should fit just barely. Llama models need BF16, so please no smaller quantization.
Also, try Daisy chaining as the current setup limits the Thunderbolt performance which is critical in this scenario.
Godspeed 🫡
Great video! And great comment section also! @azisk you attract a good community! What is the quantisation of the models you use? Asking in order to try to replicate/add some test on other configurations and compare with your results. Kindest regards!
Very interesting project. I would be interested in seeing something where the minis are connected via the LAN port.
16:44 lol - Elmo doesn’t pay his bills 😂
Facts!
Thank you for this video Alex, it's what I needed to get it. Coles notes for other folks.
If you want the top perf on the highest param models, you need to use CUDA capable cards, aka NVIDIA.
If you want to save energy and you have some mac mini's hanging around they do an OK job and will be most cost effective.
Cheers.!
"GET TO THE CHOPPA...!!! ERR... CLUSTER...!!!" (best Schwarzenegger accent)
I think this video would have benifited from some graphs to easily capture all of the scenario's outcomes. Thanks for an interesting video!!
You might want to look into a high-pass filter for your audio. A lot of humming in the lower frequencies.
Have you ever tried running LINPAK, or whatever TOP500 is running for benchmarking supercomputers? I've been away from this for a decade but I see LLNL again has the top supercomputer with the next two also being DOE facilities. It would be interesting to see what a base M4, M4 Pro and M4 Max Mac would do. I saw one of your screens showing >17Tflops and I'm wondering how fast a single M4 can run along with how fast a cluster of M4s can run.
Actually you were the reason why I settled with the M4 Pro base model because you introduced me to Exo and I immediately had im mind what you are demonstrating here. This base model with an additional 4TB Thunderbolt SSD is great value and handles all everyday tasks very well, and I am going to make an Exo cluster out of the cheapest M4 base models, the 256GB/16GB ones with external SSDs each.
Now the only thing that is still missing is multi-node Kubernetes support with Docker Desktop for Mac and the Mac mini will be the best value home lab cluster node you can currently buy.
btw: Nice keyboard selection. I bought the Keychron Q1 as well with my Mini. I originally wanted one with blue (click) switches but the only model left was the one with purple hall switches but it feels nice. It reminds me a lot of the feeling of the Amiga 500 keyboard.
btw, is it possible to connect the nodes in a daisy-chained manner?
Excellent video, exactly what I was looking for. Thanks! Great content. Followed.
I think another part of the reason the lowest unit is the hottest is that it is closest to the table. It has the least airflow underneath and I also suspect the metal chassis is absorbing more heat here.
Compared to the others, bottom M4 lacks an open floor, which likely affects air ventilation.
I had this exact project in mind, well done Alex!
Hello Alex! Thanks for the info! Was wondering which program are you using to remote into each of your mac minis?
screen sharing. it’s built in
thanks for showing such a great concept!
Thanks. The performance was 😂. Helped me to not trying by my self ,thanks a lot.
I think the future (if) Mac Pro should be some type of cluster.
Very good, thanks for your demonstration! This points towards a M4 Max Studio with 124 GB of RAM, for me! Thanks again Alex! :)