REALITY vs Apple’s Memory Claims | vs RTX4090m
ฝัง
- เผยแพร่เมื่อ 20 พ.ย. 2024
- I put Apple Silicon memory bandwidth claims to the test against the nVidia RTX4090 powerhouse.
Run Windows on a Mac: prf.hn/click/c... (affiliate)
Use COUPON: ZISKIND10
🛒 Gear Links 🛒
🍏💥 New MacBook Air M1 Deal: amzn.to/3S59ID8
💻🔄 Renewed MacBook Air M1 Deal: amzn.to/45K1Gmk
🎧⚡ Great 40Gbps T4 enclosure: amzn.to/3JNwBGW
🛠️🚀 My nvme ssd: amzn.to/3YLEySo
📦🎮 My gear: www.amazon.com...
🎥 Related Videos 🎥
💰 MacBook Machine Learning | M3 Max - • Cheap vs Expensive Mac...
🤖 INSANE Machine Learning on Neural Engine - • INSANE Machine Learnin...
👨💻 M1 DESTROYS a RTX card for ML - • When M1 DESTROYS a RTX...
🌗 RAM torture test on Mac - • TRUTH about RAM vs SSD...
👨💻 M1 Max VS RTX3070 - • M1 Max VS RTX3070 (Ten...
🛠️ Developer productivity Playlist - • Developer Productivity
- - - - - - - - -
❤️ SUBSCRIBE TO MY TH-cam CHANNEL 📺
Click here to subscribe: www.youtube.co...
- - - - - - - - -
📱LET'S CONNECT ON SOCIAL MEDIA
ALEX ON TWITTER: / digitalix
#m3max #m2max #machinelearning
JOIN: youtube.com/@azisk/join
So for photo and video editing, which would do a better job, the M3 Max macbook pro, or a Microsoft Surface Laptop Studio 2 with 64GB ram, i7 processor, and an RTX 4060 with 8GB disgrete vram?
Never ever I saw so casual unboxing. People earn from just unboxing and here you are providing true valuable information keeping those savory buying aside. Thx ❤
RTX 4090 in the laptop is an TDP limited RTX 4080.
That shouldn’t impact memory right?
@@gliderman9302 somewhat, but not due to the TDP, the laptop version uses GDDR6 memory chips instead of GDDR6x (Bandwidth limit of 576.0 GB/s VS 716.8 GB/s)
@@gliderman9302 whats limiting the memory is the terrible ass laptop
I was so confused as to why it only had 16GB of VRAM
IT IS a 4080, its the same AD103 chip. The desktop 4090 uses the AD102.
In "non unified memory land" aka PC world there is a huge difference between the CPUs memory bandwith, the GPUs memory bandwith and the link in between.
50-70 GiB/s seems reasonable for dual channel DDR5 at limited clock speeds (4800-5600 MT/s), so the CPU numbers are correct, but ~16GiB/s is atrocious in terms of GPU memory bandwith. This is not the actual GPU memory bandwith but rather the PCIe transfer bandwith between the CPU and GPU. It's probably only running a PCIe 4.0 x8 link with 8 * 16 GT/s - overhead. That test is using the CPUs memory to perform GPU operations and not even utilizing the GPUs dedicated RAM. That's madness and in no way representative of the GPUs capabilitys. The 4090 mobile has a theoretical memory bandwith of 576 GiB/s and should be able to reach around 400-500 GiB/s in those "memory access" microbenchmarks (if they were actually testing GPU memory). I am running a 3080 Ti mobile (512 GiB/s theoretical) and get around 400-450 GiB/s depending on the test. CPU to GPU bandwith is still important, but basically all real world workloads (including AI training and inference) either copy the working set to the GPUs memory upfront or stream relevant sections in and out. For the first method the interface bandwith is neglegible as it could only affect startup time (loading the model) and that's basically always bottlenecked by the storage performance (it does not matter if the CPU GPU link is 16 GiB/s or 200 GiB/s if your drive only reads at 4 GiB/s). For the second method it's a bit more relevant, but even in that case the bandwith required to move sections of, for example training data is orders of magnitude smaller than the bandwith required to actively perform calculations on that data. This is due to the nature of GPU workloads where a lot of parallel operations are performed repeatedly on a bounded dataset. For each of those operations that data has to move in and the result out of the GPU core, but it does not have to move back and forth between CPU and GPU every time. The communication is limited to instructions about what to do with the data and occasionally new pieces of data that are transferred once, insted of over and over again. The results of those calculations might also be streamed back, but thats usually equal to or smaller than the Input and does not compete for bandwith, as PCIe is full duplex. If souch a high processor to processor bandwith is actually required, Nvidias NVLink exists and can do up to 1.2 TiB/s per link. The main benefit of Apples unified memory is a more flexible and efficient allocation of RAM as data does not have to be duplicated between CPU and GPU and the amount of RAM available to each is not fixed but dynamic. You simply can not get a PC laptop with more that 24 GiB of VRAM right now.
The 1 TiB/s number is due to the AD103's 64 MiB L2 cache. If the dataset of the test is small enough it just sits in the GPUs cache.
Thank you so much for the explanation!
dayum, that's one hell of an explanation, it's really interesting to hear that the max bandwidth the 4090 achieved was due to its large cache size in comparison with the size of a specific dataset being tested. But, yeah, your explanation made a lot of sense and I understood most of it, but I still don't understand how you get such info from watching this video with limited details, truly fascinating, I was trying to figure something out on my own, the best thing I got to was that the windows laptop wasn't plugged in, saw someone comment about that already, does that make any difference in bandwidth speed compared to it being plugged into the wall?
@@aquss33 honestly, I almost completely forgot that the laptop was actually not plugged in. It might even have used the iGPU for some tests, those
IDK, didn't like DirectX12 eliminate the whole need to copy memory around in the first place?
you can't get current gen pc laptops with more than 24gb of vram, but there are ones with last gen quadros with 48gb
My bro the dedication of just "casually" buying a brand new laptop with a 4090 for the tests, my wallet could never
Most places in the US have a comfortable return window where you get 100% of your money back.
He can just return it after a few days for this. Apple has 2 week return window.
Also resell value on these machines is very high in short windows
because its not an actual 4090
@@petersuvara it’s cool you guys can do it, I’m Europe once you open the box you cannot return it unless it’s faulty
4:08 when you've been rocking Mac for so long you forget you need to plug in gaming laptops to get full power
Oh righttttttt!
Any windows laptop is like this
@@CHURCHISAWESUM*gaming laptops with a discrete GPU. Any AMD mobile chip from 6th gen and up and any Intel Core Ultra Chip have great battery life.
too bad you cant play any fucking games on a mac.
@@rafewheadon1963 You actually can lmao. Stop throwing blaket and inaccurate comments around.
The read & write gfx ram is purely on the card executing opencl, while the peak write and peak read GFX ram measure the pci express bus speed. Transfering data over pci-e is a lot slower than just reading and writing to the memory on the gpu itself
Thank you for putting my exacts thoughts into words that are understandable lmao
so theoretically if we could have a direct gpu-cpu connection in a pc it would be 1 tBps as opposed to 800 gBps for m2 ultra ?
@@oloidhexasphericon5349 thats correct
If nvidia VRAM was used like unified memory was in a Mac, it would be 1tbps
But that extra latency is an issue apparently
Which is mildly disappointing as a gaming pc owner
@@himynameisryan I think this is more of a problem for laptops. Desktops have usually much better bandwith, more PCI lanes etc...
@@Debilinside no the delay still occurs in my gaming PC. It's a real issue for some workloads but not mine
Title is a bit missleading, since the RTX4090 and RTX4090M (RTX4080 nerfed?) are 2 different GPUs with different memory bandwidths and internal cache layouts.
This was comparisons of mobile machines. I haven’t done a desktop rtx4090 test yet.
@@AZisk have you seen the size of a 4090? It’s not gonna fit in a laptop
@@user-jk9zr3sc5h I'll stuff it in.
@@AZisk After that please fit a Bugatti(VW) W16 into/onto a Vespa.
Just food for future videos ;)
@@user-jk9zr3sc5h sure it will, but it will have to sit on a dry ice block.
Video : Nvidia vs Apple
Ad : Samsung
Yeah I think that was pretty clear, Nvidia's GPU's tend to hit 1TB/s of VRAM bandwidth very easily, so if whatever you're trying to run is loaded on the VRAM then Nvidia would squash Apple any day of the week, bandwidth to system RAM however is much slower since it's running through PCI-ex. Games and things like that tend to load data to the VRAM so the GPU wouldn't sit idle waiting for meshes and textures to load from system RAM.
The desktop 4090 and the Radeon Vii both at around 1TB/second peak bandwidth.
I stopped this video halfway because he spent $3grand plus buying an RTX 4090 laptop to use the integrated graphics and battery power for his testing. Should been a. Given to anyone technical, assign the GPU manually on both even if it’s working on one, always be testing on wall power to again, remove variables. But specially testing a PC because by default windows wants to dial down and save battery on battery. Many times with such a high performance GPU, only real way to do that is to turn off the dedicated card, else going have like half hour of battery.
@@Honeypot-x9s The bandwidth test was done on the dedicated GPU on the PC laptop, not the iGPU, and battery is irrelevant here because the bandwidth doesn't go down while on battery, it's a simple wide interface that transfers data, it has no power requirement of its own to throttle on battery. Basically, the bandwidth is 1TB/s no matter what you do.
@@jihadrouani5525 everything is dynamically clocked these days for various optimizations. And even before where we are today we had power states (still used today but differently) that all different clocks and power limits making up a curve. But these days GPUs can clock themselves dynamically and decouple their clocks with their memory based Temperature, power, power availability, thermal overhead, thermal saturation (more of a Radeon thing with STAPM/skin temp aware), lack of utilization (specifically how often memory is being accessed), etc.
Plus yes windows in some deeper settings than just the power options in control panel will when you unplug it from a wall set the a lower power state on GPU and if you got a hybrid system it will almost always suspend the dGPU infavor of the iGPU. Either way it will noticeably reduce performance. Also at the point I stopped it was 48GB/sec. That is perfectly inline with what I expect out of dual channel system memory bandwidth….
@@jihadrouani5525 The 1 TB of bandwidth is not all the time, the L2 Cache is impacting the speed, as soon the data sizes get too big the speed will decrease to 500 GB/s, which is still good but not 1 TB anymore
@@Syping The bandwidth actually stays the same as long as needed, it is not limited by L2 cache, L2 cache is less than 40MB, if that was to be the limitation then the bandwidth would be crippled in milliseconds. 1TB/s can be sustained as long as the GPU itself can crunch that data in real-time. And in gaming and many other use cases 1TB/s can be sustained throughout the entire play session.
AppleGPUInfo isn't measuring bandwidth, it's just doing 2 * clock * (bus bits / 8)
That's what it felt like to me too. Very dodgy.
So in short, VRAM in Nvidia graphics are faster than unified memory and gpu bandwidth in macs, but actually bandwidth between RAM and GPU in PC machines makes it 10 times worse than what 4090 actually capable of
best summary of the situation
FYI, a 4090 laptop chip is not an AD102 chip (used in a desktop 4090), it's really an AD103 (4080 desktop) chip. Try in on a desktop 4090.
it says rtx4090 on the laptop. we all know desktop 4090 is a heck of a lot more powerful, but this was a laptop test
I was definitely not expecting my local Microcenter to be featured in one of your video XD
hey neighbor
the fastest memory bandwidth on a consumer GPU that i remember was AMD's Radeon 7 with HBM2 memory that are soldered directly to the GPU die itself. Memory bus is at : 4096 bit with a Bandwidth of 1,024 GB/s
Sad we don't have HBM on consumer cards anymore. Anyway, the 4090 spec is 1008 GB/s bandwidth.
@@xeridea seems like the memory chip technology is the bottleneck here, since i doubt a 4096 bit bus can only do 1,024 GB/s.
Apple gets their numbers simply from the LPDDR5 spec. Each of those LPDDR5 chips does 51GT/s, using 96 pins.
Agreed Unified Memory Architecture is so dumb down step back from the modular work intel did with South Bridge and North Bridge and PCI. Sure they can claim high speeds but there is no modular connectivity between the CPU and L2, L2, L3 cache. Wait until GPU is running native PCIE 5 with HMB memory aka decade ago AMD Vegas. That's why Elon Musk wants to acquire every new AMD Instinct™ MI300X Platform card with HBM3 memory interface. Apple Silicon is a joke meant to put a iPhone MAX Pro in a laptop enclosure.
Want to build out a multiuser AI workstation cluster you need AMD.
@@chebrubin no with the IMC integrated into the SoC these days it’s the most efficient way of handling memory.
@iDoWayTooMuchAcid efficient is the word. These days?
Why don't you go fanboy your TSLA and APPL trade somewhere else. Integrated memory does not scale to more than 1 CPU. All these "cores" are meaningless when the bus is saturated with network IO. Apples claims will not scale either.
There is a reason why Intel worked on the bus and CPU and GPU managmet architecture for 25 years before you woke up. Take your single CPU / GPU code tests on 1 machine and take off your tinted Vision Pro headset.
A MacPro from 10 years ago will scale better for network IO under multiple connections.
@@chebrubin LOL, man put down your handbook from two decades ago. Yes it is far more efficient, if you understand it at a low level like at silicon level, you would agree. there’s a lot less running in circles being done, fewer cycles are needed, less misses, less latency, etc, etc, etc. now what from that did you also see me say that it’s the fastest? I didn’t. Because outright it isn’t the fastest way, but a 4090 running at max bandwidth while it is capable of running rings around this M chip, will also suck down a 💩 ton more power doing so. Also fun fact, Mac’s and PC both been using unified memory wherever it can for a while now…. 😂
I said nothing about scaling across multiple CPUs, GPUs, etc. and why would I when the platforms being discussed use SoCs which are highly integrated. however if you decoupled the entire memory sub system and put it on its own internal bus like how AMD is starting to with infinity fabric as the bus in their mobile APUs (and seeing impressive gains too), I can see it scaling further. I Can’t speak for what Apple has done maybe their already have a similar memory subsystem setup.. maybe not, I don’t know.
@iDoWayTooMuchAcid precisely check with Apple Services what tech they are procuring for running there Apple AI cloud it is probably AMD instinct and AMD and super micro racks and cages. Alex is benching client laptop power compute. 1 man 1 machine 1 c compilation runtime.
Lets discuss the new Apple Mac Pro with NO GPU bus lanes, this SoC was hobled together last minute to ditch IA. No thunderbolt external GPU. It is a iPhone Pro Max with all the ram and ssds sodered. Steve Jobs is alive and kicking. The Woz needs to help find a bus for Apples SoC.
Next Alex goes undercover into NVIDIA HQ to buy a laptop with A100 under the hood ! Excellent content Alex. This is what separates your channel from all other review channels, the developer focused reviews of the machines, and not only Mac’s. I have 3090, had AMD R7 in the past with 1024 bit HBM2 memory. Man the R7 card were really great but couldn’t do cuda. A100 next Alex ? :):)
I think for the A100, I was considering renting one in the cloud to do some tests. Definitely not worth it for me to buy one yet.
I didn't know what unified memory did before this video. I thought they meant it was on mounted on silicon. Great to know about the GPU can access it too.
It's not just the GPU having access the to the RAM instead of having to go through the slower PCIe interface; the memory bandwidth is so much higher than it is in traditional PCs.
I have seen *some* limited math / AI / etc workloads, just on the CPU, that had something like a 100-1000x speed increase b/c of the unified memory, and was the biggest speed increase they ever had in like 10-20 years of algorithm development. So it's not just GPU. Yes, for normal workloads it doesn't necessarily give you a speed increase, but for algorithms that are memory bound, not CPU bound, and aren't streaming data off your storage device, you can have a radically huge speedup. Even without the GPU / AI engines.
The downside, of course, is that you can never upgrade it, without upgrading the CPU, and Apple has also made that impossible; they are intentionally making their products as hard to repair as possible so you have to buy a whole new machine every time. Though nothing's stopping, say AMD or Intel from making an all-in-one SoC that is socketed & upgradable....
I think every block in the package can access it. Remember reading in Anandtech that the bandwidth was not the same, but encoder/decoders, CPU, GPU, NPU all had direct access to the unified memory.
@@lamelama22 and now imagine - you have only 8 gb for everything and apple considering it enough with statement that it is like 16 gb lol )))
I mean, if you can't do it as a developer, you can always hang this MSI laptop in a silver chain and go be part of the 'hood.
buys a 3000 dollar machine to run a 5 minute test to debunk one question.
Returned it the next day tho
@NoOne-uh9vu in that case it's all good
I clocked my M1 Max at 296 GB/s real world with Asitop before they removed the possibility in powermetrics. Even if it only has 24 GPU cores.
The key is actually using the GPU/CPU/MediaEngine/NeuralEngine. Only one of them will not manage to saturate the memory bandwidth.
In my case it's when exporting from Davinci Resolve. I'm super happy with the machine.
no u dont need an a100, just get the desktop 4090...
the laptop 4090 is basically a 4080 specs wise and also has way lower power limit.
Something to keep in mind; the mobile 4090 uses the same die as the 4080 desktop and the same narrower bus. The desktop 4090 has 24GB of VRAM and a 50% wider bus. There is also the a6000 which is an 18k “core” (vs 16k on 4090) model with 48GB of memory on the same bus. The 4090 is $1600-$2000 and a6000 is around $6000.
So if needs exceed a mobile 4090, there are other options if willing to go to a desktop. I have a 4090 and it’s a great card for hobbyists like myself, plus I play games on it sometimes.
Great video!
3:18 thanks for including your phone number in this video 😂
Noo it’s blurred now 😂
please send me🙂
1. The CPU can't access the whole memory bandwidth simply because the CPU cores aren't 'fast' enough to do it, or more specifically, the cores don't have enough load/store units to do transfers fast enough to saturate the memory. Simply because they don't need it since CPU workloads aren't that memory bandwidth intensive.
2. The 400GB/s Apple markets is the amount the amount of data the SoC as a whole can pull from memory, on either combined workloads or very memory intensive GPU workloads.
3. applegpuinfo doesn't actually benchmark anything, it just shows the theoretical bandwidth (6.4GT/s * 512bit / 8 = 409.6GB/s).
apple is trash.
Jesus!!! Finally! Thank you very much!!! I've been waiting for someone to do so for two years! No, seriously, truly, thank you!!!!
I’m a computer engineering and informatics student atm and I watch many tech videos every day for different things. You are by far the best channel I’ve come across, from setting up my pcs for programming to comparing the best choices for new hardware I want to purchase. What I like the most is that you are professional, you present real stats and you give advices from your personal experience as a programmer! That’s a real tech guy right there. Keep up the great work u deserve more followers!
Wow, thanks!
I tried to train the same model on my M2 Max 32 and the swap went up to 200GB before it terminated. Even if it could access the 128 it wouldn't be enough. I think it's going to need at least 1 TB ram. This is a very sad no-go on any Mac in existence as far as I can determine.
why someone decided that mac is good for models to train? some puny model for fun probably, for some hipsters - but real work are done on nvidia specialized gpu's which could be clustered to how much tb's you need. Also you need to understand, that there is a limit -how much memory you can place in a cpu package
@@s.i.m.c.a I don't know about "puny models for hipsters", but the main concern isn't so much hardware capabilities with models like Mistral 7B (which is very capable indeed), but time constraints. Foundation models of the LLM variety simply cannot be trained or even reasonably finetuned (if Mixtral 7Bx8 or bigger) on consumer hardware, full stop.
Just checking the model cards of even "just" smaller generative models like SDXL reveals that they take 10s of days to train from scratch on 256 GPU clusters...
Just for fun I calculated that it'd take about half a year of constantly running on a single GPU to train SDXL. Consumer hardware (especially laptops) likely wouldn't even survive that. Finetuning on the other hand should be perfectly doable even for smaller LLMs, provided a reasonably quantized checkpoint is used. It'd still take several days or even weeks, though.
Yep, I get it. I have had some pretty good models going on my Mac though. But yea, having a 4090 plugged into the wall with massive amounts of ram is the way to go for now. @@s.i.m.c.a
I love it how he casualy go and buy a 3200$ laptop. Damn, kind of a life goal for me :P
Well done Alex :D
Really only worth it if you need it extremely portable. Even a 1000 Wh portable battery (with standard AC or laptop DC output) plus an RTX 4070 Ti Super, 4k240 OLED and decent base platform plus portable peripherals is around US$800 cheaper and much faster.
This laptop only replaces some US$2k worth of components plus peripherals when plugged in. Sadly RTX 4070 mobile is only 8 GB (just like the Desktop variant), so really not future proof even though the sweet-spot for actually using it almost to the fullest while on battery.
My bro found a really wide BUS image literally 😂
so basically mac is slower ?
Yup obviously. But it's improved so well ngl still not there
The windows machine needs to be run while plugged in?
yes it does just like every x86 windows laptop
@@lesleyhaan116 Just like 99.5% of the macs in the real world, that would be doing this workload. I mean it's a cool party trick and all, but who runs these kinds of things at a coffee shop?
@@TheRealMafoo Actually Apple Silicon performs the same with and without being plugged in. That's one of the major benefits, that you can still have the same performance without being plugged in, and still getting the same battery life.
Great video, just few things, if you don't mind:
* Unified memory trick that apple uses in their silicon, AMD was there since the Play station 4 days, although AMD used GDDR and not LP-DDR memory, but still - it was unified memory
* No matter what apple tells you and their bandwidth, GDDR6/GDD6X is way, way faster. RTX 4090 has 1.008TB/s bandwidth, more then twice what the mac offers. The issue is that, as you explained, you'll need to copy the data to the VRAM.
* Other solutions are coming. AMD has a server solution with their Mi300A APU which has unified memory and 5.3TB/s, and I'm pretty sure they'll think about a workstation solution.
All in all, Apple is giving something nice, but unfortunately has a dead end when it comes to upgradability. Just like buying the best TV you can afford - you got it, and you cannot upgrade.
you are right, unified memory was not invented by apple, but i see lots of benefits of having it in a pc. if and when nvidia decide to make their own SoCs for computers, then they might come up with something that really kills. otherwise we’re just waiting for the likes of Intel and AMD to come up with their own, but they are far behind.
i like my tv, use it, it serves me well, and I never think about upgrading it. I like your analogy, but i think times are changing. I don’t particularly have the time capacity to tinker with upgrades. id rather just sell my old mac and buy a new one, like i do with my phones.
@@AZisk i think you are talking about Laptop market, cause Apple in anything else is way way behind.
unified memory was a thing since the SGI days.
Memory bandwidth doesn't mean anything without tensor cores that Nvidia cards have! This is the reason why people are literally fighting each others to buy 3090/4090s not apple products at all. Sure, you can run large models on a macbook but it will be painfully slow despite large vram capacity and bandwidth...
Wow, great video! Thumbnail and editing are awesome 🤩😄
Thank you so much 😁
Apple advertises what's convenient for them, obviously... BUT still new Macs are incredible powerful when compared to Win laptops. If we also consider that Win machines need to be plugged in order to deliver full power, and also only last few hours on battery.... there is no competition. I'm not an apple fan as I always used, and still have, powerful Win laptops. But this years I decided to buy also a Mac Pro M3 Max and OMG, this thing is a complete different monster. It completes heavy tasks in few minutes, while also playing music, working on heavy files, running the browser and other stuff simultaneously like nothing. Speakers are awesome and the screen is the best I have ever seen on a laptop. And I can stay a full working day Unplugged to power source. My maxed out MSI GE67 Raider in comparison looks like a prehistoric laptop.
I can't wait for the M4 Max laptops to release.
My base M1 MacBook Air was the best laptop I ever had.
I was even able to create mobile applications using Unreal Engine without too much trouble, but this definitely was the limit of that machine.
Whoa, that was an AWESOME video, and one that we really need, this memory bandwidth part is very lacking on TH-cam. Nice to see some actual numbers. Good job!
I've been whining (on comments on YT, because I'm a nobody) that Intel + AMD need to wake the F up and provide more memory bandwidth. The pitiful 128 bits (aka 2 channel) vs 512 bits (aka 8 channel) that the M Max chips provide, not to mention the MASSIVE 1024 bits (aka 16 channel lol) that the M Ultra provide is just weak. Ok, for casual stuff it totally doesn't matter. But for $2000+ laptops and desktops (because it's the same limit on desktop too) there are people who get into workloads where this matters. And on mobile there are many chips with iGPU too which need it even more. AMD's strix halo with its 40 CU (really desktop class level amount of GPU cores) if it has only 128 bits of bandwidth, it will be limited as hell. I guess it has a saving grace, if the system has at least 32 GB of RAM, it at least be able to get everything in RAM and only have to read from it from an early point.
From what Coreteks said, well, I'm not so sure of his predictions, buut, he did said and argued that Intel's Lunar Lake will have DRAM on the chip, like the M chips have. FINALLY, something to compete on really low power.
Nice test and very informative video. Thank for sharing.
Honestly your one of the best channels with these comparisons! Love your AI and machine learning content and I've personally had good experiences with my M2 Max running those benchmarks too
Apple is quoting the memory bandwidth available to the whole SoC. The SoC is a “system on chip”, and a critical part of that system is the interconnect fabric the everything communicates through. Each part of the chip has its own connection, and that connection doesn’t necessarily provide bandwidth equal to the full bandwidth capability of the memory interface. The GPU is likely getting the biggest pipe to memory because it is typically doing the most memory intensive work. The CPU complexes don’t (well… haven’t previously… need data on the M4s) get as much because the CPUs can’t typically take full advantage of it. Same applies to other components. With multiple components working at once, I would expect to see the memory subsystem hitting the quoted numbers. It’s just not possible to measure that without some advanced tools that Apple may not even release.
It’s not a lie, it’s just more complicated than can be expressed in one number. A lot more complicated. But then this is true on PCs as well, with on-GPU VRAM bandwidth, PCIe bottlenecks, etc. Be wary of simplistic claims about complex systems… they’re likely at best over-generalizations.
Thumbs up for the amount of research done.
The test on the Laptop was done, while the Laptop was not plugged in. Laptops will limit themselves to barely useable without being plugged in. Found that out the hard way.
Haha I love the editing of the unboxing of that gaming laptop
Also Apple problem is that its chip has the theoretical speed but the software dont let u use it. While Nvidia you can use the GPU to its fullest given you put the effort.
did you test the msi laptop while plugged in?
I am not sure about the setup you use with the 4090 laptop setup, but was the testing done with wsl on windows 11, or was it in a vm, if so don't you think it's worth it to peoperly test it on native linux and doing a comparison between these cases?
Good point!
Also latest versions of Linux kernel does some memory magic. Worth looking into that vs windows.
Also a Laptop 4090 is not a real 4090... That would ONLY be comparable to a low wattage 4080.
@@jksoftware1 the comparison is between laptops. Not a desktop and a laptop
@@game_time1633 Then the title should be changed to "REALITY vs Apple’s Memory Claims | vs a laptop RTX4090" because it's deceptive without it because laptop 4090 uses the AD103 chip while the desktop 4090 uses AD102 chip. They are completely different.
I have 8xH100 Servers here, and 128GB M2 Ultras. Other than transformer optimized and training that blows by 100GB windows, the Macs are amazingly capable and are far beyond any 40-series, for about $400K less than the H100 servers.
I recently got a maxed out M3 Max and to test it I ran the Transformer LM example from the MLX repo - the machine was casually sitting at 123 GB memory usage with no swap being used at all. I changed the batch size to 32 and iterations to 100 in the sample code. It ran at 0.128 iterations per second over the 153.883 M parameters. The MLX framework is looking unbelievably promising
i wish i had that setup :)
There a risk free way to change the max memory for GPU access.
Its a simple terminal command, and reset at reboot.
It's set the new max memory you want your GPU to have access to. (Leave 8gb for the system and it's all stable).
sudo sysctl iogpu.wired_limit_mb=57344
Example for a 64gb model.
In the Mac memory model, both CPU, GPU, and all IP blocks can hit memory simultaneously and without moving data and get some pretty high memory bandwidth speeds.
In the Win memory model, the CPU formats a GPU request and data in main memory, compresses it, transmits it over PCIe where the GPU receives the request and data from PCIe, decompresses it into VRAM, and runs the request. If it's a compute request, the results are compressed in VRAM, transmitted via PCIe, received by the CPU from PCIe into main memory, and decompressed. If we're talking an iterative request, the data flows back and forth over PCIe as many times as necessary.
So … on the graphics card the GPU can hit the VRAM at tremendous speed - but the speed is bottlenecked by the PCIe transfer speed which is around 50 GB/sec.
This is the marketing model employed by PC designers because it keeps CPU, GPU, and mother board vendors happy and separate and distinct - allowing each to play in their own sandbox and sell their wares to consumers independently - but the reality is the _“secret”_ win overhead every x86 user pays to keep everything separate.
Wintel graphics cards have insane speed once the request and data have been set up in VRAM - but there's a _lot_ of steps they have to go through to get the data into VRAM, and a lot of steps required to return graphics card results to the CPU's main memory.
I must express my admiration for the extensive and informative data and analyses that you have presented to us. However, I am intrigued to learn if there exists the potential for the implementation of MLX technology to further maximize the performance of the already exceptional M series?
video on the way
Here’s my theory, the M2 Ultra DOES have the 800GB/s, however, one chunk of it is used by the kernel of the OS while the rest is what we see.
Apple never stated that the whole bandwidth can be taken advantage of solely by the GPU. The bandwidth number is for the whole SOC.
Workloads that stress CPU, GPU, and media engines all at the same time will take full advantage of the full system memory bandwidth.
Nice and interesting video, as always! Thank you Alex
You might also determine if you are using the onboard GPU vs the 4090 GPU. You can force certain applications to use oen or the other.
Just wanted to add some info here. But to my knowledge Nvidia uses compression on the memory bus to help save bandwidth. The memory controller does this in real-time and transparently. This is why you see higher effective bandwidth than advertised. Obviously real world numbers will vary tho.
Wasn't there any way of running the tests native on windows and not through wsl?
I am not sure I understand the comparison. The RTX 4090 in a laptop is a mobile version with a low TDP. For a real test, one must use the desktop version?
RTX 5090 is being said to be on GDDR7 with 10TBps+ memory bandwidth and all of that speed at half the power consumption
The VRAM bandwidth is likely calculated. The 4090 has a 384-bit bus which is much wider than the 64-bit running on DDR5. These are GDDR6X memory
You can add more system memory to that laptop.
You can upgrade to 48GB dimms and of its the quad sodimm machine i suspect it might be, you can go up to 192GB system memory.
Running 96GB in my Asus Scar 18 2023
but that's not vRam
I feel like Microcenters are always next to a discount clothing store, my local one is next to a Burlington coat factory
The 1 GB/s on the 4090 is in line with the specs. This is essentially the bandwidth of the RAM on the card. The other numbers are all going across the PCIE bus, and also effected by system RAM and CPU speeds. Apple is lying a bit saying their bandwidth is 10x that of the fastest desktop card, when in reality it is less.
Curious how it affect actual modeling speed, as far as I know software like Tensorflow is still optimized for Nvidia (something I don't like), is 4090 still faster than M3 Max during modeling?
intel kinda does this " this new CPU is 30% faster!!" but they never say "if the comparison is with this and that"
bro looks like Neo while casually buying a top of line 4090 laptop
I am wondering if there would be meaningful differences between RTX4090 (the original not the laptop version) vs M2 ultra
Nvidia laptops variants are and were always kinda mislabeled (apart from Pascal GTX 1000, where the 1070 even had more CUDA cores) having less CUDA cores, clock and often VRAM.
This "RTX 4090" laptop has less performance than a desktop "RTX 4070 Ti Super" in most cases (or even always when on battery).
@@whohan779 and severely limited in tdp.
Yes 4090 is much faster. I think 2.5x to 4x faster. (depending on the model quantization) . But that's as long as you don't go above the VRAM. People usually use 2x or 3x 4090 for inference and that's how you run the biggest models really fast. Also you can use 4090 for training.
The 4090's memory bandwidth is actually supposed to be 1TB/s but not the mobile version...
That beast is 3k. In 5 years you will need to spend another 3k to replace it
It's also nonsense for mobile usage unless you severely underclock it (below regular spec) as it's essentially a more efficient RTX 4070 Ti Super, even the largest airplane-permissible batteries could only power it for about half an hour under load (that's why most of these underclock them below what the battery can deliver in wattage).
You're likely better off buying a cheaper model with RTX 4070 or below and having a real desktop 4090 with a 5800X3D or smth. for gaming.
Oh sweet! You shop at the Rockville micro center that I shop at frequently! Hope I get to run into you sometime. 😊
What would be good to see, due to the massive difference in architecture, is actually running something useful on both, and seeing how they compare.
Would be curious how it stacks up against a desktop 3090 / 4080 / 4090. Obviously you're testing laptop to laptop rather than desktop parts but it would be neat for conjecture.
Desktop parts will be much faster.
The 1.05 TB/s number is "effective bandwidth". Only really reachable when whatever you're doing fits into the GPU's cache (which is rather large on Ada Lovelace compared to other generations).
What Nvidia advertises is the typical bandwidth you could expect in most applications.
Alex, the GPU can use ALL the unified memory, all it takes is a shell command. ~75% is just the default setting.
Maybe the default is set that way to reserve some RAM for MacOS? I can't imagine having the OS running well when it's memory starved.
3:45 Now that was a smooth Unboxing 😎
128GB M3 max is capable of runing mixtral 7x8b unquantized version.
For dGPU you showed us ram to GPU memory operations speed, which could be 15gb/s, even though it seems a bit slow. Maybe your laptop was on battery? For vram to vram dGPU tests, check clinfo cli tool and GPU-Z utility.
I was curious why apple made their own chip and not use Nvidia GPU like the others. Now I remember this is just Apple’s MO since I used a MAC LCII in the 90’s, Apple refuses outsource and/or make let their machines be upgradable/customized. Just like their “Apple Intelligence” instead of calling it AI like the rest of the world. It’s what they do, everything is in house and consistent marketing is a top priority.
You can't expect that the advertised memory bus bandwidth (which simply is bit width x Frequency) to match the bandwidth you messure in real usage, as any messurement is simply a combination of multiple bottlenecks - a good messurement there probably requires a deeper dive into the platform architecture.
what are your expectations from snapdragon elite x?
The new RTX 40 series have a massive increase to their L2 cache. This is how nvidia is primarily boosting their GPUs memory bandwidth. They just don't advertise it like AMD does with their L3 Cache. They call it infinity cache and claim it offers massive memory bandwidth increase while keeping the memory bus width as small as possible. The rtx 4090 laptop GPU has 64 MB of L2 cache. So, that 1 TB bandwidth isn't surprising. Over on the AMD side, rx 7900 XTX also has 96 MB of L3 cache and AMD claims that gives it a bandwidth of 2.7 TB/s. Yes Terabytes. Their current fastest mobile GPU rx 7900m also has 64 MB. Also a request, I am looking into buying a laptop with AMD GPU could you check out AMD GPUs for AI and try to see how they compare against nvidia and apple GPUs?
i was thinking of putting a comment to try and get you the 40 series mobile cards for comparison but here you are. actually usual people compare benchmarks which are for x86 pcs and mac has to go thru lots of translation layers... but ML isn't that way. so for comparison, i know it will be in the making already, but we want raw GPU perf and ML perf based on model execution, training, max batch sizes etc compared between two GPUs... thank you for exploring this niche genre in tech.
Sounds like they are advertising memory bandwidth for the GPU. The 4090 laptop version has 576.0 GB/s unlike the desktop version with 1,008 GB/s.
is m1 max still good in 2024 for the 5 years?
800 GB/s is great and all, but the real problem with Apple Silicon is their memory clock speed and their processor clock speed. The GPU's clock speed is 1400 MHz, as well as a lack of shader cores. The Apple's CPU is at the top of the game. However:
The M3 Max has 5120 shader cores with a GPU clock running at up to
Does WSL not add a lot of overhead ?
Windows laptops often have a dynamic PCIE link, which might only use one lane whilst being idle. This could change the bandwidth as well.
The 4090 (mobile) memory speed shows mostly PCIe speeds, it even states that explicitly in the logs.
However, things go stellar when the operations are limited to the GPU memory itself. It is absurdly fast and is designed that way to be able to render an image using the entire memory capacity at once at interactive framerates. That is being able to read the contents of the VRAM many tens of times every second.
So Apple's claims of those 400 GB/s being 10 to 20 times faster that the latest PC (even then, at a single and very specific task) is laughable.
Marketing claims are always subject to specific definitions and configurations, especially when it comes to benchmarks. Apple isn’t doing anything different from every other manufacturer out there.
For some reason Quality setting for this video is grayed out in mobile app (Samsung s5e here) being it fixed at an abysmal value, it looks like 240p. First time I'm facing such an issue. BTW: great content, as usual.
sorry to hear that, I hope it was just a one-off for you. I checked the quality on this side and it seemed ok before i published
2:11 where's the link?
I wonder if you could make use of the DirectStorage Api in machine learning applications that way theoretically you could bypass the cpu entirely.
if you want to transfer data from or to the ram, the cpu cannot be bypassed end of story, stop repeating that bs. directstorage just relieves the cpu from doing the transfer itself. same as rdma or any dma transfer.
@@giornikitop5373 Well I made a mistake, most news explains it as bypassing the cpu and sending data directly to vram. Next time try being nicer in explaining to others when they make a mistake.
@@Fenrasulfryou don't get to tell me what to do.
@@giornikitop5373 You don't get to be an ass to others just because you know a little more information on one specific topic that the vast majority of people don't give a sh*t about.
@@Fenrasulfr WHAT DID I SAY?
The 1TB/s is the bus speed. If you open up GPU-Z and overclock the memory you'll see the GPU's bandwidth change proportional to the memories clock speed. MacBooks have a governor limiting their maximum memory clock which is why it'll stay around 400GB/s +/- 10GB/s unless it thermally throttles
Latest Nvidia 40XX series cards have increased L2 cache memory. The mobile 4090 has 64MB, compared to the previous generation which only had 5-6MB. This may explain why the system measures 1 TB/sec bandwidth out of a 576 GB/sec card.
It's a common practice to advertise the theoretical bandwidth. The actual bandwidth used is too dependent on the application, run conditions and machine architecture. Unified memory also means that memory bandwidth is share between CPU and GPU actors. It is very naive to try to infer bandwidth from benchmark results under these premises.
Bro was searching for a reason to buy a new windows laptop with 4090 GPU and he found one... Found my reason too, but my pocket is so compelling...
Would be good to see if Resizable BAR changes the performance of the 4090m and by how much
A couple of questions:
1. How did you test STREAM multiprocessor? If it's using MPI then MPI doesn't support Shared Memory nodes.
2. A lot of these 3rd party tools can have huge bugs, it would be great if you can mention that as well.
If you're planning to test ML Models, use MLX for Apple Silicon Mac and PyTorch optimized CUDA for Nvidia. Reason is, Although PyTorch works with MPS, it's not really using the shared memory concept properly. (Probably my guess is due to how PyTorch tensors operate, it's copied between the GPU and the CPU based on the device attribute.)
Why you buy a laptop? Desktop is where the power is.
Mac users am I right hahaha
Will Apple allow again an eGPU? This will help greatly with ML. Or possibly the Mac Pro will have ML Afterburners?
No. Best bet is to expand memory access. M3 came before ML, expect Apple to create something nifty.
@@tonyburzio4107 they can come up with something new, but that won't get us the computational power needed for LLM to train or inference. M4 won't get us close enough to a couple of 4090s and 4090s won't get us close enough to A100s.
If Apple wants to play in this space and get those fat hardware $$$$, then it will need something like Afterburners for ML.
ML Afterburner could easily be just their current gen GPU is mass quantity with gobs of ram interconnected to other Afterburners with an insane fabric.
That would probably only work in Linux or heavily modded OS-X (if that wouldn't set off an integrity violation of sorts). Apple likely doesn't even have support for a Radeon GPU that was actually shipped with another Intel-based Mac (of any kind, including cheesegrater & trashcan).
I'd be curious to know if that kind of system ram bandwidth has any benefits. My guess is maybe, it depends on the use case, but not really.
Thank you for this video. Obviously this is an apples versus oranges comparison, pun intended. Apple has done an excellent job seemingly increasing performance with less hardware, if we can consider it less. But it is certainly a simpler, more efficient way to get the job done. AMD and Intel are trying to catch up in a way with their their integrated graphics. But still, well done Apple.