Yeah it was an interesting video. I am looking forward to a deep dive on the cpu architecture! The p and e cores for 25% more transistors, what are they used for?
Sorry for the long wait, the video got longer and longer the more I worked on it... Let me know if you enjoy these (very) deep-dives, or if it's too long/detailed for you. PS: the dynamic caching doesn't have anything to do with the system memory, but it's about the on-chip GPU memory. The whole GPU seems to be complete game changer, something a lot of ppl seem to have missed. This might very well be the most advanced GPU architecture right now and it will take a while until we see it's full potential.
But how come it's still not as good as an rtx 4090 or even 4080 minus the power draw plus i thought ray tracing by nvidia was a major architectural innovation. Plus if nvidia already did this minus implementing with the registers it just means nvidia had this before apple
As an embedded and FPGA engineer, CPU design like this has always felt like the major leagues. Watching this video must feel to me like watching a sports game with good color commentary feels to a typical American. Thank you for producing this
As an autodidactic hardware designer. I feel related, why the transistor count increases on the cores on next models? what new is implemented? how is everything working in harmony? how it schedule tasks to all parts on the computer without struggling. Me and everyone else are NOT capable to completely understand how this works. An engineer involved on Apple knows way better than us, but still it’s almost impossible that he know all of this, so insane. That’s what I think that these machines are miracles, almost magical! but actually it’s the hard work of many incredibly smart people. So almost all is out of my ligue, despite the 4 years of researching that I dedicated to the topic. 🤯
@@yoomy_gumsYou basically have to start with the history of architecture, and the ways they improved on the previous generation, like say the 486 vs pentium vs pentium pro. Yes, these are arm, but these days instruction sets make far less difference than the architecture fundamentals.
Same. I love all the intricacies of GPUs and CPUs, even ones with included graphics in the same die, as Intel and AMD still do in some chips. But massive dies with everything embedded into it (REALLY making "System on a Chip" mean what it says) is just incredible. Of course there is more wasted material (making such massive single dies also means more dies having defects), but it is still such a respectful "balls to the wall" approach! This is why the Snapdragon X Elite is so exciting to me!
@@yoomy_gums "why the transistor count increases on the cores on next models" For starters, they went from ARM-v8.5 to ARM-v8.6. It also depends on which extensions they actually implement.
Still watching and man, these deep dives are so fascinating to learn more about silicon design and engineering in our current era. Absolutely amazing work!
Who wouldn't watch your entire breakdown of apples silicon? Personally I enjoy how this channel focuses on the less talked about features of hardware design, it really makes you understand how much a company can care or not about a product they are launching into the market. Keep up the great work, I cannot wait to watch more of these breakdowns in the future!
Great video I was really looking forward to this one! On the "Dynamic Caching" in the new shader core (aka. register file + image block + group shared = L1). You've watched Apple's video already so I'll try to add some additional practical context to why it's important: It doesn't require new shaders to be written, old shaders are forward-compatible with taking advantage of this feature, however most shaders were indeed written with the limitations that came before it, thus the big advantage would only be felt on shaders that had low occupancy previously and can now maybe have higher occupancy. A lot of shaders are written with say reading a bunch of buffers, and reading a bunch of textures at -some point- typically early, and at this point they'll greatly benefit from high occupancy to hide latency and avoid stalling. But typically, later in the shader, you do a bunch of math that require -a lot of registers- for a short time, and this spike in register count in the old method required that the whole shader demand many registers the whole time, even though for fetching buffers and textures it only needs enough to store the read results in just then. So the benefit here is that you get to have low register pressure when you need high occupancy early in a shader to hide memory latency, and later during "just math" where you don't need occupancy to saturate the math you can now go nuts with registers. Having the freedom to use many registers can make for better algorithms that can take advantage of large amounts of registers without worrying about hurting memory latency in another part. It also provides freedom, you don't have to spend a lot of optimization time getting a magical register count, the shader core does it for you (almost, you still need to make sure you don't need many registers at the time of doing these memory reads), and most importantly, you can now make dynamically branching uber shaders that don't trash your register file usage! Previously we've always had to make many shader variants for specialized cases and compile them either at build or run-time, because a huge shader with tons of branches would have register pressure as bad as the worst case "everything is on" scenario, well now the register pressure is dynamic based on what's enabled! I probably got some parts wrong but I think it's really interesting how much having an L1 cache changes for shaders.
I believe the new shaders he's talking about is mesh shaders. Which do have to be completely rewritten. It is why in the PC space there was an up roar when Alan Wake 2 was released, it's the first major game to use mesh shaders and made older GPUs obsolete since it's not forward compatible.
@@BurritoKingdom ah yeah my bad. just wanna add that mesh shaders and amplification, as well as ray tracing have been part of metal for some years, although internally running in a software implementation, so developers have been able to write tech that took advantage of these for some time pre-emptively. I know octane used the ray tracing api for a while before the agx9 came out, but yeah i don't know of anything that has taken advantage of the mesh shading api so that indeed would be novel to see used now.
While it doe snot require shaders to be re-writen (this is very nice) you can get a good bit more pref by making changes. It is common to break up long running shaders into smaller shaders were each of these smaller shaders has a more constant register/threadgroup usage. This adds some overhead as you dispatch extra shaders but on older gpus it requests in better avg occupancy as the parts of your application with lower pressure can run higher ocupancy than if you just dispatch a single longer running shader that has some very high peak local mem or register usage. This new dynamic register/local/cache system means you can now just stitch all these shaders together (reducing the dispatch overhead) so you can now have much longer running single dispatch shaders without the occupancy hit that this has on most other gpus.
Love these long deep dive videos. When executed well they provide extraordinary value. Time is valuable and this video did not disappoint. Keep up the great work!
I'm certainly watching your every video till the end! Just recently discovered your channel and it's a godsend in terms of amazing in-depth explanations of how exactly all those performances and features are achieved and realized on the silicon level! I've always wanted for someone to explain things like that, like on a truly low level - in terms of hardware - literally talking about transistor counts and how it's all allocated on a chip, designed, interconnected, etc. Thank you so so much for what you're doing on this channel! Keep these amazing videos coming!
I watched the whole thing and subscribed. This was a very nice level of analysis for me, and I think you did a great job of overviewing the changes. It seems to me that this gen is taking to heart one of the original RISC tenets, where spending transistors on caches (vs cpu, etc) is a huge win. The tricky part, also from RISC heritage, is that you have to have compilers that can take advantage of the opportunities for caching (and the exposure of opportunities for parallelism). I enjoyed your video a lot. Thanks.
I’m in a somewhat similar industry trying to rebalance our product line portfolio and create distinct segmentation and know how many meetings and difficult it is. Im sure there was a ton of stress by folks at Apple (and thus a ton of meetings) when they relanded the M3 Pro calling it a “downgrade”. I can see the product planners and engineers arguing in my head. Watched the whole thing and subscribed. Thanks for doing this.
11:00 I just paused to comment… I’m watching every second because I haven’t found this level of detail about these chips said in such a succinct way. Thank you for keeping it entertaining and informative.
I really wish Apple would give more details about the dynamic caching stuff. I read the patent filing and it looks interesting. I was hoping the new GPU design is optimized for training ML. Watched the whole video. Hopefully as more people analyze the chip, you can update and identify where the dynamic caching logic sits on the Max chip.
Thanks for such a detailed analysis. I find these deep dives really interesting and I'm pretty sure many others would agree too. Would surely love to see more of these in future. Cheers!
Just completed watching this video. As a current chip designer, absolutely love your content and this video in particular was very well done. Would like to see more deep dives like this video.
I watched the entire 20:12 minute video, It was very informative. I personally appreciate these deep dive technical analysis type videos, I learn a-lot more about semiconductor engineering and about the hardware we all take for granted. I am deeply fascinated about where the industry is headed with these process nodes and there optimizations.
This is an extremely technical and amazing video. I’m a physicist that had worked in nanotechnology and a computer scientist and I’m lost for times during this video. Amazing job.
ปีที่แล้ว +3
Don't worry, 20 minutes for such a subject is definitely not too long. Would be great to go even more to the depth. Anyway, great and informative video! 👍
I love your channel. It’s more or less unbiased. Praising where praises are deserved and criticizing when it’s due. I also love the detail-ness of your content.
The number of transistors on a single chip is breaking my brain. But then, the last chip I was involved with was a long time ago (I ended up going into more software work as we were building software to test and validate our chip designs and that ended up being where I found the love of software engineering)
I watched the whole thing. You're, at the moment, the closest thing we have to AnandTech that I can think of. I'm a big fan of Chips & Cheese as well, but those folks sometimes take a long while before coming around the lastest chips like AT used to do. Thanks for confirming the A9 Family GPU is entirely new, some did not believe at. As well as having the common sense to see the M3 Pro is not a downgrade - is its own custom design chip aiming for something differnet than the M3 and the M3 Max.
This is the most intensive yet easily informative piece of Video. I'd say its not long. It's full of info that it never felt long. Lets see what Qualcomm does with their designs after their new acquisition.
Always watching everything. I am an electrical engineer, learned chip design in the past. Today developing car‘s safety functions in a large engineering company.
This was the best analysis I've seen so far on Apple M3 chips, especially on the M3 Pro chip and it's design purpose. It makes so much more sense and was constructive, objective analysis than 99% of other reviewers simply bashing on Apple without clear explanation.
Watched 'till the end :) I'd very much like to see a price estimate for these chips on N3B. Everybody keeps complaining about the SSD and RAM prices that Apple is charging, but my guess would be that the high-spec models are actually *heavily* subsidizing the price of the low-end configurations. A complete laptop with 96B transistors for $3500 vs. a 4090 GPU with 76B for $2000-3000 is very interesting.
Thank you for embedding eng subtitle I'm not good at listening English as a Korean, this is very helpful. Also video is very insightful and easy to understand. Thank you
Excellent video. I watch a lot of processor analysis videos and product teardowns, and yours is one of the best I've seen. Interestingly, there's a bunch of initial M3 product comparison videos that are reporting M3 as a failure because their benchmark software doesn't take advantage of the improved architecture that Apple has delivered. I would love to see your analysis on the Apple audio chip improvements and the Closed Loop Controller used in the camera system...
You asked us to comment if we watched all the way through--only half way through the video. Your video has a very interesting level of architectural details. I am an ASIC designer still working at 12 nm who will someday make the case that it is worth my company taking the next step to 7 nm. I can get access to the standard libaries TSMC provides for 5 nm and 3 nm, but the really interesting difference are what Apple (and Tesla) are doing with their architectures. Moving to a finer architecture is the brut force approach of bringing more performance to a digitial ASIC. What matters much more, even if they stayed with older architectures, is what bleeding edge companies are doing with their design. Thanks so much for your videos.
I don’t remember when was the last time I watched a 20 minutes video at a stretch. Great job making it information rich and right on the money. Enjoyed the video and how you compiled it. Please carry on.
Came to hear your thoughts on M3 Pro - great breakdown of key components! Although I've read many complaints about the M3 Pro being a "downgrade", I don't think it's a big deal. I think your analysis is sound although on another site they speculated the new Pro design may have been due to issues with achieving an adequate supply of M2 Pro chips through binning of the M2 Max alone which resulted in cost and supply issues. Nicely done - thanks!
Thorough, clean, comprehensive deep dives (without three times music for entertainment) are what I am looking for. Not this lalala and noise that some others do. I don’t know another channel that has such high density and concentration of facts. Thanks a lot for your work!
Fantastic analysis. The M3 N3B node entered mass production late last year which implies that it was designed about 1.5 years ago or longer. This was all before ChatGPT and AI changed the world. So I suspect the Neural Engine to grow in size for the M4.
Fantastic work! The M3 definitely looks like maybe the TSMC troubles with the new node may have forced Apple make some tradeoffs they didn't want. I wonder if the advances in the GPU cores is also changing the needs of the NPU compute types... can more AI workloads move to the GPU with the new architecture?
This was super insightful! Though perhaps it wasn't just sales/profit targets driving the differentiation between the M3 Pro and Max. The Max is really over-the-top for a laptop CPU/GPU and runs quite hot and loud (almost Intel-era-loud) especially in the 14 inch MBP, much more so than the M2 Max did. Battery life is also quite modest relative to other Apple M machines. So the M3 Pro serves as a new middle ground for people who want more performance and features than what the base M3 offers, but don't want the battery life and noise compromises that come with the Max, as insanely powerful as that chip may be.
Dude I watch the entire video everytime! You're getting us nerds together about this. LOL. Also, I like the simplicity of your setup. The background is basic af but that's what I appreciate is that the focus is on the matter with picture examples and no nonsense.
Thanks. The BG is just the white wall of my office, with some plants and lights. If I had more space I might be more creative, but so far I don't feel the need to change anything.
I’ve kind of been thinking that in some ways the base M2 was underpowered (specifically lacking the extra display controller), yet the M2 Pro was maybe overkill. I’d prefer the M3 also have an extra display controller, but making a smaller M3 Pro is also one way to do it. The M3 Pro seems like a pretty good solution it you want a MacBook that runs cool. And yeah, watched the whole video. Very fascinating. Thanks for making it!
There has been rumors of the iPad Pro getting a huge price increase to $1500-$1800 So I think maybe M3 Pro might be a “downgrade” so they can put it in the iPad to push more gaming on the iPad and to further segment iPad Pro from the iPad Air. And a few months ago, someone said Apple was working on a 14.1 inch iPad Pro with M3 Pro and when I saw the announcement of less CPU and GPU cores it made that leak even more believable
As a computer architecture student, this was my first video I have watched of this channel, And I was blown by the details you have provided. Amazing dude. Recommending this channel to my girlfriend also
Great video. First one I saw on your channel and you have earned a new subscriber. As an avid gamer I’ve been interested in seeing Apple push the GPU side since the m1 came out and they started using the same underlying hardware across Mac and iPad. Curious to see both apples support for this segment and adoption by studios.
If I had to take a guess at the why the NPU hasn't been taking up much more die area I would say that it's likely because they haven't found an architecture they love that doesn't take a lot of power. Instead they're keeping with their current solution and upgrading with space and power being the driving forces, especially since that's Apples Silicon Architecture design default. It's also hard to tell if it's actually that much more powerful or not between generations since TOPS doesn't tell you much about an NPU. For example, is this combined int8 and float32 TOPS? Did they add support for smaller int4 TOPS? How does is the NPU caching affecting this number?
This is from Apple’s ML website “The first generation of the Apple Neural Engine (ANE) was released as part of the A11 chip found in iPhone X, our flagship model from 2017. It had a peak throughput of 0.6 teraflops (TFlops) in half-precision floating-point data format (float16 or FP16), and it efficiently powered on-device ML features such as Face ID and Memoji. Fast-forward to 2021, and the fifth-generation of the 16-core ANE is capable of 26 times the processing power, or 15.8 TFlops, of the original.”
With your power consumption theory duly noted, even if it were true that Apple wasn’t happy with its NPU design, there’s no reason why they couldn’t have increased the NPU core count (apart from power consumption - if true). I surmise Apple wasn’t ready for LLM & AI - OR - GPUs are more suited to these tasks than NPUs after all. That’s certainly what Nvidia found out…
Having tested this, same models on Apple's NE are significantly faster compared to the GPU while consuming negligible amount of power (it's almost as if nothing is running at all). The caveat is that not every model can be run on the Neural Engine since it's very specialized by nature (in which case Core ML automatically falls back on the GPU), but there's no power efficiency issue even if they decided to scale it up. I suspect that Apple simply believes that its performance is sufficient for the current tasks they have in mind.
Love these deepdive vidso. I think these chip are a modern wonder of mankind, its mind blowing how we are able to design and produce something with 92 billion switches
Apple silicon has been such a fantastic leap that I have to remind myself that my M1 Pro is _so_ good that I shouldn't have any need to upgrade. Despite the amazing performance gains of later chips I'm sticking to my 5-8 year upgrade cycle
These deep dives are great - I did watch the whole thing (though at 2x speed). A lot of the interesting architectural details just aren't captured in specs like core and transistor counts.
Yup I noticed that immediately regarding the NPU subassembly. It looks like they built bigger arrays in each core vs just more cores. Given their function, it would be a smart move as it would allow for more efficient process optimization for larger and more complex routines. ❤
@@jonanddy You get used to it, eventually you get an extension and watch stuff at 3x or 4x speed and still take it in. It's crazy how much time you can save by doing that.
I have no idea how chip manufacturing works. Yet I was so captivated by this video and watched it all the way through. Hope to see more on the benefits of different chip designs.
The M3 NPU may be multiplexing the processing units, time-sharing 8 NPUs to get a (slower) 16 NPUs group. Like switched capacitor op-amps allow multiple poles/op-amp.
Would that not take a hit to the TOPS? Just curious about what you think regarding that. Greater fan in and fan out of signals on the silicon does effect the throughput performance right?
@@jasonjames2778 Of course it would. But TOPS and MIPS (Meaningless Information Propagated by Salesmen) do not measure actual performance. Since "operations" and "instructions" have no semantic meaning, the numbers have no semantic meaning. They are just a measure of clock speed, not performance. I was lucky enough to take a course in quantitative performance evaluation. The examples in the text were not imaginary problems but were case studies on real problems in actual computer centers. One example was a system with a fast and a slow disk drives. The vendor suggested doubling the CPU speed. That resulted in a 3% increase in throughput. Moving some files from the fast disk to the slow disk increased the throughput by 100%. Compare the 68000 and the Novix microprocessor. The 68000 is clocked at 10 MHz so it has a high MIPS. But the Novix's throughput is 3 times that of the 68000.
as a general tech enthusiast, it's so interesting to learn about these kind of specific things in such a well made and laid out way, kudos to you, i enjoyed the whole video!
I made it to the end! And you got me twitching all over to upgrade my M1 Ultra Studio to an M3 Studio Ultra. Thank you. It is sometimes shocking what us humans can actually come up with ! Incredible
I'm still puzzled why apple did not introduce a Server / AI focussed chip yet, seeing how Nvidia, AMD and Intel make massive amounts of cash with that. Something like a M3 Server edition with either a ton of CPU or GPU or NPU cores for their Mac Pro. The energy efficiency would certainly make them a great competitor to the already mentioned.
Not really. AMD's newest Threadrippers somehow have better performance/watt than m3 in some cases, despite gulping more than 300 watts of power at max speed. Guess that is what 3 figure thread counts will do. Also, Apple will be effectively required to support open source software (and open hardware standards), which doesn't sound like an Apple thing to do
The deep dive was very interesting. Longer videos are not a problem. :-) Heck, I watch every week the new video by Perun. Usually an over one-hour long Powerpoint presentation on military and defense economic matters.
I'm so hungry for these types of videos, I wish it had another 20 minutes in it comparing the GPU of Apple silicon to other GPU's like RTX 4090, since you calculated the estimated size percentage of GPU vs the entire die, you could have easily compared it to other GPU's, for example: + You said M3 Max's GPU takes roughly 35% of the 92B chip, that's roughly 32B transistors, for comparison RTX 4070 is a 35.8B chip. + You said M3's GPU takes about 23% of 25B, that's 5.75B transistors, for comparison a GTX 1650 is a 4.7B chip, and 1660 is a 6.6B chip. It kind of puts things into perspective and how much Apple need to get competitive with say a 4090 (76B)...
I was actually going to compare the GPU size, but even if I know the transistors count of the entire chip and the GPU area, it's hard to be sure, since not all parts of the chip have the same transistor density. That's what stopped me, because I don't want to make claims I can't fully back up.
It's very clear that Apple is throwing money around to try and be top dog. Their transistor budget is way higher than Intels and even higher than many dGPUs. Its not a good situation to be in for Apple, as they are too reliant on TSMC not to stumble (like they just did with N3).
This is the in depth analysis I have been waiting for!!!! You must be doing something right as I had never heard of your channel before this video. Keep up the great work!
These deep dives are my favorite content on TH-cam , always watch all the way through but leaving you a comment this time half way through when you asked
Apple's processors aren't competing with anyone because they are functioning under a completely different business model. They make their own devices and don't sell individual SoCs and they camouflage the costs within their devices. That's why they need to sell 8 MB of RAM for $200, etc. If they did sell on the open market the cost of the SoC would be uneconomical. TSMC probably gives them the sweetest production deals too.
Your presentation is so clear and focused that even a novice like me gets the gist and it really helps to ease some of the frustrations users feel when they understand the context more. The 20 minutes flew by!
I really enjoyed your 20 minutes video! Thank God that I found your channel😂 As a junior implementation engineer, to implement dynamic cache logic for GPU seems challenging! And to send bits from 40-core GPU to bus and other parts of chip is just 🤯. Apple engineers deserve a credit for making great product for our life. Thanks for sharing your architecture perspective on chip manufacturing!
Fantastic video. Thanks for the detailed review. Length of video is fine, detail was exceptional and delivered with enough speed it wasn’t at all boring. Some drag things out so much I fall asleep listening them slowly waffle. Yours was perfect for me.
Watched the whole thing, maybe 3x now.I know almost nothing about silicon design, but this is the best "review" of these chips I've seen on TH-cam so far. Knowing now that the M3 Pro has performance cores with more transistors tells me that it's not quite the downgrade the average TH-cam "reviewer" likes to go on about. Also, your comments on what the GPU potential is has me certain that an M3 Pro is for me. Will still explore, but as a graphic designer who wants to toy with ML image generation... guessing that's a step in the right direction. Plus, M2 Pros aren't really selling at much of a discount. So, why not go for a new machine; especially coming from an Intel MBP.
This was the best review of the M3 design ive seen to help understand the internals of the M line. Great video, watched it till the end, could even have been longer
Still watching ! I believe there are only 2 types of viewers : Those who click and realize under 20s that this video isn’t for them, and those who will watch the 20 minutes attentively until the final recap. Well done, thank you for the insight !
I'm still watching at 11 minutes. I'm no one really but I just enjoy the ind and outs of silicone. Apple Silicone seems like a modern day miracle that still is under appreciated by most people.
Watched the whole thing! I don't design chips but work for a semiconductor manufacturer and your content is great for learning the main side of the business
I watched the entire 20 minutes. As a long retired chip designer, this 90B active device world is incomprehensible, but I enjoy following anyway.
this channel is so good, idk why it doesnt have more subs, all videos are excellent
Yeah it was an interesting video. I am looking forward to a deep dive on the cpu architecture! The p and e cores for 25% more transistors, what are they used for?
Fascinating!!!
Same.
@silverc4s146 how to start a chip designer as career. Guide me bro.
Easily the most informative M3 breakdown. Kudos.
Sorry for the long wait, the video got longer and longer the more I worked on it... Let me know if you enjoy these (very) deep-dives, or if it's too long/detailed for you.
PS: the dynamic caching doesn't have anything to do with the system memory, but it's about the on-chip GPU memory. The whole GPU seems to be complete game changer, something a lot of ppl seem to have missed. This might very well be the most advanced GPU architecture right now and it will take a while until we see it's full potential.
thanks for your hard work!! 🙏
Love these deep dive videos!
I would love deep dive long detailed video. Great job!
I enjoyed it, thank you! Many TH-cam videos reviewing the M3, but no other video I found like this one
But how come it's still not as good as an rtx 4090 or even 4080 minus the power draw plus i thought ray tracing by nvidia was a major architectural innovation.
Plus if nvidia already did this minus implementing with the registers it just means nvidia had this before apple
As an embedded and FPGA engineer, CPU design like this has always felt like the major leagues. Watching this video must feel to me like watching a sports game with good color commentary feels to a typical American. Thank you for producing this
As an autodidactic hardware designer. I feel related, why the transistor count increases on the cores on next models? what new is implemented? how is everything working in harmony? how it schedule tasks to all parts on the computer without struggling.
Me and everyone else are NOT capable to completely understand how this works. An engineer involved on Apple knows way better than us, but still it’s almost impossible that he know all of this, so insane. That’s what I think that these machines are miracles, almost magical! but actually it’s the hard work of many incredibly smart people.
So almost all is out of my ligue, despite the 4 years of researching that I dedicated to the topic. 🤯
@@yoomy_gumsYou basically have to start with the history of architecture, and the ways they improved on the previous generation, like say the 486 vs pentium vs pentium pro. Yes, these are arm, but these days instruction sets make far less difference than the architecture fundamentals.
Same. I love all the intricacies of GPUs and CPUs, even ones with included graphics in the same die, as Intel and AMD still do in some chips.
But massive dies with everything embedded into it (REALLY making "System on a Chip" mean what it says) is just incredible. Of course there is more wasted material (making such massive single dies also means more dies having defects), but it is still such a respectful "balls to the wall" approach!
This is why the Snapdragon X Elite is so exciting to me!
@@yoomy_gums I guess the transistor count increased into the m3 with the ray tracing
@@yoomy_gums "why the transistor count increases on the cores on next models"
For starters, they went from ARM-v8.5 to ARM-v8.6.
It also depends on which extensions they actually implement.
Still watching and man, these deep dives are so fascinating to learn more about silicon design and engineering in our current era. Absolutely amazing work!
Here with you. Very fascinating.
I'm watching all the way through :p
Who wouldn't watch your entire breakdown of apples silicon? Personally I enjoy how this channel focuses on the less talked about features of hardware design, it really makes you understand how much a company can care or not about a product they are launching into the market. Keep up the great work, I cannot wait to watch more of these breakdowns in the future!
Great video I was really looking forward to this one! On the "Dynamic Caching" in the new shader core (aka. register file + image block + group shared = L1). You've watched Apple's video already so I'll try to add some additional practical context to why it's important:
It doesn't require new shaders to be written, old shaders are forward-compatible with taking advantage of this feature, however most shaders were indeed written with the limitations that came before it, thus the big advantage would only be felt on shaders that had low occupancy previously and can now maybe have higher occupancy.
A lot of shaders are written with say reading a bunch of buffers, and reading a bunch of textures at -some point- typically early, and at this point they'll greatly benefit from high occupancy to hide latency and avoid stalling. But typically, later in the shader, you do a bunch of math that require -a lot of registers- for a short time, and this spike in register count in the old method required that the whole shader demand many registers the whole time, even though for fetching buffers and textures it only needs enough to store the read results in just then.
So the benefit here is that you get to have low register pressure when you need high occupancy early in a shader to hide memory latency, and later during "just math" where you don't need occupancy to saturate the math you can now go nuts with registers. Having the freedom to use many registers can make for better algorithms that can take advantage of large amounts of registers without worrying about hurting memory latency in another part.
It also provides freedom, you don't have to spend a lot of optimization time getting a magical register count, the shader core does it for you (almost, you still need to make sure you don't need many registers at the time of doing these memory reads), and most importantly, you can now make dynamically branching uber shaders that don't trash your register file usage! Previously we've always had to make many shader variants for specialized cases and compile them either at build or run-time, because a huge shader with tons of branches would have register pressure as bad as the worst case "everything is on" scenario, well now the register pressure is dynamic based on what's enabled!
I probably got some parts wrong but I think it's really interesting how much having an L1 cache changes for shaders.
I believe the new shaders he's talking about is mesh shaders. Which do have to be completely rewritten. It is why in the PC space there was an up roar when Alan Wake 2 was released, it's the first major game to use mesh shaders and made older GPUs obsolete since it's not forward compatible.
@@BurritoKingdom ah yeah my bad. just wanna add that mesh shaders and amplification, as well as ray tracing have been part of metal for some years, although internally running in a software implementation, so developers have been able to write tech that took advantage of these for some time pre-emptively. I know octane used the ray tracing api for a while before the agx9 came out, but yeah i don't know of anything that has taken advantage of the mesh shading api so that indeed would be novel to see used now.
While it doe snot require shaders to be re-writen (this is very nice) you can get a good bit more pref by making changes.
It is common to break up long running shaders into smaller shaders were each of these smaller shaders has a more constant register/threadgroup usage. This adds some overhead as you dispatch extra shaders but on older gpus it requests in better avg occupancy as the parts of your application with lower pressure can run higher ocupancy than if you just dispatch a single longer running shader that has some very high peak local mem or register usage.
This new dynamic register/local/cache system means you can now just stitch all these shaders together (reducing the dispatch overhead) so you can now have much longer running single dispatch shaders without the occupancy hit that this has on most other gpus.
I’m a high school computer teacher and I played it for my students. My students love to keep up with the latest chip news. Thanks for sharing!
Legend
Love these long deep dive videos. When executed well they provide extraordinary value. Time is valuable and this video did not disappoint. Keep up the great work!
I'm certainly watching your every video till the end! Just recently discovered your channel and it's a godsend in terms of amazing in-depth explanations of how exactly all those performances and features are achieved and realized on the silicon level!
I've always wanted for someone to explain things like that, like on a truly low level - in terms of hardware - literally talking about transistor counts and how it's all allocated on a chip, designed, interconnected, etc.
Thank you so so much for what you're doing on this channel! Keep these amazing videos coming!
I watched the whole thing and subscribed. This was a very nice level of analysis for me, and I think you did a great job of overviewing the changes.
It seems to me that this gen is taking to heart one of the original RISC tenets, where spending transistors on caches (vs cpu, etc) is a huge win. The tricky part, also from RISC heritage, is that you have to have compilers that can take advantage of the opportunities for caching (and the exposure of opportunities for parallelism).
I enjoyed your video a lot. Thanks.
You have a gift for articulating these subjects. I have zero chip background but was easily able to follow through to the end.
I’m in a somewhat similar industry trying to rebalance our product line portfolio and create distinct segmentation and know how many meetings and difficult it is. Im sure there was a ton of stress by folks at Apple (and thus a ton of meetings) when they relanded the M3 Pro calling it a “downgrade”. I can see the product planners and engineers arguing in my head. Watched the whole thing and subscribed. Thanks for doing this.
It doesn't matter how transistors change, I just know that the knife skills are superb and they are getting expensive again.
I watched the entire video and I don't think 20 min is particularly long for this kind of content, you did a great job 👌👌
11:00 I just paused to comment… I’m watching every second because I haven’t found this level of detail about these chips said in such a succinct way. Thank you for keeping it entertaining and informative.
I really wish Apple would give more details about the dynamic caching stuff. I read the patent filing and it looks interesting. I was hoping the new GPU design is optimized for training ML. Watched the whole video. Hopefully as more people analyze the chip, you can update and identify where the dynamic caching logic sits on the Max chip.
if you do ML,just stick to CUDA...
@pham3383
Or wait for the day Apple stops being Metal exculsive and adopt something like OpenGL at a hardware level
@@Demopans5990 I think you meant Vulkan ;)
Thanks for such a detailed analysis. I find these deep dives really interesting and I'm pretty sure many others would agree too. Would surely love to see more of these in future. Cheers!
More coming for sure. Thanks for your support, much appreciated!
Just completed watching this video. As a current chip designer, absolutely love your content and this video in particular was very well done. Would like to see more deep dives like this video.
More to come!
I watched the entire 20:12 minute video, It was very informative. I personally appreciate these deep dive technical analysis type videos, I learn a-lot more about semiconductor engineering and about the hardware we all take for granted. I am deeply fascinated about where the industry is headed with these process nodes and there optimizations.
I always watch your videos from the beginning to the end, since your content is excellent. Thank you again this time.
This is an extremely technical and amazing video. I’m a physicist that had worked in nanotechnology and a computer scientist and I’m lost for times during this video. Amazing job.
Don't worry, 20 minutes for such a subject is definitely not too long. Would be great to go even more to the depth. Anyway, great and informative video! 👍
It’s remarkable how much effort you’ve put into producing and researching this, keep it up! 👏
Superb deep dive, incredible detail you're covering here. Silicon has come a long way from my early days in the 90's in semi-conductors.
I love your channel. It’s more or less unbiased. Praising where praises are deserved and criticizing when it’s due. I also love the detail-ness of your content.
this is exactly the type of breakdown/content I was looking for. Really loved watching it, I need to deep dive into each topic and learn more 😀
The number of transistors on a single chip is breaking my brain. But then, the last chip I was involved with was a long time ago (I ended up going into more software work as we were building software to test and validate our chip designs and that ended up being where I found the love of software engineering)
I watched the whole thing. You're, at the moment, the closest thing we have to AnandTech that I can think of. I'm a big fan of Chips & Cheese as well, but those folks sometimes take a long while before coming around the lastest chips like AT used to do. Thanks for confirming the A9 Family GPU is entirely new, some did not believe at. As well as having the common sense to see the M3 Pro is not a downgrade - is its own custom design chip aiming for something differnet than the M3 and the M3 Max.
This is the most intensive yet easily informative piece of Video. I'd say its not long. It's full of info that it never felt long. Lets see what Qualcomm does with their designs after their new acquisition.
Always watching everything. I am an electrical engineer, learned chip design in the past. Today developing car‘s safety functions in a large engineering company.
Still watching
im in LOVE with these deep dive videos and i dont even feel like i just watched a 20min video, please keep this video style as long as you can😭
This was the best analysis I've seen so far on Apple M3 chips, especially on the M3 Pro chip and it's design purpose. It makes so much more sense and was constructive, objective analysis than 99% of other reviewers simply bashing on Apple without clear explanation.
Wow this the most in-depth look at this chip I’ve seen. And I can understand stand it. thanks man!!
Watched 'till the end :) I'd very much like to see a price estimate for these chips on N3B. Everybody keeps complaining about the SSD and RAM prices that Apple is charging, but my guess would be that the high-spec models are actually *heavily* subsidizing the price of the low-end configurations. A complete laptop with 96B transistors for $3500 vs. a 4090 GPU with 76B for $2000-3000 is very interesting.
This is it exactly! There not selling parts but performance envelopes.
Thank you for embedding eng subtitle
I'm not good at listening English as a Korean, this is very helpful. Also video is very insightful and easy to understand. Thank you
Ive watched this video like 3 times this stuff is so interesting. I appreciate the level of effort you put into this video.
same! I hope his next video is an hour long 😅
Still watching, I'm no chip designer neither an embedded engineer, but this is so high quality content that I can't get out, I don't want too.
Excellent video. I watch a lot of processor analysis videos and product teardowns, and yours is one of the best I've seen. Interestingly, there's a bunch of initial M3 product comparison videos that are reporting M3 as a failure because their benchmark software doesn't take advantage of the improved architecture that Apple has delivered. I would love to see your analysis on the Apple audio chip improvements and the Closed Loop Controller used in the camera system...
You asked us to comment if we watched all the way through--only half way through the video.
Your video has a very interesting level of architectural details. I am an ASIC designer still working at 12 nm who will someday make the case that it is worth my company taking the next step to 7 nm. I can get access to the standard libaries TSMC provides for 5 nm and 3 nm, but the really interesting difference are what Apple (and Tesla) are doing with their architectures.
Moving to a finer architecture is the brut force approach of bringing more performance to a digitial ASIC. What matters much more, even if they stayed with older architectures, is what bleeding edge companies are doing with their design.
Thanks so much for your videos.
I’ve been waiting for this deep dive! Thank you!
I don’t remember when was the last time I watched a 20 minutes video at a stretch. Great job making it information rich and right on the money. Enjoyed the video and how you compiled it. Please carry on.
Came to hear your thoughts on M3 Pro - great breakdown of key components! Although I've read many complaints about the M3 Pro being a "downgrade", I don't think it's a big deal. I think your analysis is sound although on another site they speculated the new Pro design may have been due to issues with achieving an adequate supply of M2 Pro chips through binning of the M2 Max alone which resulted in cost and supply issues. Nicely done - thanks!
Thorough, clean, comprehensive deep dives (without three times music for entertainment) are what I am looking for. Not this lalala and noise that some others do. I don’t know another channel that has such high density and concentration of facts. Thanks a lot for your work!
Fantastic analysis. The M3 N3B node entered mass production late last year which implies that it was designed about 1.5 years ago or longer. This was all before ChatGPT and AI changed the world. So I suspect the Neural Engine to grow in size for the M4.
yesssssss!!!!
I love your breakdowns, it’s so fascinating! Watched to 11 min now and aiming for the full 20 min!
I always watch your videos from start to finish, as they are high quality content!
I’m only at 11:00, but boy is this a condensed deep dive - super interesting, thanks!
Fantastic work! The M3 definitely looks like maybe the TSMC troubles with the new node may have forced Apple make some tradeoffs they didn't want. I wonder if the advances in the GPU cores is also changing the needs of the NPU compute types... can more AI workloads move to the GPU with the new architecture?
TSMC is launching a new 3nm process, from lessons learned, in February.
Don't worry about 10+ minute videos, i could watch an hour of this!
This was super insightful! Though perhaps it wasn't just sales/profit targets driving the differentiation between the M3 Pro and Max. The Max is really over-the-top for a laptop CPU/GPU and runs quite hot and loud (almost Intel-era-loud) especially in the 14 inch MBP, much more so than the M2 Max did. Battery life is also quite modest relative to other Apple M machines. So the M3 Pro serves as a new middle ground for people who want more performance and features than what the base M3 offers, but don't want the battery life and noise compromises that come with the Max, as insanely powerful as that chip may be.
Dude I watch the entire video everytime! You're getting us nerds together about this. LOL.
Also, I like the simplicity of your setup. The background is basic af but that's what I appreciate is that the focus is on the matter with picture examples and no nonsense.
Thanks. The BG is just the white wall of my office, with some plants and lights. If I had more space I might be more creative, but so far I don't feel the need to change anything.
I’ve kind of been thinking that in some ways the base M2 was underpowered (specifically lacking the extra display controller), yet the M2 Pro was maybe overkill.
I’d prefer the M3 also have an extra display controller, but making a smaller M3 Pro is also one way to do it. The M3 Pro seems like a pretty good solution it you want a MacBook that runs cool.
And yeah, watched the whole video. Very fascinating. Thanks for making it!
i’ve been looking for a video that explains the chips in full. thank you for this information
There has been rumors of the iPad Pro getting a huge price increase to $1500-$1800
So I think maybe M3 Pro might be a “downgrade” so they can put it in the iPad to push more gaming on the iPad and to further segment iPad Pro from the iPad Air.
And a few months ago, someone said Apple was working on a 14.1 inch iPad Pro with M3 Pro and when I saw the announcement of less CPU and GPU cores it made that leak even more believable
As a computer architecture student, this was my first video I have watched of this channel, And I was blown by the details you have provided. Amazing dude.
Recommending this channel to my girlfriend also
Great video. First one I saw on your channel and you have earned a new subscriber.
As an avid gamer I’ve been interested in seeing Apple push the GPU side since the m1 came out and they started using the same underlying hardware across Mac and iPad. Curious to see both apples support for this segment and adoption by studios.
Nice! I watched the entire video and even returned to watch some parts. Thank you for the comparison.
If I had to take a guess at the why the NPU hasn't been taking up much more die area I would say that it's likely because they haven't found an architecture they love that doesn't take a lot of power. Instead they're keeping with their current solution and upgrading with space and power being the driving forces, especially since that's Apples Silicon Architecture design default. It's also hard to tell if it's actually that much more powerful or not between generations since TOPS doesn't tell you much about an NPU. For example, is this combined int8 and float32 TOPS? Did they add support for smaller int4 TOPS? How does is the NPU caching affecting this number?
This is from Apple’s ML website
“The first generation of the Apple Neural Engine (ANE) was released as part of the A11 chip found in iPhone X, our flagship model from 2017. It had a peak throughput of 0.6 teraflops (TFlops) in half-precision floating-point data format (float16 or FP16), and it efficiently powered on-device ML features such as Face ID and Memoji.
Fast-forward to 2021, and the fifth-generation of the 16-core ANE is capable of 26 times the processing power, or 15.8 TFlops, of the original.”
With your power consumption theory duly noted, even if it were true that Apple wasn’t happy with its NPU design, there’s no reason why they couldn’t have increased the NPU core count (apart from power consumption - if true). I surmise Apple wasn’t ready for LLM & AI - OR - GPUs are more suited to these tasks than NPUs after all. That’s certainly what Nvidia found out…
Having tested this, same models on Apple's NE are significantly faster compared to the GPU while consuming negligible amount of power (it's almost as if nothing is running at all). The caveat is that not every model can be run on the Neural Engine since it's very specialized by nature (in which case Core ML automatically falls back on the GPU), but there's no power efficiency issue even if they decided to scale it up. I suspect that Apple simply believes that its performance is sufficient for the current tasks they have in mind.
Fascinating, and the depth of review is much appreciated. The lack of bias in particular makes this.
Love these deepdive vidso. I think these chip are a modern wonder of mankind, its mind blowing how we are able to design and produce something with 92 billion switches
I watched the whole video. Nice format. The only thing I’m missing (?) is why did the memory get slower?
Apple silicon has been such a fantastic leap that I have to remind myself that my M1 Pro is _so_ good that I shouldn't have any need to upgrade. Despite the amazing performance gains of later chips I'm sticking to my 5-8 year upgrade cycle
Same. That gives me time in the down years to upgrade my iPad!
Loved the video especially the visualisation of where everything is on the chips. Watched every second of it.
These deep dives are great - I did watch the whole thing (though at 2x speed). A lot of the interesting architectural details just aren't captured in specs like core and transistor counts.
Yup I noticed that immediately regarding the NPU subassembly. It looks like they built bigger arrays in each core vs just more cores. Given their function, it would be a smart move as it would allow for more efficient process optimization for larger and more complex routines. ❤
I just tried the 2x speed idk how you understand anything
@@jonanddy You get used to it, eventually you get an extension and watch stuff at 3x or 4x speed and still take it in. It's crazy how much time you can save by doing that.
最後まで見ました。内容はもちろん大事なのですが、話し方が他のTH-camrと違って、とても穏やかな話し方なので心地良く聴けます。ありがとう。
Exceptional content
I have no idea how chip manufacturing works. Yet I was so captivated by this video and watched it all the way through. Hope to see more on the benefits of different chip designs.
The M3 NPU may be multiplexing the processing units, time-sharing 8 NPUs to get a (slower) 16 NPUs group. Like switched capacitor op-amps allow multiple poles/op-amp.
Would that not take a hit to the TOPS? Just curious about what you think regarding that. Greater fan in and fan out of signals on the silicon does effect the throughput performance right?
@@jasonjames2778 Of course it would. But TOPS and MIPS (Meaningless Information Propagated by Salesmen) do not measure actual performance. Since "operations" and "instructions" have no semantic meaning, the numbers have no semantic meaning. They are just a measure of clock speed, not performance. I was lucky enough to take a course in quantitative performance evaluation. The examples in the text were not imaginary problems but were case studies on real problems in actual computer centers.
One example was a system with a fast and a slow disk drives. The vendor suggested doubling the CPU speed. That resulted in a 3% increase in throughput. Moving some files from the fast disk to the slow disk increased the throughput by 100%.
Compare the 68000 and the Novix microprocessor. The 68000 is clocked at 10 MHz so it has a high MIPS. But the Novix's throughput is 3 times that of the 68000.
Still watching and for the 2nd time. Thank you.
Keep ‘em coming!
as a general tech enthusiast, it's so interesting to learn about these kind of specific things in such a well made and laid out way, kudos to you, i enjoyed the whole video!
Kind of impressive how M3 Pro is slightly faster than M2 Pro (CPU wise) despite using fewer transistors and keeping the same number of total cores.
They are literally using different settings and newer chip node
Watched it all, both an overview and a deep dive and I think you nailed the pacing.
This is great video!
I made it to the end! And you got me twitching all over to upgrade my M1 Ultra Studio to an M3 Studio Ultra. Thank you. It is sometimes shocking what us humans can actually come up with ! Incredible
I'm still puzzled why apple did not introduce a Server / AI focussed chip yet, seeing how Nvidia, AMD and Intel make massive amounts of cash with that. Something like a M3 Server edition with either a ton of CPU or GPU or NPU cores for their Mac Pro.
The energy efficiency would certainly make them a great competitor to the already mentioned.
It’s because Apple isn’t in the server market and isn’t interested in entering that market
Not really. AMD's newest Threadrippers somehow have better performance/watt than m3 in some cases, despite gulping more than 300 watts of power at max speed. Guess that is what 3 figure thread counts will do. Also, Apple will be effectively required to support open source software (and open hardware standards), which doesn't sound like an Apple thing to do
The deep dive was very interesting. Longer videos are not a problem. :-) Heck, I watch every week the new video by Perun. Usually an over one-hour long Powerpoint presentation on military and defense economic matters.
Thank you. 🙏
I'm so hungry for these types of videos, I wish it had another 20 minutes in it comparing the GPU of Apple silicon to other GPU's like RTX 4090, since you calculated the estimated size percentage of GPU vs the entire die, you could have easily compared it to other GPU's, for example:
+ You said M3 Max's GPU takes roughly 35% of the 92B chip, that's roughly 32B transistors, for comparison RTX 4070 is a 35.8B chip.
+ You said M3's GPU takes about 23% of 25B, that's 5.75B transistors, for comparison a GTX 1650 is a 4.7B chip, and 1660 is a 6.6B chip.
It kind of puts things into perspective and how much Apple need to get competitive with say a 4090 (76B)...
I was actually going to compare the GPU size, but even if I know the transistors count of the entire chip and the GPU area, it's hard to be sure, since not all parts of the chip have the same transistor density. That's what stopped me, because I don't want to make claims I can't fully back up.
Yeah it's definitely in the realm of severe speculation...
Well i'd like to see what the M3Ultra will scale up to... That'll be interesting.
It's very clear that Apple is throwing money around to try and be top dog. Their transistor budget is way higher than Intels and even higher than many dGPUs. Its not a good situation to be in for Apple, as they are too reliant on TSMC not to stumble (like they just did with N3).
This is the in depth analysis I have been waiting for!!!! You must be doing something right as I had never heard of your channel before this video. Keep up the great work!
*I wonder if we will finally get the true apple silicon Mac Pro this generation. With up and maybe beyond 128 performance CPU cores.*
These deep dives are my favorite content on TH-cam , always watch all the way through but leaving you a comment this time half way through when you asked
always, every segment, all the way through. mb
I honestly love this video, deep diving inside Chip like the M3 Family on video compared to previous generation
Apple's processors aren't competing with anyone because they are functioning under a completely different business model. They make their own devices and don't sell individual SoCs and they camouflage the costs within their devices. That's why they need to sell 8 MB of RAM for $200, etc. If they did sell on the open market the cost of the SoC would be uneconomical. TSMC probably gives them the sweetest production deals too.
Regardless of YT audience, your work is immensely valuable.
And here you are, giving it away for free. Much appreciated!
Your presentation is so clear and focused that even a novice like me gets the gist and it really helps to ease some of the frustrations users feel when they understand the context more. The 20 minutes flew by!
I watched the entire 20 minutes and you earned a new subscriber.
I really enjoyed your 20 minutes video! Thank God that I found your channel😂
As a junior implementation engineer, to implement dynamic cache logic for GPU seems challenging! And to send bits from 40-core GPU to bus and other parts of chip is just 🤯. Apple engineers deserve a credit for making great product for our life.
Thanks for sharing your architecture perspective on chip manufacturing!
Fantastic video. Thanks for the detailed review. Length of video is fine, detail was exceptional and delivered with enough speed it wasn’t at all boring. Some drag things out so much I fall asleep listening them slowly waffle. Yours was perfect for me.
Watched the whole thing, maybe 3x now.I know almost nothing about silicon design, but this is the best "review" of these chips I've seen on TH-cam so far. Knowing now that the M3 Pro has performance cores with more transistors tells me that it's not quite the downgrade the average TH-cam "reviewer" likes to go on about. Also, your comments on what the GPU potential is has me certain that an M3 Pro is for me.
Will still explore, but as a graphic designer who wants to toy with ML image generation... guessing that's a step in the right direction. Plus, M2 Pros aren't really selling at much of a discount. So, why not go for a new machine; especially coming from an Intel MBP.
This is the best video I’ve watched explaining the differences in the M chips, very nice!
As an engineer working in the semiconductor industry I very much enjoy learning about other aspects of the supply chain. 10/10
This was the best review of the M3 design ive seen to help understand the internals of the M line. Great video, watched it till the end, could even have been longer
Still watching ! I believe there are only 2 types of viewers :
Those who click and realize under 20s that this video isn’t for them, and those who will watch the 20 minutes attentively until the final recap.
Well done, thank you for the insight !
I'm still watching at 11 minutes. I'm no one really but I just enjoy the ind and outs of silicone. Apple Silicone seems like a modern day miracle that still is under appreciated by most people.
really great explinations you speak super clearly and well and easy to comprehend even for me who knows not too much about chips
Watched the whole thing! I don't design chips but work for a semiconductor manufacturer and your content is great for learning the main side of the business