Check out Poe: poe.com/login?... Support me at Patreon ➜ / anastasiintech Let's connect on LinkedIn ➜ / anastasiintech My Deep In Tech Newsletter ➜ anastasiintech...
AMD has been making the worlds most powerful GPUs and CPUs with many tiles and chiplets. Their latest GPU has 12 tiles and Nvidia struggles to figure out just a 2 tile design. AMD has much superior engineering.
I'm going into chip design. You were my inspiration. I'm also considering monolithic designs, though I'm focused more on the gaming side of technology.
You need to watch out for, and remove these scammer comment threads talking about “stocks” and “financial advisors”. These are posted by bots, and are run by investment scammers. You have one below right now. Don’t allow your fans to be preyed on.
@@fullstackcrackerjack I agree wholeheartedly! But on top of those easy to spot threads there are so many other comments that are suspicious. It's all engagement as far as the channel is concerned so I doubt they will spend much time weeding out these BS comments. Who knows these days who is a real human and what is a bot and with so much orientation to marketing mindsets it all drives the system of algorithms, so no one does anything. It's frustrating and worrying, but who cares right? Just the tip of the 'extinction event' rising into view?
THIS is why I liked Intel’s idea of replacing organic substrates with glass. The thermal coefficient is closer to pure silicon and the manufacturing process gets easier for TSVs
I was part of a startup that built a multi-chip package with a silicon interposer containing pS transmission line interconnect. We had working prototypes but ran out of money before we could convince a packaging partner it could scale - in 1999.
Glass or Glass ceramic substrate is expensive but can come close to the TCE of silicon while providing good electrical interconnect performance. We investigated that 25 years ago when designing Itanium MCM substrate in Intel.
I wanted to say that there are many people on TH-cam who talk about the big processor manufacturing companies, but few people go into it with your details and have high technical knowledge. Thank you very much for your channel 👍
What is so interesting about this is that when inventing the light bulb they had the same issues around different expansion rates of the glass and metal…. Some things never change.
In making HV power supplies for some years, we found that "Stycast" potting material had very good electrical and thermal characteristics for the applications we were considering, but we soon found out that the stuff has a much higher thermal expansion rate than say, circuitry. So it was snapping components right off the board during thermal cycling. On the other end of the spectrum, RTV was what we ended up using. But it is soft and can detach from a surface and that means failure in an HV supply. So we had to prime those surfaces to insure adhesion. We did use the Stycast on some things, but we enhanced its thermal properties by mixing fiberglass fragments into it.
In this case its not argon its graphene production cost Graphene's high thermal conductivity can help electronics cool more efficiently, with less temperature rise during operation, but its still to bloody expensive.
Great breakdown of what makes the 10 TB/s link between Blackwell dies so challenging. I wonder if there'll be a better packaging method for this link in the future or if the Rubin GPUs will go back to a 1-die design.
I'm sure FEA can model heat flows and thermal expansion very well - but everything has a tolerance. Maybe the micro connects are just too small. It seems like a solvable problem if the chips are slightly less ambitious in the sizing of the various elements. Thanks for explaining what's going on.
Having no defects at all on a wafer is quite unlikely. The larger any single chip gets, the more likely it is that it contains a defect. Hence, large chips have a worse yield and become more expensive per piece. A solution is dividing the design into smaller chips and mounting them to a common interposer. Cerebras did things differently: Their Wafer Scale Engine consists of many small processors and can tolerate the failure of a few processors. The WSE sort of routes around the damage.
I’m no packaging engineer, but as soon as I heard the word “organic” for the interposer I started wondering about problems with differing thermal coefficients. What I’m curious about is why would Nvidia and TSMC think they could make it work in the first place? Differences in thermal expansion rates are so fundamental that they must have thought they had some way of coping with them, either by coming up with a material for the interposer that magically has the same thermal coefficient as silicon, or by somehow limiting the thermal excursion with amazing heat sinking capability. - But 1,700 watts/chip TDP is going to get pretty warm almost no matter what you do. Even if you had some kind of active phase-change cooling, just the thermal resistance get the heat out of the package is going to result in a good bit of temperature rise. Does anyone in the comments have any ideas about or knowledge of advanced techniques or materials that would lead Nvidia and TSMC to think they could actually do this? It seems like a fool’s errand to me, to go away from a silicon interposer, but IANAPE (I am not a packaging engineer), so there may very well be things I’m not aware of. (Great vid as usual Anastasi, you did a great job of tracing the evolution and explaining the likely cause of the problems. Great thumbnail too 😂)
My reaction is the same. What were they thinking? Its not just the coefficient of thermal expansion, but the different material must have different thermal conductivity.
It works on a smaller scale but with a larger chip the expansion is larger so the misalignment becomes a larger problem. The chip designer failed to factor expansion in their design and the fabricator failed to inform them that it will be an issue. These separate engineering teams are working in different companies so miscommunication is also an issue.
@@kazedcat That may be true, but TSMC has whole teams of engineers just working on packaging; thermal expansion is fundamental to everything they do. I guess it’s possible TSMC wasn’t involved in the multi chip packaging using the interposer, maybe it was just a PC board guy that designed it. Still, thermal expansion is such a _basic_ fact of engineering life, it’s hard to understand how they could have overlooked it.
@@DaveEtchells TSMC provides design rules but this design rules are base on some assumptions like the size of the package. If this size limitation is not communicated properly then the layout engineers in Nvidia could have followed the design rules not knowing that the rules are not valid to the packaging size they are designing.
Other than altering the materials to react the same to heat, the only idea I have is to encase the chips in a rigid structure to prevent expansion and or have them under some amount of compressive stress to counteract deformation. But I'm not sure to what degree the expansion and contraction happens under max thermal stress so it most likely will just make it fail faster. Imagine it was that simple...
Congratulations on approaching the 200-level milestone for subscribers. With your growly voice and sharp insight into the tech world (especially chip development), you deserve the attention. Thanks for your efforts to keep us informed and thoughtful about the direction of this field.
Very interesting. Chip design up until now has always seemed to proceed without much concern for geography. Distance seemed to relate only to speed but now we see that it has inherent qualities that cannot be ignored. I ran across similar problems years ago working in design for fused glass. Compatibility took on many forms. Cheers.
Current AI by Sam Altman is mostly brute force. Bigger and bigger models. It's a beta. The sciences is not ready. Ylia know this. It's difficult to size the load. The current AI race to the cliff is a bonanza for nvidia and others. nvidia is a company specialized in seizing future marketing opportunity.
We cannot sustain this flippant pursuit of this ASI boondoggle and these proposals for super clusters. . It will end badly from a water, food, or energy crisis or perhaps all 3 simultaneously i.e. a polycrisis.if humans don’t come to their senses.
They need to preheat the entire thing to a set temperature slightly above what they expect the normal operating temperature will be and keep it there instead of allowing it to heat up on its own. This most likely will require being immerse in a liquid of some sort that can maintain higher temperatures. They may need to design it at those temperatures.
Like pretensioning concrete bridge sections. They might be able to get away with building it at some intermediate temperature, so it can tolerate shipping and the occasional cooldowns, but really do well if left running constantly.
Next step is to use microfluidics based heat dissipation. Impregnate the substrate with thousands of capillaries and pump a steady current of some refrigerant through them.
"More and more people are adopting AI." Correction: more and more corporations are adopting AI because it's the current trendy hype. Most people are pretty much fatigued with AI by now, and as soon as the corporate investors find another hype, the whole AI craze will be forgotten almost instantly.
I think your spot on Ms. A. The infamous Coefficient of thermal expansion (CTE) mismatch is a pain in the a--. Fine analysis as usual. Concurrent engineer your process guys.
@AnastasiInTech 's video left me wondering two things--anyone have answers? 1) Even though different component coefficients of heating and expansion are almost certainly present on these huge chips, is there any evidence that they are a primary (or even significant) contributor the problems with NVIDIA'S GPU? (2) Even if NVIDIA increases its yield with a new die, won't the damage from heat-induced flexing take time to build up past the problems observed initially due to misalignment (if that is the problem, see question 1). What do you think?
Interesting in-depth analysis of the GPU. Heat dissipation of the heat generated by the processor is quite challenging given the size of the GPU and the use of different materials. This also raises the question of reliability and this product's fault-free performance (durability, useful life, maintenance, etc.).
Thermal issues especially as going to multiple types of materials that work together is a huge issue. They have done well, but close doesn't count in mass production.
Excellent explanation, Anastasia! Thank you. I am following the developments in this space closely. Silicon-based chip technology seems to be rapidly reaching its limits. I know that SMIC, in close cooperation with Huawei and several universities, is working feverishly on the development of photonic chips for AI training and inferencing. Size is not a limiting factor here. My assumption is that the world will be presented with a fully functional system out of China within the next 24 months that allows for the development and operation of LLMs at a fraction of the cost and power consumption of current Nvidia products like the H100 or B200. Jensen Huang is certainly aware of this fact, and so are many investors.
True. IBM has been leading research on all-optical chips made of transistors which only use photons to switch on/off (not electric current). Promising nearly 1000x performance improvement and significant reduction in power consumption. IBM contributed significantly to the growth of the Chinese tech space.
Assembling packages at an elevated temperature midway between "room temperature" and peak operating temperature might both improve yield and reduce failure rates.
New chip manufacturing machine that is around the size of a shipping crate. Can build a warehouse and spam then in the available space. Then copy and paste the factory a few times and via la chips at scale.
Even with liquid cooling, just the thermal resistance of the package itself is going to give you some temperature rise at 1700W 0:02 TDP per chip. Maybe it’d keep or low enough to not cause problems, but I’d worry about continued thermal cycling over time. I guess the trick would be to never power down a chip once it’s been fired up; thermal cycles =1 😁
@@DaveEtchells I hear some heavy consumers of GPU's are requesting 5yrs support to be included by vendors (which is beyond the 3yrs that NVIDA include as warranty). My presumption is that these customers are also wary of the potential for a high failure rate over time and want to put the risk onto someone else
They probably need to use carbon nanotubes to connect chips to each other. But that would take a lot of development. When working with wood, you have to plan for seasonal expansion and contraction. I'm surprised chip engineers thought they could just slap some chips on a substrate without considering heat expansion and contraction. (I'm sure I must have misunderstood something.)
The double-die architecture of the Blackwell GPU really shows how far we’ve come in chip design, but it also raises new challenges like thermal management. Exciting to think about where this will take AI workloads!
They should hook them together with "zebra strip" Yeah... that's the ticket! No, really... carbon nanotubes on a flexible film might remain attached above and below despite thermal shifts. They could mask and etch the nanotubes to be only where they want them to be. But they would act more like flexible wires than any firm mount would. The top and bottom remain connected while the thermals flex the film in the gap. So, maybe it only does 7 or 8 Tb/s instead of ten. What do you want, good grammar or good taste?
Thank you for your dedication to reporting and analyzing advancing electronics technology. Large gigawatt electrical power consumption is predicted for super large scale data centers. Since the electrical consumption is almost all due to generating heat as a resistive undesirable byproduct and cooling system to abate it, what is the possibly of new technology a decade from now or more being developed that does not have this heat generated byproduct or has it reduced to a millionth of what it is today greatly eliminating data center large scale electric power consumption?
There will be always those trying to push the envelope to get more and using existing tech. Generally. I like the trend towards lower temperature computers. There seems to be a lot of slop in large scale integration. This leaves much to be desired if accuracy is needed. Regards
NVidia should change their chip design to manufacture both GPUs and the inter GPU interconnect together on a single die. This will greatly reduce yield but at least then the die would work. This is the approach with the M2 Ultra chip from Apple.
Hi Anastasi, thanks for another great video. A quick question, do they anneal the wafers post-fab? I do understand the stresses between the different materials. Deformation, delamination, etc.... Surely, annealing could solve these problems, whether done post-fab or during each stage of fabrication. It doesn't matter whether it's a hammer or a photon hitting the material, it's going to bend. Also, where do you live? I want to steal your Cerebras Chip!😉 I want one, just to hang on the wall! It looks gorgeous!!! Love ya work! Take care! ❤
Need to develop a solid-state converter of excess heat/dissipation into electricity to offset most of the chip power load... leading to solution to chip overheating problem...
Excess thermal buildup is indeed a challenge but that can be resolved. Do you remember the topic that you discussed earlier, the in-chip liquid cooling?
Cray could have told them you can't leave mechanical engineering for thermal management as a secondary consideration. Scaling that thinking up to the environment it's obvious OTEC and space solar power have the cooling & capital utilization rate you want.
I hadn’t thought about all the other types of elements used in a die, but I figured it was likely a thermal mechanical expansion issue. But now they’ve got materials with different coefficients of expansion stacked on each other, with critical tolerances. Congratulations, nVidia, you’ve designed the world’s most complex and expensive bimetallic thermostat! Heats up, it likely opens, until it cools back down. Hopefully it starts working again. Their expected reach exceeded their actual grasp, it sounds like.
To get around the thermal issues, they need to determine what the operating temp range is to avoid any permanent damage.. then design the water cooling technology to support it..
Thanks Anastasi. Great NVidia engineering as is usually the case. They just failed to give Mother Nature enough credit and she through them a curve. I have confidence that they will find a way around her. It may be painful and could be suboptimal.
Clear, useful, interesting & all presented in by someone practically skilled & passionate in these exciting technologies. Thank you! The issue I struggle to understand is whether there is enough sellable product being produced by the buyers of Nvidia chips to support ongoing purchase from Nvidia at the current rate. There may be some new break through like Transformers that suddenly makes AI so useful that everyone must buy it, but of now AI has become commodity like, with much of the difference between the various offerings being the alignment with the philosophy of the designers rather than technical competence. A somewhat more extreme diversification than with web browsers at the beginning of the web & we know that many, like Netscape, did not survive. If we see a consolidation the intense pressure that has driven Nvidia sales may wane. Thank you for sharing!
For solving thermal troubles: more copper and more silver for lesser silicon. No gold because it is very expensive now. I believe that interconnecting substrates maybe unreliable when there is a micro-earthquake as the vibration of external sources.
This reminds me of the packaging issues they had with nvidia chips in the xbox and PS3 that caused YLOD and a whole host of NVIDIA GPU issues in other devices back in the day. Theres a documentary on the nvidia chips on the ps3 on youtube that discusses it in great legnth. Manufacturing chips is a multi country edfort. I wonder how much it has to do with the current chip war and the havoc its bringing.
If they do push this out, will be interesting to see how robust these products are against thermal damage. With Intel having problems with some of their CPU's are we getting to the point where longevity of a chip will become as important as raw speed.
I guess that you either want to go wafer scale as you mentioned or make much smaller chiplets to minimise the affect from the temperature related stress. If going with the chiplets design maybe having the substrate being cooled better could help, either by having dummy copper lanes for cooling purposes only or change the substrate material and its thermal properties. These are just some guesses, would be interesting to get some insights from this world, how the engineering solutions might look like.
Actually, Nvidia dropped because California public retirement fund sold Nvidia stocks to buy another stock. Nvidia has at least a four year backlog for its A100.
There are other reasons too. AMD's AI solutions are looking very promising in terms of matching NVIDIA's performance. if NVIDIA's forward plans experience a hiccup, this is a big problem since it may well role over into its future AI products. The most alarming consideration is that NVIDIA may end up experiencing the same chip degradation that Intel is seeing in high end 13x and 14x series CPUs.
@@davidgapp1457Amd provides mediocre AI gpu solutions. Intel with the oneAPI framework that is opensource, has a better shot at overtaking the proprietary CUDA monopoly that Nvidia has illegally implemented.
@@xlr555usa The word "illegal" refers to an action or set of actions that are found to be in contravention of a particular law. Which law/laws are you referring to?
@@xlr555usa I'm interested in your 'illegal' comment? I'm a cuda depended user myself and am frustrated that Nvidia's prices are so excessive. That no one seems to be able to provide an alternative does seem 'anti competitive' in some way.
Let me know what you think and share this video with your friends!
AMD has been making the worlds most powerful GPUs and CPUs with many tiles and chiplets.
Their latest GPU has 12 tiles and Nvidia struggles to figure out just a 2 tile design.
AMD has much superior engineering.
xoxoxooxoxo
I'm going into chip design. You were my inspiration. I'm also considering monolithic designs, though I'm focused more on the gaming side of technology.
You need to watch out for, and remove these scammer comment threads talking about “stocks” and “financial advisors”. These are posted by bots, and are run by investment scammers. You have one below right now.
Don’t allow your fans to be preyed on.
@@fullstackcrackerjack I agree wholeheartedly! But on top of those easy to spot threads there are so many other comments that are suspicious. It's all engagement as far as the channel is concerned so I doubt they will spend much time weeding out these BS comments. Who knows these days who is a real human and what is a bot and with so much orientation to marketing mindsets it all drives the system of algorithms, so no one does anything. It's frustrating and worrying, but who cares right? Just the tip of the 'extinction event' rising into view?
THIS is why I liked Intel’s idea of replacing organic substrates with glass. The thermal coefficient is closer to pure silicon and the manufacturing process gets easier for TSVs
Maybe because glass is literally silicon dioxide
Kinda back to the future.
Ceramic is so 90s.
The question is, when server & power-hungry solutions will move out from silicon... there are few prospective solutions on the horizon 🤔 15y?
It will happen at some point
🤣it would be a world shattering computer advancement tho
I was part of a startup that built a multi-chip package with a silicon interposer containing pS transmission line interconnect. We had working prototypes but ran out of money before we could convince a packaging partner it could scale - in 1999.
Yeah there are even books from the 90s about it. It's nothing new as a concept and design, just manufacturing.
@@badass6300 yes, the devil is in the details of thermal mismatch with increasing power and shrinking dimensions.
Best videos, informative and in detail for non technical people!
...not only for non technical people!😉
Glass or Glass ceramic substrate is expensive but can come close to the TCE of silicon while providing good electrical interconnect performance. We investigated that 25 years ago when designing Itanium MCM substrate in Intel.
I wanted to say that there are many people on TH-cam who talk about the big processor manufacturing companies, but few people go into it with your details and have high technical knowledge. Thank you very much for your channel 👍
What is so interesting about this is that when inventing the light bulb they had the same issues around different expansion rates of the glass and metal…. Some things never change.
In making HV power supplies for some years, we found that "Stycast" potting material had very good electrical and thermal characteristics for the applications we were considering, but we soon found out that the stuff has a much higher thermal expansion rate than say, circuitry. So it was snapping components right off the board during thermal cycling. On the other end of the spectrum, RTV was what we ended up using. But it is soft and can detach from a surface and that means failure in an HV supply. So we had to prime those surfaces to insure adhesion. We did use the Stycast on some things, but we enhanced its thermal properties by mixing fiberglass fragments into it.
In this case its not argon its graphene production cost Graphene's high thermal conductivity can help electronics cool more efficiently, with less temperature rise during operation, but its still to bloody expensive.
The fact that concrete and steel have very similar rates of thermal expansion is why reinforced concrete is possible.
Wow very clearly presented - I understood this complex process with your very well done presentation.
It reminds me of the Corpus Callosum that holds two hemispheres of the brain together these conections between both sides of the gpu
Great breakdown of what makes the 10 TB/s link between Blackwell dies so challenging. I wonder if there'll be a better packaging method for this link in the future or if the Rubin GPUs will go back to a 1-die design.
Wow!
Explained better than many so-called tech channels.
Thank you.
This explanation is superb! Keep it up and with love from the Netherlands!
I'm sure FEA can model heat flows and thermal expansion very well - but everything has a tolerance. Maybe the micro connects are just too small. It seems like a solvable problem if the chips are slightly less ambitious in the sizing of the various elements. Thanks for explaining what's going on.
On point, technically accurate and informative. Thank you for your quality work.
This girl is hypnotic. And on top pf that her videos are very well made =)
In my 66yrs, I've noticed that smart, attractive women can be very 'enchanting'...especially if they have something in common like Computer Science.
Maybe the single photomask changed the pads for attaching the silicon bridges to improve packaging yield?
There is a saying in precision machining:
On a small enough scale, everything becomes a thermal problem.
Or maybe a chemical problem.
At some point it becomes a quantum tunneling problem!
Even Guilloche?
AMD is years ahead of Nvidia when it comes to chiplets. Nvidia is just now starting to use chiplets, while AMD has been using them for years.
AMD has many patients doing this. Nvidia might need to buy from AMD
Cerebras: first time?
I mean thats literally the big thing Cerebras solved with their wafer scale approach.
You're late.
Having no defects at all on a wafer is quite unlikely. The larger any single chip gets, the more likely it is that it contains a defect. Hence, large chips have a worse yield and become more expensive per piece. A solution is dividing the design into smaller chips and mounting them to a common interposer.
Cerebras did things differently: Their Wafer Scale Engine consists of many small processors and can tolerate the failure of a few processors. The WSE sort of routes around the damage.
Thank you for this explanation!
I’m no packaging engineer, but as soon as I heard the word “organic” for the interposer I started wondering about problems with differing thermal coefficients.
What I’m curious about is why would Nvidia and TSMC think they could make it work in the first place?
Differences in thermal expansion rates are so fundamental that they must have thought they had some way of coping with them, either by coming up with a material for the interposer that magically has the same thermal coefficient as silicon, or by somehow limiting the thermal excursion with amazing heat sinking capability. - But 1,700 watts/chip TDP is going to get pretty warm almost no matter what you do. Even if you had some kind of active phase-change cooling, just the thermal resistance get the heat out of the package is going to result in a good bit of temperature rise.
Does anyone in the comments have any ideas about or knowledge of advanced techniques or materials that would lead Nvidia and TSMC to think they could actually do this? It seems like a fool’s errand to me, to go away from a silicon interposer, but IANAPE (I am not a packaging engineer), so there may very well be things I’m not aware of.
(Great vid as usual Anastasi, you did a great job of tracing the evolution and explaining the likely cause of the problems. Great thumbnail too 😂)
My reaction is the same. What were they thinking? Its not just the coefficient of thermal expansion, but the different material must have different thermal conductivity.
It works on a smaller scale but with a larger chip the expansion is larger so the misalignment becomes a larger problem. The chip designer failed to factor expansion in their design and the fabricator failed to inform them that it will be an issue. These separate engineering teams are working in different companies so miscommunication is also an issue.
@@kazedcat That may be true, but TSMC has whole teams of engineers just working on packaging; thermal expansion is fundamental to everything they do.
I guess it’s possible TSMC wasn’t involved in the multi chip packaging using the interposer, maybe it was just a PC board guy that designed it. Still, thermal expansion is such a _basic_ fact of engineering life, it’s hard to understand how they could have overlooked it.
@@DaveEtchells TSMC provides design rules but this design rules are base on some assumptions like the size of the package. If this size limitation is not communicated properly then the layout engineers in Nvidia could have followed the design rules not knowing that the rules are not valid to the packaging size they are designing.
Other than altering the materials to react the same to heat, the only idea I have is to encase the chips in a rigid structure to prevent expansion and or have them under some amount of compressive stress to counteract deformation. But I'm not sure to what degree the expansion and contraction happens under max thermal stress so it most likely will just make it fail faster. Imagine it was that simple...
Thank you for these videos by the way always enjoy them
Explained this way, I'm surprised they ever build a working Blackwell GPU.😓
Very interesting and well researched
Wonderfull!
You make it so easy to understand
Keep going👍👍
Congratulations on approaching the 200-level milestone for subscribers. With your growly voice and sharp insight into the tech world (especially chip development), you deserve the attention. Thanks for your efforts to keep us informed and thoughtful about the direction of this field.
Very interesting. Chip design up until now has always seemed to proceed without much concern for geography. Distance seemed to relate only to speed but now we see that it has inherent qualities that cannot be ignored. I ran across similar problems years ago working in design for fused glass. Compatibility took on many forms. Cheers.
I absolutely love your videos. Thank you so much for continuing to make them. I find them fascinating and love the way you explain it to us 🥰
1KW for a single chip? Our poor planet!!!
😂 They are making nuclear reactors
Current AI by Sam Altman is mostly brute force. Bigger and bigger models. It's a beta. The sciences is not ready. Ylia know this. It's difficult to size the load. The current AI race to the cliff is a bonanza for nvidia and others. nvidia is a company specialized in seizing future marketing opportunity.
GPU cum water kettle. Produce boiling water and steam as you play video games. Make tea and dinner as you play.
We cannot sustain this flippant pursuit of this ASI boondoggle and these proposals for super clusters. . It will end badly from a water, food, or energy crisis or perhaps all 3 simultaneously i.e. a polycrisis.if humans don’t come to their senses.
It could easily lead to higher prices for electrical generation and distribution.
@@jrwilliams4029
They need to preheat the entire thing to a set temperature slightly above what they expect the normal operating temperature will be and keep it there instead of allowing it to heat up on its own. This most likely will require being immerse in a liquid of some sort that can maintain higher temperatures. They may need to design it at those temperatures.
I just had the same idea ;-)
Like pretensioning concrete bridge sections. They might be able to get away with building it at some intermediate temperature, so it can tolerate shipping and the occasional cooldowns, but really do well if left running constantly.
great explanation, thanks for your work!
Next step is to use microfluidics based heat dissipation. Impregnate the substrate with thousands of capillaries and pump a steady current of some refrigerant through them.
Man I just got a 4060 and it pushes everything extremely well at like 115W max. The card is tiny. It just amazes me.
Thank you Anastasi for your professionalism on this AI technology. 🤖🖖🤖🇮🇹🇺🇸❤️
How was I not subscribed..... am now Anistasi.
Go with super conducting materials for connectors. Cryogenic will eliminate heat.
"More and more people are adopting AI."
Correction: more and more corporations are adopting AI because it's the current trendy hype.
Most people are pretty much fatigued with AI by now, and as soon as the corporate investors find another hype, the whole AI craze will be forgotten almost instantly.
Good video. Very well explained and understood. Thanks Anastasi.!
Nice, easy-to-understand video! 👍
I think your spot on Ms. A. The infamous Coefficient of thermal expansion (CTE) mismatch is a pain in the a--. Fine analysis as usual. Concurrent engineer your process guys.
@AnastasiInTech 's video left me wondering two things--anyone have answers? 1) Even though different component coefficients of heating and expansion are almost certainly present on these huge chips, is there any evidence that they are a primary (or even significant) contributor the problems with NVIDIA'S GPU? (2) Even if NVIDIA increases its yield with a new die, won't the damage from heat-induced flexing take time to build up past the problems observed initially due to misalignment (if that is the problem, see question 1). What do you think?
My idea would pre designing the assembly to work at a specific temperature, and making sure that during operation this temperature is held constant.
If an acronym doesn't actually save any syllables it's not real
Interesting in-depth analysis of the GPU. Heat dissipation of the heat generated by the processor is quite challenging given the size of the GPU and the use of different materials. This also raises the question of reliability and this product's fault-free performance (durability, useful life, maintenance, etc.).
Thermal issues especially as going to multiple types of materials that work together is a huge issue. They have done well, but close doesn't count in mass production.
Excellent explanation, Anastasia! Thank you. I am following the developments in this space closely. Silicon-based chip technology seems to be rapidly reaching its limits. I know that SMIC, in close cooperation with Huawei and several universities, is working feverishly on the development of photonic chips for AI training and inferencing. Size is not a limiting factor here. My assumption is that the world will be presented with a fully functional system out of China within the next 24 months that allows for the development and operation of LLMs at a fraction of the cost and power consumption of current Nvidia products like the H100 or B200. Jensen Huang is certainly aware of this fact, and so are many investors.
True. IBM has been leading research on all-optical chips made of transistors which only use photons to switch on/off (not electric current). Promising nearly 1000x performance improvement and significant reduction in power consumption. IBM contributed significantly to the growth of the Chinese tech space.
waiting for Cerebras to IPO in October.
"OpenAI Researcher BREAKS SILENCE "Agi Is NOT SAFE""
Interesting! I had watched launch event for Blackwell. Hopefully this manufacturing problem gets resolved.➡
Assembling packages at an elevated temperature midway between "room temperature" and peak operating temperature might both improve yield and reduce failure rates.
New chip manufacturing machine that is around the size of a shipping crate. Can build a warehouse and spam then in the available space. Then copy and paste the factory a few times and via la chips at scale.
Might just have to make liquid cooling mandatory
The server GPU’s are already being liquid cooled if I recall correctly
It’s not the norm by a long way. Most tier 1’s and odm’s will be releasing DLC servers within the next 6 months though
Even with liquid cooling, just the thermal resistance of the package itself is going to give you some temperature rise at 1700W 0:02 TDP per chip. Maybe it’d keep or low enough to not cause problems, but I’d worry about continued thermal cycling over time. I guess the trick would be to never power down a chip once it’s been fired up; thermal cycles =1 😁
The problem is the internal structure. Components can get too hot before the heat reaches the liquid-cooled surface.
@@DaveEtchells I hear some heavy consumers of GPU's are requesting 5yrs support to be included by vendors (which is beyond the 3yrs that NVIDA include as warranty). My presumption is that these customers are also wary of the potential for a high failure rate over time and want to put the risk onto someone else
They probably need to use carbon nanotubes to connect chips to each other. But that would take a lot of development. When working with wood, you have to plan for seasonal expansion and contraction. I'm surprised chip engineers thought they could just slap some chips on a substrate without considering heat expansion and contraction. (I'm sure I must have misunderstood something.)
Can you do a video on Intel and how they are failing? Just recently mentioned for their listing close to being removed from the Dow Jones stock index.
They didn't pay me for my wafer-sized chip idea. That's how they went.
Been subscribed, love the content and have a crush.
The double-die architecture of the Blackwell GPU really shows how far we’ve come in chip design, but it also raises new challenges like thermal management. Exciting to think about where this will take AI workloads!
They should hook them together with "zebra strip" Yeah... that's the ticket! No, really... carbon nanotubes on a flexible film might remain attached above and below despite thermal shifts. They could mask and etch the nanotubes to be only where they want them to be. But they would act more like flexible wires than any firm mount would. The top and bottom remain connected while the thermals flex the film in the gap. So, maybe it only does 7 or 8 Tb/s instead of ten. What do you want, good grammar or good taste?
Keep it cool. Greatly Enjoy the vids.
Superb, I was doing research on it with a significant level of understanding the issue till this video popped up .
Thank you for your dedication to reporting and analyzing advancing electronics technology. Large gigawatt electrical power consumption is predicted for super large scale data centers. Since the electrical consumption is almost all due to generating heat as a resistive undesirable byproduct and cooling system to abate it, what is the possibly of new technology a decade from now or more being developed that does not have this heat generated byproduct or has it reduced to a millionth of what it is today greatly eliminating data center large scale electric power consumption?
There will be always those trying to push the envelope to get more and using existing tech. Generally. I like the trend towards lower temperature computers. There seems to be a lot of slop in large scale integration. This leaves much to be desired if accuracy is needed. Regards
Please do a video on latest news on Intel 18A fabrication
NVidia should change their chip design to manufacture both GPUs and the inter GPU interconnect together on a single die. This will greatly reduce yield but at least then the die would work. This is the approach with the M2 Ultra chip from Apple.
Hi Anastasi, thanks for another great video. A quick question, do they anneal the wafers post-fab? I do understand the stresses between the different materials. Deformation, delamination, etc.... Surely, annealing could solve these problems, whether done post-fab or during each stage of fabrication. It doesn't matter whether it's a hammer or a photon hitting the material, it's going to bend.
Also, where do you live? I want to steal your Cerebras Chip!😉 I want one, just to hang on the wall! It looks gorgeous!!!
Love ya work! Take care! ❤
very good, not alot of people can break down technology and explain it like this.
Anastasia, what do you think about Sohu? How realistic is this project from the technological standpoint?
They need to add a heater to maintain minimum temperature, and dynamically move the workload around to lower the temperature on hot spots. Or not.
Need to develop a solid-state converter of excess heat/dissipation into electricity to offset most of the chip power load... leading to solution to chip overheating problem...
Gosh, I love your young voice. Thanks for all your coverage, but especially this one because I am invested in NVIDIA.
The only thing that seems to be somewhat working at Intel is EMIB. Maybe Nvidia should package Blackwell there lol.
I understand TSMC think they have solved the problem and new batches are being tested by NVDA and by their main customers.
Excess thermal buildup is indeed a challenge but that can be resolved. Do you remember the topic that you discussed earlier, the in-chip liquid cooling?
Cray could have told them you can't leave mechanical engineering for thermal management as a secondary consideration. Scaling that thinking up to the environment it's obvious OTEC and space solar power have the cooling & capital utilization rate you want.
I hadn’t thought about all the other types of elements used in a die, but I figured it was likely a thermal mechanical expansion issue.
But now they’ve got materials with different coefficients of expansion stacked on each other, with critical tolerances.
Congratulations, nVidia, you’ve designed the world’s most complex and expensive bimetallic thermostat! Heats up, it likely opens, until it cools back down. Hopefully it starts working again.
Their expected reach exceeded their actual grasp, it sounds like.
They tried to cut costs, because of inflation pressures, and it backfired.
To get around the thermal issues, they need to determine what the operating temp range is to avoid any permanent damage.. then design the water cooling technology to support it..
Wow! I understood everything you said. And it's on substrates of computer chip manufacturing. Never thought I'd listen.
Can you make a video about the Intel microcode 0x129 problem of the 13th and 14th generation processors?
Your channel is a gem. Thank you
Thanks Anastasi. Great NVidia engineering as is usually the case. They just failed to give Mother Nature enough credit and she through them a curve. I have confidence that they will find a way around her. It may be painful and could be suboptimal.
Clear, useful, interesting & all presented in by someone practically skilled & passionate in these exciting technologies. Thank you! The issue I struggle to understand is whether there is enough sellable product being produced by the buyers of Nvidia chips to support ongoing purchase from Nvidia at the current rate. There may be some new break through like Transformers that suddenly makes AI so useful that everyone must buy it, but of now AI has become commodity like, with much of the difference between the various offerings being the alignment with the philosophy of the designers rather than technical competence. A somewhat more extreme diversification than with web browsers at the beginning of the web & we know that many, like Netscape, did not survive. If we see a consolidation the intense pressure that has driven Nvidia sales may wane. Thank you for sharing!
For solving thermal troubles: more copper and more silver for lesser silicon. No gold because it is very expensive now. I believe that interconnecting substrates maybe unreliable when there is a micro-earthquake as the vibration of external sources.
You look at amount profit vs capitalization and tsmc is way better deal.
This reminds me of the packaging issues they had with nvidia chips in the xbox and PS3 that caused YLOD and a whole host of NVIDIA GPU issues in other devices back in the day.
Theres a documentary on the nvidia chips on the ps3 on youtube that discusses it in great legnth.
Manufacturing chips is a multi country edfort.
I wonder how much it has to do with the current chip war and the havoc its bringing.
Thanks
Cerebra’s seems to become the top level ipo story soon
Wonder what the 450mm wafer cerebra’s would be…
Bigger chips result in lower yield and higher costs for the consumer.
Pues. Que dejen canales en los chips para que corra el agua. Así se refrigeran los chips
If they do push this out, will be interesting to see how robust these products are against thermal damage. With Intel having problems with some of their CPU's are we getting to the point where longevity of a chip will become as important as raw speed.
A chip cooling system can be useful if heat is a problem.
My team is working on optical interconnection platform as well as 3x4 optical logic gate. Coming soon.
You explain it very well for the layman to understand.
Subscribed. Love these videos.
They should use optical bus between chiplets and stacks.
We should all go lada , best soviet tech and the future
I guess that you either want to go wafer scale as you mentioned or make much smaller chiplets to minimise the affect from the temperature related stress.
If going with the chiplets design maybe having the substrate being cooled better could help, either by having dummy copper lanes for cooling purposes only or change the substrate material and its thermal properties.
These are just some guesses, would be interesting to get some insights from this world, how the engineering solutions might look like.
Wouldn't that create more latency though?
They are developing a glass substrate to reduce the thermal expansion issue. Also you can mitigate the problem by designing a fewer and larger via.
Actually, Nvidia dropped because California public retirement fund sold Nvidia stocks to buy another stock. Nvidia has at least a four year backlog for its A100.
There are other reasons too. AMD's AI solutions are looking very promising in terms of matching NVIDIA's performance. if NVIDIA's forward plans experience a hiccup, this is a big problem since it may well role over into its future AI products. The most alarming consideration is that NVIDIA may end up experiencing the same chip degradation that Intel is seeing in high end 13x and 14x series CPUs.
@@davidgapp1457Amd provides mediocre AI gpu solutions. Intel with the oneAPI framework that is opensource, has a better shot at overtaking the proprietary CUDA monopoly that Nvidia has illegally implemented.
@@xlr555usaNot illegal
@@xlr555usa The word "illegal" refers to an action or set of actions that are found to be in contravention of a particular law. Which law/laws are you referring to?
@@xlr555usa I'm interested in your 'illegal' comment? I'm a cuda depended user myself and am frustrated that Nvidia's prices are so excessive. That no one seems to be able to provide an alternative does seem 'anti competitive' in some way.
Dziękujemy.
Thank you
Rule of Acquisition 010: Greed is eternal.
multi-SOC more true design for economics, gauge from small embeded to high super-computers.