AMD has been making the worlds most powerful GPUs and CPUs with many tiles and chiplets. Their latest GPU has 12 tiles and Nvidia struggles to figure out just a 2 tile design. AMD has much superior engineering.
I'm going into chip design. You were my inspiration. I'm also considering monolithic designs, though I'm focused more on the gaming side of technology.
You need to watch out for, and remove these scammer comment threads talking about “stocks” and “financial advisors”. These are posted by bots, and are run by investment scammers. You have one below right now. Don’t allow your fans to be preyed on.
@@fullstackcrackerjack I agree wholeheartedly! But on top of those easy to spot threads there are so many other comments that are suspicious. It's all engagement as far as the channel is concerned so I doubt they will spend much time weeding out these BS comments. Who knows these days who is a real human and what is a bot and with so much orientation to marketing mindsets it all drives the system of algorithms, so no one does anything. It's frustrating and worrying, but who cares right? Just the tip of the 'extinction event' rising into view?
THIS is why I liked Intel’s idea of replacing organic substrates with glass. The thermal coefficient is closer to pure silicon and the manufacturing process gets easier for TSVs
@@grxwpr20725 not really, at the beggining it will be just an advancement and only with the time it will be more refined, ect. It will be like it was with silicone... Probably you don't remember the times when gates where measured in micro meters (with poor microscope or even naked eye you could see transitors), then hundreds then tens of nm, ect. - I do remember those times 🤣
The market trend can turn around very quickly. In fact, the indexes often switch from a bear market to a bull market when the news is at its worst and the mood of investors is at its lowest point. I read an article of people that grossed profits up to $150k during this crash, what are the best stocks to buy now or put on a watchlist?
In particular, amid inflation, investors should exercise caution when it comes to their exposure and new purchases. It is only feasible to get such high yields during a recession with the guidance of a qualified specialist or reliable counsel.
True, initially I wasn't quite impressed with my gains, opposed to my previous performances, I was doing so badly, figured I needed to diverssify into better assets, I touched base with a portfolio-advisor and that same year, I pulled a net gain of 550k...that's like 7times more than I average on my own.
NICOLE ANASTASIA PLUMLEE’ is the licensed fiduciary I use. Just research the name. You’d find necessary details to work with a correspondence to set up an appointment.
I was part of a startup that built a multi-chip package with a silicon interposer containing pS transmission line interconnect. We had working prototypes but ran out of money before we could convince a packaging partner it could scale - in 1999.
Glass or Glass ceramic substrate is expensive but can come close to the TCE of silicon while providing good electrical interconnect performance. We investigated that 25 years ago when designing Itanium MCM substrate in Intel.
I wanted to say that there are many people on RUclips who talk about the big processor manufacturing companies, but few people go into it with your details and have high technical knowledge. Thank you very much for your channel 👍
Congratulations on approaching the 200-level milestone for subscribers. With your growly voice and sharp insight into the tech world (especially chip development), you deserve the attention. Thanks for your efforts to keep us informed and thoughtful about the direction of this field.
What is so interesting about this is that when inventing the light bulb they had the same issues around different expansion rates of the glass and metal…. Some things never change.
In making HV power supplies for some years, we found that "Stycast" potting material had very good electrical and thermal characteristics for the applications we were considering, but we soon found out that the stuff has a much higher thermal expansion rate than say, circuitry. So it was snapping components right off the board during thermal cycling. On the other end of the spectrum, RTV was what we ended up using. But it is soft and can detach from a surface and that means failure in an HV supply. So we had to prime those surfaces to insure adhesion. We did use the Stycast on some things, but we enhanced its thermal properties by mixing fiberglass fragments into it.
In this case its not argon its graphene production cost Graphene's high thermal conductivity can help electronics cool more efficiently, with less temperature rise during operation, but its still to bloody expensive.
I'm sure FEA can model heat flows and thermal expansion very well - but everything has a tolerance. Maybe the micro connects are just too small. It seems like a solvable problem if the chips are slightly less ambitious in the sizing of the various elements. Thanks for explaining what's going on.
Great breakdown of what makes the 10 TB/s link between Blackwell dies so challenging. I wonder if there'll be a better packaging method for this link in the future or if the Rubin GPUs will go back to a 1-die design.
Very interesting. Chip design up until now has always seemed to proceed without much concern for geography. Distance seemed to relate only to speed but now we see that it has inherent qualities that cannot be ignored. I ran across similar problems years ago working in design for fused glass. Compatibility took on many forms. Cheers.
I was a noise & thermal tech at Intel. I've always thought these bigger chips might just split themselves. I actually got a chip to split under a heatspreader with clever code. These were older smaller chips too.
Interesting in-depth analysis of the GPU. Heat dissipation of the heat generated by the processor is quite challenging given the size of the GPU and the use of different materials. This also raises the question of reliability and this product's fault-free performance (durability, useful life, maintenance, etc.).
I’m no packaging engineer, but as soon as I heard the word “organic” for the interposer I started wondering about problems with differing thermal coefficients. What I’m curious about is why would Nvidia and TSMC think they could make it work in the first place? Differences in thermal expansion rates are so fundamental that they must have thought they had some way of coping with them, either by coming up with a material for the interposer that magically has the same thermal coefficient as silicon, or by somehow limiting the thermal excursion with amazing heat sinking capability. - But 1,700 watts/chip TDP is going to get pretty warm almost no matter what you do. Even if you had some kind of active phase-change cooling, just the thermal resistance get the heat out of the package is going to result in a good bit of temperature rise. Does anyone in the comments have any ideas about or knowledge of advanced techniques or materials that would lead Nvidia and TSMC to think they could actually do this? It seems like a fool’s errand to me, to go away from a silicon interposer, but IANAPE (I am not a packaging engineer), so there may very well be things I’m not aware of. (Great vid as usual Anastasi, you did a great job of tracing the evolution and explaining the likely cause of the problems. Great thumbnail too 😂)
My reaction is the same. What were they thinking? Its not just the coefficient of thermal expansion, but the different material must have different thermal conductivity.
It works on a smaller scale but with a larger chip the expansion is larger so the misalignment becomes a larger problem. The chip designer failed to factor expansion in their design and the fabricator failed to inform them that it will be an issue. These separate engineering teams are working in different companies so miscommunication is also an issue.
@@kazedcat That may be true, but TSMC has whole teams of engineers just working on packaging; thermal expansion is fundamental to everything they do. I guess it’s possible TSMC wasn’t involved in the multi chip packaging using the interposer, maybe it was just a PC board guy that designed it. Still, thermal expansion is such a _basic_ fact of engineering life, it’s hard to understand how they could have overlooked it.
@@DaveEtchells TSMC provides design rules but this design rules are base on some assumptions like the size of the package. If this size limitation is not communicated properly then the layout engineers in Nvidia could have followed the design rules not knowing that the rules are not valid to the packaging size they are designing.
Other than altering the materials to react the same to heat, the only idea I have is to encase the chips in a rigid structure to prevent expansion and or have them under some amount of compressive stress to counteract deformation. But I'm not sure to what degree the expansion and contraction happens under max thermal stress so it most likely will just make it fail faster. Imagine it was that simple...
They need to preheat the entire thing to a set temperature slightly above what they expect the normal operating temperature will be and keep it there instead of allowing it to heat up on its own. This most likely will require being immerse in a liquid of some sort that can maintain higher temperatures. They may need to design it at those temperatures.
Like pretensioning concrete bridge sections. They might be able to get away with building it at some intermediate temperature, so it can tolerate shipping and the occasional cooldowns, but really do well if left running constantly.
Having no defects at all on a wafer is quite unlikely. The larger any single chip gets, the more likely it is that it contains a defect. Hence, large chips have a worse yield and become more expensive per piece. A solution is dividing the design into smaller chips and mounting them to a common interposer. Cerebras did things differently: Their Wafer Scale Engine consists of many small processors and can tolerate the failure of a few processors. The WSE sort of routes around the damage.
Excellent explanation, Anastasia! Thank you. I am following the developments in this space closely. Silicon-based chip technology seems to be rapidly reaching its limits. I know that SMIC, in close cooperation with Huawei and several universities, is working feverishly on the development of photonic chips for AI training and inferencing. Size is not a limiting factor here. My assumption is that the world will be presented with a fully functional system out of China within the next 24 months that allows for the development and operation of LLMs at a fraction of the cost and power consumption of current Nvidia products like the H100 or B200. Jensen Huang is certainly aware of this fact, and so are many investors.
True. IBM has been leading research on all-optical chips made of transistors which only use photons to switch on/off (not electric current). Promising nearly 1000x performance improvement and significant reduction in power consumption. IBM contributed significantly to the growth of the Chinese tech space.
Next step is to use microfluidics based heat dissipation. Impregnate the substrate with thousands of capillaries and pump a steady current of some refrigerant through them.
Current AI by Sam Altman is mostly brute force. Bigger and bigger models. It's a beta. The sciences is not ready. Ylia know this. It's difficult to size the load. The current AI race to the cliff is a bonanza for nvidia and others. nvidia is a company specialized in seizing future marketing opportunity.
We cannot sustain this flippant pursuit of this ASI boondoggle and these proposals for super clusters. . It will end badly from a water, food, or energy crisis or perhaps all 3 simultaneously i.e. a polycrisis.if humans don’t come to their senses.
Thermal issues especially as going to multiple types of materials that work together is a huge issue. They have done well, but close doesn't count in mass production.
@@rasmusnorberg13 Nvidia is the incumbent, they started on the 3rd base. But their hardware efforts are being stunted by lack of a solid chiplet strategy. Blackwell has already hit one delay because of this. And next year mi350x will be on 3nm node, a year before Nvidia will have their 3nm solution.
Computer chips stacked and housed inside of pressurized metal, heat conductive gas cylinders with external fan cooled fins and external data bus connections through the cylinders may help cooling efficiency e.g. similar to heat pipes.
To get around the thermal issues, they need to determine what the operating temp range is to avoid any permanent damage.. then design the water cooling technology to support it..
Clear, useful, interesting & all presented in by someone practically skilled & passionate in these exciting technologies. Thank you! The issue I struggle to understand is whether there is enough sellable product being produced by the buyers of Nvidia chips to support ongoing purchase from Nvidia at the current rate. There may be some new break through like Transformers that suddenly makes AI so useful that everyone must buy it, but of now AI has become commodity like, with much of the difference between the various offerings being the alignment with the philosophy of the designers rather than technical competence. A somewhat more extreme diversification than with web browsers at the beginning of the web & we know that many, like Netscape, did not survive. If we see a consolidation the intense pressure that has driven Nvidia sales may wane. Thank you for sharing!
Hi Anastasi, thanks for another great video. A quick question, do they anneal the wafers post-fab? I do understand the stresses between the different materials. Deformation, delamination, etc.... Surely, annealing could solve these problems, whether done post-fab or during each stage of fabrication. It doesn't matter whether it's a hammer or a photon hitting the material, it's going to bend. Also, where do you live? I want to steal your Cerebras Chip!😉 I want one, just to hang on the wall! It looks gorgeous!!! Love ya work! Take care! ❤
I hadn’t thought about all the other types of elements used in a die, but I figured it was likely a thermal mechanical expansion issue. But now they’ve got materials with different coefficients of expansion stacked on each other, with critical tolerances. Congratulations, nVidia, you’ve designed the world’s most complex and expensive bimetallic thermostat! Heats up, it likely opens, until it cools back down. Hopefully it starts working again. Their expected reach exceeded their actual grasp, it sounds like.
Assembling packages at an elevated temperature midway between "room temperature" and peak operating temperature might both improve yield and reduce failure rates.
The most incredible thing about all of this to me is, the level of precision they are able to achieve! It boggles the mind that anything could be soldered in place to within a few microns on all planes, on this planet... Even with robotics! The earth constantly vibrating, along with the expansion and contraction of things, it's seemingly the impossible... Then to do it consistently, what a feat! Fascinating... The technical data this incredible woman serves you, you can take it to the bank! What a beautiful brain!!! 👍🏻
Just wait until robotics come online full scale. This will be the technology that powers the first interconnected robotic hive mind... The processing power will continue to progress until some new barrier is broken by ai. Which should lead to what it takes a building of GPUs to process now, to be handled by a processor that will fit in the palm of your hand... Fun stuff... Who knows beyond that?
The double-die architecture of the Blackwell GPU really shows how far we’ve come in chip design, but it also raises new challenges like thermal management. Exciting to think about where this will take AI workloads!
For solving thermal troubles: more copper and more silver for lesser silicon. No gold because it is very expensive now. I believe that interconnecting substrates maybe unreliable when there is a micro-earthquake as the vibration of external sources.
Need to develop a solid-state converter of excess heat/dissipation into electricity to offset most of the chip power load... leading to solution to chip overheating problem...
There will be always those trying to push the envelope to get more and using existing tech. Generally. I like the trend towards lower temperature computers. There seems to be a lot of slop in large scale integration. This leaves much to be desired if accuracy is needed. Regards
This reminds me of the packaging issues they had with nvidia chips in the xbox and PS3 that caused YLOD and a whole host of NVIDIA GPU issues in other devices back in the day. Theres a documentary on the nvidia chips on the ps3 on youtube that discusses it in great legnth. Manufacturing chips is a multi country edfort. I wonder how much it has to do with the current chip war and the havoc its bringing.
Apple was rightfully praised for their efficient and innovative Ultra chip design. They figured out how to fuse multiple M chips together to be more powerful while requiring less energy.
Attempting to put more logic on the same wafer to package components closer together to increase communication speed also increases thermals and reduces yield because a single glitch ruins the entire wafer. The only solution may be slower modular inter-chip interconnects, which nvidia has used in the past, and perhaps on separate wafers. ...Or we can jettison the electrical pathway design methodology altogether and switch to optical adoption for use in the deep inner core logic which does not suffer from these kinds of issues.
@AnastasiInTech 's video left me wondering two things--anyone have answers? 1) Even though different component coefficients of heating and expansion are almost certainly present on these huge chips, is there any evidence that they are a primary (or even significant) contributor the problems with NVIDIA'S GPU? (2) Even if NVIDIA increases its yield with a new die, won't the damage from heat-induced flexing take time to build up past the problems observed initially due to misalignment (if that is the problem, see question 1). What do you think?
I guess that you either want to go wafer scale as you mentioned or make much smaller chiplets to minimise the affect from the temperature related stress. If going with the chiplets design maybe having the substrate being cooled better could help, either by having dummy copper lanes for cooling purposes only or change the substrate material and its thermal properties. These are just some guesses, would be interesting to get some insights from this world, how the engineering solutions might look like.
They probably need to use carbon nanotubes to connect chips to each other. But that would take a lot of development. When working with wood, you have to plan for seasonal expansion and contraction. I'm surprised chip engineers thought they could just slap some chips on a substrate without considering heat expansion and contraction. (I'm sure I must have misunderstood something.)
If they do push this out, will be interesting to see how robust these products are against thermal damage. With Intel having problems with some of their CPU's are we getting to the point where longevity of a chip will become as important as raw speed.
So is this why AMD's MCM design for RDNA3 had performance issues? It seems like AMD was projecting a much larger performance uplift from RDNA2 but the Radeon 7000 series at the high end only had half of the promised performance. It is very hard to get information regarding the interconnect technology used in RDNA3 MCM design.
I think your spot on Ms. A. The infamous Coefficient of thermal expansion (CTE) mismatch is a pain in the a--. Fine analysis as usual. Concurrent engineer your process guys.
Why don't they move to optical interconnects, hollow channels for lasers to go thru, between anything that's big/far enough that thermal expansion starts getting into play? Does the transducing process add too much lag? Can't make tiny enough laser emitters/sensors? Too costly?
Oh so now It makes sense to me why the 5080 is essentially half the performance of the 5090, it's just a single chip, while the 5090 is more comparible to a dual GPU card, but now with both chips on one package and without the use of an internal SLI bridge. That is kinda funny since this is what the 90 series once stood for, they were often cards with dual GPU's, like the 690. Back in the day it was quite common for nVidia or AMD/ATI to offer a dual GPU card as their halo product. They still required use of SLI/crossfire methods to work, so performance was hit or miss. That 5090 is going to be mighty expansive... like 3000+ I wouldn't be surprised.
Excess thermal buildup is indeed a challenge but that can be resolved. Do you remember the topic that you discussed earlier, the in-chip liquid cooling?
Thanks Anastasi. Great NVidia engineering as is usually the case. They just failed to give Mother Nature enough credit and she through them a curve. I have confidence that they will find a way around her. It may be painful and could be suboptimal.
They should hook them together with "zebra strip" Yeah... that's the ticket! No, really... carbon nanotubes on a flexible film might remain attached above and below despite thermal shifts. They could mask and etch the nanotubes to be only where they want them to be. But they would act more like flexible wires than any firm mount would. The top and bottom remain connected while the thermals flex the film in the gap. So, maybe it only does 7 or 8 Tb/s instead of ten. What do you want, good grammar or good taste?
NVidia should change their chip design to manufacture both GPUs and the inter GPU interconnect together on a single die. This will greatly reduce yield but at least then the die would work. This is the approach with the M2 Ultra chip from Apple.
New chip manufacturing machine that is around the size of a shipping crate. Can build a warehouse and spam then in the available space. Then copy and paste the factory a few times and via la chips at scale.
That means that only datacenter gpus are affected? Based on that "A" variant Chip and lacking that chiplet bridge... consumer chips could be realised on time?
Well, what happens if this was used in a submerged system? Since the stress from heat sink mounts can accelerate this issue, a submerged system solution can help a lot, no?
Thank you for your dedication to reporting and analyzing advancing electronics technology. Large gigawatt electrical power consumption is predicted for super large scale data centers. Since the electrical consumption is almost all due to generating heat as a resistive undesirable byproduct and cooling system to abate it, what is the possibly of new technology a decade from now or more being developed that does not have this heat generated byproduct or has it reduced to a millionth of what it is today greatly eliminating data center large scale electric power consumption?
Let me know what you think and share this video with your friends!
AMD has been making the worlds most powerful GPUs and CPUs with many tiles and chiplets.
Their latest GPU has 12 tiles and Nvidia struggles to figure out just a 2 tile design.
AMD has much superior engineering.
xoxoxooxoxo
I'm going into chip design. You were my inspiration. I'm also considering monolithic designs, though I'm focused more on the gaming side of technology.
You need to watch out for, and remove these scammer comment threads talking about “stocks” and “financial advisors”. These are posted by bots, and are run by investment scammers. You have one below right now.
Don’t allow your fans to be preyed on.
@@fullstackcrackerjack I agree wholeheartedly! But on top of those easy to spot threads there are so many other comments that are suspicious. It's all engagement as far as the channel is concerned so I doubt they will spend much time weeding out these BS comments. Who knows these days who is a real human and what is a bot and with so much orientation to marketing mindsets it all drives the system of algorithms, so no one does anything. It's frustrating and worrying, but who cares right? Just the tip of the 'extinction event' rising into view?
THIS is why I liked Intel’s idea of replacing organic substrates with glass. The thermal coefficient is closer to pure silicon and the manufacturing process gets easier for TSVs
Maybe because glass is literally silicon dioxide
Kinda back to the future.
Ceramic is so 90s.
The question is, when server & power-hungry solutions will move out from silicon... there are few prospective solutions on the horizon 🤔 15y?
It will happen at some point
@@grxwpr20725 not really, at the beggining it will be just an advancement and only with the time it will be more refined, ect. It will be like it was with silicone... Probably you don't remember the times when gates where measured in micro meters (with poor microscope or even naked eye you could see transitors), then hundreds then tens of nm, ect. - I do remember those times 🤣
Best videos, informative and in detail for non technical people!
...not only for non technical people!😉
The market trend can turn around very quickly. In fact, the indexes often switch from a bear market to a bull market when the news is at its worst and the mood of investors is at its lowest point. I read an article of people that grossed profits up to $150k during this crash, what are the best stocks to buy now or put on a watchlist?
In particular, amid inflation, investors should exercise caution when it comes to their exposure and new purchases. It is only feasible to get such high yields during a recession with the guidance of a qualified specialist or reliable counsel.
True, initially I wasn't quite impressed with my gains, opposed to my previous performances, I was doing so badly, figured I needed to diverssify into better assets, I touched base with a portfolio-advisor and that same year, I pulled a net gain of 550k...that's like 7times more than I average on my own.
This aligns perfectly with my desire to organize my finances prior to retirement. Could you provide me with access to your advisor?
NICOLE ANASTASIA PLUMLEE’ is the licensed fiduciary I use. Just research the name. You’d find necessary details to work with a correspondence to set up an appointment.
She appears to be well-educated and well-read. I ran an online search on her name and came across her website; thank you for sharing.
I was part of a startup that built a multi-chip package with a silicon interposer containing pS transmission line interconnect. We had working prototypes but ran out of money before we could convince a packaging partner it could scale - in 1999.
Yeah there are even books from the 90s about it. It's nothing new as a concept and design, just manufacturing.
@@rattlehead999 yes, the devil is in the details of thermal mismatch with increasing power and shrinking dimensions.
Wow!
Explained better than many so-called tech channels.
Thank you.
Glass or Glass ceramic substrate is expensive but can come close to the TCE of silicon while providing good electrical interconnect performance. We investigated that 25 years ago when designing Itanium MCM substrate in Intel.
On point, technically accurate and informative. Thank you for your quality work.
I wanted to say that there are many people on RUclips who talk about the big processor manufacturing companies, but few people go into it with your details and have high technical knowledge. Thank you very much for your channel 👍
Wow very clearly presented - I understood this complex process with your very well done presentation.
There is a saying in precision machining:
On a small enough scale, everything becomes a thermal problem.
Or maybe a chemical problem.
At some point it becomes a quantum tunneling problem!
Even Guilloche?
This explanation is superb! Keep it up and with love from the Netherlands!
It reminds me of the Corpus Callosum that holds two hemispheres of the brain together these conections between both sides of the gpu
Congratulations on approaching the 200-level milestone for subscribers. With your growly voice and sharp insight into the tech world (especially chip development), you deserve the attention. Thanks for your efforts to keep us informed and thoughtful about the direction of this field.
What is so interesting about this is that when inventing the light bulb they had the same issues around different expansion rates of the glass and metal…. Some things never change.
In making HV power supplies for some years, we found that "Stycast" potting material had very good electrical and thermal characteristics for the applications we were considering, but we soon found out that the stuff has a much higher thermal expansion rate than say, circuitry. So it was snapping components right off the board during thermal cycling. On the other end of the spectrum, RTV was what we ended up using. But it is soft and can detach from a surface and that means failure in an HV supply. So we had to prime those surfaces to insure adhesion. We did use the Stycast on some things, but we enhanced its thermal properties by mixing fiberglass fragments into it.
In this case its not argon its graphene production cost Graphene's high thermal conductivity can help electronics cool more efficiently, with less temperature rise during operation, but its still to bloody expensive.
The fact that concrete and steel have very similar rates of thermal expansion is why reinforced concrete is possible.
I'm sure FEA can model heat flows and thermal expansion very well - but everything has a tolerance. Maybe the micro connects are just too small. It seems like a solvable problem if the chips are slightly less ambitious in the sizing of the various elements. Thanks for explaining what's going on.
This girl is hypnotic. And on top pf that her videos are very well made =)
In my 66yrs, I've noticed that smart, attractive women can be very 'enchanting'...especially if they have something in common like Computer Science.
Simp
Great breakdown of what makes the 10 TB/s link between Blackwell dies so challenging. I wonder if there'll be a better packaging method for this link in the future or if the Rubin GPUs will go back to a 1-die design.
Thank you again Anastasi to deliver great news again about tech! Never skip a beat.
I like the way you simplify complex topics. Also you are very easy on the eyes.
WOW. Anastasi is so good at explaining!
Thank you for this explanation!
How was I not subscribed..... am now Anistasi.
Love your videos...I was an ASIC Engineer through the 80s and 90s. It's absolutely fascinating to witness the evolution of semiconductor technology.
Maybe the single photomask changed the pads for attaching the silicon bridges to improve packaging yield?
It's a great channel about technology.
I subscribed and I appreciate the Portuguese subtitles.
I'm from Brazil.
Wonderfull!
You make it so easy to understand
Keep going👍👍
Very interesting. Chip design up until now has always seemed to proceed without much concern for geography. Distance seemed to relate only to speed but now we see that it has inherent qualities that cannot be ignored. I ran across similar problems years ago working in design for fused glass. Compatibility took on many forms. Cheers.
I was a noise & thermal tech at Intel. I've always thought these bigger chips might just split themselves. I actually got a chip to split under a heatspreader with clever code. These were older smaller chips too.
Thank you for these videos by the way always enjoy them
I absolutely love your videos. Thank you so much for continuing to make them. I find them fascinating and love the way you explain it to us 🥰
Nice, easy-to-understand video! 👍
Interesting in-depth analysis of the GPU. Heat dissipation of the heat generated by the processor is quite challenging given the size of the GPU and the use of different materials. This also raises the question of reliability and this product's fault-free performance (durability, useful life, maintenance, etc.).
Very interesting and well researched
great explanation, thanks for your work!
I’m no packaging engineer, but as soon as I heard the word “organic” for the interposer I started wondering about problems with differing thermal coefficients.
What I’m curious about is why would Nvidia and TSMC think they could make it work in the first place?
Differences in thermal expansion rates are so fundamental that they must have thought they had some way of coping with them, either by coming up with a material for the interposer that magically has the same thermal coefficient as silicon, or by somehow limiting the thermal excursion with amazing heat sinking capability. - But 1,700 watts/chip TDP is going to get pretty warm almost no matter what you do. Even if you had some kind of active phase-change cooling, just the thermal resistance get the heat out of the package is going to result in a good bit of temperature rise.
Does anyone in the comments have any ideas about or knowledge of advanced techniques or materials that would lead Nvidia and TSMC to think they could actually do this? It seems like a fool’s errand to me, to go away from a silicon interposer, but IANAPE (I am not a packaging engineer), so there may very well be things I’m not aware of.
(Great vid as usual Anastasi, you did a great job of tracing the evolution and explaining the likely cause of the problems. Great thumbnail too 😂)
My reaction is the same. What were they thinking? Its not just the coefficient of thermal expansion, but the different material must have different thermal conductivity.
It works on a smaller scale but with a larger chip the expansion is larger so the misalignment becomes a larger problem. The chip designer failed to factor expansion in their design and the fabricator failed to inform them that it will be an issue. These separate engineering teams are working in different companies so miscommunication is also an issue.
@@kazedcat That may be true, but TSMC has whole teams of engineers just working on packaging; thermal expansion is fundamental to everything they do.
I guess it’s possible TSMC wasn’t involved in the multi chip packaging using the interposer, maybe it was just a PC board guy that designed it. Still, thermal expansion is such a _basic_ fact of engineering life, it’s hard to understand how they could have overlooked it.
@@DaveEtchells TSMC provides design rules but this design rules are base on some assumptions like the size of the package. If this size limitation is not communicated properly then the layout engineers in Nvidia could have followed the design rules not knowing that the rules are not valid to the packaging size they are designing.
Other than altering the materials to react the same to heat, the only idea I have is to encase the chips in a rigid structure to prevent expansion and or have them under some amount of compressive stress to counteract deformation. But I'm not sure to what degree the expansion and contraction happens under max thermal stress so it most likely will just make it fail faster. Imagine it was that simple...
Explained this way, I'm surprised they ever build a working Blackwell GPU.😓
Good video. Very well explained and understood. Thanks Anastasi.!
My idea would pre designing the assembly to work at a specific temperature, and making sure that during operation this temperature is held constant.
They need to preheat the entire thing to a set temperature slightly above what they expect the normal operating temperature will be and keep it there instead of allowing it to heat up on its own. This most likely will require being immerse in a liquid of some sort that can maintain higher temperatures. They may need to design it at those temperatures.
I just had the same idea ;-)
Like pretensioning concrete bridge sections. They might be able to get away with building it at some intermediate temperature, so it can tolerate shipping and the occasional cooldowns, but really do well if left running constantly.
Thank you Anastasi for your professionalism on this AI technology. 🤖🖖🤖🇮🇹🇺🇸❤️
Superb, I was doing research on it with a significant level of understanding the issue till this video popped up .
Cerebras: first time?
I mean thats literally the big thing Cerebras solved with their wafer scale approach.
You're late.
Having no defects at all on a wafer is quite unlikely. The larger any single chip gets, the more likely it is that it contains a defect. Hence, large chips have a worse yield and become more expensive per piece. A solution is dividing the design into smaller chips and mounting them to a common interposer.
Cerebras did things differently: Their Wafer Scale Engine consists of many small processors and can tolerate the failure of a few processors. The WSE sort of routes around the damage.
Excellent explanation, Anastasia! Thank you. I am following the developments in this space closely. Silicon-based chip technology seems to be rapidly reaching its limits. I know that SMIC, in close cooperation with Huawei and several universities, is working feverishly on the development of photonic chips for AI training and inferencing. Size is not a limiting factor here. My assumption is that the world will be presented with a fully functional system out of China within the next 24 months that allows for the development and operation of LLMs at a fraction of the cost and power consumption of current Nvidia products like the H100 or B200. Jensen Huang is certainly aware of this fact, and so are many investors.
True. IBM has been leading research on all-optical chips made of transistors which only use photons to switch on/off (not electric current). Promising nearly 1000x performance improvement and significant reduction in power consumption. IBM contributed significantly to the growth of the Chinese tech space.
Next step is to use microfluidics based heat dissipation. Impregnate the substrate with thousands of capillaries and pump a steady current of some refrigerant through them.
very good, not alot of people can break down technology and explain it like this.
Wow! I understood everything you said. And it's on substrates of computer chip manufacturing. Never thought I'd listen.
1KW for a single chip? Our poor planet!!!
😂 They are making nuclear reactors
Current AI by Sam Altman is mostly brute force. Bigger and bigger models. It's a beta. The sciences is not ready. Ylia know this. It's difficult to size the load. The current AI race to the cliff is a bonanza for nvidia and others. nvidia is a company specialized in seizing future marketing opportunity.
GPU cum water kettle. Produce boiling water and steam as you play video games. Make tea and dinner as you play.
We cannot sustain this flippant pursuit of this ASI boondoggle and these proposals for super clusters. . It will end badly from a water, food, or energy crisis or perhaps all 3 simultaneously i.e. a polycrisis.if humans don’t come to their senses.
It could easily lead to higher prices for electrical generation and distribution.
@@jrwilliams4029
Thermal issues especially as going to multiple types of materials that work together is a huge issue. They have done well, but close doesn't count in mass production.
AMD is years ahead of Nvidia when it comes to chiplets. Nvidia is just now starting to use chiplets, while AMD has been using them for years.
AMD has many patients doing this. Nvidia might need to buy from AMD
So you're saying that Nvidia, who's "just starting" with chiplets are beating AMD at their own game? Sounds very good for Nvidia's future.
@@rasmusnorberg13 Nvidia is the incumbent, they started on the 3rd base. But their hardware efforts are being stunted by lack of a solid chiplet strategy. Blackwell has already hit one delay because of this. And next year mi350x will be on 3nm node, a year before Nvidia will have their 3nm solution.
Go with super conducting materials for connectors. Cryogenic will eliminate heat.
Interesting! I had watched launch event for Blackwell. Hopefully this manufacturing problem gets resolved.➡
Computer chips stacked and housed inside of pressurized metal, heat conductive gas cylinders with external fan cooled fins and external data bus connections through the cylinders may help cooling efficiency e.g. similar to heat pipes.
Gosh, I love your young voice. Thanks for all your coverage, but especially this one because I am invested in NVIDIA.
To get around the thermal issues, they need to determine what the operating temp range is to avoid any permanent damage.. then design the water cooling technology to support it..
Your channel is a gem. Thank you
Clear, useful, interesting & all presented in by someone practically skilled & passionate in these exciting technologies. Thank you! The issue I struggle to understand is whether there is enough sellable product being produced by the buyers of Nvidia chips to support ongoing purchase from Nvidia at the current rate. There may be some new break through like Transformers that suddenly makes AI so useful that everyone must buy it, but of now AI has become commodity like, with much of the difference between the various offerings being the alignment with the philosophy of the designers rather than technical competence. A somewhat more extreme diversification than with web browsers at the beginning of the web & we know that many, like Netscape, did not survive. If we see a consolidation the intense pressure that has driven Nvidia sales may wane. Thank you for sharing!
Man I just got a 4060 and it pushes everything extremely well at like 115W max. The card is tiny. It just amazes me.
People simply undervolt and underclock 3090 and such to get 30% or more lower energy use
I always feel smarter after listening to you.
Hi Anastasi, thanks for another great video. A quick question, do they anneal the wafers post-fab? I do understand the stresses between the different materials. Deformation, delamination, etc.... Surely, annealing could solve these problems, whether done post-fab or during each stage of fabrication. It doesn't matter whether it's a hammer or a photon hitting the material, it's going to bend.
Also, where do you live? I want to steal your Cerebras Chip!😉 I want one, just to hang on the wall! It looks gorgeous!!!
Love ya work! Take care! ❤
Anastasia, what do you think about Sohu? How realistic is this project from the technological standpoint?
I hadn’t thought about all the other types of elements used in a die, but I figured it was likely a thermal mechanical expansion issue.
But now they’ve got materials with different coefficients of expansion stacked on each other, with critical tolerances.
Congratulations, nVidia, you’ve designed the world’s most complex and expensive bimetallic thermostat! Heats up, it likely opens, until it cools back down. Hopefully it starts working again.
Their expected reach exceeded their actual grasp, it sounds like.
Assembling packages at an elevated temperature midway between "room temperature" and peak operating temperature might both improve yield and reduce failure rates.
You explain it very well for the layman to understand.
The most incredible thing about all of this to me is, the level of precision they are able to achieve! It boggles the mind that anything could be soldered in place to within a few microns on all planes, on this planet... Even with robotics! The earth constantly vibrating, along with the expansion and contraction of things, it's seemingly the impossible... Then to do it consistently, what a feat! Fascinating...
The technical data this incredible woman serves you, you can take it to the bank! What a beautiful brain!!! 👍🏻
Just wait until robotics come online full scale. This will be the technology that powers the first interconnected robotic hive mind... The processing power will continue to progress until some new barrier is broken by ai. Which should lead to what it takes a building of GPUs to process now, to be handled by a processor that will fit in the palm of your hand...
Fun stuff...
Who knows beyond that?
The double-die architecture of the Blackwell GPU really shows how far we’ve come in chip design, but it also raises new challenges like thermal management. Exciting to think about where this will take AI workloads!
For solving thermal troubles: more copper and more silver for lesser silicon. No gold because it is very expensive now. I believe that interconnecting substrates maybe unreliable when there is a micro-earthquake as the vibration of external sources.
Subscribed. Love these videos.
Need to develop a solid-state converter of excess heat/dissipation into electricity to offset most of the chip power load... leading to solution to chip overheating problem...
There will be always those trying to push the envelope to get more and using existing tech. Generally. I like the trend towards lower temperature computers. There seems to be a lot of slop in large scale integration. This leaves much to be desired if accuracy is needed. Regards
This reminds me of the packaging issues they had with nvidia chips in the xbox and PS3 that caused YLOD and a whole host of NVIDIA GPU issues in other devices back in the day.
Theres a documentary on the nvidia chips on the ps3 on youtube that discusses it in great legnth.
Manufacturing chips is a multi country edfort.
I wonder how much it has to do with the current chip war and the havoc its bringing.
Apple was rightfully praised for their efficient and innovative Ultra chip design. They figured out how to fuse multiple M chips together to be more powerful while requiring less energy.
Attempting to put more logic on the same wafer to package components closer together to increase communication speed also increases thermals and reduces yield because a single glitch ruins the entire wafer. The only solution may be slower modular inter-chip interconnects, which nvidia has used in the past, and perhaps on separate wafers. ...Or we can jettison the electrical pathway design methodology altogether and switch to optical adoption for use in the deep inner core logic which does not suffer from these kinds of issues.
@AnastasiInTech 's video left me wondering two things--anyone have answers? 1) Even though different component coefficients of heating and expansion are almost certainly present on these huge chips, is there any evidence that they are a primary (or even significant) contributor the problems with NVIDIA'S GPU? (2) Even if NVIDIA increases its yield with a new die, won't the damage from heat-induced flexing take time to build up past the problems observed initially due to misalignment (if that is the problem, see question 1). What do you think?
I guess that you either want to go wafer scale as you mentioned or make much smaller chiplets to minimise the affect from the temperature related stress.
If going with the chiplets design maybe having the substrate being cooled better could help, either by having dummy copper lanes for cooling purposes only or change the substrate material and its thermal properties.
These are just some guesses, would be interesting to get some insights from this world, how the engineering solutions might look like.
Wouldn't that create more latency though?
They are developing a glass substrate to reduce the thermal expansion issue. Also you can mitigate the problem by designing a fewer and larger via.
Keep it cool. Greatly Enjoy the vids.
They probably need to use carbon nanotubes to connect chips to each other. But that would take a lot of development. When working with wood, you have to plan for seasonal expansion and contraction. I'm surprised chip engineers thought they could just slap some chips on a substrate without considering heat expansion and contraction. (I'm sure I must have misunderstood something.)
The only thing that seems to be somewhat working at Intel is EMIB. Maybe Nvidia should package Blackwell there lol.
If they do push this out, will be interesting to see how robust these products are against thermal damage. With Intel having problems with some of their CPU's are we getting to the point where longevity of a chip will become as important as raw speed.
So is this why AMD's MCM design for RDNA3 had performance issues?
It seems like AMD was projecting a much larger performance uplift from RDNA2 but the Radeon 7000 series at the high end only had half of the promised performance.
It is very hard to get information regarding the interconnect technology used in RDNA3 MCM design.
They need to add a heater to maintain minimum temperature, and dynamically move the workload around to lower the temperature on hot spots. Or not.
I think your spot on Ms. A. The infamous Coefficient of thermal expansion (CTE) mismatch is a pain in the a--. Fine analysis as usual. Concurrent engineer your process guys.
I'm optimistic about the future of Nvidia too :) Understatement.
this mismatch of CTE is well known and should be addressed with uniform cooling but they will have tremendous difficulty to overcome this..
Been subscribed, love the content and have a crush.
Why don't they move to optical interconnects, hollow channels for lasers to go thru, between anything that's big/far enough that thermal expansion starts getting into play? Does the transducing process add too much lag? Can't make tiny enough laser emitters/sensors? Too costly?
Oh so now It makes sense to me why the 5080 is essentially half the performance of the 5090, it's just a single chip, while the 5090 is more comparible to a dual GPU card, but now with both chips on one package and without the use of an internal SLI bridge. That is kinda funny since this is what the 90 series once stood for, they were often cards with dual GPU's, like the 690. Back in the day it was quite common for nVidia or AMD/ATI to offer a dual GPU card as their halo product. They still required use of SLI/crossfire methods to work, so performance was hit or miss. That 5090 is going to be mighty expansive... like 3000+ I wouldn't be surprised.
Excess thermal buildup is indeed a challenge but that can be resolved. Do you remember the topic that you discussed earlier, the in-chip liquid cooling?
Thanks Anastasi. Great NVidia engineering as is usually the case. They just failed to give Mother Nature enough credit and she through them a curve. I have confidence that they will find a way around her. It may be painful and could be suboptimal.
They should hook them together with "zebra strip" Yeah... that's the ticket! No, really... carbon nanotubes on a flexible film might remain attached above and below despite thermal shifts. They could mask and etch the nanotubes to be only where they want them to be. But they would act more like flexible wires than any firm mount would. The top and bottom remain connected while the thermals flex the film in the gap. So, maybe it only does 7 or 8 Tb/s instead of ten. What do you want, good grammar or good taste?
NVidia should change their chip design to manufacture both GPUs and the inter GPU interconnect together on a single die. This will greatly reduce yield but at least then the die would work. This is the approach with the M2 Ultra chip from Apple.
Just subbed - great video!
New chip manufacturing machine that is around the size of a shipping crate. Can build a warehouse and spam then in the available space. Then copy and paste the factory a few times and via la chips at scale.
That means that only datacenter gpus are affected? Based on that "A" variant Chip and lacking that chiplet bridge... consumer chips could be realised on time?
Can you make a video about the Intel microcode 0x129 problem of the 13th and 14th generation processors?
yup- this is why all the finance bros were like why is the stock down- they have no idea what was going on. already nvidia canceld H100 basically
If an acronym doesn't actually save any syllables it's not real
So Wise , Thank You
Well, what happens if this was used in a submerged system? Since the stress from heat sink mounts can accelerate this issue, a submerged system solution can help a lot, no?
Thank you for your dedication to reporting and analyzing advancing electronics technology. Large gigawatt electrical power consumption is predicted for super large scale data centers. Since the electrical consumption is almost all due to generating heat as a resistive undesirable byproduct and cooling system to abate it, what is the possibly of new technology a decade from now or more being developed that does not have this heat generated byproduct or has it reduced to a millionth of what it is today greatly eliminating data center large scale electric power consumption?