@@der8auer-en , I have few cases, when user load factory default bios settings, pump began to work with PWM mode and, for example, 5950x shutted down on 115. As i know intel and amd has this limit in 115 degrees for safety. If processor reach this high - MB must turn off system for safety
95 degrees is the max safe temperature at which the Ryzen 7000 CPU's can run without any damage 24×7. When you exceed 95 the CPU begins throttling to keep the temps in check but it won't shut down. The thermal shutdown temp for the Ryzen 7000 series is 115 degrees. It is when the CPU reaches this temperature even after throttling that the CPU will shutdown to protect itself. A great video as always but I also would really like to know how this CPU died with all the protection mechanisms in place.
What about heat degradation? While it can probably survive warranty, how long will it last after that? Not to mention, that these run hot just for the sake of catching intel clocks and there wasn´t any need to push them so far. Eco mode limited at 105W, while lowering clocks, only lowers performance a little, but drops temperature quite a lot. It´s clear AMD pushed heavily into realm of diminishing returns.
@@Morpheus-pt3wq what's your point, 95 is a whole 5 from 100, which many intel processors hit for years already, it's more ok than you think. And even then its most likely to hit these temps only under sustained all-core loads, even with PB as most users use it, it wont even get close.
@@Morpheus-pt3wq Degradation happens due to voltages, not temperature alone. It's just that higher voltages are associated with higher temperatures The hot parts are a direct result of the thermal density which is bound to get only worse.
@@Morpheus-pt3wq a very important point that you've brought up. The recent AMD and Intel CPU's run extremely hot, sacrificing efficiency for performance. No matter how much AMD and Intel says that 90 degrees is a safe temperature, I will always be skeptical because a CPU that runs so hot may have a shorter lifespan. Let's wait for 3 years and see how many Ryzen 7000 X CPU's we see failing. On the contrary the 7600 and 7700 non X run way cooler even with PBO enabled if provided with sufficient cooling.
Could the metal be contaminated being an alloy with a lower melting point? You could check the melting point of the blob. BTW: you did an amazing work!
Even if the solder melted at operating temperature, the CPU has so many temperature sensors that it should have noticed the local overheating from missing STIM. The melting seems like a symptom rather than a cause.
I'm pretty sure that he would have gotten replacement from retailer or AMD if he RMA it. As it was demonstrated it is supposed to throttle at 95C and shutdown at 115C. Even if you take off cooler, CPU should simply shutdown instead of melting. So they did what they should have done if they took his CPU and made it impossible to RMA.
Well, they likely have AMD connections and this is essentially a roundabout way of doing a warranty claim. This does benefit AMD, after all: they get to see the components without having to do the work.
When I started my IT career in the late 90s, I accidentally overclocked a Cyrix processor by putting it in a board with its jumpers set for a much faster Intel CPU. It got so hot, so quickly, the ceramic package cracked. This is the first time I've seen anything similar.
The old days processor’s didn’t have thermal protection I had some crack and experimenting just by removing cooler and got up to over 350c really quickly. This one suspect it shorted, could have been from over voltage likely? The hotspots kind of point to that direction?
you know they watch his channel anyways tho right? even if he doesnt inform them, my guess would be someone is worried about losing their job cuz o the blob as we speak
The AMD engineer would likely say "customer likely disabled thermal shutdown." As for the CPU reaching 115C without a cooler, the engineer would say that is the normal thermal shutdown temp for Ryzen CPUs. 95C is when the chip begins throttling, 115C is the thermal shutdown temp.
@@K31TH3R Isn't there some ways to find out if that's truly the case? Like a certain area of fuses burned within the chip or some flags that could be read out in some kind of manner? I mean, that's a pretty stark contrast.
@@TheTekknician The problem with both thermal fuses and temperature sensors at this scale of process node is that their presence alone will impact the accuracy of local junction temperature readings. So in a lower cost consumer CPU where determining liability for damage is not all that important, I wouldn't see a thermal fuse being beneficial to include if the CPU's management engine is proven and reliable at appropriately shutting the CPU down as long as it has not been disabled. There's usually at least a 10-15% temperature overhead between thermal shutdown and actual die damage, so I'd expect the point where CPU damage begins occurring would be in the 125-140C range, which is around 10-20C lower than the melting point of the IHS to die solder. This means if the IHS solder has migrated, you have a strong indicator that the chip had a thermal event and failed to shut down, and that's really all you could conclude from either a destroyed thermal fuse or melted IHS solder. From the manufacturers perspective, it would likely not be difficult to prove with some degree of certainty that the customer disabled the thermal limit at all, but if you want to start pointing that finger you'd be very petty and burning bridges in doing so. If you're just trying to determine liability from the consumers perspective, we have plausible deniability on our side, simply because AM5 is still a relatively new platform and it's well known that there are BIOS bugs coming out of every AIB.
@@OsX86H3AvY No one would loose their job. This was a manufacturing error which is really rare, but is to be expected when you make so many. The engineer themselves don't work on manually inspecting the cpus, so they weren't at fault.
Perhaps some if not all of the thermal probes inside the CPU were malfunctioning? And then PBO just kept increasing the voltage as it saw the good temperatures being reported by the CPU, leading to it killing itself, with the cooler unable to cool the hotspots.
the vrm will NEVER raise the voltage above set value regardless of temperature. following your logic LN2 overclocking wouldn't be possible because at those temperatures all sensors are providing bogus values.
It seems a lot more likely it suffered from a latchup and theres nothing the thermal management could do to stop the current from overheating the chip and melting the solder over the fault
@@mycosys yeah that's what I think happened too. Basically the chip failed in some way (thermally or otherwise) that specifically held an electrical connection open, it thermally ran away burning away transistors on the die and just replacing them with a solid circuit, the cpu was already dead at this point so it's neither going to tell anyone about its temp nor control itself but power was still flowing, and for the microseconds it reached temps to melt the indium and the pressure of the lid-sandwich squeezed it out the side. Given the craters the cooler was in working order and cooled the cpu back down before the remaining indium could flow back in evenly when the mobo killed the power to the cpu which was no longer signaling correctly
@@mycosys you still have the "cpu hot" signal for that case. it's a open drain so any component connected to it can assert it (cpu, pch, sio...), and when asserted the VRM shuts down. so there must have been a fault on the MB as well.
A more plausible guess would be that the CPU actually failed short, which, as always, caused an uncontrolled/runaway temperature increase, and in this scenario the CPU's internal temp protection can do absolutely nothing to prevent it (a short is a short, temperatures will rise up to hundreds of degrees, sometimes even thousands).
@@Schwuuuuup the main issue w/ that theory is that two separate dots were shown, it's very unlikely to fail at two distinct locations at the same time.
A simple thing like dust or even a dead skin cell amok that somehow got in through the vents and found its way during the manufacturing process to and part of the CPU is possible and the laser could even have broken a particle from the mirror and lasered it in So many possibilities...
Based on experience with laptops, the processor may have failed first. It has short circuited, but the protections on the motherboard have not triggered properly. Therefore, a indium melt occurred. On a faulty gaming laptop board, a shorted CPU or GPU can reach temperatures of 200-250 degrees C under very favourable conditions. It can literally solder itself out.
So interesting. Also the absolute thickness of that IHS! I haven't delidded yet(waiting for delidder restock in the US) so I knew that it was thicker than normal but by cutting away a part of it you can truly get a feel for it.
Your failure analysis and investigation videos are definitely some of the most interesting and unique content in this space. Also, I did not realize how thick the new Ryzen IHSs were until you milled away half of it. That's a ton of material between the cooler and the heat source.
I'm still leaning towards a manufacturing defect alongside Ryzen 7000 normal behavior. I'm not entirely convinced that the embedded thermal protection is monitoring every core or component, maybe it's not looking at individual core value. My theory is that it was in fact a manufacturing defect where either a small yet excluded from thermal protection part of the CCD had a void of flux and therefore indium, causing that tiny specific spot to exceed 160C, melting the indium surrounding it, which in turn transferred heat to more surrounding indium until it flowed out and shorted surface mount components and killing the chip. Voids in cooling are very bad, as they also represent an area of air (any gas honestly), this tiny spot of air will increase in pressure as it heats, pushing in all directions looking for a way to escape, and if that void is allowing the indium to melt surrounding indium, it seems feasible that it would push the liquor indium until it melted to an edge and you'd see a blowout of indium as we saw here. My best guess is that this was a flux application failure, evidenced by the strip of indium between the CCD and O IO die that we know shouldn't be there. That misapplication of flux may have drawn solder away from the CCD, or just caused poor adhesion in some areas. I believe the flux is "printed" or silkscreened on before the chip passed inverted through an indium solder bath, so any areas with poor flux coverage could become a void, while any areas of the IHS that got flux and should not have would draw solder from other areas during the high temperature IHS application completion step. Or, at least, that's my theory, I don't manufacture CPU's 😮
Maybe there were holes in the indium before (defective application) and only a small part of the die got waaay too hot, which would have built pressure in the air in the pocket, melted the indium and maybe ultimately catapulted some of the indium off the die, both further increasing the temperature and maybe shorting something. The thing is, as the die has more than one temperature sensor, but doesn't have an unlimited amount, a small part of the die can easily get hot, while the CPU never notices.
@@shaunjennings4609 You would have seen those on the spreader though as he disassembled it. If you mean the smears of indium on the cuts of the spreader, that's most likely just from the CNC spreading the soft metal everywhere while cutting.
I can say something is off with the protection mechanism on the 7000 series, or motherboard. It happened to me once that a 7950x generated a magic smoke during the first boot (without heatsink, no protection kicked in, but it also should not exceed 50-ish celcius). It has been a while since I saw this kind of flaw, since 2010 I think.
G'day Roman, 😲WOW! thank you for this video, it was really Awesome to get a look at the cut away of the Heat Spreader, Also very interesting Temperature Content information, very strange situation 🤷♂?
Would be interesting to reach out to AMD if they have any way to still check the CPU on what exactly happened, what failed and what caused it to reach such high temps in the first place. The damage on the CPU isn't something that can't be fixed in terms of resoldering SMDs, but would still be great to check if anything can be "extracted" by AMD to determine what could have happened.
@@randomzocker8956 Intel uses their IME, Intel Management Engine. AMD has something similar, don't know the name, but every processor these days has a tiny RISC microprocessor that handles quite a few things. There's an article on that you can check out in regards to that. But yes, Intel has had it for a long time, AMD probably aswell, since it well, manages the CPU itself. I'd find it hard to believe it has no way to log specific things on what the CPU does.
@@NecroFlex I doubt that it would have any sort of memory that would store a log, doing so wouldn't really serve much of a useful purpose for 99% or use cases so i don't think they would bother making something like that/
jesus, everyone has been complaining about how thick ryzen 7000's IHS is for the longest time, and I couldn't really get what they mean until I watched this video. God dang that's some thick copper on top of the die right there.
It's possible the user had some third party tool installed for fan control/temp control and possibly tried to send a kill command to the system but the system already hung/froze so it just kept pumping voltage into it as the power supply didn't get a shutdown signal.
There's a few of these CPU meltdown stories about apparently. Makes me feel not so annoyed that i built my pc a while before these AM5 zen 4 CPUS came out so have a ryzen 9 5900x and Rx 6800 build. Which i am very happy with. And maybe i missed a bullet not going AM5.
125 that is about what I reached on an old K7 CPU as well, it was misbehaving at 115 and eventually gave up at 125 with just a piece of aluminium strapped to it! And this is about what I see here in your video as well, this is why there is a heatspreader these days as the poor K7 wouldn't even boot without a "cooler" attached! If you remove the cooler, even in the "lowest" energy mode it will dissipate more heat than that little IHS can absorb. I have no clue what the original owner might have done to it, but it could still be a defective die with a failed thermistor?
If I had to guess what happened, you can still control the SMU on these chips manually and if your utility is based on ZenStates there is basically NO limit to what you can do with the CPU as it just calls the specific SMU command with the value you gave it.
My 2c worth, with the way the solder squeezed out of the IHS, maybe there was too much mounting pressure, also potentially causing some of the pins that read/react to the temperature values not contacting correctly. This could cause both the failure of the failsafe, and the squeezing of the solder...
I think it could be contamination of both the TIM and the surface of the dies. If the TIM was contaminated, it could have a lower melting point than expected, and if the surface of the dies is contaminated, whether by things preventing surface adhesion or maybe even the surface being /too/ smooth and not having texture for the TIM to grip onto for adhesion, that could result in those little gaps in the TIM you see.
Nice exploration, really enjoy these. Just a theory.. I'm aware you don't think it's a manufacturing defect, but the only scenario I can come up with would be one. Either a cold solder joint or contaminated chip surface not allowing full heat transfer to the IHS. Once an individual spot gets hot enough to melt the solder the heat runs up too quickly for the chip or bios to compensate and the chip dies. I've seen similar failures in electrical solder joints, not sure solder for the purpose of heat transfer would react the same way. I understand your experiment shows that the board shuts down at approx. 114c, but without good contact to the IHS to absorb heat is it possible parts of the chip would heat faster than the area immediately around the thermocouples?
Very good post-mortem work on the dead CPU. That chip is running uber hot! I have an MSI X670E motherboard with the 7900 (non X or 3D) - with an Cooler Master 212 air cooler during normal use my temps are @idle around 49C, @moderate load on system mid 50C to 75C. Running Cinebench it goes to tjmax 95C and throttles down a bit but never over 95-96C after running several minutes. I am going to try a liquid AIO cooler to see what results I'm getting but I don't typically stress my system to the limits.
I had an AIO fail on my R5 3600. Think the block clogged. Was working well enough for light use, but then it hit emergency shutdown at about 113 like yours when I smashed it. CPU survived this fine and still serves me well under a Noctua U9S.
I had a very similar issue happen to me once with Precision Boost Overdrive by tweaking the curve manually. in current AGESA's for lots of BIOS, there is a bug in the curve editor that has issues with power draw. For whatever reason my 5800X i watched rocket off to 130C in under two seconds before it finally shut down, even though i only told the curve to increase maximum by 10MHz. Because of the delay in this temperature control, it is entirely possible this 7000 series rocketed off to 160 whilst trying to boost all cores when loading an application or benchmarking an OC profile, meaning the temperature spike was so fast your own tests were nowhere close in comparison. and yes, my cpu had its watercooler on at the time as well. i was terrified. luckily nothing was damaged even at the temperature it hit, and after 10 minutes it was alive. i still use that chip to this day, without any PBO offsets
Very interesting video. As in previous comments, I can only think that the part and/or solder was contaminated causing the voids, then area after area of the chip failing. Or it watched the news and had it. Look forward to seeing an update if anything is found.
I have seen something similar before on Ivy and Sandy bridge platforms with 3 asus boards, all of them have voltage reg failed in bridged mode and pumped 12v directly into the cpu as soon as psu went online, one cpu was exploded with both carrier board and heat spreader bent, the 2nd one was burned, and 3rd one was dead but it didn't show any signs of physical damage.
The larger die has the problem. The short has generated heat in the single large die. The indium melts carrying the heat to the double dies which then melt by contact and flows out the heat sink. With no more contact between heat spreader and die a catastrophic meltdown occurs in the cpu.
I know a thing or two about soldering and I can tell you that solder does not bubble and boil without contamination(flux or water). From your photos I can still see gold plating on the cpu die and ihs. That means this is a bad ihs application from factory and those air pockets allowed the die to partially overheat and short starting a thermal runaway that allowed the cpu to get to 160c+ and melt away. That is only my opinion. Nice video!
i remember in the old day when you had XP/duron with no protection. AND we had a computer where the FAN stopped. the system ended up heating up and the socket got soft and then the cooler mount snapped at the bottom (tower case) and then it rely started to heat up. so the PCB got damaged and the socket itself got half way disordered and pulled out from the PCB. the "glue" that hold the die to the ceramic body of the CPU was gone and the die fell off. to melt the indium you do need to get to 160c. to melt the solder in the old computers you need over 200c.
I had a similar case once, only those were probably Athlon 64 times, dual-core AMD on the AM2 base. One day the computer wouldn't start, started with the CPU fan at 100% and nothing else happened. So I put it on the desk with its side off and turned it on, I tried to diagnose it, even by measuring the voltages, or waiting, because maybe it will turn on after some time. After two or three minutes, my nose told me that something was really heating up and then I saw that you can get burnt on the radiator, even though its fan was running at 100% of the possible speed all the time. The processor then consumed much more power than was acceptable because the installed cooler easily managed during the stress test, and here it warmed up so much that you could make visible burns on your skin despite the roaring fan. The 500W power supply didn't see any problem with that, so it certainly didn't exceed that power. After inserting the new processor, the computer continued to work for years and was fine. I have this damaged processor for sure to this day somewhere in a box with computer components.
Someone left the new guy unsupervised. That or it was a bad IHS so the solder didn't get a chance to adhere properly. Which is probably why we see some ''shiny'' indium because it was never pressed against the IHS. Or one of their machines failed in a way that was hard to detect and that CPU was already out the door. I personally don't see how this isn't a manufacture mistake but I'm just a guy with no qualifications.
The water would not have to evaporate from the block for the cpu to get that high. The cpu may have gotten hot faster than the IHS and cold plate could transfer the energy to the water. Also, in a sealed system it will build pressure and increase the boiling point of the water.
Roman there is I thing a few other possibilities: 1/ contamination in the indium alloy which made it liquid at 100°C, you can check what's left on it for its melting point 2/ bad application during manufacturing, as like you said if an area is not covered during manufacturing it will do this 3/ the temperatures are estimated (refer to intel engineer video) and thus the temperature on top of the die maybe much higher than what is displayed by sensors you could attach a thermocouple direct to the die and measure the "real" surface temperature
The temperatures are estimated, yes. But that's done by measuring the temperature dependent voltage differentials of multiple, integrated thermocouples. These are calibrated via correction factors that are set at manufacturing time via eFuses and also chemically very pure, due to being made with lithography. I think the reported temp being off more than 5°C from the real value would already be a huge stretch and very unlikely.
I had large surface mount IGBT module slide on a PCB in some medical equipment. It was installed with 60/40 solder! It took out some fuses and nothing was burnt. It just had too much resistance when ON, replaced it and all fine.
I'd guess some water contamination got trapped in when soldered which would seal in the water turning it to steam which can superheat when the pressure from being sealed in increases and being a gas it's thermal conductivity drops way off collecting more heat. The CPU would still operate a little warmer than most but the heated steam would be way hotter and could reach solder melt point.
TL;DR - Whatever AMD rates as their CPUs maximum temperature (Ryzen 4100 = 95°C, 5600 = 90°C, 5600X = 95°C), the AM4 platform thermal protection *shut down* temperature appears to be 115°C. I believe the temperature limit in bios only effects the PBO target temperature, or the limit beyond which PB2 will no longer boost clock speed above base. Presumably, looking at this video, AM5 is the same. I have not tested the lower 85°C limit of the Ryzen 5800X3D. I don't want to kill the gold-dust pride-and-joy CPU in my personal gaming PC. I am a Master's Computer Science student writing my dissertation on CPU degradation (yes, *actually*). I am using AM4 CPUs to generate experimental data (4100 because they are cheap...) The ASUS X570 Prime motherboard I am using sets the "Platform Thermal Throttle Limit", under PBO settings, to 115°C by default. For my research I am using static overclocks so I do not know what effect this temperature has on PBO behaviour. What I can say is that 115°C is the thermal protection shut-down point for AM4 CPUs. No matter what that value is set to (higher or lower), the test system will always turn off whenever "CPU Core" hits 115°C as reported by HWiNFO. Not "CPU (Tctl/Tdue)" which fluctuates more and can read over 115°C until "CPU Core" catches up. I found this when conducting pretesting for a high operating temperature stress-test. In keeping "CPU Core" as reported by HWiNFO at 110+-2.5°C, I was able to run my test system for a full 2 weeks. But in the pretesting, if the temperature ever hit 115°C, it would turn off pretty much instantly. In unscientific (and unrecorded) personal overclocking and tinkering, I have seen "CPU Core" temperature hit ~100°C on MSI B450 Mortar Max and MSI X470 Gaming Plus motherboards that I have used in personal systems.
Because of the glycol in an AIO, the boiling point of the coolant is higher than 100 Celsius, but not much. Regardless, enough thermal mass to give the protection mechanisms time to work.
Such unexpected 'desoldering' blobs inside a complex electrical device can be far more hazardous than just a non-functional pc.. This is a serious safety issue that deserves further investigation!
the indium solder could have been contaminated with another metal with a lower melting point, or there could have been micro air bubbles in the indium solder that expanded inside the solder causing an increase in pressure thus lowering the melting point of the indium in contact with the air buble.
My theory: The chip was working very hard (not just idling, as your chip was in your tests). It was probably already at the 95 degree limit when cooling suddenly failed (bad water block mount let it fall off?) and the tremendous heat already present in the CPU forced its temperature to shoot past 160 degrees. There would have been nothing the internal temperature detection and protection measures could have done to prevent this as the heat was already present and suddenly it had nowhere to go, so the temps went through the roof.
If the person who sent the CPU in contacts you in a few weeks with the same problem that will narrow the problem down a bit. Great video very interesting
I'm more interested to see the event viewer logs from the user/yours. It seems vendor bios didn't enforce a hard limit CPU temp to ramp down clock frequency. Thank you for the great content as always.
Theory; TIM was not evenly spread, and the place on the CPU DIE where the thermistors are located reported 'high, but in spec' temperatures, but on the unevenly covered positions it got too hot and melted.
what if one of the chiplets died, the other one still worked. maby it still pushed power into the broken one and got that to 160 while the other one was working and therefore the system didnt crash?
I would do an analysis of the composition of the 'shiny' bits of indium still on the die. That TIM is technically an Alloy and if the material was not properly made, you may have areas with a lower melting point. Regarding 100C due to water evaporation - Since the cooler is closed loop, any initial evaporation would result in increased pressure, raising the boiling point of water.
@der8auer with regard to water temp, Ethylene Glycol, found in many coolants used in custom loops, raises the boiling point significantly above 100° Celcius. Also liquid when put under any pressure, (which is created by heating the closed system), raises the boiling point also. Interesting nonetheless :-)
I had something similar happen on the Arous B650I Ultra using a ryzen 7 7700x on F2 bios. I built an ITX Ryzen system and I wanted to tune it with a smaller 130w little 92mm fan tower cooler. So while i was testing it down I set the voltage to 1.22 as just a normal OC process of seeing how far down I could undervolt a 5.2ghz. It immediately shot up and I saw for a couple seconds it running at 107C before it increased to 111 and shut off. I didn't think the motherboard would let that happen.
i have 2 theories the first one is the CPU thermal reader at some point became faulty and not reading what the actual CPU temperate for Example the CPU is now 95c but the system reading it as 50c or 60c so the system think CPU is still in the safe zone of heat in long run that did kill the CPU my 2nd theory is the solder that got melt may has lower melting point and that is factory defect so in long run at 95c and it was the melting point of that solder so the end results was this
Hey Derbauer. I also had a similar situation where my cpu (5800X3D) did not turn off after going over 95. The PC was on it's side and I made it stand up as normal, but the AIO did not like this and there was no flow of water. The cpu reached 120 degrees in windows before I pulled out the power cable. Gigabyte B450 I AORUS PRO WIFI (rev. 1.0) with latest bios (F64a) ALF II 280 5800X3D with PBO
It might have happened on last POST. An in chip short, perhaps caused by the internal power delivery circuitry, could create enough heat, and the owner wouldn't be able to realize fast enough to stop the solder melting. Its one of those "what are the odds" failures
the water does not have to evaporate... the boiling point of water is 100c only at 1 atmosphere, sea level, the open air. once you constrain and put water into a sealed closed loop, then it can be under higher pressures. so a little may boil, that small amount of generated vapour will then pressuruse within the container. and then the remaining water in the loop will be pressurized to enough level to raise the boiling point above 100c, while still remaining as a liquid.
Obviously I'm rather late to this video, which is great and I'm very jealous of your mill. It occurred to me that the unnamed lab and engineer from Steve's recent videos might be able to shed some light on what happened to this one. The 160°C is nothing compared to the CPU which damaged the m/b, melted copper and, possibly, silicon! I wouldn't be surprised if you've already spoken to him about it, that the original owner of this CPU was using an ASUS m/b did not surprise me. 😒
I wonder how all the different thermal sensors in the dies are ultimately culminated for monitoring. Temp sensing is an analog methodology that must then be converted into a digital value *somewhere.* If this is being done within the CPU itself, then a few atoms out of place in the circuit could skew the analog temp reading or conversion by multiple degrees. I don't know how much testing AMD does on each individual CPU before packaging, but something like this could be missed if it required more "burn-in" to permanently alter an unstable circuit. Another theory might suggest something as purely random as a neutrino hitting a circuit in the CPU and causing it to read a wildly inaccurate value in volts, current, or temp; allowing limits to be exceeded without tripping a shutdown. A bit of a stretch of the imagination, but crazier things have happened when neutrinos hit a processor. I mean, how long would the solder need to be at 160c to begin flowing in a vertical orientation, such as in a PC tower, anyway? A couple seconds at most?
The soldering material may not be pure indium. Some of them are using In-Ag alloy which could melt as low as 128°C. I have a piece left from my STIM experiment back from Haswell era. Since the temp sensors are not 100% accurate and won't cover the whole die, it may just be enough to desolder the CPU.
I'm not sure if this is possible but I would like to see how his board specifically would handle that test you did with taking the AIO off. Also could there have been a contaminate in the indium that lowered it's melting temperature?
Even if the melting point was lower, it wouldn't have mattered. A catastrophic failure like this would experience a thermal runaway. That is, the chip failed first then melted.
I thought 95c was normal operting tempature and 115c was critical temperature protection. I actually asked Steve from gamers nexus in a super chat but he never answered, I wanted to ask him would it be safe to run the CPU at 110c since 115c was critical point
@@cromefire_That's what people said about 95C! I haven't seen information to substantiate your claim. Truth be told, only AMD knows the answer to that question. By the way, Intel process run at TJ Maxx now, and Intel claims that that will not degrade the processors longevity.
@@juno1597 AMD explicitly states that it's safe until 95°C . That's why the CPUs try to NOT go over 95°C, thermal shutoff at 115°C means that >115°C your CPU will likely be damaged immediately. Thermal shutoff doesn't mean it's healthy to run it at 115°C, but rather that beyond that your CPU won't die in months, days or hours, but probably seconds. It's the last resort. Now with dies there is a gradient. At 96°C probably as well, with 100°C your CPU might degrade fast, but would probably still run for a long time, but the closer you get to 115°C, the faster your chip will probably degrade. How much faster? Well you'd have to try as you heard Roman say with Intel chip without limits, there will be a point where it will just die outright and given Intel's TJmax of 105°C and the CPU dying at 125°C and AMDs "TJmax" of 95°C, 115°C (same +20K), it's probably reasonable to assume that the CPU might die pretty close to that. Lastly there shouldn't be a reason to run it as high as that. As chips get less efficient with temperature even AMDs move to go to 95°C isn't the most efficient and going beyond that you'll probably not get a lot more in performance. If you're willing to risk your CPU for 1-2% in performance, if even, you're probably also willing to just spend a bit more on a better cooler and just enjoy more performance with less risk involved and an intact warranty.
If the 7900x has 2 cores disabled on each ccd because they didn't pass qc, could they by any chance somehow received power and shorted out? The melted solder was focused on one area on each ccd after all, any way to know what cores were underneath them?
My theory is the CPU could shorted from looking at the hotspots. If that the case a lot of current as those Voltage regulators on those board are capable of delivering quite a bit of current and thermal runaway. Likely could have exceeded 160c even. If shorted the CPU won’t work but the motherboard VRM’s could have still been delivering current? I seen over 260c plus if cooler failed and board VRM’s still was definitely power, CPU literally was well over 260c even reaching over 326c before power cut measure using a FLIR thermal imaging camera which is pretty accurate and was a ryzen 5 CPU we tested on a MSI motherboard. It sure looks like that what happened? I would have tested the melting point of that blob.
Keep in mind the CPU was also under pressure from the cooler, this would lower the melting point of the indium i have no idea by how much however and i highly doubt it would drop it below 120c
Higher pressure HIGHER melting point. Why do you think they lower pressures in vapor chambers/heat pipes. To boil at a lower temp, to use the thermal capacity of phase change, to cool your CPU.
I "heard" on a forum that having PBO enabled disables the temp limit, which sounded crazy to me but apparently is true, you can see in the bios settings that PBO is on auto which probably means that it's active.
Its possible the CPU was damaged internally in some way when it overheated, and direct-shorted inside. Same instance as when electronics fail and burn up because they are shorted internally.
Very strange. The solder seemed well bonded to the chips and the IHS, and I'd have to imagine even as a liquid it would still make a very good thermal connection. And even for the solder to melt out from under the IHS, I'd imagine everything would have to be very hot. If the silicon managed to get incredibly hot very fast, I can't see liquid metal getting very far while touching a big piece of copper unless it was also really hot. If you have a dead AM5 chip thats still lidded, it might be interesting to see how how you have to get it to make metal drip out the bottom of it
As someone who once, a long time ago, accidentally ran a Pentium CPU without a cooler and watched it melt itself in seconds, it is wild to me seeing a CPU running without a cooler today and being totally fine afterwards.
Just a theory, could be that the pins and pad contact between CPU and socket was at fault? Could cause arching or the pin and pad that supply data on temperature was poor. We have seen the issue with Intel and hence where a the contact frame was made.
Aside from this issue here, what if we could run these processors at 110C with direct die cooling for boiling water. The expansion of the boiling water could act as the pump, meaning that the pump would only ever fail, if the CPU was too cold, the cooling loop was overpressured(say 20PSI) or the water evaporated. IIRC you can reduce the boiling point of water by adding ethanol to it but that would probably cause problems with seals and sealants.
Bear in mind that as the water temp rises, pressure in the loop will rise, raising boiling point for the liquid (there would need to be at least a little bit of air to allow some compression......water on its own would likely burst the loop somewhere just from uncompressed expansion.
As mentioned, verify the melting point of the TIM. If it's lower than you expect perhaps mounting vertically would cause it to drip away instead of on your test bench where it would stay in place.
@der8auer I have a 7700X that might interest you. Was working fine with 2200FCLK for over 2 days so I tried to lower SoC voltage from Auto (bios reading 1.43) in -10mv, when I got to 1.32 SoC the latency in aida64 increased by 2x all the way to 30x more. Since then the CPU cannot be stable with normal latency over 2000FCLK and only boots up to 2067FCLK but the latency at this FCLK is about 4x the normal range. I changed motherboard Strix x670e-e to the Gene x670e and tested 1 dimm only but no luck, tried fresh windows install and different PSU again no luck. Like I said if you interested let me because I have searched google for weeks and cant find one person with this phenomenon.
The max. safe-ish voltage for the IOD was 1.2V when it was still made in 12nm. That certainly hasn't increased with Zen4 and the IOD being made in 6nm now. I guess that the auto setting and/or auto auxillary voltages (like VDDG, VDDP) were way to high, (which was also the case with early Bios for Ryzen 3000) and you degraded the IOD. Are you sure the bios was showing 1.43v? It should have been 1.143V!
Saw the same spots on my 7950x after deliding it, CPU is not dead though used mb rog strix x670e-i, and only minor overclocking (it also started running down between dies). Interesting is the fact that now the gold layer on the die looks completely black/cooked, don't know if it's because of the liquid metal or very high temps after deliding it (I had thermal shutdown after deliding it). My big guess is that the indium(alloy) used starts to melt and the results can be seen only after some time using the CPU in vertical orientation unlike the horizontal open TB used @Der8auer(I had mine running for @1month). My guess is that the temps on the die are averaged therefore some spots can get hotter than the melting temp.
you won't see such content anywhere else, such an amazing channel!
Thank you!
Yes, I am also fascinated by this man! ;-)
I agree. It's an amazing channel that puts actual measurements and engineering knowledge into useable information.
@@der8auer-en , I have few cases, when user load factory default bios settings, pump began to work with PWM mode and, for example, 5950x shutted down on 115. As i know intel and amd has this limit in 115 degrees for safety. If processor reach this high - MB must turn off system for safety
agreed✌️✌️✌️ amazing channel and content
95 degrees is the max safe temperature at which the Ryzen 7000 CPU's can run without any damage 24×7. When you exceed 95 the CPU begins throttling to keep the temps in check but it won't shut down. The thermal shutdown temp for the Ryzen 7000 series is 115 degrees. It is when the CPU reaches this temperature even after throttling that the CPU will shutdown to protect itself. A great video as always but I also would really like to know how this CPU died with all the protection mechanisms in place.
What about heat degradation? While it can probably survive warranty, how long will it last after that?
Not to mention, that these run hot just for the sake of catching intel clocks and there wasn´t any need to push them so far. Eco mode limited at 105W, while lowering clocks, only lowers performance a little, but drops temperature quite a lot. It´s clear AMD pushed heavily into realm of diminishing returns.
@@Morpheus-pt3wq what's your point, 95 is a whole 5 from 100, which many intel processors hit for years already, it's more ok than you think. And even then its most likely to hit these temps only under sustained all-core loads, even with PB as most users use it, it wont even get close.
@@Morpheus-pt3wq Degradation happens due to voltages, not temperature alone. It's just that higher voltages are associated with higher temperatures The hot parts are a direct result of the thermal density which is bound to get only worse.
@@Morpheus-pt3wq a very important point that you've brought up. The recent AMD and Intel CPU's run extremely hot, sacrificing efficiency for performance. No matter how much AMD and Intel says that 90 degrees is a safe temperature, I will always be skeptical because a CPU that runs so hot may have a shorter lifespan. Let's wait for 3 years and see how many Ryzen 7000 X CPU's we see failing. On the contrary the 7600 and 7700 non X run way cooler even with PBO enabled if provided with sufficient cooling.
@@Morpheus-pt3wq heat degradation takes several years to kill the chip... and thats a brand new cpu..
Like someone else in the comments said, you should check the melting point of that blob, to see if it's actually 160°C.
Could the metal be contaminated being an alloy with a lower melting point? You could check the melting point of the blob.
BTW: you did an amazing work!
A chemical analysis of the t.i.m. would be interesting.
Thats a good idea
Idk solder is usually pretty dead on given how long we have been making the stuff
@@wazaagbreak-head6039 ya, but all it takes is one Homer Simpson
Even if the solder melted at operating temperature, the CPU has so many temperature sensors that it should have noticed the local overheating from missing STIM. The melting seems like a symptom rather than a cause.
It’s surprising to hear that you guys sent him a whole new cpu. Seriously you guys rock for being that kind.
I'm pretty sure that he would have gotten replacement from retailer or AMD if he RMA it. As it was demonstrated it is supposed to throttle at 95C and shutdown at 115C. Even if you take off cooler, CPU should simply shutdown instead of melting. So they did what they should have done if they took his CPU and made it impossible to RMA.
Well, they likely have AMD connections and this is essentially a roundabout way of doing a warranty claim. This does benefit AMD, after all: they get to see the components without having to do the work.
A cpu is cheap compared to the revenue the ad's and sponsorships gave...
When I started my IT career in the late 90s, I accidentally overclocked a Cyrix processor by putting it in a board with its jumpers set for a much faster Intel CPU. It got so hot, so quickly, the ceramic package cracked. This is the first time I've seen anything similar.
I haven't heard anyone reference a Cyrix processor in FOREVER!!!! Interesting story from the old days.
The old days processor’s didn’t have thermal protection I had some crack and experimenting just by removing cooler and got up to over 350c really quickly.
This one suspect it shorted, could have been from over voltage likely? The hotspots kind of point to that direction?
der8auer is a rare breed on RUclips these days: great videos and no click baity shorts
It is really nice of you to help someone out in that way, thanks for the great content!
You always make fantastic content, thank you! The way you help out your viewers is great.
Roman, will you be informing AMD or an AMD Engineer about this? The feedback would be so interesting!
you know they watch his channel anyways tho right? even if he doesnt inform them, my guess would be someone is worried about losing their job cuz o the blob as we speak
The AMD engineer would likely say "customer likely disabled thermal shutdown." As for the CPU reaching 115C without a cooler, the engineer would say that is the normal thermal shutdown temp for Ryzen CPUs. 95C is when the chip begins throttling, 115C is the thermal shutdown temp.
@@K31TH3R Isn't there some ways to find out if that's truly the case? Like a certain area of fuses burned within the chip or some flags that could be read out in some kind of manner? I mean, that's a pretty stark contrast.
@@TheTekknician The problem with both thermal fuses and temperature sensors at this scale of process node is that their presence alone will impact the accuracy of local junction temperature readings. So in a lower cost consumer CPU where determining liability for damage is not all that important, I wouldn't see a thermal fuse being beneficial to include if the CPU's management engine is proven and reliable at appropriately shutting the CPU down as long as it has not been disabled.
There's usually at least a 10-15% temperature overhead between thermal shutdown and actual die damage, so I'd expect the point where CPU damage begins occurring would be in the 125-140C range, which is around 10-20C lower than the melting point of the IHS to die solder. This means if the IHS solder has migrated, you have a strong indicator that the chip had a thermal event and failed to shut down, and that's really all you could conclude from either a destroyed thermal fuse or melted IHS solder. From the manufacturers perspective, it would likely not be difficult to prove with some degree of certainty that the customer disabled the thermal limit at all, but if you want to start pointing that finger you'd be very petty and burning bridges in doing so.
If you're just trying to determine liability from the consumers perspective, we have plausible deniability on our side, simply because AM5 is still a relatively new platform and it's well known that there are BIOS bugs coming out of every AIB.
@@OsX86H3AvY No one would loose their job. This was a manufacturing error which is really rare, but is to be expected when you make so many. The engineer themselves don't work on manually inspecting the cpus, so they weren't at fault.
Perhaps some if not all of the thermal probes inside the CPU were malfunctioning? And then PBO just kept increasing the voltage as it saw the good temperatures being reported by the CPU, leading to it killing itself, with the cooler unable to cool the hotspots.
I think it's unlikely that all sensors failed but maybe the control circuit itself which is kind of the same result.
the vrm will NEVER raise the voltage above set value regardless of temperature. following your logic LN2 overclocking wouldn't be possible because at those temperatures all sensors are providing bogus values.
It seems a lot more likely it suffered from a latchup and theres nothing the thermal management could do to stop the current from overheating the chip and melting the solder over the fault
@@mycosys yeah that's what I think happened too. Basically the chip failed in some way (thermally or otherwise) that specifically held an electrical connection open, it thermally ran away burning away transistors on the die and just replacing them with a solid circuit, the cpu was already dead at this point so it's neither going to tell anyone about its temp nor control itself but power was still flowing, and for the microseconds it reached temps to melt the indium and the pressure of the lid-sandwich squeezed it out the side.
Given the craters the cooler was in working order and cooled the cpu back down before the remaining indium could flow back in evenly when the mobo killed the power to the cpu which was no longer signaling correctly
@@mycosys you still have the "cpu hot" signal for that case. it's a open drain so any component connected to it can assert it (cpu, pch, sio...), and when asserted the VRM shuts down. so there must have been a fault on the MB as well.
A more plausible guess would be that the CPU actually failed short, which, as always, caused an uncontrolled/runaway temperature increase, and in this scenario the CPU's internal temp protection can do absolutely nothing to prevent it (a short is a short, temperatures will rise up to hundreds of degrees, sometimes even thousands).
Good thinking... What we see in the solder could be a side effect of a failure that came before
@@Schwuuuuup the main issue w/ that theory is that two separate dots were shown, it's very unlikely to fail at two distinct locations at the same time.
A simple thing like dust or even a dead skin cell amok that somehow got in through the vents and found its way during the manufacturing process to and part of the CPU is possible and the laser could even have broken a particle from the mirror and lasered it in
So many possibilities...
@@stanimir4197 not necessarily, if 12v get shorted to the 1v rail, multiple parts will burn
@@Schwuuuuup yeah but the same motherboard later ran another CPU so the VRM was fine
Based on experience with laptops, the processor may have failed first. It has short circuited, but the protections on the motherboard have not triggered properly. Therefore, a indium melt occurred. On a faulty gaming laptop board, a shorted CPU or GPU can reach temperatures of 200-250 degrees C under very favourable conditions. It can literally solder itself out.
So interesting. Also the absolute thickness of that IHS! I haven't delidded yet(waiting for delidder restock in the US) so I knew that it was thicker than normal but by cutting away a part of it you can truly get a feel for it.
That was a great video, also thanks for using titles that describe what actually happens in the video instead of clickbait.
I love that exchange. He got a new CPU and you got interesting content for your channel.
Your failure analysis and investigation videos are definitely some of the most interesting and unique content in this space. Also, I did not realize how thick the new Ryzen IHSs were until you milled away half of it. That's a ton of material between the cooler and the heat source.
Genius absolutely Genius !!! Enjoyed this very much. Thanks for sharing this kinda EXTREME investigation.
I'm still leaning towards a manufacturing defect alongside Ryzen 7000 normal behavior. I'm not entirely convinced that the embedded thermal protection is monitoring every core or component, maybe it's not looking at individual core value.
My theory is that it was in fact a manufacturing defect where either a small yet excluded from thermal protection part of the CCD had a void of flux and therefore indium, causing that tiny specific spot to exceed 160C, melting the indium surrounding it, which in turn transferred heat to more surrounding indium until it flowed out and shorted surface mount components and killing the chip. Voids in cooling are very bad, as they also represent an area of air (any gas honestly), this tiny spot of air will increase in pressure as it heats, pushing in all directions looking for a way to escape, and if that void is allowing the indium to melt surrounding indium, it seems feasible that it would push the liquor indium until it melted to an edge and you'd see a blowout of indium as we saw here.
My best guess is that this was a flux application failure, evidenced by the strip of indium between the CCD and O
IO die that we know shouldn't be there. That misapplication of flux may have drawn solder away from the CCD, or just caused poor adhesion in some areas. I believe the flux is "printed" or silkscreened on before the chip passed inverted through an indium solder bath, so any areas with poor flux coverage could become a void, while any areas of the IHS that got flux and should not have would draw solder from other areas during the high temperature IHS application completion step.
Or, at least, that's my theory, I don't manufacture CPU's 😮
Maybe there were holes in the indium before (defective application) and only a small part of the die got waaay too hot, which would have built pressure in the air in the pocket, melted the indium and maybe ultimately catapulted some of the indium off the die, both further increasing the temperature and maybe shorting something.
The thing is, as the die has more than one temperature sensor, but doesn't have an unlimited amount, a small part of the die can easily get hot, while the CPU never notices.
i was thinking the spreader had holes and sucked the indium up into it
@@shaunjennings4609 You would have seen those on the spreader though as he disassembled it. If you mean the smears of indium on the cuts of the spreader, that's most likely just from the CNC spreading the soft metal everywhere while cutting.
Love your ideas and the way you approach your research 👍
Always informative and interesting to view 😊
Very nice of you to help a viewer out! Interesting video too!
The IHS looks so thick like this. It’s a real cool way to show it
1:33 Great engineering on Hetzner's part. That sleeved cable addition is definitely not blocking the airflow to that CPU
Okay so I wasn't the only one who noticed 🤣
I can say something is off with the protection mechanism on the 7000 series, or motherboard. It happened to me once that a 7950x generated a magic smoke during the first boot (without heatsink, no protection kicked in, but it also should not exceed 50-ish celcius). It has been a while since I saw this kind of flaw, since 2010 I think.
This CPU is quite puzzling and requires further examination!
What an amazing video!!!
G'day Roman,
😲WOW! thank you for this video, it was really Awesome to get a look at the cut away of the Heat Spreader,
Also very interesting Temperature Content information, very strange situation 🤷♂?
Would be interesting to reach out to AMD if they have any way to still check the CPU on what exactly happened, what failed and what caused it to reach such high temps in the first place. The damage on the CPU isn't something that can't be fixed in terms of resoldering SMDs, but would still be great to check if anything can be "extracted" by AMD to determine what could have happened.
honestly would have more hope to see something in the windows crash reports.
Like where should a CPU log that?
@@randomzocker8956 Intel uses their IME, Intel Management Engine. AMD has something similar, don't know the name, but every processor these days has a tiny RISC microprocessor that handles quite a few things. There's an article on that you can check out in regards to that.
But yes, Intel has had it for a long time, AMD probably aswell, since it well, manages the CPU itself. I'd find it hard to believe it has no way to log specific things on what the CPU does.
@@NecroFlex I doubt that it would have any sort of memory that would store a log, doing so wouldn't really serve much of a useful purpose for 99% or use cases so i don't think they would bother making something like that/
@@NecroFlex yes but using that for logging would hinder the performance right?
jesus, everyone has been complaining about how thick ryzen 7000's IHS is for the longest time, and I couldn't really get what they mean until I watched this video. God dang that's some thick copper on top of the die right there.
It's possible the user had some third party tool installed for fan control/temp control and possibly tried to send a kill command to the system but the system already hung/froze so it just kept pumping voltage into it as the power supply didn't get a shutdown signal.
There's a few of these CPU meltdown stories about apparently. Makes me feel not so annoyed that i built my pc a while before these AM5 zen 4 CPUS came out so have a ryzen 9 5900x and Rx 6800 build. Which i am very happy with. And maybe i missed a bullet not going AM5.
125 that is about what I reached on an old K7 CPU as well, it was misbehaving at 115 and eventually gave up at 125 with just a piece of aluminium strapped to it! And this is about what I see here in your video as well, this is why there is a heatspreader these days as the poor K7 wouldn't even boot without a "cooler" attached!
If you remove the cooler, even in the "lowest" energy mode it will dissipate more heat than that little IHS can absorb. I have no clue what the original owner might have done to it, but it could still be a defective die with a failed thermistor?
If I had to guess what happened, you can still control the SMU on these chips manually and if your utility is based on ZenStates there is basically NO limit to what you can do with the CPU as it just calls the specific SMU command with the value you gave it.
I actually do this on Linux myself, but there isn't really an utility for it so you have to write values you calculate yourself...
thanks a lot as always for the most interesting content and the excellent logical approach you do in your vids 🤟
very interesting cpu hardware video, awesome content!
My 2c worth, with the way the solder squeezed out of the IHS, maybe there was too much mounting pressure, also potentially causing some of the pins that read/react to the temperature values not contacting correctly.
This could cause both the failure of the failsafe, and the squeezing of the solder...
I think it could be contamination of both the TIM and the surface of the dies. If the TIM was contaminated, it could have a lower melting point than expected, and if the surface of the dies is contaminated, whether by things preventing surface adhesion or maybe even the surface being /too/ smooth and not having texture for the TIM to grip onto for adhesion, that could result in those little gaps in the TIM you see.
Nice exploration, really enjoy these. Just a theory.. I'm aware you don't think it's a manufacturing defect, but the only scenario I can come up with would be one. Either a cold solder joint or contaminated chip surface not allowing full heat transfer to the IHS. Once an individual spot gets hot enough to melt the solder the heat runs up too quickly for the chip or bios to compensate and the chip dies. I've seen similar failures in electrical solder joints, not sure solder for the purpose of heat transfer would react the same way. I understand your experiment shows that the board shuts down at approx. 114c, but without good contact to the IHS to absorb heat is it possible parts of the chip would heat faster than the area immediately around the thermocouples?
Very good post-mortem work on the dead CPU. That chip is running uber hot! I have an MSI X670E motherboard with the 7900 (non X or 3D) - with an Cooler Master 212 air cooler during normal use my temps are @idle around 49C, @moderate load on system mid 50C to 75C. Running Cinebench it goes to tjmax 95C and throttles down a bit but never over 95-96C after running several minutes. I am going to try a liquid AIO cooler to see what results I'm getting but I don't typically stress my system to the limits.
Amazing stuff! You did something many of use always wonder about, but too fearful to try.
I had an AIO fail on my R5 3600. Think the block clogged.
Was working well enough for light use, but then it hit emergency shutdown at about 113 like yours when I smashed it. CPU survived this fine and still serves me well under a Noctua U9S.
I had a very similar issue happen to me once with Precision Boost Overdrive by tweaking the curve manually. in current AGESA's for lots of BIOS, there is a bug in the curve editor that has issues with power draw. For whatever reason my 5800X i watched rocket off to 130C in under two seconds before it finally shut down, even though i only told the curve to increase maximum by 10MHz. Because of the delay in this temperature control, it is entirely possible this 7000 series rocketed off to 160 whilst trying to boost all cores when loading an application or benchmarking an OC profile, meaning the temperature spike was so fast your own tests were nowhere close in comparison.
and yes, my cpu had its watercooler on at the time as well. i was terrified. luckily nothing was damaged even at the temperature it hit, and after 10 minutes it was alive. i still use that chip to this day, without any PBO offsets
Very interesting video. As in previous comments, I can only think that the part and/or solder was contaminated causing the voids, then area after area of the chip failing. Or it watched the news and had it. Look forward to seeing an update if anything is found.
I have seen something similar before on Ivy and Sandy bridge platforms with 3 asus boards, all of them have voltage reg failed in bridged mode and pumped 12v directly into the cpu as soon as psu went online, one cpu was exploded with both carrier board and heat spreader bent, the 2nd one was burned, and 3rd one was dead but it didn't show any signs of physical damage.
The larger die has the problem. The short has generated heat in the single large die. The indium melts carrying the heat to the double dies which then melt by contact and flows out the heat sink. With no more contact between heat spreader and die a catastrophic meltdown occurs in the cpu.
I know a thing or two about soldering and I can tell you that solder does not bubble and boil without contamination(flux or water). From your photos I can still see gold plating on the cpu die and ihs. That means this is a bad ihs application from factory and those air pockets allowed the die to partially overheat and short starting a thermal runaway that allowed the cpu to get to 160c+ and melt away. That is only my opinion. Nice video!
i remember in the old day when you had XP/duron with no protection. AND we had a computer where the FAN stopped. the system ended up heating up and the socket got soft and then the cooler mount snapped at the bottom (tower case) and then it rely started to heat up. so the PCB got damaged and the socket itself got half way disordered and pulled out from the PCB. the "glue" that hold the die to the ceramic body of the CPU was gone and the die fell off.
to melt the indium you do need to get to 160c. to melt the solder in the old computers you need over 200c.
Look at the thickness of that IHS for gods sake
This is the epic "Toms Hardware: CPU Cooling" video situation all over again
Yes remember the early days of RUclips and his experiments when he did it on a old AMD processor.
I had a similar case once, only those were probably Athlon 64 times, dual-core AMD on the AM2 base. One day the computer wouldn't start, started with the CPU fan at 100% and nothing else happened. So I put it on the desk with its side off and turned it on, I tried to diagnose it, even by measuring the voltages, or waiting, because maybe it will turn on after some time. After two or three minutes, my nose told me that something was really heating up and then I saw that you can get burnt on the radiator, even though its fan was running at 100% of the possible speed all the time. The processor then consumed much more power than was acceptable because the installed cooler easily managed during the stress test, and here it warmed up so much that you could make visible burns on your skin despite the roaring fan. The 500W power supply didn't see any problem with that, so it certainly didn't exceed that power. After inserting the new processor, the computer continued to work for years and was fine. I have this damaged processor for sure to this day somewhere in a box with computer components.
Someone left the new guy unsupervised. That or it was a bad IHS so the solder didn't get a chance to adhere properly. Which is probably why we see some ''shiny'' indium because it was never pressed against the IHS.
Or one of their machines failed in a way that was hard to detect and that CPU was already out the door. I personally don't see how this isn't a manufacture mistake but I'm just a guy with no qualifications.
The water would not have to evaporate from the block for the cpu to get that high. The cpu may have gotten hot faster than the IHS and cold plate could transfer the energy to the water. Also, in a sealed system it will build pressure and increase the boiling point of the water.
Roman there is I thing a few other possibilities:
1/ contamination in the indium alloy which made it liquid at 100°C, you can check what's left on it for its melting point
2/ bad application during manufacturing, as like you said if an area is not covered during manufacturing it will do this
3/ the temperatures are estimated (refer to intel engineer video) and thus the temperature on top of the die maybe much higher than what is displayed by sensors you could attach a thermocouple direct to the die and measure the "real" surface temperature
The temperatures are estimated, yes. But that's done by measuring the temperature dependent voltage differentials of multiple, integrated thermocouples. These are calibrated via correction factors that are set at manufacturing time via eFuses and also chemically very pure, due to being made with lithography. I think the reported temp being off more than 5°C from the real value would already be a huge stretch and very unlikely.
This is crazy! I love weird, obscure stuff like this. I’d be very interested to hear what AMD has to say about that.
I had large surface mount IGBT module slide on a PCB in some medical equipment. It was installed with 60/40 solder! It took out some fuses and nothing was burnt. It just had too much resistance when ON, replaced it and all fine.
Such interesting content on here thanks bro. I'm about to do my first delid with the ek direct die kit you consulted with them on
This was a super cool video, using that milling machine to work away at the heatspreader and expose the copper. Awesome!
I'd guess some water contamination got trapped in when soldered which would seal in the water turning it to steam which can superheat when the pressure from being sealed in increases and being a gas it's thermal conductivity drops way off collecting more heat. The CPU would still operate a little warmer than most but the heated steam would be way hotter and could reach solder melt point.
Great informative video, thanks for sharing!...
TL;DR - Whatever AMD rates as their CPUs maximum temperature (Ryzen 4100 = 95°C, 5600 = 90°C, 5600X = 95°C), the AM4 platform thermal protection *shut down* temperature appears to be 115°C.
I believe the temperature limit in bios only effects the PBO target temperature, or the limit beyond which PB2 will no longer boost clock speed above base.
Presumably, looking at this video, AM5 is the same.
I have not tested the lower 85°C limit of the Ryzen 5800X3D. I don't want to kill the gold-dust pride-and-joy CPU in my personal gaming PC.
I am a Master's Computer Science student writing my dissertation on CPU degradation (yes, *actually*).
I am using AM4 CPUs to generate experimental data (4100 because they are cheap...)
The ASUS X570 Prime motherboard I am using sets the "Platform Thermal Throttle Limit", under PBO settings, to 115°C by default. For my research I am using static overclocks so I do not know what effect this temperature has on PBO behaviour.
What I can say is that 115°C is the thermal protection shut-down point for AM4 CPUs. No matter what that value is set to (higher or lower), the test system will always turn off whenever "CPU Core" hits 115°C as reported by HWiNFO. Not "CPU (Tctl/Tdue)" which fluctuates more and can read over 115°C until "CPU Core" catches up.
I found this when conducting pretesting for a high operating temperature stress-test. In keeping "CPU Core" as reported by HWiNFO at 110+-2.5°C, I was able to run my test system for a full 2 weeks. But in the pretesting, if the temperature ever hit 115°C, it would turn off pretty much instantly.
In unscientific (and unrecorded) personal overclocking and tinkering, I have seen "CPU Core" temperature hit ~100°C on MSI B450 Mortar Max and MSI X470 Gaming Plus motherboards that I have used in personal systems.
Because of the glycol in an AIO, the boiling point of the coolant is higher than 100 Celsius, but not much. Regardless, enough thermal mass to give the protection mechanisms time to work.
Such unexpected 'desoldering' blobs inside a complex electrical device can be far more hazardous than just a non-functional pc.. This is a serious safety issue that deserves further investigation!
the indium solder could have been contaminated with another metal with a lower melting point, or there could have been micro air bubbles in the indium solder that expanded inside the solder causing an increase in pressure thus lowering the melting point of the indium in contact with the air buble.
My theory: The chip was working very hard (not just idling, as your chip was in your tests). It was probably already at the 95 degree limit when cooling suddenly failed (bad water block mount let it fall off?) and the tremendous heat already present in the CPU forced its temperature to shoot past 160 degrees. There would have been nothing the internal temperature detection and protection measures could have done to prevent this as the heat was already present and suddenly it had nowhere to go, so the temps went through the roof.
Excellent video .Maybe the it was a sensor issue .
Never tough that ill see a CPU on an CNC milling machine !
If the person who sent the CPU in contacts you in a few weeks with the same problem that will narrow the problem down a bit. Great video very interesting
I'm more interested to see the event viewer logs from the user/yours. It seems vendor bios didn't enforce a hard limit CPU temp to ramp down clock frequency. Thank you for the great content as always.
didn’t realise that ihs was that thicc 8:18
Theory; TIM was not evenly spread, and the place on the CPU DIE where the thermistors are located reported 'high, but in spec' temperatures, but on the unevenly covered positions it got too hot and melted.
theres a lot of sensors per die though, what person sees as 'core temp' isnt the only sensor in that core but just an average, same shit on intel
in that case it would still be amazing that the CPU was alive at the same time. Or at least could still pull this current
There are hundreds of sensors on the die - and normally the highest value would be reported.
A direct short would bypass all protocols and the cpu could each that temp easily
what if one of the chiplets died, the other one still worked. maby it still pushed power into the broken one and got that to 160 while the other one was working and therefore the system didnt crash?
I would do an analysis of the composition of the 'shiny' bits of indium still on the die. That TIM is technically an Alloy and if the material was not properly made, you may have areas with a lower melting point.
Regarding 100C due to water evaporation - Since the cooler is closed loop, any initial evaporation would result in increased pressure, raising the boiling point of water.
@der8auer with regard to water temp, Ethylene Glycol, found in many coolants used in custom loops, raises the boiling point significantly above 100° Celcius. Also liquid when put under any pressure, (which is created by heating the closed system), raises the boiling point also. Interesting nonetheless :-)
Yeah but not even close to 160c
I had something similar happen on the Arous B650I Ultra using a ryzen 7 7700x on F2 bios. I built an ITX Ryzen system and I wanted to tune it with a smaller 130w little 92mm fan tower cooler. So while i was testing it down I set the voltage to 1.22 as just a normal OC process of seeing how far down I could undervolt a 5.2ghz. It immediately shot up and I saw for a couple seconds it running at 107C before it increased to 111 and shut off. I didn't think the motherboard would let that happen.
That is a real head scratcher. Can you replicate indium melt and flow with a short-circuit and electrical arcing?
i have 2 theories the first one is the CPU thermal reader at some point became faulty and not reading what the actual CPU temperate for Example the CPU is now 95c but the system reading it as 50c or 60c so the system think CPU is still in the safe zone of heat in long run that did kill the CPU
my 2nd theory is the solder that got melt may has lower melting point and that is factory defect so in long run at 95c and it was the melting point of that solder so the end results was this
Hey Derbauer.
I also had a similar situation where my cpu (5800X3D) did not turn off after going over 95.
The PC was on it's side and I made it stand up as normal, but the AIO did not like this and there was no flow of water.
The cpu reached 120 degrees in windows before I pulled out the power cable.
Gigabyte B450 I AORUS PRO WIFI (rev. 1.0) with latest bios (F64a)
ALF II 280
5800X3D with PBO
It might have happened on last POST. An in chip short, perhaps caused by the internal power delivery circuitry, could create enough heat, and the owner wouldn't be able to realize fast enough to stop the solder melting. Its one of those "what are the odds" failures
the water does not have to evaporate... the boiling point of water is 100c only at 1 atmosphere, sea level, the open air. once you constrain and put water into a sealed closed loop, then it can be under higher pressures. so a little may boil, that small amount of generated vapour will then pressuruse within the container. and then the remaining water in the loop will be pressurized to enough level to raise the boiling point above 100c, while still remaining as a liquid.
Obviously I'm rather late to this video, which is great and I'm very jealous of your mill.
It occurred to me that the unnamed lab and engineer from Steve's recent videos might be able to shed some light on what happened to this one. The 160°C is nothing compared to the CPU which damaged the m/b, melted copper and, possibly, silicon!
I wouldn't be surprised if you've already spoken to him about it, that the original owner of this CPU was using an ASUS m/b did not surprise me. 😒
I wonder how all the different thermal sensors in the dies are ultimately culminated for monitoring. Temp sensing is an analog methodology that must then be converted into a digital value *somewhere.* If this is being done within the CPU itself, then a few atoms out of place in the circuit could skew the analog temp reading or conversion by multiple degrees. I don't know how much testing AMD does on each individual CPU before packaging, but something like this could be missed if it required more "burn-in" to permanently alter an unstable circuit.
Another theory might suggest something as purely random as a neutrino hitting a circuit in the CPU and causing it to read a wildly inaccurate value in volts, current, or temp; allowing limits to be exceeded without tripping a shutdown. A bit of a stretch of the imagination, but crazier things have happened when neutrinos hit a processor.
I mean, how long would the solder need to be at 160c to begin flowing in a vertical orientation, such as in a PC tower, anyway? A couple seconds at most?
Great video as always
The soldering material may not be pure indium. Some of them are using In-Ag alloy which could melt as low as 128°C. I have a piece left from my STIM experiment back from Haswell era. Since the temp sensors are not 100% accurate and won't cover the whole die, it may just be enough to desolder the CPU.
There are hundreds of sensors on the die
@@stanimir4197 But people still need to drill a tunnel through IHS and place a sensor in to get the precise reading.
I'm not sure if this is possible but I would like to see how his board specifically would handle that test you did with taking the AIO off.
Also could there have been a contaminate in the indium that lowered it's melting temperature?
Even if the melting point was lower, it wouldn't have mattered. A catastrophic failure like this would experience a thermal runaway. That is, the chip failed first then melted.
I thought 95c was normal operting tempature and 115c was critical temperature protection. I actually asked Steve from gamers nexus in a super chat but he never answered, I wanted to ask him would it be safe to run the CPU at 110c since 115c was critical point
yes: 95 - throttle, 115C shut off
Running over 95°C will still probably seriously decrease the lifetime of the CPU if run constantly.
@@cromefire_That's what people said about 95C! I haven't seen information to substantiate your claim. Truth be told, only AMD knows the answer to that question.
By the way, Intel process run at TJ Maxx now, and Intel claims that that will not degrade the processors longevity.
@@juno1597 AMD explicitly states that it's safe until 95°C . That's why the CPUs try to NOT go over 95°C, thermal shutoff at 115°C means that >115°C your CPU will likely be damaged immediately. Thermal shutoff doesn't mean it's healthy to run it at 115°C, but rather that beyond that your CPU won't die in months, days or hours, but probably seconds. It's the last resort.
Now with dies there is a gradient. At 96°C probably as well, with 100°C your CPU might degrade fast, but would probably still run for a long time, but the closer you get to 115°C, the faster your chip will probably degrade. How much faster? Well you'd have to try as you heard Roman say with Intel chip without limits, there will be a point where it will just die outright and given Intel's TJmax of 105°C and the CPU dying at 125°C and AMDs "TJmax" of 95°C, 115°C (same +20K), it's probably reasonable to assume that the CPU might die pretty close to that.
Lastly there shouldn't be a reason to run it as high as that. As chips get less efficient with temperature even AMDs move to go to 95°C isn't the most efficient and going beyond that you'll probably not get a lot more in performance. If you're willing to risk your CPU for 1-2% in performance, if even, you're probably also willing to just spend a bit more on a better cooler and just enjoy more performance with less risk involved and an intact warranty.
@@juno1597 Intel and AMD have said that, but I don't believe it. They also have no evidence to prove they are safe at 95c for years.
If the 7900x has 2 cores disabled on each ccd because they didn't pass qc, could they by any chance somehow received power and shorted out? The melted solder was focused on one area on each ccd after all, any way to know what cores were underneath them?
My theory is the CPU could shorted from looking at the hotspots. If that the case a lot of current as those Voltage regulators on those board are capable of delivering quite a bit of current and thermal runaway. Likely could have exceeded 160c even. If shorted the CPU won’t work but the motherboard VRM’s could have still been delivering current? I seen over 260c plus if cooler failed and board VRM’s still was definitely power, CPU literally was well over 260c even reaching over 326c before power cut measure using a FLIR thermal imaging camera which is pretty accurate and was a ryzen 5 CPU we tested on a MSI motherboard. It sure looks like that what happened? I would have tested the melting point of that blob.
I believe that the indium was applied it left some bobbles (air pockets). Or i could be the ccd layering was improper.
Keep in mind the CPU was also under pressure from the cooler, this would lower the melting point of the indium
i have no idea by how much however and i highly doubt it would drop it below 120c
Higher pressure HIGHER melting point. Why do you think they lower pressures in vapor chambers/heat pipes. To boil at a lower temp, to use the thermal capacity of phase change, to cool your CPU.
I "heard" on a forum that having PBO enabled disables the temp limit, which sounded crazy to me but apparently is true, you can see in the bios settings that PBO is on auto which probably means that it's active.
That really show how thick the IHS is after milling away the side brackets.
Its possible the CPU was damaged internally in some way when it overheated, and direct-shorted inside. Same instance as when electronics fail and burn up because they are shorted internally.
That's one sharp endmill to have milled away that copper so cleanly, unless they make the heat spreader out of free machining copper.
Very strange. The solder seemed well bonded to the chips and the IHS, and I'd have to imagine even as a liquid it would still make a very good thermal connection. And even for the solder to melt out from under the IHS, I'd imagine everything would have to be very hot. If the silicon managed to get incredibly hot very fast, I can't see liquid metal getting very far while touching a big piece of copper unless it was also really hot.
If you have a dead AM5 chip thats still lidded, it might be interesting to see how how you have to get it to make metal drip out the bottom of it
As someone who once, a long time ago, accidentally ran a Pentium CPU without a cooler and watched it melt itself in seconds, it is wild to me seeing a CPU running without a cooler today and being totally fine afterwards.
Just a theory, could be that the pins and pad contact between CPU and socket was at fault? Could cause arching or the pin and pad that supply data on temperature was poor. We have seen the issue with Intel and hence where a the contact frame was made.
Aside from this issue here, what if we could run these processors at 110C with direct die cooling for boiling water. The expansion of the boiling water could act as the pump, meaning that the pump would only ever fail, if the CPU was too cold, the cooling loop was overpressured(say 20PSI) or the water evaporated.
IIRC you can reduce the boiling point of water by adding ethanol to it but that would probably cause problems with seals and sealants.
Bear in mind that as the water temp rises, pressure in the loop will rise, raising boiling point for the liquid (there would need to be at least a little bit of air to allow some compression......water on its own would likely burst the loop somewhere just from uncompressed expansion.
As mentioned, verify the melting point of the TIM. If it's lower than you expect perhaps mounting vertically would cause it to drip away instead of on your test bench where it would stay in place.
@der8auer I have a 7700X that might interest you.
Was working fine with 2200FCLK for over 2 days so I tried to lower SoC voltage from Auto (bios reading 1.43) in -10mv, when I got to 1.32 SoC the latency in aida64 increased by 2x all the way to 30x more.
Since then the CPU cannot be stable with normal latency over 2000FCLK and only boots up to 2067FCLK but the latency at this FCLK is about 4x the normal range.
I changed motherboard Strix x670e-e to the Gene x670e and tested 1 dimm only but no luck, tried fresh windows install and different PSU again no luck.
Like I said if you interested let me because I have searched google for weeks and cant find one person with this phenomenon.
The max. safe-ish voltage for the IOD was 1.2V when it was still made in 12nm. That certainly hasn't increased with Zen4 and the IOD being made in 6nm now. I guess that the auto setting and/or auto auxillary voltages (like VDDG, VDDP) were way to high, (which was also the case with early Bios for Ryzen 3000) and you degraded the IOD. Are you sure the bios was showing 1.43v? It should have been 1.143V!
Saw the same spots on my 7950x after deliding it, CPU is not dead though used mb rog strix x670e-i, and only minor overclocking (it also started running down between dies).
Interesting is the fact that now the gold layer on the die looks completely black/cooked, don't know if it's because of the liquid metal or very high temps after deliding it (I had thermal shutdown after deliding it).
My big guess is that the indium(alloy) used starts to melt and the results can be seen only after some time using the CPU in vertical orientation unlike the horizontal open TB used @Der8auer(I had mine running for @1month).
My guess is that the temps on the die are averaged therefore some spots can get hotter than the melting temp.