I have 0 practical use of the information provided in this video, but I still enjoyed it very much. The way you explain everything, and your voice paired with incredible visuals and animations, made for a really easy to understand and entertaining watch!
I just watched someone freaken put solder on a chip pretty evenly in what looked like a minute or two. Will I ever have to the 50 to 100 beads spread across a microchip? Never. Yet that was non chalant it made it wild. I am only a minute into a video. Didn't know people did that. Thought those things burnt into the other side. So basically a min into the video and learning already. Kkk bye.
The fact that the regular Zen 5 CPU's were mostly received as being very disappointing but the 9800X3D is flying off shelves like crazy, should tell you how valuable this tech can be for the right applications. Being able to cram all that extra cache in there really makes these CPU's shine in games.
@@osopenowsstudio9175 Two things. Windows 11 has issues that affect the performance, most of which has been fixed and if you don't keep up with Tech news you'll miss these kinds of data and you get stuck thinking something that isn't true. Next, a LOT of reviews were done with the 9700X and this was the typical CPU used for game performance of Zen 5 in general since that's about the most any game can use. But the 9700X is a 65W TDP part. AMD I believe has released a new AGESA that allows it to clock a bit faster, and it does so very easily. So, Zen 5 doesn't suck bad and in fact it's an excellent CPU generation. If all the CPUs released NOW instead of when they did the day 1 reviews would look much better other than the 9700X unless AMD changed that to a 105W TDP part like it should have been. It easily handles running at that power scheme. Oh, except Microsoft STILL would have had issues in Win11 affecting Zen 5 because they weren't going to fix it until a few million consumers started screaming at them. It's been tested running higher power schemes and it does a bit better than when it's set at 65W. Having said that Arrow Lake also didn't look good at launch and they seem to have the same problem in that Windows doesn't handle it well. Shocking huh? You mean a Microsoft OS has issues that affect the performance of these X86-64 CPUs? Gee I wonder if it's their intention to make those look worse than they are since after all they're selling their own devices now using ARM processors and they want people to buy THEIR hardware and lock people into a Microsoft ecosystem very much like what Apple has. For more info watch the different Level1tech videos in dealing with this issue. It seems Zen 5 runs like you'd expect under Linux. Zen 6 is GOING to use Zen 5 cores with maybe one or two changes so apparently AMD thinks their Zen 5 cores are perfectly fine, but I think Intel and AMD should partner and put out a professional version (distro) of Linux
@@osopenowsstudio9175 what I'm reading is some folks believe that the current IO die (memory bandwidth?) is bottlenecking the CPU and preventing a lot of the real gains in the core from actually showing up. The 3D cache masks a lot of that
@@osopenowsstudio9175 Zen 5 is still better than Zen 4. It doesn't improve every aspect of performance, which is to be expected. Zen 4 was the best and now Zen 5 improves on it. People wanted more than they got, which was unreasonable, but, at the end of the day, Zen 5 is unsurpassed in x86 performance. Also, of course, there was the power usage reduction to take into account, which many didn't care about.
The increase in electrical resistance from having the cache die below the CCDs may not be as bad or even bad at all. It is true that the distance will be a bit higher but that can be offset by more tsvs in parallel. Another potential improvement is that for very high frequencies, current flows mostly in the outer layer of conductors increasing the apparent resistance of thicker ones. I would have to do the numbers to see if this is significant here. The skin effect doesn't apply to DC but even if the input voltage to a CPU is DC, the current flow isn't. Each time a transistor switches on/off there is a change in current, and that happens a lot at multiples and dividers of the clock. I don't work in the industry to actually know how bad is that current ripple and I have not made any numbers to know the skin depth at the required frequencies but I wouldn't be surprised if it plays some role as transistor switching itself can be much faster than the CPU speed and there is harmonics too from using square waves.
"Skin effect", Industry terms; "magnetics", "field effects", cohesion, resistance, repulsion there is definitely analog happening here in the glue and the harmonics regulated not to be destructive. Well-articulated and easy to comprehend, thank you for arousing my thinking on the topic. mb
AMD uses an integrated inductor coil in the metal layer. This is a technology they introduced during the bulldozer era. The inductor smooths out current fluctuations in power delivery.
@@doggSMK It's a flop but a lot of the technology they develop during that time was still being use today. For example Infinity Fabric is using a physical layer called GMI which was developed during the bulldozer era.
the possibility of moving I/O into cache chiplet then hybrid bonding them with core chiplet might explain why zen 5 still use old I/O die, they still need to cook it before making sure it's ready. honestly i think they really need new I/O die for future zen cpu, even the curent one seems to be held back by the I/O die. integration I/O into cache chiplet could give them benefit like lower idle power consumption, higher memory support, lower ccd to ccd latency, etc.
In "client" AMD have marketed monolithic chips with good battery life and lowered idle consumption especially when sleeping, the plans are for a Strix Halo using CCD chiplets & a beefy IOD/GPU. What does Intel have that's better than Zen4? Why would AMD divert resources from Zen6, to make a Zen5+ when they could use the re-use the cheaper EPYC platform edge computing variant for a new prosumer Threadripper range with quad-channel memory and loads of PCIE lanes? The current cheap IOD & chiplet architecture is being replaced already in Zen6, as was mentioned MI300 and Ryzen x3D are proving the technology which should be ready and scaled up for the performance desktop market launch. Perhaps Zen5 replaces the low end and Zen6 will initially be a high end product only. As general purpose compute is losing relative importance, we may be seeing core, GPU & NPU chiplets stacked on an IOD/cache interconnect in future.
The mild leak if you believe it has a silicon interloper. That would be an easier intermediate step. That approach would not work for strix halo and with nvidia entering. Mega apu game that would be a concern. Making laptop chips chiplet with a low power interloper seems a logical intermediate step
@@bobo-cc1xw you mean an interposer, but that was a replacement for an organic substrate with solder bump connections and wires inside known as infinity fabric. They already have hybrid bonding working in V-cache and stacked MI300, but you need ways to connect up the tiles of larger server chips which are planned to offer more than compute but things like signal processing for telcos. Without a 2.5D off die interconnect the silicon area will be serverly limited by the reticule limit.
it would be interesting with an IOD/SRAM/interposer combo but it would use a lot of die area and may have issues with yields and thermals. The latest processors from Intel are close to this idea though.
Thank you for these amazing videos! They have gotten me very excited about chip technology. I currently work adjacently in wearables but having learned about all this amazing stuff, I'd love to consider a second career in chip technology.
@@HighYieldever considered doing a video on something like the Apple watch S chips (or something similar)? I imagine there have to be some interesting choices and compromises in the interest of efficiency and space.
Sir I’m a big fan of your channel and addicted to your content . The way you explain CPU technology and architecture with high defination visual in such a detailed and easy-to-understand manner is truly inspiring or addicted . I’m a beginner who is trying to learn and self studing about CPU architecture from the basics . Recently, I’ve been very curious about the Intel 6502 architecture, but I haven’t found any video that explains it as well as you explain other architectures or CPU technologies . I kindly request you to make a video on the 6502 architecture. It would be incredibly helpful for me and many others who are eager to learn.
This generation demonstrate that AMD was working hard to push boundaries over the past 10 years, and at the same time shows that INTEL was WASTING THE LAST 10 YEARS and trying to scam everyone selling the same chip every year..
AMD was also trash at one point and lacked innovation. This is why Intel wasn't pressured to compete. You can't blame Intel if there was no competition. Now they are working their ass of, innovating their fabrication.
Intel has been such a disappointment. I used to be an intel fanboy but after their issues over the last year and questionable business practices.... AMD needs competition.
Yep, I have a 5820K and 10 years later , unless I get server CPUs , nothing much changed, they put more cache and that's it. Also actually it got worse. That E-core idea they took from ARM isn't good. (it's good for battery powered devices, not workstations) I want symmetrical huge cores, not smaller cores to save power. It doesn't even solve the dark silicon problem. Maybe it's for "ESG" . Like TVs being stupid slow because they can't use more than 40W in the Soviet, I mean European Union. But I digress.
I was expecting more layers of cache to be stacked. The flip to the bottom was a nice surprise. I knew they would probably do that in the future, just not so soon.
Excellent video! From the AMD engineer's comment it does seem like they are still using two reconstituted wafers, but it would be interesting to do a cross-section SEM elemental analysis to try to see whether the oxide layer exists.
An additional thing: On the question of whether AMD will combine cache and IO, it would make sense. It seems that they did not change the IO chiplet architecture and it is the Achilles heel for base zen 5 (non-3d). The challenge would be that in higher core counts, 24 and up, you would have a couple of IO dies just to accommodate the numerous ccd chiplets even with higher density of cores per ccd. For example, even if you combine the cache (the size of one chiplet) and the io die ( the size of about one and a half cpu chiplets), it would only accommodate two, perhaps three cpu chiplet. Maybe they will use this method for consumer CPUs, but have a separate 'IO die for other IO dies' in threadripper and epic CPUs, but this is cumbersome but might be worthwhile for the benefits.
Non-consumer have much larger IOD offering far more memory channels and the IF links for far more chiplets. It would be desirable to re-unify the L3 cache between dual CCDs and hybrid bonding appears to make that possible. On moooaaaahhhhhrrrr cores, L3 less chiplets leaves room for that perhaps with larger L2 cache reducing the frequency of trips off die. Zen6 is AM5, the platform is fixed. As you observed V-cache solved some constraints to performance that the 9700x reportedly suffers from regarding memory latency and bandwidth, not fully feeding 2t per full fat core. Considering a hybrid of full fat & dense cores, as it stands little die area reduction would be gained once L3 is off die, so while 6+6 or 8+4 might seem attractive a chiplet with heterogeneous cores creates binning problems for less gain than that seen in analysis of the mobile CPU which re-layed out merged function blocks allowing them to use the freed empty space created by the dense cores
Cache initial access can be of a reduced speed by a factor of 2x (L1) to 4x (L2) as he cache is usually in a dump mode. Direct access out to the DRAM still is a multi-step (25-30 cycles) mode that can make many clock cycles shame the whole concept of the multi gigahertz process. Fast is on silicon die chip, pretty fast is on bonded-chiplet.... (and so on).....
Loved the video. I was wondering if you were planning a deep dive into the M4 chips like you did for the M3 chips. That was very interesting and would be great to see that. Cheers!
The interesting thing to me is what do the regular Zen5 CCD to package connections look like? For X3D the cache chip has the solder bumps. Where do the solder bumps go on the CCD when it's the only chip?
You know what, that's a great question I completely missed! My guess would be that the upper metal layers have to be different. But that would be a big BEOL change.
@kazedcat I didn't think fabrication with organic substrates had the precision needed. Also, it would only be the copper pads bonding, the oxide and organic layers wouldn't meld. The bond would be fragile.
@@davidgunther8428 Use a different bonding substance for silicon to plastic bond. Precision is not necessary for non-cache connection. Just design the IO and power via to have enough spacing and put the cache vias in a separate region.
How about they cluster the TSV's. The density can only be as high as the microbumps on the cache chip anyways. Wonder if a modified microbump process could connect with 4 copper connects.
Hello good friend @HighYield. It's always awesome to see you do analyses on these chips...and as I said it before, on such a niche subject for most people. Honest question: How are you able to fund all this? I know you have Patreon, but that would not be enough income for you, it would barely pay for the original research!
There's a reason it sometimes takes weeks for me to release a new video, I work a normal job. RUclips is just my hobby. I'd love to focus more on it, but it's difficult to make the switch.
Unsure if the added complexity of unifying the the IO die with chiplet/cache is something we will see next consumer zen but seems like a logical approach in the near future.
It used to be the case that creating logic structures in a process that is optimized for memory had required a lot more area than a process optimized for logic. My knowledge is from about 15 years ago from a layout engineer that was working on the 'on die' calibration circuit on an airbag sensor.
16:30 How do they thin the bottom carrier wafer down without damaging the transistors? If the solution is to leave a bit of carrier wafer, then that carrier wafer would also need to have matching power delivery structures in place, right?
I wonder if the reason we have seen unexpected variations of the 5800X3D (5700X3D, 5600X3D) is due to the final binning of the resultant hybrid bonded chips? Perhaps they saw an increased defect rate and that was another reason they chose to go a completely different route for 9800X3D?
You are so good a breaking down how packaging is done. I don't work silicon industry professionally but I find it interesting, thanks for making it it palatable for me. I also it would be awesome if they indeed combine the IO and Cache die, dealing with the latency. Noted this current approach allows then to have to two different CCDs. So it will be interesting to see how the deal with 16 core. Maybe a 16 core chiplet for consumer 😊
For Zen 6 AMD would have thick cache on the bottom with TSVs and I/O on it and 12 core thin CCDs on top. So cache and I/O die will also be the support for the thin 12 core CCD. Also the 12 core CCD may have more L2 cache and no L3 as it could be all moved to the cache chiplet. This way cores get optimal cooling.
This guy really enjoys talking about amD's Packag(ing). Honestly mad interesting. I'm curious what can be done to reduce thermal issues with more stacked layers.
One thing i haven't seen is that the cores are the main heat generators - so having the package go Very hot core - warm cache - cool substrate would be less thermal/mechanical stress than the vcache (forced to be core temp)-core-substrate
I think they are shooting for a 3 layer approach, which makes V-Cache "mandatory". Memory controllers and PCIe lanes scale the worst with node advances, so those go on the bottom layer. The middle layer will have L3 cache in the various little "house hold knickknacks" like integrated graphics, voltage rectification, sensors, etc (and in the case of laptop chips, basically the entire chipset/SoC functionality). The top layer will have the compute chiplets, that only have L1 and L2 cache. If Zen 6 does this it will probably be N6 -> N4 -> N2 (or N3X)
wont having GPU sandwiched between them create thermal problem?. i mean, the heat will need to go through the top die and we all know GPU is quite power hungry device. intel's foveros seems to be much better approach since each die can dissipate heat at more or less the same thermal resistance
@@n.shiina8798 An IGP is generally a fairly tame beast. The one on Zen4/5 pulls 7W maximum. Obviously AMD will need to make other considerations if they want an APU like the 7840/8840 family of laptop chips with integrated graphics powerful enough to do light gaming.
Seems to me there's a lot they can theoretically do going forward, but it's hard to guess because of issues with costs and scalability. Basically what makes practical sense to do for consumer products, without increasing costs too much, that they can also apply to big core server CPU's while still retaining the scaling and reusability of the chips for both purposes. Gonna be a balancing act there unless they want to go the route of creating entirely separate dies for these things.
Interesting. I learned a lot of new stuff from your videos as always. But they are using glass interposers in the future right? What if it's not just in the substrate and it could be used for bonding layers as well?
Great explanation video. This is all well above my head, but I like watching videos like this to try and expand my understanding of technology. If I had to guess about the future of Zen products, I'd say that Zen6 would be more of the same 2nd generation 3D packaging since AMD kept 1st generation 3D packaging for Zen3 and Zen4. If anything, I would think that this is setting AMD up for "Zen7" (or whatever AMD calls it or whatever underlying design it uses if it isn't Zen based) on the next socket after AM5. AMD is going to have to change something considering how much has to be fit on a CPU package and how constrained AM5 feels size wise. AMD is have to do something like make the socket size larger, do more creative stacking to fit more specialized chiplets in, and/or both. For me, I'm also excited about what this means for RDNA5/UDNA1 since AMD usually moves its technological successes from product stack to product stack.
At last a good explanation of what Lisa Su announced way back when she talked about bumpless & micro bump interconnects and nobody had a clue how troublesome x3D would prove to rivals. How good were sales of Milan/Genoa-X? I am wondering about Turin-X, the new IOD there offers much faster memory compatability, but Zen has always had the issue that 2c would max out a CCD's bandwidth, so for some applications V-Cache was killer allowing massive scaling within an 8c/16t CCD. May be the newer massive bandwidth AI aimed accelerators removed the memory limitations which kept calculations off GPUs that exceeded VRAM and so Turin-X is not a priority market compared to gaming. OTOH Turin-X delay could simply be a product of phasing, after all they have Turin dense and Turin-X customers are likely to be Genoa-X ones so delay may mean CPU upgrades down the line. Using 4nm one would expect the small chiplets and V-cache dies, soon won't need pairing for known good matches. Given 32MB L3 is standard across all Zen CCDs it must have built in redundancy (never saw 5 core cheaper 30MB models knocking around). Perhaps some screening of the wafer using visual recognition could estimate likely wastage so both approaches could be used together. But it could simply be an artefact of fab procedures, known good dies were always the input to hybrid bonding, not wafers and scaling up to a different method is on some long optimisation TODO list.
Your theory about future I/O die integration makes sense for consumer/mobile CPU's. And it might explain the stagnation on AM5 chipsets. But PCIE can be powerhungry so i'm not sure an external chipset can be avoided
I kind of expected amd to solve the thermal issues and allow for higher clock speeds, because that was the biggest thing holding x3d back. I did not expect them to put the cache on the bottom though, but it paid off well, the 9800x3d is the current gaming king and as seen by the GamersNexus stream it is a overclocking beast as well
It's apparent that on optional cache on top is easier logistically. But what replaces the bottom cache die, when there's none? The CPU still must meet structural and height requirements, the power still must flow to the CCD part? Is there always a dummy layer or are they able to move the CCD layer to the bottom for non-X3D?
Right now, AMD doesn't have a wafer bottleneck, but a packaging bottleneck. The only way I see things going the way you suggest, with I/O and cache put on a massive base die (which would need to have room for a large number of CCD chiplets on top) is if TSMC drastically increases their packaging capacity. That's not impossible, but the limited supply of the rather early X3D release shows it's far from a current reality. The cache on bottom did surprise me, but mostly because I didn't know it was essentially the same process. The only real issues were engineering at the front end. I thought being on the bottom would require more advanced packaging, which competes (in terms of allotted time) with the MI300 family of products that AMD is pushing hard to cash in on with the neural machine learning bubble.
My reaction to hearing the cache was on the bottom was one of disbelief. Nobody talked about it and made it seem like it was no big deal from a complexity and manufacturing standpoint. Having tsvs for connecting the ccd to the substrate for power and data that go through the cache die along with the tsvs to connect the cache to the ccd is drastically more complex than the way 3D cache used to work. It makes sense why the cache die is the same size as the ccd, it’s to fit all those tsvs. The word is that amd is moving to silicon interposers for zen 6 to connect the io die to the ccd and vcache. There is talk of an increased level of modularity coming with this change which will lead to cost savings as there will be less bespoke designs for ccds. Rather, they will use a common interposer to connect ccd to iod and the iod will be come bespoke based on the product. Ie. special iod for various client and commercial products. This is an interesting move with amds move towards a unified graphics architecture, udna. Perhaps we will see the next generation of an mi300 type product that is much less costly to manufacture due to the modularity of forthcoming products. Imagine a server product with a common silicon interposer connecting all ccds w/3d v-cache and udna dies to the io. Pretty cool.
Anecdotally, undervolting the 1st gen 3D VCache CPUs had the greatest effect on performance, which make sense given the many thermal barriers in the chip hierarchy.
Yes, the bonding of the CCXs to the die/interposer/cache/I/O/GPU that can be on a less expensive node where the components on the "interposer" layer/die wouldn't benefit from the same scale as the tiny logic CCXs on top and yet disperse the heat that the interposer layer doesn't produce... This has to be the thought here.
3D Stack IO would solve the memory latency problem. But then we do already have monolithic Zen5 chips (every laptop CPU) and they don't seem to perform massively better than chiplet Zen5.
It's not about the performance per se, but about being able to use older tech for the IO and being able to bin multiple processing chiplets for your SKUs, making the whole package significantly cheaper and reducing the massive die-to-die-latency we experience on Ryzen desktop today.
Freaking runs cool esp. with the AM5 offset brackets got the Ryzen 7 9800x3d using Arctic Liquid Freezer III 360 Black using the AM5 offset brackets it idles at 28c never goes over 63c while gaming also got cl28 6000MHz 2x16gb overclocked to cl28 6400MHz 2x16gb stable with direct memory cooling on an MSI MEG X870E GODLIKE it owns with the Asus RTX 4090.
If diffrence in cost between wafer to wafer and chip to wafer is big enough it could be jusifiable to left space around smaller chips on wafer, especially if need final ratio of chips is 1:1
I actually expected AMD to move the 3D VCache below the CCD when the 5800X3D came out. Because its the only move that made sense to me. Remember this is a pet/side project that got lucky. Thus, the layout of the CCD did not have any 3DVcache in mind at the time.
Excellent explanation, thank you! Will be interesting to see their next move, the IO Die is clearly holding them back in Zen 4 and Zen 5, so Zen 5+ or Zen 6 should have major changes in that regard, no matter if it becomes part of the stack or just renewed.
No, the IOD is NOT holding their CPUs down. The type of interconnect is holding them down. AMD is SUPPOSED to move to direct connects between chiplets, so you end up with what Intel calls tiles, where each chiplet can be pushed up next to another. If you can move to direct connects, you don't need a parallel to serial conversion just to send data from the CCD to IOD or the other way around although most data on the IOD is probably in serial form. But as a for instance, the CCD has to write data to memory. The cores first have to convert data to serial, then transfer that data over the CPU PCB, then into the IOD (at BEST a serial transfer clock rate of 3GHz for Zen 4 and 3.2GHz for Zen 5). There's already a lot of latency just from that. With direct connects, no need for a data conversion, data moves parallel between chiplets, or that's the way it should work. The transfer clock speed will probably bump up to 4+GHz, so about 25% faster ADDED to the removal of latency from not having to do data conversions. But back to that poor data that got sent to the IOD to store to memory, it has to go into the infinity fabric multiplexer for it to be directed to the correct place, and that multiplexer takes up a lot of space on the IOD. AMD should be able to get rid of that moving to direct connects. So then after the data moves through the multiplexer it can THEN be sent to the mem controller. AMD is using a cheap way to connect die right now which is understandable because they have to compete with Intel and when Zen 2 came out which is when AMD moved to MCM, Intel was the ruler of the world of X86-64 and AMD had to price products under what Intel did. They've kept that same interconnect using the IOD and Infinity Fabric through Zen 5, because it costs less. But that's changing and AMD should be able to move to direct connects between the CCDs and IOD so that whole notion of IOD holds anything back is just not true. It's the connection speed of the Infinity Fabric along with data conversions that hold back the CPUs, but it didn't so much for Zen 3. It started to pretty clearly for Zen 4 when AMD was able to push clocks for the cores a bit higher. In fact moving ANYTHING off the IOD and onto a die that sits below cores means you now have to make the entire product line more expensive as they would ALL now have to have stacked die, and AMD doesn't do well when their products are more expensive than Intel unless they are CLEARLY better. But OEMs won't use those parts to build PCs and laptops because they say they can't price AMD based products higher than they can Intel. So, simply changing out the type of interconnect to something that's slightly more expensive is by far the better option for AMD. They don't exist in a bubble, and most the market still sees AMD as a budget option, probably even you.
@@johndoh5182 So what you're saying is that the IO Die is holding their CPUs back in Zen 4 and 5, got it. (The interconnection is part of that setup, changing the interconnection is changing the IO Die, so would removing the Infinity Fabric multiplexer) Still, appreciate the explanation, thank you. Not sure, seems to me that having the IO Die in the stack with TSVs is cheaper than having a silicon interposer on which you place the IO Die and the chiplets, but it's possible the interposer would be cheaper than the stacking process. Someone else mentioned the Infinity Fabric on High Performance Fanout that they used in RDNA3 as another option. That probably sits somewhere in-between the cost and benefits of stacking/interposer and the current setup. And no, AMD has been the premium gaming option since at least Zen 3 X3D, and depending on workload the premium multi-threaded option too. You know what they say about assumptions buddy.
Surely a dumb question but is it possible to mix two different process nodes on the same wafer? I mean, making the first layers (transistors and cache) using the most advanced mode (N3,N4) and then use a less advanced one like N7 for the vcache and the rest of the layers?
maybe is stupid question, but could you put 3 active dies using hybrid bonding? so to have core die, cache die and at bottom IO die? And that cache die is active (ofc) don't they do only TSV for other parts, and if there are only TSV or they put some inactive (connection) layer? I doubt it, because it would required each core die to have catch die to reconnect to PCB, but in theory they could do it, but then it would be real 3D construction.
You know, the thing that amuses me, is that before the 9800X3D launched, _everyone_ scoffed at the rumors that AMD was going to put the cache under the CCD, listing drawbacks that are "much larger" than the benefits. Then the 9800X3D appeared, shocking all the pundits, proving that the tradeoffs are not that bad as the 9800X3D totally annihilated all other "gaming CPUs" with zero exceptions.
Placing IO below the compute dies would be wonderful for latency and reduction in power usage per transferred bit to/from each compute die, however just as IO does not scale with new process nodes, IO power requirements don't scale down either - going off-chip can be very power hungry, which contributes to thermal concerns. Forrest Norrod discussed this in an interview in the past, where he mentioned that the power requirements for IO in Epyc were a real problem. How much of that is just "because it's IO" vs IO requirements for something in the data centre space is an open question of course.
I knew AMD had to change something with how the CPU-wafer and V-Cache wafer were gonna interact with each other. But as to what that would be or how it would be done ; I had no idea. It's surprising that AMD was able to come up with a bettter option in such a short amount of time. Maybe AMD had both options for how the CPU and V-Cache were to be connected and how that would effect performance and AMD used Option B first time around then found out that the other option not used yet was the better one.
Putting the cache between the cores and RAM seems logical. As far as the data flow goes, the cache needs to connect to the ram, and the cache to the cores. Access to ram bypassing cache isn't common, so this doesn't seem like as much of a routing mess as it could be. That said, if that is how it works, making the same CCDs work without the caches on the non x3d chips is really impressive.
So it short this means that AMD could drop the IF and allow chip to chip commucation in the future. I didnt watch the full video, but this would mean we could see a single chip like package vs the 2-3 chips we see today. Along with allowing them to make some of the chip even smaller going forward. Instead of large L3 and L2 (from base) we could just move those memory layers into their own layer leaving a ton more space for just the compute. OR more than like Chipet -> Memory -> Chiplet -> Memory etc. This way all the compute and memory would be accessable and addressable across all compute and IO. That way you can have one core completing one task, while another core could then access that same memory (without having to task schedule the working thread) back onto that same memory even though its on a totally different CCD. To be fair though, this wouldnt really improve performance for say, maybe power for sure (as it could lower idle power draw), but you would still be limited in compute for say. More than likely this would allow a smaller foot print package chip follow by a less cost on the compute chiplets. Everything else would be the same or more as the packaging method would have some increase cost. So skus that have two chiplets wouldnt have to fight for the "gaming cores" or if one has 3d cache or not as they both would have access to that extra memory. If its all SRAM - I wonder if that layer would just be mark out for the different memory locations as well (L1-3) and mark off as such too.
New materials science could be the next step, Graphene / Diamond semiconductors could allow for Huge clock increases / structural strength and great thermal dissipation characteristics... it probably already exists in "The Military Black Budgets" ?
could the 3d vcache be used on their GPUs too? I know GPU cores are significantly more limited in computations, but I'd be curious if "big cache" could help. tbf I'd also like to see them dump the AI and ray tracing core nonsense for just straight big pure render/shader cores... Idgaf about ray tracing or AI since 99% of the time I'm not utilizing it...
@@Kemano24 there might be room for IO and cache, but not room for extras like graphics. Maybe the memory controller would go on the cache chiplet, and there's an IO die for USB, PCIe and everything else. I think then you would have NUMA issues with more than 1 CCD. Cache and IO sound perfect for a trailing process node, I just don't know where to divide the pieces.
22:52 No Turin X? That doesn't seem very likely. All the available information shows that V-cache was a server-first initiative that only made its way onto desktop because somebody suggested it would be good for gaming. That AMD would do all that extra engineering to get the cache die on bottom only to restrict it to the consumer desktop space seems almost absurd.
I have 0 practical use of the information provided in this video, but I still enjoyed it very much. The way you explain everything, and your voice paired with incredible visuals and animations, made for a really easy to understand and entertaining watch!
One of practice use is ability to confidently navigate through marketing confetti, while choosing new system.
I just watched someone freaken put solder on a chip pretty evenly in what looked like a minute or two.
Will I ever have to the 50 to 100 beads spread across a microchip? Never. Yet that was non chalant it made it wild.
I am only a minute into a video. Didn't know people did that. Thought those things burnt into the other side. So basically a min into the video and learning already. Kkk bye.
wisdom is built on knowledge, so is good to know a lot of things, even if they are not useful directly.
The fact that the regular Zen 5 CPU's were mostly received as being very disappointing but the 9800X3D is flying off shelves like crazy, should tell you how valuable this tech can be for the right applications. Being able to cram all that extra cache in there really makes these CPU's shine in games.
Maybe Zen 5 is basically a revamped X3D, but hey it's still confusing on why the normal Zen 5 sucks bad
@@osopenowsstudio9175 Two things. Windows 11 has issues that affect the performance, most of which has been fixed and if you don't keep up with Tech news you'll miss these kinds of data and you get stuck thinking something that isn't true.
Next, a LOT of reviews were done with the 9700X and this was the typical CPU used for game performance of Zen 5 in general since that's about the most any game can use. But the 9700X is a 65W TDP part. AMD I believe has released a new AGESA that allows it to clock a bit faster, and it does so very easily.
So, Zen 5 doesn't suck bad and in fact it's an excellent CPU generation. If all the CPUs released NOW instead of when they did the day 1 reviews would look much better other than the 9700X unless AMD changed that to a 105W TDP part like it should have been. It easily handles running at that power scheme. Oh, except Microsoft STILL would have had issues in Win11 affecting Zen 5 because they weren't going to fix it until a few million consumers started screaming at them.
It's been tested running higher power schemes and it does a bit better than when it's set at 65W.
Having said that Arrow Lake also didn't look good at launch and they seem to have the same problem in that Windows doesn't handle it well.
Shocking huh? You mean a Microsoft OS has issues that affect the performance of these X86-64 CPUs? Gee I wonder if it's their intention to make those look worse than they are since after all they're selling their own devices now using ARM processors and they want people to buy THEIR hardware and lock people into a Microsoft ecosystem very much like what Apple has.
For more info watch the different Level1tech videos in dealing with this issue. It seems Zen 5 runs like you'd expect under Linux.
Zen 6 is GOING to use Zen 5 cores with maybe one or two changes so apparently AMD thinks their Zen 5 cores are perfectly fine, but I think Intel and AMD should partner and put out a professional version (distro) of Linux
@@osopenowsstudio9175 what I'm reading is some folks believe that the current IO die (memory bandwidth?) is bottlenecking the CPU and preventing a lot of the real gains in the core from actually showing up. The 3D cache masks a lot of that
@@osopenowsstudio9175 Zen 5 is still better than Zen 4. It doesn't improve every aspect of performance, which is to be expected. Zen 4 was the best and now Zen 5 improves on it. People wanted more than they got, which was unreasonable, but, at the end of the day, Zen 5 is unsurpassed in x86 performance. Also, of course, there was the power usage reduction to take into account, which many didn't care about.
@@multiplyx100 Fair enough but Zen 5 kinda tripping balls when they are barely faster than Zen 4 (for Windows) that even AMD is surprised.
The increase in electrical resistance from having the cache die below the CCDs may not be as bad or even bad at all. It is true that the distance will be a bit higher but that can be offset by more tsvs in parallel.
Another potential improvement is that for very high frequencies, current flows mostly in the outer layer of conductors increasing the apparent resistance of thicker ones. I would have to do the numbers to see if this is significant here. The skin effect doesn't apply to DC but even if the input voltage to a CPU is DC, the current flow isn't. Each time a transistor switches on/off there is a change in current, and that happens a lot at multiples and dividers of the clock. I don't work in the industry to actually know how bad is that current ripple and I have not made any numbers to know the skin depth at the required frequencies but I wouldn't be surprised if it plays some role as transistor switching itself can be much faster than the CPU speed and there is harmonics too from using square waves.
"Skin effect", Industry terms; "magnetics", "field effects", cohesion, resistance, repulsion there is definitely analog happening here in the glue and the harmonics regulated not to be destructive. Well-articulated and easy to comprehend, thank you for arousing my thinking on the topic. mb
AMD uses an integrated inductor coil in the metal layer. This is a technology they introduced during the bulldozer era. The inductor smooths out current fluctuations in power delivery.
So Bulldozer was not 100% tragedy... 😆
@@doggSMK It's a flop but a lot of the technology they develop during that time was still being use today. For example Infinity Fabric is using a physical layer called GMI which was developed during the bulldozer era.
Switching frequencies high enough to cause a problem, eg in PCIe, are always accompanied by GND or the inverse frequency... so there's no problem.
the possibility of moving I/O into cache chiplet then hybrid bonding them with core chiplet might explain why zen 5 still use old I/O die, they still need to cook it before making sure it's ready. honestly i think they really need new I/O die for future zen cpu, even the curent one seems to be held back by the I/O die. integration I/O into cache chiplet could give them benefit like lower idle power consumption, higher memory support, lower ccd to ccd latency, etc.
In "client" AMD have marketed monolithic chips with good battery life and lowered idle consumption especially when sleeping, the plans are for a Strix Halo using CCD chiplets & a beefy IOD/GPU.
What does Intel have that's better than Zen4? Why would AMD divert resources from Zen6, to make a Zen5+ when they could use the re-use the cheaper EPYC platform edge computing variant for a new prosumer Threadripper range with quad-channel memory and loads of PCIE lanes?
The current cheap IOD & chiplet architecture is being replaced already in Zen6, as was mentioned MI300 and Ryzen x3D are proving the technology which should be ready and scaled up for the performance desktop market launch. Perhaps Zen5 replaces the low end and Zen6 will initially be a high end product only.
As general purpose compute is losing relative importance, we may be seeing core, GPU & NPU chiplets stacked on an IOD/cache interconnect in future.
The mild leak if you believe it has a silicon interloper. That would be an easier intermediate step.
That approach would not work for strix halo and with nvidia entering. Mega apu game that would be a concern. Making laptop chips chiplet with a low power interloper seems a logical intermediate step
@@bobo-cc1xw you mean an interposer, but that was a replacement for an organic substrate with solder bump connections and wires inside known as infinity fabric.
They already have hybrid bonding working in V-cache and stacked MI300, but you need ways to connect up the tiles of larger server chips which are planned to offer more than compute but things like signal processing for telcos.
Without a 2.5D off die interconnect the silicon area will be serverly limited by the reticule limit.
Unlikely they'll be able to do this due to thermal constraints.
it would be interesting with an IOD/SRAM/interposer combo but it would use a lot of die area and may have issues with yields and thermals. The latest processors from Intel are close to this idea though.
A hybrid bonded chip also has better thermal conduction between the layers than solder bumps would.
3:34 I want my cpu to be glued together with tiny burgers
Thanks for giving a good background of hybrid bonding. I am so proud to be working in semiconductor packaging.
Thank you for these amazing videos!
They have gotten me very excited about chip technology. I currently work adjacently in wearables but having learned about all this amazing stuff, I'd love to consider a second career in chip technology.
There's some cool chip tech in wearables.
@@HighYieldever considered doing a video on something like the Apple watch S chips (or something similar)? I imagine there have to be some interesting choices and compromises in the interest of efficiency and space.
Yeah thats cool and all but have you tried super glue?
also add cotton wool and baking soda for better adhesion
Reddit leave
Sir I’m a big fan of your channel and addicted to your content . The way you explain CPU technology and architecture with high defination visual in such a detailed and easy-to-understand manner is truly inspiring or addicted .
I’m a beginner who is trying to learn and self studing about CPU architecture from the basics . Recently, I’ve been very curious about the Intel 6502 architecture, but I haven’t found any video that explains it as well as you explain other architectures or CPU technologies .
I kindly request you to make a video on the 6502 architecture. It would be incredibly helpful for me and many others who are eager to learn.
This generation demonstrate that AMD was working hard to push boundaries over the past 10 years, and at the same time shows that INTEL was WASTING THE LAST 10 YEARS and trying to scam everyone selling the same chip every year..
AMD was also trash at one point and lacked innovation. This is why Intel wasn't pressured to compete. You can't blame Intel if there was no competition. Now they are working their ass of, innovating their fabrication.
Intel has been such a disappointment. I used to be an intel fanboy but after their issues over the last year and questionable business practices.... AMD needs competition.
@@SupraSav AMD has competition. Intel is not that behind. Just wait for Panther Lake with Intel 18A and Xe3.
Yep, I have a 5820K and 10 years later , unless I get server CPUs , nothing much changed, they put more cache and that's it.
Also actually it got worse. That E-core idea they took from ARM isn't good. (it's good for battery powered devices, not workstations)
I want symmetrical huge cores, not smaller cores to save power. It doesn't even solve the dark silicon problem.
Maybe it's for "ESG" . Like TVs being stupid slow because they can't use more than 40W in the Soviet, I mean European Union. But I digress.
Sort of. Intel packaging is actually a bit more advanced than TSMC. Intel isn't fully taking advantage of it on the CPU side yet.
This video has answered all the questions I had about Zen 5's X3D packaging. Thank you.
Literally forging the chips together. That’s so neat 😊
Been waiting for this video!!! Watching it on my 9800X3D 😁
Can you ask your 9800X3D if it still has a support silicon on top? xD
@@HighYield LOL
Does the video play faster?
@@TheEVEInspiration340 fps😂
I really missed this RUclips channel. Excellent graphics as always! I love it!
A lot! And… ☺️😉
Awesome work as usual! Thank you for the great information, and the reading recommendations =)
I was expecting more layers of cache to be stacked. The flip to the bottom was a nice surprise.
I knew they would probably do that in the future, just not so soon.
Excellent video! From the AMD engineer's comment it does seem like they are still using two reconstituted wafers, but it would be interesting to do a cross-section SEM elemental analysis to try to see whether the oxide layer exists.
An additional thing: On the question of whether AMD will combine cache and IO, it would make sense. It seems that they did not change the IO chiplet architecture and it is the Achilles heel for base zen 5 (non-3d). The challenge would be that in higher core counts, 24 and up, you would have a couple of IO dies just to accommodate the numerous ccd chiplets even with higher density of cores per ccd. For example, even if you combine the cache (the size of one chiplet) and the io die ( the size of about one and a half cpu chiplets), it would only accommodate two, perhaps three cpu chiplet.
Maybe they will use this method for consumer CPUs, but have a separate 'IO die for other IO dies' in threadripper and epic CPUs, but this is cumbersome but might be worthwhile for the benefits.
Non-consumer have much larger IOD offering far more memory channels and the IF links for far more chiplets.
It would be desirable to re-unify the L3 cache between dual CCDs and hybrid bonding appears to make that possible. On moooaaaahhhhhrrrr cores, L3 less chiplets leaves room for that perhaps with larger L2 cache reducing the frequency of trips off die.
Zen6 is AM5, the platform is fixed.
As you observed V-cache solved some constraints to performance that the 9700x reportedly suffers from regarding memory latency and bandwidth, not fully feeding 2t per full fat core.
Considering a hybrid of full fat & dense cores, as it stands little die area reduction would be gained once L3 is off die, so while 6+6 or 8+4 might seem attractive a chiplet with heterogeneous cores creates binning problems for less gain than that seen in analysis of the mobile CPU which re-layed out merged function blocks allowing them to use the freed empty space created by the dense cores
Cache initial access can be of a reduced speed by a factor of 2x (L1) to 4x (L2) as he cache is usually in a dump mode. Direct access out to the DRAM still is a multi-step (25-30 cycles) mode that can make many clock cycles shame the whole concept of the multi gigahertz process. Fast is on silicon die chip, pretty fast is on bonded-chiplet.... (and so on).....
@@tsclly2377 part of the reason for the slower access is when you go from L1 using virtual addresses and translated physical addresses.
Ohh laser and plasma dicing!. I knew there was something more sophisticated than sawing.
Loved the video. I was wondering if you were planning a deep dive into the M4 chips like you did for the M3 chips. That was very interesting and would be great to see that. Cheers!
I'm currently still waiting for die shots to appear. As soon as I have something to work with, I'll make a video!
Your videos are the best in this field.
Great video like always!
Called it! In a comment under an earlier video.
I not understand most of the microchip acronym works, but it was interesting to know more about how this super duper tiny thing was build in process.
The interesting thing to me is what do the regular Zen5 CCD to package connections look like? For X3D the cache chip has the solder bumps. Where do the solder bumps go on the CCD when it's the only chip?
You know what, that's a great question I completely missed! My guess would be that the upper metal layers have to be different. But that would be a big BEOL change.
You can do copper to copper bonding with an organic redistribution layer.
@kazedcat I didn't think fabrication with organic substrates had the precision needed. Also, it would only be the copper pads bonding, the oxide and organic layers wouldn't meld. The bond would be fragile.
@@davidgunther8428 Use a different bonding substance for silicon to plastic bond. Precision is not necessary for non-cache connection. Just design the IO and power via to have enough spacing and put the cache vias in a separate region.
How about they cluster the TSV's. The density can only be as high as the microbumps on the cache chip anyways. Wonder if a modified microbump process could connect with 4 copper connects.
Now they just need to add a second cache chiplit on top and make a sandwitch
Every time you post a video, I learn something new. Thank you so much for all the hard work you put into this.
Hello good friend @HighYield. It's always awesome to see you do analyses on these chips...and as I said it before, on such a niche subject for most people. Honest question: How are you able to fund all this? I know you have Patreon, but that would not be enough income for you, it would barely pay for the original research!
I reckon it is part patreon and part enthusiasm
There's a reason it sometimes takes weeks for me to release a new video, I work a normal job. RUclips is just my hobby. I'd love to focus more on it, but it's difficult to make the switch.
@@HighYield Well, for a hobby it's really exceptional quality !
Unsure if the added complexity of unifying the the IO die with chiplet/cache is something we will see next consumer zen but seems like a logical approach in the near future.
It used to be the case that creating logic structures in a process that is optimized for memory had required a lot more area than a process optimized for logic. My knowledge is from about 15 years ago from a layout engineer that was working on the 'on die' calibration circuit on an airbag sensor.
16:30 How do they thin the bottom carrier wafer down without damaging the transistors? If the solution is to leave a bit of carrier wafer, then that carrier wafer would also need to have matching power delivery structures in place, right?
got all giddy and started kicking my feet up when i saw there's a new vid
Great video. I'll make sure to use hybrid bonding in the next cpu i design instead of the dumb way i do it now.
so how do you bond two delicate electrical components?
TSMC: we heat them until all of their the conductors melt and fuse them together
I wonder if the reason we have seen unexpected variations of the 5800X3D (5700X3D, 5600X3D) is due to the final binning of the resultant hybrid bonded chips?
Perhaps they saw an increased defect rate and that was another reason they chose to go a completely different route for 9800X3D?
Really great content, keep it up!
Insanely informative video, current zen 5 x3d chips must be chip to waffer, the evidence of that is it's sucess to failure rate.
Very educational, thumbs up!
You are so good a breaking down how packaging is done. I don't work silicon industry professionally but I find it interesting, thanks for making it it palatable for me.
I also it would be awesome if they indeed combine the IO and Cache die, dealing with the latency. Noted this current approach allows then to have to two different CCDs. So it will be interesting to see how the deal with 16 core. Maybe a 16 core chiplet for consumer 😊
For Zen 6 AMD would have thick cache on the bottom with TSVs and I/O on it and 12 core thin CCDs on top. So cache and I/O die will also be the support for the thin 12 core CCD. Also the 12 core CCD may have more L2 cache and no L3 as it could be all moved to the cache chiplet. This way cores get optimal cooling.
This guy really enjoys talking about amD's Packag(ing).
Honestly mad interesting. I'm curious what can be done to reduce thermal issues with more stacked layers.
One thing i haven't seen is that the cores are the main heat generators - so having the package go Very hot core - warm cache - cool substrate would be less thermal/mechanical stress than the vcache (forced to be core temp)-core-substrate
I think they are shooting for a 3 layer approach, which makes V-Cache "mandatory". Memory controllers and PCIe lanes scale the worst with node advances, so those go on the bottom layer. The middle layer will have L3 cache in the various little "house hold knickknacks" like integrated graphics, voltage rectification, sensors, etc (and in the case of laptop chips, basically the entire chipset/SoC functionality). The top layer will have the compute chiplets, that only have L1 and L2 cache. If Zen 6 does this it will probably be N6 -> N4 -> N2 (or N3X)
wont having GPU sandwiched between them create thermal problem?. i mean, the heat will need to go through the top die and we all know GPU is quite power hungry device. intel's foveros seems to be much better approach since each die can dissipate heat at more or less the same thermal resistance
@@n.shiina8798 An IGP is generally a fairly tame beast. The one on Zen4/5 pulls 7W maximum. Obviously AMD will need to make other considerations if they want an APU like the 7840/8840 family of laptop chips with integrated graphics powerful enough to do light gaming.
love your videos very informative
Seems to me there's a lot they can theoretically do going forward, but it's hard to guess because of issues with costs and scalability. Basically what makes practical sense to do for consumer products, without increasing costs too much, that they can also apply to big core server CPU's while still retaining the scaling and reusability of the chips for both purposes. Gonna be a balancing act there unless they want to go the route of creating entirely separate dies for these things.
Interesting. I learned a lot of new stuff from your videos as always. But they are using glass interposers in the future right? What if it's not just in the substrate and it could be used for bonding layers as well?
Great explanation video. This is all well above my head, but I like watching videos like this to try and expand my understanding of technology.
If I had to guess about the future of Zen products, I'd say that Zen6 would be more of the same 2nd generation 3D packaging since AMD kept 1st generation 3D packaging for Zen3 and Zen4. If anything, I would think that this is setting AMD up for "Zen7" (or whatever AMD calls it or whatever underlying design it uses if it isn't Zen based) on the next socket after AM5. AMD is going to have to change something considering how much has to be fit on a CPU package and how constrained AM5 feels size wise. AMD is have to do something like make the socket size larger, do more creative stacking to fit more specialized chiplets in, and/or both.
For me, I'm also excited about what this means for RDNA5/UDNA1 since AMD usually moves its technological successes from product stack to product stack.
maaan youre video really are somethinng else!!
I think it may soon be time to see the Butter Donuts in action!
Imagine Zen 7 APU with a single unified CPU chiplet, GPU chiplet, and shared memory chiplet. It would be next level.
I do. My PC will be 13995X3D with 6090ti.
At last a good explanation of what Lisa Su announced way back when she talked about bumpless & micro bump interconnects and nobody had a clue how troublesome x3D would prove to rivals.
How good were sales of Milan/Genoa-X? I am wondering about Turin-X, the new IOD there offers much faster memory compatability, but Zen has always had the issue that 2c would max out a CCD's bandwidth, so for some applications V-Cache was killer allowing massive scaling within an 8c/16t CCD. May be the newer massive bandwidth AI aimed accelerators removed the memory limitations which kept calculations off GPUs that exceeded VRAM and so Turin-X is not a priority market compared to gaming. OTOH Turin-X delay could simply be a product of phasing, after all they have Turin dense and Turin-X customers are likely to be Genoa-X ones so delay may mean CPU upgrades down the line.
Using 4nm one would expect the small chiplets and V-cache dies, soon won't need pairing for known good matches. Given 32MB L3 is standard across all Zen CCDs it must have built in redundancy (never saw 5 core cheaper 30MB models knocking around). Perhaps some screening of the wafer using visual recognition could estimate likely wastage so both approaches could be used together. But it could simply be an artefact of fab procedures, known good dies were always the input to hybrid bonding, not wafers and scaling up to a different method is on some long optimisation TODO list.
"Base tile" IO with cache + two CCD's + backside power would be nuts
Your theory about future I/O die integration makes sense for consumer/mobile CPU's. And it might explain the stagnation on AM5 chipsets. But PCIE can be powerhungry so i'm not sure an external chipset can be avoided
I kind of expected amd to solve the thermal issues and allow for higher clock speeds, because that was the biggest thing holding x3d back. I did not expect them to put the cache on the bottom though, but it paid off well, the 9800x3d is the current gaming king and as seen by the GamersNexus stream it is a overclocking beast as well
damn i cant stop watching those super interesting engineering videos. whats the anticipated way of becoming a silicon engineer?
4:47 Cleaned, processed in vacuum, pressing each other side...
Is it... cold welding?
It's apparent that on optional cache on top is easier logistically. But what replaces the bottom cache die, when there's none? The CPU still must meet structural and height requirements, the power still must flow to the CCD part? Is there always a dummy layer or are they able to move the CCD layer to the bottom for non-X3D?
Right now, AMD doesn't have a wafer bottleneck, but a packaging bottleneck. The only way I see things going the way you suggest, with I/O and cache put on a massive base die (which would need to have room for a large number of CCD chiplets on top) is if TSMC drastically increases their packaging capacity. That's not impossible, but the limited supply of the rather early X3D release shows it's far from a current reality.
The cache on bottom did surprise me, but mostly because I didn't know it was essentially the same process. The only real issues were engineering at the front end. I thought being on the bottom would require more advanced packaging, which competes (in terms of allotted time) with the MI300 family of products that AMD is pushing hard to cash in on with the neural machine learning bubble.
My reaction to hearing the cache was on the bottom was one of disbelief. Nobody talked about it and made it seem like it was no big deal from a complexity and manufacturing standpoint.
Having tsvs for connecting the ccd to the substrate for power and data that go through the cache die along with the tsvs to connect the cache to the ccd is drastically more complex than the way 3D cache used to work. It makes sense why the cache die is the same size as the ccd, it’s to fit all those tsvs.
The word is that amd is moving to silicon interposers for zen 6 to connect the io die to the ccd and vcache. There is talk of an increased level of modularity coming with this change which will lead to cost savings as there will be less bespoke designs for ccds. Rather, they will use a common interposer to connect ccd to iod and the iod will be come bespoke based on the product. Ie. special iod for various client and commercial products.
This is an interesting move with amds move towards a unified graphics architecture, udna. Perhaps we will see the next generation of an mi300 type product that is much less costly to manufacture due to the modularity of forthcoming products. Imagine a server product with a common silicon interposer connecting all ccds w/3d v-cache and udna dies to the io. Pretty cool.
I thought this was going to be boring but I was wrong.
At some point during the editing, when I have seen the video too many times, I always start to think it will suck. And I'm happy when it doesn't :)
Anecdotally, undervolting the 1st gen 3D VCache CPUs had the greatest effect on performance, which make sense given the many thermal barriers in the chip hierarchy.
Yes, the bonding of the CCXs to the die/interposer/cache/I/O/GPU that can be on a less expensive node where the components on the "interposer" layer/die wouldn't benefit from the same scale as the tiny logic CCXs on top and yet disperse the heat that the interposer layer doesn't produce... This has to be the thought here.
3D Stack IO would solve the memory latency problem.
But then we do already have monolithic Zen5 chips (every laptop CPU) and they don't seem to perform massively better than chiplet Zen5.
It's not about the performance per se, but about being able to use older tech for the IO and being able to bin multiple processing chiplets for your SKUs, making the whole package significantly cheaper and reducing the massive die-to-die-latency we experience on Ryzen desktop today.
Really good info!
Freaking runs cool esp. with the AM5 offset brackets got the Ryzen 7 9800x3d using Arctic Liquid Freezer III 360 Black using the AM5 offset brackets it idles at 28c never goes over 63c while gaming also got cl28 6000MHz 2x16gb overclocked to cl28 6400MHz 2x16gb stable with direct memory cooling on an MSI MEG X870E GODLIKE it owns with the Asus RTX 4090.
Great work! I am interested to know how thenext advances work: cowas-L cowas-S
If diffrence in cost between wafer to wafer and chip to wafer is big enough it could be jusifiable to left space around smaller chips on wafer, especially if need final ratio of chips is 1:1
This is genius. Trust AMD to come up with every good idea, and other companies will copy them.
gr8 explanation.thank u
It seems that this technology can also be used when bonding with glass substrate
Well that could be something they will use in next gen ryzen to improve CCD to IO connection. Maybe it will be on consumer GPUs some day as well.
I actually expected AMD to move the 3D VCache below the CCD when the 5800X3D came out. Because its the only move that made sense to me. Remember this is a pet/side project that got lucky. Thus, the layout of the CCD did not have any 3DVcache in mind at the time.
Any news on backside Power delivery?
Coming with Intel 18A and probably TSMC A16 iirc.
@@HighYield Great. Thanks a lot! so that should be 2027 for the customer
@@einekleineente12025
Excellent explanation, thank you!
Will be interesting to see their next move, the IO Die is clearly holding them back in Zen 4 and Zen 5, so Zen 5+ or Zen 6 should have major changes in that regard, no matter if it becomes part of the stack or just renewed.
When are Zen 6-7 coming?
@@notaras1985 2 years for each generation is a reasonable assumption so Zen 6 would be 2026, Zen 7 2028.
No, the IOD is NOT holding their CPUs down. The type of interconnect is holding them down. AMD is SUPPOSED to move to direct connects between chiplets, so you end up with what Intel calls tiles, where each chiplet can be pushed up next to another.
If you can move to direct connects, you don't need a parallel to serial conversion just to send data from the CCD to IOD or the other way around although most data on the IOD is probably in serial form. But as a for instance, the CCD has to write data to memory. The cores first have to convert data to serial, then transfer that data over the CPU PCB, then into the IOD (at BEST a serial transfer clock rate of 3GHz for Zen 4 and 3.2GHz for Zen 5). There's already a lot of latency just from that. With direct connects, no need for a data conversion, data moves parallel between chiplets, or that's the way it should work. The transfer clock speed will probably bump up to 4+GHz, so about 25% faster ADDED to the removal of latency from not having to do data conversions. But back to that poor data that got sent to the IOD to store to memory, it has to go into the infinity fabric multiplexer for it to be directed to the correct place, and that multiplexer takes up a lot of space on the IOD. AMD should be able to get rid of that moving to direct connects. So then after the data moves through the multiplexer it can THEN be sent to the mem controller.
AMD is using a cheap way to connect die right now which is understandable because they have to compete with Intel and when Zen 2 came out which is when AMD moved to MCM, Intel was the ruler of the world of X86-64 and AMD had to price products under what Intel did. They've kept that same interconnect using the IOD and Infinity Fabric through Zen 5, because it costs less. But that's changing and AMD should be able to move to direct connects between the CCDs and IOD so that whole notion of IOD holds anything back is just not true. It's the connection speed of the Infinity Fabric along with data conversions that hold back the CPUs, but it didn't so much for Zen 3. It started to pretty clearly for Zen 4 when AMD was able to push clocks for the cores a bit higher.
In fact moving ANYTHING off the IOD and onto a die that sits below cores means you now have to make the entire product line more expensive as they would ALL now have to have stacked die, and AMD doesn't do well when their products are more expensive than Intel unless they are CLEARLY better. But OEMs won't use those parts to build PCs and laptops because they say they can't price AMD based products higher than they can Intel.
So, simply changing out the type of interconnect to something that's slightly more expensive is by far the better option for AMD. They don't exist in a bubble, and most the market still sees AMD as a budget option, probably even you.
@@johndoh5182 so what should we. Expect from Zen 6 and 7 architectures?
@@johndoh5182 So what you're saying is that the IO Die is holding their CPUs back in Zen 4 and 5, got it.
(The interconnection is part of that setup, changing the interconnection is changing the IO Die, so would removing the Infinity Fabric multiplexer)
Still, appreciate the explanation, thank you.
Not sure, seems to me that having the IO Die in the stack with TSVs is cheaper than having a silicon interposer on which you place the IO Die and the chiplets, but it's possible the interposer would be cheaper than the stacking process.
Someone else mentioned the Infinity Fabric on High Performance Fanout that they used in RDNA3 as another option. That probably sits somewhere in-between the cost and benefits of stacking/interposer and the current setup.
And no, AMD has been the premium gaming option since at least Zen 3 X3D, and depending on workload the premium multi-threaded option too. You know what they say about assumptions buddy.
now AMD can just make core without l3 cache and just add them at the bottom, so chip yield will be higher and we can get bigger l3
When will we be getting compute cubes? Cubes of pure processor.
How do they manage to heat two chiplets to 150 or 300 degrees without burning the transistors?
very informative video thanks alot sir
Surely a dumb question but is it possible to mix two different process nodes on the same wafer? I mean, making the first layers (transistors and cache) using the most advanced mode (N3,N4) and then use a less advanced one like N7 for the vcache and the rest of the layers?
maybe is stupid question, but could you put 3 active dies using hybrid bonding? so to have core die, cache die and at bottom IO die?
And that cache die is active (ofc) don't they do only TSV for other parts, and if there are only TSV or they put some inactive (connection) layer? I doubt it, because it would required each core die to have catch die to reconnect to PCB, but in theory they could do it, but then it would be real 3D construction.
I wasn't surprised by the move because I've heard about AMD engineers complaining about the heat for some years now.
If AMD combines io and 3d v-cache, wouldn't that lead to having 3d v-cache for lower in the stack cpus like a hypothetical ryzen 5 10600 x3d?
Great video!
Chip glue go brrrrrrrrtt
Can we expect a chip analysis of the Apple m4 SoC family?
the only reason i see for amd to put io under chiplets and extra cache is to leave space for gpu, npu and possible arm cores in a near future
I'm so excited for the future from AMD.
So good video 🎉
You know, the thing that amuses me, is that before the 9800X3D launched, _everyone_ scoffed at the rumors that AMD was going to put the cache under the CCD, listing drawbacks that are "much larger" than the benefits.
Then the 9800X3D appeared, shocking all the pundits, proving that the tradeoffs are not that bad as the 9800X3D totally annihilated all other "gaming CPUs" with zero exceptions.
Placing IO below the compute dies would be wonderful for latency and reduction in power usage per transferred bit to/from each compute die, however just as IO does not scale with new process nodes, IO power requirements don't scale down either - going off-chip can be very power hungry, which contributes to thermal concerns.
Forrest Norrod discussed this in an interview in the past, where he mentioned that the power requirements for IO in Epyc were a real problem. How much of that is just "because it's IO" vs IO requirements for something in the data centre space is an open question of course.
I knew AMD had to change something with how the CPU-wafer and V-Cache wafer were gonna interact with each other. But as to what that would be or how it would be done ; I had no idea. It's surprising that AMD was able to come up with a bettter option in such a short amount of time. Maybe AMD had both options for how the CPU and V-Cache were to be connected and how that would effect performance and AMD used Option B first time around then found out that the other option not used yet was the better one.
Putting the cache between the cores and RAM seems logical. As far as the data flow goes, the cache needs to connect to the ram, and the cache to the cores. Access to ram bypassing cache isn't common, so this doesn't seem like as much of a routing mess as it could be. That said, if that is how it works, making the same CCDs work without the caches on the non x3d chips is really impressive.
very cool! AMD with the moves
Great video.
So it short this means that AMD could drop the IF and allow chip to chip commucation in the future. I didnt watch the full video, but this would mean we could see a single chip like package vs the 2-3 chips we see today. Along with allowing them to make some of the chip even smaller going forward. Instead of large L3 and L2 (from base) we could just move those memory layers into their own layer leaving a ton more space for just the compute. OR more than like Chipet -> Memory -> Chiplet -> Memory etc. This way all the compute and memory would be accessable and addressable across all compute and IO. That way you can have one core completing one task, while another core could then access that same memory (without having to task schedule the working thread) back onto that same memory even though its on a totally different CCD. To be fair though, this wouldnt really improve performance for say, maybe power for sure (as it could lower idle power draw), but you would still be limited in compute for say. More than likely this would allow a smaller foot print package chip follow by a less cost on the compute chiplets. Everything else would be the same or more as the packaging method would have some increase cost. So skus that have two chiplets wouldnt have to fight for the "gaming cores" or if one has 3d cache or not as they both would have access to that extra memory. If its all SRAM - I wonder if that layer would just be mark out for the different memory locations as well (L1-3) and mark off as such too.
Can you do Apple Silicon vs AMD Chip Packaging comparison video please ?
As soon as we have M4-series die shots I'll do a Apple video and I'm sure that Apple will use SoIC at some point in the future.
New materials science could be the next step, Graphene / Diamond semiconductors could allow for Huge clock increases / structural strength and great thermal dissipation characteristics... it probably already exists in "The Military Black Budgets" ?
could the 3d vcache be used on their GPUs too? I know GPU cores are significantly more limited in computations, but I'd be curious if "big cache" could help.
tbf I'd also like to see them dump the AI and ray tracing core nonsense for just straight big pure render/shader cores... Idgaf about ray tracing or AI since 99% of the time I'm not utilizing it...
I thought AMD would eventually add more layers of 3D v-cache. But placing the cache and IO on the base doesn't point that direction
@@Kemano24 there might be room for IO and cache, but not room for extras like graphics. Maybe the memory controller would go on the cache chiplet, and there's an IO die for USB, PCIe and everything else.
I think then you would have NUMA issues with more than 1 CCD.
Cache and IO sound perfect for a trailing process node, I just don't know where to divide the pieces.
Oh, they'll add more layers. Just not in the direction we expected. ;P
22:52 No Turin X? That doesn't seem very likely. All the available information shows that V-cache was a server-first initiative that only made its way onto desktop because somebody suggested it would be good for gaming. That AMD would do all that extra engineering to get the cache die on bottom only to restrict it to the consumer desktop space seems almost absurd.
It's odd - I know. But AMD said there are not plans for Turin-X. Doesn't mean it will never happen, but so far that's all we know.