I have 0 practical use of the information provided in this video, but I still enjoyed it very much. The way you explain everything, and your voice paired with incredible visuals and animations, made for a really easy to understand and entertaining watch!
I just watched someone freaken put solder on a chip pretty evenly in what looked like a minute or two. Will I ever have to the 50 to 100 beads spread across a microchip? Never. Yet that was non chalant it made it wild. I am only a minute into a video. Didn't know people did that. Thought those things burnt into the other side. So basically a min into the video and learning already. Kkk bye.
the possibility of moving I/O into cache chiplet then hybrid bonding them with core chiplet might explain why zen 5 still use old I/O die, they still need to cook it before making sure it's ready. honestly i think they really need new I/O die for future zen cpu, even the curent one seems to be held back by the I/O die. integration I/O into cache chiplet could give them benefit like lower idle power consumption, higher memory support, lower ccd to ccd latency, etc.
In "client" AMD have marketed monolithic chips with good battery life and lowered idle consumption especially when sleeping, the plans are for a Strix Halo using CCD chiplets & a beefy IOD/GPU. What does Intel have that's better than Zen4? Why would AMD divert resources from Zen6, to make a Zen5+ when they could use the re-use the cheaper EPYC platform edge computing variant for a new prosumer Threadripper range with quad-channel memory and loads of PCIE lanes? The current cheap IOD & chiplet architecture is being replaced already in Zen6, as was mentioned MI300 and Ryzen x3D are proving the technology which should be ready and scaled up for the performance desktop market launch. Perhaps Zen5 replaces the low end and Zen6 will initially be a high end product only. As general purpose compute is losing relative importance, we may be seeing core, GPU & NPU chiplets stacked on an IOD/cache interconnect in future.
The mild leak if you believe it has a silicon interloper. That would be an easier intermediate step. That approach would not work for strix halo and with nvidia entering. Mega apu game that would be a concern. Making laptop chips chiplet with a low power interloper seems a logical intermediate step
@@bobo-cc1xw you mean an interposer, but that was a replacement for an organic substrate with solder bump connections and wires inside known as infinity fabric. They already have hybrid bonding working in V-cache and stacked MI300, but you need ways to connect up the tiles of larger server chips which are planned to offer more than compute but things like signal processing for telcos. Without a 2.5D off die interconnect the silicon area will be serverly limited by the reticule limit.
it would be interesting with an IOD/SRAM/interposer combo but it would use a lot of die area and may have issues with yields and thermals. The latest processors from Intel are close to this idea though.
The increase in electrical resistance from having the cache die below the CCDs may not be as bad or even bad at all. It is true that the distance will be a bit higher but that can be offset by more tsvs in parallel. Another potential improvement is that for very high frequencies, current flows mostly in the outer layer of conductors increasing the apparent resistance of thicker ones. I would have to do the numbers to see if this is significant here. The skin effect doesn't apply to DC but even if the input voltage to a CPU is DC, the current flow isn't. Each time a transistor switches on/off there is a change in current, and that happens a lot at multiples and dividers of the clock. I don't work in the industry to actually know how bad is that current ripple and I have not made any numbers to know the skin depth at the required frequencies but I wouldn't be surprised if it plays some role as transistor switching itself can be much faster than the CPU speed and there is harmonics too from using square waves.
"Skin effect", Industry terms; "magnetics", "field effects", cohesion, resistance, repulsion there is definitely analog happening here in the glue and the harmonics regulated not to be destructive. Well-articulated and easy to comprehend, thank you for arousing my thinking on the topic. mb
AMD uses an integrated inductor coil in the metal layer. This is a technology they introduced during the bulldozer era. The inductor smooths out current fluctuations in power delivery.
@@doggSMK It's a flop but a lot of the technology they develop during that time was still being use today. For example Infinity Fabric is using a physical layer called GMI which was developed during the bulldozer era.
Thank you for these amazing videos! They have gotten me very excited about chip technology. I currently work adjacently in wearables but having learned about all this amazing stuff, I'd love to consider a second career in chip technology.
You are so good a breaking down how packaging is done. I don't work silicon industry professionally but I find it interesting, thanks for making it it palatable for me. I also it would be awesome if they indeed combine the IO and Cache die, dealing with the latency. Noted this current approach allows then to have to two different CCDs. So it will be interesting to see how the deal with 16 core. Maybe a 16 core chiplet for consumer 😊
The interesting thing to me is what do the regular Zen5 CCD to package connections look like? For X3D the cache chip has the solder bumps. Where do the solder bumps go on the CCD when it's the only chip?
You know what, that's a great question I completely missed! My guess would be that the upper metal layers have to be different. But that would be a big BEOL change.
@kazedcat I didn't think fabrication with organic substrates had the precision needed. Also, it would only be the copper pads bonding, the oxide and organic layers wouldn't meld. The bond would be fragile.
@@davidgunther8428 Use a different bonding substance for silicon to plastic bond. Precision is not necessary for non-cache connection. Just design the IO and power via to have enough spacing and put the cache vias in a separate region.
An additional thing: On the question of whether AMD will combine cache and IO, it would make sense. It seems that they did not change the IO chiplet architecture and it is the Achilles heel for base zen 5 (non-3d). The challenge would be that in higher core counts, 24 and up, you would have a couple of IO dies just to accommodate the numerous ccd chiplets even with higher density of cores per ccd. For example, even if you combine the cache (the size of one chiplet) and the io die ( the size of about one and a half cpu chiplets), it would only accommodate two, perhaps three cpu chiplet. Maybe they will use this method for consumer CPUs, but have a separate 'IO die for other IO dies' in threadripper and epic CPUs, but this is cumbersome but might be worthwhile for the benefits.
Non-consumer have much larger IOD offering far more memory channels and the IF links for far more chiplets. It would be desirable to re-unify the L3 cache between dual CCDs and hybrid bonding appears to make that possible. On moooaaaahhhhhrrrr cores, L3 less chiplets leaves room for that perhaps with larger L2 cache reducing the frequency of trips off die. Zen6 is AM5, the platform is fixed. As you observed V-cache solved some constraints to performance that the 9700x reportedly suffers from regarding memory latency and bandwidth, not fully feeding 2t per full fat core. Considering a hybrid of full fat & dense cores, as it stands little die area reduction would be gained once L3 is off die, so while 6+6 or 8+4 might seem attractive a chiplet with heterogeneous cores creates binning problems for less gain than that seen in analysis of the mobile CPU which re-layed out merged function blocks allowing them to use the freed empty space created by the dense cores
Cache initial access can be of a reduced speed by a factor of 2x (L1) to 4x (L2) as he cache is usually in a dump mode. Direct access out to the DRAM still is a multi-step (25-30 cycles) mode that can make many clock cycles shame the whole concept of the multi gigahertz process. Fast is on silicon die chip, pretty fast is on bonded-chiplet.... (and so on).....
I was expecting more layers of cache to be stacked. The flip to the bottom was a nice surprise. I knew they would probably do that in the future, just not so soon.
Excellent explanation, thank you! Will be interesting to see their next move, the IO Die is clearly holding them back in Zen 4 and Zen 5, so Zen 5+ or Zen 6 should have major changes in that regard, no matter if it becomes part of the stack or just renewed.
Loved the video. I was wondering if you were planning a deep dive into the M4 chips like you did for the M3 chips. That was very interesting and would be great to see that. Cheers!
This generation demonstrate that AMD was working hard to push boundaries over the past 10 years, and at the same time shows that INTEL was WASTING THE LAST 10 YEARS and trying to scam everyone selling the same chip every year..
AMD was also trash at one point and lacked innovation. This is why Intel wasn't pressured to compete. You can't blame Intel if there was no competition. Now they are working their ass of, innovating their fabrication.
Intel has been such a disappointment. I used to be an intel fanboy but after their issues over the last year and questionable business practices.... AMD needs competition.
Yep, I have a 5820K and 10 years later , unless I get server CPUs , nothing much changed, they put more cache and that's it. Also actually it got worse. That E-core idea they took from ARM isn't good. (it's good for battery powered devices, not workstations) I want symmetrical huge cores, not smaller cores to save power. It doesn't even solve the dark silicon problem. Maybe it's for "ESG" . Like TVs being stupid slow because they can't use more than 40W in the Soviet, I mean European Union. But I digress.
Excellent video! From the AMD engineer's comment it does seem like they are still using two reconstituted wafers, but it would be interesting to do a cross-section SEM elemental analysis to try to see whether the oxide layer exists.
Unsure if the added complexity of unifying the the IO die with chiplet/cache is something we will see next consumer zen but seems like a logical approach in the near future.
Hello good friend @HighYield. It's always awesome to see you do analyses on these chips...and as I said it before, on such a niche subject for most people. Honest question: How are you able to fund all this? I know you have Patreon, but that would not be enough income for you, it would barely pay for the original research!
There's a reason it sometimes takes weeks for me to release a new video, I work a normal job. RUclips is just my hobby. I'd love to focus more on it, but it's difficult to make the switch.
The fact that the regular Zen 5 CPU's were mostly received as being very disappointing but the 9800X3D is flying off shelves like crazy, should tell you how valuable this tech can be for the right applications. Being able to cram all that extra cache in there really makes these CPU's shine in games.
My reaction to hearing the cache was on the bottom was one of disbelief. Nobody talked about it and made it seem like it was no big deal from a complexity and manufacturing standpoint. Having tsvs for connecting the ccd to the substrate for power and data that go through the cache die along with the tsvs to connect the cache to the ccd is drastically more complex than the way 3D cache used to work. It makes sense why the cache die is the same size as the ccd, it’s to fit all those tsvs. The word is that amd is moving to silicon interposers for zen 6 to connect the io die to the ccd and vcache. There is talk of an increased level of modularity coming with this change which will lead to cost savings as there will be less bespoke designs for ccds. Rather, they will use a common interposer to connect ccd to iod and the iod will be come bespoke based on the product. Ie. special iod for various client and commercial products. This is an interesting move with amds move towards a unified graphics architecture, udna. Perhaps we will see the next generation of an mi300 type product that is much less costly to manufacture due to the modularity of forthcoming products. Imagine a server product with a common silicon interposer connecting all ccds w/3d v-cache and udna dies to the io. Pretty cool.
Right now, AMD doesn't have a wafer bottleneck, but a packaging bottleneck. The only way I see things going the way you suggest, with I/O and cache put on a massive base die (which would need to have room for a large number of CCD chiplets on top) is if TSMC drastically increases their packaging capacity. That's not impossible, but the limited supply of the rather early X3D release shows it's far from a current reality. The cache on bottom did surprise me, but mostly because I didn't know it was essentially the same process. The only real issues were engineering at the front end. I thought being on the bottom would require more advanced packaging, which competes (in terms of allotted time) with the MI300 family of products that AMD is pushing hard to cash in on with the neural machine learning bubble.
Totaly agree , having the IO and Cache on the bottom layer seems the next logical step. Ironically that would make Zen 6 die appear as monolithic but i guess the good thing is they could spread the heat on a bigger area with the help of support silicon ?
16:30 How do they thin the bottom carrier wafer down without damaging the transistors? If the solution is to leave a bit of carrier wafer, then that carrier wafer would also need to have matching power delivery structures in place, right?
21:46 I doubt power delivery is such a big concern since it already has to go through so much routing. I bet the actual cache is still smaller than the CCD and the margins are used for routing power and Infiniti fabric.
23:50 "Both IO and cache don't scale well the smaller process nodes" Are you sure about that? IO for obvious reasons doesn't but as far as I'm aware cache is actually what scales the best with smaller process nodes. On the other hand, cache can be packed much closer together than compute, usually by at least 3 times the density with a cache optimised node so there isn't the strict need to use the smallest and most expensive process nodes. One of the reasons cache can be made much denser than compute, even on the same process node even if it's not a specifically cache optimised node, is that cache uses a fraction of the power that compute does which also should mean that cache by very definition should continue to scale with smaller process nodes well beyond the limits of compute.
@@HighYield I'm sorry to burst your bubble but that is flatly incorrect. In fact it's the opposite where TSMC expects the density scaling of SRAM to actually increase compared to compute with sub-2nm processes. Admittedly not by much. N2 is projected to have 1.15x scaling for compute vs N3 but N2 cache optimised will have a 1.21x scaling compared to N3 cache optimised. It will be revelaed at IEDM later this month.
At last a good explanation of what Lisa Su announced way back when she talked about bumpless & micro bump interconnects and nobody had a clue how troublesome x3D would prove to rivals. How good were sales of Milan/Genoa-X? I am wondering about Turin-X, the new IOD there offers much faster memory compatability, but Zen has always had the issue that 2c would max out a CCD's bandwidth, so for some applications V-Cache was killer allowing massive scaling within an 8c/16t CCD. May be the newer massive bandwidth AI aimed accelerators removed the memory limitations which kept calculations off GPUs that exceeded VRAM and so Turin-X is not a priority market compared to gaming. OTOH Turin-X delay could simply be a product of phasing, after all they have Turin dense and Turin-X customers are likely to be Genoa-X ones so delay may mean CPU upgrades down the line. Using 4nm one would expect the small chiplets and V-cache dies, soon won't need pairing for known good matches. Given 32MB L3 is standard across all Zen CCDs it must have built in redundancy (never saw 5 core cheaper 30MB models knocking around). Perhaps some screening of the wafer using visual recognition could estimate likely wastage so both approaches could be used together. But it could simply be an artefact of fab procedures, known good dies were always the input to hybrid bonding, not wafers and scaling up to a different method is on some long optimisation TODO list.
This guy really enjoys talking about amD's Packag(ing). Honestly mad interesting. I'm curious what can be done to reduce thermal issues with more stacked layers.
It's apparent that on optional cache on top is easier logistically. But what replaces the bottom cache die, when there's none? The CPU still must meet structural and height requirements, the power still must flow to the CCD part? Is there always a dummy layer or are they able to move the CCD layer to the bottom for non-X3D?
Interesting. I learned a lot of new stuff from your videos as always. But they are using glass interposers in the future right? What if it's not just in the substrate and it could be used for bonding layers as well?
Great explanation video. This is all well above my head, but I like watching videos like this to try and expand my understanding of technology. If I had to guess about the future of Zen products, I'd say that Zen6 would be more of the same 2nd generation 3D packaging since AMD kept 1st generation 3D packaging for Zen3 and Zen4. If anything, I would think that this is setting AMD up for "Zen7" (or whatever AMD calls it or whatever underlying design it uses if it isn't Zen based) on the next socket after AM5. AMD is going to have to change something considering how much has to be fit on a CPU package and how constrained AM5 feels size wise. AMD is have to do something like make the socket size larger, do more creative stacking to fit more specialized chiplets in, and/or both. For me, I'm also excited about what this means for RDNA5/UDNA1 since AMD usually moves its technological successes from product stack to product stack.
Are there pillars to still offer some cooling for the 1st floor vcache? A vapor chamber layered between the bumps would carry heat and lock the balls in place. Metal points that channel heat to the spreader.
I think they are shooting for a 3 layer approach, which makes V-Cache "mandatory". Memory controllers and PCIe lanes scale the worst with node advances, so those go on the bottom layer. The middle layer will have L3 cache in the various little "house hold knickknacks" like integrated graphics, voltage rectification, sensors, etc (and in the case of laptop chips, basically the entire chipset/SoC functionality). The top layer will have the compute chiplets, that only have L1 and L2 cache. If Zen 6 does this it will probably be N6 -> N4 -> N2 (or N3X)
wont having GPU sandwiched between them create thermal problem?. i mean, the heat will need to go through the top die and we all know GPU is quite power hungry device. intel's foveros seems to be much better approach since each die can dissipate heat at more or less the same thermal resistance
@@n.shiina8798 An IGP is generally a fairly tame beast. The one on Zen4/5 pulls 7W maximum. Obviously AMD will need to make other considerations if they want an APU like the 7840/8840 family of laptop chips with integrated graphics powerful enough to do light gaming.
I wonder if they could use the same sort of tech behind Anisotropic Conductive Films and if that could have benefits. I could see it being simpler to use metal nano-particles instead of directly fusing the chips together. Perhaps if the particles are small enough it'd allow them to put contact points even closer together, though I don't know what is the limiting factor for the spacing currently.
Anecdotally, undervolting the 1st gen 3D VCache CPUs had the greatest effect on performance, which make sense given the many thermal barriers in the chip hierarchy.
I kind of expected amd to solve the thermal issues and allow for higher clock speeds, because that was the biggest thing holding x3d back. I did not expect them to put the cache on the bottom though, but it paid off well, the 9800x3d is the current gaming king and as seen by the GamersNexus stream it is a overclocking beast as well
I actually expected AMD to move the 3D VCache below the CCD when the 5800X3D came out. Because its the only move that made sense to me. Remember this is a pet/side project that got lucky. Thus, the layout of the CCD did not have any 3DVcache in mind at the time.
3D Stack IO would solve the memory latency problem. But then we do already have monolithic Zen5 chips (every laptop CPU) and they don't seem to perform massively better than chiplet Zen5.
Your theory about future I/O die integration makes sense for consumer/mobile CPU's. And it might explain the stagnation on AM5 chipsets. But PCIE can be powerhungry so i'm not sure an external chipset can be avoided
You know, the thing that amuses me, is that before the 9800X3D launched, _everyone_ scoffed at the rumors that AMD was going to put the cache under the CCD, listing drawbacks that are "much larger" than the benefits. Then the 9800X3D appeared, shocking all the pundits, proving that the tradeoffs are not that bad as the 9800X3D totally annihilated all other "gaming CPUs" with zero exceptions.
Probably yield and node: if the cache or the CCD is non-functional, you can throw the entire thing away, and the process node required for the CCD is likely much more expensive than for the cache
This makes me wonder if they'll put the memory controller on a single chiplet with the L3 cache under 2 CCDs without a L3, and then have the PCIe and other IO on a separate chiplet I would think this would improve performance significantly because of having a unified L3 reducing L2 snoops between cores for cache coherency.
Surely a dumb question but is it possible to mix two different process nodes on the same wafer? I mean, making the first layers (transistors and cache) using the most advanced mode (N3,N4) and then use a less advanced one like N7 for the vcache and the rest of the layers?
How does alignment work? It seems feels like an impossible fit, when using reconstituted wafer, to control individual chip tolerances and than the whole wafer.
Yes, the bonding of the CCXs to the die/interposer/cache/I/O/GPU that can be on a less expensive node where the components on the "interposer" layer/die wouldn't benefit from the same scale as the tiny logic CCXs on top and yet disperse the heat that the interposer layer doesn't produce... This has to be the thought here.
If the I/O will also get integrated into the chiplet I think we might expect increase in amount of cores or much lower processors sizes. I'm not knowledgeable enough but I'm imagining doubling the possible amount of CPU dies on the same surface, or using the space for something like an L4 cache die or bigger NPU die which for sure will start getting bigger in the future or even a separate GPU die for the APU. They will for sure find a way to use the freed up space.
22:52 No Turin X? That doesn't seem very likely. All the available information shows that V-cache was a server-first initiative that only made its way onto desktop because somebody suggested it would be good for gaming. That AMD would do all that extra engineering to get the cache die on bottom only to restrict it to the consumer desktop space seems almost absurd.
Cant they bond a glass layer that is made as a prism to seperate spectrum channels and use mirror tunnels to send photonic transfer? PCIE 7.0 Photonic transfer.
Putting the cache between the cores and RAM seems logical. As far as the data flow goes, the cache needs to connect to the ram, and the cache to the cores. Access to ram bypassing cache isn't common, so this doesn't seem like as much of a routing mess as it could be. That said, if that is how it works, making the same CCDs work without the caches on the non x3d chips is really impressive.
Would integrating the IOD with the cache be an issue with multi-CCD CPUs? The current bodge where one CCD has cache and the other doesn't and the solution to performance problems is to effectively disable one or other of the CCDs depending on the workload is truly epic. I'd assumed AMD would eventually try to have cache on both CCDs but if the IOD is integrated that would require disabling it on one of the CCDs. How much faster would the IOD memory controller run if it was integrated with the cache? The current 6000MT/s limit (without scaling) could be a problem over the next few years given the availability of faster memory and Intel running faster.
Fill tbe void with liquid diamond for heat transfer and treat it for spreading the photonic spectrum, offering a transmission channel for each photonic wave length.
Placing IO below the compute dies would be wonderful for latency and reduction in power usage per transferred bit to/from each compute die, however just as IO does not scale with new process nodes, IO power requirements don't scale down either - going off-chip can be very power hungry, which contributes to thermal concerns. Forrest Norrod discussed this in an interview in the past, where he mentioned that the power requirements for IO in Epyc were a real problem. How much of that is just "because it's IO" vs IO requirements for something in the data centre space is an open question of course.
If 3D v-cache is now the same size as a die, then amd can fit more cache in it? It is like twice the size then before. Or am i wrong? Second question: for 99xxX3d there will be 2 dies and 2 Vcaches. Can they be one large independent cache? so that there will be 200mb cache because of it? AMD said in the past, that it is not possible, because they are to far from another cores. But it is not true, VRAM in videocards are much more away from core then any vcache from any dies in cpu. So technically amd can make 400mb cache without any work?
So first question: yes, in theory AMD could pack more cache into the new X3D chiplet. But as of now they are not doing that. Second question: there are rumors that the 9950X3D will still only have one CPU chiplet with extra cache. And even if both would have cache below, right now you couldn't connect them, because they are not designed that way. But in theory it could be possible.
VRAM is DRAM. There is a giant gap in performance, bandwidth and reaction speed, between them. A few numbers from a 5700X: Read/Latency: L1 2206GB/s, 0.9ns L2 1112GB/s, 2.6ns, L3 605 GB/s, 11.9 ns RAM(DDR4-3600 dual channel): 52GB/s, 62.2ns. VRAM is usually build with an wider interface but latency is a lot longer, usually 200ns and more. GPU are optimized to work with large amounts of streamed data and perform a lot of calculations on them, but their control flow is quite slow.
So it short this means that AMD could drop the IF and allow chip to chip commucation in the future. I didnt watch the full video, but this would mean we could see a single chip like package vs the 2-3 chips we see today. Along with allowing them to make some of the chip even smaller going forward. Instead of large L3 and L2 (from base) we could just move those memory layers into their own layer leaving a ton more space for just the compute. OR more than like Chipet -> Memory -> Chiplet -> Memory etc. This way all the compute and memory would be accessable and addressable across all compute and IO. That way you can have one core completing one task, while another core could then access that same memory (without having to task schedule the working thread) back onto that same memory even though its on a totally different CCD. To be fair though, this wouldnt really improve performance for say, maybe power for sure (as it could lower idle power draw), but you would still be limited in compute for say. More than likely this would allow a smaller foot print package chip follow by a less cost on the compute chiplets. Everything else would be the same or more as the packaging method would have some increase cost. So skus that have two chiplets wouldnt have to fight for the "gaming cores" or if one has 3d cache or not as they both would have access to that extra memory. If its all SRAM - I wonder if that layer would just be mark out for the different memory locations as well (L1-3) and mark off as such too.
I knew AMD had to change something with how the CPU-wafer and V-Cache wafer were gonna interact with each other. But as to what that would be or how it would be done ; I had no idea. It's surprising that AMD was able to come up with a bettter option in such a short amount of time. Maybe AMD had both options for how the CPU and V-Cache were to be connected and how that would effect performance and AMD used Option B first time around then found out that the other option not used yet was the better one.
If diffrence in cost between wafer to wafer and chip to wafer is big enough it could be jusifiable to left space around smaller chips on wafer, especially if need final ratio of chips is 1:1
I wonder if the Radeon Technologies Group can adapt 3D V-Cache on the bottom of its GPUs for its Infinity Cache, and can this boost the Infinity Cache’s throughput and capacity. GPUs are often starved by low memory throughput than by high latencies unlike CPUs which are starved by high latencies more often than by low throughput.
having the compute and IO together in a single chip again and doing away with the infinity fabric in between would be a huge deal for RAM latency and idle power consumption, however I don't think consumers would care much. We've seen that high RAM latency is effectively mitigated by a huge L3 cache and the current out of order compute pipelines, and idle power consumption is something people only really care about on mobile devices, which is where AMD has been using single dies anyway. So while I would love to see it as it could still push gaming performance a bit higher and/or reduce the need for a large L3 cache, it could very well be that AMD decides the extra cost isn't worth it
I expect this to happen in the next two gens. AMD has enough server market share to split the desktop/laptop chips off on their own. Servers can stay separate IO die and chiplets and desktop/laptop can move to more monolithic designs or as this video is about via hybrid bonding. I expect Intel to do the same even know they just split the IO die out with Lunar Lake.
You are thinking too narrow, AMD is space constrained with regards to the number of CCDs, stacking them on top of IO would enable 32C or more on Ryzen, which consumers would 100% care about.
Apple is kind of doing what you are saying already. I have my M3 Max MBP with 128GB unified memory and everything other than gaming it does amazingly well. There are some games I can play but I got this as my desktop replacement. I won't use Windows ever now. I got sidetracked a bit, but Apple introduced high throughput RAM soldered very near the processor (I think it's part of the processor) and the I/O is directly inside the chip too. Everything is ultra smooth and fast. In case you are wondering why I got this, I am a software developer, and I do tinker with local models (which run on GPU with 128GB VRAM, unified memory). And I plan to use this for like 6-7 years lol.
The main mitigation for RAM latency is caches, cachelines, pre-fetching, reordering & SMT, and of course consumers won't care; they'll just benefit from better power efficiency as less energy is used moving bits around on average. In an APU having general compute, GPU & NPU on a more power efficient newer process bonded to caches and IO layer improves yields and lowers overall costs by die area reduction. AMD are absolutely going for this in Zen6, they have had years since the 2021 announcement that the packaging problems were solved and Milan-X and 5800x3D became feasible. Since we had Zen4 with improved V-cache, Zen5 re-designed to suit V-cache, so there's every reason to suppose Zen6 is going to tackle the perceived weakness of Infinity Fabric linked chiplets. Remember AMD had no way of knowing that Meteor & Arrow Lake would be flops and Apple & Qualcomm make CPU & GPU too.
Not to mention: significantly improved inter CCD latency (huge bottleneck for gaming), potential for the base die to act as an interposer for direct CCD to CCD connection that allows CCDs to act as one (similar to Intel EMIB), expansion of VCache size per CCD, possibility of a large L4 VCache shared between CCDs. If you think this change only improves memory latency and idle power consumption, you are only thinking about using the new technology to do, well, exactly what we are currently doing, utilizing none is its fundamental advantages...
AM3 = PGA 940 AM4 = PGA 1331 AM5 = LGA 1718 AM6 = LGA 2300?? Does more pins make the cpu die bigger? Will an AM6 cpu be physically bigger than any amd cpu that has come before it?
With hybrid bonding, there may not be a need to put L3 cache inside of the compute die: let that be optimized for the ultra latency sensitive u-op, L1 and fast L2 caches. Why waste the expensive leading edge die space when the L3 can entirely be placed below using an older process that is nearly as dense (ie cheaper)? For Zen 5, this would roughly increase core count by 50% for the compute die. In terms of chip stacking with hybrid bonding, how many wafer-to-wafer layers are needed to get to a good structural support metric? Stack enough of them through this process to get the structural support necessary and a massive amount of L3 cache. Since this massive stack is purely for cache, wafer-to-wafer hybrid bonding can be leveraged. There are manufacturing time optimizations that can be done here too with say a 16 layer stack: start with eight double stacks, four quad stacks, two octo stacks until the final bond step resulting in 16 layers is produced. If they can check the quality of the bonding at each step, many more double layers would be discarded for bad yields vs. quad and octo layers. Essentially as the stack gets higher, only known good substacks are used going up, improving overall yields. Only when adding the CPU dies on top would a carrier wafer be needed. Another variable for manufacturing is simultaneous arrival of backside power delivery. This would manifest as a sandwich of repeating data, silicon and power layers. This does bring up the idea of leveraging shared power layers (I would presume that the data layers are too fine to even think of this). There is also the question if hybrid bonding can be done with multiple dies on top of a larger base die. In this case, the larger base die would simply be the IO die holding multiple stacks of cache with compute dies on top.
Science channel. Note magnetics, the field effects of attachment is part of hybrid bonding; attraction is good, repulsion a spike, and or resistance as stated down comment may be not so good. Learning here for me thank you. mb
I would like to see AMD work with DRAM vendors to make hybrid bonded DRAM. This means memory must be in the CPU package. We could probably agree standard sizes, say 16, 32 and 64GB, possibly higher? say 6c gets 16GB, 8c either 16 or 32GB, 12c 32GB 16c - 64GB. the 12/16c could still have an infinity link to IO + additional MC optionally connected to additional DRAM. The OS would have to know there are low and high latency memory Or hybrid bond to IOD? then IOD hybrid bond to DRAM?
We might see this soon for Smarphones, where power and space matter a lot more very unlikely in the near future for desktop CPUs I would say, there is simply no reason to do it
@@Kemano24 there might be room for IO and cache, but not room for extras like graphics. Maybe the memory controller would go on the cache chiplet, and there's an IO die for USB, PCIe and everything else. I think then you would have NUMA issues with more than 1 CCD. Cache and IO sound perfect for a trailing process node, I just don't know where to divide the pieces.
I was more shocked zen 5 didnt change the io die than i was the cache chip was on the bottom. For the life of me i dont understand why they didn't update it. Maybe there is a radical io change on the horizon. I would think they wouldn't integrate the cache and io into a single layer though. That would mean multi ccd processors would have duplicate io. Might be useful, but im unsure. I feel it might be more likely they actually just add another layer beneath the 3d cache for io. That would be pretty crazy. Then id think a 3rd or base layer io chip could be centered between chiplets. You could even hybrid bond chiplets to this new 3rd layer io die likely decreasing latency between chiplets.
I have 0 practical use of the information provided in this video, but I still enjoyed it very much. The way you explain everything, and your voice paired with incredible visuals and animations, made for a really easy to understand and entertaining watch!
One of practice use is ability to confidently navigate through marketing confetti, while choosing new system.
I just watched someone freaken put solder on a chip pretty evenly in what looked like a minute or two.
Will I ever have to the 50 to 100 beads spread across a microchip? Never. Yet that was non chalant it made it wild.
I am only a minute into a video. Didn't know people did that. Thought those things burnt into the other side. So basically a min into the video and learning already. Kkk bye.
the possibility of moving I/O into cache chiplet then hybrid bonding them with core chiplet might explain why zen 5 still use old I/O die, they still need to cook it before making sure it's ready. honestly i think they really need new I/O die for future zen cpu, even the curent one seems to be held back by the I/O die. integration I/O into cache chiplet could give them benefit like lower idle power consumption, higher memory support, lower ccd to ccd latency, etc.
In "client" AMD have marketed monolithic chips with good battery life and lowered idle consumption especially when sleeping, the plans are for a Strix Halo using CCD chiplets & a beefy IOD/GPU.
What does Intel have that's better than Zen4? Why would AMD divert resources from Zen6, to make a Zen5+ when they could use the re-use the cheaper EPYC platform edge computing variant for a new prosumer Threadripper range with quad-channel memory and loads of PCIE lanes?
The current cheap IOD & chiplet architecture is being replaced already in Zen6, as was mentioned MI300 and Ryzen x3D are proving the technology which should be ready and scaled up for the performance desktop market launch. Perhaps Zen5 replaces the low end and Zen6 will initially be a high end product only.
As general purpose compute is losing relative importance, we may be seeing core, GPU & NPU chiplets stacked on an IOD/cache interconnect in future.
The mild leak if you believe it has a silicon interloper. That would be an easier intermediate step.
That approach would not work for strix halo and with nvidia entering. Mega apu game that would be a concern. Making laptop chips chiplet with a low power interloper seems a logical intermediate step
@@bobo-cc1xw you mean an interposer, but that was a replacement for an organic substrate with solder bump connections and wires inside known as infinity fabric.
They already have hybrid bonding working in V-cache and stacked MI300, but you need ways to connect up the tiles of larger server chips which are planned to offer more than compute but things like signal processing for telcos.
Without a 2.5D off die interconnect the silicon area will be serverly limited by the reticule limit.
Unlikely they'll be able to do this due to thermal constraints.
it would be interesting with an IOD/SRAM/interposer combo but it would use a lot of die area and may have issues with yields and thermals. The latest processors from Intel are close to this idea though.
Yeah thats cool and all but have you tried super glue?
also add cotton wool and baking soda for better adhesion
The increase in electrical resistance from having the cache die below the CCDs may not be as bad or even bad at all. It is true that the distance will be a bit higher but that can be offset by more tsvs in parallel.
Another potential improvement is that for very high frequencies, current flows mostly in the outer layer of conductors increasing the apparent resistance of thicker ones. I would have to do the numbers to see if this is significant here. The skin effect doesn't apply to DC but even if the input voltage to a CPU is DC, the current flow isn't. Each time a transistor switches on/off there is a change in current, and that happens a lot at multiples and dividers of the clock. I don't work in the industry to actually know how bad is that current ripple and I have not made any numbers to know the skin depth at the required frequencies but I wouldn't be surprised if it plays some role as transistor switching itself can be much faster than the CPU speed and there is harmonics too from using square waves.
"Skin effect", Industry terms; "magnetics", "field effects", cohesion, resistance, repulsion there is definitely analog happening here in the glue and the harmonics regulated not to be destructive. Well-articulated and easy to comprehend, thank you for arousing my thinking on the topic. mb
AMD uses an integrated inductor coil in the metal layer. This is a technology they introduced during the bulldozer era. The inductor smooths out current fluctuations in power delivery.
So Bulldozer was not 100% tragedy... 😆
@@doggSMK It's a flop but a lot of the technology they develop during that time was still being use today. For example Infinity Fabric is using a physical layer called GMI which was developed during the bulldozer era.
Every time you post a video, I learn something new. Thank you so much for all the hard work you put into this.
Thank you for these amazing videos!
They have gotten me very excited about chip technology. I currently work adjacently in wearables but having learned about all this amazing stuff, I'd love to consider a second career in chip technology.
There's some cool chip tech in wearables.
You are so good a breaking down how packaging is done. I don't work silicon industry professionally but I find it interesting, thanks for making it it palatable for me.
I also it would be awesome if they indeed combine the IO and Cache die, dealing with the latency. Noted this current approach allows then to have to two different CCDs. So it will be interesting to see how the deal with 16 core. Maybe a 16 core chiplet for consumer 😊
A hybrid bonded chip also has better thermal conduction between the layers than solder bumps would.
3:34 I want my cpu to be glued together with tiny burgers
Thanks for giving a good background of hybrid bonding. I am so proud to be working in semiconductor packaging.
The interesting thing to me is what do the regular Zen5 CCD to package connections look like? For X3D the cache chip has the solder bumps. Where do the solder bumps go on the CCD when it's the only chip?
You know what, that's a great question I completely missed! My guess would be that the upper metal layers have to be different. But that would be a big BEOL change.
You can do copper to copper bonding with an organic redistribution layer.
@kazedcat I didn't think fabrication with organic substrates had the precision needed. Also, it would only be the copper pads bonding, the oxide and organic layers wouldn't meld. The bond would be fragile.
@@davidgunther8428 Use a different bonding substance for silicon to plastic bond. Precision is not necessary for non-cache connection. Just design the IO and power via to have enough spacing and put the cache vias in a separate region.
Awesome work as usual! Thank you for the great information, and the reading recommendations =)
Ohh laser and plasma dicing!. I knew there was something more sophisticated than sawing.
An additional thing: On the question of whether AMD will combine cache and IO, it would make sense. It seems that they did not change the IO chiplet architecture and it is the Achilles heel for base zen 5 (non-3d). The challenge would be that in higher core counts, 24 and up, you would have a couple of IO dies just to accommodate the numerous ccd chiplets even with higher density of cores per ccd. For example, even if you combine the cache (the size of one chiplet) and the io die ( the size of about one and a half cpu chiplets), it would only accommodate two, perhaps three cpu chiplet.
Maybe they will use this method for consumer CPUs, but have a separate 'IO die for other IO dies' in threadripper and epic CPUs, but this is cumbersome but might be worthwhile for the benefits.
Non-consumer have much larger IOD offering far more memory channels and the IF links for far more chiplets.
It would be desirable to re-unify the L3 cache between dual CCDs and hybrid bonding appears to make that possible. On moooaaaahhhhhrrrr cores, L3 less chiplets leaves room for that perhaps with larger L2 cache reducing the frequency of trips off die.
Zen6 is AM5, the platform is fixed.
As you observed V-cache solved some constraints to performance that the 9700x reportedly suffers from regarding memory latency and bandwidth, not fully feeding 2t per full fat core.
Considering a hybrid of full fat & dense cores, as it stands little die area reduction would be gained once L3 is off die, so while 6+6 or 8+4 might seem attractive a chiplet with heterogeneous cores creates binning problems for less gain than that seen in analysis of the mobile CPU which re-layed out merged function blocks allowing them to use the freed empty space created by the dense cores
Cache initial access can be of a reduced speed by a factor of 2x (L1) to 4x (L2) as he cache is usually in a dump mode. Direct access out to the DRAM still is a multi-step (25-30 cycles) mode that can make many clock cycles shame the whole concept of the multi gigahertz process. Fast is on silicon die chip, pretty fast is on bonded-chiplet.... (and so on).....
I was expecting more layers of cache to be stacked. The flip to the bottom was a nice surprise.
I knew they would probably do that in the future, just not so soon.
I really missed this RUclips channel. Excellent graphics as always! I love it!
A lot! And… ☺️😉
Your videos are the best in this field.
Great video like always!
Excellent explanation, thank you!
Will be interesting to see their next move, the IO Die is clearly holding them back in Zen 4 and Zen 5, so Zen 5+ or Zen 6 should have major changes in that regard, no matter if it becomes part of the stack or just renewed.
When are Zen 6-7 coming?
@@notaras1985 2 years for each generation is a reasonable assumption so Zen 6 would be 2026, Zen 7 2028.
Loved the video. I was wondering if you were planning a deep dive into the M4 chips like you did for the M3 chips. That was very interesting and would be great to see that. Cheers!
I'm currently still waiting for die shots to appear. As soon as I have something to work with, I'll make a video!
Been waiting for this video!!! Watching it on my 9800X3D 😁
Can you ask your 9800X3D if it still has a support silicon on top? xD
@@HighYield LOL
Does the video play faster?
@@TheEVEInspiration340 fps😂
This generation demonstrate that AMD was working hard to push boundaries over the past 10 years, and at the same time shows that INTEL was WASTING THE LAST 10 YEARS and trying to scam everyone selling the same chip every year..
AMD was also trash at one point and lacked innovation. This is why Intel wasn't pressured to compete. You can't blame Intel if there was no competition. Now they are working their ass of, innovating their fabrication.
Intel has been such a disappointment. I used to be an intel fanboy but after their issues over the last year and questionable business practices.... AMD needs competition.
@@SupraSav AMD has competition. Intel is not that behind. Just wait for Panther Lake with Intel 18A and Xe3.
Yep, I have a 5820K and 10 years later , unless I get server CPUs , nothing much changed, they put more cache and that's it.
Also actually it got worse. That E-core idea they took from ARM isn't good. (it's good for battery powered devices, not workstations)
I want symmetrical huge cores, not smaller cores to save power. It doesn't even solve the dark silicon problem.
Maybe it's for "ESG" . Like TVs being stupid slow because they can't use more than 40W in the Soviet, I mean European Union. But I digress.
Sort of. Intel packaging is actually a bit more advanced than TSMC. Intel isn't fully taking advantage of it on the CPU side yet.
Excellent video! From the AMD engineer's comment it does seem like they are still using two reconstituted wafers, but it would be interesting to do a cross-section SEM elemental analysis to try to see whether the oxide layer exists.
Called it! In a comment under an earlier video.
Unsure if the added complexity of unifying the the IO die with chiplet/cache is something we will see next consumer zen but seems like a logical approach in the near future.
Hello good friend @HighYield. It's always awesome to see you do analyses on these chips...and as I said it before, on such a niche subject for most people. Honest question: How are you able to fund all this? I know you have Patreon, but that would not be enough income for you, it would barely pay for the original research!
I reckon it is part patreon and part enthusiasm
There's a reason it sometimes takes weeks for me to release a new video, I work a normal job. RUclips is just my hobby. I'd love to focus more on it, but it's difficult to make the switch.
This is genius. Trust AMD to come up with every good idea, and other companies will copy them.
The fact that the regular Zen 5 CPU's were mostly received as being very disappointing but the 9800X3D is flying off shelves like crazy, should tell you how valuable this tech can be for the right applications. Being able to cram all that extra cache in there really makes these CPU's shine in games.
Maybe Zen 5 is basically a revamped X3D, but hey it's still confusing on why the normal Zen 5 sucks bad
My reaction to hearing the cache was on the bottom was one of disbelief. Nobody talked about it and made it seem like it was no big deal from a complexity and manufacturing standpoint.
Having tsvs for connecting the ccd to the substrate for power and data that go through the cache die along with the tsvs to connect the cache to the ccd is drastically more complex than the way 3D cache used to work. It makes sense why the cache die is the same size as the ccd, it’s to fit all those tsvs.
The word is that amd is moving to silicon interposers for zen 6 to connect the io die to the ccd and vcache. There is talk of an increased level of modularity coming with this change which will lead to cost savings as there will be less bespoke designs for ccds. Rather, they will use a common interposer to connect ccd to iod and the iod will be come bespoke based on the product. Ie. special iod for various client and commercial products.
This is an interesting move with amds move towards a unified graphics architecture, udna. Perhaps we will see the next generation of an mi300 type product that is much less costly to manufacture due to the modularity of forthcoming products. Imagine a server product with a common silicon interposer connecting all ccds w/3d v-cache and udna dies to the io. Pretty cool.
4:47 Cleaned, processed in vacuum, pressing each other side...
Is it... cold welding?
Right now, AMD doesn't have a wafer bottleneck, but a packaging bottleneck. The only way I see things going the way you suggest, with I/O and cache put on a massive base die (which would need to have room for a large number of CCD chiplets on top) is if TSMC drastically increases their packaging capacity. That's not impossible, but the limited supply of the rather early X3D release shows it's far from a current reality.
The cache on bottom did surprise me, but mostly because I didn't know it was essentially the same process. The only real issues were engineering at the front end. I thought being on the bottom would require more advanced packaging, which competes (in terms of allotted time) with the MI300 family of products that AMD is pushing hard to cash in on with the neural machine learning bubble.
Imagine Zen 7 APU with a single unified CPU chiplet, GPU chiplet, and shared memory chiplet. It would be next level.
I do. My PC will be 13995X3D with 6090ti.
Totaly agree , having the IO and Cache on the bottom layer seems the next logical step. Ironically that would make Zen 6 die appear as monolithic but i guess the good thing is they could spread the heat on a bigger area with the help of support silicon ?
16:30 How do they thin the bottom carrier wafer down without damaging the transistors? If the solution is to leave a bit of carrier wafer, then that carrier wafer would also need to have matching power delivery structures in place, right?
got all giddy and started kicking my feet up when i saw there's a new vid
I think it may soon be time to see the Butter Donuts in action!
maaan youre video really are somethinng else!!
21:46 I doubt power delivery is such a big concern since it already has to go through so much routing.
I bet the actual cache is still smaller than the CCD and the margins are used for routing power and Infiniti fabric.
23:50 "Both IO and cache don't scale well the smaller process nodes"
Are you sure about that? IO for obvious reasons doesn't but as far as I'm aware cache is actually what scales the best with smaller process nodes. On the other hand, cache can be packed much closer together than compute, usually by at least 3 times the density with a cache optimised node so there isn't the strict need to use the smallest and most expensive process nodes. One of the reasons cache can be made much denser than compute, even on the same process node even if it's not a specifically cache optimised node, is that cache uses a fraction of the power that compute does which also should mean that cache by very definition should continue to scale with smaller process nodes well beyond the limits of compute.
SRAM doesn’t scale anymore. Logic does.
@@HighYield I'm sorry to burst your bubble but that is flatly incorrect. In fact it's the opposite where TSMC expects the density scaling of SRAM to actually increase compared to compute with sub-2nm processes.
Admittedly not by much. N2 is projected to have 1.15x scaling for compute vs N3 but N2 cache optimised will have a 1.21x scaling compared to N3 cache optimised.
It will be revelaed at IEDM later this month.
At last a good explanation of what Lisa Su announced way back when she talked about bumpless & micro bump interconnects and nobody had a clue how troublesome x3D would prove to rivals.
How good were sales of Milan/Genoa-X? I am wondering about Turin-X, the new IOD there offers much faster memory compatability, but Zen has always had the issue that 2c would max out a CCD's bandwidth, so for some applications V-Cache was killer allowing massive scaling within an 8c/16t CCD. May be the newer massive bandwidth AI aimed accelerators removed the memory limitations which kept calculations off GPUs that exceeded VRAM and so Turin-X is not a priority market compared to gaming. OTOH Turin-X delay could simply be a product of phasing, after all they have Turin dense and Turin-X customers are likely to be Genoa-X ones so delay may mean CPU upgrades down the line.
Using 4nm one would expect the small chiplets and V-cache dies, soon won't need pairing for known good matches. Given 32MB L3 is standard across all Zen CCDs it must have built in redundancy (never saw 5 core cheaper 30MB models knocking around). Perhaps some screening of the wafer using visual recognition could estimate likely wastage so both approaches could be used together. But it could simply be an artefact of fab procedures, known good dies were always the input to hybrid bonding, not wafers and scaling up to a different method is on some long optimisation TODO list.
This guy really enjoys talking about amD's Packag(ing).
Honestly mad interesting. I'm curious what can be done to reduce thermal issues with more stacked layers.
It's apparent that on optional cache on top is easier logistically. But what replaces the bottom cache die, when there's none? The CPU still must meet structural and height requirements, the power still must flow to the CCD part? Is there always a dummy layer or are they able to move the CCD layer to the bottom for non-X3D?
8:00 So is it better in every single way or has drawbacks compared to previous? You contradict yourself in the same sentence.
Interesting. I learned a lot of new stuff from your videos as always. But they are using glass interposers in the future right? What if it's not just in the substrate and it could be used for bonding layers as well?
Great explanation video. This is all well above my head, but I like watching videos like this to try and expand my understanding of technology.
If I had to guess about the future of Zen products, I'd say that Zen6 would be more of the same 2nd generation 3D packaging since AMD kept 1st generation 3D packaging for Zen3 and Zen4. If anything, I would think that this is setting AMD up for "Zen7" (or whatever AMD calls it or whatever underlying design it uses if it isn't Zen based) on the next socket after AM5. AMD is going to have to change something considering how much has to be fit on a CPU package and how constrained AM5 feels size wise. AMD is have to do something like make the socket size larger, do more creative stacking to fit more specialized chiplets in, and/or both.
For me, I'm also excited about what this means for RDNA5/UDNA1 since AMD usually moves its technological successes from product stack to product stack.
gr8 explanation.thank u
I'm so excited for the future from AMD.
Are there pillars to still offer some cooling for the 1st floor vcache? A vapor chamber layered between the bumps would carry heat and lock the balls in place. Metal points that channel heat to the spreader.
I think they are shooting for a 3 layer approach, which makes V-Cache "mandatory". Memory controllers and PCIe lanes scale the worst with node advances, so those go on the bottom layer. The middle layer will have L3 cache in the various little "house hold knickknacks" like integrated graphics, voltage rectification, sensors, etc (and in the case of laptop chips, basically the entire chipset/SoC functionality). The top layer will have the compute chiplets, that only have L1 and L2 cache. If Zen 6 does this it will probably be N6 -> N4 -> N2 (or N3X)
wont having GPU sandwiched between them create thermal problem?. i mean, the heat will need to go through the top die and we all know GPU is quite power hungry device. intel's foveros seems to be much better approach since each die can dissipate heat at more or less the same thermal resistance
@@n.shiina8798 An IGP is generally a fairly tame beast. The one on Zen4/5 pulls 7W maximum. Obviously AMD will need to make other considerations if they want an APU like the 7840/8840 family of laptop chips with integrated graphics powerful enough to do light gaming.
I wonder if they could use the same sort of tech behind Anisotropic Conductive Films and if that could have benefits.
I could see it being simpler to use metal nano-particles instead of directly fusing the chips together.
Perhaps if the particles are small enough it'd allow them to put contact points even closer together, though I don't know what is the limiting factor for the spacing currently.
I wasn't surprised by the move because I've heard about AMD engineers complaining about the heat for some years now.
Really good info!
Any news on backside Power delivery?
Coming with Intel 18A and probably TSMC A16 iirc.
@@HighYield Great. Thanks a lot! so that should be 2027 for the customer
@@einekleineente12025
Anecdotally, undervolting the 1st gen 3D VCache CPUs had the greatest effect on performance, which make sense given the many thermal barriers in the chip hierarchy.
I kind of expected amd to solve the thermal issues and allow for higher clock speeds, because that was the biggest thing holding x3d back. I did not expect them to put the cache on the bottom though, but it paid off well, the 9800x3d is the current gaming king and as seen by the GamersNexus stream it is a overclocking beast as well
I actually expected AMD to move the 3D VCache below the CCD when the 5800X3D came out. Because its the only move that made sense to me. Remember this is a pet/side project that got lucky. Thus, the layout of the CCD did not have any 3DVcache in mind at the time.
Chip glue go brrrrrrrrtt
If AMD combines io and 3d v-cache, wouldn't that lead to having 3d v-cache for lower in the stack cpus like a hypothetical ryzen 5 10600 x3d?
3D Stack IO would solve the memory latency problem.
But then we do already have monolithic Zen5 chips (every laptop CPU) and they don't seem to perform massively better than chiplet Zen5.
Your theory about future I/O die integration makes sense for consumer/mobile CPU's. And it might explain the stagnation on AM5 chipsets. But PCIE can be powerhungry so i'm not sure an external chipset can be avoided
very informative video thanks alot sir
You know, the thing that amuses me, is that before the 9800X3D launched, _everyone_ scoffed at the rumors that AMD was going to put the cache under the CCD, listing drawbacks that are "much larger" than the benefits.
Then the 9800X3D appeared, shocking all the pundits, proving that the tradeoffs are not that bad as the 9800X3D totally annihilated all other "gaming CPUs" with zero exceptions.
Here I am wondering why the chips are not built on top of each other during the lithography process?
Probably yield and node: if the cache or the CCD is non-functional, you can throw the entire thing away, and the process node required for the CCD is likely much more expensive than for the cache
This makes me wonder if they'll put the memory controller on a single chiplet with the L3 cache under 2 CCDs without a L3, and then have the PCIe and other IO on a separate chiplet I would think this would improve performance significantly because of having a unified L3 reducing L2 snoops between cores for cache coherency.
Can we expect a chip analysis of the Apple m4 SoC family?
Surely a dumb question but is it possible to mix two different process nodes on the same wafer? I mean, making the first layers (transistors and cache) using the most advanced mode (N3,N4) and then use a less advanced one like N7 for the vcache and the rest of the layers?
How does alignment work? It seems feels like an impossible fit, when using reconstituted wafer, to control individual chip tolerances and than the whole wafer.
Yes, the bonding of the CCXs to the die/interposer/cache/I/O/GPU that can be on a less expensive node where the components on the "interposer" layer/die wouldn't benefit from the same scale as the tiny logic CCXs on top and yet disperse the heat that the interposer layer doesn't produce... This has to be the thought here.
If the I/O will also get integrated into the chiplet I think we might expect increase in amount of cores or much lower processors sizes. I'm not knowledgeable enough but I'm imagining doubling the possible amount of CPU dies on the same surface, or using the space for something like an L4 cache die or bigger NPU die which for sure will start getting bigger in the future or even a separate GPU die for the APU. They will for sure find a way to use the freed up space.
So good video 🎉
Cache on top, cache on bottom. Do both ;) We will get there.
22:52 No Turin X? That doesn't seem very likely. All the available information shows that V-cache was a server-first initiative that only made its way onto desktop because somebody suggested it would be good for gaming. That AMD would do all that extra engineering to get the cache die on bottom only to restrict it to the consumer desktop space seems almost absurd.
It's odd - I know. But AMD said there are not plans for Turin-X. Doesn't mean it will never happen, but so far that's all we know.
Cant they bond a glass layer that is made as a prism to seperate spectrum channels and use mirror tunnels to send photonic transfer? PCIE 7.0 Photonic transfer.
Putting the cache between the cores and RAM seems logical. As far as the data flow goes, the cache needs to connect to the ram, and the cache to the cores. Access to ram bypassing cache isn't common, so this doesn't seem like as much of a routing mess as it could be. That said, if that is how it works, making the same CCDs work without the caches on the non x3d chips is really impressive.
Well that could be something they will use in next gen ryzen to improve CCD to IO connection. Maybe it will be on consumer GPUs some day as well.
Would integrating the IOD with the cache be an issue with multi-CCD CPUs? The current bodge where one CCD has cache and the other doesn't and the solution to performance problems is to effectively disable one or other of the CCDs depending on the workload is truly epic. I'd assumed AMD would eventually try to have cache on both CCDs but if the IOD is integrated that would require disabling it on one of the CCDs. How much faster would the IOD memory controller run if it was integrated with the cache? The current 6000MT/s limit (without scaling) could be a problem over the next few years given the availability of faster memory and Intel running faster.
Fill tbe void with liquid diamond for heat transfer and treat it for spreading the photonic spectrum, offering a transmission channel for each photonic wave length.
Light channels add bandwidth without heat and begin the experiment in going photonic.
Placing IO below the compute dies would be wonderful for latency and reduction in power usage per transferred bit to/from each compute die, however just as IO does not scale with new process nodes, IO power requirements don't scale down either - going off-chip can be very power hungry, which contributes to thermal concerns.
Forrest Norrod discussed this in an interview in the past, where he mentioned that the power requirements for IO in Epyc were a real problem. How much of that is just "because it's IO" vs IO requirements for something in the data centre space is an open question of course.
If 3D v-cache is now the same size as a die, then amd can fit more cache in it? It is like twice the size then before. Or am i wrong?
Second question: for 99xxX3d there will be 2 dies and 2 Vcaches. Can they be one large independent cache? so that there will be 200mb cache because of it? AMD said in the past, that it is not possible, because they are to far from another cores. But it is not true, VRAM in videocards are much more away from core then any vcache from any dies in cpu.
So technically amd can make 400mb cache without any work?
So first question: yes, in theory AMD could pack more cache into the new X3D chiplet. But as of now they are not doing that.
Second question: there are rumors that the 9950X3D will still only have one CPU chiplet with extra cache. And even if both would have cache below, right now you couldn't connect them, because they are not designed that way. But in theory it could be possible.
VRAM is DRAM. There is a giant gap in performance, bandwidth and reaction speed, between them. A few numbers from a 5700X: Read/Latency: L1 2206GB/s, 0.9ns L2 1112GB/s, 2.6ns, L3 605 GB/s, 11.9 ns RAM(DDR4-3600 dual channel): 52GB/s, 62.2ns.
VRAM is usually build with an wider interface but latency is a lot longer, usually 200ns and more. GPU are optimized to work with large amounts of streamed data and perform a lot of calculations on them, but their control flow is quite slow.
So it short this means that AMD could drop the IF and allow chip to chip commucation in the future. I didnt watch the full video, but this would mean we could see a single chip like package vs the 2-3 chips we see today. Along with allowing them to make some of the chip even smaller going forward. Instead of large L3 and L2 (from base) we could just move those memory layers into their own layer leaving a ton more space for just the compute. OR more than like Chipet -> Memory -> Chiplet -> Memory etc. This way all the compute and memory would be accessable and addressable across all compute and IO. That way you can have one core completing one task, while another core could then access that same memory (without having to task schedule the working thread) back onto that same memory even though its on a totally different CCD. To be fair though, this wouldnt really improve performance for say, maybe power for sure (as it could lower idle power draw), but you would still be limited in compute for say. More than likely this would allow a smaller foot print package chip follow by a less cost on the compute chiplets. Everything else would be the same or more as the packaging method would have some increase cost. So skus that have two chiplets wouldnt have to fight for the "gaming cores" or if one has 3d cache or not as they both would have access to that extra memory. If its all SRAM - I wonder if that layer would just be mark out for the different memory locations as well (L1-3) and mark off as such too.
I knew AMD had to change something with how the CPU-wafer and V-Cache wafer were gonna interact with each other. But as to what that would be or how it would be done ; I had no idea. It's surprising that AMD was able to come up with a bettter option in such a short amount of time. Maybe AMD had both options for how the CPU and V-Cache were to be connected and how that would effect performance and AMD used Option B first time around then found out that the other option not used yet was the better one.
Will Zen 7 have 2nm transistors?
If diffrence in cost between wafer to wafer and chip to wafer is big enough it could be jusifiable to left space around smaller chips on wafer, especially if need final ratio of chips is 1:1
I wonder if the Radeon Technologies Group can adapt 3D V-Cache on the bottom of its GPUs for its Infinity Cache, and can this boost the Infinity Cache’s throughput and capacity. GPUs are often starved by low memory throughput than by high latencies unlike CPUs which are starved by high latencies more often than by low throughput.
Can you do Apple Silicon vs AMD Chip Packaging comparison video please ?
As soon as we have M4-series die shots I'll do a Apple video and I'm sure that Apple will use SoIC at some point in the future.
Great video.
having the compute and IO together in a single chip again and doing away with the infinity fabric in between would be a huge deal for RAM latency and idle power consumption, however I don't think consumers would care much. We've seen that high RAM latency is effectively mitigated by a huge L3 cache and the current out of order compute pipelines, and idle power consumption is something people only really care about on mobile devices, which is where AMD has been using single dies anyway. So while I would love to see it as it could still push gaming performance a bit higher and/or reduce the need for a large L3 cache, it could very well be that AMD decides the extra cost isn't worth it
I expect this to happen in the next two gens. AMD has enough server market share to split the desktop/laptop chips off on their own. Servers can stay separate IO die and chiplets and desktop/laptop can move to more monolithic designs or as this video is about via hybrid bonding. I expect Intel to do the same even know they just split the IO die out with Lunar Lake.
You are thinking too narrow, AMD is space constrained with regards to the number of CCDs, stacking them on top of IO would enable 32C or more on Ryzen, which consumers would 100% care about.
Apple is kind of doing what you are saying already. I have my M3 Max MBP with 128GB unified memory and everything other than gaming it does amazingly well. There are some games I can play but I got this as my desktop replacement. I won't use Windows ever now. I got sidetracked a bit, but Apple introduced high throughput RAM soldered very near the processor (I think it's part of the processor) and the I/O is directly inside the chip too. Everything is ultra smooth and fast. In case you are wondering why I got this, I am a software developer, and I do tinker with local models (which run on GPU with 128GB VRAM, unified memory). And I plan to use this for like 6-7 years lol.
The main mitigation for RAM latency is caches, cachelines, pre-fetching, reordering & SMT, and of course consumers won't care; they'll just benefit from better power efficiency as less energy is used moving bits around on average.
In an APU having general compute, GPU & NPU on a more power efficient newer process bonded to caches and IO layer improves yields and lowers overall costs by die area reduction. AMD are absolutely going for this in Zen6, they have had years since the 2021 announcement that the packaging problems were solved and Milan-X and 5800x3D became feasible. Since we had Zen4 with improved V-cache, Zen5 re-designed to suit V-cache, so there's every reason to suppose Zen6 is going to tackle the perceived weakness of Infinity Fabric linked chiplets.
Remember AMD had no way of knowing that Meteor & Arrow Lake would be flops and Apple & Qualcomm make CPU & GPU too.
Not to mention: significantly improved inter CCD latency (huge bottleneck for gaming), potential for the base die to act as an interposer for direct CCD to CCD connection that allows CCDs to act as one (similar to Intel EMIB), expansion of VCache size per CCD, possibility of a large L4 VCache shared between CCDs.
If you think this change only improves memory latency and idle power consumption, you are only thinking about using the new technology to do, well, exactly what we are currently doing, utilizing none is its fundamental advantages...
AM3 = PGA 940
AM4 = PGA 1331
AM5 = LGA 1718
AM6 = LGA 2300??
Does more pins make the cpu die bigger? Will an AM6 cpu be physically bigger than any amd cpu that has come before it?
my man asking the same question as " does bigger bottle stores more water"
I thought this was going to be boring but I was wrong.
At some point during the editing, when I have seen the video too many times, I always start to think it will suck. And I'm happy when it doesn't :)
I wonder if AMD will bother with multiple compute chiplets for consumer zen 6, or just use a single 16 core ccd on top of the new IO die.
I wonder if this works for CMOS and CCD image sensors
Not only does it work, iirc CMOS was the first actual application of hybrid bonding.
sir when is apple m4 deep dive coming in?
As soon as I get my hands on some die shots.
With hybrid bonding, there may not be a need to put L3 cache inside of the compute die: let that be optimized for the ultra latency sensitive u-op, L1 and fast L2 caches. Why waste the expensive leading edge die space when the L3 can entirely be placed below using an older process that is nearly as dense (ie cheaper)? For Zen 5, this would roughly increase core count by 50% for the compute die.
In terms of chip stacking with hybrid bonding, how many wafer-to-wafer layers are needed to get to a good structural support metric? Stack enough of them through this process to get the structural support necessary and a massive amount of L3 cache. Since this massive stack is purely for cache, wafer-to-wafer hybrid bonding can be leveraged. There are manufacturing time optimizations that can be done here too with say a 16 layer stack: start with eight double stacks, four quad stacks, two octo stacks until the final bond step resulting in 16 layers is produced. If they can check the quality of the bonding at each step, many more double layers would be discarded for bad yields vs. quad and octo layers. Essentially as the stack gets higher, only known good substacks are used going up, improving overall yields. Only when adding the CPU dies on top would a carrier wafer be needed.
Another variable for manufacturing is simultaneous arrival of backside power delivery. This would manifest as a sandwich of repeating data, silicon and power layers. This does bring up the idea of leveraging shared power layers (I would presume that the data layers are too fine to even think of this).
There is also the question if hybrid bonding can be done with multiple dies on top of a larger base die. In this case, the larger base die would simply be the IO die holding multiple stacks of cache with compute dies on top.
so how do you bond two delicate electrical components?
TSMC: we heat them until all of their the conductors melt and fuse them together
Science channel. Note magnetics, the field effects of attachment is part of hybrid bonding; attraction is good, repulsion a spike, and or resistance as stated down comment may be not so good. Learning here for me thank you. mb
And if you know just how insanely small this things is, you'll me more impressed.
I would like to see AMD work with DRAM vendors to make hybrid bonded DRAM. This means memory must be in the CPU package. We could probably agree standard sizes, say 16, 32 and 64GB, possibly higher? say 6c gets 16GB, 8c either 16 or 32GB, 12c 32GB 16c - 64GB.
the 12/16c could still have an infinity link to IO + additional MC optionally connected to additional DRAM. The OS would have to know there are low and high latency memory
Or hybrid bond to IOD? then IOD hybrid bond to DRAM?
We might see this soon for Smarphones, where power and space matter a lot more
very unlikely in the near future for desktop CPUs I would say, there is simply no reason to do it
@@jakfut2167 I think the mobile people are happy with LPDDRx even though it has huge latency penalty and don't really care about latency
I thought AMD would eventually add more layers of 3D v-cache. But placing the cache and IO on the base doesn't point that direction
@@Kemano24 there might be room for IO and cache, but not room for extras like graphics. Maybe the memory controller would go on the cache chiplet, and there's an IO die for USB, PCIe and everything else.
I think then you would have NUMA issues with more than 1 CCD.
Cache and IO sound perfect for a trailing process node, I just don't know where to divide the pieces.
Oh, they'll add more layers. Just not in the direction we expected. ;P
I was more shocked zen 5 didnt change the io die than i was the cache chip was on the bottom. For the life of me i dont understand why they didn't update it. Maybe there is a radical io change on the horizon. I would think they wouldn't integrate the cache and io into a single layer though. That would mean multi ccd processors would have duplicate io. Might be useful, but im unsure. I feel it might be more likely they actually just add another layer beneath the 3d cache for io. That would be pretty crazy. Then id think a 3rd or base layer io chip could be centered between chiplets. You could even hybrid bond chiplets to this new 3rd layer io die likely decreasing latency between chiplets.
next: 3 chips linked with hybrid bonding 🤔