The idea of more cores at lower clock speed is extremely compelling as a cloud provider. It’s not uncommon for us to have >50% cpu utilization at a host level, but, scheduling overhead and other issues makes higher oversub impractical, so more cores would allow us to get even more dense and offer a lower cost option for the vast majority of workloads that don’t really need high clock speeds anyway.
@@shepardpolska yea it’s a great fit for those use-cases. E-cores are great for laptops but the lack of simd and other features makes them a non-starter for general purpose virtualization, but cores that work just like the high-perf stuff but a little slower is easy to integrate into a tiered offering.
in 2016 some researchers built a 1000 core cpu that ran on 1.78Ghz that can be powered by a double AA battery with each core being able to shut down individually when not in use. If they did this in 2016, imagine what level of technology we have now that they are hiding from us.
@@ChrisStoneinator It's called "KiloCore". Dead project by now, but was hyped up few years ago. Used PowerPC ISA. Coudn't find any actual performance numbers. Overhyped with almost no practical use in general computing because how the chip is built.
To me, this shows how much room for change there still is in CPU design. We've been stuck with monolithic quad cores for so long, and suddenly, we got big.LITTLE, chiplets, 3D-stacked cache, and 128-core CPUs. I'm loving it :)
For gaming this sucks. I want 16core monolitic. Hybrid Intel cpu has given me nothing but problems. Microsoft thread director is still terrible in 2024.
@@impuls60well the fastest gaming cpu is and will be for a while an 8 core amd with 3d v cache. If you want 16 cores get the 7950x3D or similar. This is best for laptops and cloud computing. Especially enterprise. More cores in less space is great for enterprise
@@impuls60 Good luck waiting for 16-core monolithic on a cutting-edge node with high clock speeds and a large cache for gaming. Such a theoretical product would cost thousands of dollars and won't have a viable market if it can be outperformed by an 8-core costing many times less. You have to make a compromise at some point.
Oh, I see, that's honestly pretty clever, I imagine using the same base design make programming much easier than having 2 entirely separate designs for high performance and high efficiency cores. Really nails home how CPUs are integrated CIRCUITS, and not magic crystals. The same circuit can be laid out multiple ways for different purposes.
It's much easier for the OS, instead of needing special CPU driver and applications not knowing what features the chips support (as they can be moved between P&E cores), it can simply favour the fastest cores, then slower and finally SMT threads when utilisation is high.
@@RobBCactive Another likely advantage is in development cost for AMD. I presume that they do all the architectural development in the P core design (since it's easier as it's broken down into more separate blocks) then, when the architecture is mature, redesign the layout to compact it into a smaller area.
@@spankeyfish Good point! Well if you have developed the RTL, alter the process rules for lower frequency, use compact cells and loosen the placement requirements it is mainly running software tools and design rule checkers to validate a different implementation. It could even involve many of the same engineers who created the faster modular initial design that passed validation and understand the design in depth. So that has to be simpler and cheaper than 2 architectures.
AMD is playing on 'easy mode' in regards to scheduling: With cores where the only difference is max clock speed, all you have to do is prioritize the high-clocking cores, and you're done. If the c cores are more efficient, you can prioritize c cores when all cores are clocked slower than c_max to save power. Intel has it much harder - they have to have instrumentation in the core to sense the instruction mix, in order for the scheduler to know how threads would have performed if run another core. Past tense, because they now have to hope that the mix stays the same for long enough.
But the OS doesn't know either what the performance demand of a thread will be so it picks the one with most capacity. I could imagine a power saving mode where only Zen4c cores are enabled to save battery, so no boosting is allowed.
@@maynardburger incorrect the E-cores can perform badly given certain instructions. Find the Chips and Cheese investigation into E core inefficiencies, sometimes P core only is faster than P+E in multi-threading.
@@RobBCactiveI dont think they have big advantage when cpu is mostly idling. They just can add cpu multi tread performance very energy and cost efficiently, so main treads can perform better.
Cache is also less in Zen(number)c cores, which might have a slight impact in ipc since the core has less data immediately accessible at any given point, so it has to reach out to L3 and ram more often.
It's amusing to think back to Dr Cutress's 2020 interview with Michael Clark (Zen lead architect) who grinned a lot when talking about exciting challenges when being asked questions designed to get him to give stuff away.
@@HighYield well the interview made clear that they're starting design at least 5 years ahead, it's why they were stuck with Bulldozer for so long. But Zen4c as a tweak for area might have been in response to cloud demands and taken less, at the time of the interview MC said Zen cores were smaller and scaled better so they believed there was no need for a different core. I think he would have liked to share the idea but obviously was bound by confidentiality.
Current rumours are that Zen 6c will go upto 32 cores on a single chiplet (CCD), that would be some serious multi-threading performance in such a small area.
Likely it's all their C cores, but given it's gonna be on either 3 or 2 nm, it's likely gonna be the only Core type by then, since there isn't much point shrinking Cache below 5/4nm due to them not gaining any performance
@@hinter9907 But AMD already has the solution for that with stacked V-cache. In future core designs they could completely off-load the L3 cache to a separate die (or dies) because as they have already proven with V-cache, you can shrink cache considerably if that is the only microarchitecture on the chiplet.
@@hinter9907 Yeah the 32 Core version of Zen 6c will be the Bergamo successor. If they stick to 12 CCDs then that's 384 cores on a single CPU! It's rumoured to be 2nm but I personally don't think TSMC will have this node ready for AMD early enough or Zen 6c will be late, not sure which. Yeah I would leave cache on 6nm pretty much indefinitely now, it's a much cheaper node plus has loads of capacity.
Using the same architecture also makes software (particularly the scheduler) much simpler, which in turn also gives a performance benefit. On an E-Core/P-Core design, as soon as a process e.g. uses AVX instructions it has to be moved to the P-core (which takes time, messes up the cache, etc.) and the scheduler then has to either pin it there (which might not be efficient) or keep track of how often this happens (in order to decide if moving back to an E-core would be beneficial; which causes additional complexity and uses extra cpu cycles). Most of that simply disappears if all cores have the same capabilities (and the only difference is how high they can boost their clock).
AMD's approach also supports another dimension of differentiation by using chiplets: the manufacturing nodes and design rules can be radically different depending upon the goals. There is where rumors of Zen 5 and Zen 5c are going in that they are using two different manufacturing nodes for further optimizations in clock speed, area and power consumption. Phoenix 2 is unique in that both Zen 4 and 4c are together on the same monolithic die (which makes sense given its target mobile market). Desktop hybrid chips would be far easier for AMD to implement if they felt the need to bring them to market: just one Zen 5 die for high clocks/V-cache support and a Zen 5c for an increase in core count. AMD has some tremendous design flexibility here for end products. The software side cannot be stressed enough that the changes between Zen 4/Zen 4c and Zen 5/Zen 5c are nonexistent in user code space. (The supervisor instructions are mostly the same but Zen 4c as a few changes to account for split CCX on a CCD and higher core counts in general. Clock speed/turbo tables are of course different to report to the OS. All the stuff you'd expect.) Intel's split architecture requires some OS scheduler changes to function properly. Lots of odd bugs have appeared due to the nature of heterogenous compute: legacy code always expected the same type of processor cores when they first adopted a multithreaded paradigm. These are the difficult type of bugs to replicate as they involve code running in particular ways across the heterogenous architecture that require specific setup to encounter. For example splitting up work units evenly on the software side expecting them to all complete at roughly the same side presuming all the cores were the same. Or a workload calibrates its capabilities based on first running on a P core then gets context switched into a much slower E core that can't keep up with initial expectations. Intel is large enough that they got the necessary scheduler changes implemented into OS kernels to get things mostly working there are plenty of edge cases.
Now that you mentioned it, guess zen c cores make sense for next gen consoles. Clocks are always much lower anyway and the area efficiency would be helpful for cost. Actually it is possible console zen 2 cores are already optimized for area. Never thought about it. Cool!
Well it could go in console but they're not designed like this at all. Usually the slim model gets just a die shrink or they integrated multiple chips into one. But in 2023 they're actually saving more money by staying on the same process node instead of trying to buy some of the 3nm waffers that apple wants to put in the new iPhone. And saving 35% on just the core with no other saving anywhere else in the chip isn't going to make the console cheaper. It would just be 10% smaller chip in the best case (they're selling at a loss so maybe they are willing to do a silent version update to lose 10$ less) Though, to your last comment the ZEN2 they're using isn't anything like a zen2c. It would have been so valuable for servers you could guarantee amd would have made an epic cpu with it 3 years ago already...
nah next gen consoles are much more likely to use a single CCD X3D/VCache cpu. You get slight power efficiency and better gaming performance. This would also mean games will start to focus on the cache a lot more and trickle down to desktop as well and that could be the next paradigm.
@@rudrasingh6354 I sure hope so, it would be incredible, however 3d cache is super expensive and consoles try to stay monolithic thus far (maybe next gen this will need to change? I sure hope so so they can include more silicon without charging 700 usd for a console. Also they could afford to maybe make part of it on a newer node. Let's hope!!). I'm also not sure it would be an issue to use zen c cores with 3d cache if they find a cost conscious way to do it.
While the C-cores make sense for mobile, I think the gaming APUs are more limited by the area and power draw of the GPU (and possibly memory bandwidth). AMD needs to find similar optimizations for the GPU edit: as a separate note, could mention in the video how modern operating system schedulers already focus on the best rated cores for foreground applications, and indeed multicore workloads push the clocks of all cores down, especially on mobile
@@yusuffirdaus3900 You should look at Intel cores, they are like twice the size of Zen cores, at roughly the same, or marginally higher IPC. Regular Zen cores are already comparatively efficient for the space. So Zen4c cores aren't that much bigger than a E core, but obviously much faster.
Actually the RTL is part of the design tools intermediate format. It’s a step in the synthesis process, the place and route step will apply the “design rules” for the fab process.
@@HighYieldRTL indeed sometimes can be customized for a specific node. There are many iterations of synthesys, placement and RTL updates to achieve target power/area/speed requirements.
@@HighYield Junior ASIC design engineer here. The RTL design is basically the functionnal description of the chip written in HDL (Verilog, VHDL or SystemVerilog). The synthesis tool takes as input the RTL design and the node/process-specific library, i.e how exactly a logical function in the RTL is translated to a set of logic gates and the physical characteristics of this set of gates (length/width, electrical resistance, etc...) in the chosen fab process. The output produced is something called a Netlist, which is basically the list of all the logic gates and how they are connected to one-another in order to have the same logical behavior as the RTL. Then the Netlist is placed and rooted in the physical space to create the final die layout and the shadow masks used by the fab.
@@HighYield Hey FPGA developer here, so RTL is the first step of design entry after ISA is defined and high level block level designs and requires are agreed upon. RTL is the design written in Hardware Description Language, HDL (VHDL, Verilog or SystemVerilog or a combination of all three), the HDL is usually behavioral, written in a way that looks like software but with some differences (mostly the described behavior is highly parallel). After the RTL is written in an HDL, it can be simulated at the behavioral level for a large portion of the verification steps. In an ASIC design flow, RTL goes into synthesis to produced a synthesized netlist. Synopsys Design Compiler is one of the tools available for this. The synthesized design can optionally be simulated at this time. Once you have a synthesized netlist of the "generic" logic gates and flip flops, then comes technology mapping, this is where the start of the PDK is used. The "generic" logic that has been synthesized is mapped to pre-designed physical lego blocks that can go onto the ASIC. Once technology mapping is complete, along with constraints (IO, clock, timing, etc), the chip can be floor-planned before the tools are given the opportunity to do place and route within the floor-planning rules. You can think of it as if a designer wrote down where the walls in a house will go, but the automated tools will place all your furniture according to the rules like, "I want my TV within 2 meters of my couch, and must be facing it". You touch on this accurately in the video, a highly floor-planned chip can sometimes lead to a less optimal design because you gave the automated tools less freedom, however it can sometimes give a more optimal design because you restrict the scope of what the automated algorithms are trying to do. Note that a segmented and floor-planned design not only creates placement and routing constraints, it usually means the synthesis tool, knowing that later stages has more constraints, might also be forced to perform less optimizations making for a larger synthesized netlist. After placement and routing, there will often be some timing violations, some of these can be fixed manually, or the design can be iterated on and the tools given another try on a design with some changes. The placed and routed design can also optionally be simulated, but it will be much more time consuming when compared to an RTL behavioral simulation. In all the steps, emulation can be substituted for simulation, the cost of the emulation hardware is higher, the setup for deploying a design onto emulation hardware is higher but the execution speed of the emulated design is higher then simulation.
Fuck, you made me remember my Semiconductor design courses....goddamn Cadence tools... absolutely sucked... made me loose my interest in the subject area, knowing that I would have to work with them.
It's still more similar to engineering two cores. The advantage is the ease of integration with current software. AMD just have to assign normal cores as priority cores and they are done. Compared to Intel who needed Windows 11 scheduling to take proper advantage of their P and E core CPUs.
For the record, from my initial skimming of the news, I had assumed more or less that Zen 5c are literally just Zen 5 cores with part of the cache cut off with a handsaw. I knew that the benefit came in smaller power consumption, but I had no idea there were actually smart engineering reasons for doing it this way
I've been thinking about Intel's big little implementation a lot lately and have finally just come around to it. I guess wider programs won't necessarily need more fast cores if it can already get a lot of smaller ones. Then AMD goes and Feng Shui's their regular cores to get the same effect while dodging scheduler worries and I'm blown away all over again.
Great video. This strategy seems to make sense for servers. For my laptop, I would personally love to have a hybrid processor design with lower power cores for simple workloads, good support in an intelligent cpu scheduler, and a couple fast cores for occasional cpu-intensive workloads. I haven’t followed cpu design much in the past few years, and this is the first time I am learning about hybrid and asymmetric core designs. Interesting stuff. It’s cool to see how cpu designers are innovating as we reach the limits of increasing clock speeds and shrinking transistors.
AMD is on to something here. Will be very interesting how the performance results. Area optimised cores would be potent in handhelds, tablets and laptops - that's desktop class core but with power efficiency per IPC, not just lower clocks & voltages.
Nice video. Excellent presentaiton. I recently obtained a MinisForum Phoenix APU UM790 Pro 7940HS mini PC and am impressed with the small size, great capability and low power envelope. The Phoenix2 will have lots of applications especially for mobile and notebooks. An all 4c chip could be very useful in the data center for CPU performance per BTU. Many mobile chips are the desktop variant running at much lower clocks to extend battery life and/ or reduce heat.
They won't do that... It sounds amazing on paper but how many people would actually buy that? Also it would perform poorly. There's a memory bandwidth requirement to feed cores. Probably requires threadripper socket. There's a lot of discussion of whether it's coming to the chiplet desktop CPUs at all in a hybrid package with a regular chiplet and a c chiplet... And it seems like it won't at all... That's why we've only seen them in EPIC because they make so much sense for servers. And in laptops where the size and power efficiency are king.
@@SquintyGears Intel is pushing e-core counts higher already with 14th gen and they're already at 8+16 now, so AMD is going to have to do something if they want to stay in the upper end of multi threaded performance on desktop. I have a 5950X myself (software developer) and in my workloads it makes 1-2% difference whether I run the memory at 3200 or 3600. For really "crunch heavy" stuff like all AVX workloads it makes even less. It's practically only branch heavy code, and code that does very little work to very large amounts of data (like database lookups) where memory bandwidth is a problem. Games, however, tend to fall in this category. Threadripper (non-Pro) went up to 64 cores on a quad channel memory setup on DDR4 3200. I can't see any realistic scenario where 8 full and 16 compact cores can't perfectly well run on dual channel DDR5 6000.
@@andersjjensen we haven't seen the numbers for those intel chips yet... Yes obviously quad channel has been proven in the past. The main difference you're seeing from 3200 to 3600 is the inifity fabric clock difference because latency to memory doesn't scale along with the the raw bandwidth difference of that clock bump, unless you took the absurd amount of time it take to tune the timings for it. i don't know anybody who has, I've tried and given up.... Also, importantly, Intels 8+16 wouldn't take as much bandwidth to feed as a zen + zenC of the same count because the slow intel cores are much slower than their counterparts. I'm not saying i know exactly where the breaking point is where the performance gains fall off a cliff. But I'm pretty sure we're not that far from it. with some quick napkin math with latency that hasn't changed and bus width that hasn't changed but core counts having exploded you can understand why some people wanting to see 48 threads on AM5 seems a little bit far fetched... And that's if they would even make such a configuration. I'll remind you that we've only seen leaks for epic chiplet parts and laptop fully integrated parts. Both are integrations where it brings a very obvious advantage. What's the point for desktop? (assuming they do finally release an update for threadripper)
@@SquintyGears you do realize that the memory bandwidth has increased since they put 16 cores on am4 right? 24 cores is only a third more bandwidth needed if you start from the presumption that the first 16 core chips were bandwidth bottlenecked On server they do 8 cores per memory channel for zen4 and 16 for zen4c. A 24 core needs 2 channels following server guidelines. Am5 has 2 channels...
One big bottleneck for the smaller c cores would be the power density. The heat per mm^2 is a huge bottleneck, so clock speeds are also reduced in that regard.
But they don't boost as high so require lower voltage which squares while frequency is linear, boosting deliberately pushes the core into an inefficient part of the performance/power curve. At lower TDP per core the c cores become more efficient while not hitting the higher frequency that standard Zen hits at higher TDP. In a likely U 15W mobile laptop the power density is way lower than desktop where a single core can easily exceed 15W. Thermals constrain boosting as 10C costs about 100MHz max frequency, so on modern chips it's not really a huge bottle neck in general, the Bergamo servers with a large number of CCDs have a large socket, it's more the TDP overall that makes power efficient frequency a priority as high utilisation is the normal case.
@@RobBCactive There's an optimal point of performance per watt for any chip design. Heat per mm2 is still going to be a limit even when running chips at that optional point.
My immediate first thought was to the steam deck 2 as well. A single or maybe two big cores for those primarily single threaded games, with 4 smaller cores for heavily threaded games. I've found that a similar layout works well for desktop use on intel's designs, with 4-6 cores that you can schedule for gaming being optimal for an enjoyable experience. I do believe that the 2 GPU workgroups in phoenix 2 are anemic for the deck, as graphics performance would require an upgrade, but that's something they could adjust in a custom design for Valve Back to the CPU architecture though. I absolutely love having more cores than I would think I need. My main gripe with intel's design is the limited use cases I'd have due to how they're scheduled. A "space optimized" core would suit my wishes perfectly, and the full fat cores being there for when you need them is the icing on the cake.
The problem is the core configuration. My i7-12700K is a mistake by intel, so they won't make a similar chip in a very long time. Good for me, because the perfomance per watt is hilarious, it's way too high. Watching 4K youtube videos while the CPU stays at 8 watts and 28 degrees on summer with air cooling is really something else...
Qualcomm is catching up. Just like how Linux gaming is getting better after steam deck. I hope they bring out a smaller handheld with Qualcomm or MediaTek .
@worthlessguy7477 problem is, ARM doesn't support dx12. So all you're gonna get is phone games on that handheld even if it existed. We want a handheld that plays real games, not android apps.
For an ultra portable laptop I think Little Phoenix makes perfect sense provided that the compact cores clock reasonably well (say 60-70% of the full fat cores) when plugged in. And since the process scheduler in both Linux and Windows are biased towards the highest clocking cores single threaded and lightly threaded apps will always "run right" without the OS and apps needing modification. For standard desktop compact cores make little sense, but for "the layman's workstation" I think a 9950X3D with 8 full fat + VCache and 16 compact cores would be very desirable to many. Especially now where AMD has discontinued Threadripper non-Pro.
You really need tight coupling for thread shceduling. On the M1/M2 platform, it works extremely well. The GUI is smooth and only pegs the E cores in regular usage.
AMD has the capability with Zen to scale in any direction it wants. That was the whole goal of zen from the beginning. It needed to be an achitecture that in the long run, can do everthing and anything, and use innovative packaging techniques. Zen4c is great proof of what can be done, and in the future, there could easily be 3 or 4 different version of zen cores, tuned to what is needed. More cores at lower clocks at least to me never made sense because as an average consumer, quick bursts of high single core usage is what I would require.
Zen APUs for high performance handhelds seems to be a market amd has all to themselves, leveraging their existing work seems like what they'd do. 5c probably wont be a big change from 4c beyond what they can do to improve Epyc and Threadripper. Lower power and heat is always good though. I wouldexpect lots of c cores to be used in next Gen Console APUs and by that time we might be up to 6c
Hey dude one issue with what you said and it needs clarification. About 5:30 you're talking about Zen 4 vs. Zen 4C. You say there is no difference in IPC. That's ONLY true without regard to L3. If you are going to take into account L3 and the different workloads that benefit from having more L3 cache, so large data sets and for PC that's often games, but other scientific or engineering workloads where you might do modeling and need access to large amounts of data, then there is a difference in IPC. As soon as cache values change, you have affected IPC in some workloads. IPC can't be talked about as if it's a fixed thing because it's not. It's workload dependent. In THIS case with the chip you are showing, because there is a shared L3 then there should be no difference in IPC between Zen 4 and 4C cores, but this is not always true. In general Zen 4C cores have half the L3 cache, so if you compared a CPU with Zen 4C cores only vs. a CPU with Zen 4 cores, there would be a difference in IPC.
Zen4c looks like the prelude to AMD producing an entirely new range of massively threaded enterprise CPUs. They saw the ARM based servers and went "We can do that, but with x86 instruction sets.
I imagine the Zen 4c cores can compute the same instruction sets as Zen4, Unlike Intel E cores which are missing some instruction sets the P cores can execute. AMD with their smaller silicon footprint along with chiplets can likely build CPUs relatively cost effectively now, even with TSMC charging more to produce them. On the flipside Intel has the advantage of creating their own chips in their own fabs.
Yea the identical instruction sets is an interesting choice. It’s a huge win for the server market and opens up possibilities for all-efficient-core designs, but may make less sense for some consumer applications (a lot of background processes may not actually need stuff like avx512 so that may leave big chunks of the core sitting idle). I’m sure how much intel running their own fabs is truly an advantage right now given how much they’ve struggled to get EUV online and move to their next node. Relying on TSMC may mean AMD is sharing some of their profits but it also means they’re always able to access the latest, fastest process node. Shedding that risk removes a lot of uncertainty.
I really like that they've started building them like this. I have a sneaking suspicion that IPC will increase due to lower latency between the different parts of the core.
I very much see the 'dense' versions of Zen being used in games consoles, handhelds, thin & light laptops, NUC like devices, etc. Basically anything where 'power to performance' is king rather than just outright performance. Personally I don't want hybrid chips coming to desktop though or if it does I want to have all options, ie only full large Zen chips or Zen + Zen C chips or only Zen C chips. But I've had and am still having a lot of issues with with Intel's big/little designs and so my next system is AMD because I won't want big cores for my desktop.
With Zen Dense being the same IPC but just lower clocks it doesn't mess with stuff in the way that intel E cores do. Apps and windows will also know much better which core is faster, since Windows scheduler looks at clocks only to determine the fastest core
@@shepardpolska I just don't know why Windows/Intel is handling things so badly though, like if I don't have that app 'on the top' (ie if I minimise it or have something else open over the top of it) then most of the time things get moved to the E-Cores and leave the P cores empty but these are full on performance requiring applications that were maxing out the P cores but then just almost stop if I don't watch them work!!
By definition Windows handles things badly, that's its entire purpose xD But yeah hopefully we'll see some variation assuming they continue on the same path, already today we are seeing more options with each generation
Gaming needs high performance cores though. We can see that the Z1 Extreme can clock up to 5.1ghz, I think AMD will need at least 2 Zen5 cores, then maybe 6 Zen5c cores for the next Steam Deck APU. Or if they want to keep it at lower clocks with all Zen5c cores, they'll probably have to add X3D cache to it to boost the gaming performance back up to an acceptable level. Since X3D cache will inhibit cooling, that might actually work well to complement the slower clocking cores.
Wonderful breakdown and discussion, I think about this everytime I see Intel's lineup and the whole instruction-set compatibility rework that's underway with AVX10.x. I prefer AMDs design and approach for that reason alone. I respect Big.Little when they go hard for like Apple's designs and just breathe great products that perform and last. Zen4C sounds exciting for all products
Very enlightening video indeed. I’d say what AMD Is doing is a halfway house solution towards a ‘little’ core, but one that’s perfect for the sort of workload that its targeting. Easy workloads that are highly parallel would work better on those little cores, as they can easily scale by core count. But if the workload has less inherent parallelism, then the Zen4C would be perfect for it. Not to mention the ease of workload scheduling here. Intel’s hybrid design requires dedicated hardware in the thread director, while this solution from AMD doesn’t
AMD has confirmed that Zen 5 will have "Integrated AI and Machine Learning optimizations" which to me says an expansion on AVX512 support. But since they are going hybrid, I wonder if the new AVX10 implementation Intel has come up with might be at play. Since AVX10 allows for different register widths instead of forcing 512 bit it becomes a good fit for little cores without hampering big ones
Actually AVX10 is presented as a common standard not limited to intel. But intel is rolling it out inconsistently. I don't think it's what they where talking about though. I think they meant an soc style approach like apple does with their neural engine. It's an accelerator block that would be part of the IO die like the gpu section. AVX512/10 is meant to accelerate the model training and scientific research tasks. It's vectorizing 512 bits of data through a single instruction instead of the 64bit that it can do regularly. This speeds up processing a whole matrix of doing the same operation on every cell. But for inference and using prebuilt models day to day its not what's most effective.
Great for mobile, but for desktop compute Zen 4c doesn't have the appeal as more big Zen cores. My suspicion is that at 100% load for longer periods, a big Zen undervolt reaches greater (or at least a wider) range of efficiency targets. AMD must have realized this same particular flexibly pretty early on, because they literally Compressed Zen here until it reached the same thermal limit at a much lower, and presumably the most efficient, point on the power curve. This is great for devices that run a submaximal load, like most office or mobile devices, or GPU limited devices. Though I'm not sure if that Zenc wont adversely impact "GPU busy" on even the most constrained devices, like handheld gaming, because today the emphasis is more on high refresh rates than fidelity. Time will tell. Frankly the most amazing thing about this is how consistently disruptive AMD's designs choices for Zen in the server or supercompute space are for the consumer market. They keep hitting multiple targets with the same stone.
@HighYield A point I have not seen mentioned is using advanced process nodes. A halved L3$ with 3 cores for 2 allows a doubling of cores in almost the same area, meaning retaining @High Yield ! For AMD that might make 3nm more economic, gaining efficiency in even smaller area without loss of clock speed challenges while power density is not problematic even all core. OTOH the higher performance less dense 4nm might offer higher max clocks and boost potential on desktop. Cost per core in high utilisation cloud server, the effective cost/performance of Zen5c might be compelling.
@bulletpunch9317 Those separate arm cores have less instructions than the big core, which means lower performance, but PCs are different because PCs need more performance, so using the same architecture could increase power, but lowering die size to cram in more cores in the same package would give us some extra performance at the same power usage as the non-hybrid CPU.
@@aviatedviewssound4798 im talking about the prime and medium cores, not the big and little cores. The prime core on android flagships is like the standard pc 'big' core. The medium cores are like the 'small' cores. The little cores on android are too weak for pc, they're like the old atoms. Qualcomms latest laptop soc only has the prime and medium cores, no little core. As for apple their little cores perform similar to androids medium core.
I still love the simplicity of the intel 9700k. 8 fast cores with no hyperthreading. A gaming value masterpiece we may never see again. With hyperthreading and different speed cores you can run into situations were the high priority processing takes a back seat on a slower core or competes with another thread for the same cores resources. It doesnt always happen but with the 9700k you take no chances and get a discount for doing so.
I love random comments like this from one computing perspective or another, I don't game but I do a lot of virtualization and one ancient arch that held its own for many years was Haswell, 10 years old now.. just pulled my last one out of active service this week
Unfortunately everything you described happens on any multicore cpu. You can still have the OS assign threads sub optimally and multiple processes compete for the same ressource. Hyper threading doesn't really make a difference for that and the new slow fast cores just add a bit more complexity in the priority of how processes thread transfers get handeled. The best way to dodge these issues is to manually assign a higher priority to the program you really care about in task manger. Because then the windows scheduling will never kick it behind something else.
@@SquintyGears Well what about a games own threads competing with each other thus reducing the performance of the thread that creates the main bottleneck? The rendering thread or main thread for example. This was the kind of case I had in mind when I wrote my comment and it is evidenced by many hyperthreading on vs off benchmarks. In these cases the application priority was identical and hyperthreading was tested on and off and showed a difference. I understand in an ideal world the game engine it self would handle this properly with the scheduler but real world data shows that it does not always. Setting the application priority addresses only part of the problem.
@@maxfmfdm that's just not really how that works... Within a single program like a game, there are tons of conditions for different parts of the logic to operate in parallel or in sequence. But since it all came from the same program you can generally say they play nice with each other. The problems come when other processes that have to run too, get run in between game tasks and end up making the game wait for a bit more than it expected to. BUT this only shows up as a problem when you're running out of computational headroom. If you're not limited doing anything fancy changing hidden settings will net you 1% or 2% improvements and that's often within run to run variance of just background tasks randomly starting or not (because windows does a lot of things)... It is true that the original Hyper threading implementation in the first and second gen core i series chips showed significant differences in a non negligible proportion of software. It was both a programmer and intel problem. And games were one of the most affected types of software. But that's not true anymore... Programmers don't use the same threading libraries and the scheduler logic for what gets priority and where it gets assigned has gotten more clever. Same thing with the way HT runs the 2 threads on the same core, the architectures now are much better at occupying the empty cycles. Bottom line... The 9700k is a beast because it's the first regular desktop 8 core intel made so it was heads and shoulders above its predecessor. The reason it doesn't have HT was only for market segmentation because they introduced the i9 (which is a marketing construct they did to increase the potential selling price, i7 is still 300$ but you could start spending 500$ now) because HT generally only brings you 40% extra performance in ideal conditions but it hasn't been a source of loss in a very long time. People test it again on every new platform.
@@SquintyGears um nope... just because they are all part of the same program does NOT mean they play nice together in reguards to these issues. I'm sorry but your first comment was a bit off but this claim is WAY off. I worked as a software engineer for a game developer and your claims are just outrageous at this point.
The more intricate the design is when it comes to big.little(heterogenous) design the slower it will be. Homogenous design will always be faster because there is no latency/waiting time wasted for the system do decide where to put the workload, and if the workload gets assigned to the lower performing cores u loose perf.
The interesting question is why does AMD include regular Zen 4 cores on P2? With server CPUs it is clear that they do not need the regular cores in the first place. I get that it's probably for ST/MT workload optimization but shouldn't AMD prioritize for a smaller area? This also challenges the narrative that what AMD does is "big little". They don't have to go together.
Zen 4c cannot go much beyond 3ghz. This makes them great as efficient cores, but peak single threaded speed would be a lot worse without the "big" cores.
@@b127_1 I've heard it can go past 3.2 GHz just fine, though with a higher voltage than Zen 4. I get that you can get up to 25% slower on Zen 4c w/ standard configs. Though, I reckon with proper optimizations the gap can be narrowed down to 15. For mid range laptops AMD can still get away with not using Zen 4 (or using just one). After all, you are dealing with 15 W or less TDP. It's more flexible than other solutions at the cost of less area savings. So I still don't think that calling this approach "big little" is the right one.
@@b127_1zen4 are already super efficient. I turned off boosting, limited to max 2.7ghz, and my 6850u is absolutely nailing every task I throw at it, with about 7-8h of normal usage (over 12h in "just reading and writing text" mode). all with 40ºC with no fan, unless playing a video or using external monitor. So I personally wouldn't mind all cores to be 4c.
Ryzen 8950XE with 8 Zen5 cores and 16 Zen5c cores would be rad. Let's say Zen 5 is 17% IPC increase and 3% clock speed increase over Zen 4, and a Zen 5c core is only 15% slower than a Zen 5 core. That would make a Ryzen 8950XE ~60% faster in multi-core workloads than 7950x. Or measured in Cinebench R23 score, that would be ~61,300 points vs ~38,500 points for 7950x.
The idea is the MT cores don't design for the boost frequency but can operate at the power efficient frequency limit. So under heavy load those Zen4c cores might be no slower, as the system operates at a base frequency. The Zen4 cores have to dial frequency back to share the power/thermal budget.
But why is this so important? What is the 'consumer platform' type of user doing where simply having 16 full speed big cores wouldn't still be more than enough for multithread? I had the same issue with Raptor Lake. Sure, it's technically added a fair chunk of multithread performance, but few people buying those are actually taking meaningful advantage of that versus just having 8P + 8E.
@@maynardburger Some productivity users, use desktop rigs and like Raptor Lake plus the hardware quicksync acceleration. Many users think moooaaaaahhhrrrrrrrrrr cores are better, so pick Intel despite the problems cooling a lot of watts for little performance gain.
Or, iuno, they can just use 16 Zen 5 cores so people aren't getting gimped cores on half of their CPU. The big little thing only works if it leads to *more* threads, the same amount with less power on half the cores would be devastating.
@@joemarais7683 The Zen 5c cores aren't really gimped. They have multi-threading and support all of the instructions that the full Zen 5 cores support. They are just clocked a bit lower. Besides, you can fit double the amount (16 cores) on one chiplet instead of 8. So I don't see how it's going to be devastating. If you are talking about games, are there any that really need more than 8 very fast cores? Either way, with Ryzen gaming is generally done on one chiplet to avoid the cross chiplet communication latency. So it doesn't really matter if the 2nd chiplet has Zen 5 or Zen 5c cores since you'll always want to keep the game running on the 1st chiplet.
Great analysis. Definitely made me more interested in Phoenix 2 and the upcoming Zen5c cores. I do agree that perhaps the phoenix little nomeclature would have been more accurate rather than Phoenix 2, its still a pretty fascinating chip.
Wendell from Level1Techs had an interesting idea of having an 8 core design combining both Zen 4C and Zen 4 cores. Because you save some power consumption on the Zen 4C cores you can budget the rest of the power towards the Zen 4 cores to much higher clock speeds.
Another advantage is since the IP is the same between the core types the only meaningful difference is the clock speeds. So thread scheduling is not going to be too much of a problem since Windows already assigns most applications via clock speeds anyway.
@@kingkrrrraaaaaaaaaaaaaaaaa4527 hopefully it will work as easily as that, unlike intel which still don't work and just assign random cores when idling, I just cheacked and the only cores that are not in use on my laptop are the e-cores, like you would think windows would run on the e-cores by default by now
@@besher532 I heavily suspect this is a pure governor setting issue (performance v/s efficiency mode). ARM solved the scheduler part a long time ago on Linux so Intel+Windows probably solved that as well.
This is definitely the superior way to do hybrid arch vs the way Intel is doing it. I've shied away from Intel 12th and 13th gen cpus with big and little cores because I simply don't trust the scheduler no matter what they say about "thread director" As far as gaming goes the scheduler should have no trouble with AMD's hybrid arch because the "world thread" so to speak would always be scheduled on one of your higher clocked cores, but the lower clocked cores due to them being the same IPC would still be useful for gaming. I think I'll hold my 5900x until AMD starts dropping these hybrids on consumer desktop. Also by the time AMD brings this to consumer desktop hopefully they've further optimized their infinity fabric to work with high end ddr5 in 1:1 mode to reach monolithic latency levels as Intel does.
So curious as to the efficacy and efficiency of high density next gen, such as zen5 implementations. Hoping that they makes high fidelity, high resolution capabilities much more accessible globally.
The same cores are still being used as big/medium cores in multiple arm designs from qualcomm or mediatek. Like the sd855/865 had a high clocked a76/a77 core and a 3 lower clocked a76/a77 cores. The higher clocked core is physically larger too.
I'm curious how this Zen4c design compares to for example Intel's E-cores in terms of performance per watt. The whole point of bigLITTLE was that minor and background tasks could be executed on the smaller and less complex core instead of spinning up a whole "proper" core, thus saving power. But if AMD's c-cores are just "normal core but more compact", would that even end up saving any power besides the slight difference due to clockspeed differences? So far it just looks great for the datacenter world where you can pack a bunch more c-cores onto a die than normal ones. But what about laptops and other varying power devices that strongly benefits from proper bigLITTLE?
We're still here waiting for software/scheduler to utilize our current cores.. I am not sure how more cores would have any performance gains, although the efficiency is there. Honestly surprised this channel doesn't have 20x the subs, some great content here.
If it's AMD you're referring to, then that's really only a thing for games and dual CCD CPUs. The bulk of how it works now, would work even better with C cores. Single and lightly threaded tasks just go the the highest clocking cores. Games just happen to be a special use case for X3D where you typically want to reverse that when you have a X3D chiplet. You'll hardly ever run in to any other consumer workload where you'd actually want/need to pick the cache cores. Most of what else could potentially benefit from the extra cache would just as likely be capable of using all threads on the CPU. That's where more threads provide more performance. Either highly threaded tasks that can max the CPU, or muti-tasking with a collection of light to mildly intensive tasks. Wanting to specifically default to the C cores would be mostly a concern for mobile, which is really a matter of having to OS switch behaviors when on battery. There are game engines capable of using 12-16 threads, but are currently scaling back to not have the game peg the CPU at 100%. The 2.0 CP2077 update, for example will use 50% of a 16c/32t CPU, but 80% of a 8c/16t one. I've seen Star Citizen using as much as 19 threads though. If it starts becoming more common for games to push past 16 threads, Intel's E cores in current CPUs will be worse off.
I still prefer the simplicity of more cores working in parallel with high clock speeds and multi threading over hybrid designs since i don't have to worry about not having enough cores, or if drm may crash my game because the tasks get moved between cores
Should it not be possible to start optimizing the sub parts of the architecture for vanilla and dense in the future? For example giving the vanilla core a more elaborate branch prediction or a bigger uop cache. So it would still be the same architecture but with a tweak or two besides clockspeed, cache and density. Might be a cost efficient way to optimize even better into two directions
As soon as the architecture is changed, you're working with different cores. Making a design more compact is vastly different than making wholly different functionality.
I think the most difference we'd see in architecture in a single AMD APU chip is using Zen 5 with Zen 4c+ (rescaled on a smaller node)-- basically having a slight size-area advantage while reusing an existing design. So basically Zen N + Zen N-1c+ to be able to launch more quickly unless there's a major architecture difference between generations. I could see this happening if there were problems bringing up Zen 5c in a reasonable time or if they wanted to lead out-of-the-gate with an APU that was on the new process, regular Zen 5 was ready, and wanted to make a powerful lower-cost chip off the bat to secure marketshare. It would still work better out of the box than Intel's P&E cores because that requires a lot of magic on the programming and scheduling side.
Porting Zen Nc to a significantly different process used by Zen N++ would be much of the same work of validating Zen N++. APUs & server follow the initial desktop lauch, used to introduce new designs. Delaying V-cache & compact Zen chiplets is more about managing releases and demand, launching the simpler product first.
Undervolting existing AMD designs gives tons of efficiency. So this should work fine. No need to reinvent your wheel when you can just optimize and shrink the one you already make.
The biggest advantage of the the C cores is while it lacks L3 you can just pop v cache on it and make up the difference and then some. You wouldn't find that in the highest performance chips but I could see that for all APUs going forward considering die space is so important. For high end desktop (gaming) I could see ryzen 8000 or maybe 9000 having a block of full sized cores with v cache and then a even lower clocked block of C cores for all background tasks or anything that the OS thinks wouldn't be effected by the chiplet to chiplet communication penalty. Really AMD should just come out with APUs for the low end, mid range and high end gaming chips with v cache, and then have their highest core count CPUs just use C cores as productivity (generally) cares more about cores than it does clock speed. Not sure how much that would eat into threadripper sales but considering they havent come out with those for a few generations i dont think they would care too much.
a more mixed design would be more flexible though. I do both gaming and productivity, sometimes with integrated, sometimes with a dedicated gpu connected to my laptop. 4 zen5 cores with v-cache and 6 zen5c cores would be bloody fantastic for a laptop.
@@raute2687 You can say you do productivity work on a laptop but lets be real, any person that does it for work will have a desktop. Which is why they should segment the line up for entry level APU, mid and high end gaming, and then top of the line productivity where more cores will actually help performance.
Honestly this could revolutionize the flexibility of Chiplet design that AMD does on Desktop without needing to incur as much CCD Latency issues like on the X900+ SKUs currently. Instead of needing 2 CCDs to break past 8 Cores. You could have a CCD by default come configured with 8 Full and 8 Compact Cores with 3DVC. So something like 8950X = 32 Cores (16 Full, 16 Compact, 2CCDs) 8900X = 24 Cores (12 Full, 12 Compact, 2CCDs) 8700X = 16 Cores (8 Full, 8 Compact, 1 CCD) 8600X = 12 Cores (6 Full, 6 Compact, 1 CCD) 8400X = 8 Cores (4 Full, 4 Compact, 1CCD) Then You can have something like an 8500 with some of the C or Full cores disabled to get 10 Cores or something. The scalability becomes far higher with that design. Or heck, why not a pure C core CCD with even more 3DVC than the standard CCD?
Correct me if I'm wrong, but AMD won't be putting 16 full Zen 5 cores in a single CCD with Ryzen 8000. I think this will be coming with Zen 6/Ryzen 9000. 16 compact cores yes, this will be possible from what I've heard.
CCD size/density would need to double in a single generations to have 16 Zen cores in one CCD, the rumours are talking about 4+8c instead of 8+8c for laptops
@@Alovon ah, when you wrote 8950X= 32 Cores (16 Full, 16 Compact, 2 CCDs) I understood a 16 Zen CCD + 16 ZenC CCD. I see now that you meant 8+8c + 8+8c, Silly me, heterogeneous CCDs are a dumb idea xD the 2 CCDs will be identical , aside from any stacked cache on top... unless...
This design is interesting for sure, and it brings up some interesting ideas for future AMD chips, including how Little Phoenix may be cut down. A quad-core Ryzen 3 "7340" variant could be 2+2, while a lower-end "7240" might be 1+4, and then some Athlon "7140" is 0+4.
The zen4c cores aren't that much slower than the regular ones that it would make an athlon. Think about them as if they where maxed at 3.8GHz instead of 5.3GHz So they'll actually perform closer to a Zen3 going at its full clocks at equal core counts. The athlon are much more cut down.
@@SquintyGears I'm well aware of how neutered the Athlons are. I only bring that name into it as AMD doesn't have anything else below the Ryzen 3 name in the x1xx range.
I think they are starting to separate the qualitative compute concerns to clamp down on efficiency and leverage electrical properties that are becoming better understood at smaller scales. I find myself shopping for efficiency rather than ultimate performance, this way I solar and renewables are even more attractive.
Would be interesting see some low end Zen 4c only apu's too take on Intel's Celeron N series processors. A dual/quad core Zen 4c apu would wipe the floor with Intels celern N series processors
Having the C and non-C cores have the exact same instruction set but with diferent frequencies/thermal budgets is brilliant! That means no AVX-512 bullshit like Intell has, or problems with certain instructions being significantly slower on the C cores vs their alternatives with regards to the non-C cores. This allows compilers, opersting systems and applications to optimize for AMD much more thoroughly.
This worked out for AMD the last time they tried it; albeit the goal then was to drive power consumption down on their Excavator cores (my memory says this was for the Carrizo lineup maybe?), rather than to create anything like a hybrid CPU cluster.
More cores at higher clocks has a benefit for mathematical compute but for desktop applications a single threaded CPU that is faster is better. What AMD have done is provide a really good compromise. As more applications become multi-threaded their performance can be offloaded into more cores giving a performance boost. However, this means that applications will need to be compiled correctly for this architecture.
The problem with this, that intel is seeing now is scheduling. in most cases and especially for gaming, there can be massive latency and scheduling problems
Stop the latency hysteria and look at a core to core latency chart. Ecores talking to the same cluster already have less tban two thirds the latency of cross ccd cores on Zen 4, when it's ecores from different clusters talking it's even more advantageous for them at nearly half and when it's a Pcore to an Ecore and vice versa it's only like 20% worse than only Cores talking to each other . Ecores aren't being used bcuz games literally don't need more cores, that's it. The and fanbiys talking about "massive latency problems" are the same clowns that bought a 5900x and 3900x for purely gaming. 😂
High core counts are also constrained by memory bandwidth. However, this can be partly mitigated with a larger cache. I would think a 32 core AM5 desktop could be made practical if: a) It also included 3Dvcache and b) it Used fast DDR5. Ideally, more RAM channels would be ideal, but even without them, I think 32 cores could be fed this way. Perhaps an all Zen4c or Zen5c desktop part with 32 cores would be a multi thread killer. It would still lag in games and the like due to lower clocks but I would love one on my desk.
We will have to see in benchmarks, in multiple uses, what it goes. I know that the Intel P+E design sounded naf to many, but given the right situation, aka a laptop or small device, they are kinda nice. Lower heat, longer battery life. And yes I know windows 10 needed some patching to get it right. I know.
Great stuff! I remember seeing some annotations of Arm designs that implied they also use this method of making the chip less blocky, with several parts of the chip sharing chunks of the chip and making it impossible to clearly indicate where specific things are. Is that the case? Is that mostly a result of using high density cells, or do they also use some kind of automation for laying out the floor plan?
They use automated layout on both core. It is just that the big core is optimize for clock speed so the critical wires must be very short. While the "C" core is optimize for area so the wires are allowed to be longer and this free up the transistor layout to fill the gaps between functional blocks making the logic blocks closer with each other.
AMD also named the small Zen APU as Raven 2. Later it held a name I can't recall but driver code still refers to it that way. This is actually a great way to go, especially on laptops. 2B + 8l would probably be thermally limited anyway but less than an all core equivalent. For desktop 4B + whatever you can fit should also be enough for high performance designs. On server Genoa is already a huge win for higher perf through lower power per computation. The hard part will be fixing the Windows Scheduler. On Linux it is already fixed by ARM long ago. This is also a different design from Intel, as here the small cores also have the same connection to the memory controller and L3 cache. On Intel there is some latency penalty from them being clustered in groups of four. I think people will be very surprised of how this will work and will not think of small cores as a bad thing.
Some ARM SoCs have been doing the first half of this for over a decade. The Tegra 3 is the first one I remember, with four cores on the high performance cell library, and one core on low power cell library. It bet ARM's big.LITTLE to the market by a year or two. And it's somewhat common on low-end SoCs to have a fake big.LITTLE configuration that's actually just two clusters of little cores optimised for different clock speeds. I assume those SoCs used different cell libraries. AMD's main innovation here is taking the extra step to really optimise for area with the more careful layout.
I do actually think that Intel's move back towards single threaded cores makes a lot of sense regardless of architecture, the better your scheduler and prefetch units are, the less extra juice you can squeeze out of the ALUs and FPUs. Add to that the fact that cross thread vulnerabilities are almost impossible on single thread designs as well as higher densities and those E-cores are looking mighty fine. AMD's focus on instruction compatibility also makes a lot of sense given how they haven't been affected by vulnerabilities and node restrictions as much, it wouldn't surprise me if the future will land somewhere in between, single threaded cores using the same ISA but min-maxxed for speed/energy consumption.
@@HighYield Ok, but what about 32 Zen5c cores on Ryzen chip? Will they be faster than 16 Zen5 cores with higher frequency for multithreading tasks like Cinebench?
As a designer for the SBP9900 at TI, I understand the difference between the P and E cores. I also understand the Ridiculous Instruction Set Computer (RISC) concept of ARM. But who said the CPU should be "symmetrical"? I worked on a stack-based direct execution computer. I examined the WD Pascal Engine and the Novix Forth hardware engine. Semantic design means that there is no "symmetry". If they want high performance, they should abandon inherently defective CMOS and switch to resonant tunnel diode threshold logic. "Engineers" at TI had a hard enough time trying to use I²L circuits. I had to create an I²L library of equivalent circuits for the standard TTL ICs. q-nary logic, self-timed circuits, universal logic modules, and semantic design were totally beyond them.
If the both zen4 and zen4c cores use the same RTL, instruction set and IPC does that mean it could be easier to do scheduling in OS? I remember hearing that intel's big little architecture had some inefficiencies because windows wasn't able to efficiently schedule in tasks like games.
IMO, with Apple and Intel seemingly sticking to their "hybrid" cores performance-wise model for the foreseeable future, AMD >must< retain a mix or become irrelevant on the Desktop. ESPECIALLY if they wish to return to the consumer HEDT market. For the Server/Cloud side, a completely "c" cores CPU would be highly desirable though.
I wonder if AMD's use of the same RTL for both cores stems from the instruction set. x86 is extremely complex. The decoder, accordingly, needs extra complexity, much of which is independent of the throughput needed, and much of which is always-on regardless of what the core is doing. So it might lose efficiency if underutilized, and simplifying the decoder to just provide a lower throughput might not give all of that efficiency back. And if the decoder throughput isn't changing, the execution throughput can't be allowed to change as well. Which eliminates the main reason to change the RTL. Miniaturizing the same logic, also has a lot of advantages. If all the wires are shorter, that means less electrical resistance, and less leakage, which means that the same logic can get higher power efficiency (and thus less heat). This partially compensates for the lost cooling from a smaller physical surface area. It also might allow for thinner wires to carry the same signals, enabling further area savings. Shorter wires also have less chance to affect each other, also partially compensating for the fact that there are more of them. ARM does not have this problem, since it's much easier to decode. As such, it's a lot easier to just cut off half the decoder and get half the throughput (though it's not necessarily trivial). There is also plenty of room for the same exact layout optimization, but it's not necessarily as obvious given that the best point of comparison has different logic. And it sounds to me, like Intel's solution to this problem, is just to remove some instructions. This will drastically simplify the decoder, perhaps making it easier to actually reap the benefits of simplifying the decoder.
When optimizing for size and power efficiency, isn't there more to do than just moving the core elements around? I'm worried that the lower clocked zen4c cores can't compete with E-cores in power efficiency (and thus multithreaded performance within a fixed TDP envelope)
@@mawkzin Yeah sure but the 4c cores need to be more power efficient than the 14th gen E-cores in instructions/watt. Obviously they'll be faster core-to-core but efficiency is important for multithreaded loads and handheld use.
More cores with less clock speed is definitely the wiser path. We are still very un-optimized on SW/ kernel level when it comes to true multithreading, still waiting on the real paradigm shift
A Zen 5 desktop CPU with one CCD of 8 normal Zen 5 cores with stacked L3$ and a second CCD with 16 Zen 5c cores would an extremely interesting product. As long as they can handle the scheduling aspects to optimize for various workloads it would be a killer.
Id say 6+6 would be interesting cost effective but powerfull system for gaming. Most games dont need that many power cores but might be still able to utilize many cores for some extend. 4+12 black cheep option 😅
There really isn't as much concern for scheduling in that scenario. Games are the sole outlier to how the OS wants to handle multi CCDs CPUs. By default it's wants to put tasks on the highest clocking cores 1st, but while the stacked cores are slower clocked, they are technically superior for most games. So the workaround is a whitelist method. Some future core dense CCD will always be the slower CCD. Pretty much every other workload is benefitted by higher clocks 1st, and then if it's heavily threaded it'll just include the other CCD. There won't really be enough other common workloads that would benefit from the cache and not use the whole CPU, nor would the dense CCD be faster in isolation.
If AMD had the will to do it, they could probably launch that right now with an 8 core Zen4+3DVcache chiplet & a 16 core Zen4C chiplet. There's no obvious technical limitation making it impossible to my knowledge? and I wouldn't be surprised if they had something akin to that running in the labs right now (in the same way that 5950X3D's exist internally at AMD)
I had seen some tools being developed for the issues of having 2 different chiplets can give. If you happen to have a 7900x3d or 7950x3d, I would hoghly suggest that you check out a program called Process Lasso. Something that I believe AMD should have made themselves and released with those cpus, so you don't end up with the issues some had, for an example some games running worse than on a 7700x (example csgo) or a 7800x3d, because they didn't run on the appropriate chiplet for max performance
Well, it seems AMD has a mysterious "low power core" design in the works with the Zen5 family. And no, it's not the "dense"/Zen "c" core, it's a distinct new core (maybe in the IOD or stacked on top of it?) So who knows but it seems Zen5 will have 3 core variants.
The previous Zen designs clock the L3 cache at the the highest core clock frequency in the CCX. Has this changed, are they using Z4 or Z4c cache slices?
That is interesting approach. We are deploying buildservers at scale, so efficiency is kinda important for us. With i9-13900K having E-cores was essential to reduce build times by 40% compared to only P-cores with same power consumption. For AMD we are using multiple peripherals to do the job and we need CPU to be as homogenic as possible. Also, Hyper-V in WS2022 was not happy with heterogenic cores, but it could be satisfied by AMD. Time will tell.
A very interesting concept. I think it's the better way than P/E-cores, especially on Windows. The scheduler has enough problems assigning tasks to the right core. The thing I would like to see would be a desktop CPU with 8 Zen5 cores (maybe X3D) and 16 Zen 5c cores. The differences in clock speeds would guarantee that high performance tasks stay on the Zen5 cores. That the ISA is the same on all cores is great, as heavy multithreaded workloads can profit from many cores regardless of type and clock speed.
I agree with @MikeGaruccio, more smaller sizes cores and lower clock is the main idea behind RISC architectures, except that here the cores are actually CISC cores. More cores are the way to go for sure, especially when the cores are uncompromised, except for clock speed. AMD is correct is their approach.
The idea of more cores at lower clock speed is extremely compelling as a cloud provider. It’s not uncommon for us to have >50% cpu utilization at a host level, but, scheduling overhead and other issues makes higher oversub impractical, so more cores would allow us to get even more dense and offer a lower cost option for the vast majority of workloads that don’t really need high clock speeds anyway.
Yeah the main reason for AMD developing Zen4c is to sell them in EPYC chips.
@@shepardpolska yea it’s a great fit for those use-cases. E-cores are great for laptops but the lack of simd and other features makes them a non-starter for general purpose virtualization, but cores that work just like the high-perf stuff but a little slower is easy to integrate into a tiered offering.
in 2016 some researchers built a 1000 core cpu that ran on 1.78Ghz that can be powered by a double AA battery with each core being able to shut down individually when not in use.
If they did this in 2016, imagine what level of technology we have now that they are hiding from us.
@@HamguyBacon”hiding from us” lmao also source?
@@ChrisStoneinator It's called "KiloCore". Dead project by now, but was hyped up few years ago. Used PowerPC ISA. Coudn't find any actual performance numbers.
Overhyped with almost no practical use in general computing because how the chip is built.
To me, this shows how much room for change there still is in CPU design.
We've been stuck with monolithic quad cores for so long, and suddenly, we got big.LITTLE, chiplets, 3D-stacked cache, and 128-core CPUs. I'm loving it :)
For gaming this sucks. I want 16core monolitic. Hybrid Intel cpu has given me nothing but problems. Microsoft thread director is still terrible in 2024.
@@impuls60well the fastest gaming cpu is and will be for a while an 8 core amd with 3d v cache. If you want 16 cores get the 7950x3D or similar.
This is best for laptops and cloud computing. Especially enterprise. More cores in less space is great for enterprise
@@impuls60 Good luck waiting for 16-core monolithic on a cutting-edge node with high clock speeds and a large cache for gaming. Such a theoretical product would cost thousands of dollars and won't have a viable market if it can be outperformed by an 8-core costing many times less. You have to make a compromise at some point.
Oh, I see, that's honestly pretty clever, I imagine using the same base design make programming much easier than having 2 entirely separate designs for high performance and high efficiency cores. Really nails home how CPUs are integrated CIRCUITS, and not magic crystals. The same circuit can be laid out multiple ways for different purposes.
It's much easier for the OS, instead of needing special CPU driver and applications not knowing what features the chips support (as they can be moved between P&E cores), it can simply favour the fastest cores, then slower and finally SMT threads when utilisation is high.
@@RobBCactive Another likely advantage is in development cost for AMD. I presume that they do all the architectural development in the P core design (since it's easier as it's broken down into more separate blocks) then, when the architecture is mature, redesign the layout to compact it into a smaller area.
@@spankeyfish Good point! Well if you have developed the RTL, alter the process rules for lower frequency, use compact cells and loosen the placement requirements it is mainly running software tools and design rule checkers to validate a different implementation.
It could even involve many of the same engineers who created the faster modular initial design that passed validation and understand the design in depth.
So that has to be simpler and cheaper than 2 architectures.
modern cpus are pretty close to magic crystals 😅
No you're wrong, it'd definitely magic crystals. With a lightning and logic enchant.
AMD is playing on 'easy mode' in regards to scheduling: With cores where the only difference is max clock speed, all you have to do is prioritize the high-clocking cores, and you're done. If the c cores are more efficient, you can prioritize c cores when all cores are clocked slower than c_max to save power. Intel has it much harder - they have to have instrumentation in the core to sense the instruction mix, in order for the scheduler to know how threads would have performed if run another core. Past tense, because they now have to hope that the mix stays the same for long enough.
But the OS doesn't know either what the performance demand of a thread will be so it picks the one with most capacity. I could imagine a power saving mode where only Zen4c cores are enabled to save battery, so no boosting is allowed.
Why would it be any different? If it's as easy as identifying which core can clock more, then Intel's situation is the exact same.
@@maynardburger incorrect the E-cores can perform badly given certain instructions. Find the Chips and Cheese investigation into E core inefficiencies, sometimes P core only is faster than P+E in multi-threading.
@@RobBCactiveI dont think they have big advantage when cpu is mostly idling. They just can add cpu multi tread performance very energy and cost efficiently, so main treads can perform better.
Cache is also less in Zen(number)c cores, which might have a slight impact in ipc since the core has less data immediately accessible at any given point, so it has to reach out to L3 and ram more often.
It's amusing to think back to Dr Cutress's 2020 interview with Michael Clark (Zen lead architect) who grinned a lot when talking about exciting challenges when being asked questions designed to get him to give stuff away.
right?
Good catch. I wonder how long AMD worked on Zen 4c and when the decision was made to go that route...
@@HighYield well the interview made clear that they're starting design at least 5 years ahead, it's why they were stuck with Bulldozer for so long. But Zen4c as a tweak for area might have been in response to cloud demands and taken less, at the time of the interview MC said Zen cores were smaller and scaled better so they believed there was no need for a different core. I think he would have liked to share the idea but obviously was bound by confidentiality.
Current rumours are that Zen 6c will go upto 32 cores on a single chiplet (CCD), that would be some serious multi-threading performance in such a small area.
Likely it's all their C cores, but given it's gonna be on either 3 or 2 nm, it's likely gonna be the only Core type by then, since there isn't much point shrinking Cache below 5/4nm due to them not gaining any performance
@@hinter9907 But AMD already has the solution for that with stacked V-cache. In future core designs they could completely off-load the L3 cache to a separate die (or dies) because as they have already proven with V-cache, you can shrink cache considerably if that is the only microarchitecture on the chiplet.
@@hinter9907 Yeah the 32 Core version of Zen 6c will be the Bergamo successor. If they stick to 12 CCDs then that's 384 cores on a single CPU!
It's rumoured to be 2nm but I personally don't think TSMC will have this node ready for AMD early enough or Zen 6c will be late, not sure which.
Yeah I would leave cache on 6nm pretty much indefinitely now, it's a much cheaper node plus has loads of capacity.
I doubt it would have enough bandwidth
@@tuckerhiggins4336 It would be at least 12 channels of DDR7, could go up to 16 channels or they might use HBM4.
I wasn’t very interested In Phoenix 2 before watching this video, but your detailed breakdown really makes it fascinating.
Using the same architecture also makes software (particularly the scheduler) much simpler, which in turn also gives a performance benefit.
On an E-Core/P-Core design, as soon as a process e.g. uses AVX instructions it has to be moved to the P-core (which takes time, messes up the cache, etc.) and the scheduler then has to either pin it there (which might not be efficient) or keep track of how often this happens (in order to decide if moving back to an E-core would be beneficial; which causes additional complexity and uses extra cpu cycles). Most of that simply disappears if all cores have the same capabilities (and the only difference is how high they can boost their clock).
AMD's approach also supports another dimension of differentiation by using chiplets: the manufacturing nodes and design rules can be radically different depending upon the goals. There is where rumors of Zen 5 and Zen 5c are going in that they are using two different manufacturing nodes for further optimizations in clock speed, area and power consumption. Phoenix 2 is unique in that both Zen 4 and 4c are together on the same monolithic die (which makes sense given its target mobile market). Desktop hybrid chips would be far easier for AMD to implement if they felt the need to bring them to market: just one Zen 5 die for high clocks/V-cache support and a Zen 5c for an increase in core count. AMD has some tremendous design flexibility here for end products.
The software side cannot be stressed enough that the changes between Zen 4/Zen 4c and Zen 5/Zen 5c are nonexistent in user code space. (The supervisor instructions are mostly the same but Zen 4c as a few changes to account for split CCX on a CCD and higher core counts in general. Clock speed/turbo tables are of course different to report to the OS. All the stuff you'd expect.) Intel's split architecture requires some OS scheduler changes to function properly. Lots of odd bugs have appeared due to the nature of heterogenous compute: legacy code always expected the same type of processor cores when they first adopted a multithreaded paradigm. These are the difficult type of bugs to replicate as they involve code running in particular ways across the heterogenous architecture that require specific setup to encounter. For example splitting up work units evenly on the software side expecting them to all complete at roughly the same side presuming all the cores were the same. Or a workload calibrates its capabilities based on first running on a P core then gets context switched into a much slower E core that can't keep up with initial expectations. Intel is large enough that they got the necessary scheduler changes implemented into OS kernels to get things mostly working there are plenty of edge cases.
Now that you mentioned it, guess zen c cores make sense for next gen consoles. Clocks are always much lower anyway and the area efficiency would be helpful for cost. Actually it is possible console zen 2 cores are already optimized for area. Never thought about it. Cool!
Well it could go in console but they're not designed like this at all.
Usually the slim model gets just a die shrink or they integrated multiple chips into one. But in 2023 they're actually saving more money by staying on the same process node instead of trying to buy some of the 3nm waffers that apple wants to put in the new iPhone.
And saving 35% on just the core with no other saving anywhere else in the chip isn't going to make the console cheaper. It would just be 10% smaller chip in the best case (they're selling at a loss so maybe they are willing to do a silent version update to lose 10$ less)
Though, to your last comment the ZEN2 they're using isn't anything like a zen2c. It would have been so valuable for servers you could guarantee amd would have made an epic cpu with it 3 years ago already...
nah next gen consoles are much more likely to use a single CCD X3D/VCache cpu. You get slight power efficiency and better gaming performance.
This would also mean games will start to focus on the cache a lot more and trickle down to desktop as well and that could be the next paradigm.
@@rudrasingh6354 I sure hope so, it would be incredible, however 3d cache is super expensive and consoles try to stay monolithic thus far (maybe next gen this will need to change? I sure hope so so they can include more silicon without charging 700 usd for a console. Also they could afford to maybe make part of it on a newer node. Let's hope!!). I'm also not sure it would be an issue to use zen c cores with 3d cache if they find a cost conscious way to do it.
While the C-cores make sense for mobile, I think the gaming APUs are more limited by the area and power draw of the GPU (and possibly memory bandwidth). AMD needs to find similar optimizations for the GPU
edit: as a separate note, could mention in the video how modern operating system schedulers already focus on the best rated cores for foreground applications, and indeed multicore workloads push the clocks of all cores down, especially on mobile
Yes, but area efficiency is the concern, zen4 or zen4c both are very efficient at low power, but classic zen 4 takes so much space.
@@yusuffirdaus3900 You should look at Intel cores, they are like twice the size of Zen cores, at roughly the same, or marginally higher IPC. Regular Zen cores are already comparatively efficient for the space. So Zen4c cores aren't that much bigger than a E core, but obviously much faster.
Actually the RTL is part of the design tools intermediate format. It’s a step in the synthesis process, the place and route step will apply the “design rules” for the fab process.
So the RTL is adapted to the node specific design rules first?
@@HighYieldRTL indeed sometimes can be customized for a specific node. There are many iterations of synthesys, placement and RTL updates to achieve target power/area/speed requirements.
@@HighYield Junior ASIC design engineer here. The RTL design is basically the functionnal description of the chip written in HDL (Verilog, VHDL or SystemVerilog).
The synthesis tool takes as input the RTL design and the node/process-specific library, i.e how exactly a logical function in the RTL is translated to a set of logic gates and the physical characteristics of this set of gates (length/width, electrical resistance, etc...) in the chosen fab process. The output produced is something called a Netlist, which is basically the list of all the logic gates and how they are connected to one-another in order to have the same logical behavior as the RTL. Then the Netlist is placed and rooted in the physical space to create the final die layout and the shadow masks used by the fab.
@@HighYield
Hey FPGA developer here, so RTL is the first step of design entry after ISA is defined and high level block level designs and requires are agreed upon. RTL is the design written in Hardware Description Language, HDL (VHDL, Verilog or SystemVerilog or a combination of all three), the HDL is usually behavioral, written in a way that looks like software but with some differences (mostly the described behavior is highly parallel).
After the RTL is written in an HDL, it can be simulated at the behavioral level for a large portion of the verification steps.
In an ASIC design flow, RTL goes into synthesis to produced a synthesized netlist. Synopsys Design Compiler is one of the tools available for this. The synthesized design can optionally be simulated at this time.
Once you have a synthesized netlist of the "generic" logic gates and flip flops, then comes technology mapping, this is where the start of the PDK is used. The "generic" logic that has been synthesized is mapped to pre-designed physical lego blocks that can go onto the ASIC.
Once technology mapping is complete, along with constraints (IO, clock, timing, etc), the chip can be floor-planned before the tools are given the opportunity to do place and route within the floor-planning rules. You can think of it as if a designer wrote down where the walls in a house will go, but the automated tools will place all your furniture according to the rules like, "I want my TV within 2 meters of my couch, and must be facing it". You touch on this accurately in the video, a highly floor-planned chip can sometimes lead to a less optimal design because you gave the automated tools less freedom, however it can sometimes give a more optimal design because you restrict the scope of what the automated algorithms are trying to do. Note that a segmented and floor-planned design not only creates placement and routing constraints, it usually means the synthesis tool, knowing that later stages has more constraints, might also be forced to perform less optimizations making for a larger synthesized netlist.
After placement and routing, there will often be some timing violations, some of these can be fixed manually, or the design can be iterated on and the tools given another try on a design with some changes.
The placed and routed design can also optionally be simulated, but it will be much more time consuming when compared to an RTL behavioral simulation.
In all the steps, emulation can be substituted for simulation, the cost of the emulation hardware is higher, the setup for deploying a design onto emulation hardware is higher but the execution speed of the emulated design is higher then simulation.
Fuck, you made me remember my Semiconductor design courses....goddamn Cadence tools... absolutely sucked... made me loose my interest in the subject area, knowing that I would have to work with them.
That really feels like a business savvy move - engineer the core once, and get as much use outta it as you can
It's still more similar to engineering two cores.
The advantage is the ease of integration with current software. AMD just have to assign normal cores as priority cores and they are done. Compared to Intel who needed Windows 11 scheduling to take proper advantage of their P and E core CPUs.
That's a much smarter trade-off than I originally thought! Very well done, AMD eng team
For the record, from my initial skimming of the news, I had assumed more or less that Zen 5c are literally just Zen 5 cores with part of the cache cut off with a handsaw. I knew that the benefit came in smaller power consumption, but I had no idea there were actually smart engineering reasons for doing it this way
I've been thinking about Intel's big little implementation a lot lately and have finally just come around to it. I guess wider programs won't necessarily need more fast cores if it can already get a lot of smaller ones.
Then AMD goes and Feng Shui's their regular cores to get the same effect while dodging scheduler worries and I'm blown away all over again.
lol all AMD did is just crippled a few of their existing cores
while Intel actually give you extra E cores on top of the existing P cores
@@loucipher7782
lol at upboating yourself ebin redditor *tips hat*
@@loucipher7782 Only time will tell. We need benchmarks, hard data.
Great video. This strategy seems to make sense for servers.
For my laptop, I would personally love to have a hybrid processor design with lower power cores for simple workloads, good support in an intelligent cpu scheduler, and a couple fast cores for occasional cpu-intensive workloads.
I haven’t followed cpu design much in the past few years, and this is the first time I am learning about hybrid and asymmetric core designs. Interesting stuff. It’s cool to see how cpu designers are innovating as we reach the limits of increasing clock speeds and shrinking transistors.
AMD is on to something here. Will be very interesting how the performance results. Area optimised cores would be potent in handhelds, tablets and laptops - that's desktop class core but with power efficiency per IPC, not just lower clocks & voltages.
Nice video. Excellent presentaiton.
I recently obtained a MinisForum Phoenix APU UM790 Pro 7940HS mini PC and am impressed with the small size, great capability and low power envelope.
The Phoenix2 will have lots of applications especially for mobile and notebooks.
An all 4c chip could be very useful in the data center for CPU performance per BTU.
Many mobile chips are the desktop variant running at much lower clocks to extend battery life and/ or reduce heat.
Hoping to see a 8 zen 5 x3d + 16 zen 5c cpu in the first half of next year. Gonna be building with that chip if it's released.
Yeah, that's a mad concept for programmers and content creators who also wanna game.
They won't do that... It sounds amazing on paper but how many people would actually buy that?
Also it would perform poorly. There's a memory bandwidth requirement to feed cores. Probably requires threadripper socket.
There's a lot of discussion of whether it's coming to the chiplet desktop CPUs at all in a hybrid package with a regular chiplet and a c chiplet... And it seems like it won't at all... That's why we've only seen them in EPIC because they make so much sense for servers. And in laptops where the size and power efficiency are king.
@@SquintyGears Intel is pushing e-core counts higher already with 14th gen and they're already at 8+16 now, so AMD is going to have to do something if they want to stay in the upper end of multi threaded performance on desktop. I have a 5950X myself (software developer) and in my workloads it makes 1-2% difference whether I run the memory at 3200 or 3600. For really "crunch heavy" stuff like all AVX workloads it makes even less. It's practically only branch heavy code, and code that does very little work to very large amounts of data (like database lookups) where memory bandwidth is a problem. Games, however, tend to fall in this category.
Threadripper (non-Pro) went up to 64 cores on a quad channel memory setup on DDR4 3200. I can't see any realistic scenario where 8 full and 16 compact cores can't perfectly well run on dual channel DDR5 6000.
@@andersjjensen we haven't seen the numbers for those intel chips yet...
Yes obviously quad channel has been proven in the past.
The main difference you're seeing from 3200 to 3600 is the inifity fabric clock difference because latency to memory doesn't scale along with the the raw bandwidth difference of that clock bump, unless you took the absurd amount of time it take to tune the timings for it. i don't know anybody who has, I've tried and given up....
Also, importantly, Intels 8+16 wouldn't take as much bandwidth to feed as a zen + zenC of the same count because the slow intel cores are much slower than their counterparts.
I'm not saying i know exactly where the breaking point is where the performance gains fall off a cliff. But I'm pretty sure we're not that far from it. with some quick napkin math with latency that hasn't changed and bus width that hasn't changed but core counts having exploded you can understand why some people wanting to see 48 threads on AM5 seems a little bit far fetched... And that's if they would even make such a configuration. I'll remind you that we've only seen leaks for epic chiplet parts and laptop fully integrated parts. Both are integrations where it brings a very obvious advantage. What's the point for desktop? (assuming they do finally release an update for threadripper)
@@SquintyGears you do realize that the memory bandwidth has increased since they put 16 cores on am4 right? 24 cores is only a third more bandwidth needed if you start from the presumption that the first 16 core chips were bandwidth bottlenecked
On server they do 8 cores per memory channel for zen4 and 16 for zen4c. A 24 core needs 2 channels following server guidelines. Am5 has 2 channels...
One big bottleneck for the smaller c cores would be the power density. The heat per mm^2 is a huge bottleneck, so clock speeds are also reduced in that regard.
But they don't boost as high so require lower voltage which squares while frequency is linear, boosting deliberately pushes the core into an inefficient part of the performance/power curve. At lower TDP per core the c cores become more efficient while not hitting the higher frequency that standard Zen hits at higher TDP. In a likely U 15W mobile laptop the power density is way lower than desktop where a single core can easily exceed 15W.
Thermals constrain boosting as 10C costs about 100MHz max frequency, so on modern chips it's not really a huge bottle neck in general, the Bergamo servers with a large number of CCDs have a large socket, it's more the TDP overall that makes power efficient frequency a priority as high utilisation is the normal case.
@@RobBCactive There's an optimal point of performance per watt for any chip design.
Heat per mm2 is still going to be a limit even when running chips at that optional point.
@@crazyelf1 you clearly don't understand this as well as you think. Large socket spreads the TDP over a large area so W/mm² can be easy to cool.
My immediate first thought was to the steam deck 2 as well.
A single or maybe two big cores for those primarily single threaded games, with 4 smaller cores for heavily threaded games. I've found that a similar layout works well for desktop use on intel's designs, with 4-6 cores that you can schedule for gaming being optimal for an enjoyable experience.
I do believe that the 2 GPU workgroups in phoenix 2 are anemic for the deck, as graphics performance would require an upgrade, but that's something they could adjust in a custom design for Valve
Back to the CPU architecture though. I absolutely love having more cores than I would think I need. My main gripe with intel's design is the limited use cases I'd have due to how they're scheduled. A "space optimized" core would suit my wishes perfectly, and the full fat cores being there for when you need them is the icing on the cake.
The Deck 2 def needs more WGPs, fully agree.
The problem is the core configuration.
My i7-12700K is a mistake by intel, so they won't make a similar chip in a very long time. Good for me, because the perfomance per watt is hilarious, it's way too high.
Watching 4K youtube videos while the CPU stays at 8 watts and 28 degrees on summer with air cooling is really something else...
Something with 6 zen 5c cores and 12 RDNA 3 CUs would be legendary for handhelds
Qualcomm is catching up. Just like how Linux gaming is getting better after steam deck. I hope they bring out a smaller handheld with Qualcomm or MediaTek .
@worthlessguy7477 problem is, ARM doesn't support dx12. So all you're gonna get is phone games on that handheld even if it existed. We want a handheld that plays real games, not android apps.
If valve ever decide to make a new handheld pc it's 100% going to happen, but with RDNA 5 instead
Look into strix
For an ultra portable laptop I think Little Phoenix makes perfect sense provided that the compact cores clock reasonably well (say 60-70% of the full fat cores) when plugged in. And since the process scheduler in both Linux and Windows are biased towards the highest clocking cores single threaded and lightly threaded apps will always "run right" without the OS and apps needing modification.
For standard desktop compact cores make little sense, but for "the layman's workstation" I think a 9950X3D with 8 full fat + VCache and 16 compact cores would be very desirable to many. Especially now where AMD has discontinued Threadripper non-Pro.
This seems like a much better compromise to me! Will be interesting to see how it performs :)
Dense cores is such a brilliant idea. Classic cores also are run at 2.5 - 3Ghz in servers.
You really need tight coupling for thread shceduling. On the M1/M2 platform, it works extremely well. The GUI is smooth and only pegs the E cores in regular usage.
AMD has the capability with Zen to scale in any direction it wants. That was the whole goal of zen from the beginning. It needed to be an achitecture that in the long run, can do everthing and anything, and use innovative packaging techniques. Zen4c is great proof of what can be done, and in the future, there could easily be 3 or 4 different version of zen cores, tuned to what is needed.
More cores at lower clocks at least to me never made sense because as an average consumer, quick bursts of high single core usage is what I would require.
Zen APUs for high performance handhelds seems to be a market amd has all to themselves, leveraging their existing work seems like what they'd do. 5c probably wont be a big change from 4c beyond what they can do to improve Epyc and Threadripper. Lower power and heat is always good though. I wouldexpect lots of c cores to be used in next Gen Console APUs and by that time we might be up to 6c
Hey dude one issue with what you said and it needs clarification. About 5:30 you're talking about Zen 4 vs. Zen 4C. You say there is no difference in IPC. That's ONLY true without regard to L3. If you are going to take into account L3 and the different workloads that benefit from having more L3 cache, so large data sets and for PC that's often games, but other scientific or engineering workloads where you might do modeling and need access to large amounts of data, then there is a difference in IPC.
As soon as cache values change, you have affected IPC in some workloads. IPC can't be talked about as if it's a fixed thing because it's not. It's workload dependent.
In THIS case with the chip you are showing, because there is a shared L3 then there should be no difference in IPC between Zen 4 and 4C cores, but this is not always true. In general Zen 4C cores have half the L3 cache, so if you compared a CPU with Zen 4C cores only vs. a CPU with Zen 4 cores, there would be a difference in IPC.
Zen4c looks like the prelude to AMD producing an entirely new range of massively threaded enterprise CPUs. They saw the ARM based servers and went "We can do that, but with x86 instruction sets.
Zen 4c actually makes a lot of sense in the mobile and server sections; smaller, more efficient cores at lower clocks are all you need there so..
I imagine the Zen 4c cores can compute the same instruction sets as Zen4, Unlike Intel E cores which are missing some instruction sets the P cores can execute. AMD with their smaller silicon footprint along with chiplets can likely build CPUs relatively cost effectively now, even with TSMC charging more to produce them. On the flipside Intel has the advantage of creating their own chips in their own fabs.
Yea the identical instruction sets is an interesting choice. It’s a huge win for the server market and opens up possibilities for all-efficient-core designs, but may make less sense for some consumer applications (a lot of background processes may not actually need stuff like avx512 so that may leave big chunks of the core sitting idle).
I’m sure how much intel running their own fabs is truly an advantage right now given how much they’ve struggled to get EUV online and move to their next node. Relying on TSMC may mean AMD is sharing some of their profits but it also means they’re always able to access the latest, fastest process node. Shedding that risk removes a lot of uncertainty.
@@MikeGaruccio If you are using a recent C standard library you'll may end up having AVX-??? instructions simply by calling the strcmp function.
I really like that they've started building them like this. I have a sneaking suspicion that IPC will increase due to lower latency between the different parts of the core.
I very much see the 'dense' versions of Zen being used in games consoles, handhelds, thin & light laptops, NUC like devices, etc. Basically anything where 'power to performance' is king rather than just outright performance.
Personally I don't want hybrid chips coming to desktop though or if it does I want to have all options, ie only full large Zen chips or Zen + Zen C chips or only Zen C chips. But I've had and am still having a lot of issues with with Intel's big/little designs and so my next system is AMD because I won't want big cores for my desktop.
With Zen Dense being the same IPC but just lower clocks it doesn't mess with stuff in the way that intel E cores do. Apps and windows will also know much better which core is faster, since Windows scheduler looks at clocks only to determine the fastest core
@@shepardpolska I just don't know why Windows/Intel is handling things so badly though, like if I don't have that app 'on the top' (ie if I minimise it or have something else open over the top of it) then most of the time things get moved to the E-Cores and leave the P cores empty but these are full on performance requiring applications that were maxing out the P cores but then just almost stop if I don't watch them work!!
By definition Windows handles things badly, that's its entire purpose xD
But yeah hopefully we'll see some variation assuming they continue on the same path, already today we are seeing more options with each generation
Gaming needs high performance cores though. We can see that the Z1 Extreme can clock up to 5.1ghz, I think AMD will need at least 2 Zen5 cores, then maybe 6 Zen5c cores for the next Steam Deck APU. Or if they want to keep it at lower clocks with all Zen5c cores, they'll probably have to add X3D cache to it to boost the gaming performance back up to an acceptable level. Since X3D cache will inhibit cooling, that might actually work well to complement the slower clocking cores.
Wonderful breakdown and discussion, I think about this everytime I see Intel's lineup and the whole instruction-set compatibility rework that's underway with AVX10.x. I prefer AMDs design and approach for that reason alone. I respect Big.Little when they go hard for like Apple's designs and just breathe great products that perform and last. Zen4C sounds exciting for all products
Very enlightening video indeed. I’d say what AMD Is doing is a halfway house solution towards a ‘little’ core, but one that’s perfect for the sort of workload that its targeting. Easy workloads that are highly parallel would work better on those little cores, as they can easily scale by core count. But if the workload has less inherent parallelism, then the Zen4C would be perfect for it.
Not to mention the ease of workload scheduling here. Intel’s hybrid design requires dedicated hardware in the thread director, while this solution from AMD doesn’t
AMD has confirmed that Zen 5 will have "Integrated AI and Machine Learning optimizations" which to me says an expansion on AVX512 support. But since they are going hybrid, I wonder if the new AVX10 implementation Intel has come up with might be at play. Since AVX10 allows for different register widths instead of forcing 512 bit it becomes a good fit for little cores without hampering big ones
Actually AVX10 is presented as a common standard not limited to intel. But intel is rolling it out inconsistently.
I don't think it's what they where talking about though. I think they meant an soc style approach like apple does with their neural engine. It's an accelerator block that would be part of the IO die like the gpu section.
AVX512/10 is meant to accelerate the model training and scientific research tasks. It's vectorizing 512 bits of data through a single instruction instead of the 64bit that it can do regularly. This speeds up processing a whole matrix of doing the same operation on every cell.
But for inference and using prebuilt models day to day its not what's most effective.
Great for mobile, but for desktop compute Zen 4c doesn't have the appeal as more big Zen cores. My suspicion is that at 100% load for longer periods, a big Zen undervolt reaches greater (or at least a wider) range of efficiency targets. AMD must have realized this same particular flexibly pretty early on, because they literally Compressed Zen here until it reached the same thermal limit at a much lower, and presumably the most efficient, point on the power curve. This is great for devices that run a submaximal load, like most office or mobile devices, or GPU limited devices. Though I'm not sure if that Zenc wont adversely impact "GPU busy" on even the most constrained devices, like handheld gaming, because today the emphasis is more on high refresh rates than fidelity. Time will tell.
Frankly the most amazing thing about this is how consistently disruptive AMD's designs choices for Zen in the server or supercompute space are for the consumer market. They keep hitting multiple targets with the same stone.
@HighYield A point I have not seen mentioned is using advanced process nodes.
A halved L3$ with 3 cores for 2 allows a doubling of cores in almost the same area, meaning retaining @High Yield !
For AMD that might make 3nm more economic, gaining efficiency in even smaller area without loss of clock speed challenges while power density is not problematic even all core.
OTOH the higher performance less dense 4nm might offer higher max clocks and boost potential on desktop.
Cost per core in high utilisation cloud server, the effective cost/performance of Zen5c might be compelling.
Very neat. I can’t wait to see how this works in practice. This is much simpler than trying to implement big.small for x86.
A funny note, Apple did the same sort of thing with the RTL for their A10.
They devolved into different designs completely after, but it is amusing.
Android devices cores too went from bigger prime cores which were the same arch as the medium cores to the different X series.
@bulletpunch9317 Those separate arm cores have less instructions than the big core, which means lower performance, but PCs are different because PCs need more performance, so using the same architecture could increase power, but lowering die size to cram in more cores in the same package would give us some extra performance at the same power usage as the non-hybrid CPU.
@@aviatedviewssound4798 im talking about the prime and medium cores, not the big and little cores. The prime core on android flagships is like the standard pc 'big' core. The medium cores are like the 'small' cores. The little cores on android are too weak for pc, they're like the old atoms. Qualcomms latest laptop soc only has the prime and medium cores, no little core. As for apple their little cores perform similar to androids medium core.
Oh, I didn't know that. Thanks for the info!
I still love the simplicity of the intel 9700k. 8 fast cores with no hyperthreading. A gaming value masterpiece we may never see again. With hyperthreading and different speed cores you can run into situations were the high priority processing takes a back seat on a slower core or competes with another thread for the same cores resources. It doesnt always happen but with the 9700k you take no chances and get a discount for doing so.
I love random comments like this from one computing perspective or another, I don't game but I do a lot of virtualization and one ancient arch that held its own for many years was Haswell, 10 years old now.. just pulled my last one out of active service this week
Unfortunately everything you described happens on any multicore cpu. You can still have the OS assign threads sub optimally and multiple processes compete for the same ressource. Hyper threading doesn't really make a difference for that and the new slow fast cores just add a bit more complexity in the priority of how processes thread transfers get handeled.
The best way to dodge these issues is to manually assign a higher priority to the program you really care about in task manger. Because then the windows scheduling will never kick it behind something else.
@@SquintyGears Well what about a games own threads competing with each other thus reducing the performance of the thread that creates the main bottleneck? The rendering thread or main thread for example. This was the kind of case I had in mind when I wrote my comment and it is evidenced by many hyperthreading on vs off benchmarks. In these cases the application priority was identical and hyperthreading was tested on and off and showed a difference. I understand in an ideal world the game engine it self would handle this properly with the scheduler but real world data shows that it does not always. Setting the application priority addresses only part of the problem.
@@maxfmfdm that's just not really how that works... Within a single program like a game, there are tons of conditions for different parts of the logic to operate in parallel or in sequence. But since it all came from the same program you can generally say they play nice with each other.
The problems come when other processes that have to run too, get run in between game tasks and end up making the game wait for a bit more than it expected to.
BUT this only shows up as a problem when you're running out of computational headroom.
If you're not limited doing anything fancy changing hidden settings will net you 1% or 2% improvements and that's often within run to run variance of just background tasks randomly starting or not (because windows does a lot of things)...
It is true that the original Hyper threading implementation in the first and second gen core i series chips showed significant differences in a non negligible proportion of software. It was both a programmer and intel problem. And games were one of the most affected types of software.
But that's not true anymore... Programmers don't use the same threading libraries and the scheduler logic for what gets priority and where it gets assigned has gotten more clever. Same thing with the way HT runs the 2 threads on the same core, the architectures now are much better at occupying the empty cycles.
Bottom line... The 9700k is a beast because it's the first regular desktop 8 core intel made so it was heads and shoulders above its predecessor. The reason it doesn't have HT was only for market segmentation because they introduced the i9 (which is a marketing construct they did to increase the potential selling price, i7 is still 300$ but you could start spending 500$ now) because HT generally only brings you 40% extra performance in ideal conditions but it hasn't been a source of loss in a very long time. People test it again on every new platform.
@@SquintyGears um nope... just because they are all part of the same program does NOT mean they play nice together in reguards to these issues. I'm sorry but your first comment was a bit off but this claim is WAY off. I worked as a software engineer for a game developer and your claims are just outrageous at this point.
A "Van Gogh 2" with Zen 5c would be kinda amazing.
The more intricate the design is when it comes to big.little(heterogenous) design the slower it will be. Homogenous design will always be faster because there is no latency/waiting time wasted for the system do decide where to put the workload, and if the workload gets assigned to the lower performing cores u loose perf.
The interesting question is why does AMD include regular Zen 4 cores on P2? With server CPUs it is clear that they do not need the regular cores in the first place. I get that it's probably for ST/MT workload optimization but shouldn't AMD prioritize for a smaller area? This also challenges the narrative that what AMD does is "big little". They don't have to go together.
Regular Zen 4 cores for single threaded workloads, as they clock higher, probably.
Zen 4c cannot go much beyond 3ghz. This makes them great as efficient cores, but peak single threaded speed would be a lot worse without the "big" cores.
@@b127_1 I've heard it can go past 3.2 GHz just fine, though with a higher voltage than Zen 4. I get that you can get up to 25% slower on Zen 4c w/ standard configs. Though, I reckon with proper optimizations the gap can be narrowed down to 15. For mid range laptops AMD can still get away with not using Zen 4 (or using just one). After all, you are dealing with 15 W or less TDP. It's more flexible than other solutions at the cost of less area savings. So I still don't think that calling this approach "big little" is the right one.
@@b127_1zen4 are already super efficient. I turned off boosting, limited to max 2.7ghz, and my 6850u is absolutely nailing every task I throw at it, with about 7-8h of normal usage (over 12h in "just reading and writing text" mode). all with 40ºC with no fan, unless playing a video or using external monitor. So I personally wouldn't mind all cores to be 4c.
@@EntekCoffee They should include at least one P core in anything but the cheapest mobile designs, because of Amdahl's law.
Ryzen 8950XE with 8 Zen5 cores and 16 Zen5c cores would be rad. Let's say Zen 5 is 17% IPC increase and 3% clock speed increase over Zen 4, and a Zen 5c core is only 15% slower than a Zen 5 core. That would make a Ryzen 8950XE ~60% faster in multi-core workloads than 7950x. Or measured in Cinebench R23 score, that would be ~61,300 points vs ~38,500 points for 7950x.
The idea is the MT cores don't design for the boost frequency but can operate at the power efficient frequency limit.
So under heavy load those Zen4c cores might be no slower, as the system operates at a base frequency.
The Zen4 cores have to dial frequency back to share the power/thermal budget.
But why is this so important? What is the 'consumer platform' type of user doing where simply having 16 full speed big cores wouldn't still be more than enough for multithread? I had the same issue with Raptor Lake. Sure, it's technically added a fair chunk of multithread performance, but few people buying those are actually taking meaningful advantage of that versus just having 8P + 8E.
@@maynardburger Some productivity users, use desktop rigs and like Raptor Lake plus the hardware quicksync acceleration.
Many users think moooaaaaahhhrrrrrrrrrr cores are better, so pick Intel despite the problems cooling a lot of watts for little performance gain.
Or, iuno, they can just use 16 Zen 5 cores so people aren't getting gimped cores on half of their CPU. The big little thing only works if it leads to *more* threads, the same amount with less power on half the cores would be devastating.
@@joemarais7683 The Zen 5c cores aren't really gimped. They have multi-threading and support all of the instructions that the full Zen 5 cores support. They are just clocked a bit lower. Besides, you can fit double the amount (16 cores) on one chiplet instead of 8. So I don't see how it's going to be devastating. If you are talking about games, are there any that really need more than 8 very fast cores? Either way, with Ryzen gaming is generally done on one chiplet to avoid the cross chiplet communication latency. So it doesn't really matter if the 2nd chiplet has Zen 5 or Zen 5c cores since you'll always want to keep the game running on the 1st chiplet.
Great analysis. Definitely made me more interested in Phoenix 2 and the upcoming Zen5c cores. I do agree that perhaps the phoenix little nomeclature would have been more accurate rather than Phoenix 2, its still a pretty fascinating chip.
Wendell from Level1Techs had an interesting idea of having an 8 core design combining both Zen 4C and Zen 4 cores.
Because you save some power consumption on the Zen 4C cores you can budget the rest of the power towards the Zen 4 cores to much higher clock speeds.
Another advantage is since the IP is the same between the core types the only meaningful difference is the clock speeds.
So thread scheduling is not going to be too much of a problem since Windows already assigns most applications via clock speeds anyway.
@@kingkrrrraaaaaaaaaaaaaaaaa4527 hopefully it will work as easily as that, unlike intel which still don't work and just assign random cores when idling, I just cheacked and the only cores that are not in use on my laptop are the e-cores, like you would think windows would run on the e-cores by default by now
@@besher532 I heavily suspect this is a pure governor setting issue (performance v/s efficiency mode). ARM solved the scheduler part a long time ago on Linux so Intel+Windows probably solved that as well.
Area efficient cores will be revolutionary especially for handheld's like steam deck.
This is definitely the superior way to do hybrid arch vs the way Intel is doing it. I've shied away from Intel 12th and 13th gen cpus with big and little cores because I simply don't trust the scheduler no matter what they say about "thread director" As far as gaming goes the scheduler should have no trouble with AMD's hybrid arch because the "world thread" so to speak would always be scheduled on one of your higher clocked cores, but the lower clocked cores due to them being the same IPC would still be useful for gaming. I think I'll hold my 5900x until AMD starts dropping these hybrids on consumer desktop. Also by the time AMD brings this to consumer desktop hopefully they've further optimized their infinity fabric to work with high end ddr5 in 1:1 mode to reach monolithic latency levels as Intel does.
Thank you for the great explanation of the truly revolutionary 4c cores!
So curious as to the efficacy and efficiency of high density next gen, such as zen5 implementations. Hoping that they makes high fidelity, high resolution capabilities much more accessible globally.
Zen 5c might be even more impressive if they learned something from Zen 4c.
The same cores are still being used as big/medium cores in multiple arm designs from qualcomm or mediatek. Like the sd855/865 had a high clocked a76/a77 core and a 3 lower clocked a76/a77 cores. The higher clocked core is physically larger too.
Is a Cortex-X1 the same as a Cortex-A77 with higher clocks?
@@rightwingsafetysquad9872 no, you can read anandtech for the details.
I'm curious how this Zen4c design compares to for example Intel's E-cores in terms of performance per watt. The whole point of bigLITTLE was that minor and background tasks could be executed on the smaller and less complex core instead of spinning up a whole "proper" core, thus saving power. But if AMD's c-cores are just "normal core but more compact", would that even end up saving any power besides the slight difference due to clockspeed differences? So far it just looks great for the datacenter world where you can pack a bunch more c-cores onto a die than normal ones. But what about laptops and other varying power devices that strongly benefits from proper bigLITTLE?
We're still here waiting for software/scheduler to utilize our current cores.. I am not sure how more cores would have any performance gains, although the efficiency is there.
Honestly surprised this channel doesn't have 20x the subs, some great content here.
What problems are you experiencing with thread scheduling and what CPU are you using?
If it's AMD you're referring to, then that's really only a thing for games and dual CCD CPUs. The bulk of how it works now, would work even better with C cores. Single and lightly threaded tasks just go the the highest clocking cores. Games just happen to be a special use case for X3D where you typically want to reverse that when you have a X3D chiplet.
You'll hardly ever run in to any other consumer workload where you'd actually want/need to pick the cache cores. Most of what else could potentially benefit from the extra cache would just as likely be capable of using all threads on the CPU. That's where more threads provide more performance. Either highly threaded tasks that can max the CPU, or muti-tasking with a collection of light to mildly intensive tasks. Wanting to specifically default to the C cores would be mostly a concern for mobile, which is really a matter of having to OS switch behaviors when on battery.
There are game engines capable of using 12-16 threads, but are currently scaling back to not have the game peg the CPU at 100%. The 2.0 CP2077 update, for example will use 50% of a 16c/32t CPU, but 80% of a 8c/16t one. I've seen Star Citizen using as much as 19 threads though. If it starts becoming more common for games to push past 16 threads, Intel's E cores in current CPUs will be worse off.
I still prefer the simplicity of more cores working in parallel with high clock speeds and multi threading over hybrid designs since i don't have to worry about not having enough cores, or if drm may crash my game because the tasks get moved between cores
Should it not be possible to start optimizing the sub parts of the architecture for vanilla and dense in the future? For example giving the vanilla core a more elaborate branch prediction or a bigger uop cache. So it would still be the same architecture but with a tweak or two besides clockspeed, cache and density. Might be a cost efficient way to optimize even better into two directions
As soon as the architecture is changed, you're working with different cores. Making a design more compact is vastly different than making wholly different functionality.
I think the most difference we'd see in architecture in a single AMD APU chip is using Zen 5 with Zen 4c+ (rescaled on a smaller node)-- basically having a slight size-area advantage while reusing an existing design. So basically Zen N + Zen N-1c+ to be able to launch more quickly unless there's a major architecture difference between generations. I could see this happening if there were problems bringing up Zen 5c in a reasonable time or if they wanted to lead out-of-the-gate with an APU that was on the new process, regular Zen 5 was ready, and wanted to make a powerful lower-cost chip off the bat to secure marketshare. It would still work better out of the box than Intel's P&E cores because that requires a lot of magic on the programming and scheduling side.
Porting Zen Nc to a significantly different process used by Zen N++ would be much of the same work of validating Zen N++.
APUs & server follow the initial desktop lauch, used to introduce new designs.
Delaying V-cache & compact Zen chiplets is more about managing releases and demand, launching the simpler product first.
Eh zen5 and zen5c should be out around the same time so I don't think that will be an issue
The holy grail is monolithic BIG-little with 3D V-cache.
Undervolting existing AMD designs gives tons of efficiency. So this should work fine.
No need to reinvent your wheel when you can just optimize and shrink the one you already make.
The biggest advantage of the the C cores is while it lacks L3 you can just pop v cache on it and make up the difference and then some. You wouldn't find that in the highest performance chips but I could see that for all APUs going forward considering die space is so important.
For high end desktop (gaming) I could see ryzen 8000 or maybe 9000 having a block of full sized cores with v cache and then a even lower clocked block of C cores for all background tasks or anything that the OS thinks wouldn't be effected by the chiplet to chiplet communication penalty.
Really AMD should just come out with APUs for the low end, mid range and high end gaming chips with v cache, and then have their highest core count CPUs just use C cores as productivity (generally) cares more about cores than it does clock speed. Not sure how much that would eat into threadripper sales but considering they havent come out with those for a few generations i dont think they would care too much.
a more mixed design would be more flexible though. I do both gaming and productivity, sometimes with integrated, sometimes with a dedicated gpu connected to my laptop. 4 zen5 cores with v-cache and 6 zen5c cores would be bloody fantastic for a laptop.
@@raute2687 You can say you do productivity work on a laptop but lets be real, any person that does it for work will have a desktop. Which is why they should segment the line up for entry level APU, mid and high end gaming, and then top of the line productivity where more cores will actually help performance.
Honestly this could revolutionize the flexibility of Chiplet design that AMD does on Desktop without needing to incur as much CCD Latency issues like on the X900+ SKUs currently.
Instead of needing 2 CCDs to break past 8 Cores. You could have a CCD by default come configured with 8 Full and 8 Compact Cores with 3DVC.
So something like
8950X = 32 Cores (16 Full, 16 Compact, 2CCDs)
8900X = 24 Cores (12 Full, 12 Compact, 2CCDs)
8700X = 16 Cores (8 Full, 8 Compact, 1 CCD)
8600X = 12 Cores (6 Full, 6 Compact, 1 CCD)
8400X = 8 Cores (4 Full, 4 Compact, 1CCD)
Then You can have something like an 8500 with some of the C or Full cores disabled to get 10 Cores or something.
The scalability becomes far higher with that design.
Or heck, why not a pure C core CCD with even more 3DVC than the standard CCD?
Correct me if I'm wrong, but AMD won't be putting 16 full Zen 5 cores in a single CCD with Ryzen 8000. I think this will be coming with Zen 6/Ryzen 9000. 16 compact cores yes, this will be possible from what I've heard.
CCD size/density would need to double in a single generations to have 16 Zen cores in one CCD,
the rumours are talking about 4+8c instead of 8+8c for laptops
@@YounesLayachi This is 16 Zen "Compact" cores as an option for high-end CCDs. Not full-sized cores.
@@Alovon ah, when you wrote
8950X= 32 Cores (16 Full, 16 Compact, 2 CCDs)
I understood a 16 Zen CCD + 16 ZenC CCD.
I see now that you meant 8+8c + 8+8c,
Silly me, heterogeneous CCDs are a dumb idea xD the 2 CCDs will be identical , aside from any stacked cache on top...
unless...
@@YounesLayachi No, I meant 2 ZenC CCDs with extra 3DVC ontop.
This design is interesting for sure, and it brings up some interesting ideas for future AMD chips, including how Little Phoenix may be cut down. A quad-core Ryzen 3 "7340" variant could be 2+2, while a lower-end "7240" might be 1+4, and then some Athlon "7140" is 0+4.
The zen4c cores aren't that much slower than the regular ones that it would make an athlon.
Think about them as if they where maxed at 3.8GHz instead of 5.3GHz
So they'll actually perform closer to a Zen3 going at its full clocks at equal core counts.
The athlon are much more cut down.
@@SquintyGears I'm well aware of how neutered the Athlons are. I only bring that name into it as AMD doesn't have anything else below the Ryzen 3 name in the x1xx range.
I think they are starting to separate the qualitative compute concerns to clamp down on efficiency and leverage electrical properties that are becoming better understood at smaller scales. I find myself shopping for efficiency rather than ultimate performance, this way I solar and renewables are even more attractive.
Would be interesting see some low end Zen 4c only apu's too take on Intel's Celeron N series processors. A dual/quad core Zen 4c apu would wipe the floor with Intels celern N series processors
For my application, an asymmetrical design works provided i have at least two threads running 4ghz … the other threads can be much slower 2ghz
Having the C and non-C cores have the exact same instruction set but with diferent frequencies/thermal budgets is brilliant! That means no AVX-512 bullshit like Intell has, or problems with certain instructions being significantly slower on the C cores vs their alternatives with regards to the non-C cores. This allows compilers, opersting systems and applications to optimize for AMD much more thoroughly.
This worked out for AMD the last time they tried it; albeit the goal then was to drive power consumption down on their Excavator cores (my memory says this was for the Carrizo lineup maybe?), rather than to create anything like a hybrid CPU cluster.
More cores at higher clocks has a benefit for mathematical compute but for desktop applications a single threaded CPU that is faster is better. What AMD have done is provide a really good compromise. As more applications become multi-threaded their performance can be offloaded into more cores giving a performance boost. However, this means that applications will need to be compiled correctly for this architecture.
The problem with this, that intel is seeing now is scheduling. in most cases and especially for gaming, there can be massive latency and scheduling problems
Stop the latency hysteria and look at a core to core latency chart. Ecores talking to the same cluster already have less tban two thirds the latency of cross ccd cores on Zen 4, when it's ecores from different clusters talking it's even more advantageous for them at nearly half and when it's a Pcore to an Ecore and vice versa it's only like 20% worse than only Cores talking to each other . Ecores aren't being used bcuz games literally don't need more cores, that's it. The and fanbiys talking about "massive latency problems" are the same clowns that bought a 5900x and 3900x for purely gaming. 😂
@@Frozoken 🤓
@@Razzbow really got me there you clown
@@Razzbow 🤡
High core counts are also constrained by memory bandwidth. However, this can be partly mitigated with a larger cache. I would think a 32 core AM5 desktop could be made practical if: a) It also included 3Dvcache and b) it Used fast DDR5. Ideally, more RAM channels would be ideal, but even without them, I think 32 cores could be fed this way. Perhaps an all Zen4c or Zen5c desktop part with 32 cores would be a multi thread killer. It would still lag in games and the like due to lower clocks but I would love one on my desk.
We're moving to CXL and HBM. I think DDR is seeing it's last days.
Hi, I would really be interested in your opinion on Apple's new A17 Pro microprocessor. Perhaps you could make a video about it?
Geekerwan has a nice one
@@bulletpunch9317 Thanks, I will check it out.
I'm super interested in the A17 Pro, since it's the first N3B chip so far. But I need something tangible to make a video, like a good die shot...
We will have to see in benchmarks, in multiple uses, what it goes.
I know that the Intel P+E design sounded naf to many, but given the right situation, aka a laptop or small device, they are kinda nice. Lower heat, longer battery life.
And yes I know windows 10 needed some patching to get it right. I know.
I tried out more cores with higher clocks. Worked well in games that utilized it but the majority of games were imo unplayable.
Great stuff! I remember seeing some annotations of Arm designs that implied they also use this method of making the chip less blocky, with several parts of the chip sharing chunks of the chip and making it impossible to clearly indicate where specific things are. Is that the case? Is that mostly a result of using high density cells, or do they also use some kind of automation for laying out the floor plan?
They use automated layout on both core. It is just that the big core is optimize for clock speed so the critical wires must be very short. While the "C" core is optimize for area so the wires are allowed to be longer and this free up the transistor layout to fill the gaps between functional blocks making the logic blocks closer with each other.
AMD also named the small Zen APU as Raven 2. Later it held a name I can't recall but driver code still refers to it that way.
This is actually a great way to go, especially on laptops. 2B + 8l would probably be thermally limited anyway but less than an all core equivalent.
For desktop 4B + whatever you can fit should also be enough for high performance designs. On server Genoa is already a huge win for higher perf through lower power per computation.
The hard part will be fixing the Windows Scheduler. On Linux it is already fixed by ARM long ago.
This is also a different design from Intel, as here the small cores also have the same connection to the memory controller and L3 cache. On Intel there is some latency penalty from them being clustered in groups of four.
I think people will be very surprised of how this will work and will not think of small cores as a bad thing.
Some ARM SoCs have been doing the first half of this for over a decade.
The Tegra 3 is the first one I remember, with four cores on the high performance cell library, and one core on low power cell library. It bet ARM's big.LITTLE to the market by a year or two.
And it's somewhat common on low-end SoCs to have a fake big.LITTLE configuration that's actually just two clusters of little cores optimised for different clock speeds. I assume those SoCs used different cell libraries.
AMD's main innovation here is taking the extra step to really optimise for area with the more careful layout.
I'd like to see a CPU with x86, ARM and FPGA integrated on a single CPU. I'm just weird like that.
I do actually think that Intel's move back towards single threaded cores makes a lot of sense regardless of architecture, the better your scheduler and prefetch units are, the less extra juice you can squeeze out of the ALUs and FPUs. Add to that the fact that cross thread vulnerabilities are almost impossible on single thread designs as well as higher densities and those E-cores are looking mighty fine. AMD's focus on instruction compatibility also makes a lot of sense given how they haven't been affected by vulnerabilities and node restrictions as much, it wouldn't surprise me if the future will land somewhere in between, single threaded cores using the same ISA but min-maxxed for speed/energy consumption.
I'm still waiting for 32gb HBM3 integrated in the CPU
What do you think about an idea of Ryzen with 32 Zen5c cores + 3D V-Cache?
I’m not sure Zen5c even has the TSVs required for the 3D V-Cache.
@@HighYield Ok, but what about 32 Zen5c cores on Ryzen chip? Will they be faster than 16 Zen5 cores with higher frequency for multithreading tasks like Cinebench?
As a designer for the SBP9900 at TI, I understand the difference between the P and E cores. I also understand the Ridiculous Instruction Set Computer (RISC) concept of ARM. But who said the CPU should be "symmetrical"? I worked on a stack-based direct execution computer. I examined the WD Pascal Engine and the Novix Forth hardware engine. Semantic design means that there is no "symmetry".
If they want high performance, they should abandon inherently defective CMOS and switch to resonant tunnel diode threshold logic. "Engineers" at TI had a hard enough time trying to use I²L circuits. I had to create an I²L library of equivalent circuits for the standard TTL ICs. q-nary logic, self-timed circuits, universal logic modules, and semantic design were totally beyond them.
If the both zen4 and zen4c cores use the same RTL, instruction set and IPC does that mean it could be easier to do scheduling in OS? I remember hearing that intel's big little architecture had some inefficiencies because windows wasn't able to efficiently schedule in tasks like games.
Exactly. The OS biases workloads towards the highest clocking cores, and fills up from there.
I would love to see the test results of 12 core Zen 4c with unified 36MB memory
IMO, with Apple and Intel seemingly sticking to their "hybrid" cores performance-wise model for the foreseeable future, AMD >must< retain a mix or become irrelevant on the Desktop. ESPECIALLY if they wish to return to the consumer HEDT market. For the Server/Cloud side, a completely "c" cores CPU would be highly desirable though.
I wonder if AMD's use of the same RTL for both cores stems from the instruction set.
x86 is extremely complex. The decoder, accordingly, needs extra complexity, much of which is independent of the throughput needed, and much of which is always-on regardless of what the core is doing. So it might lose efficiency if underutilized, and simplifying the decoder to just provide a lower throughput might not give all of that efficiency back. And if the decoder throughput isn't changing, the execution throughput can't be allowed to change as well. Which eliminates the main reason to change the RTL.
Miniaturizing the same logic, also has a lot of advantages. If all the wires are shorter, that means less electrical resistance, and less leakage, which means that the same logic can get higher power efficiency (and thus less heat). This partially compensates for the lost cooling from a smaller physical surface area. It also might allow for thinner wires to carry the same signals, enabling further area savings. Shorter wires also have less chance to affect each other, also partially compensating for the fact that there are more of them.
ARM does not have this problem, since it's much easier to decode. As such, it's a lot easier to just cut off half the decoder and get half the throughput (though it's not necessarily trivial). There is also plenty of room for the same exact layout optimization, but it's not necessarily as obvious given that the best point of comparison has different logic.
And it sounds to me, like Intel's solution to this problem, is just to remove some instructions. This will drastically simplify the decoder, perhaps making it easier to actually reap the benefits of simplifying the decoder.
why isn't x86 doing the M3 trick to build ram on board?
looking at bandwidth of lpddr4, I wonder HOW is that faster....
Awesome video man.
Cheers, recommended by Moore's Law is Dead
Thank you, I can also recommend Tom!
When optimizing for size and power efficiency, isn't there more to do than just moving the core elements around? I'm worried that the lower clocked zen4c cores can't compete with E-cores in power efficiency (and thus multithreaded performance within a fixed TDP envelope)
Don't worry, the standart zen 4 is already more efficient than intel pcore under lower voltage and the 4c just extend this lead.
@@mawkzin Yeah sure but the 4c cores need to be more power efficient than the 14th gen E-cores in instructions/watt. Obviously they'll be faster core-to-core but efficiency is important for multithreaded loads and handheld use.
I don't know how you do it to make a video about a topic where I know a lot and still make it incredibly interesting, I could bingewatch hours of it
More cores with less clock speed is definitely the wiser path. We are still very un-optimized on SW/ kernel level when it comes to true multithreading, still waiting on the real paradigm shift
A Zen 5 desktop CPU with one CCD of 8 normal Zen 5 cores with stacked L3$ and a second CCD with 16 Zen 5c cores would an extremely interesting product. As long as they can handle the scheduling aspects to optimize for various workloads it would be a killer.
Id say 6+6 would be interesting cost effective but powerfull system for gaming. Most games dont need that many power cores but might be still able to utilize many cores for some extend. 4+12 black cheep option 😅
I really hope they do it, otherwise I'll go with intel. I'm not buying that hiper expensive threadripper pro anytime soon
There really isn't as much concern for scheduling in that scenario. Games are the sole outlier to how the OS wants to handle multi CCDs CPUs. By default it's wants to put tasks on the highest clocking cores 1st, but while the stacked cores are slower clocked, they are technically superior for most games. So the workaround is a whitelist method. Some future core dense CCD will always be the slower CCD.
Pretty much every other workload is benefitted by higher clocks 1st, and then if it's heavily threaded it'll just include the other CCD. There won't really be enough other common workloads that would benefit from the cache and not use the whole CPU, nor would the dense CCD be faster in isolation.
If AMD had the will to do it, they could probably launch that right now with an 8 core Zen4+3DVcache chiplet & a 16 core Zen4C chiplet. There's no obvious technical limitation making it impossible to my knowledge? and I wouldn't be surprised if they had something akin to that running in the labs right now (in the same way that 5950X3D's exist internally at AMD)
I had seen some tools being developed for the issues of having 2 different chiplets can give.
If you happen to have a 7900x3d or 7950x3d, I would hoghly suggest that you check out a program called Process Lasso. Something that I believe AMD should have made themselves and released with those cpus, so you don't end up with the issues some had, for an example some games running worse than on a 7700x (example csgo) or a 7800x3d, because they didn't run on the appropriate chiplet for max performance
Well, it seems AMD has a mysterious "low power core" design in the works with the Zen5 family. And no, it's not the "dense"/Zen "c" core, it's a distinct new core (maybe in the IOD or stacked on top of it?)
So who knows but it seems Zen5 will have 3 core variants.
The previous Zen designs clock the L3 cache at the the highest core clock frequency in the CCX. Has this changed, are they using Z4 or Z4c cache slices?
Good question, I can't tell you. I think it's the same L3 cache they used in the big Phoenix, but I don't know about the clk speeds.
That is interesting approach. We are deploying buildservers at scale, so efficiency is kinda important for us. With i9-13900K having E-cores was essential to reduce build times by 40% compared to only P-cores with same power consumption. For AMD we are using multiple peripherals to do the job and we need CPU to be as homogenic as possible. Also, Hyper-V in WS2022 was not happy with heterogenic cores, but it could be satisfied by AMD. Time will tell.
When are zen4 DESKTOP APUs coming out?
A very interesting concept. I think it's the better way than P/E-cores, especially on Windows. The scheduler has enough problems assigning tasks to the right core. The thing I would like to see would be a desktop CPU with 8 Zen5 cores (maybe X3D) and 16 Zen 5c cores. The differences in clock speeds would guarantee that high performance tasks stay on the Zen5 cores.
That the ISA is the same on all cores is great, as heavy multithreaded workloads can profit from many cores regardless of type and clock speed.
I agree with @MikeGaruccio, more smaller sizes cores and lower clock is the main idea behind RISC architectures, except that here the cores are actually CISC cores. More cores are the way to go for sure, especially when the cores are uncompromised, except for clock speed. AMD is correct is their approach.
Very great explanation about this AMD New Approach CPU Design. Thanks You!