respect the heck out of the multiple 'this is the boundary of my expertise' comments. in my experience when someone says that it just reaffirms that everything up to that point is trustworthy, or at least honest. it makes your speculation more interesting. not in the market for a laptop, but this tech space has been so cool to follow! i love where intel (and amd) is going.
Yup. But I am really glad we are seeing competition in this segment and companies are back testing out new and being bold with the design and not just being afraid of unknown and stagnating. Testing out new stuff and improving older stuff is never a bad thing.
Well with regards to IPC anyway AMD has been on the money with every irritation of gains they said they will get, Intel on the other hand... Well lets just say they have fallen short each and everytime.
Intel still has to prove that they can do efficient compute (but this step closer to Apple design might help) Amd still has to prove they can do efficient platform power, where Apple and Meteor Lake are miles ahead in true achievable battery life. Will see when notebooks are out.
@@Sam-jx5zy Yeah AMD has been making head way in power gating but still not near Intel's way in doing it, as for Apple well that's all based on ARM which is very efficient but ARM is by it's very nature, basic in how it processes requests and can't process long instructions and requests like x86. On the other hand they sip power so pros and cons.
High Yield explaining chip layouts is like music! I just nerd-out. This man is so good at the niche he is in. I wish this would be as financially lucrative as the value of the knowledge he espouses!
It's insane that this channel went from small youtuber with sub-1000 views to being invited to international trade shows by one of the biggest chip makers in less than three years! As someone who's here since the VTFET video I am amazed but also not surprised because the quality was there from the start. Herzlichen Glückwunsch und weiterhin alles Gute auf deinem Weg der dich hoffentlich weit bringen wird! 🍻
It seems a cpu architecture revolution is underway. The days of simply adding more cores are over. Intel and AMD are now innovating a lot more than they have been the last decade or two, and ARM CPUs are reaching insane performance levels, and neural processing is becoming much more prevalent. It seems that we have reached a turning point. It reminds me of the innovation that occurred during the movement from single core processors with huge pipelines, to multicore processors with reduced pipelines.
@@_EyeOfTheTiger reduced instruction set is better not worse for power per watt and compiled code efficiency. Risc set is enough for any task complicity, and much of modern workload are vectorised so no difference risc or cisc etc, more depends from vector extensions. big and complex decoders only waste for chip space and energy in cisc and later they also do all in mu-ops (so intel and amd chips really a risc chips with added on chip cisc to risc translation, since from i686 times 🤷♂️😂)
The problem is, all of this innovation is only coming because we haven't been able to make gains by upping core counts and building die on smaller process nodes, like you said. This is because we're reaching a point where we simply can't gain more benefits from those methods. So after all of this restructuring of the Chip and optimization is done, my layman opinion is that we're going to plateau. These sorts of organizational innovations can't happen forever, at a certain point you've reached peak efficiency for the tech you have available
I think a lot of us would be surprised to know how many of these 'new' technologies were actually patented decades ago. It's just a matter of culture and economics I think. No?
I suspect the L0 naming scheme does a few things. Allows for L2 to keep the same name since it has the same performance and size as previous gen and allows L3 to not be named L4, which may have some negative thoughts among the media (plus the potential for direct comparisons to AMD)
Also it must be using virtual addresses like L1, the larger L1 still has to start physical address translation for cache misses. So this small cache for recent data not in the register file may save energy by (usually) avoiding work.
@@mikeb3172 Cache coherency requires any cached data to be available to another core. But that requires physical addresses to obtain a read only or exclusive write to the memory cacheline. If L0 is read only with L1 able to invalidate entries then the simpler faster cache type fits. It sounds like a Jim Keller style idea that questions prevailing assumptions.
There used to be an intel generation during the stagnation years that had an L4 cache. DDR4 wasn't ready yet and the cores were memory-starved, so they put some cache on.
First found High Yields channel 3 months back with the Zen 6 video and have become a fan ever since and have watched lots of his previous videos as well. Great content!
This is the best explanation of these things I’ve seen so far! I also don’t fully understand everything but I feel like you made it really easy and enjoyable to follow in one video. Thank you!
Adamantine is a separate cache tile that goes between the base tile and active tiles, so it can’t be on Lunar Lake with only 3 tiles. It’s possible for it to be on Arrow Lake as the tile implementation isn’t revealed, but I am doubtful of that.
Interesting, the L0 pre-L1 may avoid work. L1 caches use process specific virtual addresses with an address translation needed in parallel to validate the tag isn't a clash with data from another thread. (There's some great CPU engineering lecture videos in YT that explain how L1 operates) The tiny pre-L1 shouldn't imply the real L1 is an L2 accessed by physical addresses shareable between processes. So a pre-L1 cache ends up as L0 to avoid confusion. Now for some speculation and this could be why Lion Cove dropped HT. Without HT that cache using logical virtual addresses could make energy saving simplifications. If it is entirely flushed on thread changes and not shared between threads, no logical to physical address translation validation seems necessary. The small size may mean it can be looked up fast enough to pre-filter L1, with misses going to L1 after or the energy inexpensive fast cases are going to L1 simultaneously to complete faster with later validation on an L0 miss. So perhaps the underlying truth of the leaks was HT went to allow effectively a cache of register file and most recently used logical addresses accelerating L1.
The address translation is needed to figure out which cache line in the selected set has the desired virtual address (and all caches still store the full physical address each line is for, regardless of whether they're initially addressed virtually). The reason for traditional L1 being virtually-addressed is specifically to allow doing the translation (aka TLB lookup) in parallel. The reason such L1-s are so tiny is because they (ab)use the low 12 bits of physical and virtual addresses being the same (due to 4KB pages), and extend to 32KB or 48KB or whatever via just reading all 8 or 12 (aka associativity) possible matches, and selecting between them when the TLB result is gotten. A 192KB virtually-addressed cache would imply it reading an entire 48 possible cachelines (each being 64 bytes) on each access, which is utterly crazy. That said, assuming that L0 and L1 accesses aren't done in parallel, by the time the L0 concludes that it doesn't have the asked-for data, the TLB lookup will have finished anyway, and thus the L1 will be addressable physically with no additional delay, like it would with a traditional L2.
@@dzaimaThe point is in L1 the virtual address can be looked up, with the physical address translation in parallel for validation to ensure it's from the right process. 2 different physical cachelines can share the same logical page bits. You don't want the latency penalty of translating virtual addresses first because it's slow. The figuring out which cache line has the virtual address is back to front, process virtual addresses are mapped to physical memory via address mapping. The question what virtual address does this physical memory have is meaningless because it depends on what processes are sharing the memory page, you have a 1:n mapping. But the process thread running has a 1:1 translation. All the code I compiled tried to use relative addresses with relocatable code to minimise such problems.
@@RobBCactive Virtual to physical address mapping isn't a 1:1 translation even within one process - it can be n:1, as a process can map the same page to multiple locations in its own single virtual address space (and this is useful - see "Mesh: compacting memory manager"). Thus, addressing a cache by a full virtual address is impossible to do correctly without still having some physical mapping check somewhere.
@@dzaima just another reason to avoid the need for it, I think you are ignoring the possibility of a read only cache that writes through via L1 with its translation. Actual processing writes mostly to registers and then store operations. If you include all the L1 features what is the benefit of the L0 cache? The address translation isn't going to magically complete faster.
@@RobBCactive I'm saying that it'd be pretty reasonable for the Lion Cove L0 to function exactly like traditional L1-s, and its L1 can largely function exactly like a traditional L2. Haswell (2014) has a 32KB L1 with 4-cycle latency and 256KB L2 with 12-cycle latency, and it seems very possible to me that, with 10 years of process node improvements, similarly-structured caches (with the higher bandwidth of course) can map to Lion Cove's L0 and L1; and then the difference ends up being the modernly-sized extra level before the very-slow L3. I suppose it would be possible that Lion Cove's L0 leaves write ops to L1, but that'd obviously result in a rather larger write latency (though perhaps that doesn't matter too much given store forwarding).
@@HighYield Aw, many more, High Yield. After all, I'm sure you've met many great people in the industry. More to come I say! Also, will we get some Strix or even Turin content? Skymont seems very impressive. I feel like AMD is sitting on Zen5c, which IPC is on par with Zen5, I'm saddened AMD didn't talk about it at all (perhaps in a future Hot Chips). They've left 8 Zen5c cores for consumer and the rest for Turin (dense). From what I've heard it's also a unified CCX, so no split cache, so much better latency (Zen 2 to Zen3), I don't know why they're sitting out on the design. That said, Turin dense, the CCDS look massive, and I don't think it'll fit on AM5. I'm really interested to know why the Zen5c CCDS look larger than Bergamos Zen4c CCDs. My thoughts lead me to it to having 12 CCDs instead of 8 in Bergamo. Could it be more GMI links, to fit more CCDs on package? Is that the reason why is bigger? Could a 12 Zen5c CCD fit onto AM5 package socket?
@@PKperformanceEUThere is no way intel will reach M4 max that quickly. Intel is good but the last few years haven’t been kind to intel or 10 years at that.
Intel used to make great ARM chips in their Xtensa series, up until their Atom SoC push in mobile. But they still hold on to their ARM architecture license. AMD also based their first competitive x86 products on their am29k RISC architecture. x86 (or more accurately AMD64) is just a layer of backwards compatibility and nothing more.
They just directly copy the ARM SoC to x86, but just like for Qualcomm, it's required 4 years just to copy the M1.. so those bullets comes out to slowly.. also Apple can scale their chip to desktop level, but check the Snapdragon X Elite, if you increase the power consumption with 250% (from 23W to 80W), the gained extra performance just 10%.. so food luck to make a desktop chip with that, so the chip itself doesn't mean a lot, if it's limited to only laptops, since the laptop market only a small part of the PC market..
@@PaulSpades If it means backwards compatability with out emulation... I am not buying mac in everything but name. And if it costs like curent Arm solution, it will be reasonable even as core for actual PC. But lack of Ram expandability is still a bit meh.
Look at you now! This is crazy, I remember watching your videos when you had 700 subscribers, and now you're getting invited to these events. Congrats!
it looks weird that media and display engine separated. they could switch display engine and 8MB side cache but the media engine does need some cache (not 8MB )
Will you do a video on Zen 5 and 5C? I'm interested in the capabilities of Zen 5C versus Lunar Lakes E-Cores and how the different paths they took paid off now.
I'm seriously considering an LNL mini PC as an upgrade from my current 5600G mini PC and 7220U laptop, I feel like this thing can do it all with a much lower power and heat output (which will make it more portable than my current AM4 mini PC), 32 gigs is seriously enough for me since that's where I'm at right now, and the heaviest thing I run is probably just War Thunder and RPCS3
That is what they said 50 years ago about 640kB. You have no idea how much memory we might need 5 or 10 years from now. Maybe even only 2 or 3 years from now.
@@AutieTortie win 7 times required 2gb of ram (8gb for best experience). now at least 8gb is required for daily tasks and non under engineered games. so 2-4 times ram in 14 years i say. but 32gb max... i say it may be not for ultra professional 3D / music producers. but who knows
Allow a user to add more ram and use the "onboard" RAM as cache or allow users to replace the SOC like they do for desktops. That or CXL3.1 can access more RAM via a PCIE enabled port and device.
There are already plenty of mobile use cases that don't need massive compute power but do need more than 32GB of RAM. It's an understandable compromise at times but it would be nice if there were more memory options.
Just came across this video of yours. With available information you did awesome work. I also look forward for real time performance with the release of Lunar lake. Similarly with the arrow lake. Time has ensured that ultra series have taken over H series chips from 6 years ago. Interesting times. Such videos not only educate but can also be useful for purchase decisions. Subscribed 👍
I think we'll have to wait for the release of the Lunar Lake laptop and the benchmark scores, but if you simply multiply the graphics scores of the Meteor Lake-H's 3DMark benchmark Time Spy and Fire Strike by 1.5, you get TS: 5250 FS: 13800. In terms of desktop GPUs, it's close to the performance of the GTX1660. In the country where I live, there are several articles that say it's 50% better in performance than the Meteor Lake-U, but if you multiply the GPU performance of the Meteor Lake-U by 1.5, it will be the same as the Meteor Lake-H's GPU performance. On a different note, is the presence or absence of hyperthreading related to the high single-thread performance of Apple silicon?
Good Job @Highyield , love your detailed reviews on these silicons. With respect to this video finally Intel catching up with various ARM platforms including Apple’s M series and Snapdragon X series.
it is also interesting to see that the NPU has roughly a similar TOPs per area as the gpu, so expect it to be very power efficient, which also might mean that perhaps someone might find out a way to overclock it, since sometimes hardware optimized for efficiency has insane headroom for overclocking.
I'm really glad to hear intel seems to be going as wide as possible. It seems like that is why Apple chips are so fast and efficient, not ARM doing magic or something
Apple have caches v. close to the CPU, reducing latency and energy for data flows. Going wide doesn't help a lot of code, it inherently has serialising data dependencies.
Not sure if it's going wide that's helping here. From what I know, the efficiency of Apple chips came from 4 things: - better manufacturing node (M1 was N5, everybody else was on 7nm. M3 was on N3, everybody else was on N5. With Lunar Lake, we're finally on even grounds here) - on-chip RAM (while I hate non-upgradable RAM, I'm glad that Intel did this with Lunar Lake. There is a segment which clearly want battery life much more than upgradeable RAM) - non bloated OS (nothing to comment here, Windows (and Microsoft) sucks, Linux doesn't have enough support to be perfectly tuned yet) - laptop and motherboard design - this is much more subjective. Thing is that Apple actually prioritises battery life, while on PC side it's usually the benchmarks. Which is why many laptops are much louder and warmer. I also know that simply having some extra ports, that is, only having them exist, having something connected to them is not neccessary, that can also increase the minimum power required for the laptop to be on. Apple is famous for not having enough ports - I think this is also a reason of its efficiency Edit: forgot to add, M chips being on ARM also help on efficiency ... but not so much as most people claim (as if it's the only thing). My gut feeling is that it helps like 5-10%. As for the M chips being so fast ... other than the big memory bus width (up to 16!! channels on the Ultra chips) I'd say is also because of better manufacturing node. If you take the N5 and N4 and 5nm and 4nm generation of chips, Intel and AMD are better than M1 and M2. I mean, if you exclude the efficiency, Intel's Raptor Lake and Raptor Lake refresh which are on 10nm++++ are quite competitive even with M3 chips. Still, overall, the difference is not that big usually. The M cores/chips are clearly well designed.
Great job on the video as usual bro! Thanks for the info and looking forward to seeing Lunar Lake AND Arrow lake hopefully later this year 🤞. Congrats on having Intel fly you out there too!
@@betag24cn That was the first generation of E-cores. Did you watch the video? Skymont E-cores have similar IPC to Raptor Cove (Raptor Lake)... while being vastly more efficient
@@__aceofspades doesnt matter, tje concept is stupid, is fake you did not glued together two cpus because you were in panic, it is a dumb idea and points to the fact that your designs are terribñe on not generating heat thanks to absirds levels of power consumption, does not matter
It would be another lie by Intel. They said that Gracemont matched Skylake. Here we are years latter and the Haswell 4 core 8 thread i7-4700MQ laptop chip that i have has 25% higher IPC (CPU-Z) than the e-cores on my Alder Lake Core i7-12700K CPU with way, WAY faster DDR5. Lunar Lake is Intel's Bulldozer, there so many problems with the overall design of the chip. Meteor Lake makes more sense.
ARM works around SMT with SMP/UMA interconnect. Overall this still obeys the SMT design, but appears to be an alternative view. Cost based power optimization on the network, if I had to guess. This chip does look like it is focused on SoC almost exclusively. 8MB out the door is basically sequential conversion and they do not care about capacity. They are using the bandwidth to carry them. Internally uses 512 bus? How many ports do we get here? It is the SMP interconnect that worries me slightly. All communication from NPU, E cluster, P cluster, GPU and IO die move through that cache. Bandwidth will allow page swapping, but latency performance could suffer. Everything runs from RAM. I noticed the L1 vs L2 is mostly pointless and suggests that the L2 is mostly capacity for keeping things on P cluster. This is very interesting for Intel SMT. Therefore I have some concerns within the P cluster itself. (There is a solution to this problem.) I already have concerns with P vs E cluster balancing. (There is a solution to this problem.) I am guessing they removed most of the turbo boost, despite the low core count. Going this wide on low power is very interesting. Higher emphasis on MIMD over SIMD in cluster? Heavy on the compute and feature rich IO for the money. Hopefully decent memory performance for the money. Will be interesting to know the range of power density. Hit or miss on software applications.
If the P cores are not directly on SMP interconnect, then they are effectively worthless. The E cores on the other hand may be SMT or SMP without causing too much concern. Schedulers may care, but that would be about it. Qualcomm Oryon used a similar network, which is 3 SMP cores each core being SMT-4. With Oryon each core got 2MB unified cache, while here with Intel there is significantly less for significantly more IO. Intel has handed out about 1MB unified cache per core. Lets hope this is a coherent interconnect. At 15W target, we are likely looking at performance or value.
Can you explain the implications embedded RAM? My PC currently has 128GB of RAM. From several places I've looked, it sounds to me like with lunar lake you're limited to the 32GB of embedded RAM and it won't make use of normal RAM sticks? If so that's.. just not an option at all for developers and I'm really surprised that noone is getting out pitch forks. BUT I must be missing something so what am I missing?
Hello, great video. I wanted to ask you, now that both mobile laptop cpus from amd and intel are announced, which cpu do you think is superior overall? Taking everything into account would you go with lunar lake or strix? Thanks
I think Lunar Lake has areal shot at the efficiency crown, but it does launch later in Q3, while AMD will launch sooner. Always wait for reviews, but for battery life I think LNL will be best. Strix Point should win in raw performance with up to 12 cores.
@@HighYield thanks for replying, patiently waiting on arrow lake desktop reveal as that is what im really interested in, im looking to upgrade to a new desktop with an rtx 5090, gonna go with whatever is faster amd 9000 or arrow lake. Ryzen 9000 vanilla series kinda disappointed me a bit tbh, pretty much same gaming performance as previous gen. Have to see what 9000 x3d chips have to offer.
I don't think they are comparable. AMD doesn't have something to compete with lunar lake given the low power target of lunar lake, and Intel has not announced what their answer to Strix Point is (though we all know it's going to be some variation of arrow lake). Intel will Winn the efficiency battle against Strix point and it's very likely that their GPU will be very competitive with hawk point at lower power, but it is unlikely it will be able to touch Strix point in GPU performance given that Strix point has 16 CUs. Overall, more and more excited for lunar lake. I think in a handheld form factor it's going to be very interesting.
@@sloanNYCthe shipping may be not late, but the real issue is supply. Here in my country you only able to find phoenix point/hawk point easily in gaming laptops while the thin & light category is dominated by intel.
the NPU is pretty gigantic compared to for example what Apple does. Curious about the performance because Apples ones are ridiculously fast for their size
Do I understand correctly that 128 bit memory bus is the same bandwidth as what we'd get with dual-channel DDR modules? Since each module (channel) is usually 64-bit? Just trying to understand the overall memory bandwidth, I know we also get latency benefits and not downplaying it.
It would be interestong if someone came up with a hybrid chip that has both x86 and ARM instruction cores. Which would allow running both X86 and ARM software natively. It could be an 8 core CPU with 2 X86 P cores + 6 ARM P cores.
@@reiniermoreno1653 Implementing the Software would easier compared to the hardware since we already have OSs that understand which ISA you are using. Binary executables also have info on which architecture they are designed for.
They are lying. Raptor Cove has 36MB of L3 cache on a monolithic Ring Bus. A ring bus Haswell 4core8thread laptop Core i7-4700MQ that i have has 25% higher IPC than the e-cores on my current Alder Lake i7-12700K besides the HUGE difference on RAM speed.
Speaking of, we REALLY need the dynamic iGPU memory allocation that Apple has. On Windows' side I can see why it's not implemented and why nobody talks about it, as Microsoft couldn't give 2 flying Fs about Windows, especially in the performance side. If it's not ads or tracking the user, then it's priority 7384, to be done in 15 years from now. On Linux side I hope we'll see something, but usually GPU stuff comes from the manufacturer, so it would be Intel or AMD here for the iGPUs. And they're both busy on other areas, like the actual GPUs being competitive. And the drivers for Windows. Linux comes after that. Sigh.
@@sowa705 Oh, ok. I was under the impression that it's settled at boot time. I wonder then why did Apple presented (and people being wowed) as something new. I guess it was new for them.
Can you just combine GPU, NPU and CPU for the same inference task though? Or is Intel just adding up numbers to create a bigger number but in the real world, you will have to decide where to run any given model?
You are correct. Currently can't just add the numbers. Apparently work is being done to enable mismatched processors for ai batch-processing, but I don't expect it will release soon, if ever.
@@martin777xyz thanks for confirming. And yeah, from my understanding it sounds really tough to make these systems complement each other. Maybe some day we'll be running so many models locally that they can run in parallel but even that...
Lunar Lake looks like the biggest improvement for Intel in over a decade. In terms of performance per watt and GPU performance, it looks like Lunar Lake will beat Zen 5 and Qualcomm's X Elite. The only downside is that Lunar Lake is focused exclusively on thin and light laptops and handhelds, its not their highest performance product for mobile or desktops that is Arrow Lake which looks great for performance but will lose some efficiency and iGPU gains Lunar Lake brings.
Gluing the ram closer can yield far better ram function than CAMM. I think this cpu will be used in very small systems. I'm betting they will use even faster ram as soon as it comes available.
@@ChristopherBurtraw The on chip is mostly for the power savings, not neccesarly for bandwidth. Lunar Lake is optimized to be very efficient (and I hope it actually delivers). It should be perfect for ultrabooks which want really really long battery life and for gaming handhelds. For the rest of normal folk and normal (or powerful) laptops, we'll have Arrow Lake. And hopefully we'll see LPCAMM2 laptops with that. I dream of a Framework 16 with Arrow Lake and LPCAMM2 in which to add 128 GB of RAM and finally upgrade my almost 8 year old laptop, to one that will also last me 7-10 years.
@@Winnetou17 I'm hoping the next gen (after the one they just announced) 13 board will have it too. Framework won't want to implement this one even for the 13...
I'm still waiting for a chip that integrates 32GB of HBM3e as an on-package L4 cache within the same SoC, while also supporting the addition of DDR5 memory modules with ECC capability, rather than being limited to just integrated memory.
The core layout with everything right next to the memory controller makes sense, and I'm glad to see intel moving in this direction. It'll be super interesting to see how x86 power consumption improves with this layout!
Exciting stuff! Great video, as usual. I do have one question, though. Is it certain that a server implementation (or any) of Lion Cove would have SMT? Also, different implementations of the same architecture sounds more like a standard vs Dense Zen situation to me, and I think that it could get expensive to develop lots of just slightly different cores
Depending on the details on moving data between the NPU and the GPU, using both at the same time could work really well for some AI workloads. Training a QLora where the main weights are only used for 4 bit inference that could run in the NPU and the backpropagation is done only for a low rank adaptor in fp32 or fp16 in the GPU could potentially work well. It won't be faster than a dedicated GPU, even a 3060 should outperform it. Memory bandwidth will likely also severely limit its performance. But often the issue with GPUs is not speed but available memory. Also this should be much more power efficient. It all will depend on software support, that is usually the issue with most non nvidia AI hardware.
I'm currently using a zen4 amd cpu and have been an amd user for several years now but honestly Intels next Gen stuff looks more compelling than zen 5 and up. That is if they can pull out off. I really like thread director being in hardware. Dealing with Gazillion cores and core types has been a weak point in software as of late so I hope this can help.
I work for a major computer vendor and you're spot on. Your conclusion 110% speaks my mind and maims exactly what I've been saying when Intel presented us the LNL 3 weeks ago. I said that if LNL matches almost the battery perf of ARM by Qualcomm, this is going to be another Windows RT. ARM for Windows doesn't really offer a difference. We have already more performance than needed, NPU's are available en masse thanks to NVIDIA, it's just MS that firewalls for now the marketing bullshit storytelling about Copilot and that blocks other than embedded NPU's from being recognized by copilot, but this will change probably next year and they'll have to open the gates. What's left ? Battery performance. Ok, but if this gets matched, what's the point of having the whole industry shifting away from x86 ? Zero... ARM will be the thing that made Intel rethink it's architecture and from there the power efficiency and that's a good thing.
interesting question is how intel will handle it in the future. LunarLake will be parallel to ArrowLake-U. But what about PanterLake in 2025 ? Will in continue the LunarLake design or will it go back to the generic design of ArrowLake-U ?
Looking forward to getting a windows laptop similar to the Macbook Air. I would love to have laptop without the need for a CPU fan, or maybe one that runs only during high workloads.
Sir make A video on 14900K And upcoming ultra 9 290K If it has NPU and TOPS TOPS in CPU and integrated GPU Or has inbuilt RAM Or any difference than Lunar lake
Cool to see a nice bit of Cache on the side to minimize DRAM access, L4 foresight on desktop? probably not but I love what I'm seeing from Intel this year, very exciting in more ways than expected. Maybe not quite leadership just yet but at least on par, the whole E-cores thing is evolving into something and I wont be surprise if it eventually gets to a point of Zen Dense. So far its still looking to be a split design mentality but a high IPC Philosophy so the ability to use E-cores for most task will get the best out of the efficiency. Last time I was this excited was Alder Lake?
I'd love to see something like this for desktops where I can get an entire SOC with 32-64GB of ram all bundled together. I know there are upgradeability concerns but the performance benefits if you over spec could be really good, especially for ram heavy applications.
@@HighYield what do you think amd or Intel is not able to compete with apple in terms of single core performance with being in this field for decades i means just see the geekbench 6 single performance of m4 its insane for a ipad this thin
windows recall and similar shit, but they did prove it doesn't need an NPU. they are no longer building chips for us, we just need 4 fast cores and a lot of cache to go with them :)
they are betting on a ai world where everybody runs ai locally, others have the same idea, i personally find that very creepy, just want the pc to do what i order, bot to have its own ideas
I have two guesses on Intel 4 and later processes Meteor Lake and Lunar Lake all being mobile oriented and not desktop. One is that the processes are not suitable for high performance operation, but do get better power efficiency. 2) manage fab-process capacity.
Its very sad there's no perfect chip, i like some moves intel made with this desing like interposer for connectivity, the complete turn off of some tiles if they're not required to mantain a low idle power draw, the extra efficiency on E cores to be the main cores used to low the power draw under workloads and monolithic design on the computing chiplet, if just E cores where more like Zen c and less like big.LITTLE and the ram doesn't being integrated on the SoC...
Is there a good chance that all these architectural improvements will help Intel make much more efficient desktop CPUs in the near future? I'm really interested in the Small Form Factor space, and I think Intel has had a bit of a hard time competing there in recent years with their processors.
I know this is a video about Lunar Lake but this video gets me really really excited for Battlemage and desktop products like Arrow Lake If intel could figure out a V-Cache competitor and commit to multiple years of support for a motherboard platform they could make AMD straight up unattractive on desktop. I say that as someone with a 7950x3D and invested into AM5! I can't wait to see the next few years
@@aravindpallippara1577 Well currently, Lion cove is projected to have higher single threaded performance than Zen5 cores. That single thread lead will help with everything, including gaming. AMD has the biggest advantage in gaming rn with V-Cache, platform support and efficiency. with skymont, intel has a real chance of gaining a huge performance/watt uplift particularly in multi threaded loads which is where intel sucks down a comically large amount of power That's why I specified V-Cache and platform support would make AMD unattractive on desktop. Because Intel already has a decent chance of having class leading single threaded performance, adding V-Cache to an intel CPU would surely boost performance considerably (especially in games that love v-cache like Factorio, or Kerbal Space Program) And platform support like we have with AM5 would be really great. Having to upgrade every 2 gens is a huge downside compared to AMD's offerings and commitment to 2027+ support and why I personally went for the 7950x3D and AM5. V-Cache and platform support is just great
The V-cache is a solution because of the slow memory controller on Zen processors. When you glue the ram this close you dont have that AMD problem. No need for the same solution. The mystery cache is probably enough if Intel engineers did their job well.
@@impuls60 Imo that's an F Tier comment. L3 cache is going to beat out faster memory just by virtue of the insanely high bandwidth and lower latency. There's a reason intel loses in those games that favor 3D cache
@@impuls60 Agreed with the above commentor a cpu cache and ram have vastly different type of uses - cache is very raw and hence very fast as opposed to ram which needs to be encrypted and passed through os layer checks before being accessed - cache is still the king for performance of single thread operations.
Technically you can upgrade the memory after purchase. You just have to be really good at soldering 😁. I only tell this comment, because I knew someone who did this to his MacBook. Bought a. 8GB model abd with some patience and skill it became a 32 GB model 😅.
@@noticing33 I don't know but I know the device worked afterwards. I lost contact with him after his internship ended. But I think he wouldn't have done it, if it wouldn't have improvement performance.
it is a soc for laptops, but they might offer something for desktops later, or might not, intel is like a kid lost in the forest at this point, copying everybody but scared all the time
A cynic would point out Pat was excited about Raptor and Meteor Lake and even Sapphire Rapids. These presentations have not been a reliable guide to what's delivered and when in recent years.
@@HighYieldThe lack of a ring bus and cache hurts gaming perfomance. Raptor Lake mobile chips are way faster in games than Meteor Lake ones. i9-13980HX (what a beast of a mobile CPU) is a desktop Core i9-13900K (the 2nd or 3rd fastest gaming CPU behind the Ryzen 7 7800X3D) with lower power draw and slightly lower MHz. Meteor Lake doesn't max out at 8 P-cores with 36MB of L3.
@@HighYield It is their Imaging Engine called "Imaging Processing Unit - IPU". It takes up a lot of die real estate I hope we see devices with better cameras to take advantage of the enhanced processor.
I'm glad we have so vast mobile CPUs choice these days: Apple M1/M2/M3, Intel Meteor/Arrow/Lunar Lake, AMD Hawk Point/Strix Point and a new player - Snapdragon X Elite - is on its way. We never had a more difficult choice
respect the heck out of the multiple 'this is the boundary of my expertise' comments. in my experience when someone says that it just reaffirms that everything up to that point is trustworthy, or at least honest. it makes your speculation more interesting.
not in the market for a laptop, but this tech space has been so cool to follow! i love where intel (and amd) is going.
The more things I learn, the more in realize how little I really know
@@HighYield Yep... but keep it up! :D Awesome to see great content like this on here.
@@HighYield This. You do need to know more than the average person to learn that, and it usually comes at a later age ;-)
Lunar lake looks compelling, but just like Zen 5, I'll believe it when I see it.
The only way you should do it.
Yup. But I am really glad we are seeing competition in this segment and companies are back testing out new and being bold with the design and not just being afraid of unknown and stagnating. Testing out new stuff and improving older stuff is never a bad thing.
Well with regards to IPC anyway AMD has been on the money with every irritation of gains they said they will get, Intel on the other hand... Well lets just say they have fallen short each and everytime.
Intel still has to prove that they can do efficient compute (but this step closer to Apple design might help)
Amd still has to prove they can do efficient platform power, where Apple and Meteor Lake are miles ahead in true achievable battery life.
Will see when notebooks are out.
@@Sam-jx5zy Yeah AMD has been making head way in power gating but still not near Intel's way in doing it, as for Apple well that's all based on ARM which is very efficient but ARM is by it's very nature, basic in how it processes requests and can't process long instructions and requests like x86. On the other hand they sip power so pros and cons.
High Yield explaining chip layouts is like music! I just nerd-out. This man is so good at the niche he is in. I wish this would be as financially lucrative as the value of the knowledge he espouses!
Yeah, this guy is the real deal. Really enjoy his content, even when it’s on x86
nerds
Much better detail than Asianometry, that's for sure
@@aerohk Jon is much smarter than me, he's just looking at the bigger picture most of the time. Like a certain technology, instead of a specific chip.
one characteristic of high IQ people is they pinpoint intelligent aspects in other people 🎩🧤🥂
I think they're doing some very interesting stuff with their SOCs. I'm happy they flew you out to the event, you're a great creator.
CPUs are going through somewhat of an architectural revolution. The days of simply adding more cores is over. The real innovation has begun.
It's insane that this channel went from small youtuber with sub-1000 views to being invited to international trade shows by one of the biggest chip makers in less than three years!
As someone who's here since the VTFET video I am amazed but also not surprised because the quality was there from the start.
Herzlichen Glückwunsch und weiterhin alles Gute auf deinem Weg der dich hoffentlich weit bringen wird! 🍻
Dankeschön. Ich bin immer noch überrascht
@@HighYield Verständlich aber hast es dir definitiv verdient.
Uhhhh huh….
It seems a cpu architecture revolution is underway. The days of simply adding more cores are over. Intel and AMD are now innovating a lot more than they have been the last decade or two, and ARM CPUs are reaching insane performance levels, and neural processing is becoming much more prevalent. It seems that we have reached a turning point. It reminds me of the innovation that occurred during the movement from single core processors with huge pipelines, to multicore processors with reduced pipelines.
ARM processors are still RISC based while x86 is CISC. ARM does well for certain task but it is a reduced instruction set
Frankly speaking after 4 years of inertia intel make something based on apple m1 ideas 🤔😉
@@_EyeOfTheTiger reduced instruction set is better not worse for power per watt and compiled code efficiency. Risc set is enough for any task complicity, and much of modern workload are vectorised so no difference risc or cisc etc, more depends from vector extensions. big and complex decoders only waste for chip space and energy in cisc and later they also do all in mu-ops (so intel and amd chips really a risc chips with added on chip cisc to risc translation, since from i686 times 🤷♂️😂)
The problem is, all of this innovation is only coming because we haven't been able to make gains by upping core counts and building die on smaller process nodes, like you said.
This is because we're reaching a point where we simply can't gain more benefits from those methods. So after all of this restructuring of the Chip and optimization is done, my layman opinion is that we're going to plateau. These sorts of organizational innovations can't happen forever, at a certain point you've reached peak efficiency for the tech you have available
I think a lot of us would be surprised to know how many of these 'new' technologies were actually patented decades ago. It's just a matter of culture and economics I think. No?
I can’t wait until real high-res die shots of these chips will be available
Get a scanning electron microscope
Congrats on getting a press invite dude! You deserve it.
but my rtx 3070 beats a pos NPU🤣🤣
Excited for this video!
I suspect the L0 naming scheme does a few things. Allows for L2 to keep the same name since it has the same performance and size as previous gen and allows L3 to not be named L4, which may have some negative thoughts among the media (plus the potential for direct comparisons to AMD)
And naming their L1 L1.5 likely creates unnecessary problems on the SW side
Also it must be using virtual addresses like L1, the larger L1 still has to start physical address translation for cache misses.
So this small cache for recent data not in the register file may save energy by (usually) avoiding work.
My first thought was that the L0 is not shared per core, or micro code references to cache start at 0 and they're being more cohesive with naming.
@@mikeb3172 Cache coherency requires any cached data to be available to another core. But that requires physical addresses to obtain a read only or exclusive write to the memory cacheline.
If L0 is read only with L1 able to invalidate entries then the simpler faster cache type fits.
It sounds like a Jim Keller style idea that questions prevailing assumptions.
There used to be an intel generation during the stagnation years that had an L4 cache. DDR4 wasn't ready yet and the cores were memory-starved, so they put some cache on.
First found High Yields channel 3 months back with the Zen 6 video and have become a fan ever since and have watched lots of his previous videos as well. Great content!
This is the best explanation of these things I’ve seen so far! I also don’t fully understand everything but I feel like you made it really easy and enjoyable to follow in one video. Thank you!
For a while I was expecting a big reveal of the Adamantine L4 cache.. Alas, it ended being the side cache
I was hoping for that too
Adamantine is a separate cache tile that goes between the base tile and active tiles, so it can’t be on Lunar Lake with only 3 tiles. It’s possible for it to be on Arrow Lake as the tile implementation isn’t revealed, but I am doubtful of that.
@@dex6316 Adamantine is an active Silicon base die which was rumoured to contained L4 cache. It is not a die that goes in between.
Interesting, the L0 pre-L1 may avoid work.
L1 caches use process specific virtual addresses with an address translation needed in parallel to validate the tag isn't a clash with data from another thread. (There's some great CPU engineering lecture videos in YT that explain how L1 operates)
The tiny pre-L1 shouldn't imply the real L1 is an L2 accessed by physical addresses shareable between processes.
So a pre-L1 cache ends up as L0 to avoid confusion.
Now for some speculation and this could be why Lion Cove dropped HT.
Without HT that cache using logical virtual addresses could make energy saving simplifications. If it is entirely flushed on thread changes and not shared between threads, no logical to physical address translation validation seems necessary.
The small size may mean it can be looked up fast enough to pre-filter L1, with misses going to L1 after or the energy inexpensive fast cases are going to L1 simultaneously to complete faster with later validation on an L0 miss.
So perhaps the underlying truth of the leaks was HT went to allow effectively a cache of register file and most recently used logical addresses accelerating L1.
The address translation is needed to figure out which cache line in the selected set has the desired virtual address (and all caches still store the full physical address each line is for, regardless of whether they're initially addressed virtually). The reason for traditional L1 being virtually-addressed is specifically to allow doing the translation (aka TLB lookup) in parallel.
The reason such L1-s are so tiny is because they (ab)use the low 12 bits of physical and virtual addresses being the same (due to 4KB pages), and extend to 32KB or 48KB or whatever via just reading all 8 or 12 (aka associativity) possible matches, and selecting between them when the TLB result is gotten. A 192KB virtually-addressed cache would imply it reading an entire 48 possible cachelines (each being 64 bytes) on each access, which is utterly crazy.
That said, assuming that L0 and L1 accesses aren't done in parallel, by the time the L0 concludes that it doesn't have the asked-for data, the TLB lookup will have finished anyway, and thus the L1 will be addressable physically with no additional delay, like it would with a traditional L2.
@@dzaimaThe point is in L1 the virtual address can be looked up, with the physical address translation in parallel for validation to ensure it's from the right process. 2 different physical cachelines can share the same logical page bits.
You don't want the latency penalty of translating virtual addresses first because it's slow.
The figuring out which cache line has the virtual address is back to front, process virtual addresses are mapped to physical memory via address mapping.
The question what virtual address does this physical memory have is meaningless because it depends on what processes are sharing the memory page, you have a 1:n mapping. But the process thread running has a 1:1 translation.
All the code I compiled tried to use relative addresses with relocatable code to minimise such problems.
@@RobBCactive Virtual to physical address mapping isn't a 1:1 translation even within one process - it can be n:1, as a process can map the same page to multiple locations in its own single virtual address space (and this is useful - see "Mesh: compacting memory manager"). Thus, addressing a cache by a full virtual address is impossible to do correctly without still having some physical mapping check somewhere.
@@dzaima just another reason to avoid the need for it, I think you are ignoring the possibility of a read only cache that writes through via L1 with its translation.
Actual processing writes mostly to registers and then store operations.
If you include all the L1 features what is the benefit of the L0 cache? The address translation isn't going to magically complete faster.
@@RobBCactive I'm saying that it'd be pretty reasonable for the Lion Cove L0 to function exactly like traditional L1-s, and its L1 can largely function exactly like a traditional L2.
Haswell (2014) has a 32KB L1 with 4-cycle latency and 256KB L2 with 12-cycle latency, and it seems very possible to me that, with 10 years of process node improvements, similarly-structured caches (with the higher bandwidth of course) can map to Lion Cove's L0 and L1; and then the difference ends up being the modernly-sized extra level before the very-slow L3.
I suppose it would be possible that Lion Cove's L0 leaves write ops to L1, but that'd obviously result in a rather larger write latency (though perhaps that doesn't matter too much given store forwarding).
High Yield with the early Computex coverage!!!!
My first, but hopefully not my last Computex.
@@HighYield Aw, many more, High Yield. After all, I'm sure you've met many great people in the industry. More to come I say! Also, will we get some Strix or even Turin content? Skymont seems very impressive. I feel like AMD is sitting on Zen5c, which IPC is on par with Zen5, I'm saddened AMD didn't talk about it at all (perhaps in a future Hot Chips). They've left 8 Zen5c cores for consumer and the rest for Turin (dense). From what I've heard it's also a unified CCX, so no split cache, so much better latency (Zen 2 to Zen3), I don't know why they're sitting out on the design.
That said, Turin dense, the CCDS look massive, and I don't think it'll fit on AM5. I'm really interested to know why the Zen5c CCDS look larger than Bergamos Zen4c CCDs. My thoughts lead me to it to having 12 CCDs instead of 8 in Bergamo. Could it be more GMI links, to fit more CCDs on package? Is that the reason why is bigger? Could a 12 Zen5c CCD fit onto AM5 package socket?
M4 and Lunar CPU fight is going to be interesting.
Hopefully intel becomes competitive with arrow lake against m4max.
@@PKperformanceEUThere is no way intel will reach M4 max that quickly. Intel is good but the last few years haven’t been kind to intel or 10 years at that.
@@GlobalWave1 Most likely. But it be nice to haven alternative if m4max will be more expensive than m3max i dont buy it
@@GlobalWave1 lunar lake is not a competitor to m4 max. its a competitor to m4
M4 is already reality and will be partnered by the M5, when Lunar Lake hits the market.
The battle is far from over, X86 still has a lot of bullets to fight
haha. this is an advanced arm soc copy at every level of the design, except for the instructions decoder.
@@PaulSpades yeah just like how snapdragon ditch the low power cores for the laptops
Intel used to make great ARM chips in their Xtensa series, up until their Atom SoC push in mobile. But they still hold on to their ARM architecture license.
AMD also based their first competitive x86 products on their am29k RISC architecture.
x86 (or more accurately AMD64) is just a layer of backwards compatibility and nothing more.
They just directly copy the ARM SoC to x86, but just like for Qualcomm, it's required 4 years just to copy the M1.. so those bullets comes out to slowly.. also Apple can scale their chip to desktop level, but check the Snapdragon X Elite, if you increase the power consumption with 250% (from 23W to 80W), the gained extra performance just 10%.. so food luck to make a desktop chip with that, so the chip itself doesn't mean a lot, if it's limited to only laptops, since the laptop market only a small part of the PC market..
@@PaulSpades If it means backwards compatability with out emulation... I am not buying mac in everything but name. And if it costs like curent Arm solution, it will be reasonable even as core for actual PC. But lack of Ram expandability is still a bit meh.
Great video! Loved your lucid description of the LNL architecture.
Look at you now! This is crazy, I remember watching your videos when you had 700 subscribers, and now you're getting invited to these events. Congrats!
Thanks! Tbh, it still feel like dream to me. I'm enjoying it as much as I can.
@@HighYield I'm glad :) and it's well deserved! Your hard work is definitely paying off
it looks weird that media and display engine separated. they could switch display engine and 8MB side cache but the media engine does need some cache (not 8MB )
I remember watching your channel before you had less than 1000 subscribers. It's good to see you getting big enough to be invited to Intel events
I've seen your username around for a while now, thanks for sticking with me
Will you do a video on Zen 5 and 5C? I'm interested in the capabilities of Zen 5C versus Lunar Lakes E-Cores and how the different paths they took paid off now.
C cores are wayyyy better than e cores
For sure, but idk when I'll find the time. Soonish I think.
cpu competition is spicy again babyyy
I'm seriously considering an LNL mini PC as an upgrade from my current 5600G mini PC and 7220U laptop, I feel like this thing can do it all with a much lower power and heat output (which will make it more portable than my current AM4 mini PC), 32 gigs is seriously enough for me since that's where I'm at right now, and the heaviest thing I run is probably just War Thunder and RPCS3
I love that I came across this channel. I can't believe how good the quality of content you have while still being a sub 50,000 sub channel.
Good effort from former IMG guys (among others). Congrats to the team
If the standard would be 32 GB for RAM on every soldered LPDDR5X RAM then no one will have issue over upgradability.
That is what they said 50 years ago about 640kB. You have no idea how much memory we might need 5 or 10 years from now. Maybe even only 2 or 3 years from now.
@@AutieTortie win 7 times required 2gb of ram (8gb for best experience). now at least 8gb is required for daily tasks and non under engineered games. so 2-4 times ram in 14 years i say.
but 32gb max... i say it may be not for ultra professional 3D / music producers. but who knows
Two option ram 16/32gb
Allow a user to add more ram and use the "onboard" RAM as cache or allow users to replace the SOC like they do for desktops.
That or CXL3.1 can access more RAM via a PCIE enabled port and device.
There are already plenty of mobile use cases that don't need massive compute power but do need more than 32GB of RAM.
It's an understandable compromise at times but it would be nice if there were more memory options.
Just came across this video of yours. With available information you did awesome work. I also look forward for real time performance with the release of Lunar lake. Similarly with the arrow lake. Time has ensured that ultra series have taken over H series chips from 6 years ago. Interesting times. Such videos not only educate but can also be useful for purchase decisions. Subscribed 👍
I love this channel so much
Proud of you for getting invited to a press event! Well deserved. hoping lunar lake won't be as weird of a launch as strix
i really like ur transparency
crazy you can soon build a tiny 4" by 4" work station, totally fine for 3d, illustration works and code. this is the way forwards.
I love how you are going into detail of the die space and die size and what each new process with L0 L1 L2 cache with lunar lake and what it means! 🎉
Thanks for the disclaimer early in the video. A perfect example of why you are exemplary and trustworthy.
Amazing contents!
I think we'll have to wait for the release of the Lunar Lake laptop and the benchmark scores, but if you simply multiply the graphics scores of the Meteor Lake-H's 3DMark benchmark Time Spy and Fire Strike by 1.5, you get TS: 5250 FS: 13800. In terms of desktop GPUs, it's close to the performance of the GTX1660. In the country where I live, there are several articles that say it's 50% better in performance than the Meteor Lake-U, but if you multiply the GPU performance of the Meteor Lake-U by 1.5, it will be the same as the Meteor Lake-H's GPU performance. On a different note, is the presence or absence of hyperthreading related to the high single-thread performance of Apple silicon?
Great analysis! Really a real breakthrough.
You produce incredibly great content. Subscribed!
Good Job @Highyield , love your detailed reviews on these silicons. With respect to this video finally Intel catching up with various ARM platforms including Apple’s M series and Snapdragon X series.
it is also interesting to see that the NPU has roughly a similar TOPs per area as the gpu, so expect it to be very power efficient, which also might mean that perhaps someone might find out a way to overclock it, since sometimes hardware optimized for efficiency has insane headroom for overclocking.
I'm really glad to hear intel seems to be going as wide as possible. It seems like that is why Apple chips are so fast and efficient, not ARM doing magic or something
Apple have caches v. close to the CPU, reducing latency and energy for data flows.
Going wide doesn't help a lot of code, it inherently has serialising data dependencies.
Not sure if it's going wide that's helping here. From what I know, the efficiency of Apple chips came from 4 things:
- better manufacturing node (M1 was N5, everybody else was on 7nm. M3 was on N3, everybody else was on N5. With Lunar Lake, we're finally on even grounds here)
- on-chip RAM (while I hate non-upgradable RAM, I'm glad that Intel did this with Lunar Lake. There is a segment which clearly want battery life much more than upgradeable RAM)
- non bloated OS (nothing to comment here, Windows (and Microsoft) sucks, Linux doesn't have enough support to be perfectly tuned yet)
- laptop and motherboard design - this is much more subjective. Thing is that Apple actually prioritises battery life, while on PC side it's usually the benchmarks. Which is why many laptops are much louder and warmer. I also know that simply having some extra ports, that is, only having them exist, having something connected to them is not neccessary, that can also increase the minimum power required for the laptop to be on. Apple is famous for not having enough ports - I think this is also a reason of its efficiency
Edit: forgot to add, M chips being on ARM also help on efficiency ... but not so much as most people claim (as if it's the only thing). My gut feeling is that it helps like 5-10%.
As for the M chips being so fast ... other than the big memory bus width (up to 16!! channels on the Ultra chips) I'd say is also because of better manufacturing node. If you take the N5 and N4 and 5nm and 4nm generation of chips, Intel and AMD are better than M1 and M2. I mean, if you exclude the efficiency, Intel's Raptor Lake and Raptor Lake refresh which are on 10nm++++ are quite competitive even with M3 chips. Still, overall, the difference is not that big usually. The M cores/chips are clearly well designed.
Pretty cool. Looks interesting on paper. We will see how it performs in real life.
Great job on the video as usual bro! Thanks for the info and looking forward to seeing Lunar Lake AND Arrow lake hopefully later this year 🤞. Congrats on having Intel fly you out there too!
As an ASIC Design Engineer , this is an amazing video. Was able to relate to a lot of concepts I learnt in school
If those E cores are getting so good, I wouldn't mind having a budget option with just 6 E cores!
those things are basically a 8th gen i5 mixed with a atom, if you want that, go buy one, dont wait for the future
@@betag24cn That was the first generation of E-cores. Did you watch the video? Skymont E-cores have similar IPC to Raptor Cove (Raptor Lake)... while being vastly more efficient
@@__aceofspades doesnt matter, tje concept is stupid, is fake you did not glued together two cpus because you were in panic, it is a dumb idea and points to the fact that your designs are terribñe on not generating heat thanks to absirds levels of power consumption, does not matter
It would be another lie by Intel. They said that Gracemont matched Skylake.
Here we are years latter and the Haswell 4 core 8 thread i7-4700MQ laptop chip that i have has 25% higher IPC (CPU-Z) than the e-cores on my Alder Lake Core i7-12700K CPU with way, WAY faster DDR5.
Lunar Lake is Intel's Bulldozer, there so many problems with the overall design of the chip. Meteor Lake makes more sense.
I'm wondering why they would make it on TSMC N3B when N3E is already in production.
Intel had to take what was available when the order was made would be my bet.
Jupp
ARM works around SMT with SMP/UMA interconnect. Overall this still obeys the SMT design, but appears to be an alternative view. Cost based power optimization on the network, if I had to guess. This chip does look like it is focused on SoC almost exclusively.
8MB out the door is basically sequential conversion and they do not care about capacity. They are using the bandwidth to carry them. Internally uses 512 bus? How many ports do we get here? It is the SMP interconnect that worries me slightly. All communication from NPU, E cluster, P cluster, GPU and IO die move through that cache. Bandwidth will allow page swapping, but latency performance could suffer. Everything runs from RAM.
I noticed the L1 vs L2 is mostly pointless and suggests that the L2 is mostly capacity for keeping things on P cluster. This is very interesting for Intel SMT. Therefore I have some concerns within the P cluster itself. (There is a solution to this problem.) I already have concerns with P vs E cluster balancing. (There is a solution to this problem.) I am guessing they removed most of the turbo boost, despite the low core count. Going this wide on low power is very interesting. Higher emphasis on MIMD over SIMD in cluster?
Heavy on the compute and feature rich IO for the money. Hopefully decent memory performance for the money. Will be interesting to know the range of power density. Hit or miss on software applications.
If the P cores are not directly on SMP interconnect, then they are effectively worthless. The E cores on the other hand may be SMT or SMP without causing too much concern. Schedulers may care, but that would be about it. Qualcomm Oryon used a similar network, which is 3 SMP cores each core being SMT-4. With Oryon each core got 2MB unified cache, while here with Intel there is significantly less for significantly more IO. Intel has handed out about 1MB unified cache per core. Lets hope this is a coherent interconnect.
At 15W target, we are likely looking at performance or value.
So there's no Intel process node in lunar lake? All from tsmc nodes.
The base tile ins manufactured by Intel and they also do testing + packaging. But yes, all the active silicon is TSMC.
@@HighYield if i not wrong isnt they gonna used on home manufacturing for 2025? basically all the things gonna used intel intel nodes
Can you explain the implications embedded RAM? My PC currently has 128GB of RAM. From several places I've looked, it sounds to me like with lunar lake you're limited to the 32GB of embedded RAM and it won't make use of normal RAM sticks? If so that's.. just not an option at all for developers and I'm really surprised that noone is getting out pitch forks.
BUT I must be missing something so what am I missing?
Is Arrow Lake actually getting the 20A node, or just like this it will be using TSMC all around?
There will be some ARL parts on 20A.
Hello, great video. I wanted to ask you, now that both mobile laptop cpus from amd and intel are announced, which cpu do you think is superior overall? Taking everything into account would you go with lunar lake or strix?
Thanks
I think Lunar Lake has areal shot at the efficiency crown, but it does launch later in Q3, while AMD will launch sooner. Always wait for reviews, but for battery life I think LNL will be best. Strix Point should win in raw performance with up to 12 cores.
@@HighYield AMD is usually late with actual shipping laptops though, no?
@@HighYield thanks for replying, patiently waiting on arrow lake desktop reveal as that is what im really interested in, im looking to upgrade to a new desktop with an rtx 5090, gonna go with whatever is faster amd 9000 or arrow lake. Ryzen 9000 vanilla series kinda disappointed me a bit tbh, pretty much same gaming performance as previous gen. Have to see what 9000 x3d chips have to offer.
I don't think they are comparable. AMD doesn't have something to compete with lunar lake given the low power target of lunar lake, and Intel has not announced what their answer to Strix Point is (though we all know it's going to be some variation of arrow lake).
Intel will Winn the efficiency battle against Strix point and it's very likely that their GPU will be very competitive with hawk point at lower power, but it is unlikely it will be able to touch Strix point in GPU performance given that Strix point has 16 CUs.
Overall, more and more excited for lunar lake. I think in a handheld form factor it's going to be very interesting.
@@sloanNYCthe shipping may be not late, but the real issue is supply. Here in my country you only able to find phoenix point/hawk point easily in gaming laptops while the thin & light category is dominated by intel.
Great video!
intel is back on innovation track. I really want to bulid my next pc with intel. And Arc gpu😅
They are copying apple/arm and using tsmc, like AMD. Can't wait to see them do interesting things besides 14nm++++++++
"innovation track --> build my next pc" reads like "nVidia has -90 class GPUs that are great, let me build my next PC with GT 1030 (DDR4)"
yes, they are looking for new ways to get us on another decade of 4 cores is more than enough
The way tiles allow different parts of a CPU to use the most optimal process node is very cool.
the NPU is pretty gigantic compared to for example what Apple does. Curious about the performance because Apples ones are ridiculously fast for their size
Do I understand correctly that 128 bit memory bus is the same bandwidth as what we'd get with dual-channel DDR modules? Since each module (channel) is usually 64-bit? Just trying to understand the overall memory bandwidth, I know we also get latency benefits and not downplaying it.
Yes, it’s 128-bit as most other consumer chips.
It would be interestong if someone came up with a hybrid chip that has both x86 and ARM instruction cores. Which would allow running both X86 and ARM software natively. It could be an 8 core CPU with 2 X86 P cores + 6 ARM P cores.
i believe amd with the chiplets can do it, if they have not done it already
You will need also a hybrid SO that could understand and manage 2 different ISAs and architectures
@@reiniermoreno1653 you mean, windows 12?
@@reiniermoreno1653 Implementing the Software would easier compared to the hardware since we already have OSs that understand which ISA you are using. Binary executables also have info on which architecture they are designed for.
All I want to know is whether Intel can produce an APU good enough for me to ditch my Ryzen 7700/Radeon 6700XT desktop?
that's a gen or two away...don't expect all that this soon lol
their TOP gpu right now doesn't even reach 6700 xt performance let alone an APU from this coming gen...
@@ofon2000
Oh well.
Have to lug my combo around again...No, not seriously.
LOL.
keep in mind that amd pushed for a apu for decades, and look how far they got, intel catching up will not be soon, if it ever happens
Lol. Thats a good one
14:40 * " would have never expected that the E-cores in Arrow [Lunar] Lake have a higher IPC than the P-cores in Raptor Lake."
They are lying. Raptor Cove has 36MB of L3 cache on a monolithic Ring Bus.
A ring bus Haswell 4core8thread laptop Core i7-4700MQ that i have has 25% higher IPC than the e-cores on my current Alder Lake i7-12700K besides the HUGE difference on RAM speed.
is this where the power and signal have been seperated (on opposite side)?
No, it’s on a TSMC node which doesn’t have backside power.
Does this mean the gpu can access all the memory like the m series aka unified memory?
I can access the 8MB GPU L2 cache and the 8MB memory side cache.
every intel iGPU can access all of system RAM
Speaking of, we REALLY need the dynamic iGPU memory allocation that Apple has. On Windows' side I can see why it's not implemented and why nobody talks about it, as Microsoft couldn't give 2 flying Fs about Windows, especially in the performance side. If it's not ads or tracking the user, then it's priority 7384, to be done in 15 years from now.
On Linux side I hope we'll see something, but usually GPU stuff comes from the manufacturer, so it would be Intel or AMD here for the iGPUs. And they're both busy on other areas, like the actual GPUs being competitive. And the drivers for Windows. Linux comes after that. Sigh.
@@Winnetou17 you do have dynamic igpu allocation on windows, on intel all of memory is accessible to the igpu (unlike ryzen lol)
@@sowa705 Oh, ok. I was under the impression that it's settled at boot time. I wonder then why did Apple presented (and people being wowed) as something new. I guess it was new for them.
Can you just combine GPU, NPU and CPU for the same inference task though? Or is Intel just adding up numbers to create a bigger number but in the real world, you will have to decide where to run any given model?
You are correct. Currently can't just add the numbers. Apparently work is being done to enable mismatched processors for ai batch-processing, but I don't expect it will release soon, if ever.
@@martin777xyz thanks for confirming. And yeah, from my understanding it sounds really tough to make these systems complement each other. Maybe some day we'll be running so many models locally that they can run in parallel but even that...
Lunar Lake looks like the biggest improvement for Intel in over a decade. In terms of performance per watt and GPU performance, it looks like Lunar Lake will beat Zen 5 and Qualcomm's X Elite. The only downside is that Lunar Lake is focused exclusively on thin and light laptops and handhelds, its not their highest performance product for mobile or desktops that is Arrow Lake which looks great for performance but will lose some efficiency and iGPU gains Lunar Lake brings.
"looks like the biggest improvement for Intel in a decade"
No, it's since Alder Lake.
I'd like to see a version of this that is compatible with CAMM for memory instead of on-package.
Gluing the ram closer can yield far better ram function than CAMM. I think this cpu will be used in very small systems. I'm betting they will use even faster ram as soon as it comes available.
@@impuls60 CAMM supports the same 8533 speed that this package does. I don't buy this explanation.
@@ChristopherBurtraw The on chip is mostly for the power savings, not neccesarly for bandwidth.
Lunar Lake is optimized to be very efficient (and I hope it actually delivers). It should be perfect for ultrabooks which want really really long battery life and for gaming handhelds.
For the rest of normal folk and normal (or powerful) laptops, we'll have Arrow Lake. And hopefully we'll see LPCAMM2 laptops with that. I dream of a Framework 16 with Arrow Lake and LPCAMM2 in which to add 128 GB of RAM and finally upgrade my almost 8 year old laptop, to one that will also last me 7-10 years.
@@Winnetou17 I'm hoping the next gen (after the one they just announced) 13 board will have it too. Framework won't want to implement this one even for the 13...
I'm still waiting for a chip that integrates 32GB of HBM3e as an on-package L4 cache within the same SoC, while also supporting the addition of DDR5 memory modules with ECC capability, rather than being limited to just integrated memory.
bruh
the hell you talking about man lmfao
Could you potentially change the memory modules on the CPU?
If you de-solder the modules and replace them. But that’s not very practical.
Would be very much interested in the thermal perf as they are using the TSMC manufacturing
Thank you for this educational content! Underdog Intel is striking back with a mean kick! This is an amazing SOC! Its real competitor is the Apple M4!
The core layout with everything right next to the memory controller makes sense, and I'm glad to see intel moving in this direction. It'll be super interesting to see how x86 power consumption improves with this layout!
Bro, it's been a while but I still have that love for Intel
Exciting stuff! Great video, as usual. I do have one question, though. Is it certain that a server implementation (or any) of Lion Cove would have SMT? Also, different implementations of the same architecture sounds more like a standard vs Dense Zen situation to me, and I think that it could get expensive to develop lots of just slightly different cores
Yes, Lion Cove in Xeon will have SMT. And yes, LNC is also more flexible. Not really a “LNCc”, but there will be size differences.
@@HighYield Nice! Excited to see what they will come up with
Depending on the details on moving data between the NPU and the GPU, using both at the same time could work really well for some AI workloads. Training a QLora where the main weights are only used for 4 bit inference that could run in the NPU and the backpropagation is done only for a low rank adaptor in fp32 or fp16 in the GPU could potentially work well. It won't be faster than a dedicated GPU, even a 3060 should outperform it. Memory bandwidth will likely also severely limit its performance. But often the issue with GPUs is not speed but available memory. Also this should be much more power efficient.
It all will depend on software support, that is usually the issue with most non nvidia AI hardware.
I'm currently using a zen4 amd cpu and have been an amd user for several years now but honestly Intels next Gen stuff looks more compelling than zen 5 and up. That is if they can pull out off. I really like thread director being in hardware. Dealing with Gazillion cores and core types has been a weak point in software as of late so I hope this can help.
I work for a major computer vendor and you're spot on. Your conclusion 110% speaks my mind and maims exactly what I've been saying when Intel presented us the LNL 3 weeks ago. I said that if LNL matches almost the battery perf of ARM by Qualcomm, this is going to be another Windows RT. ARM for Windows doesn't really offer a difference.
We have already more performance than needed, NPU's are available en masse thanks to NVIDIA, it's just MS that firewalls for now the marketing bullshit storytelling about Copilot and that blocks other than embedded NPU's from being recognized by copilot, but this will change probably next year and they'll have to open the gates. What's left ? Battery performance. Ok, but if this gets matched, what's the point of having the whole industry shifting away from x86 ? Zero...
ARM will be the thing that made Intel rethink it's architecture and from there the power efficiency and that's a good thing.
interesting question is how intel will handle it in the future. LunarLake will be parallel to ArrowLake-U. But what about PanterLake in 2025 ? Will in continue the LunarLake design or will it go back to the generic design of ArrowLake-U ?
Looking forward to getting a windows laptop similar to the Macbook Air. I would love to have laptop without the need for a CPU fan, or maybe one that runs only during high workloads.
Sir make A video on 14900K
And upcoming ultra 9 290K
If it has NPU and TOPS
TOPS in CPU and integrated GPU
Or has inbuilt RAM
Or any difference than Lunar lake
I think they should have used the empty silicon left in the die to make the gpu more powerful
Cool to see a nice bit of Cache on the side to minimize DRAM access, L4 foresight on desktop? probably not but I love what I'm seeing from Intel this year, very exciting in more ways than expected. Maybe not quite leadership just yet but at least on par, the whole E-cores thing is evolving into something and I wont be surprise if it eventually gets to a point of Zen Dense. So far its still looking to be a split design mentality but a high IPC Philosophy so the ability to use E-cores for most task will get the best out of the efficiency. Last time I was this excited was Alder Lake?
I'd love to see something like this for desktops where I can get an entire SOC with 32-64GB of ram all bundled together. I know there are upgradeability concerns but the performance benefits if you over spec could be really good, especially for ram heavy applications.
When willl you do m4 chip deep analysis
As soon as we have a good die-shot and more information on the Pro/Max/Ultra variants.
@@HighYield what do you think amd or Intel is not able to compete with apple in terms of single core performance with being in this field for decades i means just see the geekbench 6 single performance of m4 its insane for a ipad this thin
Why waste so many space just for NPU which no one would really utilize?
windows recall and similar shit, but they did prove it doesn't need an NPU. they are no longer building chips for us, we just need 4 fast cores and a lot of cache to go with them :)
blame Microsoft
Wave goodbye to everyone else as they accelerate past you, with the help of ai
they are betting on a ai world where everybody runs ai locally, others have the same idea, i personally find that very creepy, just want the pc to do what i order, bot to have its own ideas
I have two guesses on Intel 4 and later processes Meteor Lake and Lunar Lake all being mobile oriented and not desktop. One is that the processes are not suitable for high performance operation, but do get better power efficiency. 2) manage fab-process capacity.
Its very sad there's no perfect chip, i like some moves intel made with this desing like interposer for connectivity, the complete turn off of some tiles if they're not required to mantain a low idle power draw, the extra efficiency on E cores to be the main cores used to low the power draw under workloads and monolithic design on the computing chiplet, if just E cores where more like Zen c and less like big.LITTLE and the ram doesn't being integrated on the SoC...
Is there a good chance that all these architectural improvements will help Intel make much more efficient desktop CPUs in the near future? I'm really interested in the Small Form Factor space, and I think Intel has had a bit of a hard time competing there in recent years with their processors.
I know this is a video about Lunar Lake but this video gets me really really excited for Battlemage and desktop products like Arrow Lake
If intel could figure out a V-Cache competitor and commit to multiple years of support for a motherboard platform they could make AMD straight up unattractive on desktop. I say that as someone with a 7950x3D and invested into AM5!
I can't wait to see the next few years
Shouldn't intel be exceeding not matching AMD's current offering to make AMD unattractive?
Or the standards are different for different companies?
@@aravindpallippara1577 Well currently, Lion cove is projected to have higher single threaded performance than Zen5 cores. That single thread lead will help with everything, including gaming. AMD has the biggest advantage in gaming rn with V-Cache, platform support and efficiency.
with skymont, intel has a real chance of gaining a huge performance/watt uplift particularly in multi threaded loads which is where intel sucks down a comically large amount of power
That's why I specified V-Cache and platform support would make AMD unattractive on desktop. Because Intel already has a decent chance of having class leading single threaded performance, adding V-Cache to an intel CPU would surely boost performance considerably (especially in games that love v-cache like Factorio, or Kerbal Space Program)
And platform support like we have with AM5 would be really great. Having to upgrade every 2 gens is a huge downside compared to AMD's offerings and commitment to 2027+ support and why I personally went for the 7950x3D and AM5. V-Cache and platform support is just great
The V-cache is a solution because of the slow memory controller on Zen processors. When you glue the ram this close you dont have that AMD problem. No need for the same solution. The mystery cache is probably enough if Intel engineers did their job well.
@@impuls60 Imo that's an F Tier comment. L3 cache is going to beat out faster memory just by virtue of the insanely high bandwidth and lower latency. There's a reason intel loses in those games that favor 3D cache
@@impuls60 Agreed with the above commentor a cpu cache and ram have vastly different type of uses - cache is very raw and hence very fast as opposed to ram which needs to be encrypted and passed through os layer checks before being accessed - cache is still the king for performance of single thread operations.
This guy got me high this morning. He got the sample
Technically you can upgrade the memory after purchase. You just have to be really good at soldering 😁.
I only tell this comment, because I knew someone who did this to his MacBook. Bought a. 8GB model abd with some patience and skill it became a 32 GB model 😅.
But did it help perf
@@noticing33 I don't know but I know the device worked afterwards. I lost contact with him after his internship ended. But I think he wouldn't have done it, if it wouldn't have improvement performance.
Hahaha its gonna be fun for dyi project
That works when the memory is soldered to the motherboard, not when it's on the CPU die.
@@mattbosley3531 very true but this was still an Intel Mac with soldered memory.
I’ve been using the Zenbook s14 for a couple of days and my god, the battery life is mind blowing, while also offering leading performance.
Not even 50k subs and you're already getting free Computex trips? Damn, balling
No LP E core island this time?
All E-cores are their own “LP island” in LNL.
@@HighYield Thanks, What is this Metallic U shape on the periphery of chip? does it have any use or just to provide structural integrity?
@@bruceparker3139 Yes, Its purely to support the PCB.
Will the lunar lake offer more cores?
Nah, only 8 cores (4P + 4E), and no multi-threading
@@NothingNewNerdI don't think it's designed for high powered +45 watt laptops.
@@saricubra2867 Bruh, did you read what I wrote
Wake up babe High Yield posted 🔥🔥🔥
So... it won't be possible to even add DRAM to the system afterwards? What the actual hell...
it is a soc for laptops, but they might offer something for desktops later, or might not, intel is like a kid lost in the forest at this point, copying everybody but scared all the time
X86 is not dead yet. Pat also seemed very excited for Panther Lake...
A cynic would point out Pat was excited about Raptor and Meteor Lake and even Sapphire Rapids.
These presentations have not been a reliable guide to what's delivered and when in recent years.
@@RobBCactive
Raptor lake was great
@@technewseveryweek8332 if you read the tech news, you'd know better.
he was happy when he offered all their unused fabs to amd and nvidia because they are going extinct and nvidia seems to have listened
@@betag24cn to be fair, the Xeon Sierra Forest server is on an Intel 3nm node and with 144 E-cores has some advantages over Bergamo Zen4.
So which is best for gaming lunar or meteor lake.
Lunar Lake should be quite a bit stronger. But like always, wait for benchmarks.
@@HighYieldThe lack of a ring bus and cache hurts gaming perfomance.
Raptor Lake mobile chips are way faster in games than Meteor Lake ones.
i9-13980HX (what a beast of a mobile CPU) is a desktop Core i9-13900K (the 2nd or 3rd fastest gaming CPU behind the Ryzen 7 7800X3D) with lower power draw and slightly lower MHz.
Meteor Lake doesn't max out at 8 P-cores with 36MB of L3.
I will neither understand, nor remember most of this, but it was interesting.
so, intel give up racing moore's law by using tsmc fab?
Intel has a dozen fabs under construction now.
Below the Lion Cove Cluster is probably the ISP. Maybe?
I think its Intels "security area" called IPU (Infrastructure Processing Unit). Forgot to label it.
@@HighYield It is their Imaging Engine called "Imaging Processing Unit - IPU". It takes up a lot of die real estate I hope we see devices with better cameras to take advantage of the enhanced processor.
@@HighYield the Platform Controller Tile hosts the Security and Connectivity IP blocks.
@@EnochGitongaKimathi Thanks, that makes more sense. Hate that there are two "IPUs"
I'm glad we have so vast mobile CPUs choice these days: Apple M1/M2/M3, Intel Meteor/Arrow/Lunar Lake, AMD Hawk Point/Strix Point and a new player - Snapdragon X Elite - is on its way. We never had a more difficult choice