Next-Gen CPU Acceleration: AVX For Generative AI

TechTechPotato

Просмотров 25 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 янв 2025

Комментарии • 156

@Wunkolo Год назад ⁺⁷³
I contributed AVX512 acceleration in the CPU backend for emulators like Xenia and Yuzu/Citra/Vita3K(Dynarmic). You _can_ currently use AVX512-features on 128,256-bit registers with AVX512VL. The issue is that it defines it as a subset of the 512-bit registers so it requires a full 512-bit implementation rather than defining it as orthogonal super-sets of 128->256->512.
There's also some outdated information here. In the pictured article by Travis Downs here(12:52) the downclocking issue hasn't really been an issue since Icelake(2019). Especially if you only ever touch 128/256-bit registers.
@Lauren_C Год назад ⁺²
Out of curiosity, is it still worth developing for AVX-512 at this time, given Intel has largely appeared to have dropped it on consumer CPUs?
@Wunkolo Год назад ⁺¹⁷
@@Lauren_C AMD has picked up the slack for a while now with Zen 4, so it's becoming more and more ubiquitous between both vendors. Intel appears to be bringing some form of it back with this whole AVX10 thing. The least-exciting part about AVX512 is the vector-width, so there is lots of value to have AVX512 features for smaller 128/256-bit registers. This push for AVX10 is kind of proof of that.
AVX512 work may continue. Both AVX512 and AVX10 use the same EVEX-encoded instructions, which specifies the vector-length in the instruction itself. AVX512VL is the AVX512-feature that allows operating upon 128-bit or 256-bit registers rather than just the 512-bit registers. So AVX512VL and AVX10 instructions can be exactly the same but may now fault if the hardware does not support certain vector widths. Before, hardware was required to support the 512-bit vectors before even supporting 256/128. AVX10 kind of flips the script on that definition and starts at 128-bit and extends to 512-bit.
So if an AVX10 chip runs my 128-bit AVX512F+AVX512VL code, it will be fine. But if I use 256/512 bit registers, then I have to be more careful and ensure the hardware supports it to work on 256/512-bit AVX10, but don't have to check at all in the case of regular AVX512.
So there's no reason to stop writing AVX512 code, since it's the same as AVX10 code but with some extra steps and safeguards to be had.
@arditm2178 Год назад
with all that experience, what's your opinion on Gather and Scatter memory operations? (asking as a noob)
@EyefyourGf 9 месяцев назад
I know this is old comment,but i just wanna say thank you for contribution,and for in depth explanation.
@eekpanggang Год назад ⁺⁵
FINALLY THE GUYS BEHIND CHIPSANDCHEESE! Really love that you brought them with us here, Ian!
@porina_pew Год назад ⁺¹⁷
I found 2 unit AVX-512 (SKX) to be worth it even with the clock drops. It gave a per-clock uplift in performance of around 80% for my interests, so it was still ahead in throughput. Even 1 unit (RKL) gave 40% per-clock boost.
@Wunkolo Год назад ⁺²
Even back with Skylake-X, a %15 reduction in clock speed for an almost x2/x4/x8/etc increase in performance is a _huge_ perf gain.
@flashmozzg Год назад ⁺⁴
The issue was that if you had mixed code it wsn't worth it and the latency was noticeable. I.e. if you had a few avx512 operations surrounded by mostly avx2 or lower ops, then all of them would be downclocked (I think the number to switch back from downlcocking was around 80-100 cycles).
@yuan.pingchen3056 5 месяцев назад
@@flashmozzg just buying an AVX512 capable processor with lower clock multipliers, it's okay, for example, the early alder-lake i5-12400 or i5-12500
@neonmidnight6264 Год назад ⁺³⁴
11:30 The simplest way to address this is to have the compiler fill in the 512b vectors as 256x2, so that when it emits fallback, it is effectively achieving 2x loop unrolling. This is a strategy that .NET takes, even if historically it was intended to simplify vector fallback code, it ended up working a bit too well for pairwise (but not masking and not horizontal) operations.
CoreLib still does bespoke implementations of 512b, 256b and 128b widths, but it's not labour intensive because AdvSimd and AVX2/SSE4.2 features map to each other fairly cleanly (movemask emulation notwithstanding) allowing for a unified API. Nevertheless, this approach appears to be superior because, for example, StdLib variant of memchr in Rust is not vectorized nor LLVM is able to auto-vectorize it, leaving significant amount of performance on the table for such an important operation.
Generally speaking, most well-written libraries which utilize SIMD end up partially reimplementing their own cross-platform abstraction on top of intrinsics to avoid significant code duplication. The arrival of cross-platform SIMD abstractions in both C++ and Rust this late (both are still unstable/experimental) in 2023 is really disappointing.
@alex84632 Год назад ⁺³
It sounds like the compiler should break 512 into 4x128, or 2x256 if it can, or 1x512 if it can. So that all versions of AVX10 work.
@awesommee333 Год назад ⁺²
Doesn’t really work for stuff like shuffles. Or we’ll it does but it’s four instructions minimum at that point for a single 512 bit shuffle
@neonmidnight6264 Год назад ⁺¹
@@awesommee333 Yeah, only for pairwise. For horizontal operations you still need to use the exact supported width.
@neonmidnight6264 Год назад
@@alex84632 While true, .NET's register allocator and level of optimization can't compete with GCC/LLVM for the IR shape produced by unrolling 512b vectors into 128x4 - keep in mind that usually you operate on pairs of vectors meaning that's already 8 V128 vectors in flight - this is a lot of register pressure and usually results in stack spilling and the compiler giving up on certain optimizations.
With that said, I do believe that LLVM can probably do better, but there's tradeoff with choosing 512b width when it comes to regressing small lengths (or code size if you do 512 -> 256 -> 128 -> scalar) if this is general purpose code.
@MrVladko0 10 месяцев назад
С++ have good SIMD libraries like Agner Fog and EVE. Experimental SIMD support in C++/Rust is miles away from them.
@wile123456 Год назад ⁺⁷
*cries in the forgotten TX instruction set which gives a boost for the PS3 emulator*
@woolfel Год назад ⁺²¹
It's nice to see AVX get improvements, but a big part of SIMD is the software stack. Without great SIMD compiler to optimize the execution, better AVX won't necessarily produce the gains. NVidia's CUDA stack is dominant because the compiler is better than competitors. For example, CUDA's default execution is "non-deterministic" to maximize utilization. If you set CUDA to determinant execution, the throughput takes a hit.
@vasudevmenon2496 Год назад ⁺³
I started using clamchowders(Chester) memory benchmark and been a year. Thanks for making it open source. Thank you Ian for the reporting
@JonMartinYXD Год назад ⁺¹
All you need to know about AVX-512 can be found under the _CPUs with AVX-512_ section of the AVX-512 Wikipedia page. Just look at that table and think about what instruction subsets a software developer should try to use in their code..
@twopic5408 Год назад ⁺¹
Had loads of Fun editing this video
@andrey7268 Год назад ⁺³
5:50 AVX-512 is not "limited" to 512-bit vectors; with AVX-512VL (which every CPU that has AVX-512 supports) you can use AVX-512 instructions on 128 and 256-bit vectors. The problem (for Intel) is that AVX-512 *requires* 512-bit vectors to be supported by the CPU. AVX10 makes 512-bit vectors an optional feature. This is mostly Intel solving their own problem with E-cores not wanting to support 512-bit vectors for some reason. AMD showed that 512-bit vector instructions can be implemented on top of 256-bit vector units, so really this is just Intel refusing to do it for some reason.
@LarsDonner Год назад ⁺¹⁴
To support different x86 processor generations I already have to write multiple versions of my functions: SSE, AVX, AVX2, AVX-512 and dispatch to the correct version based on the CPUID flags. Now I get to write another 2 or 3 versions more? Oof.
Also, the shown article at 12:52 concludes that down-clocking was already not a problem on Ice and Tiger Lake. Weird how that story just doesn't want to die.
@tappy8741 6 месяцев назад ⁺¹
AVX10 is 100% an intel retcon to avx512 because all of a sudden they need to be more efficient with silicon to be competitive. That said, the sse/avx/avx2 implementations for a given problem tend to be related, in that same way avx10/avx512 are related. So really there's 3 main implementations, generic, avx2 family, avx512 family.
@LarsDonner 6 месяцев назад
@@tappy8741 My guess would be, that the AVX-512 ans AVX10 instructions would still have different opcodes, even if they do exactly the same. In that case the algorithm may be the same, but I still have to convince my compiler to generate different versions of it.
@tappy8741 6 месяцев назад ⁺¹
@@LarsDonner Yeah it's still a pain, but it's not an unfamiliar path.
This whole 256 bit AVX10 is utter nonsense. All processors that have AVX10.256 will have AVX2 also right? And at this point most programs that want simd have implemented avx2? What's the benefit of porting if an avx2 path already exists and there's not much to be gained?
@LarsDonner 6 месяцев назад
@@tappy8741 I guess you would still gain the ability to apply masks to every operation, control rounding modes and have a more complete instruction set. But now that the 512-bit-cat is out of the bag (and AMD is going hard on it) I find it hard to believe that there will ever be many AVX10/256-only CPUs.
@tappy8741 6 месяцев назад ⁺¹
@@LarsDonner I doubt amd will make avx10.256 only, unless they do an embedded sku or a custom part for console or steamdeck. But intel, intel I think wants to do avx10.256 for most consumer and avx10.512 for server. Might get avx10.512 on intel's halo consumer slash prosumer. At least that's what I assume the plan was when they set this avx10 crap in motion, they couldn't compete on general compute when they tied the avx512 anchor to their necks and had to ditch it. Things may change now that they've bitten the bullet and are using tsmc much more heavily.
I'm not denying that avx10.256 is more rounded, masks are nice etc. But the avx2 implementation already exists, will work on anything that has avx10.256, and even if all future consumer intel is avx10.256 it'll be a decade before there's any sort of penetration worth a damn.
@jannegrey Год назад ⁺²¹
It would be cool, but it seemed Intel almost killed their AVX-512 adoption, by fusing it off from Alder Lake. Yes - it didn't work on E-cores, but code could be at least run. And from memory de-clocking wasn't nowhere nearly as severe as it used to be. Writing so many versions of the code is taxing and a lot of people won't do it - with their limited resources - just to have it run on very few machines. It doesn't look great if AMD implementation seems better than Intel's, in Intel made ISA. So I will wait and see if it really works and there is some movement from Intel to do it well. In practice, not just in theory.
@octagonPerfectionist Год назад ⁺³
could it be run though? is it able to do a hybrid core layout with avx-512? i thought it was one or the other
@jannegrey Год назад ⁺³
@@octagonPerfectionist Bad phrasing on my part (I assume you're talking about: "it didn't work on E-cores, but code could be at least run." part). I meant that code could be run on P cores (so CPU as a whole package). From memory, E-cores couldn't run 512 bit extensions, they didn't have silicon for that. But early Alder Lake CPU's, assuming you had proper motherboard that allowed for it, also allowed for AVX-512 to be run. Problem was that Thread Director could allocate it to E-cores, and then it was bugging out. Rather than deal with it (TBF first 6 months of thread director improvements made a lot of changes for good, but it was gigantic operation) and making it that AVX-512 was scheduled only for P-cores, initially BIOS would not allow for AVX-512 to be scheduled or it would force you to switch off E-cores.
So depending on application - yes it could run AVX-512, but if it wasn't scheduled or written perfectly, it would send it to E-cores (and scheduling it all was a big problem - that is why it depended mostly on application) - and they would have problem. That is why you ended with "one or the other". And I don't even blame Intel for not prioritizing it - again Thread Director improvements in the first 6 months were staggering. But I am annoyed how CPU was initially sold and marketed as AVX-512 capable. In theory, on very well written applications - yes. In practice, it usually meant running with only P-cores enabled, which sucked.
@oj0024 Год назад ⁺⁴
It would be cool if you could cover scalable vector extensions like rvv or sve more in depth. The chips and cheese people covered the P870 and Veyron V1 quite in depth.
@MarekKnapek Год назад ⁺⁸
@1:40 "First we have AVX, introduced in way way way back in Sandy Bridge." No, first we had Intel MMX and AMD 3DNow!, after that SSE and similar.
@treelibrarian7618 Год назад ⁺²
So the comment by clamchowder at the end about breaking up avx512 instructions to enable E-core implementation (as they already have for avx2) got me thinking about what is required to do this. Effectively the only major hurdle is the instructions that allow multi-lane scope, sub-lane granularity selection like vcompress,/vexpand/vpermps/vpermi2ps etc. which have to take more input data than a 128-bit port can handle. Everything else can be broken down into 128bit wide µops quite easily, with the addition of a shifted k-reg input to determine which part of it should be used.
Making a big assumption that each vector execution port must be capable of writing both k-reg and vector registers as output a possible solution could be that the k-reg's could then be used to communicate between µops to describe which parts of the output register have already been processed as the multiple data gets fed to successive µops. But there's obviously something I don't know about, since I also can't see any reason why vperm2f128/vinsert/extractf128/vbroadcast etc wouldn't be handled in the renamer but they clearly aren't, at least not according to intel's description of gracemont avx implementation latency in the optimization manual...
I'm sure there's a good reason why not, but it also occurs that breaking up 512bit vectors might even benefit the P-cores if only having 128-bit registers simplifies the register->execution port crosspoint-switch and XU size enough to allow more 128-bit vector execution ports in the same silicon area, giving the same or greater total compute - and allowing smaller vectors to also use all the available compute, even eliminating the whole "dirty avx registers" issue when using SSE instructions... but whatever.
There will still be a fairly long adoption period since most desktop systems still won't have the features for quite a while
@jeffreybraunjr3962 Год назад ⁺²
I’m glad there are very intelligent people who understand all of this
@Steamrick Год назад ⁺⁵⁵
I forsee developers implementing the 256bit version of AVX10 that runs on every CPU and ignoring the 512bit variants...
@falvyu Год назад ⁺⁸
I think that's probably what's going to happen. That being said, with all 'operations' being available in multiple sizes, porting 256 bits AVX10 to 512 bits should be easier than porting SSE/AVX to AVX512.
@wile123456 Год назад
Only on servers and HED will 512 bit be relevant
@flink1231 Год назад ⁺²
Agree, unless some dev has a very specific server side application that can use it os a very specific customer need
@Lauren_C Год назад ⁺⁷
@@wile123456PlayStation 3 emulation gets a pretty massive performance boost from AVX 512.
@MiesvanderLippe Год назад ⁺³
The speed up is only ever relevant to people writing software that will go through the effort. You can also have the compiler do some trickery if you write code that can be optimised this way.
@Winnetou17 Год назад ⁺⁴
Oh man, I couldn't NOT remember Linus Torvalds rant about AVX ... IIRC against all AVX, not just AVX512. It annoyed the hell out of him the downclocking part, since it wasn't just for the time of those instructions it was for several miliseconds. I wonder is his stance on this has changed and what he thinks of AVX10
@CaptainScorpio24 Год назад ⁺¹
brother i have i7 12700 non k with avx 512 enabled on asus tuf z690 plus wifi d4. I don't knw its use .
@nmopzzz Год назад ⁺¹
Whats the different between AVX and the older SIMD instructions?
@retroanderson Год назад ⁺²
I'd be interested to know why RPCS3 can leverage AVX512 so well.
@TechTechPotato Год назад ⁺⁴
The weirdness of the Cell vector engine/processor maps to big vector instructions well enough :)
@unvergebeneid Год назад ⁺¹
4:43 that's a lot of silicon just to have a slightly faster memcpy...
@MrHav1k 2 месяца назад
This is a great explanation. Thank you!!
@Kiyuja Год назад ⁺¹⁹
I hope Intel pushes for this even in consumer chips. I was so sad seeing they gave up on it after inventing it themselves. AVX 512 can be used for compilers but also emulators. Might be helpful in virtualization, dunno exactly. I genuinely would love seeing it more spread so software can take advantage of it, maybe even games can profit in the future. I'd rather see that than AI units ngl...
@JATmatic Год назад ⁺⁴
The software should not need to recompiled/redesigned on same ISA (Intel x86-64) to the advantage of larger SIMD registers. (SSE, AVX, AVX-256,AVX-512...)
The different SIMD ISA problems with increasing data widths would could be solved by Agner Fogs ForwardComs way of doing SIMD.
His ForwardCom proposal of ISA would allow operate on variable width of SIMD registers.
It is however only an experiment and only few real RISC arches have variable width SIMD functionality today.
@sixteenornumber Год назад ⁺¹⁰
I really wish everyone would get on the same age with vector length.
@esra_erimez Год назад ⁺¹
How much of these advances can be attributed to Pat Gelsinger?
@bjornlindqvist8305 Год назад ⁺¹⁰
Intel appears to have no coherent vision on how they want SIMD to work on their cpus. Most developers even those interested in hpc does not want to have to learn a new instruction set every other year.
@edmunns8825 Год назад ⁺¹
@TechTechPotato Have you had a at AMX on sapphire rappids?
@Veptis Год назад ⁺⁸
Raptor Lake had no AVX-512 but put more cores on the desktop parts. Alder lake had it unintentionally, but actually for the Xeons that used the same cores.
I am not about to buy a xeon for my workstation. I still game a lot so it will be an i9. I want a larger accelerator card with massive memeory to run large models locally. and intel is twlling me to use IDC instead. I said the same thing enough times by now.
Its interesting to see that there is a future, but the stuff I see in ipex is AMX, matrix extensions... not just vectors. So surely we get ATX soon - advanced tensor extensions. but that's already used as an acronym by intel.
@lost4468yt Год назад ⁺³
lol u still usin tensor extensions bro? we usin advanced flexor extensions
@jjdizz1l Год назад
This was very informative and a great show.
@christopherjackson2157 Год назад ⁺⁵
Is the heat issue just because avx512 is powering up such a large, contiguous, area of silicon?
Or am I vastly oversimplifying this in my mental model 😅
@jordanmccallum1234 Год назад ⁺⁷
Yes that it's contiguous, but more so 1. A ton of data is being moved from the register block to this area and 2. A lot of calculations are done in a very short timespan
@kazedcat Год назад ⁺¹
@@jordanmccallum1234It's the movement of data. AMD's solution of doing the 512 execution over two cycles solves the problem because now data is move every other cycle instead of every cycle but the density of calculation is still the same
@jordanmccallum1234 Год назад ⁺¹
@@kazedcat the density over time has halved?
Completing an AVX-512 operation in two cycles is double the time. I'm not talking about the density of the execution section, but the calculation as a whole.
@kazedcat Год назад ⁺⁴
@@jordanmccallum1234 It is not actually double the time unless the instruction can be executed in one cycle. For example if the instruction needs 8 cycles to execute then double pumping would take 9 cycles because execution is pipelined and you only need to wait 1 more cycle for the other half of the execution to be done. If you are doing a lot of simple instruction then yes the amount of computation is halve but if you are doing a lot of complicated instruction then computation is not halve.
@TheParadoxy Год назад
This type of content is why tech tech potato is the best!!
@maxthebean8047 Год назад ⁺¹²
AVX 360, AVX One 🤣
@dannotech2062 9 месяцев назад ⁺¹
🤪🤪🤣🤣😂😂
@matheuswohl Год назад ⁺³
7:26 "compiler devs" shows html code lol
@levygaming3133 Год назад ⁺¹
I mean there’s only so much stock footage
@HakonBroderLund Год назад ⁺²
Really liked this style of reporting
@flink1231 Год назад ⁺⁷
Intel should just implement 512 on e cores like amd did, splitting into 2 256bit ops. It obviously is not free in terms of space, however it likely significantly more space efficient.
@SatsJava Год назад ⁺³
Intel must listen to clamchowder
@milestailprower Год назад ⁺⁷
Really frustrating Intel decided to "nuke" AVX-512 partially thru Alder Lake's release via fusing it off.
We know that Alder Lake can do things like emulate RPCS3 better with AVX-512. They could have done something like "AVX-512 may exist on P-cores, but we will provide no official support for it. If you do try to workaround microcode patches, and enable AVX-512 in BIOS, we provide no guarantee that things will work properly, as we are not validating AVX-512." Instead, they had to go nuclear and fuse off all new CPUs.
@davidmiller9485 Год назад ⁺³
I don't know why this surprises people. Intel is known for just blocking off things they don't want to deal with rather than finding a work around and that includes anything that might cause customer support tickets to increase. I don't have many complaints about Intel, but that is one of them.
@divyanshbhutra5071 Год назад ⁺⁷
I'm VERY excited about the future of X86, with X86S.
@pedrophmg Год назад ⁺¹
2^10 = 1024
I've heard in somewhere that was the reason, isn't?
@Stadtpark90 Год назад ⁺¹
I just watched a 15 min video without understanding a word.
The only thing I understood, was that there is a chicken and egg problem concerning adopting of powerful last Gen tech, and that there were unintended consequences on the thermal side, and that the new implementation is trying to make it easier to get the power from the wheels to the street. - Is it like buying a Subaru? Every model has 4WD now?
@TechTechPotato Год назад ⁺¹
The big clue was at the beginning - advanced vector extensions. The ability to vectorize hard compute problems, of which ML and HPC are two, and enable this in hardware easier. It was kinda aimed at those familiar with programming optimizations for specific hardware (because not all pieces of hardware do all the things).
@billykotsos4642 Год назад ⁺¹
Hotchips for sure is super cool !
@samuel5916 Год назад ⁺¹
Not Chester being a whole entire snack 😳
@Summanis Год назад ⁺³
To make sure I understand, AVX10 on Intel hybrid architectures will still be limited to 256bit on all cores? Or will this allow for the core to select which code path it takes?
@noergelstein Год назад ⁺⁶
I don’t see how that is possibe. If you make a check “if supports 512bit” then the scheduler could move the thread to a core without support between the check and the actual execution of the instruction. If this were possible, it could have been done on Alderlake.
What Intel needs to do on the hybrid architectures is support 512 bit on both but then internally have a fast execution on the P core and a slower execution on the E core (such that it where the same as calling two 256 bit instructions).
@sebtheanimal Год назад
another thing I never knew I needed as a Linux user.
@autarchprinceps Год назад ⁺⁷
I love how the main example for vector use is video encoding, a thing professionally and even non-professionally 99% done on GPUs, or more specificly their dedicated video encoders or occasionally special cards with just that, and then lots of silence follows when asked for another example, followed by something HPC maybe, which is mostly again GPUs or much more dedicated vector accelerators or special purpose CPUs primarily designed around vector units, not for a long time just of the shelf CPUs due to their vector extensions.
Vector extensions are in a really weird place. They are not as good at what they do as GPUs, let alone dedicated hardware, but still not prevelent enough in use in applications where support for GPUs or other accelerators would be too much efford to add.
Now some have come up and said, how about AI, but once again that is just plain not true. There are dedicated AI chips and accelerators, and not just in servers but on the edge too, including in most smartphone SoCs even, and if that is too specialised, lots of AI also gets done on GPUs. Nobody is going to run major AI tasks on just vector extensions if they have any control over the hardware used. Why should they?
@TechTechPotato Год назад ⁺¹¹
The AI use case is very much true. Most DC inference is still done on CPUs today.
@todorkolev7565 Год назад ⁺¹⁰
HI, I used to work for a video teleconferencing platform. Our code, on the server, re-en-coded different video streams and we were highly reliant on our compiler guys to give us an edge, in any way possible, to make video quick and cheap to process.
My first question was: Hey, aren't GPU's better for this stuff?
Our very social and not at all Asperger's compiler devs abruptly corrected me that GPU processing introduces latency, which a premium teleconferencing platform aims to reduce and, anyway, server farms very rarely have much GPU on board to use and then it is expensive and heterogenous (IE: You get different GPU architectures all the time).
Last but not least, Few jobs are purely only-parallel. Once you throw in some work that isn't parallelizable, GPUs start losing their charm!
@lukemcdo Год назад ⁺¹
Tacking onto the video angle, the whole reason that GPUs are efficient is because they have a mostly-fixed-function decoder sitting next to the display engine. The minute you're not displaying, there's nothing special about having the fixed function hardware with the GPU.
Following up with that, the parts of video encoding and decoding that in most situations are fulfilled by fixed function hardware are a huge part of the work but there is often heavily parallel workloads to be applied to the results. Take for example, Intel noting that the "film noise" AV1 filter was running on GPU shaders on Alder Lake/Raptor Lake. Cool, but not something the above software vendor can rely on being present, and now also that is a source of error and/or latency compared to a CPU core on top of the fixed function hardware component.
@dannotech2062 9 месяцев назад
AVX programmer here. For your information, AMD's implementation of AVX-512 runs at AVX2 speed, because they take a 512-bit register and split into 2x 256-bit operations and "double pump" them through the pipeline to get a 512-bit results. So, they take 2 bites from the cookie to generate 1 poop. It's nice because it has ISA compatibility, but there is no additional speedup from the vector size. Having analyzing the IPC of my AMD 7800X3D and compared it my 36-core Intel Sapphire Rapids based CPU, I have come to the conclusion that AMD can execute up to 3x 256-bit operations per clock.
AMD AVX2 instructions per clock - 3
AMD AVX-512 instructions per clock - 1.5
Intel AVX2 instructions per clock - 3
Intel AVX-512 instructions per clock - 2
VTune shows me that AVX2 instructions execute on port 0, port 1 and port 5, where as AVX-512 instructions execute on port 0 and port 5 only.
So clock for clock, Intel has a 33% advantage over AMD's implementation.
If AMD ever decided do double their FP execution width with, say Zen 5 or Zen 6, it's game over for Intel and it's AVX10/360/one/series-S/X specifications will be dead. For as much AVX code as I have written, I will never support AVX10.
Anyone is interested in seeing AVX-512 in action, go to my channel and watch the video "Real-time software rendering with AVX-512"
@tappy8741 6 месяцев назад
Apparently Zen5 will fouble FP execution width, but I'll believe it when I see it.
@johnkost2514 11 месяцев назад
The 8087 has come a long way from when I was a young puppy ..
@Trick-Framed Год назад
Crap. I missed Hot Chips. They let me watch last year. Good time. Lots of info.
@sloanNYC Год назад
Very interesting perspectives for sure.
@afre3398 Год назад ⁺²
Do Intel and AMD have some co-operation or joint group regarding new instructions set additions. Or is it like customers go to Intel/AMD and say we rally need this
@Cormy1 Год назад ⁺¹
I'm baffled how little coverage ChipsandCheese have gotten among techtubers thus far. If any of you are curious about the difficulties Intel are facing with Arc, and its potential, PLEASE go read their articles on it!
Spoiler: Driver maturity will NOT save them.
On another side note, Skylake-X had some instances where the TIM used between the die and IHS was thermal PASTE instead of liquid metal solder, so they had MUCH worse thermal conductivity and therefore suffered more from overheating during high loads like AVX512, which means more throttling and aggressive downclocking.
@TechTechPotato Год назад ⁺¹
Part of the issue is that a lot of Techtubers are often a mile wide, but an inch deep, when it comes to the complexities of microarchitecture. Few of them understand C&C's analysis to begin with, then translating it to something their audience understands is difficult. AnandTech had the same issue.
@Cormy1 Год назад ⁺¹
@@TechTechPotato So long as there are performance comparisons to be made, the audience can grasp SOMETHING.
That's the nice part of Chips and Cheese, they benchmark.
They aren't simply technological exhibitions, but practical metrics.
Even if you don't grasp the entirety of the article on the A770, you can still see when it scores comparably to GPUs from over a decade ago!
I used to love reading Anandtech deep dives, but even I gave up on that when I realized I couldn't extrapolate anything real from them because of how many layers of abstractions there are from the underlying architecture to the end-result. It's particularly disappoint when some of the "improvements" end up having next to no real-world applications (RDNA's dual-issue shenanigans) without manually tuning everything, which isn't feasible.
As consumers, we have NO IDEA how the various components are stressed at that level. Doubling registers means absolutely nothing to us. Even the companies don't fully grasp the impacts their changes will make, this can be seen in the shrinking of cache in RDNA3 by comparison to RDNA2. Clearly they've decided they went overboard on that and the benefits don't scale well to such high amounts.
Meanwhile NVidia's L2 cache investment completely failed to cover for their reduced bus-widths, creating the spectacular failure that is the 4060 TI, but which also extends to the 4070 and 4070 TI, though no one seems to have noticed that yet maybe because they believe the higher VRAM quantities cover for it, but they don't.
You NEED context to make what you're reading meaningful, and Anandtech articles often didn't provide that when discussing architectures (partly because those deep dives were often made before launch and reviews, and rather just based on announcement slides)
Analyze the product, not the technology. You can work backwards to the technology after, referencing other products to demonstrate what the technological differences are achieving.
In addition, Anandtech didn't do micro-benches. They did a lot of application benches, which produce abstract scores that tell you nothing about what is being stressed in the product to those who aren't familiar with those applications.
By comparison, it's very easy to talk about bandwidth or latency of various components of a device, which is what ChipsandCheese demonstrates.
You don't need to understand absolutely everything about what's going on under the hood and how it works to understand what aspects of an architecture or device are either exceptional, or sorely lacking.
Techtubers and audiences can easily see that to gauge something as simple as what areas Battlemage could feasibly greatly improve on, and how wide the gap is in reality, beyond just driver/code maturity.
It's not that deep when presented in that manner.
@Gindi4711 Год назад ⁺¹
If I look at the workloads that are usually run at consumer CPUs only very few of them would profit from AVX512.
Gracemont is optimized for performance/area so Intel will not waste die space if they do not see a big benefit.
If you need 20% extra die space to get a 2x performance increase in less than 5% of applications then this is clearly not worth it.
@defeqel6537 Год назад ⁺¹
Very few consumer workloads are highly thread parallelized too, so apart from mobile power efficiency, the small cores aren't very useful in the first place. When you start crunching more data, you start seeing benefits of instruction level parallelization too
@sflxn Год назад ⁺³
Maybe I missed it but I listened to this video and heard nothing about generative AI.
@TechTechPotato Год назад
Vector extensions are used to accelerate math used in ML and Generative AI
@davidgunther8428 Год назад
Nice details!
@flickeykrunchofficialYT Год назад
Chester your the man
@ObviousCough Год назад ⁺⁴
I need AVX512 for y-cruncher
@semape292 Год назад
Whats that?
@MMGuy Год назад
@@semape292 multithreaded benchmark that computes pi (and other constants)
@Trick-Framed Год назад
Def Leppard finna sue over that retread riff 😂
@dannotech2062 9 месяцев назад
14:55
@axiom1650 Год назад
I didn't like since the counter was on 256.
@kwinzman Год назад ⁺¹
So in summary it's not as flexible as ARM's SVE. And it's not as backwards compatible as AMD'S AVX-512 implementation. Thanks Intel!
@WilliamTaylor-h4r Год назад
You have given us the world, praise angels, be wise.
@m_sedziwoj Год назад ⁺³
6:58 so they adding AVX10 because AVX512 have too many iteration, and want simple flag, but now he talking about iterations... someone do not learn on past errors....
@tappy8741 6 месяцев назад
They added avx10 to retcon avx512 given that intel failed to do it efficiently for years, amd finally supported avx512 themselves, and intel needed to trim fat to be able to compete with amd with generic performance. It was a play to hit the snooze button on simd, which may succeed given that the focus from now on is probably ML
@fd5927 11 месяцев назад
Chip and Cheese duo .... me 😮, now I'm dating myself ... Remember Itv, "Video 'n Chips" ... bet you don't .... kept my I9 11900k cpu and loving it. Before P and E cores ... AVX 512 is still active ... luckily .... Dreading the day new bios update where intel disables microcode of AVX-512 .......
@tyraelhermosa Год назад ⁺²
AVX, AVX 2, AVX 512, AVX 10, and then they're going to AVX 10.1 and 10.2?
Argghhhh
@RadhakrishnanMudliar 4 месяца назад
AVX-512
@irwainnornossa4605 Год назад
It should be named AVE.
@plugplagiate1564 Год назад
it looks like the devolpers of intel, desperatly want to break up with the mathematical foundation of chip making.
saying, the turing engine, that is fully defined by mathematic, goes down the river.
i don't think it is a good idea to swap mathematic with pure try and error programming.
@maciejkowalski6045 7 месяцев назад
its funny theese guys explain and i can`t understand nothjing from their explanation i wonder if they even know what they are talking about
@mathyoooo2 Год назад ⁺²
I really dislike what Intel is doing with avx10. They should have just put a slow version of avx512 on E-cores and be done with it.
@milestailprower Год назад ⁺¹
RPCS3 really benefits from AVX512. I suspect AVX512 will be very important for emulation in the future.
@aravindpallippara1577 Год назад ⁺¹
That's mostly due to the weird cell processor architecture on ps3
Ps4 and ps5 uses x86_64 processors so emulating them or transcoding the instructions to existing processors should be effortless
@milestailprower Год назад
@@aravindpallippara1577 In theory, yeah. You'd probably just have a hypervisor do the heavy lifting. Even outside of RPCS3, other emulators like Yuzu benefit from AVX-512 optimizations. But yeah, it appears that AVX-512 is *really* good particularly for emulating the cell architecture.
You may still need a to work on the PS4/PS5 GPU. Sony has their own proprietary APIs for the radeon GPUs that needs to be reinterpreted or recompiled. With low level APIs like Vulkan and DX12, hopefully the CPU overhead is low.
Also, the memory architecture is different than on PC. Although, maybe that's when resizable bar can show its benefits -- assuming access latency between CPU and VRAM isn't a problem.
I'm not an expert on this, but I'm sure that AVX10 / AVX512 will still be moderately useful when it comes to emulating Nintendo's next console - even if they are going to be using ARM. It will probably be useful when it comes to ARM SIMD extensions like SVE2.
@garrettkajmowicz Год назад
When is Intel going to short-change us on these instructions as well?
@user-sx6nd8zl5k 8 месяцев назад
They left aside the avx512 instructions that are very necessary for the AI due to the stupidity of the E-cores, that is why generation 13 and 14 of Intel suck, and the worst will come with generation 15 of Intel that will no longer have Hyper -Threading, one step forward and two steps back in chip engineering .x86 is dead, it died along with Moore's law
@wmopp9100 Год назад
cool thing, I am pretty sure intel marketing/product management will kill it.
if there hadnt been such a big push for virtualization in the early 2000s they would have killed that one too.
(some intel server CPUs had it, some (higher/more expensive) SKUs didn't. weird times)
@justindressler5992 Год назад
Avx is being used for AI isn't it. I was pretty disappointed after buying the 13900k and found out they remove avx-512 at the same time AMD launched there new chip with avx-512. It was vary annoyed since I was hoping to use avx-512 for AI. It probably makes sense to limit to 256 though I imagine many applications can use this without issues. But it just shows that if AMD can have 16 full cores with AVX-512 how far Intel's nodes are behind.
@josiahsuarez Год назад
I know I shouldn't but mcdonalds hamburgers are good ._.
@pilsen8920 Год назад ⁺¹
Avx10 is going to be better on amd? and it's intel's baby. 😂 they just can't get a win. Lol
@__aceofspades Год назад ⁺¹⁰
AVX10 solves the issue with heterogeneous cores, and it also puts AMD in a very difficult spot where their AVX512 silicon on consumer hardware will soon be obsolete. Obviously AVX512 adoption was very very low for consumer applications, but now developers will have to choose between AVX10 with Intel's 80% market share or AVX512 with AMD's 20% because they wont want to implement both, the choice is pretty obvious. Even before AVX10 was announced, AVX512 for consumers was dying, it had low support to begin with and once Intel stopped using it with Alder Lake, why would developers bother supporting it when now only Zen 4 supports it. AVX512 is more or less dead for consumers, long live AVX10.
@Eleganttf2 Год назад ⁺¹
make sense
@обычныйчел-я3е Год назад ⁺⁹
Zen 4's AVX-512 is in most ways a superset/equivalent of the first versions of AVX10, the Vector Length extension is there too. I hope compilers will be able to compile compatible AVX-512 intrinsics to AVX10 and vice versa, even if they won't convert the vector lengths yet.
With both consumer platforms supporting some kind of modern SIMD ISA with masking, the developers are more incentivized to write such code, even if they can't stick to an ISA.
Another question may be the new APX extension, AMD has no alternatives to that but Intel haven't launched (and likely won't any time soon) any CPUs implementing it.
@Anton1699 Год назад ⁺¹¹
AVX10.1 is basically just another CPUID enumeration method for the AVX-512 features found in Sapphire Rapids. So your dispatcher could check whether the CPU supports the required AVX-512 feature sets or AVX10.1/512.
@endless2239 Год назад ⁺³
if AVX10.1 is basically the same thing as AVX512 and Intel already said that new XEON CPUs will keep compatibility with AVX512 (worse the new 256 instructions will come after granite rapids) then developer will now have to choose between 100% marketshare of AVX512 or whatever granite rapids marketshare will be.
@shaunlunney7551 Год назад
AMD figured out how to make AVX512 work across all it cores, why cant intel? What a headache. This seems to try and make up for that lack of ability.
@KabelkowyJoe Год назад
4:00 What? Intel employee don't know why it's called AVX10? AVX is IDIOTIC idea, it's like Intel Itanium comming back from grave, embedded once again into silicone a VLIW. CPU running instruction(s) processing one cache line 64bytes. At once. Even 128bit was enough 4x32bit vector. 256 is understandable but 512 is not. This is AVX512 from "silicone point of view". Its no different than Itanium. New version would be like inventing bunch of new combination of grouping data and running various combinations on groups 512,256,128,64,32,16,8 bits. Making it only worse. Why to even burn entire compiler into silicone? I wish Intel wipe out this once for good just like they did to Itanium!
EDIT 512: SORRY im THINKING OUT LOUD
Editing milion times simple comment.. im not sure if this makes sense [some things do not]
Im affraid they are making decoder as complicated.
As this comment :>
Im not English native speaker.
Before you read - you have to imagine what i got in mind, 64th of 8 bit ALU with carry bits, like human beings with hands, and imagine you join these together side by side to create various structures. You also have 128bits FPU, 256bit, you also must allow 64 bit operations etc. Must provide carry bit for each "chunk" of data, AVX allow to execute various instructions, and pack data differently and that is main problem here! Depending on how you packed data 8,16,32,64,128bit numbers all these bits and pieces connect differently. Add 16x 32 bit numbers, or 8x64 bit numbers or AND two register together etc. For example to make 512bit wide ALU you put "people" in row. Could be also impemented as 2x256 SSE, so carry bits connect. But you can't make such structure fast, each create delay, so best is to make pipeline, and run one after another. If you make 512 bit wide CPU but that consumes more power, or 64x 8bit chunks you will have 64 instructions delay but less power consumer, result in 1 cycle / each if your pipeline is executing stable fassion. All depend on your implementation. Wrote in next paragraph GPU examle of 8 differently configured groups of SPus embadded in structure, to give an idea of how much transistors is potentially wasted, or how would be possible to mimic AVX on GPU. SPu in GPUs are grouped. And each group can execute one type of instruction on group.
Each group of SPu configured differently statically. Would allow you dispatch instructions and make virtual pipeline. This would allow you to have result in one cycle max 8 steps. No matter how you pack data in your GPU emulated AVX. You throw data into predefined basket. Im not sure if this makes sense, or how do they make it. Im not CPU designer i can only imagine me building this. Having 8 different combinations and AVX allow to pack data differently as 8,16,32,64,128 and 256 wide numbers. Having such structure and execute possibly in one cycle is nightmare. AMD is using FPUs but their AVX was slower, obviously Intel was wasting lot of transistor to make wider structure and run faster possibly in once cycle. 512 bits flipping at once is not easy task. Pipeline of smaller chunks makes it easier, and waste less energy but creates delay. And normally size of ALU, or register file is predefined, it's nightmare if each instruction pack data differently. It's challenge im 100% certain. Just imagine 8 different types of CPU 8bit wide, 64 bit wide, 512 bit wide packed into one silicone. This is more or less idea behind AVX. Programmer is happy Intel is wasting silicone.
Normally i remove comments like this...
You cant go further away from RISC. AVX could be always implemented either as 8 bit CPU with 64 instruction units or 64 step pipeline being able to process chunk of 512 bit memory each chunk differently or all at once, a 16 bit CPU with 32 step pipeline or 32 execution units also being able to process chunk of 512 bit at once, or 128 bit CPU with 4 step pipeline so on so forth. But more you add instructions to it, make it more and more sophisticated structure. It gets more and more complex anyway. No different than VILW 512 bit CPU. With very sophisticated decoder.
and CUT.
I wish they FAIL, wish ever since AVX512 was invented. Get rid of this Krzanich shit please! Just as AMD get rid of 3DNow! Implement for sake of backward compatbility as slower 2x256bit but dont make decoder even more complex. I can imagine more FPU units in CPUs, 256 bit x16 pipelined or even equivalent to AVX 4096 if pipelined. But i cant imagine complex 512 "static" structure" with dozen dozen of new "statically defined" instructions allowing you to pack data as you want. It's not accident why we have NPU and GPU with so many limitations. It's also wasted silicone if SPus are not used at moment but made simpler forces programer to think how to process data. Not this AVX blob of wires. Allowing to do everything in every possible way..
@crispysilicon Год назад ⁺³
I was here first. 😂
@xNenshu Год назад ⁺⁶
What an attention seeker 😂😂❤
@shmarvdogg69420 Год назад ⁺⁶
@@xNenshu IKR, how dare someone have a little fun in youtube comments section! 😂😂❤
@ProjectPhysX Год назад ⁺¹
@@shmarvdogg69420🖖
@Speak_Out_and_Remove_All_Doubt Год назад ⁺⁴
If you could ask Intel for one thing what would it be?
Drop the E-Cores! (at at the very least change the thread director so that AVX-512 is never attempted to be run on the E-Cores so all this won't be an issue.
@__aceofspades Год назад ⁺¹¹
That makes zero sense and will never happen. E-cores benefit every workload that can use the extra cores, while AVX-512 is only used in a few consumer applications. Nearly every consumer would be better off with e-cores than AVX-512 support. Heterogeneous designs are also the future, everyone from Intel to Apple to Qualcomm and AMD will be using some form of heterogeneous designs.
@Speak_Out_and_Remove_All_Doubt Год назад ⁺¹
@@__aceofspades Not quite what I meant, I was saying how it would be great if a program that used AVX-512 knew to only execute on the P-Cores, then we wouldn't have this issue.
As the the e-cores, I still have a lot of trouble with them in some programs get booted onto them and then running real slowly. They have their place in the market but with the latency issues, reduced instruction set, reduced IPC, slower clocks, etc, etc personally I would rather Intel gave use more P-cores.
@falvyu Год назад ⁺¹
@@Speak_Out_and_Remove_All_Doubt I'm pretty sure that Intel has considered that approach. A major problem comes the OS' scheduler: what happens if it migrates an AVX512 task onto an E-core ?
Sure, you'd have an exception and the scheduler might be able to move it back to a P-core. But then, any program that ever runs >1 AVX512 instructions would forever stay on the P-core, including those that only called library functions (e.g. memcpy). This could mean having all processes run on the P-core (=> E-cores would be unused). And I'm pretty sure you'd have other issues related to diverging CPU features.
I think the another other viable options would be AMD's: double-pumped AVX512 (i.e. 'split' 512-bits instructions into 2x256-bits instructions). However, it would probably would still require additional hardware compared to AVX2 (i.e. => bigger E-cores on a limited space => fewer E-cores).
@defeqel6537 Год назад ⁺¹
I see two other solutions: 1) just implement the ISA on all cores, the micro-op level instructions may be different and take variable amount of time, or 2) have the small cores panic the first time they see the instruction and have the OS scheduler stop scheduling the panicked processes (and all threads within) to those cores (and I guess there is 3) the RISC-V approach of OS emulating unsupported instructions)
@millosolo Год назад
AVX is broken and in a very bad place under the pressure of special purpose hardware. Even intel is cautious. Bad.
@labloke5020 Год назад
I have watched the whole video and I still have no idea what AVX is. This was a waste of time.
@TechTechPotato Год назад ⁺²
Advanced vector extensions. Literally at the beginning. Helps if you know what a vector is, and an extension.
@charlesdorval394 Год назад ⁺¹
Putting everything under the same CPU flag... will it be 10, 10.1, 10.2 ...? ... *roll eyes* They're just doing another round of the same thing again aren't they. Let's see how many they come up with this time...
@jesuslovesyoujohn314-21 Год назад
John 3:16 For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
Isaiah 53:6 All we like sheep have gone astray; we have turned every one to his own way; and the LORD hath laid on him the iniquity of us all.
@TechTechPotato Год назад ⁺³
Austin 3:16 - Because Stone Cold Said So

Следующие

Автовоспроизведение