Next-Gen CPU Acceleration: AVX For Generative AI

Поделиться
HTML-код
  • Опубликовано: 21 авг 2024
  • The future is AVX10, so says Intel. Recently a document was released showcasing a post-AVX512 world, and to explain why this matters, I've again invited the Chips And Cheese crew onto the channel. Chester and George answer my questions on AVX10 and why it matters!
    Visit www.chipsandche... to learn more!
    Background Thumbnail from Fritzchens Fritz: www.flickr.com...
    -----------------------
    Need POTATO merch? There's a chip for that!
    merch.techtechp...
    more-moore.com : Sign up to the More Than Moore Newsletter
    / techtechpotato : Patreon gets you access to the TTP Discord server!
    Follow Ian on Twitter at / iancutress
    Follow TechTechPotato on Twitter at / techtechpotato
    If you're in the market for something from Amazon, please use the following links. TTP may receive a commission if you purchase anything through these links.
    Amazon USA : geni.us/Amazon...
    Amazon UK : geni.us/Amazon...
    Amazon CAN : geni.us/Amazon...
    Amazon GER : geni.us/Amazon...
    Amazon Other : geni.us/TTPAma...
    Ending music: • An Jone - Night Run Away
    -----------------------
    Welcome to the TechTechPotato (c) Dr. Ian Cutress
    Ramblings about things related to Technology from an analyst for More Than Moore
    #techtechpotato
    ------------
    More Than Moore, as with other research and analyst firms, provides or has provided paid research, analysis, advising, or consulting to many high-tech companies in the industry, which may include advertising on TTP. The companies that fall under this banner include AMD, Armari, Facebook, IBM, Infineon, Intel, Lattice Semi, Linode, MediaTek, NordPass, ProteanTecs, Qualcomm, SiFive, Tenstorrent.

Комментарии • 154

  • @Wunkolo
    @Wunkolo 11 месяцев назад +73

    I contributed AVX512 acceleration in the CPU backend for emulators like Xenia and Yuzu/Citra/Vita3K(Dynarmic). You _can_ currently use AVX512-features on 128,256-bit registers with AVX512VL. The issue is that it defines it as a subset of the 512-bit registers so it requires a full 512-bit implementation rather than defining it as orthogonal super-sets of 128->256->512.
    There's also some outdated information here. In the pictured article by Travis Downs here(12:52) the downclocking issue hasn't really been an issue since Icelake(2019). Especially if you only ever touch 128/256-bit registers.

    • @Lauren_C
      @Lauren_C 11 месяцев назад +2

      Out of curiosity, is it still worth developing for AVX-512 at this time, given Intel has largely appeared to have dropped it on consumer CPUs?

    • @Wunkolo
      @Wunkolo 11 месяцев назад +17

      ​@@Lauren_C AMD has picked up the slack for a while now with Zen 4, so it's becoming more and more ubiquitous between both vendors. Intel appears to be bringing some form of it back with this whole AVX10 thing. The least-exciting part about AVX512 is the vector-width, so there is lots of value to have AVX512 features for smaller 128/256-bit registers. This push for AVX10 is kind of proof of that.
      AVX512 work may continue. Both AVX512 and AVX10 use the same EVEX-encoded instructions, which specifies the vector-length in the instruction itself. AVX512VL is the AVX512-feature that allows operating upon 128-bit or 256-bit registers rather than just the 512-bit registers. So AVX512VL and AVX10 instructions can be exactly the same but may now fault if the hardware does not support certain vector widths. Before, hardware was required to support the 512-bit vectors before even supporting 256/128. AVX10 kind of flips the script on that definition and starts at 128-bit and extends to 512-bit.
      So if an AVX10 chip runs my 128-bit AVX512F+AVX512VL code, it will be fine. But if I use 256/512 bit registers, then I have to be more careful and ensure the hardware supports it to work on 256/512-bit AVX10, but don't have to check at all in the case of regular AVX512.
      So there's no reason to stop writing AVX512 code, since it's the same as AVX10 code but with some extra steps and safeguards to be had.

    • @arditm2178
      @arditm2178 9 месяцев назад

      with all that experience, what's your opinion on Gather and Scatter memory operations? (asking as a noob)

    • @EyefyourGf
      @EyefyourGf 4 месяца назад

      I know this is old comment,but i just wanna say thank you for contribution,and for in depth explanation.

  • @neonmidnight6264
    @neonmidnight6264 11 месяцев назад +34

    11:30 The simplest way to address this is to have the compiler fill in the 512b vectors as 256x2, so that when it emits fallback, it is effectively achieving 2x loop unrolling. This is a strategy that .NET takes, even if historically it was intended to simplify vector fallback code, it ended up working a bit too well for pairwise (but not masking and not horizontal) operations.
    CoreLib still does bespoke implementations of 512b, 256b and 128b widths, but it's not labour intensive because AdvSimd and AVX2/SSE4.2 features map to each other fairly cleanly (movemask emulation notwithstanding) allowing for a unified API. Nevertheless, this approach appears to be superior because, for example, StdLib variant of memchr in Rust is not vectorized nor LLVM is able to auto-vectorize it, leaving significant amount of performance on the table for such an important operation.
    Generally speaking, most well-written libraries which utilize SIMD end up partially reimplementing their own cross-platform abstraction on top of intrinsics to avoid significant code duplication. The arrival of cross-platform SIMD abstractions in both C++ and Rust this late (both are still unstable/experimental) in 2023 is really disappointing.

    • @alex84632
      @alex84632 11 месяцев назад +3

      It sounds like the compiler should break 512 into 4x128, or 2x256 if it can, or 1x512 if it can. So that all versions of AVX10 work.

    • @awesommee333
      @awesommee333 11 месяцев назад +2

      Doesn’t really work for stuff like shuffles. Or we’ll it does but it’s four instructions minimum at that point for a single 512 bit shuffle

    • @neonmidnight6264
      @neonmidnight6264 11 месяцев назад +1

      @@awesommee333 Yeah, only for pairwise. For horizontal operations you still need to use the exact supported width.

    • @neonmidnight6264
      @neonmidnight6264 11 месяцев назад

      @@alex84632 While true, .NET's register allocator and level of optimization can't compete with GCC/LLVM for the IR shape produced by unrolling 512b vectors into 128x4 - keep in mind that usually you operate on pairs of vectors meaning that's already 8 V128 vectors in flight - this is a lot of register pressure and usually results in stack spilling and the compiler giving up on certain optimizations.
      With that said, I do believe that LLVM can probably do better, but there's tradeoff with choosing 512b width when it comes to regressing small lengths (or code size if you do 512 -> 256 -> 128 -> scalar) if this is general purpose code.

    • @MrVladko0
      @MrVladko0 5 месяцев назад

      С++ have good SIMD libraries like Agner Fog and EVE. Experimental SIMD support in C++/Rust is miles away from them.

  • @wile123456
    @wile123456 11 месяцев назад +7

    *cries in the forgotten TX instruction set which gives a boost for the PS3 emulator*

  • @porina_pew
    @porina_pew 11 месяцев назад +17

    I found 2 unit AVX-512 (SKX) to be worth it even with the clock drops. It gave a per-clock uplift in performance of around 80% for my interests, so it was still ahead in throughput. Even 1 unit (RKL) gave 40% per-clock boost.

    • @Wunkolo
      @Wunkolo 11 месяцев назад +2

      Even back with Skylake-X, a %15 reduction in clock speed for an almost x2/x4/x8/etc increase in performance is a _huge_ perf gain.

    • @flashmozzg
      @flashmozzg 11 месяцев назад +4

      The issue was that if you had mixed code it wsn't worth it and the latency was noticeable. I.e. if you had a few avx512 operations surrounded by mostly avx2 or lower ops, then all of them would be downclocked (I think the number to switch back from downlcocking was around 80-100 cycles).

    • @yuan.pingchen3056
      @yuan.pingchen3056 21 день назад

      @@flashmozzg just buying an AVX512 capable processor with lower clock multipliers, it's okay, for example, the early alder-lake i5-12400 or i5-12500

  • @eekpanggang
    @eekpanggang 11 месяцев назад +6

    FINALLY THE GUYS BEHIND CHIPSANDCHEESE! Really love that you brought them with us here, Ian!

  • @woolfel
    @woolfel 11 месяцев назад +21

    It's nice to see AVX get improvements, but a big part of SIMD is the software stack. Without great SIMD compiler to optimize the execution, better AVX won't necessarily produce the gains. NVidia's CUDA stack is dominant because the compiler is better than competitors. For example, CUDA's default execution is "non-deterministic" to maximize utilization. If you set CUDA to determinant execution, the throughput takes a hit.

  • @jannegrey593
    @jannegrey593 11 месяцев назад +21

    It would be cool, but it seemed Intel almost killed their AVX-512 adoption, by fusing it off from Alder Lake. Yes - it didn't work on E-cores, but code could be at least run. And from memory de-clocking wasn't nowhere nearly as severe as it used to be. Writing so many versions of the code is taxing and a lot of people won't do it - with their limited resources - just to have it run on very few machines. It doesn't look great if AMD implementation seems better than Intel's, in Intel made ISA. So I will wait and see if it really works and there is some movement from Intel to do it well. In practice, not just in theory.

    • @octagonPerfectionist
      @octagonPerfectionist 11 месяцев назад +3

      could it be run though? is it able to do a hybrid core layout with avx-512? i thought it was one or the other

    • @jannegrey593
      @jannegrey593 11 месяцев назад +3

      @@octagonPerfectionist Bad phrasing on my part (I assume you're talking about: "it didn't work on E-cores, but code could be at least run." part). I meant that code could be run on P cores (so CPU as a whole package). From memory, E-cores couldn't run 512 bit extensions, they didn't have silicon for that. But early Alder Lake CPU's, assuming you had proper motherboard that allowed for it, also allowed for AVX-512 to be run. Problem was that Thread Director could allocate it to E-cores, and then it was bugging out. Rather than deal with it (TBF first 6 months of thread director improvements made a lot of changes for good, but it was gigantic operation) and making it that AVX-512 was scheduled only for P-cores, initially BIOS would not allow for AVX-512 to be scheduled or it would force you to switch off E-cores.
      So depending on application - yes it could run AVX-512, but if it wasn't scheduled or written perfectly, it would send it to E-cores (and scheduling it all was a big problem - that is why it depended mostly on application) - and they would have problem. That is why you ended with "one or the other". And I don't even blame Intel for not prioritizing it - again Thread Director improvements in the first 6 months were staggering. But I am annoyed how CPU was initially sold and marketed as AVX-512 capable. In theory, on very well written applications - yes. In practice, it usually meant running with only P-cores enabled, which sucked.

  • @andrey7268
    @andrey7268 10 месяцев назад +3

    5:50 AVX-512 is not "limited" to 512-bit vectors; with AVX-512VL (which every CPU that has AVX-512 supports) you can use AVX-512 instructions on 128 and 256-bit vectors. The problem (for Intel) is that AVX-512 *requires* 512-bit vectors to be supported by the CPU. AVX10 makes 512-bit vectors an optional feature. This is mostly Intel solving their own problem with E-cores not wanting to support 512-bit vectors for some reason. AMD showed that 512-bit vector instructions can be implemented on top of 256-bit vector units, so really this is just Intel refusing to do it for some reason.

  • @Steamrick
    @Steamrick 11 месяцев назад +55

    I forsee developers implementing the 256bit version of AVX10 that runs on every CPU and ignoring the 512bit variants...

    • @falvyu
      @falvyu 11 месяцев назад +8

      I think that's probably what's going to happen. That being said, with all 'operations' being available in multiple sizes, porting 256 bits AVX10 to 512 bits should be easier than porting SSE/AVX to AVX512.

    • @wile123456
      @wile123456 11 месяцев назад

      Only on servers and HED will 512 bit be relevant

    • @flink1231
      @flink1231 11 месяцев назад +2

      Agree, unless some dev has a very specific server side application that can use it os a very specific customer need

    • @Lauren_C
      @Lauren_C 11 месяцев назад +7

      @@wile123456PlayStation 3 emulation gets a pretty massive performance boost from AVX 512.

    • @MiesvanderLippe
      @MiesvanderLippe 11 месяцев назад +3

      The speed up is only ever relevant to people writing software that will go through the effort. You can also have the compiler do some trickery if you write code that can be optimised this way.

  • @maxthebean8047
    @maxthebean8047 11 месяцев назад +12

    AVX 360, AVX One 🤣

  • @vasudevmenon2496
    @vasudevmenon2496 11 месяцев назад +3

    I started using clamchowders(Chester) memory benchmark and been a year. Thanks for making it open source. Thank you Ian for the reporting

  • @MarekKnapek
    @MarekKnapek 11 месяцев назад +8

    @1:40 "First we have AVX, introduced in way way way back in Sandy Bridge." No, first we had Intel MMX and AMD 3DNow!, after that SSE and similar.

  • @twopic5408
    @twopic5408 11 месяцев назад +1

    Had loads of Fun editing this video

  • @bjornlindqvist8305
    @bjornlindqvist8305 11 месяцев назад +10

    Intel appears to have no coherent vision on how they want SIMD to work on their cpus. Most developers even those interested in hpc does not want to have to learn a new instruction set every other year.

  • @Kiyuja
    @Kiyuja 11 месяцев назад +19

    I hope Intel pushes for this even in consumer chips. I was so sad seeing they gave up on it after inventing it themselves. AVX 512 can be used for compilers but also emulators. Might be helpful in virtualization, dunno exactly. I genuinely would love seeing it more spread so software can take advantage of it, maybe even games can profit in the future. I'd rather see that than AI units ngl...

  • @LarsDonner
    @LarsDonner 11 месяцев назад +14

    To support different x86 processor generations I already have to write multiple versions of my functions: SSE, AVX, AVX2, AVX-512 and dispatch to the correct version based on the CPUID flags. Now I get to write another 2 or 3 versions more? Oof.
    Also, the shown article at 12:52 concludes that down-clocking was already not a problem on Ice and Tiger Lake. Weird how that story just doesn't want to die.

    • @tappy8741
      @tappy8741 Месяц назад +1

      AVX10 is 100% an intel retcon to avx512 because all of a sudden they need to be more efficient with silicon to be competitive. That said, the sse/avx/avx2 implementations for a given problem tend to be related, in that same way avx10/avx512 are related. So really there's 3 main implementations, generic, avx2 family, avx512 family.

    • @LarsDonner
      @LarsDonner Месяц назад

      ​@@tappy8741 My guess would be, that the AVX-512 ans AVX10 instructions would still have different opcodes, even if they do exactly the same. In that case the algorithm may be the same, but I still have to convince my compiler to generate different versions of it.

    • @tappy8741
      @tappy8741 Месяц назад +1

      @@LarsDonner Yeah it's still a pain, but it's not an unfamiliar path.
      This whole 256 bit AVX10 is utter nonsense. All processors that have AVX10.256 will have AVX2 also right? And at this point most programs that want simd have implemented avx2? What's the benefit of porting if an avx2 path already exists and there's not much to be gained?

    • @LarsDonner
      @LarsDonner Месяц назад

      @@tappy8741 I guess you would still gain the ability to apply masks to every operation, control rounding modes and have a more complete instruction set. But now that the 512-bit-cat is out of the bag (and AMD is going hard on it) I find it hard to believe that there will ever be many AVX10/256-only CPUs.

    • @tappy8741
      @tappy8741 Месяц назад +1

      @@LarsDonner I doubt amd will make avx10.256 only, unless they do an embedded sku or a custom part for console or steamdeck. But intel, intel I think wants to do avx10.256 for most consumer and avx10.512 for server. Might get avx10.512 on intel's halo consumer slash prosumer. At least that's what I assume the plan was when they set this avx10 crap in motion, they couldn't compete on general compute when they tied the avx512 anchor to their necks and had to ditch it. Things may change now that they've bitten the bullet and are using tsmc much more heavily.
      I'm not denying that avx10.256 is more rounded, masks are nice etc. But the avx2 implementation already exists, will work on anything that has avx10.256, and even if all future consumer intel is avx10.256 it'll be a decade before there's any sort of penetration worth a damn.

  • @JATmatic
    @JATmatic 11 месяцев назад +4

    The software should not need to recompiled/redesigned on same ISA (Intel x86-64) to the advantage of larger SIMD registers. (SSE, AVX, AVX-256,AVX-512...)
    The different SIMD ISA problems with increasing data widths would could be solved by Agner Fogs ForwardComs way of doing SIMD.
    His ForwardCom proposal of ISA would allow operate on variable width of SIMD registers.
    It is however only an experiment and only few real RISC arches have variable width SIMD functionality today.

  • @JonMartinYXD
    @JonMartinYXD 10 месяцев назад +1

    All you need to know about AVX-512 can be found under the _CPUs with AVX-512_ section of the AVX-512 Wikipedia page. Just look at that table and think about what instruction subsets a software developer should try to use in their code..

  • @Winnetou17
    @Winnetou17 11 месяцев назад +4

    Oh man, I couldn't NOT remember Linus Torvalds rant about AVX ... IIRC against all AVX, not just AVX512. It annoyed the hell out of him the downclocking part, since it wasn't just for the time of those instructions it was for several miliseconds. I wonder is his stance on this has changed and what he thinks of AVX10

  • @treelibrarian7618
    @treelibrarian7618 11 месяцев назад +2

    So the comment by clamchowder at the end about breaking up avx512 instructions to enable E-core implementation (as they already have for avx2) got me thinking about what is required to do this. Effectively the only major hurdle is the instructions that allow multi-lane scope, sub-lane granularity selection like vcompress,/vexpand/vpermps/vpermi2ps etc. which have to take more input data than a 128-bit port can handle. Everything else can be broken down into 128bit wide µops quite easily, with the addition of a shifted k-reg input to determine which part of it should be used.
    Making a big assumption that each vector execution port must be capable of writing both k-reg and vector registers as output a possible solution could be that the k-reg's could then be used to communicate between µops to describe which parts of the output register have already been processed as the multiple data gets fed to successive µops. But there's obviously something I don't know about, since I also can't see any reason why vperm2f128/vinsert/extractf128/vbroadcast etc wouldn't be handled in the renamer but they clearly aren't, at least not according to intel's description of gracemont avx implementation latency in the optimization manual...
    I'm sure there's a good reason why not, but it also occurs that breaking up 512bit vectors might even benefit the P-cores if only having 128-bit registers simplifies the register->execution port crosspoint-switch and XU size enough to allow more 128-bit vector execution ports in the same silicon area, giving the same or greater total compute - and allowing smaller vectors to also use all the available compute, even eliminating the whole "dirty avx registers" issue when using SSE instructions... but whatever.
    There will still be a fairly long adoption period since most desktop systems still won't have the features for quite a while

  • @Veptis
    @Veptis 11 месяцев назад +8

    Raptor Lake had no AVX-512 but put more cores on the desktop parts. Alder lake had it unintentionally, but actually for the Xeons that used the same cores.
    I am not about to buy a xeon for my workstation. I still game a lot so it will be an i9. I want a larger accelerator card with massive memeory to run large models locally. and intel is twlling me to use IDC instead. I said the same thing enough times by now.
    Its interesting to see that there is a future, but the stuff I see in ipex is AMX, matrix extensions... not just vectors. So surely we get ATX soon - advanced tensor extensions. but that's already used as an acronym by intel.

    • @lost4468yt
      @lost4468yt 11 месяцев назад +3

      lol u still usin tensor extensions bro? we usin advanced flexor extensions

  • @milestailprower
    @milestailprower 11 месяцев назад +7

    Really frustrating Intel decided to "nuke" AVX-512 partially thru Alder Lake's release via fusing it off.
    We know that Alder Lake can do things like emulate RPCS3 better with AVX-512. They could have done something like "AVX-512 may exist on P-cores, but we will provide no official support for it. If you do try to workaround microcode patches, and enable AVX-512 in BIOS, we provide no guarantee that things will work properly, as we are not validating AVX-512." Instead, they had to go nuclear and fuse off all new CPUs.

    • @davidmiller9485
      @davidmiller9485 11 месяцев назад +3

      I don't know why this surprises people. Intel is known for just blocking off things they don't want to deal with rather than finding a work around and that includes anything that might cause customer support tickets to increase. I don't have many complaints about Intel, but that is one of them.

  • @jeffreybraunjr3962
    @jeffreybraunjr3962 11 месяцев назад +2

    I’m glad there are very intelligent people who understand all of this

  • @flink1231
    @flink1231 11 месяцев назад +7

    Intel should just implement 512 on e cores like amd did, splitting into 2 256bit ops. It obviously is not free in terms of space, however it likely significantly more space efficient.

  • @divyanshbhutra5071
    @divyanshbhutra5071 11 месяцев назад +7

    I'm VERY excited about the future of X86, with X86S.

  • @sixteenornumber
    @sixteenornumber 11 месяцев назад +10

    I really wish everyone would get on the same age with vector length.

  • @oj0024
    @oj0024 11 месяцев назад +4

    It would be cool if you could cover scalable vector extensions like rvv or sve more in depth. The chips and cheese people covered the P870 and Veyron V1 quite in depth.

  • @SatsJava
    @SatsJava 11 месяцев назад +3

    Intel must listen to clamchowder

  • @samuel5916
    @samuel5916 11 месяцев назад +1

    Not Chester being a whole entire snack 😳

  • @christopherjackson2157
    @christopherjackson2157 11 месяцев назад +5

    Is the heat issue just because avx512 is powering up such a large, contiguous, area of silicon?
    Or am I vastly oversimplifying this in my mental model 😅

    • @jordanmccallum1234
      @jordanmccallum1234 11 месяцев назад +7

      Yes that it's contiguous, but more so 1. A ton of data is being moved from the register block to this area and 2. A lot of calculations are done in a very short timespan

    • @kazedcat
      @kazedcat 11 месяцев назад +1

      ​@@jordanmccallum1234It's the movement of data. AMD's solution of doing the 512 execution over two cycles solves the problem because now data is move every other cycle instead of every cycle but the density of calculation is still the same

    • @jordanmccallum1234
      @jordanmccallum1234 11 месяцев назад +1

      @@kazedcat the density over time has halved?
      Completing an AVX-512 operation in two cycles is double the time. I'm not talking about the density of the execution section, but the calculation as a whole.

    • @kazedcat
      @kazedcat 11 месяцев назад +4

      @@jordanmccallum1234 It is not actually double the time unless the instruction can be executed in one cycle. For example if the instruction needs 8 cycles to execute then double pumping would take 9 cycles because execution is pipelined and you only need to wait 1 more cycle for the other half of the execution to be done. If you are doing a lot of simple instruction then yes the amount of computation is halve but if you are doing a lot of complicated instruction then computation is not halve.

  • @autarchprinceps
    @autarchprinceps 11 месяцев назад +7

    I love how the main example for vector use is video encoding, a thing professionally and even non-professionally 99% done on GPUs, or more specificly their dedicated video encoders or occasionally special cards with just that, and then lots of silence follows when asked for another example, followed by something HPC maybe, which is mostly again GPUs or much more dedicated vector accelerators or special purpose CPUs primarily designed around vector units, not for a long time just of the shelf CPUs due to their vector extensions.
    Vector extensions are in a really weird place. They are not as good at what they do as GPUs, let alone dedicated hardware, but still not prevelent enough in use in applications where support for GPUs or other accelerators would be too much efford to add.
    Now some have come up and said, how about AI, but once again that is just plain not true. There are dedicated AI chips and accelerators, and not just in servers but on the edge too, including in most smartphone SoCs even, and if that is too specialised, lots of AI also gets done on GPUs. Nobody is going to run major AI tasks on just vector extensions if they have any control over the hardware used. Why should they?

    • @TechTechPotato
      @TechTechPotato  11 месяцев назад +11

      The AI use case is very much true. Most DC inference is still done on CPUs today.

    • @todorkolev7565
      @todorkolev7565 11 месяцев назад +10

      HI, I used to work for a video teleconferencing platform. Our code, on the server, re-en-coded different video streams and we were highly reliant on our compiler guys to give us an edge, in any way possible, to make video quick and cheap to process.
      My first question was: Hey, aren't GPU's better for this stuff?
      Our very social and not at all Asperger's compiler devs abruptly corrected me that GPU processing introduces latency, which a premium teleconferencing platform aims to reduce and, anyway, server farms very rarely have much GPU on board to use and then it is expensive and heterogenous (IE: You get different GPU architectures all the time).
      Last but not least, Few jobs are purely only-parallel. Once you throw in some work that isn't parallelizable, GPUs start losing their charm!

    • @lukemcdo
      @lukemcdo 11 месяцев назад +1

      Tacking onto the video angle, the whole reason that GPUs are efficient is because they have a mostly-fixed-function decoder sitting next to the display engine. The minute you're not displaying, there's nothing special about having the fixed function hardware with the GPU.
      Following up with that, the parts of video encoding and decoding that in most situations are fulfilled by fixed function hardware are a huge part of the work but there is often heavily parallel workloads to be applied to the results. Take for example, Intel noting that the "film noise" AV1 filter was running on GPU shaders on Alder Lake/Raptor Lake. Cool, but not something the above software vendor can rely on being present, and now also that is a source of error and/or latency compared to a CPU core on top of the fixed function hardware component.

  • @helpmedaddyjesus7099
    @helpmedaddyjesus7099 10 месяцев назад

    I love the chipsandcheese articles

  • @Summanis
    @Summanis 11 месяцев назад +3

    To make sure I understand, AVX10 on Intel hybrid architectures will still be limited to 256bit on all cores? Or will this allow for the core to select which code path it takes?

    • @noergelstein
      @noergelstein 11 месяцев назад +6

      I don’t see how that is possibe. If you make a check “if supports 512bit” then the scheduler could move the thread to a core without support between the check and the actual execution of the instruction. If this were possible, it could have been done on Alderlake.
      What Intel needs to do on the hybrid architectures is support 512 bit on both but then internally have a fast execution on the P core and a slower execution on the E core (such that it where the same as calling two 256 bit instructions).

  • @matheuswohl
    @matheuswohl 11 месяцев назад +3

    7:26 "compiler devs" shows html code lol

    • @levygaming3133
      @levygaming3133 9 месяцев назад +1

      I mean there’s only so much stock footage

  • @retroanderson
    @retroanderson 11 месяцев назад +2

    I'd be interested to know why RPCS3 can leverage AVX512 so well.

    • @TechTechPotato
      @TechTechPotato  11 месяцев назад +4

      The weirdness of the Cell vector engine/processor maps to big vector instructions well enough :)

  • @unvergebeneid
    @unvergebeneid 11 месяцев назад +1

    4:43 that's a lot of silicon just to have a slightly faster memcpy...

  • @dannotech2062
    @dannotech2062 4 месяца назад

    AVX programmer here. For your information, AMD's implementation of AVX-512 runs at AVX2 speed, because they take a 512-bit register and split into 2x 256-bit operations and "double pump" them through the pipeline to get a 512-bit results. So, they take 2 bites from the cookie to generate 1 poop. It's nice because it has ISA compatibility, but there is no additional speedup from the vector size. Having analyzing the IPC of my AMD 7800X3D and compared it my 36-core Intel Sapphire Rapids based CPU, I have come to the conclusion that AMD can execute up to 3x 256-bit operations per clock.
    AMD AVX2 instructions per clock - 3
    AMD AVX-512 instructions per clock - 1.5
    Intel AVX2 instructions per clock - 3
    Intel AVX-512 instructions per clock - 2
    VTune shows me that AVX2 instructions execute on port 0, port 1 and port 5, where as AVX-512 instructions execute on port 0 and port 5 only.
    So clock for clock, Intel has a 33% advantage over AMD's implementation.
    If AMD ever decided do double their FP execution width with, say Zen 5 or Zen 6, it's game over for Intel and it's AVX10/360/one/series-S/X specifications will be dead. For as much AVX code as I have written, I will never support AVX10.
    Anyone is interested in seeing AVX-512 in action, go to my channel and watch the video "Real-time software rendering with AVX-512"

    • @tappy8741
      @tappy8741 Месяц назад

      Apparently Zen5 will fouble FP execution width, but I'll believe it when I see it.

  • @Cormy1
    @Cormy1 10 месяцев назад +1

    I'm baffled how little coverage ChipsandCheese have gotten among techtubers thus far. If any of you are curious about the difficulties Intel are facing with Arc, and its potential, PLEASE go read their articles on it!
    Spoiler: Driver maturity will NOT save them.
    On another side note, Skylake-X had some instances where the TIM used between the die and IHS was thermal PASTE instead of liquid metal solder, so they had MUCH worse thermal conductivity and therefore suffered more from overheating during high loads like AVX512, which means more throttling and aggressive downclocking.

    • @TechTechPotato
      @TechTechPotato  10 месяцев назад +1

      Part of the issue is that a lot of Techtubers are often a mile wide, but an inch deep, when it comes to the complexities of microarchitecture. Few of them understand C&C's analysis to begin with, then translating it to something their audience understands is difficult. AnandTech had the same issue.

    • @Cormy1
      @Cormy1 10 месяцев назад +1

      @@TechTechPotato So long as there are performance comparisons to be made, the audience can grasp SOMETHING.
      That's the nice part of Chips and Cheese, they benchmark.
      They aren't simply technological exhibitions, but practical metrics.
      Even if you don't grasp the entirety of the article on the A770, you can still see when it scores comparably to GPUs from over a decade ago!
      I used to love reading Anandtech deep dives, but even I gave up on that when I realized I couldn't extrapolate anything real from them because of how many layers of abstractions there are from the underlying architecture to the end-result. It's particularly disappoint when some of the "improvements" end up having next to no real-world applications (RDNA's dual-issue shenanigans) without manually tuning everything, which isn't feasible.
      As consumers, we have NO IDEA how the various components are stressed at that level. Doubling registers means absolutely nothing to us. Even the companies don't fully grasp the impacts their changes will make, this can be seen in the shrinking of cache in RDNA3 by comparison to RDNA2. Clearly they've decided they went overboard on that and the benefits don't scale well to such high amounts.
      Meanwhile NVidia's L2 cache investment completely failed to cover for their reduced bus-widths, creating the spectacular failure that is the 4060 TI, but which also extends to the 4070 and 4070 TI, though no one seems to have noticed that yet maybe because they believe the higher VRAM quantities cover for it, but they don't.
      You NEED context to make what you're reading meaningful, and Anandtech articles often didn't provide that when discussing architectures (partly because those deep dives were often made before launch and reviews, and rather just based on announcement slides)
      Analyze the product, not the technology. You can work backwards to the technology after, referencing other products to demonstrate what the technological differences are achieving.
      In addition, Anandtech didn't do micro-benches. They did a lot of application benches, which produce abstract scores that tell you nothing about what is being stressed in the product to those who aren't familiar with those applications.
      By comparison, it's very easy to talk about bandwidth or latency of various components of a device, which is what ChipsandCheese demonstrates.
      You don't need to understand absolutely everything about what's going on under the hood and how it works to understand what aspects of an architecture or device are either exceptional, or sorely lacking.
      Techtubers and audiences can easily see that to gauge something as simple as what areas Battlemage could feasibly greatly improve on, and how wide the gap is in reality, beyond just driver/code maturity.
      It's not that deep when presented in that manner.

  • @billykotsos4642
    @billykotsos4642 11 месяцев назад +1

    Hotchips for sure is super cool !

  • @johnkost2514
    @johnkost2514 6 месяцев назад

    The 8087 has come a long way from when I was a young puppy ..

  • @Stadtpark90
    @Stadtpark90 11 месяцев назад +1

    I just watched a 15 min video without understanding a word.
    The only thing I understood, was that there is a chicken and egg problem concerning adopting of powerful last Gen tech, and that there were unintended consequences on the thermal side, and that the new implementation is trying to make it easier to get the power from the wheels to the street. - Is it like buying a Subaru? Every model has 4WD now?

    • @TechTechPotato
      @TechTechPotato  11 месяцев назад +1

      The big clue was at the beginning - advanced vector extensions. The ability to vectorize hard compute problems, of which ML and HPC are two, and enable this in hardware easier. It was kinda aimed at those familiar with programming optimizations for specific hardware (because not all pieces of hardware do all the things).

  • @TheParadoxy
    @TheParadoxy 11 месяцев назад

    This type of content is why tech tech potato is the best!!

  • @HakonBroderLund
    @HakonBroderLund 11 месяцев назад +2

    Really liked this style of reporting

  • @nmopzzz
    @nmopzzz 10 месяцев назад +1

    Whats the different between AVX and the older SIMD instructions?

  • @pedrophmg
    @pedrophmg 11 месяцев назад +1

    2^10 = 1024
    I've heard in somewhere that was the reason, isn't?

  • @CaptainScorpio24
    @CaptainScorpio24 11 месяцев назад +1

    brother i have i7 12700 non k with avx 512 enabled on asus tuf z690 plus wifi d4. I don't knw its use .

  • @jjdizz1l
    @jjdizz1l 11 месяцев назад

    This was very informative and a great show.

  • @edmunns8825
    @edmunns8825 11 месяцев назад +1

    @TechTechPotato Have you had a at AMX on sapphire rappids?

  • @sebtheanimal
    @sebtheanimal 11 месяцев назад

    another thing I never knew I needed as a Linux user.

  • @kwinzman
    @kwinzman 11 месяцев назад +1

    So in summary it's not as flexible as ARM's SVE. And it's not as backwards compatible as AMD'S AVX-512 implementation. Thanks Intel!

  • @Trick-Framed
    @Trick-Framed 11 месяцев назад

    Crap. I missed Hot Chips. They let me watch last year. Good time. Lots of info.

  • @esra_erimez
    @esra_erimez 11 месяцев назад +1

    How much of these advances can be attributed to Pat Gelsinger?

  • @Trick-Framed
    @Trick-Framed 11 месяцев назад

    Def Leppard finna sue over that retread riff 😂

  • @afre3398
    @afre3398 11 месяцев назад +2

    Do Intel and AMD have some co-operation or joint group regarding new instructions set additions. Or is it like customers go to Intel/AMD and say we rally need this

  • @Gindi4711
    @Gindi4711 11 месяцев назад +1

    If I look at the workloads that are usually run at consumer CPUs only very few of them would profit from AVX512.
    Gracemont is optimized for performance/area so Intel will not waste die space if they do not see a big benefit.
    If you need 20% extra die space to get a 2x performance increase in less than 5% of applications then this is clearly not worth it.

    • @defeqel6537
      @defeqel6537 10 месяцев назад +1

      Very few consumer workloads are highly thread parallelized too, so apart from mobile power efficiency, the small cores aren't very useful in the first place. When you start crunching more data, you start seeing benefits of instruction level parallelization too

  • @sflxn
    @sflxn 11 месяцев назад +3

    Maybe I missed it but I listened to this video and heard nothing about generative AI.

    • @TechTechPotato
      @TechTechPotato  11 месяцев назад

      Vector extensions are used to accelerate math used in ML and Generative AI

  • @user-me5eb8pk5v
    @user-me5eb8pk5v 10 месяцев назад

    You have given us the world, praise angels, be wise.

  • @dannotech2062
    @dannotech2062 4 месяца назад

    14:55

  • @flickeykrunchofficialYT
    @flickeykrunchofficialYT 11 месяцев назад

    Chester your the man

  • @ObviousCough
    @ObviousCough 11 месяцев назад +4

    I need AVX512 for y-cruncher

    • @semape292
      @semape292 11 месяцев назад

      Whats that?

    • @MMGuy
      @MMGuy 11 месяцев назад

      @@semape292 multithreaded benchmark that computes pi (and other constants)

  • @fd5927
    @fd5927 6 месяцев назад

    Chip and Cheese duo .... me 😮, now I'm dating myself ... Remember Itv, "Video 'n Chips" ... bet you don't .... kept my I9 11900k cpu and loving it. Before P and E cores ... AVX 512 is still active ... luckily .... Dreading the day new bios update where intel disables microcode of AVX-512 .......

  • @irwainnornossa4605
    @irwainnornossa4605 11 месяцев назад

    It should be named AVE.

  • @davidgunther8428
    @davidgunther8428 11 месяцев назад

    Nice details!

  • @sloanNYC
    @sloanNYC 11 месяцев назад

    Very interesting perspectives for sure.

  • @m_sedziwoj
    @m_sedziwoj 11 месяцев назад +3

    6:58 so they adding AVX10 because AVX512 have too many iteration, and want simple flag, but now he talking about iterations... someone do not learn on past errors....

    • @tappy8741
      @tappy8741 Месяц назад

      They added avx10 to retcon avx512 given that intel failed to do it efficiently for years, amd finally supported avx512 themselves, and intel needed to trim fat to be able to compete with amd with generic performance. It was a play to hit the snooze button on simd, which may succeed given that the focus from now on is probably ML

  • @plugplagiate1564
    @plugplagiate1564 11 месяцев назад

    it looks like the devolpers of intel, desperatly want to break up with the mathematical foundation of chip making.
    saying, the turing engine, that is fully defined by mathematic, goes down the river.
    i don't think it is a good idea to swap mathematic with pure try and error programming.

  • @tyraelhermosa
    @tyraelhermosa 11 месяцев назад +2

    AVX, AVX 2, AVX 512, AVX 10, and then they're going to AVX 10.1 and 10.2?
    Argghhhh

  • @axiom1650
    @axiom1650 11 месяцев назад

    I didn't like since the counter was on 256.

  • @josiahsuarez
    @josiahsuarez 11 месяцев назад

    I know I shouldn't but mcdonalds hamburgers are good ._.

  • @mathyoooo2
    @mathyoooo2 11 месяцев назад +2

    I really dislike what Intel is doing with avx10. They should have just put a slow version of avx512 on E-cores and be done with it.

  • @user-sx6nd8zl5k
    @user-sx6nd8zl5k 2 месяца назад

    They left aside the avx512 instructions that are very necessary for the AI ​​due to the stupidity of the E-cores, that is why generation 13 and 14 of Intel suck, and the worst will come with generation 15 of Intel that will no longer have Hyper -Threading, one step forward and two steps back in chip engineering .x86 is dead, it died along with Moore's law

  • @milestailprower
    @milestailprower 11 месяцев назад +1

    RPCS3 really benefits from AVX512. I suspect AVX512 will be very important for emulation in the future.

    • @aravindpallippara1577
      @aravindpallippara1577 11 месяцев назад +1

      That's mostly due to the weird cell processor architecture on ps3
      Ps4 and ps5 uses x86_64 processors so emulating them or transcoding the instructions to existing processors should be effortless

    • @milestailprower
      @milestailprower 11 месяцев назад

      @@aravindpallippara1577 In theory, yeah. You'd probably just have a hypervisor do the heavy lifting. Even outside of RPCS3, other emulators like Yuzu benefit from AVX-512 optimizations. But yeah, it appears that AVX-512 is *really* good particularly for emulating the cell architecture.
      You may still need a to work on the PS4/PS5 GPU. Sony has their own proprietary APIs for the radeon GPUs that needs to be reinterpreted or recompiled. With low level APIs like Vulkan and DX12, hopefully the CPU overhead is low.
      Also, the memory architecture is different than on PC. Although, maybe that's when resizable bar can show its benefits -- assuming access latency between CPU and VRAM isn't a problem.
      I'm not an expert on this, but I'm sure that AVX10 / AVX512 will still be moderately useful when it comes to emulating Nintendo's next console - even if they are going to be using ARM. It will probably be useful when it comes to ARM SIMD extensions like SVE2.

  • @pilsen8920
    @pilsen8920 10 месяцев назад +1

    Avx10 is going to be better on amd? and it's intel's baby. 😂 they just can't get a win. Lol

  • @maciejkowalski6045
    @maciejkowalski6045 2 месяца назад

    its funny theese guys explain and i can`t understand nothjing from their explanation i wonder if they even know what they are talking about

  • @wmopp9100
    @wmopp9100 11 месяцев назад

    cool thing, I am pretty sure intel marketing/product management will kill it.
    if there hadnt been such a big push for virtualization in the early 2000s they would have killed that one too.
    (some intel server CPUs had it, some (higher/more expensive) SKUs didn't. weird times)

  • @__aceofspades
    @__aceofspades 11 месяцев назад +10

    AVX10 solves the issue with heterogeneous cores, and it also puts AMD in a very difficult spot where their AVX512 silicon on consumer hardware will soon be obsolete. Obviously AVX512 adoption was very very low for consumer applications, but now developers will have to choose between AVX10 with Intel's 80% market share or AVX512 with AMD's 20% because they wont want to implement both, the choice is pretty obvious. Even before AVX10 was announced, AVX512 for consumers was dying, it had low support to begin with and once Intel stopped using it with Alder Lake, why would developers bother supporting it when now only Zen 4 supports it. AVX512 is more or less dead for consumers, long live AVX10.

    • @Eleganttf2
      @Eleganttf2 11 месяцев назад +1

      make sense

    • @user-yj1ov9cz9g
      @user-yj1ov9cz9g 11 месяцев назад +9

      Zen 4's AVX-512 is in most ways a superset/equivalent of the first versions of AVX10, the Vector Length extension is there too. I hope compilers will be able to compile compatible AVX-512 intrinsics to AVX10 and vice versa, even if they won't convert the vector lengths yet.
      With both consumer platforms supporting some kind of modern SIMD ISA with masking, the developers are more incentivized to write such code, even if they can't stick to an ISA.
      Another question may be the new APX extension, AMD has no alternatives to that but Intel haven't launched (and likely won't any time soon) any CPUs implementing it.

    • @Anton1699
      @Anton1699 11 месяцев назад +11

      AVX10.1 is basically just another CPUID enumeration method for the AVX-512 features found in Sapphire Rapids. So your dispatcher could check whether the CPU supports the required AVX-512 feature sets or AVX10.1/512.

    • @endless2239
      @endless2239 11 месяцев назад +3

      if AVX10.1 is basically the same thing as AVX512 and Intel already said that new XEON CPUs will keep compatibility with AVX512 (worse the new 256 instructions will come after granite rapids) then developer will now have to choose between 100% marketshare of AVX512 or whatever granite rapids marketshare will be.

  • @justindressler5992
    @justindressler5992 11 месяцев назад

    Avx is being used for AI isn't it. I was pretty disappointed after buying the 13900k and found out they remove avx-512 at the same time AMD launched there new chip with avx-512. It was vary annoyed since I was hoping to use avx-512 for AI. It probably makes sense to limit to 256 though I imagine many applications can use this without issues. But it just shows that if AMD can have 16 full cores with AVX-512 how far Intel's nodes are behind.

  • @garrettkajmowicz
    @garrettkajmowicz 11 месяцев назад

    When is Intel going to short-change us on these instructions as well?

  • @crispysilicon
    @crispysilicon 11 месяцев назад +3

    I was here first. 😂

    • @xNenshu
      @xNenshu 11 месяцев назад +6

      What an attention seeker 😂😂❤

    • @shmarvdogg69420
      @shmarvdogg69420 11 месяцев назад +6

      @@xNenshu IKR, how dare someone have a little fun in youtube comments section! 😂😂❤

    • @ProjectPhysX
      @ProjectPhysX 11 месяцев назад +1

      ​@@shmarvdogg69420🖖

  • @shaunlunney7551
    @shaunlunney7551 10 месяцев назад

    AMD figured out how to make AVX512 work across all it cores, why cant intel? What a headache. This seems to try and make up for that lack of ability.

  • @Speak_Out_and_Remove_All_Doubt
    @Speak_Out_and_Remove_All_Doubt 11 месяцев назад +4

    If you could ask Intel for one thing what would it be?
    Drop the E-Cores! (at at the very least change the thread director so that AVX-512 is never attempted to be run on the E-Cores so all this won't be an issue.

    • @__aceofspades
      @__aceofspades 11 месяцев назад +11

      That makes zero sense and will never happen. E-cores benefit every workload that can use the extra cores, while AVX-512 is only used in a few consumer applications. Nearly every consumer would be better off with e-cores than AVX-512 support. Heterogeneous designs are also the future, everyone from Intel to Apple to Qualcomm and AMD will be using some form of heterogeneous designs.

    • @Speak_Out_and_Remove_All_Doubt
      @Speak_Out_and_Remove_All_Doubt 11 месяцев назад +1

      @@__aceofspades Not quite what I meant, I was saying how it would be great if a program that used AVX-512 knew to only execute on the P-Cores, then we wouldn't have this issue.
      As the the e-cores, I still have a lot of trouble with them in some programs get booted onto them and then running real slowly. They have their place in the market but with the latency issues, reduced instruction set, reduced IPC, slower clocks, etc, etc personally I would rather Intel gave use more P-cores.

    • @falvyu
      @falvyu 11 месяцев назад +1

      @@Speak_Out_and_Remove_All_Doubt I'm pretty sure that Intel has considered that approach. A major problem comes the OS' scheduler: what happens if it migrates an AVX512 task onto an E-core ?
      Sure, you'd have an exception and the scheduler might be able to move it back to a P-core. But then, any program that ever runs >1 AVX512 instructions would forever stay on the P-core, including those that only called library functions (e.g. memcpy). This could mean having all processes run on the P-core (=> E-cores would be unused). And I'm pretty sure you'd have other issues related to diverging CPU features.
      I think the another other viable options would be AMD's: double-pumped AVX512 (i.e. 'split' 512-bits instructions into 2x256-bits instructions). However, it would probably would still require additional hardware compared to AVX2 (i.e. => bigger E-cores on a limited space => fewer E-cores).

    • @defeqel6537
      @defeqel6537 10 месяцев назад +1

      I see two other solutions: 1) just implement the ISA on all cores, the micro-op level instructions may be different and take variable amount of time, or 2) have the small cores panic the first time they see the instruction and have the OS scheduler stop scheduling the panicked processes (and all threads within) to those cores (and I guess there is 3) the RISC-V approach of OS emulating unsupported instructions)

  • @jesuslovesyoujohn314-21
    @jesuslovesyoujohn314-21 11 месяцев назад

    John 3:16 For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
    Isaiah 53:6 All we like sheep have gone astray; we have turned every one to his own way; and the LORD hath laid on him the iniquity of us all.

    • @TechTechPotato
      @TechTechPotato  11 месяцев назад +2

      Austin 3:16 - Because Stone Cold Said So

  • @charlesdorval394
    @charlesdorval394 11 месяцев назад +1

    Putting everything under the same CPU flag... will it be 10, 10.1, 10.2 ...? ... *roll eyes* They're just doing another round of the same thing again aren't they. Let's see how many they come up with this time...

  • @millosolo
    @millosolo 11 месяцев назад

    AVX is broken and in a very bad place under the pressure of special purpose hardware. Even intel is cautious. Bad.

  • @labloke5020
    @labloke5020 11 месяцев назад

    I have watched the whole video and I still have no idea what AVX is. This was a waste of time.

    • @TechTechPotato
      @TechTechPotato  11 месяцев назад +2

      Advanced vector extensions. Literally at the beginning. Helps if you know what a vector is, and an extension.

  • @KabelkowyJoe
    @KabelkowyJoe 10 месяцев назад

    4:00 What? Intel employee don't know why it's called AVX10? AVX is IDIOTIC idea, it's like Intel Itanium comming back from grave, embedded once again into silicone a VLIW. CPU running instruction(s) processing one cache line 64bytes. At once. Even 128bit was enough 4x32bit vector. 256 is understandable but 512 is not. This is AVX512 from "silicone point of view". Its no different than Itanium. New version would be like inventing bunch of new combination of grouping data and running various combinations on groups 512,256,128,64,32,16,8 bits. Making it only worse. Why to even burn entire compiler into silicone? I wish Intel wipe out this once for good just like they did to Itanium!
    EDIT 512: SORRY im THINKING OUT LOUD
    Editing milion times simple comment.. im not sure if this makes sense [some things do not]
    Im affraid they are making decoder as complicated.
    As this comment :>
    Im not English native speaker.
    Before you read - you have to imagine what i got in mind, 64th of 8 bit ALU with carry bits, like human beings with hands, and imagine you join these together side by side to create various structures. You also have 128bits FPU, 256bit, you also must allow 64 bit operations etc. Must provide carry bit for each "chunk" of data, AVX allow to execute various instructions, and pack data differently and that is main problem here! Depending on how you packed data 8,16,32,64,128bit numbers all these bits and pieces connect differently. Add 16x 32 bit numbers, or 8x64 bit numbers or AND two register together etc. For example to make 512bit wide ALU you put "people" in row. Could be also impemented as 2x256 SSE, so carry bits connect. But you can't make such structure fast, each create delay, so best is to make pipeline, and run one after another. If you make 512 bit wide CPU but that consumes more power, or 64x 8bit chunks you will have 64 instructions delay but less power consumer, result in 1 cycle / each if your pipeline is executing stable fassion. All depend on your implementation. Wrote in next paragraph GPU examle of 8 differently configured groups of SPus embadded in structure, to give an idea of how much transistors is potentially wasted, or how would be possible to mimic AVX on GPU. SPu in GPUs are grouped. And each group can execute one type of instruction on group.
    Each group of SPu configured differently statically. Would allow you dispatch instructions and make virtual pipeline. This would allow you to have result in one cycle max 8 steps. No matter how you pack data in your GPU emulated AVX. You throw data into predefined basket. Im not sure if this makes sense, or how do they make it. Im not CPU designer i can only imagine me building this. Having 8 different combinations and AVX allow to pack data differently as 8,16,32,64,128 and 256 wide numbers. Having such structure and execute possibly in one cycle is nightmare. AMD is using FPUs but their AVX was slower, obviously Intel was wasting lot of transistor to make wider structure and run faster possibly in once cycle. 512 bits flipping at once is not easy task. Pipeline of smaller chunks makes it easier, and waste less energy but creates delay. And normally size of ALU, or register file is predefined, it's nightmare if each instruction pack data differently. It's challenge im 100% certain. Just imagine 8 different types of CPU 8bit wide, 64 bit wide, 512 bit wide packed into one silicone. This is more or less idea behind AVX. Programmer is happy Intel is wasting silicone.
    Normally i remove comments like this...
    You cant go further away from RISC. AVX could be always implemented either as 8 bit CPU with 64 instruction units or 64 step pipeline being able to process chunk of 512 bit memory each chunk differently or all at once, a 16 bit CPU with 32 step pipeline or 32 execution units also being able to process chunk of 512 bit at once, or 128 bit CPU with 4 step pipeline so on so forth. But more you add instructions to it, make it more and more sophisticated structure. It gets more and more complex anyway. No different than VILW 512 bit CPU. With very sophisticated decoder.
    and CUT.
    I wish they FAIL, wish ever since AVX512 was invented. Get rid of this Krzanich shit please! Just as AMD get rid of 3DNow! Implement for sake of backward compatbility as slower 2x256bit but dont make decoder even more complex. I can imagine more FPU units in CPUs, 256 bit x16 pipelined or even equivalent to AVX 4096 if pipelined. But i cant imagine complex 512 "static" structure" with dozen dozen of new "statically defined" instructions allowing you to pack data as you want. It's not accident why we have NPU and GPU with so many limitations. It's also wasted silicone if SPus are not used at moment but made simpler forces programer to think how to process data. Not this AVX blob of wires. Allowing to do everything in every possible way..