AVX512 (3 of 3): Deep Dive into AVX512 Mechanisms

Поделиться
HTML-код
  • Опубликовано: 24 янв 2025

Комментарии •

  • @nayjames123
    @nayjames123 4 года назад +27

    How come you were using vmovupd instead of vmovups in the rounding example to load the single precision floats. Is there any difference between the two, like vmovups requires 4byte alignment. Or do the vmovu* instructions remove all alignment requirements and expand to the same micro ops?

    • @WhatsACreel
      @WhatsACreel  4 года назад +30

      Oh well spied!! It is a mistake. On some CPU's I understand there is a penalty for switching data types like that, so it should definitely be VMOVUPS! Pinned mate, thanks for pointing this out :)

  • @TomStorey96
    @TomStorey96 4 года назад +18

    Watching a video like this makes me understand how CPUs keep gaining millions upon millions of transistors. The muxing, control lines, registers and logic in general to implement all of these instructions, things like broadcasting etc would just keep piling on the transistors..!
    And the detail in that koala drawing ... 🤯

  • @_lapys
    @_lapys 4 года назад +14

    Oh, wasn't expecting the art montage at the end. Appreciate it all the same with the series 🤭

  • @NeilRoy
    @NeilRoy 4 года назад +9

    Interesting stuff. Those masks are quite fascinating. Also love your dad's artwork. Very talented.

    • @WhatsACreel
      @WhatsACreel  4 года назад +3

      Cheers brus! You're a legend Neil :)

  • @stijnkuipers4251
    @stijnkuipers4251 4 года назад +4

    Your dad is an absolute legend indeed!

  • @danielnitzan8582
    @danielnitzan8582 4 года назад +6

    Your channel is a gem ❤

  • @PunmasterSTP
    @PunmasterSTP 7 месяцев назад +1

    Oh man, I can't wait to see some AVX1024 registers 😆

  • @willofirony
    @willofirony 4 года назад +2

    We ain't in Kansas anymore, Toto. Loved this trilogy, thank you. The Kmasks, the compressed displacement, broadcasting, the register files, all of it is exciting. I have experimented with SIMD since you first introduced us to SSE. I suspect that the power of these instructions will only really be experienced after a paradigm shift in the way we structure data. The classic vision of data in records (structs, Classes etc.) has served is well with the classic architectures . These revolved around pointers and pointer arithmetic (shock! horror! bare naked pointers are at the foundation of it ALL). The new architecture is less friendly to the mixing of numerical, textual and bit-field data. It thrives on sequential lists of data all being the same types. So, data currently stored thus: Name: Michael, Age: 69, Salary : beyond your wildest dreams; Name: Creel, Age: ... etc. will need to be stared as Name: Michael, Creel; Age: 69... etc. I, perhaps ,need a lot more examples of data for clarity but the idea being that each numerical field can be accessed as one long array. Why? When one isn't number crunching enterprise amounts of data, the overhead to gather the numerical data from classic records can erase the advantage of these powerful instruction sets. It is not an obstacle just a different view of your data.
    I really like your Dad's pictures. I can see why you are so proud of him. Stay healthy.

    • @WhatsACreel
      @WhatsACreel  4 года назад +4

      I’m with you mate! SIMD is really exciting stuff, but it does leave many languages in the dust. There’s just too much flexibility to express with most modern languages.
      I think you are alluding to a topic called “SOA vs AOS”. Storing data as an array of structures, versus a structure of arrays. SIMD is very good at SOA, but computer languages usually use AOS. We want all the names together in an array, all the ages in another array, etc. Then we can manipulate or search 16 ages at once in SIMD, and we don’t have to gather them from all over RAM :)
      That’s a perfect example of one of the ways modern languages are not designed to take advantage of this stuff! I did a video a long time ago on that topic, but I don’t think I covered it well. Maybe we could revisit it?
      I reckon you’re spot on Michael! Thanks for the kind words mate, stay healthy too :)

  • @szaman6204
    @szaman6204 2 года назад +1

    Pan jest mistrzem.

  • @OpenGL4ever
    @OpenGL4ever Год назад

    @Creel
    20:24
    Did you notice, that it rounded myFloats[0] = 1.5 to 2, but myFloats[4] = 0.5 to 0?
    I would consider that strange. If a value is x.5 i would always expect to round the value upwards.

  • @imrank340
    @imrank340 4 года назад +2

    Really very good picture of Kwala.or Quwala. By the way great tutorial for Intel CPU AVX512 series all three.

  • @robertzavala7064
    @robertzavala7064 2 года назад

    Awesome, I received 11 points! The compressed displacement explanation and example was brilliant. And thank you for sharing your Dad's artwork.

  • @oresteszoupanos
    @oresteszoupanos 4 года назад +3

    Great intro to this instruction set! I'm a database guy, so not quite sure when I'll ever write my first assembly code, but your teaching style is so good that I can't help watching!

  • @KristianDjukic
    @KristianDjukic Год назад

    Great video, great lesson about AVX512 mechanisms

  • @fdc4810
    @fdc4810 4 года назад +5

    Really cool and awesome videos talking about some of the AVX512 ! Can you explore more about avx512 in the future, especially the FMA instructions in the future? CUDA with tensor cores boost the GEMM computation through put by such a big degree that now the Ampere A100 basically has more silicon for tensor cores instead of the common CUDA cores. The GA100 actually has much less fp32 CUDA cores than the GA102 gaming/content creation lineups(like RTX A600 or 3090) . It would be interesting to see how the avx512 FMA implementation on BLAS boosts the speed/throughput in comparison to that of avx2 and no avx at all.

    • @WhatsACreel
      @WhatsACreel  4 года назад +5

      I would love to! Trouble is getting hardware :( I have a 1050ti, that's about it at the moment. Certainly a fine card, but not exactly state of the art. Some more AVX512 vids would be fun!! The fmads are awesome!! Thank you for the suggestions mate, and thanks for watching :)

    • @NUCLEARARMAMENT
      @NUCLEARARMAMENT 4 года назад

      The GA100 has 6912 CUDA cores, with over 13,000 FP32 units. The GA102-derived RTX 3090 and RTX 3080 have around 4,000-5,000 FP32 CUDA cores, with around 8,000-10,000 FP32 FPUs. Most of the die area on Ampere GPUs is still reserved by the shader/CUDA cores. The GA100 lacks RT cores, those are for GA102 and smaller dies only.

  • @matsedv
    @matsedv 3 года назад +2

    Great video - its interesting as a developer to learn some asm/intrinsincs.

  • @piotrlenarczyk5803
    @piotrlenarczyk5803 3 года назад +1

    Thank you for video.

  • @ChrisM541
    @ChrisM541 2 года назад

    Really interesting video, big thanks for this.
    I'm sure I'm missing something, but in that AVXFoundationDetection code, after the cpuid instruction I see you test bit 16 of ebx by doing an initial shift of bit 16 into bit 0 (shr ebx, 16) and then the test of bit 0 (old bit 16) with 'and ebx, 1'. Could you also test bit 16 directly with that 'and' thus bypassing the need for a shift? All bits would be zero'd appart from bit 16 so the appropriate status flags (e.g. Z) would still only reflect the status of bit 16, so true/false return maintained?

  • @crown8838
    @crown8838 3 года назад

    Could you explain why the assembler code of avx512 funxtion like that, why we use those regeister?

  • @kadiyamsrikar9565
    @kadiyamsrikar9565 4 года назад +2

    Hello mate . Will u make some videos on opencl and gpu programming. Should be nice interesting addition to ur high performance software computing guide.

  • @lukehanscom482
    @lukehanscom482 4 года назад +3

    Avx2 also has automatic broadcasting cool instruction

  • @mikkoyliharsila3240
    @mikkoyliharsila3240 3 года назад

    Thanks! I was capabale to understand most of the stuff.

  • @reirei_tk
    @reirei_tk 4 года назад +2

    Great one, dude!

  • @duskota2234
    @duskota2234 2 года назад

    loved it, thank you so much!

  • @lordadamson
    @lordadamson 2 года назад

    man I love your videos so much :'D

  • @diegonayalazo
    @diegonayalazo 3 года назад +1

    Thanks

  • @DarshanSenTheComposer
    @DarshanSenTheComposer 4 года назад +3

    Very well taught. Cheers! :)

  • @tom_forsyth
    @tom_forsyth Год назад

    8:21 - The only thing special about k0 is that you can't use it in a lot of instructions. The *encoding* "000" is used to mean "no mask". It doesn't *read* k0 when you do that, it just is hardwired to "no mask". That's why changing the contents of k0 doesn't change that behaviour - it never actually reads the register. But because the encoding is reserved, it also means you can't use k0 for most instructions. It's a perfectly good normal register, it's just the *encoding* for most vector instructions is reserved, so you can only really use it in the other mask-register instructions - as a temporary or things like that.
    Annoyingly, some assemblers allow you to use the "{k0}" syntax, which is technically illegal. Because again - the instruction doesn't read k0! They should produce an error, but they don't.

  • @TKGMoMoSheep
    @TKGMoMoSheep 4 года назад +2

    Great video!! May I know which CPU are you using?

    • @WhatsACreel
      @WhatsACreel  4 года назад +4

      It's an i5 1035g1 I believe. Cheers for watching :)

    • @OpenGL4ever
      @OpenGL4ever Год назад

      @@WhatsACreel For those who want to know the generation, it's generation 10 and an Ice Lake.

  • @rodrigotobar3606
    @rodrigotobar3606 4 года назад +6

    I noticed that when using "round nearest" 1.5 rounds up to 2, but 0.5 rounds down to 0. I think both values can be represented exactly with float4 though, so I found this a bit surprising.

  • @gideonmaxmerling204
    @gideonmaxmerling204 4 года назад +1

    I have a question, why use lea when you can use mov with the pointer operator, don't really remember the syntax but you know what I mean
    also, in the command vcvtps2dq, I understand everything except dq, I understand the it is a 4 byte int but what is dq?

    • @meneldal
      @meneldal 4 года назад +3

      Lea is typically preferred by most compilers, I believe it is because it doesn't update flags so it avoids creating conflicts in instructions when the cpu wants to do some reordering of instructions.

    • @WhatsACreel
      @WhatsACreel  4 года назад

      I can't remember what the code was doing? LEA makes a pointer, MOV just moves the data. It's possible you can use the data directly in a SIMD instruction, it's usually only the final operand that can be memory, but check the manual if you reckon there might be a faster way than my instructions. It's certainly possible I just did something stupid :)
      As for the DQ, I have to admit, I have no idea!! Haha, as far as I know, it means Double Quadword, so it refers to the 128 bits of an SSE register. I'm not sure what that's got to do with integers though?

    • @gideonmaxmerling204
      @gideonmaxmerling204 4 года назад +1

      @@WhatsACreel what you did was:
      lea rax, myDouble
      but in fact, you could have just done:
      mov rax, offset myDouble
      to just load the address of mydouble, since myDouble is just an alias for [*some number*]

    • @WhatsACreel
      @WhatsACreel  4 года назад +1

      ​@@gideonmaxmerling204 Oh that's great!! I've never seen that syntax! I've always just used LEA for addresses and MOV for data. Cheers for sharing mate, that's cool :)

    • @gideonmaxmerling204
      @gideonmaxmerling204 4 года назад

      ​@@WhatsACreel thinking about it, you could also have not done the mov instruction and just "bcst [ OFFSET myDouble]".
      But considering that myDouble is an alias for [*number*], I think you should try doing "bcst myDouble".
      I would test it myself but my cpu only has avx2

  • @overcritical304
    @overcritical304 4 года назад

    Hey Creel, a question: How can I create a txt file and read and write to it using assembly?

  • @arktvrvs
    @arktvrvs 4 года назад +1

    question: why is simd still developed ? arent they obsoleted by gpus or is there something they can do gpus cant?

    • @AAA-de6gt
      @AAA-de6gt 4 года назад +6

      It takes a very long time to send data to the GPU, tell the GPU what to do with it (and only 1 thing for all the data), and get the data back. It is only worth the overhead for very large parallel data sets, that don't need to communicate with the CPU too much. CPU SIMD doesn't require the overhead of using the GPU and is also more flexible.

    • @WhatsACreel
      @WhatsACreel  4 года назад +4

      Great point AAA!
      I'd also say that GPU's really are SIMD. The warps in CUDA programming show us that a GPU is really just a 32 way SIMD device, it's not very different from a CPU at all. GPU's are becoming more and more like CPU's while CPU's are becoming more and more GPU like. Maybe they'll meet in the middle at some point?
      Just my two cents, cheers for watching mates :)

    • @reirei_tk
      @reirei_tk 4 года назад +2

      In my experience, debugging/troubleshooting CPU code is way easier than GPU code. Part of this, I believe, is that no mainstream language has GPU code as a first class language feature; you always have to install the vendor's SDK or a third party library. For an example, C# got official ARM support before it got official GPGPU (which is none).

    • @JackMott
      @JackMott 4 года назад

      as gpu capabilities grow we might soon ask: why are cpus developed?

    • @OpenGL4ever
      @OpenGL4ever Год назад

      Another reason is that high-level compilers that generate normal x86 code have no idea about the GPU. The GPU must therefore always be specially programmed and accessed via APIs. With SIMD units of the CPU, all you have to do is tell the compiler to use SIMD and then it will optimize your code for it as much as possible without you having to rewrite the code. However, it makes sense to design the code in such a way that it can be easily optimized for SIMD units.

  • @lohphat
    @lohphat 4 года назад

    Is there an option with VS to build .exe files which then load the code blocks upon execution depending on the architecture? e.g. You can have procs optimized for AVX512 which run instead if the CPU can handle it and an alternate proc module if not?

    • @OpenGL4ever
      @OpenGL4ever Год назад

      You could use the preprocessor for this. It's available in C and C++.

  • @AllElectronicsChannel
    @AllElectronicsChannel 4 года назад +1

    Wooooow!!

  • @tom_forsyth
    @tom_forsyth Год назад

    17:28 - you're welcome!

  • @reirei_tk
    @reirei_tk 4 года назад +3

    Would ever do a performance comparison between AVX-512 and AVX2? AVX-512 is notorious for downclocking due to the heat generated. I believe it only happens with using the 512 bit registers, so AVX-512 instructions on 256 bit registers don't have that problem? Not sure, but it's be great to see an investigation (both the instructions and the extra registers - if we can still use the full 32 256 bit registers at full speed, AVX-512 would still be worth it IMO).

    • @WhatsACreel
      @WhatsACreel  4 года назад +1

      Great topic! I'm not sure if we'll cover it. AVX512 runs at half the speed for floating point, that's even before the downclocking. I didn't check all the instructions, and the broadcasting and masks still add a lot of flexibility. But, certainly at the moment, it seems like AVX512 is best used for integer operations. Cheers for the suggestion :)

    • @CSDT0
      @CSDT0 4 года назад

      @@WhatsACreel It really depends on the uarch. I don't know much about Ice Lake, but I do know about Skylake Server (SKX) which embeds AVX512 for about 3 years now.
      On this uarch, you can execute 2 FP instructions per cycles, which is the same as for integer ones. So downclocking aside, AVX512 is twice as fast as AVX2 on this uarch.
      Maybe you could do a little micro-benchmark on your machine to see?

  • @paxdriver
    @paxdriver Год назад

    Can rust run different instruction sets too? I can't really find much about it's compiler options other than people praising the errors it logs lol
    I've never heard of a low level mask like kmask before!! I'm a noob so that's not surprising, but the idea of bit level masks really tickled my brain 😊

  • @AlFasGD
    @AlFasGD 3 года назад

    This CPU is capabale of ZVX521 Fondation instruction set!

  • @ZedaZ80
    @ZedaZ80 3 года назад +1

    The year is 2051. Intel has added a new instruction: `DOOM`.
    (it runs Doom :P)
    *EDIT:* wow, your dad's art is so cool :0

  • @derzweistein8973
    @derzweistein8973 4 года назад

    Can you make Videos that explain how, for example Loop Streaming works, and how one can abuse it to get Very fast Loops ?

  • @another212shadow
    @another212shadow 9 месяцев назад

    okay series, but great fucking drawing. Your dad totally stold the show. The detail is incredible.

  • @sagivalia5041
    @sagivalia5041 3 года назад

    I find it funny that in almost every other language, it's like: you don't have it? figure out on the internet how to make it work, it will eventually work.
    On assembly: you don't have it? get a CPU which has it

  • @matthewoliver2645
    @matthewoliver2645 4 года назад

    can your finish your direct 2d sereis

  • @Adamchevy
    @Adamchevy Год назад

    I really have know idea how to code with AVX512. The last coding class I took in college was 2010, which was advanced C++. I am extremely interested in getting back into programming beginning with emulators.

  • @Henrix1998
    @Henrix1998 4 года назад +2

    Im not sure what to do with all this knowledge

  • @e8root
    @e8root 3 года назад

    Cool stuff. Too bad my 9900k cannot do any of that XD

  • @liquidsoap5850
    @liquidsoap5850 3 года назад +1

    Coming from Intel 8051: Gee, I can subtract 1 from all the big azz registers. Future is now!

  • @nabollo
    @nabollo 4 года назад +1

    You should sell prints of that koala drawing.

  • @Guztav1337
    @Guztav1337 4 года назад

    Your camera can be cut a little bit smaller. But it wasn't in the way for the content, so it was good anyway.

  • @lordadamson
    @lordadamson 2 года назад

    by the way the order of your playlist is reversed. so it goes from part 3 down to 1.

  • @Illya9999
    @Illya9999 4 года назад +1

    Man AVX512 looks so useful. Sad that I won't be able to use it cause almost all the code I write is for an arm5te processor

    • @Dave_thenerd
      @Dave_thenerd 3 года назад

      ARM has NEON though! :P

    • @OpenGL4ever
      @OpenGL4ever Год назад

      You can emulate AVX512 in bochs and run bochs on your ARM processor. It will be very slow, but for learning it should be good enough.

    • @Illya9999
      @Illya9999 Год назад

      @@OpenGL4ever its a console

    • @OpenGL4ever
      @OpenGL4ever Год назад

      @@Illya9999 Well then you lost. As far as i know on ARM a SIMD unit is only available from the later generation, the ARMv6.

    • @Illya9999
      @Illya9999 Год назад

      @@OpenGL4ever yep

  • @lukehanscom482
    @lukehanscom482 4 года назад

    Amd will eventually have to add avx 512 if it become popular with software companies or else and will be screwed

    • @mathyoooo2
      @mathyoooo2 2 года назад

      Zen4 will have it afaik

  • @markmanning2921
    @markmanning2921 3 года назад +1

    xor eax, eax shr ebx, 17 adc eax, 0
    just to be more confuzering!

  • @alan2here
    @alan2here 4 года назад +2

    int83886080[3] moon_pos; // 10MB, accurate for 3 million iterations
    int2[4] quadrants = {0, 1, 2, 3};
    unfloat1024 n; // unsigned, float, unit range (0 to 1)