AVX512 (2 of 3): Programming AVX512 in 3 Different Ways

Creel

Просмотров 19 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 сен 2024

Комментарии • 74

@punishedsnake492 4 года назад ⁺⁴⁴
This is pure gold. Not much info could be found about using AVX512, so I'm very grateful for this series of videos.
@WhatsACreel 4 года назад ⁺¹²
Glad you liked is Snake! Cheers for watching :)
@MrGooglevideoviewer 2 года назад ⁺¹
bit late for this comment... but I just wanted to say you are a bloody legend. Thanks for going to all the effort of making these. Cheers! 👍
@HuntingKingYT Год назад ⁺²
GCC Auto-Vectorization: I’m gonna end this man’s whole career
@SystemsDevel 2 года назад ⁺³
Thanks for the amazing x8664 Assembly videos! Learned so much! What is this beautiful syntax highlighting you have in VS? Please tell me :)
@PunmasterSTP 3 месяца назад
VCL? More like "Very cool, and powerful as hell!" 👍
@timkox9640 4 года назад ⁺⁵
Realy interesting stuff, thank you for this great explanation. At some point you mention that the asm code is not faster than standard C++. Is it because of the unaligned arrays or is there more to it?
@WhatsACreel 4 года назад ⁺¹²
It's because C++ compiler can't inline our ASM, so the function call itself will slow things to the point where there's no point in using ASM like that. If we get into ASM, we usually try to stay there for as long as possible. The way we did it in the vid was just a few instructions, and so the function call will make that much slower than regular C++, even though the AVX512 instructions themselves will probably be very fast! Sorry I was unclear. Thanks for watching mate, hope this clears it up :)
@timkox9640 4 года назад
@@WhatsACreel Ok, it makes sense, thank you very much. I love your content, can't wait for the third part. Cheers ;)
@BSOD.Enjoyer 4 года назад ⁺³
@@WhatsACreel With clang or gcc, you can do inline asm in x64. clang/LLVM is pretty easy to set up in visual studio. It is not conformant c++ though.
@hudevin7187 2 года назад
Thanks for your interesting and useful stuff.
@AlyxSharkBite-2000 3 года назад
I just came across this, great video!
@amber1862 3 года назад ⁺³
Beautiful video as always mate! Have you got a PayPal donation account set up at all? I'm sure I'm not alone when I say that I'd love to donate without having to sign up to Patreon :)
I have a future video request/idea relating to a question asked in this comment section: I'd love a video on assembly-related optimization traps beginners (like myself) can fall into, such as seeing the 3 lines of assembly towards the end of this video and not realising that although the SIMD add operation itself is one cycle, the loading and storing portion of the same instruction can take MANY cycles when doing it manually. A top 10 optimization traps video would be INCREDIBLY useful!
@WhatsACreel 3 года назад ⁺¹
Sorry, I don't have a one time donation set up. Cheers for the thought though, I am thinking of setting one up. And it would be excellent to make an ASM traps and pitfalls vid! I really like that idea. At the moment, I'm working on a 'ASM Misconceptions' vid, so that's kind of similar. I hope I can record and share soon. Well, thank you for the suggestions and thank you for watching, have a great day :)
@frognik79 4 года назад ⁺⁶
The first one seems so simple, is there any difference in assembly between them?
@WhatsACreel 4 года назад ⁺¹²
Mr. Fog's library is amazing! There's sometimes a small speed difference between it and intrinsics or native ASM, but it's usually negligible. That would indicate that sometimes it's not translated directly to single instructions. But, VCL includes a whole bunch of really powerful mathematical functions that are not available in ASM, so that's great. Some really fast implementations too!!
There's thousands of instructions in the x86/64 instruction set, and so the library doesn't attempt to capture all the flexibility of native ASM, but I'd say for many conceivable tasks, VCL is certainly a very good way to go! I find it's and excellent way to prototype vector algorithms too. Really simple and easy to debug :)
I'm sure there's differences, yep. I've never looked deeply into what they are, but found the library performs really very well. Hope this helps, and cheers for watching :)
@diegonayalazo 3 года назад
Thanks
@WhoNoMe 4 года назад ⁺³
Why did you write in the video that the “manual” asm is “much” slower than using the vectorclass.h even tho u use like 3 instructions in asm?
@nayjames123 4 года назад ⁺³
Because you have to load and store the vectors from memory for the add. In c++ they could stay in the registers removing the need for unnecessary loads/stores
@salainen6850 4 года назад
@@nayjames123 There is also overhead when calling the assembly function.
@WhatsACreel 4 года назад ⁺⁵
Yes, the answers here are right! It's because the ASM will require a function call. The compiler can inline and optimize its own functions, but when we use ASM, it doesn't optimize. So when we go into ASM, we usually want to stay there for as long as possible, otherwise the time for the function call and loading of the data will not be mitigated and the ASM will perform poorly.
@steveokinevo 4 года назад ⁺¹
Beautiful Chris just beautiful man great video, like the xmm and ymm sets, with aligned data on 16 byte and 32 byte boundaries respectively, the avx512 would be 64byte aligned ?
@WhatsACreel 4 года назад ⁺⁴
Yes, that's right! AVX512 alignment is 64 bytes. Since the original AVX instruction set, the alignment restrictions have been relaxed. We still have to align for the MOVAPD and other aligned moves, but we can use an unaligned operands as the final parameters to AVX and AVX512 instructions. Cheers for watching mate :)
@steveokinevo 3 года назад
@@WhatsACreel No wories, NICE ONE for that, Cheers
@bbq1423 4 года назад ⁺⁷
Question: Does SIMD instructions run parallel on a transistor level or is there some kind of internal for loop in the CPU?
@WhatsACreel 4 года назад ⁺²¹
Yes, they are parallel. For the most part anyway. There might be some complex instructions that split into micro ops, which could be executed by different pipes in sequence. But generally, it's all at the same time. Cheers for watching!
@alan2here 4 года назад ⁺¹
@@WhatsACreel cheers for answering questions :)
@gideonmaxmerling204 4 года назад ⁺¹
may I ask, why can't you pass the zmmwords to assembly and return the result through zmm0 instead of doing pointers (using vectorcall)?
edit: is there a better calling convention then the "c" calling convention?
@WhatsACreel 4 года назад ⁺³
Do you know, I'm not sure if there's a better one. I haven't studied calling conventions for a while. The x64 ones tend to all be very similar. C is pretty good. It uses registers for the first 4 ints and floats. But these vector types are arrays, so I'm not sure there's any better way to pass them then by pointer. Which is essentially what the C convention does.
It's easy to establish your own calling convention once you're in Assembly. Then you don't have to worry about any calling convention, unless you interact with other C functions. Of course, that can be tricky if you're not careful too!
Sorry I can't help more. Thank you for watching, and thank you for this interesting question :)
@gideonmaxmerling204 4 года назад ⁺¹
@@WhatsACreel you should try calling a normal c++ function and passing it an intrinsic zmmword then looking at the disassembly
@OpenGL4ever Год назад
There is a WP article about calling conventions the article is called "x86 calling conventions". This should give a nice overview.
@roberthowell8267 4 года назад ⁺¹¹
I don't know about this... I'm most DEFINITELY going with an AMD cpu
@WhatsACreel 4 года назад ⁺¹³
I can't fault AMD right now! Great CPU's :)
@llothar68 3 года назад
And it's also not recommended when programming for macOS now. Because Rosetta2 does not emulate AVX512
@AlyxSharkBite-2000 3 года назад ⁺¹
If you want to do AVX512 you will need an Intel CPU either one of the LGA 2066 i9 or one of the upcoming 11th Gen Core i7 or i9 (or equivalent Xeon) AMD doesn’t support AVX512.
@roberthowell8267 3 года назад ⁺¹
@@AlyxSharkBite-2000 ok we'll see how relevant avx512 is soon enough
@AlyxSharkBite-2000 3 года назад ⁺¹
@@roberthowell8267 Oh I wasn’t saying it was I was only saying if you wanted it (figured you did since this was an AVX512 video) you needed an Intel. Didn’t want you to pick up a CPU and it not having a feature you wanted.
@MagnusTheUltramarine 2 года назад
Why is it that the parameters are passed to rcx, rdx and r8? also what is stored in those registers?
Anyways, thanks a lot for these videos!
@WhatsACreel 2 года назад
It's just the calling convention. Folks had to decide on some registers and they just decided those ones. It's different in Linux. Yeah, but I have no good explanation, just the convention
There's nothing special about those registers, they're general purpose, you can store whatever you like in them.
Hope this helps, have a good one :)
@MagnusTheUltramarine 2 года назад
@@WhatsACreel Thanks man. I watched all your new playlist on modern x64 assembly, and avx512. You truly enjoy what you explain!
Maybe you could make some videos on how to make a mini retro game or some kind of program in masm, in order to put this ideas in practice, just an idea
@OpenGL4ever Год назад
Calling conventions are compiler specific and that's what the compiler expects when an extern function is used. Many different compilers adhere to a specific calling convention and as a developer there isn't much you can do about it because then you would have to change the compiler.
@Alex-op2kc 3 года назад ⁺¹
Part 3: ruclips.net/video/543a1b-cPmU/видео.html
@dankillinger 4 года назад ⁺⁵
:)
@theterribleanimator1793 4 года назад
:)
@WhatsACreel 4 года назад ⁺⁴
:)
@anthonynjoroge5780 4 года назад
:-}
@NeilRoy 4 года назад
Hey, I seen that Blender folder on your desktop. Whatcha doin' with Blender? :)
@WhatsACreel 4 года назад ⁺²
Blender is amazing!! I use it for the 3D in some of these vids, and photogrammetry (which is creating models from photographs), sometimes I just make little towns and houses and things for fun :) I'd like to sell on turbo squid or Unity store eventually, but at the moment, it's such a learning curve, still just a beginner :)
@NeilRoy 3 года назад ⁺¹
@@WhatsACreel Nice! Been messing around with it myself. Another neat program is MakeHuman which is free and allows you to create human 3D models which you can import into Blender. Also free. So, make some people for your towns. :)
@WhatsACreel 3 года назад ⁺¹
@@NeilRoy Make Human is really great! Thanks mate :)
@WhatsACreel 3 года назад ⁺²
There's a plugin for Blender called Manuel Bastioni Lab. It's good too. Not a lot of assets though.
@theexplosionist2019 4 года назад ⁺²
There won't be any mainstream AVX-512 until RocketLake.
@WhatsACreel 4 года назад ⁺¹
That's an interesting point of view! Maybe this gigantic instruction set won't work out at all? Fascinating time we are in right now :)
@llothar68 3 года назад ⁺³
@@WhatsACreel I thought Intel learned this with blowing 100 billions on Itanium VLIW architecture. But Intel is run by business graduates and not technical persons like Lisa Su.
@TheNoodlyAppendage 2 года назад
The problem with AVX512 is the hardware supports the opcodes, but doesnt support them in hardware. With only 2 FPU's its no faster than SSE
@OpenGL4ever Год назад
This will have the very simple reason that there are currently hardly any applications for the end user that support AVX512. But for compilers and developers it's good that they can buy CPUs that can do AVX512 so that they can adapt their software and compilers to it. That is why the space on the silicon chip was very likely saved.
As soon as the software supports AVX512 better, there will also be hardware with more AVX512 execution units per core, so that the additional performance compared to AVX2 can also be used.
It's basically a chicken and egg problem. But why waste space for the chicken if the egg hasn't even been laid yet.
@ClayWheeler 2 года назад
Not gonna lie . There's Intel FanBoy who said "Intel is better because video games can run on AVX 512 on it" .
I was like: "Bruh, show me any video game that requires AVX 512 right now"
@OpenGL4ever Год назад
No game currently requires AVX512, nor would that be wise as it would then run on very little hardware. However, there are games that support it and use it when it's available. The last of us is one of them and there are also comparisons on YT. Just search for AVX512 on/off.
@jozo035 4 года назад ⁺⁴
AVX512 is just too good to be truth. In theory You can get 256 SP-GFlops per core (FMA at 4 GHz). With 28 or 56 cores (if you have dual die Xeons available) you have performance above GPU accelerators at much lower TDP (which is often most important parameter).
In reality, AVX512 proved to be disaster (Xeon-Phi 7xxx was released in 2014)...
@WhatsACreel 4 года назад ⁺³
Yep, rough introduction, for sure! Similar things occurred with the original AVX. Though, at that time, they didn't have Ryzen to compete with! I think it was the opposite too. I think the original AVX was slow to start up, but once it was going it sped up?
I hope the throughput of the floating point can be improved. It's my only real worry about the instruction set. Oh, and compilers too - I mean, it was hard enough to effectively vectorise code, trying to automatically wrangle an instruction set like AVX512 from regular C++ code will be very difficult! I'm sure those compiler authors are clever enough to get some amazing things happening already :)
At the moment, throughput is 1 per cycle for the simpler floating instructions. It's 1/2 for AVX, so you get pretty much the same amount of flops. If they can improve that, even a little, I think it will do wonders!
Certainly love the masking abilities!! Really great stuff :)
Only time will tell :)
@OpenGL4ever Год назад
@@WhatsACreel The reason why is, because for AVX2 they use two AVX2 units per core in a super scalar way. And with AVX512, these two units just work together as one. So in the end, the result is the same. But if you ask me, this is not important at the moment, because Compilers have to adapt anyway first.
@AlexDanut 3 года назад ⁺²
Ok, but seriously, what is a creel?
@EpicHardware 3 года назад ⁺⁶
wow 0 dislikes, i guess haters don't care about avx :P
@amber1862 3 года назад ⁺³
Many acclaimed studies have shown it's physcally impossible to dislike an Australian talking about low-level performance computing.
@ricos1497 4 года назад ⁺²
Great video. I understood very little of it, but it was interesting nevertheless. Have you ever done a video on your background and what type of programming you do, as I'm quite interested to know where to start on things like this. My experience of coding is writing powershell scripts, vba, a bit of C#, SQL scripts (on the data side, rather than performance) that sort of thing, but I'm entirely self taught and effectively piggy back on what others have done before me. I have little understanding of the back end processes and such like. Any recommendations on where to start to get into these things, should I get a degree or should I just try hacking into the FBI and hope for a lucky hit? Much appreciated.
@xniyana9956 3 года назад ⁺¹
I'm no where close to being on Creel's level but I do understand a lot of this stuff. I can actually write a fair bit of x86 assembly code but only the old school way, eg mov, add, cmp, div etc. and only 32 bit assembly for now. I also know some C/C++, C#, VB.Net, VB6 and a couple other languages.
It's not as intimidating as you think. I'm entirely self-taught. But I'm not going to lie, it won't be easy for you to level up out of basic scripting but just being able to write scripts puts you sooooo far ahead of the average person. I'll recommend you focus heavily on C# since you're already familiar with it and you can learn a lot in that environment. There is plenty out there in the world of C#. Just write a lot of code and read a lot when you get stuck. Rinse and repeat and within 2 years you should be able to do some amazing things. Also, don't be afraid to push yourself.
@orestescm7644 3 года назад
shame this will not work with Ryzen cpus
@lx2222x 3 года назад ⁺¹
LIKE AND SUB, THIS MAN IS AWSOME
@anonmouse-zr9cn Год назад
This is great. Very approachable.
@arditm2178 4 года назад
So... Do avx512 gather scatter instructions provide any performance benefit or is it just for cleaner code? And perhaps a chance for future better hardware implementation?
@WhatsACreel 4 года назад ⁺¹
I am not sure on the performance. If I remember correctly, they gather elements based on the bits of a K register. I assume the normal penalty for cache line misses would still hold, since the instructions would only be reading from 1 or 2 cache lines. Pretty much the same as any other instruction that reads the whole 64 bytes.
That's speculation though, and I'd certainly love to explore it a little. My memory is often flawed, so I might be completely wrong. I'd say they're useful, but they're not the completely arbitrary gathers that we might hope for.
Hope this helps, and if you do explore the performance I'd love to read/hear about it if you'd like to share. Cheers for watching mate, have a good one :)
@DaveAxiom 3 года назад
11:28 NASM uses the standard Intel syntax. MASM uses a modified proprietary syntax!

Следующие

Автовоспроизведение

AVX512 (3 of 3): Deep Dive into AVX512 Mechanisms