I have come to this video 7 years since it's made, but the only source that's explained the GPU architecture so well. We need more people like you Brandon. Thanks for this!
Thank you Brandon, when you explained where's the number of cuda cores comes from, I felt it's so interesting, and I calculated it right away of how many SMMs in my GTX 960 with 1024 cuda cores
Thank you very much for your work, Brandon! This video contains a lot of useful information. Explanations are simple and concise. And I also find your approach of researching information very inspiring. Step by step you dive into this topic, and although it's hard, in the end you have well-stuctured presentation, that you kindly share with other people.
As you can probably see, Streaming Multiprocessors are the GPU’s equivalent of a CPU core. Furthermore, these CUDA "cores" Nvidia refers to are actually execution units. Floating-point FMAs to be precise. They make up the bulk of the SM’s execution units. In reality, the number of SMs doesn’t really matter: Nvidia tends to change the size of an SM between microarchitectures, so comparison isn’t really useful. I would say comparing the number of CUDA cores as well as the clock speed is probably a more reliable comparison.
That depends on your workload. If you have a lot of thread divergence, more SMs with fewer ALUs is better. If you have high utilization and less thread divergence, then fewer SMs with more ALUs would be better. As GPU tasks become more complicated, the amount of divergence increases, so making the SMs smaller is a better approach. The raw peak performance only depends on the number of ALUs and clock speed (not the SMs). However, the maximum practical peak performance will highly depend on the number of SMs depending on the workload. I.e. 1 SM with 2048 ALUs at 1 GHz might do 2 TFLOPS. But practically only be able to achieve 40% peak. Whereas 4 SMs with 256 ALUs each might peak out at 1 TFLOPs and be able to achieve 90% peak. 40% vs 90% could be explained completely via thread divergence. Then you have 2*.4=0.8 TFLOPS vs 1*0.9=0.9 TFLOPS. That's an extreme example though, because they are a lot closer in base numbers while being further apart in performance (it has to do with register file pressure, scheduling, work distribution, etc.)
I like this level....down to the hardware. Easy -Peezy software like Facebook and Microsoft Office are sitting on top of hardware. The cleverness is in the hardware....billions of transistors.
True, but they do similar tasks. Albeit, the GPUs cores are more like lots and lots of advanced ALUs with a few special bells and whistles here and there
very nice explained. Can I get the link to all sequential videos after or before this..? I couldn't find the prev one. I wanna watch all of those. @Bradon Fredrickson
Bradon Fredrickson thank you somuch for the quick replay. I thought of asking because some where in the video you told "I have explained CUBA in the prev class". It is such a nice video. Thank you so much. 😊.
@@bradonf333 well, they aren't the same though. GPU ALUs are a bit more advanced, and tend to have features like how to calculate shading, if I recall correctly
@@ithaca2076 That's not quite correct, depending on what you mean. CPUs don't typically have one type of ALU anymore, they have an integer ALU (add, subtract, logic, shifts), a multiplier, a divider, and a FP ALU (which is often divided into a FP ADD/MUL ALU and a FP DIV / SQRT etc ALU). Often those ALUs are combined into common pipelines, for example, an x86 CPU may do {Add, Subtract, Logic, Shifts, and multiplication} in one ALU, and then {Add, Subtract, logic, and division} in another ALU. It also depends on the CPU, some of them have multiply-accumulate instructions, while others don't. If you ignore the addition of a MAC instruction, then a GPU "ALU" is going to be much simpler than most CPU ALUs. The degree to which that's true depends on the GPU architecture, the older ones combined an INT32 and FP32 ALU into a single "core", which may or may not have been unified (i.e. a single pipeline that could do either), or they could have been unique pipelines. The advantage to unique pipelines would be that the latency is lower (fewer cycles) at the cost of area. The current NVidia architectures have a combined FP32/INT32 pipeline, and a FP32 only pipeline. The current AMD architectures have combined FP32 and INT32 pipelines, which is true for the newer ARM Mali GPUs, as well as PowerVR, and Apple's GPUs. Going back to NVidia though, the FP32 ALUs can only do FPADD, FPSUB, FPMUL, FPMAC, FMA, and I think that's also were the Int to FP instructions are done. The INT32 ALUs do integer ADD, SUB, MUL, MAC, Logic, Shift, FPCompare, and I think is where the FP to Int instructions are done. The "Cores" / ALUs don't do anything fancier like SQRT, or Division. The GPU is actually incapable of doing either of those operations, and instead can do approximations to those operations using the Special Function Units (SFU / MFU). Those would not be considered ALUs though. TL;DR The GPU ALUs are in general much simpler than that of a modern (high end) CPU.
Pretty sure their gpus are more complicated than what you can read up o Wikipedia, otherwise all companies can easily steal nividia's intellectual property.
I have come to this video 7 years since it's made, but the only source that's explained the GPU architecture so well. We need more people like you Brandon. Thanks for this!
Thanks! I appreciate that. Happy i could help! 👍
8 years and still its the best video available that explains gpu architecture
Totally agree 👍
7 years later this is still legendary
Thank you!
Thanks a lot for this explanation! This is the best video that I could find on this topic!!
This is a nice high level view of architecture without going into too much detail.
Thank you Brandon, when you explained where's the number of cuda cores comes from, I felt it's so interesting, and I calculated it right away of how many SMMs in my GTX 960 with 1024 cuda cores
Thank you very much for your work, Brandon! This video contains a lot of useful information. Explanations are simple and concise. And I also find your approach of researching information very inspiring. Step by step you dive into this topic, and although it's hard, in the end you have well-stuctured presentation, that you kindly share with other people.
Thank you! I really appreciate that. Glad my video could help.
Great tutorial. Very clear explanation. Thank you Bradon
only video that explains this topic well, please consider making more on newer stuff, thank you 🙏
Thanks so much!!
Very nice 👍 overview of Nivdia GPU arch
really nice explanation. Thanks for you sharing!
As you can probably see, Streaming Multiprocessors are the GPU’s equivalent of a CPU core.
Furthermore, these CUDA "cores" Nvidia refers to are actually execution units. Floating-point FMAs to be precise. They make up the bulk of the SM’s execution units.
In reality, the number of SMs doesn’t really matter: Nvidia tends to change the size of an SM between microarchitectures, so comparison isn’t really useful. I would say comparing the number of CUDA cores as well as the clock speed is probably a more reliable comparison.
That depends on your workload. If you have a lot of thread divergence, more SMs with fewer ALUs is better. If you have high utilization and less thread divergence, then fewer SMs with more ALUs would be better. As GPU tasks become more complicated, the amount of divergence increases, so making the SMs smaller is a better approach.
The raw peak performance only depends on the number of ALUs and clock speed (not the SMs). However, the maximum practical peak performance will highly depend on the number of SMs depending on the workload. I.e. 1 SM with 2048 ALUs at 1 GHz might do 2 TFLOPS. But practically only be able to achieve 40% peak. Whereas 4 SMs with 256 ALUs each might peak out at 1 TFLOPs and be able to achieve 90% peak. 40% vs 90% could be explained completely via thread divergence.
Then you have 2*.4=0.8 TFLOPS vs 1*0.9=0.9 TFLOPS. That's an extreme example though, because they are a lot closer in base numbers while being further apart in performance (it has to do with register file pressure, scheduling, work distribution, etc.)
I like this level....down to the hardware.
Easy -Peezy software like Facebook and Microsoft Office are sitting on top of hardware.
The cleverness is in the hardware....billions of transistors.
更==
Very well explained. However, cores are not the same as ALUs.
True, but they do similar tasks. Albeit, the GPUs cores are more like lots and lots of advanced ALUs with a few special bells and whistles here and there
VERY useful! thanks.
Thanks for this simple, but informative tutorial.
I liked the way you explain 👍
Wow, very well done! It was very helpful for me.
Thanks!!
very helpful for learning Graphics thanks
thaink you for the nice video. Can a core runs more than onr thread in the same time? Or a cuca core is to execute only one thread in the same time?
Great video!!!
Very interesting,thumbs up :)
Hierarchy
1. GPU
2. GPC
3. SMM
4. CUDA core.
Very well explained thanks
Well explained! thank you
Thank you
Thanks!
this can accelerate my Unix os I really like it
very nice explained. Can I get the link to all sequential videos after or before this..? I couldn't find the prev one. I wanna watch all of those. @Bradon Fredrickson
Himanshu patra Hey, sorry I don't have any more videos. This was just a final project I had to do for school.
Bradon Fredrickson thank you somuch for the quick replay. I thought of asking because some where in the video you told "I have explained CUBA in the prev class". It is such a nice video. Thank you so much. 😊.
Is there any difference in the CPU ALUs and GPU ALUs?
I think the main difference is the number of ALU's. GPU's have a lot of ALU's and a CPU only had a few.
@@bradonf333 well, they aren't the same though. GPU ALUs are a bit more advanced, and tend to have features like how to calculate shading, if I recall correctly
@@ithaca2076 That's not quite correct, depending on what you mean. CPUs don't typically have one type of ALU anymore, they have an integer ALU (add, subtract, logic, shifts), a multiplier, a divider, and a FP ALU (which is often divided into a FP ADD/MUL ALU and a FP DIV / SQRT etc ALU). Often those ALUs are combined into common pipelines, for example, an x86 CPU may do {Add, Subtract, Logic, Shifts, and multiplication} in one ALU, and then {Add, Subtract, logic, and division} in another ALU. It also depends on the CPU, some of them have multiply-accumulate instructions, while others don't. If you ignore the addition of a MAC instruction, then a GPU "ALU" is going to be much simpler than most CPU ALUs.
The degree to which that's true depends on the GPU architecture, the older ones combined an INT32 and FP32 ALU into a single "core", which may or may not have been unified (i.e. a single pipeline that could do either), or they could have been unique pipelines. The advantage to unique pipelines would be that the latency is lower (fewer cycles) at the cost of area. The current NVidia architectures have a combined FP32/INT32 pipeline, and a FP32 only pipeline. The current AMD architectures have combined FP32 and INT32 pipelines, which is true for the newer ARM Mali GPUs, as well as PowerVR, and Apple's GPUs.
Going back to NVidia though, the FP32 ALUs can only do FPADD, FPSUB, FPMUL, FPMAC, FMA, and I think that's also were the Int to FP instructions are done.
The INT32 ALUs do integer ADD, SUB, MUL, MAC, Logic, Shift, FPCompare, and I think is where the FP to Int instructions are done.
The "Cores" / ALUs don't do anything fancier like SQRT, or Division. The GPU is actually incapable of doing either of those operations, and instead can do approximations to those operations using the Special Function Units (SFU / MFU). Those would not be considered ALUs though.
TL;DR The GPU ALUs are in general much simpler than that of a modern (high end) CPU.
how can we relate wraps and grids??
great!!
Microarchitecture sounds like organization according to you
Pretty sure their gpus are more complicated than what you can read up o Wikipedia, otherwise all companies can easily steal nividia's intellectual property.