Nvidia GPU Architecture

Поделиться
HTML-код
  • Опубликовано: 4 дек 2024

Комментарии • 64

  • @kartik8n8
    @kartik8n8 2 года назад +31

    I have come to this video 7 years since it's made, but the only source that's explained the GPU architecture so well. We need more people like you Brandon. Thanks for this!

    • @bradonf333
      @bradonf333  2 года назад +3

      Thanks! I appreciate that. Happy i could help! 👍

  • @Engrbilal143
    @Engrbilal143 8 месяцев назад +4

    8 years and still its the best video available that explains gpu architecture

  • @kompila
    @kompila 2 месяца назад +1

    7 years later this is still legendary

  • @aravinds123
    @aravinds123 3 месяца назад +2

    Thanks a lot for this explanation! This is the best video that I could find on this topic!!

  • @pubgplayer1720
    @pubgplayer1720 Год назад +2

    This is a nice high level view of architecture without going into too much detail.

  • @anphiano4775
    @anphiano4775 3 года назад +2

    Thank you Brandon, when you explained where's the number of cuda cores comes from, I felt it's so interesting, and I calculated it right away of how many SMMs in my GTX 960 with 1024 cuda cores

  • @antonidaweber9184
    @antonidaweber9184 Год назад +2

    Thank you very much for your work, Brandon! This video contains a lot of useful information. Explanations are simple and concise. And I also find your approach of researching information very inspiring. Step by step you dive into this topic, and although it's hard, in the end you have well-stuctured presentation, that you kindly share with other people.

    • @bradonf333
      @bradonf333  Год назад +1

      Thank you! I really appreciate that. Glad my video could help.

  • @MrPkmonster
    @MrPkmonster 4 года назад +3

    Great tutorial. Very clear explanation. Thank you Bradon

  • @shreyabhandare6056
    @shreyabhandare6056 Год назад +3

    only video that explains this topic well, please consider making more on newer stuff, thank you 🙏

  • @jayashreebhargava2348
    @jayashreebhargava2348 8 месяцев назад +1

    Very nice 👍 overview of Nivdia GPU arch

  • @zhikangdeng3619
    @zhikangdeng3619 7 лет назад +10

    really nice explanation. Thanks for you sharing!

  • @billoddy5637
    @billoddy5637 5 лет назад +14

    As you can probably see, Streaming Multiprocessors are the GPU’s equivalent of a CPU core.
    Furthermore, these CUDA "cores" Nvidia refers to are actually execution units. Floating-point FMAs to be precise. They make up the bulk of the SM’s execution units.
    In reality, the number of SMs doesn’t really matter: Nvidia tends to change the size of an SM between microarchitectures, so comparison isn’t really useful. I would say comparing the number of CUDA cores as well as the clock speed is probably a more reliable comparison.

    • @hjups
      @hjups 3 года назад +4

      That depends on your workload. If you have a lot of thread divergence, more SMs with fewer ALUs is better. If you have high utilization and less thread divergence, then fewer SMs with more ALUs would be better. As GPU tasks become more complicated, the amount of divergence increases, so making the SMs smaller is a better approach.
      The raw peak performance only depends on the number of ALUs and clock speed (not the SMs). However, the maximum practical peak performance will highly depend on the number of SMs depending on the workload. I.e. 1 SM with 2048 ALUs at 1 GHz might do 2 TFLOPS. But practically only be able to achieve 40% peak. Whereas 4 SMs with 256 ALUs each might peak out at 1 TFLOPs and be able to achieve 90% peak. 40% vs 90% could be explained completely via thread divergence.
      Then you have 2*.4=0.8 TFLOPS vs 1*0.9=0.9 TFLOPS. That's an extreme example though, because they are a lot closer in base numbers while being further apart in performance (it has to do with register file pressure, scheduling, work distribution, etc.)

  • @petergibson2318
    @petergibson2318 6 лет назад +4

    I like this level....down to the hardware.
    Easy -Peezy software like Facebook and Microsoft Office are sitting on top of hardware.
    The cleverness is in the hardware....billions of transistors.

  • @vladislavdracula1763
    @vladislavdracula1763 8 лет назад +26

    Very well explained. However, cores are not the same as ALUs.

    • @ithaca2076
      @ithaca2076 3 года назад +1

      True, but they do similar tasks. Albeit, the GPUs cores are more like lots and lots of advanced ALUs with a few special bells and whistles here and there

  • @klam77
    @klam77 9 месяцев назад +1

    VERY useful! thanks.

  • @JJJohnson441
    @JJJohnson441 7 лет назад +1

    Thanks for this simple, but informative tutorial.

  • @zienabesam4339
    @zienabesam4339 Год назад +1

    I liked the way you explain 👍

  • @FreakinLobstah
    @FreakinLobstah 8 лет назад +2

    Wow, very well done! It was very helpful for me.

  • @zhangbo0037
    @zhangbo0037 4 года назад +1

    very helpful for learning Graphics thanks

  • @evabasis4960
    @evabasis4960 6 лет назад +3

    thaink you for the nice video. Can a core runs more than onr thread in the same time? Or a cuca core is to execute only one thread in the same time?

  • @Rowing-li6jt
    @Rowing-li6jt 5 лет назад +2

    Great video!!!

  • @8scorpionx
    @8scorpionx 9 лет назад +2

    Very interesting,thumbs up :)

  • @kartikpodugu
    @kartikpodugu 3 года назад +1

    Hierarchy
    1. GPU
    2. GPC
    3. SMM
    4. CUDA core.

  • @paritoshgavali
    @paritoshgavali 4 года назад

    Very well explained thanks

  • @breezysaint9539
    @breezysaint9539 8 лет назад +2

    Well explained! thank you

  • @IgorAherne
    @IgorAherne 9 лет назад +1

    Thanks!

  • @stizandelasage
    @stizandelasage 4 года назад

    this can accelerate my Unix os I really like it

  • @himanshupatra1991
    @himanshupatra1991 7 лет назад

    very nice explained. Can I get the link to all sequential videos after or before this..? I couldn't find the prev one. I wanna watch all of those. @Bradon Fredrickson

    • @bradonf333
      @bradonf333  7 лет назад +1

      Himanshu patra Hey, sorry I don't have any more videos. This was just a final project I had to do for school.

    • @himanshupatra1991
      @himanshupatra1991 7 лет назад +1

      Bradon Fredrickson thank you somuch for the quick replay. I thought of asking because some where in the video you told "I have explained CUBA in the prev class". It is such a nice video. Thank you so much. 😊.

  • @Varaquilex
    @Varaquilex 8 лет назад +1

    Is there any difference in the CPU ALUs and GPU ALUs?

    • @bradonf333
      @bradonf333  8 лет назад +3

      I think the main difference is the number of ALU's. GPU's have a lot of ALU's and a CPU only had a few.

    • @ithaca2076
      @ithaca2076 3 года назад +1

      @@bradonf333 well, they aren't the same though. GPU ALUs are a bit more advanced, and tend to have features like how to calculate shading, if I recall correctly

    • @hjups
      @hjups 3 года назад +1

      ​@@ithaca2076 That's not quite correct, depending on what you mean. CPUs don't typically have one type of ALU anymore, they have an integer ALU (add, subtract, logic, shifts), a multiplier, a divider, and a FP ALU (which is often divided into a FP ADD/MUL ALU and a FP DIV / SQRT etc ALU). Often those ALUs are combined into common pipelines, for example, an x86 CPU may do {Add, Subtract, Logic, Shifts, and multiplication} in one ALU, and then {Add, Subtract, logic, and division} in another ALU. It also depends on the CPU, some of them have multiply-accumulate instructions, while others don't. If you ignore the addition of a MAC instruction, then a GPU "ALU" is going to be much simpler than most CPU ALUs.
      The degree to which that's true depends on the GPU architecture, the older ones combined an INT32 and FP32 ALU into a single "core", which may or may not have been unified (i.e. a single pipeline that could do either), or they could have been unique pipelines. The advantage to unique pipelines would be that the latency is lower (fewer cycles) at the cost of area. The current NVidia architectures have a combined FP32/INT32 pipeline, and a FP32 only pipeline. The current AMD architectures have combined FP32 and INT32 pipelines, which is true for the newer ARM Mali GPUs, as well as PowerVR, and Apple's GPUs.
      Going back to NVidia though, the FP32 ALUs can only do FPADD, FPSUB, FPMUL, FPMAC, FMA, and I think that's also were the Int to FP instructions are done.
      The INT32 ALUs do integer ADD, SUB, MUL, MAC, Logic, Shift, FPCompare, and I think is where the FP to Int instructions are done.
      The "Cores" / ALUs don't do anything fancier like SQRT, or Division. The GPU is actually incapable of doing either of those operations, and instead can do approximations to those operations using the Special Function Units (SFU / MFU). Those would not be considered ALUs though.
      TL;DR The GPU ALUs are in general much simpler than that of a modern (high end) CPU.

  • @SHIVAMPANDEY-rr8in
    @SHIVAMPANDEY-rr8in 7 лет назад

    how can we relate wraps and grids??

  • @SHIVAMPANDEY-rr8in
    @SHIVAMPANDEY-rr8in 7 лет назад

    great!!

  • @Supperesed
    @Supperesed 6 лет назад +1

    Microarchitecture sounds like organization according to you

  • @223Warlord
    @223Warlord 7 лет назад +3

    Pretty sure their gpus are more complicated than what you can read up o Wikipedia, otherwise all companies can easily steal nividia's intellectual property.