CUDA Crash Course: GPU Performance Optimizations Part 1

Поделиться
HTML-код
  • Опубликовано: 10 ноя 2024

Комментарии • 32

  • @syfaiz
    @syfaiz 4 года назад +21

    We need more videos like this (in-depth performance tuning, wtih profiling and analysis). Good work, man.

  • @mhnatiuk
    @mhnatiuk 5 лет назад +6

    Great material, it's really difficult to wind some tips for beggining CUDA programmers. I come from pytorch and thanks to you i successfully implemented multiplication of dense vector times sparse binary matrix multiplication. My usecase is very specific and very demanding in terms of performance, so i your videos really helped a lot. Thanks!

  • @AndrewCodeDev
    @AndrewCodeDev 4 года назад +7

    Hey Nick - you've been tremendously helpful. Thanks for your insights!

  • @muneshchauhan
    @muneshchauhan 4 года назад +4

    Another optimization strategy is to try having block size equivalent to the multiple of warp size and the grid size equivalent to the multiple of the number of SMs in the GPU. Well this may not apply in your example as your block size is already a multiple of warp size. Really enjoyed your explanation.

  • @jfd2595
    @jfd2595 4 месяца назад +1

    where is the next video or part2? I heard you said that the next video is more about optimization of tiled matrix multiplication, but i can't find any of that in your channel. I'm dying to watch it.

  • @aleksandarcvetkovic7045
    @aleksandarcvetkovic7045 10 месяцев назад

    Hey Nick, your videos are truly a life saver, you covered so many important topics and are never afraid to get your hands dirty :).
    Can we expect a part 2 of this video, or is it already somewhere outside of this playlist?

  • @rahulramesh1238
    @rahulramesh1238 4 года назад +3

    Great content Nick. Helped me a lot with understanding things better :)

  • @SnoSixtyTwo
    @SnoSixtyTwo 4 года назад +1

    Love it, just wanted to add my two cents: when you introduce coalescing, it was a bit confusing to me what exactly you meant. But the improvements from your change exactly should be:
    a) only one 32 byte (minimum size) read transaction from A per warp per iteration instead of 2
    b) coalesced 128 byte read transaction from B per warp per iteration
    c) coalesced 128 byte write transaction at the very end (likely the least significant)

  • @jerrickhoang6336
    @jerrickhoang6336 3 года назад +1

    Hey Nick, awesome video! Curious when part 2 is going to come out

  • @jianxiang
    @jianxiang 4 года назад +1

    Cool tutorial. You explain everything so clearly.

  • @ryanmckenna2047
    @ryanmckenna2047 Год назад

    Great series, well done!

  • @eladon19153
    @eladon19153 Год назад

    Teached me alot man! Will have to go through couple of times to get the full extent of it tho 😅

  • @eaemmm333
    @eaemmm333 3 года назад +1

    amazing thank you very much for this video

  • @tushargarg8378
    @tushargarg8378 Год назад

    how did you compute `int row` and `int col`, is there another guide that I can follow?

  • @nm5paczek
    @nm5paczek 4 года назад +1

    Thank you, you are awesome.

  • @zhengjack3401
    @zhengjack3401 3 года назад

    Well done!

  • @_lilkm
    @_lilkm 5 лет назад +1

    Hi nick, i'm trying to profile the matrix multiplication cuda code, it is the same as your naive matrix multiplication code with nvidia Nsight, i tried with 1

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  5 лет назад

      I'm unfamiliar with the specs of the 960m, but you should be able to get more information using the following api - docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html . You should be able to use that to get the reason by the kernel launch didn't happen (a lot of times it's a bad launch param, or maybe an earlier memory allocation failed).

  • @thibautmodrzyk6215
    @thibautmodrzyk6215 3 года назад +1

    Hi nick, great content, but the #pragma unroll actually made the performances worse on the system I'm running it on (Nvidia V100).
    Typically for a 10k * 10k matrix, I'm going from 720ms to 750ms. Any idea why that is ?

    • @jcsahnwaldt
      @jcsahnwaldt Год назад +1

      If the compiler completely unrolls a loop over 10,000 elements, it will generate 10,000 times n instructions, where n is the number of instructions required for each iteration (calculate indices, load data, multiply data...). That's a lot of machine code. Could easily be a few hundred KB. The code has to be loaded from memory, which takes time.
      If the loop isn't unrolled, its body will probably consist of a few instructions. They'll probably fit in the instruction cache, which means that the code has to be loaded from memory only once and can then be executed 10,000 times.

    • @jcsahnwaldt
      @jcsahnwaldt Год назад

      In a thread titled "code instruction cache" on the NVIDIA developer forums, someone mentions a 3% performance penalty for instruction cache misses, which would match your experience. Might be a coincidence though. I don't really know much about GPUs. :-)

    • @jcsahnwaldt
      @jcsahnwaldt Год назад

      I'd like to post the link to the forum thread, but the stupid RUclips spam filter won't let me. :-(

    • @jcsahnwaldt
      @jcsahnwaldt Год назад

      Search for "nvidia forum 38939"

  • @_lilkm
    @_lilkm 5 лет назад +1

    great video

  • @SHASHANKDAMMALAPATI
    @SHASHANKDAMMALAPATI 7 месяцев назад

    great video