CUDA Crash Course: Matrix Multiplication

Поделиться
HTML-код
  • Опубликовано: 2 фев 2025

Комментарии • 34

  • @yuhaolin7337
    @yuhaolin7337 5 лет назад +6

    your examples and illustration are helpful! thanks!

  • @eaemmm333
    @eaemmm333 3 года назад +6

    Thank you for the informative courses one feedback is your picture hid the lower right side which might contain some information if kindly make it small it would be better

    • @matrx2wi3omf
      @matrx2wi3omf 11 месяцев назад +1

      Row = blockIdx.y * blockDim.y + threadIdx.y
      Col = blockIdx.x * blockDim.x + threadIdx.x

  • @michaelscheinfeild9768
    @michaelscheinfeild9768 Год назад +1

    nick great cuda course thank you

  • @BlackbodyEconomics
    @BlackbodyEconomics 3 года назад +1

    You totally saved me. Thanks :)

  • @SuperChrzaszcz
    @SuperChrzaszcz 5 лет назад +2

    About the inside of the `verify_result`: Are You relying on malloc initiallizing the `verify_c` to zero?

    • @NotesByNick
      @NotesByNick  5 лет назад +2

      Got lucky that I happened to get chunk of zeroed out memory when I recorded the video. The code on Github was fixed post-release of this video on Feb 22, and does not rely on undefined behavior.

  • @turnipsheep
    @turnipsheep 3 года назад +1

    Excellent. Really helpful. Thanks

  • @hanwang5940
    @hanwang5940 5 лет назад +2

    Is the intermediate variable temp_sum necessary? My train of thought is that since array c is in global memory. It will take longer to accumulate the multiplication for each index using array c. Instead, here we use temp_sum which is in local memory to accumulate and then assign the final value to global memory. Thus reducing the time of memory transfer. Is my reasoning correct?

    • @NotesByNick
      @NotesByNick  5 лет назад +3

      The compiler should put temp in a register, so it should be faster to access than global memory. You don't have to do it, but it likely will improve performance (significantly in this case). I just tested that code with and without the temp_sum variable, and got about 5.6 and 14.1 seconds respectively (matrix size of 2^14 x 2^14) on a TITAN V GPU.

    • @hanwang5940
      @hanwang5940 5 лет назад +1

      @@NotesByNick
      I see, thanks for the explanation. I tried with my GPU(gtx 1050). Matrix size of 2 ^13 x 2 ^ 13. With temp_sum and without temp_sum is 13.5s and 14.5s(the avg of several test runs). Its a slight improvement but not as substantial as yours. Im guessing its because TITAN V is much more powerful than GTX 1050.

    • @NotesByNick
      @NotesByNick  5 лет назад

      ​@@hanwang5940 Performance optimizations rarely translate 1:1 between different GPUs of different architectures (sometimes they don't translate at all). I did the same test on my local machine and found it improved a 2^12 matrix from ~2 seconds to ~1.6 or so. Another influencing factor will be if you are using unified memory or not (if you want to really isolate the performance changes, you don't want paging to influence your results). You also want to make sure that same GPU isn't being used to drive a display.

  • @farinazf3040
    @farinazf3040 3 года назад

    Thank you very much! You are a lifesaver!

  • @eduardojreis
    @eduardojreis 3 года назад +1

    Thank you so much for this incredible Crash Course. I think I understood the need to the `thread block` to be 2D, but I am not sure about the `grid size`. Why does it need to be 2D as well? Also I am a bit confused with the `dim3` having only 2 components. Shouldn't it be `dim2`?

    • @eduardojreis
      @eduardojreis 3 года назад

      Would it be the case that `thread block` have a limit size and `grids` don't? I might had missed that.

    • @Aditya-ec1ts
      @Aditya-ec1ts 11 месяцев назад

      Yeah @@eduardojreis I was thinking the same. We can still have a 1d block with 2d threads in it and make it work. I don't think it will affect the compute that much either.

  • @ryanmckenna2047
    @ryanmckenna2047 Год назад

    Doesn't that give us 2^12 threads in total since its a 16x16 grid with 4 thread blocks along each grid axis, giving a 64 x 64 grid which 2^6 x 2^6 = 2^12 threads in total. Why wouldn't we want one thread per element if there are 1024 x 1024 elements in the matrix in total. In the vector addition example we had one thread per element, in this case we have 2^8 = 256 threads less than that. Please explain.

    • @zijiali8349
      @zijiali8349 4 месяца назад

      I had the same confusion in the beginning. I believe he "purposely" allocated more threads and space than necessary, only to demonstrate that you don't need to perfectly match. This is handled in the if statement in kernel.

  • @Omgcoconuts
    @Omgcoconuts 4 года назад +1

    What about matrix-vector multiplication. How does the indexing change?

    • @NotesByNick
      @NotesByNick  4 года назад +1

      Good question! The indexing doesn't really change much. Matrix-vector multiplication is just a special case of matrix-matrix multiplication, where the dimension in one of the matrices is 1. The only major change you would make to the indexing is to do row-major accesses instead of column-major accesses for the vector.

  • @seanhiggins2085
    @seanhiggins2085 2 года назад

    I followed the code in this video and for some reason it runs just fine in Release but when I try to run it in Debug it actually fails the assert(). Any clue why this is?

    • @VincentZalzal
      @VincentZalzal Год назад

      The C version of the matrix multiplication is not initializating the destination memory to 0 before looping over += operations. In Debug, depending on your standard library, OS, etc, memory might be default-initialized to 0 after a call to malloc, but that step is usually skipped in Release mode.

  • @KishanKumar-mz3xr
    @KishanKumar-mz3xr 4 года назад

    Is it necessary to store data in a linear array, can't we initialize a and b as 2-D array of size n*n?

    • @NotesByNick
      @NotesByNick  4 года назад +1

      You can, but that would be more inefficient. Instead of having 1 pointer to n x n elements, you would need n pointers that each point to 1 x n elements. So now you have to not only store n x n total elements, you’d need n pointers (instead of just 1). There’s also something to be said for not having fragmented memory. Why break up a giant piece of memory you know you need into small chunks of you don’t have to? This can lead to an even larger amount of memory being used because of padding for each row you allocate individually (instead just a small amount of padding for a single large allocation)

  • @eduardojreis
    @eduardojreis 3 года назад

    3:50 - If Thread Blocks > Waps > Threads. Shouldn't this be than a "tiny 2D Warp" instead of a "tiny 2D thread block"?

  • @elucasvargas1874
    @elucasvargas1874 4 года назад

    There is a mistake, int GRID_SIZE = (int)ceil(n / BLOCK_SIZE); must to be int GRID_SIZE = (int)ceil(n / BLOCK_SIZE + 1);

    • @NotesByNick
      @NotesByNick  4 года назад +1

      ELucas Vargas thanks, for the comment. This was already fixed in the code on GitHub, and in version two of the CUDA Crash Course series.
      Cheers,
      -Nick

    • @elucasvargas1874
      @elucasvargas1874 4 года назад

      @@NotesByNick Thanks, great job

  • @jellyjiggler5311
    @jellyjiggler5311 4 года назад

    arent there four threads per block?

    • @NotesByNick
      @NotesByNick  4 года назад

      Only in the slides for the example. The code in the video uses 256 threads per block, and the updated uses 1024.