Great material, it's really difficult to wind some tips for beggining CUDA programmers. I come from pytorch and thanks to you i successfully implemented multiplication of dense vector times sparse binary matrix multiplication. My usecase is very specific and very demanding in terms of performance, so i your videos really helped a lot. Thanks!
Another optimization strategy is to try having block size equivalent to the multiple of warp size and the grid size equivalent to the multiple of the number of SMs in the GPU. Well this may not apply in your example as your block size is already a multiple of warp size. Really enjoyed your explanation.
where is the next video or part2? I heard you said that the next video is more about optimization of tiled matrix multiplication, but i can't find any of that in your channel. I'm dying to watch it.
Hey Nick, your videos are truly a life saver, you covered so many important topics and are never afraid to get your hands dirty :). Can we expect a part 2 of this video, or is it already somewhere outside of this playlist?
Love it, just wanted to add my two cents: when you introduce coalescing, it was a bit confusing to me what exactly you meant. But the improvements from your change exactly should be: a) only one 32 byte (minimum size) read transaction from A per warp per iteration instead of 2 b) coalesced 128 byte read transaction from B per warp per iteration c) coalesced 128 byte write transaction at the very end (likely the least significant)
Hi nick, i'm trying to profile the matrix multiplication cuda code, it is the same as your naive matrix multiplication code with nvidia Nsight, i tried with 1
I'm unfamiliar with the specs of the 960m, but you should be able to get more information using the following api - docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html . You should be able to use that to get the reason by the kernel launch didn't happen (a lot of times it's a bad launch param, or maybe an earlier memory allocation failed).
Hi nick, great content, but the #pragma unroll actually made the performances worse on the system I'm running it on (Nvidia V100). Typically for a 10k * 10k matrix, I'm going from 720ms to 750ms. Any idea why that is ?
If the compiler completely unrolls a loop over 10,000 elements, it will generate 10,000 times n instructions, where n is the number of instructions required for each iteration (calculate indices, load data, multiply data...). That's a lot of machine code. Could easily be a few hundred KB. The code has to be loaded from memory, which takes time. If the loop isn't unrolled, its body will probably consist of a few instructions. They'll probably fit in the instruction cache, which means that the code has to be loaded from memory only once and can then be executed 10,000 times.
In a thread titled "code instruction cache" on the NVIDIA developer forums, someone mentions a 3% performance penalty for instruction cache misses, which would match your experience. Might be a coincidence though. I don't really know much about GPUs. :-)
We need more videos like this (in-depth performance tuning, wtih profiling and analysis). Good work, man.
Thanks! Glad you enjoyed the video!
Great material, it's really difficult to wind some tips for beggining CUDA programmers. I come from pytorch and thanks to you i successfully implemented multiplication of dense vector times sparse binary matrix multiplication. My usecase is very specific and very demanding in terms of performance, so i your videos really helped a lot. Thanks!
Glad it has been helpful, fella!
Hey Nick - you've been tremendously helpful. Thanks for your insights!
Always happy to help!
Another optimization strategy is to try having block size equivalent to the multiple of warp size and the grid size equivalent to the multiple of the number of SMs in the GPU. Well this may not apply in your example as your block size is already a multiple of warp size. Really enjoyed your explanation.
where is the next video or part2? I heard you said that the next video is more about optimization of tiled matrix multiplication, but i can't find any of that in your channel. I'm dying to watch it.
Hey Nick, your videos are truly a life saver, you covered so many important topics and are never afraid to get your hands dirty :).
Can we expect a part 2 of this video, or is it already somewhere outside of this playlist?
Great content Nick. Helped me a lot with understanding things better :)
Love it, just wanted to add my two cents: when you introduce coalescing, it was a bit confusing to me what exactly you meant. But the improvements from your change exactly should be:
a) only one 32 byte (minimum size) read transaction from A per warp per iteration instead of 2
b) coalesced 128 byte read transaction from B per warp per iteration
c) coalesced 128 byte write transaction at the very end (likely the least significant)
Hey Nick, awesome video! Curious when part 2 is going to come out
Cool tutorial. You explain everything so clearly.
Thanks! Glad you liked it!
Great series, well done!
Teached me alot man! Will have to go through couple of times to get the full extent of it tho 😅
amazing thank you very much for this video
how did you compute `int row` and `int col`, is there another guide that I can follow?
Thank you, you are awesome.
Thanks! Glad you liked the video!
Well done!
Hi nick, i'm trying to profile the matrix multiplication cuda code, it is the same as your naive matrix multiplication code with nvidia Nsight, i tried with 1
I'm unfamiliar with the specs of the 960m, but you should be able to get more information using the following api - docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html . You should be able to use that to get the reason by the kernel launch didn't happen (a lot of times it's a bad launch param, or maybe an earlier memory allocation failed).
Hi nick, great content, but the #pragma unroll actually made the performances worse on the system I'm running it on (Nvidia V100).
Typically for a 10k * 10k matrix, I'm going from 720ms to 750ms. Any idea why that is ?
If the compiler completely unrolls a loop over 10,000 elements, it will generate 10,000 times n instructions, where n is the number of instructions required for each iteration (calculate indices, load data, multiply data...). That's a lot of machine code. Could easily be a few hundred KB. The code has to be loaded from memory, which takes time.
If the loop isn't unrolled, its body will probably consist of a few instructions. They'll probably fit in the instruction cache, which means that the code has to be loaded from memory only once and can then be executed 10,000 times.
In a thread titled "code instruction cache" on the NVIDIA developer forums, someone mentions a 3% performance penalty for instruction cache misses, which would match your experience. Might be a coincidence though. I don't really know much about GPUs. :-)
I'd like to post the link to the forum thread, but the stupid RUclips spam filter won't let me. :-(
Search for "nvidia forum 38939"
great video
Glad you liked it, fella!
great video