I don't have a GPU or TPU locally and wanted to know other alternatives to code in CUDA. Is google colab okay with CUDA and would the syntax be different
I'm unfamiliar with google colab, but I'd imagine it would be the same if it supports CUDA. The other alternative would be to use an Amazon AWS instance with a GPU.
You can also buy a jetson nano dev kit, it costs only 99$ and you have a little 128 cuda cores Maxwell gpu for learning the basics and even the constraints can help you to optimize your code
Are you compiling your CUDA code with NVCC? This is usually a problem when you try and compile with a compiler like gcc/g++ that does not understand the "" syntax of kernel launches (because it is not part of the C/C++ standard).
The most important question is always not answered: What happens, if the program has to multiply two matrices with e.g. 300x300 numbers into a third matrix with 300x300 numbers and my GPU does have 8000 ALUs only? How will that being processed in parallel? Why only 256 threads in the example?
My dear, could you elaborate a little on how the multi-dimensional ID-system works? (ThreadIdx.x and ThreadIdx.y etc.) My assumption is that it is just an abstraction to make it easier to launch a specific amount of threads that fit well to the programmers problem (f.ex. a given matrix-size), and is all calculated into one common index under the hood, just like a multi-dimensional array A[N][M] really is just an abstraction of a regular array A[M * N] in regular C++. Am I misunderstanding? Thank you very much for you work!
You are correct in assuming it is just a programming abstraction. At its heart, a multi-dimensional thread ID is just an index. If you're writing a problem that uses matrices, it may make sense to launch a 2D grid of threads, because a matrix is 2D.
Hey Nick, I am getting error in / Boundary Check if (tid < N) c[tid] = a[tid] + b[tid]; but when i corrected according to vs suggestion to / Boundary Check if (tid < N) c[tid] == a[tid] + b[tid]; then it says warning #174-D: expression has no effect 1> if (tid < N) c[tid] == a[tid] + b[tid]; However, after this warning, it showed "completed successfully" . Can you please explain what happened, as I am still confused
please point some beginner material, I don't know anything about visual studio, I usually use c++ for competitive programming that's how parallel computing caught by interest(how could i optimize algo more). how can i setup this visual studio to .cu so i can create and run the programme. please tell me from the scratch it's frustating from 2 days i am not able to find anything
This comment got me confused, // CTAs per Grid // We need to launch at LEAST as many threads as we have elements // This equation pads an extra CTA to the grid if N cannot evenly be divided // by NUM_THREADS (e.g. N = 1025, NUM_THREADS = 1024) int NUM_BLOCKS = (N + NUM_THREADS - 1) / NUM_THREADS; It seems like you are not "padding an extra CTA" cause you are not changing number of CTAs, you are trying to launch enough blocks to accomodate for N when N/NUM_THREADS is a fraction and the integer division will ditch the fractional part, so you need to jump to the next integer number for the block size - did I get this right?
You are correct. This just works on mine because it's a multiple of 256, so it will always have a remainder of 0. I am uploading a correction video now, and pushing the patch to all the examples where I did that. Thanks for catching that!
I'm afraid not (I only really used windows and VS for the early parts of this series since I work on Linux). Fortunately, it does not have any impact on compilation because it's just something like a linter warning. Quite annoying though, and doing a google search about it a while ago didn't really help me much, unfortunately.
@@NotesByNick Just found a workaround to it: we can use the cudaLaunchKernel API for the same purpose. It does the same without the weirdest delimiter somebody could have chosen haha. Reference: docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g5064cdf5d8e6741ace56fd8be951783c
I get the following errors... Error LNK2001 unresolved external symbol threadIdx Error LNK2001 unresolved external symbol blockIdx Error LNK2001 unresolved external symbol blockDim I have included .cu using tools. I renamed all .cpp files to .cu I have the following header files included: #include #include #include I have given the cudarts.lib in the linker I have included the cuda include directory. C++ gives error messages with the helpfulness of a monkey. I can't find anything on Google that works either. Useless.
Pick some number of threads per block (e.g, 512), then divide the number of elements (10k in your case) by the threads per block (and round up). Then just handle the excess threads you launch in the kernel (a simple range check to make sure the thread ID doesn't exceed the number of elements).
Hi, great tutorials ! Here is my question: What is the difference between: cudaMalloc((void**)&d_a, NO_BYTE)); AND cudaMalloc(&d_a, NO_BYTE)); What i understand is that we have to provide a double pointer and therefore we have to cast our device pointer to a generic double pointer. Best regards,
Thanks for the question! In modern versions of CUDA, you no longer have to cast to a void**, you just need to pass a double pointer to the api call. There really is no difference between the two.
IDK about the others, this playlist helped me get a job. Kudos, brother I appreciate all your efforts to make useful contents like this.
what kind of job do you get?
@@attafriski5901 I got a job of senior ML Engineer.
same question as @atta
i am just curious, what type of job did you get? at which company?
The only useful cuda tutorial that I could find to fit my needs in 2025!
It's kinda fun, when you show that you have an assert which checks the result, but you have Release build in which asserts do nothing)
I am basically preparing my university exam with your course!
and I am here to learn how to speed up the simulations I'll be using when I get to start a master's degree
@@derikWGwhat will you sim?
@@jimmyjudha8424 I am currently working with Brownian motion via particle simulations and partial differential equations numeric solutions
@@jimmyjudha8424 I am currently working with Brownian motion via particle simulation and partial differential equations numeric solutions
Check out the entire CUDA programming course here: ruclips.net/video/cvo3gnInQ7M/видео.html
Thank you
This is so clutch thank you
shouldnt the last row of vector addition on the SIMT schematic be [3] instead of [2]?
Yep! Just a small mistake when copying the boxes
@@NotesByNick no worries, just thought I'd ask, thanks for the vids. This is a great learning resource!
I don't have a GPU or TPU locally and wanted to know other alternatives to code in CUDA. Is google colab okay with CUDA and would the syntax be different
I'm unfamiliar with google colab, but I'd imagine it would be the same if it supports CUDA. The other alternative would be to use an Amazon AWS instance with a GPU.
You can also buy a jetson nano dev kit, it costs only 99$ and you have a little 128 cuda cores Maxwell gpu for learning the basics and even the constraints can help you to optimize your code
Thank you for these great courses, could you share the slides? Which cuda book would you recommend?
I got an error message:
error: expected primary-expression before ‘>’ token vectorAdd(d_a, d_b, d_c, n);
How should I solve it?
I figured it out, my previous compiler setting was incorrect.
Are you compiling your CUDA code with NVCC? This is usually a problem when you try and compile with a compiler like gcc/g++ that does not understand the "" syntax of kernel launches (because it is not part of the C/C++ standard).
@@NotesByNick Thank you very much for your reply. Your tutorials are awesome!
@@kaokuntai Thanks fella! Always happy to help!
The most important question is always not answered: What happens, if the program has to multiply two matrices with e.g. 300x300 numbers into a third matrix with 300x300 numbers and my GPU does have 8000 ALUs only? How will that being processed in parallel? Why only 256 threads in the example?
Don't forget to free the memory!
Can we assign only two threads to a vector of size 20? Or the number of threads should exactly the same as vector size?
My dear, could you elaborate a little on how the multi-dimensional ID-system works? (ThreadIdx.x and ThreadIdx.y etc.)
My assumption is that it is just an abstraction to make it easier to launch a specific amount of threads that fit well to the programmers problem (f.ex. a given matrix-size), and is all calculated into one common index under the hood, just like a multi-dimensional array A[N][M] really is just an abstraction of a regular array A[M * N] in regular C++.
Am I misunderstanding?
Thank you very much for you work!
You are correct in assuming it is just a programming abstraction. At its heart, a multi-dimensional thread ID is just an index. If you're writing a problem that uses matrices, it may make sense to launch a 2D grid of threads, because a matrix is 2D.
@@NotesByNick Great. Thank you, sir!
@@goobensteen Always happy to help, fella!
Hey Nick, I am getting error in / Boundary Check
if (tid < N) c[tid] = a[tid] + b[tid];
but when i corrected according to vs suggestion to / Boundary Check
if (tid < N) c[tid] == a[tid] + b[tid];
then it says warning #174-D: expression has no effect
1> if (tid < N) c[tid] == a[tid] + b[tid];
However, after this warning, it showed "completed successfully" .
Can you please explain what happened, as I am still confused
please point some beginner material, I don't know anything about visual studio, I usually use c++ for competitive programming that's how parallel computing caught by interest(how could i optimize algo more). how can i setup this visual studio to .cu so i can create and run the programme. please tell me from the scratch it's frustating from 2 days i am not able to find anything
This comment got me confused,
// CTAs per Grid
// We need to launch at LEAST as many threads as we have elements
// This equation pads an extra CTA to the grid if N cannot evenly be divided
// by NUM_THREADS (e.g. N = 1025, NUM_THREADS = 1024)
int NUM_BLOCKS = (N + NUM_THREADS - 1) / NUM_THREADS;
It seems like you are not "padding an extra CTA" cause you are not changing number of CTAs, you are trying to launch enough blocks to accomodate for N when N/NUM_THREADS is a fraction and the integer division will ditch the fractional part, so you need to jump to the next integer number for the block size - did I get this right?
Hi Nick, thank you for the explanation. I am having trouble finding the code. Can you show the link once more ? Thanks!
awesome content :)
what code one should add to see results of vector addition ?
Hi! Question: wasn't the `(int)ceil( n / NUM_THREADS)` supposed to be something like `ceil((float) n / NUM_THREADS)`? Isn't `n/NUM_THREADS` an int?
You are correct. This just works on mine because it's a multiple of 256, so it will always have a remainder of 0. I am uploading a correction video now, and pushing the patch to all the examples where I did that. Thanks for catching that!
amazing
Thank you!
Happy to help!
Are you sure? Is a thread block assigned to a single shader core rather than a SM?
I'm confused too, have you figured out?😭
Hey Nick, appreciate all the series. I had a minor doubt regarding the squiggly warning line at the
I'm afraid not (I only really used windows and VS for the early parts of this series since I work on Linux). Fortunately, it does not have any impact on compilation because it's just something like a linter warning. Quite annoying though, and doing a google search about it a while ago didn't really help me much, unfortunately.
@@NotesByNick Just found a workaround to it: we can use the cudaLaunchKernel API for the same purpose. It does the same without the weirdest delimiter somebody could have chosen haha. Reference: docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g5064cdf5d8e6741ace56fd8be951783c
add this to your settings.json :
"files.associations": {
"*.cu":"cpp" ,
"*.cuh":"cpp"
}
I get the following errors...
Error LNK2001 unresolved external symbol threadIdx
Error LNK2001 unresolved external symbol blockIdx
Error LNK2001 unresolved external symbol blockDim
I have included .cu using tools.
I renamed all .cpp files to .cu
I have the following header files included:
#include
#include
#include
I have given the cudarts.lib in the linker
I have included the cuda include directory.
C++ gives error messages with the helpfulness of a monkey. I can't find anything on Google that works either. Useless.
I want to calculate an array that has 10000 elements. How should I allocate the number of threads and the number of blocks?
Pick some number of threads per block (e.g, 512), then divide the number of elements (10k in your case) by the threads per block (and round up). Then just handle the excess threads you launch in the kernel (a simple range check to make sure the thread ID doesn't exceed the number of elements).
@@NotesByNick Thank you so much!!
Memory not being freed ...
Hi, great tutorials !
Here is my question: What is the difference between:
cudaMalloc((void**)&d_a, NO_BYTE));
AND
cudaMalloc(&d_a, NO_BYTE));
What i understand is that we have to provide a double pointer and therefore we have to cast our device pointer to a generic double pointer.
Best regards,
Thanks for the question! In modern versions of CUDA, you no longer have to cast to a void**, you just need to pass a double pointer to the api call. There really is no difference between the two.
@@NotesByNick
fast, clean and clear!
Thank you!
This guy seems to know his things, but rushes into explaining topics.