CUDA Crash Course: Vector Addition

Nick

Просмотров 91 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 фев 2025
In this video we go over vector addition in C++!
For code samples: github.com/coff...
For live content: / coffeebeforearch

Комментарии • 61

@tensorthug6802 Год назад ⁺³⁰
IDK about the others, this playlist helped me get a job. Kudos, brother I appreciate all your efforts to make useful contents like this.
@attafriski5901 Год назад
what kind of job do you get?
@tensorthug6802 Год назад
@@attafriski5901 I got a job of senior ML Engineer.
@PhiNguyen-iz9go Год назад ⁺¹
same question as @atta
@adityamali1862 8 месяцев назад
i am just curious, what type of job did you get? at which company?
@Iamine1981 24 дня назад ⁺¹
The only useful cuda tutorial that I could find to fit my needs in 2025!
@Pavel.Zhigulin Год назад ⁺¹
It's kinda fun, when you show that you have an assert which checks the result, but you have Release build in which asserts do nothing)
@milandanilovic2588 Год назад ⁺³
I am basically preparing my university exam with your course!
@derikWG Год назад
and I am here to learn how to speed up the simulations I'll be using when I get to start a master's degree
@jimmyjudha8424 8 месяцев назад
@@derikWGwhat will you sim?
@derikWG 8 месяцев назад
@@jimmyjudha8424 I am currently working with Brownian motion via particle simulations and partial differential equations numeric solutions
@derikWG 8 месяцев назад
@@jimmyjudha8424 I am currently working with Brownian motion via particle simulation and partial differential equations numeric solutions
@LearnitCS 3 месяца назад
Check out the entire CUDA programming course here: ruclips.net/video/cvo3gnInQ7M/видео.html
@chrislyu7211 Год назад ⁺¹
Thank you
@tonyzhang1854 2 месяца назад
This is so clutch thank you
@nininininininini7511 5 лет назад ⁺⁵
shouldnt the last row of vector addition on the SIMT schematic be [3] instead of [2]?
@NotesByNick 5 лет назад
Yep! Just a small mistake when copying the boxes
@nininininininini7511 5 лет назад
@@NotesByNick no worries, just thought I'd ask, thanks for the vids. This is a great learning resource!
@pixel7038 5 лет назад ⁺³
I don't have a GPU or TPU locally and wanted to know other alternatives to code in CUDA. Is google colab okay with CUDA and would the syntax be different
@NotesByNick 5 лет назад ⁺³
I'm unfamiliar with google colab, but I'd imagine it would be the same if it supports CUDA. The other alternative would be to use an Amazon AWS instance with a GPU.
@JoseLaruta 5 лет назад ⁺¹
You can also buy a jetson nano dev kit, it costs only 99$ and you have a little 128 cuda cores Maxwell gpu for learning the basics and even the constraints can help you to optimize your code
@mathssoso4261 3 года назад ⁺³
Thank you for these great courses, could you share the slides? Which cuda book would you recommend?
@kaokuntai 5 лет назад ⁺³
I got an error message:
error: expected primary-expression before ‘>’ token vectorAdd(d_a, d_b, d_c, n);
How should I solve it?
@kaokuntai 5 лет назад ⁺¹
I figured it out, my previous compiler setting was incorrect.
@NotesByNick 5 лет назад ⁺¹
Are you compiling your CUDA code with NVCC? This is usually a problem when you try and compile with a compiler like gcc/g++ that does not understand the "" syntax of kernel launches (because it is not part of the C/C++ standard).
@kaokuntai 5 лет назад ⁺¹
@@NotesByNick Thank you very much for your reply. Your tutorials are awesome!
@NotesByNick 5 лет назад ⁺¹
@@kaokuntai Thanks fella! Always happy to help!
@LeicaM11 11 месяцев назад
The most important question is always not answered: What happens, if the program has to multiply two matrices with e.g. 300x300 numbers into a third matrix with 300x300 numbers and my GPU does have 8000 ALUs only? How will that being processed in parallel? Why only 256 threads in the example?
@hermestrismegistus9142 Год назад ⁺²
Don't forget to free the memory!
@mahdiamrollahi8456 Год назад
Can we assign only two threads to a vector of size 20? Or the number of threads should exactly the same as vector size?
@goobensteen 5 лет назад ⁺³
My dear, could you elaborate a little on how the multi-dimensional ID-system works? (ThreadIdx.x and ThreadIdx.y etc.)
My assumption is that it is just an abstraction to make it easier to launch a specific amount of threads that fit well to the programmers problem (f.ex. a given matrix-size), and is all calculated into one common index under the hood, just like a multi-dimensional array A[N][M] really is just an abstraction of a regular array A[M * N] in regular C++.
Am I misunderstanding?
Thank you very much for you work!
@NotesByNick 5 лет назад
You are correct in assuming it is just a programming abstraction. At its heart, a multi-dimensional thread ID is just an index. If you're writing a problem that uses matrices, it may make sense to launch a 2D grid of threads, because a matrix is 2D.
@goobensteen 5 лет назад ⁺¹
@@NotesByNick Great. Thank you, sir!
@NotesByNick 5 лет назад
@@goobensteen Always happy to help, fella!
@pronodbharatiya6412 Год назад
Hey Nick, I am getting error in / Boundary Check
if (tid < N) c[tid] = a[tid] + b[tid];
but when i corrected according to vs suggestion to / Boundary Check
if (tid < N) c[tid] == a[tid] + b[tid];
then it says warning #174-D: expression has no effect
1> if (tid < N) c[tid] == a[tid] + b[tid];
However, after this warning, it showed "completed successfully" .
Can you please explain what happened, as I am still confused
@ashutoshakkole7611 2 года назад
please point some beginner material, I don't know anything about visual studio, I usually use c++ for competitive programming that's how parallel computing caught by interest(how could i optimize algo more). how can i setup this visual studio to .cu so i can create and run the programme. please tell me from the scratch it's frustating from 2 days i am not able to find anything
@RatedA4Aliens Год назад
This comment got me confused,
// CTAs per Grid
// We need to launch at LEAST as many threads as we have elements
// This equation pads an extra CTA to the grid if N cannot evenly be divided
// by NUM_THREADS (e.g. N = 1025, NUM_THREADS = 1024)
int NUM_BLOCKS = (N + NUM_THREADS - 1) / NUM_THREADS;
It seems like you are not "padding an extra CTA" cause you are not changing number of CTAs, you are trying to launch enough blocks to accomodate for N when N/NUM_THREADS is a fraction and the integer division will ditch the fractional part, so you need to jump to the next integer number for the block size - did I get this right?
@junyangliu4812 2 года назад
Hi Nick, thank you for the explanation. I am having trouble finding the code. Can you show the link once more ？ Thanks！
@andrzejreinke 4 года назад ⁺²
awesome content :)
@ignaskurklietis6539 2 года назад
what code one should add to see results of vector addition ?
@SuperChrzaszcz 5 лет назад ⁺⁵
Hi! Question: wasn't the `(int)ceil( n / NUM_THREADS)` supposed to be something like `ceil((float) n / NUM_THREADS)`? Isn't `n/NUM_THREADS` an int?
@NotesByNick 5 лет назад ⁺¹
You are correct. This just works on mine because it's a multiple of 256, so it will always have a remainder of 0. I am uploading a correction video now, and pushing the patch to all the examples where I did that. Thanks for catching that!
@CineVerse-h1s Год назад
amazing
@ДаниилМатвеев-ю5т 5 лет назад ⁺¹
Thank you!
@NotesByNick 5 лет назад
Happy to help!
@ab-zw1xf Год назад
Are you sure? Is a thread block assigned to a single shader core rather than a SM?
@juanjuan4041 Год назад
I'm confused too, have you figured out?😭
@vigneshsuresh5203 4 года назад ⁺¹
Hey Nick, appreciate all the series. I had a minor doubt regarding the squiggly warning line at the
@NotesByNick 4 года назад ⁺²
I'm afraid not (I only really used windows and VS for the early parts of this series since I work on Linux). Fortunately, it does not have any impact on compilation because it's just something like a linter warning. Quite annoying though, and doing a google search about it a while ago didn't really help me much, unfortunately.
@vigneshsuresh5203 4 года назад
@@NotesByNick Just found a workaround to it: we can use the cudaLaunchKernel API for the same purpose. It does the same without the weirdest delimiter somebody could have chosen haha. Reference: docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g5064cdf5d8e6741ace56fd8be951783c
@miramar-103 3 года назад ⁺⁴
add this to your settings.json :
"files.associations": {
"*.cu":"cpp" ,
"*.cuh":"cpp"
}
@CoolDude911 3 года назад ⁺¹
I get the following errors...
Error LNK2001 unresolved external symbol threadIdx
Error LNK2001 unresolved external symbol blockIdx
Error LNK2001 unresolved external symbol blockDim
I have included .cu using tools.
I renamed all .cpp files to .cu
I have the following header files included:
#include
#include
#include
I have given the cudarts.lib in the linker
I have included the cuda include directory.
C++ gives error messages with the helpfulness of a monkey. I can't find anything on Google that works either. Useless.
@yugeshkeluskar 4 года назад
I want to calculate an array that has 10000 elements. How should I allocate the number of threads and the number of blocks?
@NotesByNick 4 года назад ⁺²
Pick some number of threads per block (e.g, 512), then divide the number of elements (10k in your case) by the threads per block (and round up). Then just handle the excess threads you launch in the kernel (a simple range check to make sure the thread ID doesn't exceed the number of elements).
@yugeshkeluskar 4 года назад
@@NotesByNick Thank you so much!!
@mprone 9 месяцев назад ⁺¹
Memory not being freed ...
@hakimbettayeb6434 5 лет назад ⁺¹
Hi, great tutorials !
Here is my question: What is the difference between:
cudaMalloc((void**)&d_a, NO_BYTE));
AND
cudaMalloc(&d_a, NO_BYTE));
What i understand is that we have to provide a double pointer and therefore we have to cast our device pointer to a generic double pointer.
Best regards,
@NotesByNick 5 лет назад ⁺³
Thanks for the question! In modern versions of CUDA, you no longer have to cast to a void**, you just need to pass a double pointer to the api call. There really is no difference between the two.
@hakimbettayeb6434 5 лет назад ⁺¹
@@NotesByNick
fast, clean and clear!
Thank you!
@savarkars1403 4 года назад ⁺⁹
This guy seems to know his things, but rushes into explaining topics.

Следующие

Автовоспроизведение

CUDA Crash Course: Unified Memory Vector Add