CUDA Crash Course (v2): Vector Addition

02 CUDA Shared Memory

From Scratch: Matrix Multiplication in CUDA

I Built My Dog the Safest Base in Minecraft

Inter Miami CF vs. Atlanta United | Audi 2024 MLS Cup Playoffs | Full Match Highlights

Six Hundred Strike [EPIC: The Musical] Full Animatic

CUDA Crash Course: GPU Performance Optimizations Part 1

CoffeeBeforeArch

Просмотров 13 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 10 ноя 2024

Комментарии • 32

@syfaiz 4 года назад ⁺²¹
We need more videos like this (in-depth performance tuning, wtih profiling and analysis). Good work, man.
@CoffeeBeforeArch 4 года назад ⁺³
Thanks! Glad you enjoyed the video!
@mhnatiuk 5 лет назад ⁺⁶
Great material, it's really difficult to wind some tips for beggining CUDA programmers. I come from pytorch and thanks to you i successfully implemented multiplication of dense vector times sparse binary matrix multiplication. My usecase is very specific and very demanding in terms of performance, so i your videos really helped a lot. Thanks!
@CoffeeBeforeArch 5 лет назад ⁺¹
Glad it has been helpful, fella!
@AndrewCodeDev 4 года назад ⁺⁷
Hey Nick - you've been tremendously helpful. Thanks for your insights!
@CoffeeBeforeArch 4 года назад ⁺¹
Always happy to help!
@muneshchauhan 4 года назад ⁺⁴
Another optimization strategy is to try having block size equivalent to the multiple of warp size and the grid size equivalent to the multiple of the number of SMs in the GPU. Well this may not apply in your example as your block size is already a multiple of warp size. Really enjoyed your explanation.
@jfd2595 4 месяца назад ⁺¹
where is the next video or part2? I heard you said that the next video is more about optimization of tiled matrix multiplication, but i can't find any of that in your channel. I'm dying to watch it.
@aleksandarcvetkovic7045 10 месяцев назад
Hey Nick, your videos are truly a life saver, you covered so many important topics and are never afraid to get your hands dirty :).
Can we expect a part 2 of this video, or is it already somewhere outside of this playlist?
@rahulramesh1238 4 года назад ⁺³
Great content Nick. Helped me a lot with understanding things better :)
@SnoSixtyTwo 4 года назад ⁺¹
Love it, just wanted to add my two cents: when you introduce coalescing, it was a bit confusing to me what exactly you meant. But the improvements from your change exactly should be:
a) only one 32 byte (minimum size) read transaction from A per warp per iteration instead of 2
b) coalesced 128 byte read transaction from B per warp per iteration
c) coalesced 128 byte write transaction at the very end (likely the least significant)
@jerrickhoang6336 3 года назад ⁺¹
Hey Nick, awesome video! Curious when part 2 is going to come out
@jianxiang 4 года назад ⁺¹
Cool tutorial. You explain everything so clearly.
@CoffeeBeforeArch 4 года назад
Thanks! Glad you liked it!
@ryanmckenna2047 Год назад
Great series, well done!
@eladon19153 Год назад
Teached me alot man! Will have to go through couple of times to get the full extent of it tho 😅
@eaemmm333 3 года назад ⁺¹
amazing thank you very much for this video
@tushargarg8378 Год назад
how did you compute `int row` and `int col`, is there another guide that I can follow?
@nm5paczek 4 года назад ⁺¹
Thank you, you are awesome.
@CoffeeBeforeArch 4 года назад
Thanks! Glad you liked the video!
@zhengjack3401 3 года назад
Well done!
@_lilkm 5 лет назад ⁺¹
Hi nick, i'm trying to profile the matrix multiplication cuda code, it is the same as your naive matrix multiplication code with nvidia Nsight, i tried with 1
@CoffeeBeforeArch 5 лет назад
I'm unfamiliar with the specs of the 960m, but you should be able to get more information using the following api - docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html . You should be able to use that to get the reason by the kernel launch didn't happen (a lot of times it's a bad launch param, or maybe an earlier memory allocation failed).
@thibautmodrzyk6215 3 года назад ⁺¹
Hi nick, great content, but the #pragma unroll actually made the performances worse on the system I'm running it on (Nvidia V100).
Typically for a 10k * 10k matrix, I'm going from 720ms to 750ms. Any idea why that is ?
@jcsahnwaldt Год назад ⁺¹
If the compiler completely unrolls a loop over 10,000 elements, it will generate 10,000 times n instructions, where n is the number of instructions required for each iteration (calculate indices, load data, multiply data...). That's a lot of machine code. Could easily be a few hundred KB. The code has to be loaded from memory, which takes time.
If the loop isn't unrolled, its body will probably consist of a few instructions. They'll probably fit in the instruction cache, which means that the code has to be loaded from memory only once and can then be executed 10,000 times.
@jcsahnwaldt Год назад
In a thread titled "code instruction cache" on the NVIDIA developer forums, someone mentions a 3% performance penalty for instruction cache misses, which would match your experience. Might be a coincidence though. I don't really know much about GPUs. :-)
@jcsahnwaldt Год назад
I'd like to post the link to the forum thread, but the stupid RUclips spam filter won't let me. :-(
@jcsahnwaldt Год назад
Search for "nvidia forum 38939"
@_lilkm 5 лет назад ⁺¹
great video
@CoffeeBeforeArch 5 лет назад ⁺¹
Glad you liked it, fella!
@SHASHANKDAMMALAPATI 7 месяцев назад
great video

Следующие

Автовоспроизведение

CUDA Crash Course (v2): Vector Addition

CUDA Crash Course (v2): Vector Addition

02 CUDA Shared Memory

02 CUDA Shared Memory

From Scratch: Matrix Multiplication in CUDA

From Scratch: Matrix Multiplication in CUDA

I Built My Dog the Safest Base in Minecraft

I Built My Dog the Safest Base in Minecraft

Inter Miami CF vs. Atlanta United | Audi 2024 MLS Cup Playoffs | Full Match Highlights

Inter Miami CF vs. Atlanta United | Audi 2024 MLS Cup Playoffs | Full Match Highlights

Six Hundred Strike [EPIC: The Musical] Full Animatic

Six Hundred Strike [EPIC: The Musical] Full Animatic

The House of Representatives race is still waiting to be called in the United States

The House of Representatives race is still waiting to be called in the United States

Collaborative Groups in CUDA

Collaborative Groups in CUDA

CUDA Crash Course (v2): Unified Memory

CUDA Crash Course (v2): Unified Memory

Just enough C to have fun

Just enough C to have fun

05 Atomics Reductions Warp Shuffle

05 Atomics Reductions Warp Shuffle

From Scratch: Cache Tiled Matrix Multiplication in CUDA

From Scratch: Cache Tiled Matrix Multiplication in CUDA

2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU

2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU

Premature Optimization

Premature Optimization

БУ, ИСПУГАЛСЯ?? #shorts

БУ, ИСПУГАЛСЯ?? #shorts

НОВЫЙ AMONG US в РЕАЛЬНОЙ ЖИЗНИ - Масленников, Егорик, Милана Хаметова, Супер Стас

НОВЫЙ AMONG US в РЕАЛЬНОЙ ЖИЗНИ - Масленников, Егорик, Милана Хаметова, Супер Стас

Как натянуть прут самого на себя и не сорвать резьбу

Как натянуть прут самого на себя и не сорвать резьбу

Москву атаковало рекордное число дронов. Митинг в Курске. Умер пресс-секретарь Шамана и Киркорова

Москву атаковало рекордное число дронов. Митинг в Курске. Умер пресс-секретарь Шамана и Киркорова

Почему Паулина и Бондарчук расстались? #отношения #бандарчук #паулина #актеркино #психология

Почему Паулина и Бондарчук расстались? #отношения #бандарчук #паулина #актеркино #психология

REAL MADRID 4 - 0 CA OSASUNA I RESUMEN LALIGA EA SPORTS

REAL MADRID 4 - 0 CA OSASUNA I RESUMEN LALIGA EA SPORTS

Лукашенко: Трамп - мощь! #лукашенко #политика #новости #беларусь #выборы #shorts

Лукашенко: Трамп – мощь! #лукашенко #политика #новости #беларусь #выборы #shorts

Про простреленные колеса, чудо тренера новую трассу и поездку на задание

Про простреленные колеса, чудо тренера новую трассу и поездку на задание