GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

CUDA Programming

How do Graphics Cards Work? Exploring GPU Architecture

AMAD WORLD CLASS! MAN CITY 1-2 MAN UTD GOLDBRIDGE MATCH REACTION

revealing the truth...

OUR FIRST 24 HOURS HOME WITH A NEWBORN + HER NAME REVEAL!!

GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA

Coding In Rust

Просмотров 19 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 25 янв 2025

Комментарии • 19

@citizensmith3074 2 года назад ⁺¹⁹
This video is pure gold: thanks so much for uploading I've learnt so much from it. I may have to watch it several times though!!! A great overview and introduction to so many areas for further study.
@KetogenicGuitars 28 дней назад
This is so good. As beginner Cuda programmer I get the most important concepts and terminology seeds right into good soil. I know what to except. First thing to do is fully accept parallel threads being the king.
@rolandsunsun6362 2 месяца назад
this video is so clear and great
@TheAIEpiphany Год назад ⁺¹
One thing that's confusing: if reading from a memory location in a different row is 3x slower than reading from a memory location in the same row - how come we get 13x slowdown? Worst case (if you're deliberately reading from a different row each time) - one would expect a 3x slowdown?
What am I missing out on? Is it the burst mode?
2) You're using float2 type so that means your thread is loading 4 bytes (for 2 points) not 8 bytes? Which would put the 4 warps into 512B loading territory instead of the optimal 1024? -> EDIT: ok, I just saw that p1 & p2 are actually float pointers so that does make sense.
3) How can we guarantee that p1 & p2 arrays (holding the points) are adjacent, i.e. in the same physical row in memory?
Great video! The sound quality is a bit off though.
@brady1123 Год назад ⁺²
It's 3x slower for reading a single value, but it gets worse when reading many contiguous values where the burst column read can read many values in one operation.
For example, let's say that we're reading two sets of 10 values, one set of which are all contiguous in a row, and one set that are all on different rows. And you have the three ops in the video: LOAD a row, READ a column, STORE the row back.
For the contiguous values: time = LOAD + BURST READ + STORE = 3 ops
For the disjoint values: time = (LOAD + READ + STORE)*10 = 30 ops
That's how you get the 10x speed-up.
@srikanthmalla1127 2 месяца назад
Thank you!
@steveHoweisno1 Год назад
Excellent. For the matrix multiply, you’re reusing the same row multiple times but the columns would have to be loaded in every time. So how do you increase compute intensity of the columns?
@webgpu Год назад
Christopher, do you think the long time it takes for ram to be accessed could be decreased by embedding a basic cpu in those ram modules?
@codinginrust Год назад ⁺¹
Good question, I don't know!
@codingmachine2817 Год назад ⁺³
33:10 FlashAttention proved this wrong
@brady1123 Год назад
"Occupancy is the most powerful tool that you have for tuning a program. **Once you're doing your best for memory access patterns** there's pretty much no algorithmic optimization that you can do that'll speed your program up by as much as 33%"
I thought FlashAttention's major contribution was optimizing memory access patterns, namely reducing the number of HBM loads/stores.
@ChimiChanga1337 10 месяцев назад
can you please explain this a bit more? I'm trying to teach myself flash attention's cuda code.
@FinansalEngelli 6 месяцев назад ⁺³
He literally said "Once you are doing your best for memory access paterns" and Flash Attention is a MEMORY ACCESS algorithm, it reduces the memory access to GPU HMB RAM.
@GeorgePaul82 11 месяцев назад
Is there a chance you can do a video about Why AMD's version isnt as good as NVIDIA ?
@codinginrust 11 месяцев назад
I've not got a AMD gfx card ZLUDA means it does not really matter www.phoronix.com/review/radeon-cuda-zluda
@ryderbrooks1783 10 месяцев назад
AMD's issue is tooling and the general software ecosystem. The hardware is reasonably close.
@dGooddBaddUgly 9 месяцев назад ⁺³
Look like Intel is out of the question here.
@codinginrust 9 месяцев назад
They are getting better in terms of energy efficiency and performance www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html
@SavageBits 6 месяцев назад
@@codinginrust Limitation of Gaudi is that it is a less flexible fixed function matrix math accelerator. General purpose compute engine in Hopper/Blackwell architecture can better support rapidly evolving LLM algos. Another issue is interconnect bandwidth: NVLINK5 absolutely crushes PCIE5

Следующие

Автовоспроизведение

GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

CUDA Programming

CUDA Programming

How do Graphics Cards Work? Exploring GPU Architecture

How do Graphics Cards Work? Exploring GPU Architecture

AMAD WORLD CLASS! MAN CITY 1-2 MAN UTD GOLDBRIDGE MATCH REACTION

AMAD WORLD CLASS! MAN CITY 1-2 MAN UTD GOLDBRIDGE MATCH REACTION

revealing the truth...

revealing the truth...

OUR FIRST 24 HOURS HOME WITH A NEWBORN + HER NAME REVEAL!!

OUR FIRST 24 HOURS HOME WITH A NEWBORN + HER NAME REVEAL!!

Imagine Dragons - Take Me To The Beach (feat. Ado) (Official Lyric Video)

Imagine Dragons - Take Me To The Beach (feat. Ado) (Official Lyric Video)

How CUDA Programming Works | GTC 2022

How CUDA Programming Works | GTC 2022

How GPU Computing Works | GTC 2021

How GPU Computing Works | GTC 2021

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

Tutorial: CUDA programming in Python with numba and cupy

Tutorial: CUDA programming in Python with numba and cupy

Intro to GPU Programming

Intro to GPU Programming

How to Do 90% of What Plugins Do (With Just Vim)

How to Do 90% of What Plugins Do (With Just Vim)

A4 кола Slush !!!

A4 кола Slush !!!

No more worries about freezing hands and feet when going out in winter!

No more worries about freezing hands and feet when going out in winter!

Как играют в игры сейчас и раньше

Как играют в игры сейчас и раньше

ЕГО ГЛАВНЫЙ ДРУГ ТЕПЕРЬ СТОМАТОЛОГ! #shorts

ЕГО ГЛАВНЫЙ ДРУГ ТЕПЕРЬ СТОМАТОЛОГ! #shorts

Сигма бой не стал морожкой

Сигма бой не стал морожкой

Ученик Нурмагомедова против Фёдора Емельяненко

Ученик Нурмагомедова против Фёдора Емельяненко

Самое дорогое зеркало!

Самое дорогое зеркало!

Удар или пощада?

Удар или пощада?