CUDA Crash Course: Matrix Multiplication

Writing Code That Runs FAST on a GPU

Axolotl is a AI FineTuning Magician

Trying EVERY Fast Food Holiday Item!

Surprising Son with Dream Car on 16th Birthday

I 3D Printed a $1,500 Chair

CUDA Crash Course: Unified Memory Vector Add

Nick

Просмотров 21 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 фев 2025

Комментарии • 17

@_lilkm 5 лет назад ⁺⁶
good video , i really liked your the explanation
@goobensteen 5 лет назад ⁺³
Note that cudaMemPrefetchAsync is only supported on Pascal and newer architectures, and is only supported on Linux, as per Nvidia. The call to cudaMemPrefetchAsync returns cudaErrorInvalidDevice on my Windows 10 machine.
@hanwang5940 5 лет назад ⁺⁴
What are the advantages of using unified memory other than not needing to transfer data between host and device? After profiling it seems like using unified memory takes significant more time than the usual malloc and cudaMalloc.
@NotesByNick 5 лет назад ⁺⁹
Good question! Unified memory can take longer than normal allocation and copy, but with adequate pre-fetching, they can yield roughly similar results. It is mainly a programmability feature, and that goes beyond just avoiding duplicate data structures. Modern architectures (Volta and later) support over-subscription, where you can allocate more memory than you have on your GPU. Doing this without unified memory would require you to decompose your problem into multiple kernels so that you can work on a chunk of data, finish computation, copy new data in, start a new kernel, etc. With unified memory, none of that is required.
@hanwang5940 5 лет назад ⁺⁴
@@NotesByNick Thank you for the prompt answer! I'll look more into pre-fetching.
@NotesByNick 5 лет назад ⁺³
@@hanwang5940 Of course! I'm always happy to help!
@Kur0channel Год назад
great video, thanks for the tutorial! I am still little confused where does the unified memory locate, is it located at global memory?
@epic1094 2 года назад
I get a segmentation fault when I run the init_vector() function, could someone help me?
@andreagenor 4 года назад ⁺²
Hi Nick, I starting to learn Cuda through your videos and they are awesome. Some time ago I saw about Cuda Thrust, Do you have any experience with this? what do you think about using modern c++ with Cuda? Can you do a video about it?
@tesla.8410 4 года назад
@Nick How does this transfer to embedded systems like Xavier where GPU and CPU share the same memory? It seems to me you shouldn’t have to cudamemcpy anything since everything is in the same physical memory.
@NotesByNick 4 года назад
My understanding is that things behave slightly differently on those platforms. While it is true they have the same physical memory, you can still have dedicated device memory (it's probably just part of the same physical memory, but a dedicated buffer not accessibly by the CPU). There's a great guide from NVIDIA on this exact topic - docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#overview
@tesla.8410 4 года назад
@@NotesByNick Wow thanks, Nick! Do you know if this is valid for Xavier? I believe Tegra was the generation before Xavier?
I would expect you should not really have to copy anything, but rather just pass pointers as long as its in the same address space. It becomes tricky when you have multiple applications with different address spaces and you have to go through some middleware like ROS.
Maybe you can do a video on cuda for embedded programming? Would be greatly appreciated!
@kaokuntai 5 лет назад ⁺²
The environment:
Windows 10
VS 2017
CUDA 10.1
The snapshot of this error:
drive.google.com/open?id=1V-Mv2xk9Leny3GEUYsMkBWBHumWastxU
However,
run the same code in Linux environment, there is no error.
Any help is appreciated.
@NotesByNick 5 лет назад ⁺¹
Unfortunately all I can tell from this error is that the functional test is failing. Because the code works in Linux, this is likely some error related to your build environment. If you building the example in the repo's directory, I believe there are Visual Studio files from my setup there (using VS 2015 and CUDA 10.0). These may be interfering with your environment. A couple checks to do would be 1.) Is your kernel even launching, and 2.) Is your data being copied to/from the GPU. As a simple test to see if it is the build environment, you could make a new directory, copy the code in, and build an run it.
@kaokuntai 5 лет назад ⁺¹
@@NotesByNick Thank you for your reply.
I insert another Nvidia Display Card (newer version) and run source code (vector_add_um.cu) by your tutorials which works perfectly in Windows 10 + VS 2017 + CUDA 10.1.
The previous Nvidia Display Card (older version) can be run with the ENV: Windows 10 + VS 2015 + cuda 10.0.
Thank you very much~
@NotesByNick 5 лет назад ⁺¹
Is everything working now? Usually these are build problems related to whether your GPU has SM architecture >= 3.0, and if if you are compiling a 64-bit host application.
@kaokuntai 5 лет назад
@@NotesByNick Yes!
You are really an expert!
The newer version is NVIDIA GeForce RTX 2080 Ti
Thank you very much for helping.

Следующие

Автовоспроизведение

CUDA Crash Course: Matrix Multiplication

CUDA Crash Course: Matrix Multiplication

Writing Code That Runs FAST on a GPU

Writing Code That Runs FAST on a GPU

Axolotl is a AI FineTuning Magician

Axolotl is a AI FineTuning Magician

Trying EVERY Fast Food Holiday Item!

Trying EVERY Fast Food Holiday Item!

Surprising Son with Dream Car on 16th Birthday

Surprising Son with Dream Car on 16th Birthday

I 3D Printed a $1,500 Chair

I 3D Printed a $1,500 Chair

REBUILDING A PORSCHE 911 GT3RS FROM SCRATCH

REBUILDING A PORSCHE 911 GT3RS FROM SCRATCH

CUDA Crash Course (v2): Unified Memory

CUDA Crash Course (v2): Unified Memory

Arenas, strings and Scuffed Templates in C

Arenas, strings and Scuffed Templates in C

CUDA Crash Course: Cache Tiled Matrix Multiplication

CUDA Crash Course: Cache Tiled Matrix Multiplication

Advanced GPU computing: Efficient CPU-GPU memory transfers, CUDA streams

Advanced GPU computing: Efficient CPU-GPU memory transfers, CUDA streams

NVIDIA CUDA Tutorial 8: Intro to Shared Memory

NVIDIA CUDA Tutorial 8: Intro to Shared Memory

Nvidia CUDA in 100 Seconds

Nvidia CUDA in 100 Seconds

10 Signs Your Software Project Is Heading For FAILURE

10 Signs Your Software Project Is Heading For FAILURE

CUDA Programming

CUDA Programming

Мама Подписчика ТРЕБУЕТ ДОБАВИТЬ СЫНА В КЛИП! Разоблачение

Мама Подписчика ТРЕБУЕТ ДОБАВИТЬ СЫНА В КЛИП! Разоблачение

МОЖНО ПОТЕРЯТЬ СЛУХ НА КОНЦЕРТЕ СИГМА БОЯ? #янгер #shorts

МОЖНО ПОТЕРЯТЬ СЛУХ НА КОНЦЕРТЕ СИГМА БОЯ? #янгер #shorts

СЕКРЕТ ЛЕВОГО ГЛАЗА САНСА! / UNDERTALE

СЕКРЕТ ЛЕВОГО ГЛАЗА САНСА! / UNDERTALE

ИРП Беларуси! 85.7% СУПА! Качество поражает!

ИРП Беларуси! 85.7% СУПА! Качество поражает!

0% Respect Moments + HIM

0% Respect Moments + HIM

притворился дедом и проверил шаурмечные на человечность ч10

притворился дедом и проверил шаурмечные на человечность ч10

😰Я Прокачал ОРУЖИЕ На МАКСИМУМ в Майнкрафт!

😰Я Прокачал ОРУЖИЕ На МАКСИМУМ в Майнкрафт!

Girlsss❤️ #tiktok #elsarca

Girlsss❤️ #tiktok #elsarca