Note that cudaMemPrefetchAsync is only supported on Pascal and newer architectures, and is only supported on Linux, as per Nvidia. The call to cudaMemPrefetchAsync returns cudaErrorInvalidDevice on my Windows 10 machine.
What are the advantages of using unified memory other than not needing to transfer data between host and device? After profiling it seems like using unified memory takes significant more time than the usual malloc and cudaMalloc.
Good question! Unified memory can take longer than normal allocation and copy, but with adequate pre-fetching, they can yield roughly similar results. It is mainly a programmability feature, and that goes beyond just avoiding duplicate data structures. Modern architectures (Volta and later) support over-subscription, where you can allocate more memory than you have on your GPU. Doing this without unified memory would require you to decompose your problem into multiple kernels so that you can work on a chunk of data, finish computation, copy new data in, start a new kernel, etc. With unified memory, none of that is required.
Hi Nick, I starting to learn Cuda through your videos and they are awesome. Some time ago I saw about Cuda Thrust, Do you have any experience with this? what do you think about using modern c++ with Cuda? Can you do a video about it?
@Nick How does this transfer to embedded systems like Xavier where GPU and CPU share the same memory? It seems to me you shouldn’t have to cudamemcpy anything since everything is in the same physical memory.
My understanding is that things behave slightly differently on those platforms. While it is true they have the same physical memory, you can still have dedicated device memory (it's probably just part of the same physical memory, but a dedicated buffer not accessibly by the CPU). There's a great guide from NVIDIA on this exact topic - docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#overview
@@NotesByNick Wow thanks, Nick! Do you know if this is valid for Xavier? I believe Tegra was the generation before Xavier? I would expect you should not really have to copy anything, but rather just pass pointers as long as its in the same address space. It becomes tricky when you have multiple applications with different address spaces and you have to go through some middleware like ROS. Maybe you can do a video on cuda for embedded programming? Would be greatly appreciated!
The environment: Windows 10 VS 2017 CUDA 10.1 The snapshot of this error: drive.google.com/open?id=1V-Mv2xk9Leny3GEUYsMkBWBHumWastxU However, run the same code in Linux environment, there is no error. Any help is appreciated.
Unfortunately all I can tell from this error is that the functional test is failing. Because the code works in Linux, this is likely some error related to your build environment. If you building the example in the repo's directory, I believe there are Visual Studio files from my setup there (using VS 2015 and CUDA 10.0). These may be interfering with your environment. A couple checks to do would be 1.) Is your kernel even launching, and 2.) Is your data being copied to/from the GPU. As a simple test to see if it is the build environment, you could make a new directory, copy the code in, and build an run it.
@@NotesByNick Thank you for your reply. I insert another Nvidia Display Card (newer version) and run source code (vector_add_um.cu) by your tutorials which works perfectly in Windows 10 + VS 2017 + CUDA 10.1. The previous Nvidia Display Card (older version) can be run with the ENV: Windows 10 + VS 2015 + cuda 10.0. Thank you very much~
Is everything working now? Usually these are build problems related to whether your GPU has SM architecture >= 3.0, and if if you are compiling a 64-bit host application.
good video , i really liked your the explanation
Note that cudaMemPrefetchAsync is only supported on Pascal and newer architectures, and is only supported on Linux, as per Nvidia. The call to cudaMemPrefetchAsync returns cudaErrorInvalidDevice on my Windows 10 machine.
What are the advantages of using unified memory other than not needing to transfer data between host and device? After profiling it seems like using unified memory takes significant more time than the usual malloc and cudaMalloc.
Good question! Unified memory can take longer than normal allocation and copy, but with adequate pre-fetching, they can yield roughly similar results. It is mainly a programmability feature, and that goes beyond just avoiding duplicate data structures. Modern architectures (Volta and later) support over-subscription, where you can allocate more memory than you have on your GPU. Doing this without unified memory would require you to decompose your problem into multiple kernels so that you can work on a chunk of data, finish computation, copy new data in, start a new kernel, etc. With unified memory, none of that is required.
@@NotesByNick Thank you for the prompt answer! I'll look more into pre-fetching.
@@hanwang5940 Of course! I'm always happy to help!
great video, thanks for the tutorial! I am still little confused where does the unified memory locate, is it located at global memory?
I get a segmentation fault when I run the init_vector() function, could someone help me?
Hi Nick, I starting to learn Cuda through your videos and they are awesome. Some time ago I saw about Cuda Thrust, Do you have any experience with this? what do you think about using modern c++ with Cuda? Can you do a video about it?
@Nick How does this transfer to embedded systems like Xavier where GPU and CPU share the same memory? It seems to me you shouldn’t have to cudamemcpy anything since everything is in the same physical memory.
My understanding is that things behave slightly differently on those platforms. While it is true they have the same physical memory, you can still have dedicated device memory (it's probably just part of the same physical memory, but a dedicated buffer not accessibly by the CPU). There's a great guide from NVIDIA on this exact topic - docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#overview
@@NotesByNick Wow thanks, Nick! Do you know if this is valid for Xavier? I believe Tegra was the generation before Xavier?
I would expect you should not really have to copy anything, but rather just pass pointers as long as its in the same address space. It becomes tricky when you have multiple applications with different address spaces and you have to go through some middleware like ROS.
Maybe you can do a video on cuda for embedded programming? Would be greatly appreciated!
The environment:
Windows 10
VS 2017
CUDA 10.1
The snapshot of this error:
drive.google.com/open?id=1V-Mv2xk9Leny3GEUYsMkBWBHumWastxU
However,
run the same code in Linux environment, there is no error.
Any help is appreciated.
Unfortunately all I can tell from this error is that the functional test is failing. Because the code works in Linux, this is likely some error related to your build environment. If you building the example in the repo's directory, I believe there are Visual Studio files from my setup there (using VS 2015 and CUDA 10.0). These may be interfering with your environment. A couple checks to do would be 1.) Is your kernel even launching, and 2.) Is your data being copied to/from the GPU. As a simple test to see if it is the build environment, you could make a new directory, copy the code in, and build an run it.
@@NotesByNick Thank you for your reply.
I insert another Nvidia Display Card (newer version) and run source code (vector_add_um.cu) by your tutorials which works perfectly in Windows 10 + VS 2017 + CUDA 10.1.
The previous Nvidia Display Card (older version) can be run with the ENV: Windows 10 + VS 2015 + cuda 10.0.
Thank you very much~
Is everything working now? Usually these are build problems related to whether your GPU has SM architecture >= 3.0, and if if you are compiling a 64-bit host application.
@@NotesByNick Yes!
You are really an expert!
The newer version is NVIDIA GeForce RTX 2080 Ti
Thank you very much for helping.