Everything in the kernel is deterministic except the atomicAdd. Depending on the order in which the atomicAdds are performed across all thread blocks, you may get slightly different floating point results.
Hey James, thanks for the video at 9:20 you showed that performance of a kernel is 5-6 times faster than the cpu version. However, the cpu version is single threaded. Would you say CPU version will be equivalent if we make it run on several threads? For example, my 8-threaded CPU. I think it would actually win over the GPU thanks!
I have not actually implemented a CPU version so I can't say for sure but I suspect that you could make it just as fast (or maybe even faster?) using OpenMP or equivalent. This code was for demonstration purposes. You wouldn't really compute just a dot product with the GPU as there wouldn't be enough computation to saturate the GPU's potential. What you might do however is solve a more complicated algorithm that itself involves computing dot products as one of its steps.
Thanks for video
thanks for the video, it was helpful :)
Thanks from covid era, this video helps :D
Your welcome!
Hi, Can you help me with one question: Why do you get the same results every time you run the program?
Everything in the kernel is deterministic except the atomicAdd. Depending on the order in which the atomicAdds are performed across all thread blocks, you may get slightly different floating point results.
Hey James, thanks for the video
at 9:20 you showed that performance of a kernel is 5-6 times faster than the cpu version.
However, the cpu version is single threaded. Would you say CPU version will be equivalent if we make it run on several threads? For example, my 8-threaded CPU. I think it would actually win over the GPU
thanks!
I have not actually implemented a CPU version so I can't say for sure but I suspect that you could make it just as fast (or maybe even faster?) using OpenMP or equivalent. This code was for demonstration purposes. You wouldn't really compute just a dot product with the GPU as there wouldn't be enough computation to saturate the GPU's potential. What you might do however is solve a more complicated algorithm that itself involves computing dot products as one of its steps.