CUDA Crash Course (v2): Vector Addition

Поделиться
HTML-код
  • Опубликовано: 10 ноя 2024

Комментарии • 20

  • @closerlookcrime
    @closerlookcrime Год назад +1

    Wow excellent video. Well done sir. Thank-you.

  • @whataboutry
    @whataboutry 4 года назад +1

    Very impressive Nick! Thanks for this.

  • @arnold9103
    @arnold9103 4 года назад +1

    Thank you for this very clear and instructive video. I have ran into the same problem as viewer Chun Ming Jeffy Tam.
    Perhaps you should change the N =

  • @OurielGotesdyner
    @OurielGotesdyner 5 лет назад +2

    hi, first of all thank you for making a this videos, I've decided to compare the runtime this vector add between your cuda implementation and a regular for loop implementation and found that on a scale of about until a vector size of 2^25 the regular cpu based implementation worked better, while on a N = 2^30 the regular implementation did it in about 190 sec while the gpu cuda based one did it in about 156 sec. I am running it in the visual studio 2019 enviroment, using a quadro p2000 gpu. what can be the explanation to that rather disappointing result? am I doing something wrong?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 года назад

      Great question! It really depends. Vector addition is a pretty simple kernel, so the faster speed of the CPU and things like prefetchers probably make the number elements required for the GPU to win be higher. Are you measuring end-to-end latency, or just kernel?

    • @OurielGotesdyner
      @OurielGotesdyner 4 года назад +1

      @@CoffeeBeforeArch Thank you for the reply, I was testing the latency of the entire Cuda related operation (Allocation, kernel call and freeing) vs the iteration.

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 года назад

      One way to significantly improve the memcpy transfer times is to use host-pinned memory. This can be done using cudaMallocHost, instead of something like malloc or new on the CPU side code. This unfortunately means you can't use std::vector, but the performance is likely worth it. I believe with vector addition, the data transfer takes longer than the computation, so making that change may help a lot.
      --Nick

    • @OurielGotesdyner
      @OurielGotesdyner 4 года назад

      @@CoffeeBeforeArch Ok, thanks, loosing the ability to use vectors is quite a con when moving to more complicated programs so I hope Cuda still maintains an edge even without that, wanted to also ask whether VS users are destined to have all cuda related code be forever labeled as errors by VS or is there a way to do anything with it? thanks again.

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 года назад

      If you're referring to the intellisense errors, I'm shocked they're still there. I'm not sure what the status is of a fix for that (I primarily work in a linux environment).

  • @seyedmasoodmostafavi7440
    @seyedmasoodmostafavi7440 Год назад

    Thank you for this

  • @chunmingjeffytam8663
    @chunmingjeffytam8663 4 года назад +1

    Hi there, it seems that the program ran almost instantaneously on your end. I have a rather old CPU (4790) setup but a mid-range GeForce 1670Ti Super Graphic card , I was surprised that under nvprof you manage to take only 45/21 micro second, where it takes a thousand times more for me (105/52 MILLI second), the same scaling for the kernel call too (vectorAdd) (it is about 7 millisecond on my end vs 7 microsecond of your screenshot)
    i am wonder if the hardware difference (not sure about your hardware setup) really make that huge of a difference? 1000 times is really not worth it if that's the CUDA performance I am getting. I wonder if I am missing some thing from my setup.

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 года назад +1

      Could be a number of things. One important factor is whether or not you’re using your GPU for graphics as well as running your CUDA app. If that’s the case, you’re competing against another program running side by side on the GPU for resources. All that being said, the numbers do look very high. My setup for the video had a 1050TI and i7 7700. Not a huge leap in performance that would cause these kind of gaps. Without any other info though, it’s hard to say exactly why it’s happening.
      Cheers,
      -Nick

    • @chunmingjeffytam8663
      @chunmingjeffytam8663 4 года назад

      CoffeeBeforeArch thanks Nick. I have another old Graphic card on board (750x for physX) as output, but I have disabled it (and also set CUDA_VISIBLE_DEVICE to the 1670 ) and restarted but it’s still really high. Our spec is actually comparable so I am actually stumped. Will try taking that card out and try reinstalling the whole windows again and see.

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 года назад +1

      I’m not super familiar with CUDA on windows, so there may be some perf pitfalls there. I mainly work on Linux because it supports more CUDA features.

    • @chunmingjeffytam8663
      @chunmingjeffytam8663 4 года назад

      @@CoffeeBeforeArch Hi, I found out the reason. I copied the code in github and in github you have N = 2^26 instead of 16 as shown in the video. so it is roughly 2^10 bigger size of operation. So about thousand times slower. I changed it back to 2^16 and it is comparable speed to yours now. I will follow through with the example.
      Would you continue this series? This is really helping me a lot. My ultimate goal is to write my own monte-carlo engine in CUDA. But I want to learn all the libraries provided such as cuBLAS and other numerical ones. Thank you for this once again.

    • @chunmingjeffytam8663
      @chunmingjeffytam8663 4 года назад

      P.S. I am considering coding in Linux just to try it out myself as well.

  • @industrialdonut7681
    @industrialdonut7681 4 года назад +1

    Is cuda code written in C rather than C++? Noob question lol

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 года назад +1

      You can use many modern C++ features in CUDA (e.g, things like constexpr). However, definitely not everything (like std::vector which does dynamic allocation)