Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Поделиться
HTML-код
  • Опубликовано: 14 окт 2024

Комментарии • 69

  • @thinhon54
    @thinhon54 13 дней назад +3

    This is the best video about Torch distributed I have ever seen. Thanks for making this video!

  • @КириллКлимушин
    @КириллКлимушин 6 месяцев назад +4

    This is second video Ive watched from this channel after "quantization". And frankly wanted to express my gratitude towards your work as it is very easy to follow and the level of abstractions is tenable to understand concepts holistically.

  • @abdallahbashir8738
    @abdallahbashir8738 6 месяцев назад +5

    I really love your vidoes. you have a natural talent on simplifying logic and code. in same capacity as Andrej

  • @chiragjn101
    @chiragjn101 10 месяцев назад +9

    Great video, thanks for creating this. I have use DDP quite a lot but seeing the visualizations for communication overlap helped me build a very good mental model.
    Would love to see more content around distributed training - Deepspeed ZeRO, Megatron DP + TP + PP

  • @jiankunli7148
    @jiankunli7148 Месяц назад

    Great introduction. Love the pace of the class and the balance of breadth vs depth

  • @nithinma8697
    @nithinma8697 Месяц назад

    Umar hits the sweet spot (Goldilocks zone) by balancing theory and practical😄😄😄😄😄

  • @rachadlakis1
    @rachadlakis1 Месяц назад +1

    That's an amazing resource! It's great to see you sharing such detailed information on a complex topic. Your effort to explain everything clearly will really help others understand and apply these concepts. Keep up the great work!

  • @amishasomaiya9891
    @amishasomaiya9891 4 месяца назад +2

    Starting to watch my 3rd video on this channel, after transformer from scratch and quantization. Thank you for the great content and also for the code and notes to look back again. Thank you.

  • @pulkitnijhawan653
    @pulkitnijhawan653 Месяц назад +1

    Amazing video. Ideal video of how a lecture on a video should be

  • @wangqis
    @wangqis 15 дней назад +1

    非常清楚的解释~

  • @tharunbhaskar6795
    @tharunbhaskar6795 3 месяца назад +1

    Dang. Never thought learning DDP would be this easy. Another great content from Umar. Looking forward for FSDP

  • @karanacharya18
    @karanacharya18 5 месяцев назад +2

    Super high quality lecture. You have a gift of teaching, man. Thank you!

  • @thuann2cats
    @thuann2cats 2 месяца назад +1

    absolutely amazing! You made these concepts so accessible!

  • @Maximos80
    @Maximos80 2 месяца назад +1

    Incredible content, Umar! Well done! 🎉

  • @thejasonchu
    @thejasonchu Месяц назад +1

    Great work! thank you !

  • @oliverhitchcock8436
    @oliverhitchcock8436 9 месяцев назад +3

    Another great video, Umar. Nice work

  • @МихаилЮрков-т1э
    @МихаилЮрков-т1э 8 месяцев назад

    The video was very interesting and useful. Please make a similar video on DeepSpeed functionality. And in general, how to train large models (for example LLaMa SFT) on distributed systems (Multi-Server) when GPUs are located on different PCs.

  • @maleekabakhtawar3892
    @maleekabakhtawar3892 2 месяца назад

    well explained each and every detail, Great work Great Explanation👍
    can you make this type of detailed video on distributed training through tensor parallelism? it would be very helpful. Thank you!

  • @vasoyarutvik2897
    @vasoyarutvik2897 5 месяцев назад +2

    this channel is hidden gem

  • @cken27
    @cken27 10 месяцев назад +3

    Amazing content! Thanks for your sharing

  • @arpitsahni4263
    @arpitsahni4263 Месяц назад

    thanks for the video!! can you cover the Tensor, Sequence and Pipeline Parallel and D using Dtensors in Pytorch next?

  • @mandarinboy
    @mandarinboy 8 месяцев назад

    Great intro video. Do you have any plans to also cover other parallelism: Model, Pipeline, Tensor, etc.

  • @hu6u-l8c
    @hu6u-l8c 9 месяцев назад +2

    Thank you very much for your wonderful video. Can you teach a video on how to use the accelerate library with dpp?

  • @vimukthirandika872
    @vimukthirandika872 2 месяца назад +1

    Really impressive!

  • @nova2577
    @nova2577 8 месяцев назад +1

    You deserve many more likes and subscribers!

  • @prajolshrestha9686
    @prajolshrestha9686 9 месяцев назад +1

    Thankyou so much for this amazing video. It is really informative.

  • @AtanuChowdhury-d6o
    @AtanuChowdhury-d6o 7 месяцев назад +1

    very nice and informative video. Thanks

  • @Engrbilal143
    @Engrbilal143 7 месяцев назад

    Awesome video. Please make tutorial on FSDP as well

  • @810602jay
    @810602jay 10 месяцев назад +1

    Amazing learning stuff ! Very Thanks !~ 🥰🥰🥰

  • @d.s.7857
    @d.s.7857 10 месяцев назад +1

    Thank you so much for this

  • @madhusudhanreddy9157
    @madhusudhanreddy9157 9 месяцев назад

    If time permits for you, Please make an video for entire GPU and TPU and how to them effectively and most of us donno .
    please create a playlist for pytorch for beginners and intermediates.
    Thanks for reading.

  • @sajjadshahabodini
    @sajjadshahabodini 10 дней назад +1

    thanks

  • @loong6127
    @loong6127 7 месяцев назад +1

    Great video

  • @xugefu
    @xugefu День назад

    Thanks!

  • @madhusudhanreddy9157
    @madhusudhanreddy9157 9 месяцев назад

    Hi Umar, Great video and enjoyed thorughly but i have one question.why are we using the approach of sum(grad1+grad2+....+gradN), why cant we use Avg of Gradients.

    • @umarjamilai
      @umarjamilai  9 месяцев назад +2

      Of course you can (but you don't have to) use the average of the gradients. Actually, people usually take the average of the gradients. The reason we use the average is because we want the loss to be (more of less) the same as the non-distributed model, so you can compare the plots of the two. I don't know if PyTorch internally automatically takes the average of the gradients, I'd have to check the documentation/source.

    • @madhusudhanreddy9157
      @madhusudhanreddy9157 9 месяцев назад

      @@umarjamilaithanks for the info.

  • @riyajatar6859
    @riyajatar6859 7 месяцев назад +1

    In broadcast , if we are sending the copy of file from rank 0 and rank 4 node to other node. How is the total time still 10 second. Because still I am having same internet speed of 1MB/s.
    Could anyone explain? I am bit confused.
    Also what happens if I am having odd numbers of nodes

  • @manishsharma2211
    @manishsharma2211 10 месяцев назад +1

    you teach soooooooo good

  • @SaurabhK9012
    @SaurabhK9012 2 месяца назад

    Please create a video on model parallelism and FSDP.

  • @svkchaitanya
    @svkchaitanya 3 месяца назад +1

    You rock always 😂

  • @YashSharma-c4f
    @YashSharma-c4f 7 месяцев назад +1

    fantastic

  • @khoapham7303
    @khoapham7303 10 месяцев назад +2

    I'm always confused with DP and DDP. Can you please tell me the difference between them? While both of them belong to data parallelism method.

    • @umarjamilai
      @umarjamilai  10 месяцев назад +6

      DP only works on a single machine, while DDP can work on multiple machines. However, PyTorch now recommends using DDP also for single-machine setup.

    • @khoapham7303
      @khoapham7303 10 месяцев назад

      @@umarjamilai thank you for your reply

  • @waynelau3256
    @waynelau3256 5 месяцев назад

    Working with fsdp and megatron now and I really want to figure this out from scratch haha, it sounds fun but a big headache

  • @mdbayazid6837
    @mdbayazid6837 10 месяцев назад +1

    Federated learning basics please.❤

  • @ramprasath6424
    @ramprasath6424 10 месяцев назад +1

    please do some thing related to audio large models like conformers,quartznet ,etc

  • @코크라이크
    @코크라이크 5 месяцев назад

    could provide another videos with respect to model parallel and pipeline parallel ? thanks..

  • @Yo-rw7mq
    @Yo-rw7mq 6 месяцев назад +1

    Great!

  • @felipemello1151
    @felipemello1151 5 месяцев назад +1

    I wish i could like it twice

    • @umarjamilai
      @umarjamilai  5 месяцев назад

      You can share it on social media. That's the best way to thank me 😇

    • @felipemello1151
      @felipemello1151 5 месяцев назад

      @@umarjamilai not sure if it’s in your plans, but if you are open to suggestions, I would love to watch a video on multimodal models. Again, awesome work!

    • @umarjamilai
      @umarjamilai  2 месяца назад

      Check my latest video!

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 27 дней назад

    Valeu!

  • @rohollahhosseyni8564
    @rohollahhosseyni8564 7 месяцев назад +1

    great video

  • @sounishnath513
    @sounishnath513 9 месяцев назад +1

    SUUUPERRRR

  • @tryit-wv8ui
    @tryit-wv8ui 10 месяцев назад

    another banger

  • @ai__76
    @ai__76 5 месяцев назад

    How to do in Kubernetes? Please explain it.

  • @Erosis
    @Erosis 10 месяцев назад

    Wouldn't the accumulated gradient need to be divided by the total number of individual gradients summed (or the learning rate needs to be divided by this value) to make it equivalent?

    • @umarjamilai
      @umarjamilai  10 месяцев назад +2

      Yes, if you want to treat the "cumulative gradient" as a big batch, then you'd usually divide it by the number of items to keep it equivalent to the single-item setup. But it's not mandatory: as a matter of fact, loss functions on PyTorch have a "reduction" parameter, which is usually set to "mean" (so dividing the loss by the number of items) but can also be set to "sum".
      One reason we usually calculate the "mean" loss is because we want to make comparisons between models with different hyperparameters (batch size), so the loss should not depend on the batch size.
      But remember that mathematically you don't have to

  • @milonbhattacharya4097
    @milonbhattacharya4097 8 месяцев назад

    shouldnt loss be accumulated ? loss += (y_pred - y_actual)^0.5

    • @LCG-f8w
      @LCG-f8w 7 месяцев назад

      In my understanding, yes the loss is accumulated for one batch theoretically, and the gradients are computed based on this accumulated loss too. But in the parallel implementation, both the loss calculated in the feedforward process, and the gradients calculated in the back propagation process executed in a parallel way. Here @umarjamilai use a for loop to illustrate the de facto parallel mechanism.

  • @jiemao-v4l
    @jiemao-v4l 9 месяцев назад

    do you have a discord channel?

  • @eriboyer2229
    @eriboyer2229 Месяц назад

    21:01

  • @Allen-TAN
    @Allen-TAN 9 месяцев назад +1

    Always great to watch your video, excellent work

  • @hellochli
    @hellochli 9 месяцев назад +1

    Thanks!

    • @umarjamilai
      @umarjamilai  9 месяцев назад

      谢谢你!我们在领英connect吧

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 26 дней назад

    Valeu!