How Fully Sharded Data Parallel (FSDP) works?

Поделиться
HTML-код
  • Опубликовано: 1 янв 2025

Комментарии • 66

  • @viharigandrakota1783
    @viharigandrakota1783 14 дней назад

    The Best video on this topic. period.

  • @tinajia2958
    @tinajia2958 9 месяцев назад +4

    This is the best video I’ve watched on distributed training

  • @yixiaoli6786
    @yixiaoli6786 9 месяцев назад +2

    The best video of FSDP. Very clear and helpful!

  • @chenqian3404
    @chenqian3404 Год назад +6

    To me this is by far the best video explaining how FSDP works, thanks a lot!

  • @mahmoudelhage6996
    @mahmoudelhage6996 4 месяца назад +5

    As a Machine learning Research Engineer working on fine-tuning LLMs, I normally use DDP, or Deepspeed, and wanted to understand more about how FSDP works. This video is well structured and provides a detailed explanation about FSDP, I totally recommend it. Thanks Ahmed for your effort :)

  • @MrLalafamily
    @MrLalafamily Год назад +2

    Thank you so much for investing your time in creating this tutorial. I am not an ML engineer, but I wanted to build intuition around parallelizing computation across GPUs and your video was very helpful. I especially liked that you provided multiple examples for parts that were a bit more nuanced. I paused the video many times to think things over. Again, gratitude as a learner

  • @quaizarvohra3810
    @quaizarvohra3810 2 месяца назад

    I have been looking for a resource which would explain FSDP conceptually. This one explains it very clearly and completely. Awesome!

  • @dhanvinmehta3294
    @dhanvinmehta3294 4 месяца назад +1

    Thank you very much for making such a knowledge-dense, yet self-contained video!

  • @abhirajkanse6418
    @abhirajkanse6418 2 месяца назад

    That makes things very clear! Thanks a lot!!

  • @yuxulin1322
    @yuxulin1322 9 месяцев назад

    Thank you so much for such detailed explanations.

  • @AntiochSanders
    @AntiochSanders Год назад +1

    Wow this is super good explanation, cleared up a lot of misconceptions I had about fsdp.

  • @PalashKumar1010
    @PalashKumar1010 Месяц назад

    in adam, at 1:29, variance should be velocity

  • @abdelkarimeljandoubi2322
    @abdelkarimeljandoubi2322 2 месяца назад

    Well explained. Thank you

  • @xxxiu13
    @xxxiu13 5 месяцев назад

    A great explanation of FSDP indeed. Thanks for the video!

  • @lazycomedy9358
    @lazycomedy9358 11 месяцев назад

    This is really clear and help me understand a lot of details in FSDP!! Thanks

  • @amansinghal5908
    @amansinghal5908 6 месяцев назад

    great video - one recommendation. make 3 videos, one like this, one that goes deeper into the implementation e.g. FSDP code and finally how to use it e.g. case studies

  • @saurabhpawar2682
    @saurabhpawar2682 11 месяцев назад

    Excellent explanation. Thank you so much for putting this out!

  • @tharunbhaskar6795
    @tharunbhaskar6795 5 месяцев назад

    The best explanation so far

  • @bharadwajchivukula2945
    @bharadwajchivukula2945 Год назад

    crisp and amazing explanation so far

  • @pankajvermacr7
    @pankajvermacr7 Год назад

    thanks for this, im having trouble undertanding FSDP, even i read a research paper but hard to understand, i really appreciate your effort, please make more such videos.

  • @mandeepthebest
    @mandeepthebest 5 месяцев назад

    amazing video! very well articulated.

  • @NachodeGregorio
    @NachodeGregorio 10 месяцев назад

    Amazing explanation, well done.

  • @coolguy69235
    @coolguy69235 Год назад

    very good video ! seriously keep up the good work !

  • @yatin-arora
    @yatin-arora 6 месяцев назад +1

    well explained 👏

  • @yuvalkirstain7190
    @yuvalkirstain7190 11 месяцев назад

    Fantastic presentation, thank you!

  • @ElijahTang-t1y
    @ElijahTang-t1y 4 месяца назад

    well explained, great job!

  • @hyunhoyeo4287
    @hyunhoyeo4287 6 месяцев назад

    Great explanation!

  • @amirakhlaghi8143
    @amirakhlaghi8143 7 месяцев назад

    Excellent presentation

  • @adityashah3751
    @adityashah3751 9 месяцев назад

    Great video!

  • @phrasedparasail9685
    @phrasedparasail9685 2 месяца назад

    This is amazing

  • @p0w3rFloW
    @p0w3rFloW Год назад

    Awesome video! Thanks for sharing

  • @clarechen1590
    @clarechen1590 4 месяца назад

    great video!

  • @RaviTeja-zk4lb
    @RaviTeja-zk4lb Год назад

    I was struggling to understand how FSDP works and your video helped me a lot in understanding. Thank you. After understanding what are this backends. I see that FSDP definetly requires 'GPU'. For 'CPU" we use 'gloo' as backend and it doesn't support reduce-scatter. It would be great if you also cover Paramater serving training using RPC framework.

  • @TrelisResearch
    @TrelisResearch 8 месяцев назад

    Great video, congrats

  • @mohammadsalah2307
    @mohammadsalah2307 Год назад

    Thanks for sharing! 19:19 The first FSDP unit to compute forward process is FWD0; however this FSDP unit contains layer 0 and layer 3; How could we compute the result of layer3 without compute the result of layer 1 and 2 first?

    • @ahmedtaha8848
      @ahmedtaha8848  Год назад +2

      Layer3 is computed only after computing layer 1 and layer 2. Please note that there are two 'FWD0': the first one computes layer 0; the second one computes layer 3 after FWD1 (layer 1 and 2) finishes.

  • @bennykoren212
    @bennykoren212 Год назад

    Excellent !

  • @santiagoruaperez7394
    @santiagoruaperez7394 10 месяцев назад

    Hi. I want to ask you something. In the 3:01 you also include de optimizer state in the multiplication for each parameter. I want to ask if the optimizer state is not just one for the whole model? What I mean is: if I have a 13B model in comparisson with a 7B model, the gradients are going to be more. But in the case of the optimizer state is going to depend also from the number of parameters?

    • @ahmedtaha8848
      @ahmedtaha8848  10 месяцев назад +1

      The optimizer state is not just one for the whole model. A 13B model has both more gradients and more optimizer state compared to 7B model. Yes, optimizer state depends on the number of parameters. For Adam optimizer, the optimizer state (Slide 4 & 5) includes both momentum (first moment) and variance (second moment) for each gradient, i.e., for each parameter.

    • @santiagoruaperez7394
      @santiagoruaperez7394 10 месяцев назад

      Amazing video, super clear@@ahmedtaha8848

  • @adamlin120
    @adamlin120 Год назад

    Amazing explanation 🎉🎉🎉

  • @AIwithAniket
    @AIwithAniket Год назад

    it helped a lot. thank you so much

  • @dhineshkumarr3182
    @dhineshkumarr3182 Год назад

    Thanks man!

  • @hannibal0466
    @hannibal0466 8 месяцев назад

    Awesome Bro! One short question: in the example shown (24:06), why there are two consecutive AG2 stages?

    • @ahmedtaha8848
      @ahmedtaha8848  8 месяцев назад

      Thanks! One for the forward pass and another for the backward pass. I suppose you can write a special handler for the last FSDP unit to avoid freeing the parameters then re-gathering them. Yet, imagine if FSDP unit#0 have another layer (layer#6) after FSDP unit#2, i.e., so total (layer#0, layer#3, layer#6). The aforementioned special handler won't look wise then.

  • @Veekshan95
    @Veekshan95 Год назад

    Amazing video with great visual aids and even better explanation.
    I just had one question - At 24:45 you mentioned that FSDP layer 0 is never freed until the end. So does this mean, the GPUs will have layer0 all the time and in addition to that they will consider other layers as needed?

    • @ahmedtaha8848
      @ahmedtaha8848  Год назад +1

      Yes, Unit 0 (layer 0 + layer 3) -- which is the outermost FSDP unit -- will be available across all nodes (GPUs) during an entire training iteration (forward + backward). Quoting from arxiv.org/pdf/2304.11277 (Page #6), "Note that the backward pass excludes the AG0 All-Gather because FSDP intentionally keeps the outermost FSDP unit’s parameters in memory to avoid redundantly freeing at the end of forward and then re-All-Gathering to begin backward."

  • @louiswang538
    @louiswang538 6 месяцев назад

    how is FSDP different from gradient accumulation? seems both have a mini-batch to get 'local gradients' and sum up to get global gradient for model update.

  • @aflah7572
    @aflah7572 5 месяцев назад

    Thank You!

  • @ManishPrajapati-o4x
    @ManishPrajapati-o4x 3 месяца назад

    TY!

  • @RedOne-t6w
    @RedOne-t6w 10 месяцев назад

    Awesome

  • @maxxu8818
    @maxxu8818 9 месяцев назад

    Hello Ahmed, If it's a 4 way FSDP in a node, does it mean there is only 4 GPU used for that node? Usually there are 8 GPUs in a node? how the other 4 GPUs are used? Thanks!

  • @richeshc
    @richeshc 11 месяцев назад

    Namaste, a doubt. For Pipeline parallelism (mins 10 to 12) you mentioned while we send weights from 1st gpu of minibatch 1 training to gpu 2, we start training gpu 1 feedforward network on minibatch 2. Doubt is isnt it suppose to be feedforward followed by backward propogation and updation of weights, then training on batch 2? Words are giving an interpretation that we are straight away starting feedforward training on gpu1 with minimatch 2 soon we transfer weights from minibatch 1 training of gpu 1 to gpu 2.

    • @ahmedtaha8848
      @ahmedtaha8848  11 месяцев назад

      For minibatch 1, we can't do back-propogation till we compute the loss, i.e, till mini-batch 1 passes through all layers/blocks. Same for mini-batch 2. After computing the loss for mini-batch 1, we can back-propogate one layer/block at a time on different GPUs -- and of course update the gradient. Yet, again other GPUs will remain idle if we are processing (forward/backward) a single mini-batch. Thus, it is better to work with multiple mini-batches, each with a different loss value. These mini-batches will be forwarding/backwarding on different GPUs in parallel.

  • @piotr780
    @piotr780 10 месяцев назад

    how whole gradient is calculated if weights does not fit single GPU memory ???????

  • @amaleki
    @amaleki 4 месяца назад

    there is a bit of discrepancy in what is defined as model parallelism with Nvidia literature. Name, in nvidia literature, Model parallelism is overarching idea of sharing the model and it can take two forms: i) Tensor MP (splitting layers between GPUs, each GPU get a portion of each layer) ii) Pipeline Parallelism (each GPU responsible for computing some layers entirely).

  • @DarnellFie
    @DarnellFie 3 месяца назад

    Thanks for the interesting content! 😍 I’ve got a question: 🤨 I found these words 😅. (behave today finger ski upon boy assault summer exhaust beauty stereo over). Can someone explain what this is? 😅

  • @DevelopersHutt
    @DevelopersHutt 9 месяцев назад

    TY

  • @parasetamol6261
    @parasetamol6261 Год назад

    That Great vedio.

  • @tga3532
    @tga3532 7 дней назад

    Very well explained. Thank you!

  • @dan1ar
    @dan1ar 3 месяца назад

    Great video!

  • @gostaforsum6141
    @gostaforsum6141 4 месяца назад

    Great explanation!