Turing-NLG, DeepSpeed and the ZeRO optimizer

Поделиться
HTML-код
  • Опубликовано: 10 янв 2025

Комментарии •

  • @maxim_ml
    @maxim_ml 9 месяцев назад +3

    watching now and hearing 17b is _huge_ really makes you feel the time passing

  • @gabrielchu5798
    @gabrielchu5798 Год назад +11

    It's fascinating to revisit this after three years. Must say, by throwing in more parameters, we indeed have made incredible progress in truly understand languages.

  • @КириллКлимушин
    @КириллКлимушин 6 месяцев назад

    Thanks for such an interesting overview of the concept. Probably the only video with decent quality I've managed to find on the youtube. Thank you!

  • @meteogold6761
    @meteogold6761 Год назад +1

    The device that establishes interconnection between GPUs within a node is called NVLink. InfiniBand is used to establish connection between different nodes.

  • @CodeEmporium
    @CodeEmporium 4 года назад +4

    Very interesting. Thanks for the great explanation! Ill read more on this

  • @ssenie
    @ssenie 3 года назад +4

    9:01 on board network connecting gpus is called nvlink

  • @MaxTrex-i7s
    @MaxTrex-i7s 3 месяца назад

    Amazing video, with amazing achievements in AI. Great work Yannic. Thanks a lot !!

  • @maxwang3831
    @maxwang3831 3 года назад +3

    Pretty clear explanation on ZeRO. Appreciate it.

  • @judgeomega
    @judgeomega 4 года назад +9

    the zero optimizer is described at 11:37

  • @fahdciwan8709
    @fahdciwan8709 4 года назад +1

    thanks for making this so simple to understand

  • @sadface7457
    @sadface7457 4 года назад +9

    Can we get a videos series on the graveyard of machine learning why ideas like synthetic gradients and capsule-nets have gone dormant the deep learning space.

    • @YannicKilcher
      @YannicKilcher  4 года назад +8

      My bet is those ideas never worked in the first place. Yes, they "work" in their papers, but you can get anything to work for a paper. They will probably keep re-appearing every couple of months because someone combines them with something else and wants a bit of name recognition.
      In very seldom cases, someone will figure out how to make one of these actually work, like it happened with neural networks (alexnet) or GANs (goodfellow). These ideas were around long before, looking like crap. But so were 1000 others that actually are crap.

    • @taylorsmurphy
      @taylorsmurphy 4 года назад +1

      Capsules were the main topic of Geoffrey Hinton at the recent 2020 Turing Awards vimeo.com/390347111 at about 1:30:00

  • @afshinoroojlooy7038
    @afshinoroojlooy7038 Год назад

    Very clear explanation. Thanks!

  • @AnyaChuri
    @AnyaChuri 2 года назад

    Thank you Yannic 💕💕

  • @jrkirby93
    @jrkirby93 4 года назад +7

    Interesting to note, that when compared to pegasus-large, this didn't even perform better on all the tasks. A bit misleading to have a column in the performance evaluation called "Previous SOTA" while there's another column in the same chart that beats the "Previous SOTA". And this has ~30x more parameters than the pegasus large model. If this isn't evidence that better training beats raw size, I don't know what is.
    However, this is a cool way of training that allows for very big models with low overhead. On top of that, with this technique, size should scale approximately linearly with the number of GPUs+machines (if you use multicast routers). It makes techniques that use self-supervised local loss to avoid full-model backpropagation seem a lot less necessary.
    All things considered, I think the next important step is learning to distill such large models into smaller models with similar performance. Primarily, this is needed for inference, because 17 billion parameters has a hard time fitting on GPUs. But also, this might be able to extend training beyond it's initial bound - perhaps distilling a model and adding more layers over again in a loop.

  • @mastafafoufa5121
    @mastafafoufa5121 4 года назад +1

    Hi, in the first approach of Data Parallelism without ZeRO, is it sequential? Meaning is Data0 first fed to the network, then backwards propagations is done, then all new parameters are sent to the other GPUS (GPU_1 to GPU_n-1) ? And then Data1 is fed to the network stored in GPU_1 sending its own updated parameters to GPU_2 etc.
    I feel like that would be not be optimal. But I do not see the intuition behind the parameters sharing across GPUs. Any idea how that would work? :)

    • @tedbrownlow4617
      @tedbrownlow4617 Год назад

      My understanding from the video is that the parameters are shared across GPUs because you are trying to train the same model with multiple GPUs. Without sharing/synchronizing parameters, you would be effectively training N separate models, which would become difficult-to-impossible to reconcile quite quickly.

  • @whatsinthepapers6112
    @whatsinthepapers6112 4 года назад +4

    "AND.... it is a bit better!"

    • @YannicKilcher
      @YannicKilcher  4 года назад +6

      NLP is slowly going the way of imagenet.

    • @whatsinthepapers6112
      @whatsinthepapers6112 4 года назад +2

      @@YannicKilcher Absolutely. Single GPU plebs like me will have to stick with toy problems for now!

  • @williamleigh816
    @williamleigh816 9 месяцев назад

    great video!

  • @burakhelvacoglu8819
    @burakhelvacoglu8819 Год назад

    as epistemological perspective: commonly known (objective history) facts can be manifactured by total possibility language space

  • @glennkroegel1342
    @glennkroegel1342 4 года назад +3

    It's a little bit better...but maybe it's sentient (joke).
    But seriously, I think the memory footprint of things like BERT is already pushing it. So much so that the things I'm most interested in in the last year are things like DistilBert. But even that is too much for the CPU peasants out there of which there are many. Having said that I do trust that leaders in the field are aware that the "No replacement for displacement" parameter dick measuring contest is not the endgame solution.

    • @YannicKilcher
      @YannicKilcher  4 года назад +3

      I think to the big companies, it's mostly a recruitment advertising platform.

  • @gigabytechanz9646
    @gigabytechanz9646 Год назад

    Awesome!

  • @Qumeric
    @Qumeric 4 года назад

    It's somewhat similar to microprocessor pipelines. Probably there are more pipelining tricks to adapt.

    • @tedbrownlow4617
      @tedbrownlow4617 Год назад

      I thought the same thing. The classic laundry machine example is exactly what I thought of during the naive-ish model splitting section.

  • @2107mann
    @2107mann 4 года назад +2

    First one on TNLG

  • @florianschmidt6509
    @florianschmidt6509 4 года назад +3

    "I don't know"'