Why Does Diffusion Work Better than Auto-Regression?

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Microsoft DeepSpeed introduction at KAUST

Seungmin "그렇게, 천천히, 우리(As we are)" | [Stray Kids : SKZ-PLAYER]

Hollywood - Peso Pluma, Estevan Plazola (Video Oficial)

Exposing the Blox Fruits Dragon Update

Turing-NLG, DeepSpeed and the ZeRO optimizer

Yannic Kilcher

Просмотров 18 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 10 янв 2025

Комментарии •

@maxim_ml 9 месяцев назад ⁺³
watching now and hearing 17b is _huge_ really makes you feel the time passing
@gabrielchu5798 Год назад ⁺¹¹
It's fascinating to revisit this after three years. Must say, by throwing in more parameters, we indeed have made incredible progress in truly understand languages.
@КириллКлимушин 6 месяцев назад
Thanks for such an interesting overview of the concept. Probably the only video with decent quality I've managed to find on the youtube. Thank you!
@meteogold6761 Год назад ⁺¹
The device that establishes interconnection between GPUs within a node is called NVLink. InfiniBand is used to establish connection between different nodes.
@CodeEmporium 4 года назад ⁺⁴
Very interesting. Thanks for the great explanation! Ill read more on this
@ssenie 3 года назад ⁺⁴
9:01 on board network connecting gpus is called nvlink
@MaxTrex-i7s 3 месяца назад
Amazing video, with amazing achievements in AI. Great work Yannic. Thanks a lot !!
@maxwang3831 3 года назад ⁺³
Pretty clear explanation on ZeRO. Appreciate it.
@judgeomega 4 года назад ⁺⁹
the zero optimizer is described at 11:37
@fahdciwan8709 4 года назад ⁺¹
thanks for making this so simple to understand
@sadface7457 4 года назад ⁺⁹
Can we get a videos series on the graveyard of machine learning why ideas like synthetic gradients and capsule-nets have gone dormant the deep learning space.
@YannicKilcher 4 года назад ⁺⁸
My bet is those ideas never worked in the first place. Yes, they "work" in their papers, but you can get anything to work for a paper. They will probably keep re-appearing every couple of months because someone combines them with something else and wants a bit of name recognition.
In very seldom cases, someone will figure out how to make one of these actually work, like it happened with neural networks (alexnet) or GANs (goodfellow). These ideas were around long before, looking like crap. But so were 1000 others that actually are crap.
@taylorsmurphy 4 года назад ⁺¹
Capsules were the main topic of Geoffrey Hinton at the recent 2020 Turing Awards vimeo.com/390347111 at about 1:30:00
@afshinoroojlooy7038 Год назад
Very clear explanation. Thanks!
@AnyaChuri 2 года назад
Thank you Yannic 💕💕
@jrkirby93 4 года назад ⁺⁷
Interesting to note, that when compared to pegasus-large, this didn't even perform better on all the tasks. A bit misleading to have a column in the performance evaluation called "Previous SOTA" while there's another column in the same chart that beats the "Previous SOTA". And this has ~30x more parameters than the pegasus large model. If this isn't evidence that better training beats raw size, I don't know what is.
However, this is a cool way of training that allows for very big models with low overhead. On top of that, with this technique, size should scale approximately linearly with the number of GPUs+machines (if you use multicast routers). It makes techniques that use self-supervised local loss to avoid full-model backpropagation seem a lot less necessary.
All things considered, I think the next important step is learning to distill such large models into smaller models with similar performance. Primarily, this is needed for inference, because 17 billion parameters has a hard time fitting on GPUs. But also, this might be able to extend training beyond it's initial bound - perhaps distilling a model and adding more layers over again in a loop.
@mastafafoufa5121 4 года назад ⁺¹
Hi, in the first approach of Data Parallelism without ZeRO, is it sequential? Meaning is Data0 first fed to the network, then backwards propagations is done, then all new parameters are sent to the other GPUS (GPU_1 to GPU_n-1) ? And then Data1 is fed to the network stored in GPU_1 sending its own updated parameters to GPU_2 etc.
I feel like that would be not be optimal. But I do not see the intuition behind the parameters sharing across GPUs. Any idea how that would work? :)
@tedbrownlow4617 Год назад
My understanding from the video is that the parameters are shared across GPUs because you are trying to train the same model with multiple GPUs. Without sharing/synchronizing parameters, you would be effectively training N separate models, which would become difficult-to-impossible to reconcile quite quickly.
@whatsinthepapers6112 4 года назад ⁺⁴
"AND.... it is a bit better!"
@YannicKilcher 4 года назад ⁺⁶
NLP is slowly going the way of imagenet.
@whatsinthepapers6112 4 года назад ⁺²
@@YannicKilcher Absolutely. Single GPU plebs like me will have to stick with toy problems for now!
@williamleigh816 9 месяцев назад
great video!
@burakhelvacoglu8819 Год назад
as epistemological perspective: commonly known (objective history) facts can be manifactured by total possibility language space
@glennkroegel1342 4 года назад ⁺³
It's a little bit better...but maybe it's sentient (joke).
But seriously, I think the memory footprint of things like BERT is already pushing it. So much so that the things I'm most interested in in the last year are things like DistilBert. But even that is too much for the CPU peasants out there of which there are many. Having said that I do trust that leaders in the field are aware that the "No replacement for displacement" parameter dick measuring contest is not the endgame solution.
@YannicKilcher 4 года назад ⁺³
I think to the big companies, it's mostly a recruitment advertising platform.
@gigabytechanz9646 Год назад
Awesome!
@Qumeric 4 года назад
It's somewhat similar to microprocessor pipelines. Probably there are more pipelining tricks to adapt.
@tedbrownlow4617 Год назад
I thought the same thing. The classic laundry machine example is exactly what I thought of during the naive-ish model splitting section.
@2107mann 4 года назад ⁺²
First one on TNLG
@florianschmidt6509 4 года назад ⁺³
"I don't know"'
@YannicKilcher 4 года назад
🤷

Следующие

Автовоспроизведение

Why Does Diffusion Work Better than Auto-Regression?

Why Does Diffusion Work Better than Auto-Regression?

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Microsoft DeepSpeed introduction at KAUST

Microsoft DeepSpeed introduction at KAUST

Seungmin "그렇게, 천천히, 우리(As we are)" | [Stray Kids : SKZ-PLAYER]

Seungmin "그렇게, 천천히, 우리(As we are)" | [Stray Kids : SKZ-PLAYER]

Hollywood - Peso Pluma, Estevan Plazola (Video Oficial)

Hollywood - Peso Pluma, Estevan Plazola (Video Oficial)

Exposing the Blox Fruits Dragon Update

Exposing the Blox Fruits Dragon Update

Vermont vs. Marshall: 2024 NCAA men’s soccer championship highlights

Vermont vs. Marshall: 2024 NCAA men’s soccer championship highlights

Attention in transformers, visually explained | DL6

Attention in transformers, visually explained | DL6

DeepSpeed: All the tricks to scale to gigantic models

DeepSpeed: All the tricks to scale to gigantic models

Ilya Sutskever: "Sequence to sequence learning with neural networks: what a decade"

Ilya Sutskever: "Sequence to sequence learning with neural networks: what a decade"

When Optimisations Work, But for the Wrong Reasons

When Optimisations Work, But for the Wrong Reasons

Rethinking Attention with Performers (Paper Explained)

Rethinking Attention with Performers (Paper Explained)

Transformers (how LLMs work) explained visually | DL5

Transformers (how LLMs work) explained visually | DL5

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

QLoRA paper explained (Efficient Finetuning of Quantized LLMs)

QLoRA paper explained (Efficient Finetuning of Quantized LLMs)

كان بإمكاني الاحتفاظ بكل بشرتي😡💀

كان بإمكاني الاحتفاظ بكل بشرتي😡💀

🔴 СРОЧНО Бездомные звезды и сгоревший Голливуд: пожары в Лос-Анджелесе #новости #калифорния #пожары

🔴 СРОЧНО Бездомные звезды и сгоревший Голливуд: пожары в Лос-Анджелесе #новости #калифорния #пожары

Эксклюзив: Иран готов ввести войска в Армению

Эксклюзив: Иран готов ввести войска в Армению

ДЕНЬ РОЖДЕНИЯ БАКУ! ЮБИЛЕЙНОЕ ВИДЕО НА 7 МИЛЛИОНОВ ПОДПИСЧИКОВ!

ДЕНЬ РОЖДЕНИЯ БАКУ! ЮБИЛЕЙНОЕ ВИДЕО НА 7 МИЛЛИОНОВ ПОДПИСЧИКОВ!

Toddler stops crying when she receives cash

Toddler stops crying when she receives cash

Дали Кадролу мрот…

Дали Кадролу мрот…

Я ПРОТИВ ИГРЫ В КАЛЬМАРА 2 В ЗАКУЛИСЬЕ В ГАРРИС МОД!

Я ПРОТИВ ИГРЫ В КАЛЬМАРА 2 В ЗАКУЛИСЬЕ В ГАРРИС МОД!

T.O.P squid game inspired look! 【IG: jia_songggggggggg】 #squidgame #top #bigbang

T.O.P squid game inspired look! 【IG: jia_songggggggggg】 #squidgame #top #bigbang