Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

"It's time for him to leave" | Jamie Carragher says Marcus Rashford should leave Man Utd

I GOT BULLIED INTO CUTTING MY HAIR :(

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Gabriel Mongaras

Просмотров 3,8 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 янв 2025

Комментарии • 14

@EniAxon1 8 месяцев назад ⁺¹
Thanks, another great vid!
@宋佳璇-k4j 2 месяца назад
Great vids! But I didn't figure out how the training process is accomplished?
@husienvora9954 8 месяцев назад ⁺¹
great vid gabriel 👍
@ericl227 8 месяцев назад ⁺²
17:50, shouldn't H_i be a summation of k_j and v_j over j, instead of i, where j goes from 1 to i.
@gabrielmongaras 8 месяцев назад ⁺¹
Yep. Nice catch! Put a note in the video about the reindexing.
@TheVirgile27 8 месяцев назад
One thing I don't understand well. After training, how we manage the final output ? For a large input, do we "force it" to speak directly i.e. an output for each input OR we first insert all the input and then look at the output at a certain point. Because basically, one could wait to complete the reading part and then force an answer. Maybe i am not that clear (not my language) but it seems there could be several ways to retrieve an output from this type of transformer. Please be kind, thanks for the video :)
@EobardUchihaThawne 8 месяцев назад
i have to get used to those scienetific notations, i struggle to provide code from these articles myself
@danieldeychakiwsky1928 8 месяцев назад
Great vids! At around 3 mins when you get into the attention matrices, I think the dimensions aren’t right because if Q is d by s and K is d by s and we take QK-transpose then d by s matmul s by d gives a d by d matrix but the video shows that matrix to be s by s.
@gabrielmongaras 8 месяцев назад
Thanks! In that part, I transposed the diagram as I thought it looked a little better that way. Sometimes the diagrams I draw are transposed, but I try to label the dimensions to avoid ambiguity. So the s by s matrix is the resulting matrix, not a d by d one. A s by s matrix results in sequence relations while a d by d matrix results in dimension relations across the entire sequence.
@M-ed5ct 8 месяцев назад
Thanks for the video!
Just one question. The H state dimension is fixed, but it accumulates additional information proceeding with the token sequence. After the first segment, H1 just "summarize" one segment, but after segment N, Hn summarize current segment + Hn-1 that is the summary of all the past context. Do you think would make sense to increase H dimension proceeding with context, i.e. dimension of Hn grows with n? The idea is that we keep information per bit constant in H, so that we can really grow to unlimited context without state becoming a bottleneck.
@gabrielmongaras 8 месяцев назад
I think it makes sense to increase the hidden state, though doing so would result in a memory dependence on the sequence length during inference, which is currently a big problem. One can think of a softmax attention transformer as having an infinite hidden state (the keys/values are just stacked), on the other hand an RNN has a constant size hidden state. Perhaps something in the middle would perform better than an RNN, but not require as much memory as a Transformer?
@M-ed5ct 8 месяцев назад
@@gabrielmongaras Yeah, the trick is to find a state update function xn+1 = S(xn, segment_n) so that dim(xn+1) > dim(xn), i.e. projecting vector xn into a bigger space, while preserving the semantic and eventually imbuing segment_n's new data. Indeed because state dimension in the paper is tailored for a quite long context, with a growing state you can even start from a _smaller_ state x1 and then grow it with the number of segments....so maybe for a not too large context you can even have a memory reduction!
But I don't see memory usage as a problem, you can always clamp it to a maximum if really needed, a kind of max_memory parameter...it can't be worse than the original.
@YEETSWORLDWIDE 8 месяцев назад ⁺²
so basically what you're tellling me is the world is going to end
@gabrielmongaras 8 месяцев назад ⁺²
Absolutely :)

Следующие

Автовоспроизведение

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

"It's time for him to leave" | Jamie Carragher says Marcus Rashford should leave Man Utd

"It's time for him to leave" | Jamie Carragher says Marcus Rashford should leave Man Utd

I GOT BULLIED INTO CUTTING MY HAIR :(

I GOT BULLIED INTO CUTTING MY HAIR :(

MARK 마크 '프락치 (Fraktsiya) (Feat. 이영지)' MV

MARK 마크 '프락치 (Fraktsiya) (Feat. 이영지)' MV

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

RING Attention explained: 1 Mio Context Length

RING Attention explained: 1 Mio Context Length

When Jim Carrey Goes Totally Off Script!

When Jim Carrey Goes Totally Off Script!

NEW: INFINI Attention w/ 1 Mio Context Length

NEW: INFINI Attention w/ 1 Mio Context Length

CoPE - Contextual Position Encoding: Learning to Count What's Important

CoPE - Contextual Position Encoding: Learning to Count What's Important

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Attention in transformers, visually explained | DL6

Attention in transformers, visually explained | DL6

KAN: Kolmogorov-Arnold Networks

KAN: Kolmogorov-Arnold Networks

The math behind Attention: Keys, Queries, and Values matrices

The math behind Attention: Keys, Queries, and Values matrices

В ГЛУШИ под СНЕГОМ ОГРОМНЫЙ ДОМ - ЛОСЬ ПРИШЕЛ проверить

В ГЛУШИ под СНЕГОМ ОГРОМНЫЙ ДОМ - ЛОСЬ ПРИШЕЛ проверить

Ура! 🎉 С наступившим 2025! Новые Русские Бабки и Все Звёзды Юмора - Измайловский парк. 😂✨

Ура! 🎉 С наступившим 2025! Новые Русские Бабки и Все Звёзды Юмора – Измайловский парк. 😂✨

Играем в снегу с Бенчиком!😸 #симбочка #симба #симбочкапимпочка

Играем в снегу с Бенчиком!😸 #симбочка #симба #симбочкапимпочка

The Aura Look Like🤌🏻😎🤭😂

The Aura Look Like🤌🏻😎🤭😂

ДИАНА УСТРАИВАЕТ НОВОГОДНЮЮ ВЕЧЕРИНКУ С МАМОЙ!!!

ДИАНА УСТРАИВАЕТ НОВОГОДНЮЮ ВЕЧЕРИНКУ С МАМОЙ!!!

Выживание в Огромном Роботе | MECH SURVIVAL

Выживание в Огромном Роботе | MECH SURVIVAL

ПЕРВЫЙ СТРИМ ШИМОРО В 2025! - ПОЛНОЕ ПРОХОЖДЕНИЕ ШЕДЕВРА MiSide

ПЕРВЫЙ СТРИМ ШИМОРО В 2025! - ПОЛНОЕ ПРОХОЖДЕНИЕ ШЕДЕВРА MiSide

Калмыков НОКАУТИРОВАЛ Хамзата. Хоронженко VS Пахан & Маэстро-КОНФЛИКТ. Тандовский ГАЗ на Искандара

Калмыков НОКАУТИРОВАЛ Хамзата. Хоронженко VS Пахан & Маэстро–КОНФЛИКТ. Тандовский ГАЗ на Искандара