L19.4.1 Using Attention Without the RNN -- A Basic Form of Self-Attention

L19.4.2 Self-Attention and Scaled Dot-Product Attention

L19.1 Sequence Generation with Word and Character RNNs

Exposing the Blox Fruits Dragon Update

Where I've been for the past year...

Hey.. long time no see

L19.3 RNNs with an Attention Mechanism

Sebastian Raschka

Просмотров 20 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 23 дек 2024

Комментарии • 19

@ah-rdk 2 дня назад ⁺¹
The most complete video I found for Attention Mechanism! Thank you very much, sir.
@SebastianRaschka День назад
Glad you found it useful! Btw I have an even more complete blog post you might like :) Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs (magazine.sebastianraschka.com/p/understanding-and-coding-self-attention)
@koiRitwikHai 2 года назад ⁺³
On 17:42, I think the two items (that are going into the pink "Neural Net" box) should be S_{tprime-1} and h_{t}
where tprime = t'
because otherwise e_{t,tprime} will be solely dependent on t only, then why even call it e_{t,tprime}? just call it e_t simply
@ricardogomes9528 Год назад
I think it should be S_{t-1} (as it is) and h_{t'}, because in the formula below we have that t' ranges all the way through T, which is the max_index of the time-steps of the encoder. Am I wrong?
@borutsvara7245 Год назад
Yes, i think should be h_{t'}, otherwise the t' dependency does not make sense. But also you will need to run the yellow RNN on the entire sentence to get S_{t-1} and compute attention, which makes no much sense. Further in the next slide you have a h_{'t}, which may indicate a correction attempt.
@Prithviization 8 месяцев назад
Slide 29 here : sebastianraschka.com/pdf/lecture-notes/stat453ss21/L19_seq2seq_rnn-transformers__slides.pdf
@LoveinPortofino1 2 года назад ⁺²
Thanks for the very detailed explanation. In the graph you show that S_{t-1} and c_{t} goes into the calculation of S_{t}. However, we also need y_{t-1} as well. In the original paper,
the formula is: S_{t} = f(S_{t-1}, y_{t-1}, c_{t}). That is why I did not quite understand the computation graph to calculate S_{t}.
Is the formula below correct?
S_{t} = sigmoid(Weight_{hidden state} * S_{t-1} + Weight_{context} * c_{t} + Weight_{input} y_{t-1})
@Amapramaadhy 2 года назад ⁺¹
Thanks for the great content.
I find the "time step" terminology concept confusing. Might we call it "next item in the sequence" instead?
@mahaaljarrah3236 3 года назад ⁺³
Thank you very much, was really helpful.
@abubakarali6399 3 года назад ⁺¹
When you summing up all the attention weights then aggregate function give single result, instead of which word is more important.
How this aggregate function remember attention weight of every word?
@SebastianRaschka 3 года назад ⁺²
Good point. The attention weights give you basically the importance of a word. Btw here I am using word as a loose term that also means the representation of it as a real-valued vector.
You weight the "relevant" words via these weights more strongly when you aggregate. Like you hinted at, you squash the attention-weighted words into a single one, but this is still more "powerful" than a regular RNN. In the regular RNN, you carry on the words iteration by iteration, so information from early words might become forgotten. Say you are in word 10 in a sentence input. With the attention weighted version, you can have still a high weight on word 1, and then via the aggregation function this word will have a high influence then in time step 10.
@736939 2 года назад ⁺¹
What is the meaning of the bidirectional RNN? Why exactly is this type used for the attention?
@SebastianRaschka 2 года назад ⁺⁴
A bidirectional sounds fancier than it really is. You can think of a standard RNN where you use it on the input sentence as usual. Then, you use it again on the sentence where the words are in reversed order. Then, you concatenate the 2 representations. The one from the forward sentence and the one from the reverse sentence. Why? I guess that's because you want to capture more of the context. For some words, the relevant words come before, for some it's after.
@736939 2 года назад
@@SebastianRaschka Thank you professor.
@Mvkv4L Год назад
Hi Sebastian. I hope you're doing well. I have a question about the attention weights (alpha) and the energies (e) and I was hoping you would help. What are the shapes of alpha and e? Are they vectors or scalars?
@EngineeredFemale Год назад
Thanks. The video was clear.
@yosimadsu2189 Год назад
Thanks. More detailed but not the best. It lacks actual value to be calculated. Still confusing though
@nayanrainakaul9813 7 месяцев назад
Thanks kween
@BrianWhite-s4y 3 месяца назад
Wava Groves

Следующие

Автовоспроизведение

L19.4.1 Using Attention Without the RNN -- A Basic Form of Self-Attention

L19.4.1 Using Attention Without the RNN -- A Basic Form of Self-Attention

L19.4.2 Self-Attention and Scaled Dot-Product Attention

L19.4.2 Self-Attention and Scaled Dot-Product Attention

L19.1 Sequence Generation with Word and Character RNNs

L19.1 Sequence Generation with Word and Character RNNs

Exposing the Blox Fruits Dragon Update

Exposing the Blox Fruits Dragon Update

Where I've been for the past year...

Where I've been for the past year...

Hey.. long time no see

Hey.. long time no see

YELLOWSTONE Season 5 Episode 14 Ending Explained

YELLOWSTONE Season 5 Episode 14 Ending Explained

Developing an LLM: Building, Training, Finetuning

Developing an LLM: Building, Training, Finetuning

Attention for Neural Networks, Clearly Explained!!!

Attention for Neural Networks, Clearly Explained!!!

Attention in Neural Networks

Attention in Neural Networks

L19.5.2.3 BERT: Bidirectional Encoder Representations from Transformers

L19.5.2.3 BERT: Bidirectional Encoder Representations from Transformers

13.3.1 L1-regularized Logistic Regression as Embedded Feature Selection (L13: Feature Selection)

13.3.1 L1-regularized Logistic Regression as Embedded Feature Selection (L13: Feature Selection)

The math behind Attention: Keys, Queries, and Values matrices

The math behind Attention: Keys, Queries, and Values matrices

Lecture 13: Attention

Lecture 13: Attention

L19.5.2.2 GPT-v1: Generative Pre-Trained Transformer

L19.5.2.2 GPT-v1: Generative Pre-Trained Transformer

Building LLMs from the Ground Up: A 3-hour Coding Workshop

Building LLMs from the Ground Up: A 3-hour Coding Workshop

ДИКИЙ ДИРЕКТОР ОТРАБАТЫВАЕТ БОЕВЫЕ ИСКУССТВА НА ПОКУПАТЕЛЯХ / ВЫШВЫРНУЛ ИЗ МАГАЗИНА, НО... 1 ЧАСТЬ

ДИКИЙ ДИРЕКТОР ОТРАБАТЫВАЕТ БОЕВЫЕ ИСКУССТВА НА ПОКУПАТЕЛЯХ / ВЫШВЫРНУЛ ИЗ МАГАЗИНА, НО... 1 ЧАСТЬ

Другая семья. Рассказ

Другая семья. Рассказ

哥哥吃最后一个橘子，妹妹生闷气让哥哥猜！ #俩活宝 #兄妹日常 #生闷气跺脚

哥哥吃最后一个橘子，妹妹生闷气让哥哥猜！ #俩活宝 #兄妹日常 #生闷气跺脚

Не Пентагон разлил мазут, сами! Что думают россияне о катастрофе в Керченском проливе. Опрос

Не Пентагон разлил мазут, сами! Что думают россияне о катастрофе в Керченском проливе. Опрос

ГОША ОБИЖАЕТ ДИМОЧКУ 🥺 #натальнаякарта #карцев #журавлев #иванченко #mediumquality

ГОША ОБИЖАЕТ ДИМОЧКУ 🥺 #натальнаякарта #карцев #журавлев #иванченко #mediumquality

Такие пранки ни к чему хорошему точно не приведут

Такие пранки ни к чему хорошему точно не приведут

Денис и Елена Кукояки - о секретах счастливой семейной жизни, дочке Василисе и итогах года

Денис и Елена Кукояки – о секретах счастливой семейной жизни, дочке Василисе и итогах года

Росіяни штурмують на самокатах. Як «Азов» зупиняє штурм на Торецькому напрямку

Росіяни штурмують на самокатах. Як «Азов» зупиняє штурм на Торецькому напрямку