Wow - only 700 views for probably the best explanation of Transformers I came across so far! Really nice work! Keep it up!!! (FYI: I also read the blog post)
holy shit, been trying to wrap my head around self-attention for a while, but it all finally clicked together with this video. very well explained, very good video :)
Google should rank videos according to the likes and the number of previously viewed videos on the same topics: this should go straight to the top for Attention/Transformer searches because I have seen and read plenty, and this is the first time the QKV as dictionary vs RDBMs made sense; that confusion had been so bad it literally stopped me thinking every time I had to consider Q, or K, or V and thus prevented me grokking the big idea. I now want to watch/read everything by you.
Literally the BEST explanation of attention and transformer EVER!! Agree with everyone else about why this is not ranked higher :( I'm just glad I found it !
This is a great explanation, I have to admit I read your blog thinking the video was just a summary of it but it's much better than expected. I would appreciate it if you can create lectures in the future of how transformers are used for image recognition, I suspect we are just getting started with self attention and we'll start seeing more in CV.
Man I was looking for this for a long time, thank you very much for this best explanation, yep it's the best, btw RUclips recommended this video, I guess this is the power of self-attention in recommended systems.
Best explanation I have seen so far on the topic! One of the few that describe the underlaying math and not just show a simple flowchart. The only thing that confuses me: at 6:24 you say W = X_T*X but on your website you show a pytorch implementation wiith W = X*X_T. Depending on what you use you get either a [k x k] or a [t x t] matrix?
This is the best video I've seen on attention models. The only thing is that I don't think the explanation of the multihead part in minute 19 is accurate. What multihead does it not treating the word "too" and "terrible" different from the word "restaurant. What is does is that, instead of using the same weight for all elements of the embedding vector, as shown in 5':30", it calculates 2 weights, one for each half of the embedding vector. So, in other words, we break down the embedding vectors of the input words into small pieces and do self attention to ALL embedding sub-vectors, as opposed to doing self attention for the embedding of "too" and "terrible" differently from the attention of "restaurant".
Perfect explanation, but don't we have softmax operation in the practical SA just like simple SA? I could not see softmax in the representations of practical SA (18:42) unlike simple SA (05:16).
so far the best video describing this topic. Only questoin i have is how do we get around the fact that a word will have highest self attention with it self. You said you would clarify about this but I could not find this point.
Man, if only I had found this video early on during my academic project, would've probably been able to do a whole lot better in my project. Shame it's already about to end
VU Master's student here revisiting this lecture to help for my thesis. Super easy to get back into after a few months away from the concept. I did deep learning last December and I have to say it's my favourite course of the entire degree, mostly due to the clear and concise explanations given by the lecturers. I have one question though, I'm confused as to how simple self-attention would learn since it essentially doesn't use any parameters? I feel I'm missing something here. Thanks!
I love it when you are talking about the different ways of implementing multi-head attention, there are so many tutorials just glossing over it or taking it for granted, but I would wish to know more details @ 20:30. I came here because your article discussed it but i did not feel I have a too clear picture. Here, with the video, I still feel unclear. Which one was implemented in Transformer and which one for BERT? Suppose they cut the original input vector matrix into 8 or 12 chunks, why did not I see in their code the start of each section? I only saw a line dividing the input dimension by number of heads. That's all. How would the attention head the input vector idx they need to work on? Somehow I feel the head need to know the starting index ...
Thanks for you kind words! In the slide you point to the bottom version is used in every implementation I've seen. The way this "cutting up" is usually done is with a view operation. If I take a vector x of length 128, and do x.view(8, 16), I get a matrix with 8 rows and 16 columns, which I can then interpret as the 8 vectors of length 16 that will go into the 8 different heads. Here is that view() operation in the Huggingface GPT2 implementation: github.com/huggingface/transformers/blob/8719afa1adc004011a34a34e374b819b5963f23b/src/transformers/models/gpt2/modeling_gpt2.py#L208
Question: At 8:46 May I know please why since Y is defined as multiplication, it's purely linear and thus is non-vanishing gradient, which means the gradient will be of a linear operation? While W=SoftMax(XX^T) is non-linear and thus can cause vanishing gradients. Second, what is the relationship between linearity/non-linearity and vanishing/non-vanishing gradient?
It was very well put the presentation on self-attention. Thank you for uploading this. I had a doubt @15:56 how it will suffer from vanishing gradients without the normalization. As dimensionality increases, the overall dot product should be larger. Wouldn't this be a case of exploding gradient? I'd really love some insight on this. EDIT: Listened more carefully again. The vanishing gradient on the "softmax" operation. Got it now. Great video 🙂
Great video, I just have a question: When we compute the weights that are then multiplied by the value, are these vectors or just a single scalar value? I know we used the dot product to get w so it should be just a single scalar value, but just wanted to confirm. As an example, at 5:33 are the values for w a single value or vectors?
I know this is the best explanation about transformers I've come across so far. Still I'm having an problem with understanding Key, query and value part. Is there any recommendation, where I can learn completely from the basics? Thanks in advance
Hi extremely helpful video here, I really appreciate but i have a question i dont understand how multi head self attention works if we are not generate extra parameters for each stack of self attention layer, what is the difference in each stack so that we can grasp the different relations of the same word in each layer
Yeah after 9 days and re-watching this video i think I grasped why we are not using extra parameters. Lets say you have an embedding dimension of 768 and you want to make 3 attention head meaning somehow dividing the 768 vector so you could have a 256x1 vector for each attention head. (This splitting is actually a linear transformation so there is no weights to be learned here right. ) . After that, for each of this 3 attention heads we have parameters 3 of [K, Q, W](superscripted for each attention head). For each attention head our K will be the dimension of 256xwhatever, Q will be the dimension of 256xwhatever and V will be the dimension of 256xwhatever. And this is for one head. Concatanating all learned vectors K. Q and will end up a 768xwhatever for each of them, exact size that we would have with a single attention. Voila.
Hi, thank you for this nice Explanation. However, there is one thing that I don‘t get. How can the self attention model, for instance in the sentence „John likes his new shoes“, compute high value for „his“ and „John“. I mean, we know that they are related, but the embeddings for these words can be very different. Hope you can help me out :)
@@joehsiao6224 It's semantics really. Since the key and query are derived from the same vector it's up to you which you call the key and which the query, so the figure is fine in the sense that it would technically work without problems. However, given the analogy with the discrete key-value store, it makes most sense to say that the key and value come from the same input vector (i.e. have the same index) and that the query comes from a (potentially) different input.
Finally an actual _explanation_ of self-attention, particularly of the key, value and query that was bugging me a lot. Thanks so much!
Exactly! Thanks Mr. Bloem!
OMG , me too i was thinking of relational databases bcs they were saying database and it wasnt making any sense
This is the best explanation of self-attention I have ever seen! Thank you VERY MUCH!
Wow - only 700 views for probably the best explanation of Transformers I came across so far! Really nice work! Keep it up!!! (FYI: I also read the blog post)
A very clear and broken down explanation of self-attention. Definitely deserves much more recognition.
Best explanation out there, highly recommended. Thank you!
Saved lots of hours with this simple but awesome explanation of self-attention, thanks a lot!
The best ever video showing how self-attention works.
holy shit, been trying to wrap my head around self-attention for a while, but it all finally clicked together with this video.
very well explained, very good video :)
This is the best explanation of multi-head self attention I've seen.
I have gone through 10+ videos on this, but this is the best ...hats off
This is the kind of content that deserves the like, subscribe and share promotion. Thank you for your efforts, keep up!
Google should rank videos according to the likes and the number of previously viewed videos on the same topics: this should go straight to the top for Attention/Transformer searches because I have seen and read plenty, and this is the first time the QKV as dictionary vs RDBMs made sense; that confusion had been so bad it literally stopped me thinking every time I had to consider Q, or K, or V and thus prevented me grokking the big idea. I now want to watch/read everything by you.
This is a really excellent video. I was finding this a very confusing topic but I found it really clarifying.
Literally the BEST explanation of attention and transformer EVER!! Agree with everyone else about why this is not ranked higher :(
I'm just glad I found it !
Read the blog post and then found this presentation, what a gift!
I had to leave a comment, the best explanation of Query, Key, Value I have seen!
I think one of the best videos describing self-attention. Thank you for sharing.
Take my money, you deserve everything, greatest of all time. GOT
This is the best explanation i have ever heard
best explanation i found for self attention and multi head attention on internet , thank you sir
Fantastic explanation for self-attention
Better than the Karpathy explainer video. Enough said!
The best explanation of transformers and self-attention! I am watching all of your videos :)
Finally i have intuitive view of seld_attention . Thank you😇
Thank you! This is the best introductory video to self-attention!
Great lecture! I really appreciated your presentation by starting with simple self-attention, very helpful.
very nice explanation of self attention
Great video explanation and there is also a good read of this for those interested.
Thank you very much professor.
Best transformer explanation so far !!!
Absolutely amazing series of videos! Congrats!
This is a very clear explanation. Why RUclips does not recommend it??!
This is incredible and deserves a lot more views! (glad RUclips ranked it high for me to discover it :))
This really is a spectacular explanation.
Very good course which is easy to understand!
This is a spectacular explanation of transformers. Thank you very much!
Thank you. This is as simple as it can get. Thanks a lot!!!
Best explanation ever! Congratulations and thank you!
Thank you for the video and the slides. Your explanations are very clear.
thanks, man! you packed some really complex concepts in a very short video.
going to watch more material that you are producing.
Thanks for your sharing! Nice and clear video!
Finally found the Best explanation TY.
You are a great teacher.
This is a great explanation, I have to admit I read your blog thinking the video was just a summary of it but it's much better than expected. I would appreciate it if you can create lectures in the future of how transformers are used for image recognition, I suspect we are just getting started with self attention and we'll start seeing more in CV.
Man I was looking for this for a long time, thank you very much for this best explanation, yep it's the best, btw RUclips recommended this video, I guess this is the power of self-attention in recommended systems.
Pure gold!
Thank you.
Best explanation I have seen so far on the topic! One of the few that describe the underlaying math and not just show a simple flowchart.
The only thing that confuses me: at 6:24 you say W = X_T*X but on your website you show a pytorch implementation wiith W = X*X_T. Depending on what you use you get either a [k x k] or a [t x t] matrix?
Watching this video feels like trying to decipher alien scriptures with a blindfold on.
Thank you so much! This was amazing! Keep it up! This vdo is so underrated. I will share. :)
This is a great explanation, thanks so much! I got really sick of explanations just skipping over most of the details.
9:55 if the sequence get longer, the weights become smaller (soft max with many components ) : is it better to have shorter sequences ?
Good sir, I thank ye for this educational video with nice visuals
This is the best video I've seen on attention models. The only thing is that I don't think the explanation of the multihead part in minute 19 is accurate. What multihead does it not treating the word "too" and "terrible" different from the word "restaurant. What is does is that, instead of using the same weight for all elements of the embedding vector, as shown in 5':30", it calculates 2 weights, one for each half of the embedding vector. So, in other words, we break down the embedding vectors of the input words into small pieces and do self attention to ALL embedding sub-vectors, as opposed to doing self attention for the embedding of "too" and "terrible" differently from the attention of "restaurant".
Thanks for such a nice explanation!
such a great explanation with examples :-) one has to love it. thank you
Perfect explanation, but don't we have softmax operation in the practical SA just like simple SA? I could not see softmax in the representations of practical SA (18:42) unlike simple SA (05:16).
Excellent, excellent, excellent!
so far the best video describing this topic. Only questoin i have is how do we get around the fact that a word will have highest self attention with it self. You said you would clarify about this but I could not find this point.
Man, if only I had found this video early on during my academic project, would've probably been able to do a whole lot better in my project. Shame it's already about to end
VU Master's student here revisiting this lecture to help for my thesis. Super easy to get back into after a few months away from the concept. I did deep learning last December and I have to say it's my favourite course of the entire degree, mostly due to the clear and concise explanations given by the lecturers. I have one question though, I'm confused as to how simple self-attention would learn since it essentially doesn't use any parameters? I feel I'm missing something here. Thanks!
You are a God.
Hello can someone explain me this, the key and the values for each iteration wont it be the same, as we compare it to 5:29 , please help me on this
Thanks, found this very useful !!
best explanation ever!
Great video. Thanks
I love it when you are talking about the different ways of implementing multi-head attention, there are so many tutorials just glossing over it or taking it for granted, but I would wish to know more details @ 20:30. I came here because your article discussed it but i did not feel I have a too clear picture. Here, with the video, I still feel unclear. Which one was implemented in Transformer and which one for BERT? Suppose they cut the original input vector matrix into 8 or 12 chunks, why did not I see in their code the start of each section? I only saw a line dividing the input dimension by number of heads. That's all. How would the attention head the input vector idx they need to work on? Somehow I feel the head need to know the starting index ...
Thanks for you kind words! In the slide you point to the bottom version is used in every implementation I've seen. The way this "cutting up" is usually done is with a view operation. If I take a vector x of length 128, and do x.view(8, 16), I get a matrix with 8 rows and 16 columns, which I can then interpret as the 8 vectors of length 16 that will go into the 8 different heads.
Here is that view() operation in the Huggingface GPT2 implementation: github.com/huggingface/transformers/blob/8719afa1adc004011a34a34e374b819b5963f23b/src/transformers/models/gpt2/modeling_gpt2.py#L208
This is gold.
I loved u explanation !!!
Thanks for the video! It was super helpful
Question: At 8:46 May I know please why since Y is defined as multiplication, it's purely linear and thus is non-vanishing gradient, which means the gradient will be of a linear operation? While W=SoftMax(XX^T) is non-linear and thus can cause vanishing gradients. Second, what is the relationship between linearity/non-linearity and vanishing/non-vanishing gradient?
Thanks for the great explanation! Just one question, if simple self-attention has no parameters, how can we expect it to learn? it is not trainable.
It was very well put the presentation on self-attention. Thank you for uploading this. I had a doubt @15:56 how it will suffer from vanishing gradients without the normalization. As dimensionality increases, the overall dot product should be larger. Wouldn't this be a case of exploding gradient? I'd really love some insight on this.
EDIT: Listened more carefully again. The vanishing gradient on the "softmax" operation. Got it now. Great video 🙂
Great video, I just have a question:
When we compute the weights that are then multiplied by the value, are these vectors or just a single scalar value? I know we used the dot product to get w so it should be just a single scalar value, but just wanted to confirm.
As an example, at 5:33 are the values for w a single value or vectors?
Yes, it is a single scalar the result of the dot product further normalize by softmax, so the sum of all weights equals to one.
Wow, this is unique.
I know this is the best explanation about transformers I've come across so far. Still I'm having an problem with understanding Key, query and value part. Is there any recommendation, where I can learn completely from the basics? Thanks in advance
This video was raaaaad THANK YOU
Amazing video!!
Hi extremely helpful video here, I really appreciate but i have a question i dont understand how multi head self attention works if we are not generate extra parameters for each stack of self attention layer, what is the difference in each stack so that we can grasp the different relations of the same word in each layer
Yeah after 9 days and re-watching this video i think I grasped why we are not using extra parameters. Lets say you have an embedding dimension of 768 and you want to make 3 attention head meaning somehow dividing the 768 vector so you could have a 256x1 vector for each attention head. (This splitting is actually a linear transformation so there is no weights to be learned here right. ) . After that, for each of this 3 attention heads we have parameters 3 of [K, Q, W](superscripted for each attention head). For each attention head our K will be the dimension of 256xwhatever, Q will be the dimension of 256xwhatever and V will be the dimension of 256xwhatever. And this is for one head. Concatanating all learned vectors K. Q and will end up a 768xwhatever for each of them, exact size that we would have with a single attention. Voila.
Hi, thank you for this nice Explanation. However, there is one thing that I don‘t get. How can the self attention model, for instance in the sentence „John likes his new shoes“, compute high value for „his“ and „John“. I mean, we know that they are related, but the embeddings for these words can be very different. Hope you can help me out :)
On page 23, should it not be ki qj rather than kj qi?
I totally agree on your opinion
You're right. Thanks for the pointer. We'll fix this in any future versions of the video.
@@dlvu6202 Why the change? I think we are querying with current ith input against every other jth input, and the figure looks right to me.
@@joehsiao6224 It's semantics really. Since the key and query are derived from the same vector it's up to you which you call the key and which the query, so the figure is fine in the sense that it would technically work without problems. However, given the analogy with the discrete key-value store, it makes most sense to say that the key and value come from the same input vector (i.e. have the same index) and that the query comes from a (potentially) different input.
@@dlvu6202 it makes sense. Thanks for the reply!
At 16.43, why is d['b'] = 3 rather than 2?
This was a mistake, apologies. We'll fix this in the slides.
fine , i will subscribe
Everything was clear till the query key and value.. anyone has a slower video or resource for understanding??
how self-attention has sence on word embedding, where each word is represented by random vector so this self-correlation has no sence?
thank you so much
I finally understand it jesus christ
Is this ASMR?
Thank you very much