Lecture 12.1 Self-attention

Поделиться
HTML-код
  • Опубликовано: 26 дек 2024

Комментарии • 99

  • @derkontrolleur1904
    @derkontrolleur1904 4 года назад +98

    Finally an actual _explanation_ of self-attention, particularly of the key, value and query that was bugging me a lot. Thanks so much!

    • @Epistemophilos
      @Epistemophilos 3 года назад +1

      Exactly! Thanks Mr. Bloem!

    • @rekarrkr5109
      @rekarrkr5109 2 года назад +1

      OMG , me too i was thinking of relational databases bcs they were saying database and it wasnt making any sense

  • @ArashKhoeini
    @ArashKhoeini Год назад +8

    This is the best explanation of self-attention I have ever seen! Thank you VERY MUCH!

  • @constantinfry3087
    @constantinfry3087 4 года назад +28

    Wow - only 700 views for probably the best explanation of Transformers I came across so far! Really nice work! Keep it up!!! (FYI: I also read the blog post)

  • @sohaibzahid1188
    @sohaibzahid1188 2 года назад +4

    A very clear and broken down explanation of self-attention. Definitely deserves much more recognition.

  • @dhruvjain4372
    @dhruvjain4372 3 года назад +17

    Best explanation out there, highly recommended. Thank you!

  • @tizianofranza2956
    @tizianofranza2956 3 года назад +6

    Saved lots of hours with this simple but awesome explanation of self-attention, thanks a lot!

  • @thcheung
    @thcheung 3 года назад

    The best ever video showing how self-attention works.

  • @HiHi-iu8gf
    @HiHi-iu8gf Год назад

    holy shit, been trying to wrap my head around self-attention for a while, but it all finally clicked together with this video.
    very well explained, very good video :)

  • @josemariabraga3380
    @josemariabraga3380 2 года назад

    This is the best explanation of multi-head self attention I've seen.

  • @sathyanarayanankulasekaran5928
    @sathyanarayanankulasekaran5928 3 года назад

    I have gone through 10+ videos on this, but this is the best ...hats off

  • @szilike_10
    @szilike_10 2 года назад +1

    This is the kind of content that deserves the like, subscribe and share promotion. Thank you for your efforts, keep up!

  • @MrOntologue
    @MrOntologue Год назад +2

    Google should rank videos according to the likes and the number of previously viewed videos on the same topics: this should go straight to the top for Attention/Transformer searches because I have seen and read plenty, and this is the first time the QKV as dictionary vs RDBMs made sense; that confusion had been so bad it literally stopped me thinking every time I had to consider Q, or K, or V and thus prevented me grokking the big idea. I now want to watch/read everything by you.

  • @Ariel-px7hz
    @Ariel-px7hz 2 года назад +3

    This is a really excellent video. I was finding this a very confusing topic but I found it really clarifying.

  • @workstuff6094
    @workstuff6094 3 года назад

    Literally the BEST explanation of attention and transformer EVER!! Agree with everyone else about why this is not ranked higher :(
    I'm just glad I found it !

  • @maxcrous
    @maxcrous 4 года назад

    Read the blog post and then found this presentation, what a gift!

  • @free_guac
    @free_guac 3 года назад

    I had to leave a comment, the best explanation of Query, Key, Value I have seen!

  • @farzinhaddadpour7192
    @farzinhaddadpour7192 Год назад

    I think one of the best videos describing self-attention. Thank you for sharing.

  • @martian.07_
    @martian.07_ 2 года назад

    Take my money, you deserve everything, greatest of all time. GOT

  • @davidadewoyin468
    @davidadewoyin468 2 года назад

    This is the best explanation i have ever heard

  • @AlirezaAroundItaly
    @AlirezaAroundItaly Год назад

    best explanation i found for self attention and multi head attention on internet , thank you sir

  • @mohammadyahya78
    @mohammadyahya78 Год назад

    Fantastic explanation for self-attention

  • @senthil2sg
    @senthil2sg Год назад

    Better than the Karpathy explainer video. Enough said!

  • @MonicaRotulo
    @MonicaRotulo 2 года назад

    The best explanation of transformers and self-attention! I am watching all of your videos :)

  • @Mars.2024
    @Mars.2024 7 месяцев назад

    Finally i have intuitive view of seld_attention . Thank you😇

  • @nengyunzhang6341
    @nengyunzhang6341 3 года назад

    Thank you! This is the best introductory video to self-attention!

  • @RioDeDoro
    @RioDeDoro Год назад

    Great lecture! I really appreciated your presentation by starting with simple self-attention, very helpful.

  • @bello3137
    @bello3137 Год назад

    very nice explanation of self attention

  • @peregudovoleg
    @peregudovoleg 2 года назад

    Great video explanation and there is also a good read of this for those interested.
    Thank you very much professor.

  • @huitangtt
    @huitangtt 3 года назад

    Best transformer explanation so far !!!

  • @olileveque
    @olileveque 8 месяцев назад

    Absolutely amazing series of videos! Congrats!

  • @imanmossavat9383
    @imanmossavat9383 2 года назад

    This is a very clear explanation. Why RUclips does not recommend it??!

  • @shandou5276
    @shandou5276 3 года назад +1

    This is incredible and deserves a lot more views! (glad RUclips ranked it high for me to discover it :))

  • @BenRush
    @BenRush Год назад +1

    This really is a spectacular explanation.

  • @junliu7398
    @junliu7398 2 года назад

    Very good course which is easy to understand!

  • @juliogodel
    @juliogodel 4 года назад +1

    This is a spectacular explanation of transformers. Thank you very much!

  • @impracticaldev
    @impracticaldev Год назад

    Thank you. This is as simple as it can get. Thanks a lot!!!

  • @fredoliveira7569
    @fredoliveira7569 Год назад

    Best explanation ever! Congratulations and thank you!

  • @marcolehmann6477
    @marcolehmann6477 4 года назад +1

    Thank you for the video and the slides. Your explanations are very clear.

  • @muhammadumerjavaid6663
    @muhammadumerjavaid6663 4 года назад

    thanks, man! you packed some really complex concepts in a very short video.
    going to watch more material that you are producing.

  • @润龙廖
    @润龙廖 Год назад

    Thanks for your sharing! Nice and clear video!

  • @rahulmodak6318
    @rahulmodak6318 3 года назад

    Finally found the Best explanation TY.

  • @zadidhasan4698
    @zadidhasan4698 Год назад

    You are a great teacher.

  • @maganaluis92
    @maganaluis92 2 года назад

    This is a great explanation, I have to admit I read your blog thinking the video was just a summary of it but it's much better than expected. I would appreciate it if you can create lectures in the future of how transformers are used for image recognition, I suspect we are just getting started with self attention and we'll start seeing more in CV.

  • @karimedx
    @karimedx 3 года назад +1

    Man I was looking for this for a long time, thank you very much for this best explanation, yep it's the best, btw RUclips recommended this video, I guess this is the power of self-attention in recommended systems.

  • @m1hdi333
    @m1hdi333 2 месяца назад

    Pure gold!
    Thank you.

  • @Markste-in
    @Markste-in 3 года назад +1

    Best explanation I have seen so far on the topic! One of the few that describe the underlaying math and not just show a simple flowchart.
    The only thing that confuses me: at 6:24 you say W = X_T*X but on your website you show a pytorch implementation wiith W = X*X_T. Depending on what you use you get either a [k x k] or a [t x t] matrix?

  • @saurabhmahra4084
    @saurabhmahra4084 Год назад +1

    Watching this video feels like trying to decipher alien scriptures with a blindfold on.

  • @laveenabachani
    @laveenabachani 2 года назад

    Thank you so much! This was amazing! Keep it up! This vdo is so underrated. I will share. :)

  • @darkfuji196
    @darkfuji196 2 года назад

    This is a great explanation, thanks so much! I got really sick of explanations just skipping over most of the details.

  • @WahranRai
    @WahranRai 2 года назад +1

    9:55 if the sequence get longer, the weights become smaller (soft max with many components ) : is it better to have shorter sequences ?

  • @stephaniemartinez1294
    @stephaniemartinez1294 2 года назад

    Good sir, I thank ye for this educational video with nice visuals

  • @Raven-bi3xn
    @Raven-bi3xn Год назад

    This is the best video I've seen on attention models. The only thing is that I don't think the explanation of the multihead part in minute 19 is accurate. What multihead does it not treating the word "too" and "terrible" different from the word "restaurant. What is does is that, instead of using the same weight for all elements of the embedding vector, as shown in 5':30", it calculates 2 weights, one for each half of the embedding vector. So, in other words, we break down the embedding vectors of the input words into small pieces and do self attention to ALL embedding sub-vectors, as opposed to doing self attention for the embedding of "too" and "terrible" differently from the attention of "restaurant".

  • @clapathy
    @clapathy 2 года назад

    Thanks for such a nice explanation!

  • @MrFunlive
    @MrFunlive 3 года назад

    such a great explanation with examples :-) one has to love it. thank you

  • @metehkaya96
    @metehkaya96 2 месяца назад

    Perfect explanation, but don't we have softmax operation in the practical SA just like simple SA? I could not see softmax in the representations of practical SA (18:42) unlike simple SA (05:16).

  • @randomdudepostingstuff9696
    @randomdudepostingstuff9696 2 года назад

    Excellent, excellent, excellent!

  • @aiapplicationswithshailey3600
    @aiapplicationswithshailey3600 Год назад

    so far the best video describing this topic. Only questoin i have is how do we get around the fact that a word will have highest self attention with it self. You said you would clarify about this but I could not find this point.

  • @WM_1310
    @WM_1310 2 года назад +1

    Man, if only I had found this video early on during my academic project, would've probably been able to do a whole lot better in my project. Shame it's already about to end

  • @somewisealien
    @somewisealien 2 года назад

    VU Master's student here revisiting this lecture to help for my thesis. Super easy to get back into after a few months away from the concept. I did deep learning last December and I have to say it's my favourite course of the entire degree, mostly due to the clear and concise explanations given by the lecturers. I have one question though, I'm confused as to how simple self-attention would learn since it essentially doesn't use any parameters? I feel I'm missing something here. Thanks!

  • @TheCrmagic
    @TheCrmagic 3 года назад +1

    You are a God.

  • @soumilbinhani8803
    @soumilbinhani8803 Год назад

    Hello can someone explain me this, the key and the values for each iteration wont it be the same, as we compare it to 5:29 , please help me on this

  • @SilePastile
    @SilePastile 3 года назад

    Thanks, found this very useful !!

  • @deepu1234able
    @deepu1234able 3 года назад

    best explanation ever!

  • @ChrisHalden007
    @ChrisHalden007 Год назад

    Great video. Thanks

  • @ax5344
    @ax5344 3 года назад

    I love it when you are talking about the different ways of implementing multi-head attention, there are so many tutorials just glossing over it or taking it for granted, but I would wish to know more details @ 20:30. I came here because your article discussed it but i did not feel I have a too clear picture. Here, with the video, I still feel unclear. Which one was implemented in Transformer and which one for BERT? Suppose they cut the original input vector matrix into 8 or 12 chunks, why did not I see in their code the start of each section? I only saw a line dividing the input dimension by number of heads. That's all. How would the attention head the input vector idx they need to work on? Somehow I feel the head need to know the starting index ...

    • @dlvu6202
      @dlvu6202  3 года назад +1

      Thanks for you kind words! In the slide you point to the bottom version is used in every implementation I've seen. The way this "cutting up" is usually done is with a view operation. If I take a vector x of length 128, and do x.view(8, 16), I get a matrix with 8 rows and 16 columns, which I can then interpret as the 8 vectors of length 16 that will go into the 8 different heads.
      Here is that view() operation in the Huggingface GPT2 implementation: github.com/huggingface/transformers/blob/8719afa1adc004011a34a34e374b819b5963f23b/src/transformers/models/gpt2/modeling_gpt2.py#L208

  • @Mewgu_studio
    @Mewgu_studio Год назад

    This is gold.

  • @adrielcabral6634
    @adrielcabral6634 Год назад

    I loved u explanation !!!

  • @LukeZhang1
    @LukeZhang1 4 года назад

    Thanks for the video! It was super helpful

  • @mohammadyahya78
    @mohammadyahya78 Год назад

    Question: At 8:46 May I know please why since Y is defined as multiplication, it's purely linear and thus is non-vanishing gradient, which means the gradient will be of a linear operation? While W=SoftMax(XX^T) is non-linear and thus can cause vanishing gradients. Second, what is the relationship between linearity/non-linearity and vanishing/non-vanishing gradient?

  • @小孟滴儿丫
    @小孟滴儿丫 Год назад

    Thanks for the great explanation! Just one question, if simple self-attention has no parameters, how can we expect it to learn? it is not trainable.

  • @balasubramanyamevani7752
    @balasubramanyamevani7752 2 года назад

    It was very well put the presentation on self-attention. Thank you for uploading this. I had a doubt @15:56 how it will suffer from vanishing gradients without the normalization. As dimensionality increases, the overall dot product should be larger. Wouldn't this be a case of exploding gradient? I'd really love some insight on this.
    EDIT: Listened more carefully again. The vanishing gradient on the "softmax" operation. Got it now. Great video 🙂

  • @recessiv3
    @recessiv3 2 года назад

    Great video, I just have a question:
    When we compute the weights that are then multiplied by the value, are these vectors or just a single scalar value? I know we used the dot product to get w so it should be just a single scalar value, but just wanted to confirm.
    As an example, at 5:33 are the values for w a single value or vectors?

    • @TubeConscious
      @TubeConscious 2 года назад +1

      Yes, it is a single scalar the result of the dot product further normalize by softmax, so the sum of all weights equals to one.

  • @kafaayari
    @kafaayari 3 года назад

    Wow, this is unique.

  • @manojkumarthangaraj2122
    @manojkumarthangaraj2122 3 года назад

    I know this is the best explanation about transformers I've come across so far. Still I'm having an problem with understanding Key, query and value part. Is there any recommendation, where I can learn completely from the basics? Thanks in advance

  • @geoffreysworkaccount160
    @geoffreysworkaccount160 3 года назад

    This video was raaaaad THANK YOU

  • @wolfisraging
    @wolfisraging 3 года назад

    Amazing video!!

  • @ecehatipoglu209
    @ecehatipoglu209 Год назад

    Hi extremely helpful video here, I really appreciate but i have a question i dont understand how multi head self attention works if we are not generate extra parameters for each stack of self attention layer, what is the difference in each stack so that we can grasp the different relations of the same word in each layer

    • @ecehatipoglu209
      @ecehatipoglu209 Год назад

      Yeah after 9 days and re-watching this video i think I grasped why we are not using extra parameters. Lets say you have an embedding dimension of 768 and you want to make 3 attention head meaning somehow dividing the 768 vector so you could have a 256x1 vector for each attention head. (This splitting is actually a linear transformation so there is no weights to be learned here right. ) . After that, for each of this 3 attention heads we have parameters 3 of [K, Q, W](superscripted for each attention head). For each attention head our K will be the dimension of 256xwhatever, Q will be the dimension of 256xwhatever and V will be the dimension of 256xwhatever. And this is for one head. Concatanating all learned vectors K. Q and will end up a 768xwhatever for each of them, exact size that we would have with a single attention. Voila.

  • @turingmachine4122
    @turingmachine4122 3 года назад

    Hi, thank you for this nice Explanation. However, there is one thing that I don‘t get. How can the self attention model, for instance in the sentence „John likes his new shoes“, compute high value for „his“ and „John“. I mean, we know that they are related, but the embeddings for these words can be very different. Hope you can help me out :)

  • @donnap6253
    @donnap6253 3 года назад +1

    On page 23, should it not be ki qj rather than kj qi?

    • @superhorstful
      @superhorstful 3 года назад

      I totally agree on your opinion

    • @dlvu6202
      @dlvu6202  3 года назад

      You're right. Thanks for the pointer. We'll fix this in any future versions of the video.

    • @joehsiao6224
      @joehsiao6224 3 года назад

      @@dlvu6202 Why the change? I think we are querying with current ith input against every other jth input, and the figure looks right to me.

    • @dlvu6202
      @dlvu6202  3 года назад

      @@joehsiao6224 It's semantics really. Since the key and query are derived from the same vector it's up to you which you call the key and which the query, so the figure is fine in the sense that it would technically work without problems. However, given the analogy with the discrete key-value store, it makes most sense to say that the key and value come from the same input vector (i.e. have the same index) and that the query comes from a (potentially) different input.

    • @joehsiao6224
      @joehsiao6224 3 года назад

      @@dlvu6202 it makes sense. Thanks for the reply!

  • @jrlandau
    @jrlandau Год назад

    At 16.43, why is d['b'] = 3 rather than 2?

    • @dlvu6202
      @dlvu6202  Год назад

      This was a mistake, apologies. We'll fix this in the slides.

  • @abdot604
    @abdot604 2 года назад

    fine , i will subscribe

  • @edphi
    @edphi Год назад

    Everything was clear till the query key and value.. anyone has a slower video or resource for understanding??

  • @cicik57
    @cicik57 2 года назад

    how self-attention has sence on word embedding, where each word is represented by random vector so this self-correlation has no sence?

  • @vadimen181
    @vadimen181 3 года назад

    thank you so much

  • @vukrosic6180
    @vukrosic6180 2 года назад +1

    I finally understand it jesus christ

  • @Isomorphist
    @Isomorphist Год назад

    Is this ASMR?

  • @abhishektyagi154
    @abhishektyagi154 3 года назад

    Thank you very much