CS480/680 Lecture 19: Attention and Transformer Networks

Поделиться
HTML-код
  • Опубликовано: 25 июл 2024

Комментарии • 216

  • @pascalpoupart3507
    @pascalpoupart3507  4 года назад +134

    The slides are posted here: cs.uwaterloo.ca/~ppoupart/teaching/cs480-spring19/schedule.html

    • @tarunluthrabk
      @tarunluthrabk 3 года назад +7

      Hello Professor. Your explanations are amazing. Kindly pin this comment or add in description so that it is visible to everyone.

    • @majid0912
      @majid0912 Год назад +1

      ّDear Pascal, I was wondering if you have any presentations to describe the article titled Neural Machine Translation By Jointly Learning To Align And Translate.

  • @SuperOnlyP
    @SuperOnlyP 3 года назад +273

    Finally, Someone can explain simply what queries, keys , values are in the transformer model . Thank you Sir !!!

    • @mathisve
      @mathisve 3 года назад +23

      Yeah, I don't understand why nobody else goes over this seemingly pretty important detail

    • @stackoverflow8260
      @stackoverflow8260 3 года назад +14

      Wow, I was gonna ask, why didn't he explain or give an example for query, key, value in the case of a simple language translation or modelling example. Machine learning community is not very good at conveying their ideas, when you can't put stuff in rigorous mathematics at least use a lot of pictures and many examples at every possible step.

    • @andrii5054
      @andrii5054 3 года назад +9

      I can also recommend this explanation: ruclips.net/video/mMa2PmYJlCo/видео.html
      It has helped me a lot

    • @SuperOnlyP
      @SuperOnlyP 3 года назад +1

      @@andrii5054 The video really simplify the concept. Thanks for sharing !

    • @Darkev77
      @Darkev77 2 года назад +1

      Where was that?

  • @zengrz
    @zengrz 3 года назад +195

    00:00 Attention
    31:32 Transformer
    47:15 Masked Multi-head Attention
    1:01:45 Layer normalization, Positional embedding

  • @graceln2480
    @graceln2480 2 года назад +3

    One of the best explanations for attention & transformers in RUclips. Most of the other videos are junk with authors pretending to understand the concepts and just adding to the RUclips clutter.

  • @sudhirghandikota1382
    @sudhirghandikota1382 4 года назад +54

    Thank you very much Dr. Poupart. This is the best explanation of transformers I have come across on the internet

  • @GotUpLateWithMoon
    @GotUpLateWithMoon 3 года назад +15

    This is the best lecture on attention mechanism I can find! Thank you Dr. Poupart! finally all the details made sense to me.

  • @drdr3496
    @drdr3496 Год назад +10

    This is the single best video on "Attention is all you need", attention, transformers, etc. on the Internet. It's simple as that. Thanks Dr Poupart.

    • @bleacherz7503
      @bleacherz7503 Год назад

      Why does a dot product correlate to attention?

    • @drdr3496
      @drdr3496 Год назад +1

      @@bleacherz7503 a dot product between two vectors shows how similar they are

    • @seldan6698
      @seldan6698 Год назад

      ​@@drdr3496 nice. Can you explain me whole query , key and value process for some example like " the cat sat on the mat". What is query, key and values for this sentence

    • @robn2497
      @robn2497 4 месяца назад

      ty

  • @insoucyant
    @insoucyant 9 дней назад

    Best video on attention that I have come across

  • @momusi.9819
    @momusi.9819 4 года назад +24

    Thank you very much, this was by far the best explanation of Transformers that I found online!

  • @cwtan501
    @cwtan501 3 года назад

    By far the best I have seen to explain multiheaded attention

  • @JMRC
    @JMRC 4 года назад +24

    Thank you to the person asking the question at 28:49! The softmax gave it away, but I wasn't sure.

  • @dennishuang3498
    @dennishuang3498 2 года назад +1

    Very enjoyed your lecture, Professor Poupart! Very informative and simplified many complicated concepts. Thank you very much!

  • @moustafa_shomer
    @moustafa_shomer 2 года назад +3

    This is the best Transformer / Attention Explaination ever.. Thank you

  • @tagrikli
    @tagrikli 3 года назад +175

    This video just cured my depression.

    • @judgeomega
      @judgeomega 3 года назад +22

      dont worry, im sure the next visit to a public internet forum will once again obliterate hope in humanity

    • @100vivasvan
      @100vivasvan 3 года назад +3

      haha same here

    • @dilettante9576
      @dilettante9576 Год назад +2

      Cured my ADHD

    • @Mrduralexx
      @Mrduralexx Год назад +2

      This video gave me depression…

    • @UmerBashir
      @UmerBashir Год назад +1

      @@Mrduralexx yeah its a different level of anxiety that it instills

  • @AI_ML_iQ
    @AI_ML_iQ Год назад +9

    In recent work, titled "Generalized Attention Mechanism and Relative Position for Transformer" , on transformer it is shown that different matrices for query and key are not required for attention mechanism in Transformer thus reducing number of parameters to be trained for Transformer of GPT, other language models and Transformers for images/videos.

  • @utkarshgupta7364
    @utkarshgupta7364 3 года назад +1

    Most awesome video on transformers one could find on youtube

  • @hamzaaliimran6441
    @hamzaaliimran6441 8 месяцев назад

    one of the best and detailed lecture on attention on youtube I must say.

  • @Hotheaddragon
    @Hotheaddragon 3 года назад +4

    You are a blessing, finally understood a very important concept.

  • @TylerMosaic
    @TylerMosaic 3 года назад +21

    wow! love the way he answers that great question at around 50:52 : “why we dont we implement the mask with hadamard product outside of the softmax?”. brilliant prof.

  • @aadeshingle7593
    @aadeshingle7593 10 месяцев назад

    Thanks a lot Professor Poupart one of the best explanation for maths behind transformers!

  • @fengxie4762
    @fengxie4762 4 года назад +5

    A great lecture! Highly recommended!

  • @mi9807
    @mi9807 Год назад

    One of the best videos!

  • @jinyang4796
    @jinyang4796 4 года назад

    Thank you for the clear explanation and well-illustrated examples!

  • @xhulioxhelilai9346
    @xhulioxhelilai9346 3 месяца назад

    Thank you for the very comprehensive and understandable course. Being in 2024 I can say that I can understand even better and easier this course using GPT-4.

  • @orhan4876
    @orhan4876 8 месяцев назад

    thank you for being so thorough!

  • @weichen1
    @weichen1 4 года назад +4

    I am not able to find a better video than this one explaining attention and transformer on the internet

  • @richard126wfr
    @richard126wfr 2 года назад

    The best explanation of attention mechanism I found on RUclips is the making pizza analogue by Alfredo Canziani.

  • @benjamindeporte3806
    @benjamindeporte3806 9 месяцев назад +1

    I eventually understood the Q,K,V in attention. Many thanks.

  • @vihaanrajput8082
    @vihaanrajput8082 2 года назад

    His toturial video is my favorite timepass, specially at night,Hail to prof. Poupart

  • @ibrahimkaibi4200
    @ibrahimkaibi4200 3 года назад

    A very interesting explanation (wonderful)

  • @ghostoftsushimaps4150
    @ghostoftsushimaps4150 10 месяцев назад

    Bhaiji love from India. Is lecture ko araam se dekhunga

  • @underlecht
    @underlecht 3 года назад +1

    I would call this the best explanation of attention/transformers on youtube i have found so far.

  • @Siva-Kumar-D
    @Siva-Kumar-D 2 года назад

    This is the best video Internet about Transformers network

  • @giorgioregni2639
    @giorgioregni2639 3 года назад

    Best explanation of transformer I have ever seen, thank you Dr Poupart

  • @aponom84
    @aponom84 4 года назад +1

    Nice lecture! Thanks!

  • @pred9990
    @pred9990 4 года назад +1

    Cool lecture!

  • @brandonleesantos9383
    @brandonleesantos9383 2 года назад

    Truly fantastic wow

  • @parmidagranfar4861
    @parmidagranfar4861 3 года назад

    finallu understood what is going on . most of the videos are so simple and skipped math . i liked it

  • @syedhasany1809
    @syedhasany1809 4 года назад +3

    This was a great lecture, thank you.

  • @faatemehch96
    @faatemehch96 3 года назад

    thank you, the video is really useful. 👍🏻👍🏻

  • @jelenajokic9184
    @jelenajokic9184 2 года назад +1

    The simplest explanation of attention, thanks a lot for sharing, great lectures🤗!

  • @sandipbnvnhjv
    @sandipbnvnhjv Год назад +1

    I asked chatGPT for the best video on Attention and it brought me here

  • @shifaspv2128
    @shifaspv2128 Год назад

    Thank you so much, the brainstorming

  • @firstnamelastname3106
    @firstnamelastname3106 3 года назад

    thank you my man, u saved me

  • @Antony25rages
    @Antony25rages 3 года назад

    Thank you for this :)

  • @HeshamChannel
    @HeshamChannel Год назад +1

    Very good explanation. Thanks.

  • @minhajulhoque2113
    @minhajulhoque2113 2 года назад

    Great video!

  • @cedricmanouan2333
    @cedricmanouan2333 3 года назад

    very interesting and useful. Thanks Sir

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 4 года назад +1

    Does anyone know? Which NMT video has the previous intro to Attention the professor cites in this video? I couldn’t find his video on neural machine translation.

  • @seminkwak
    @seminkwak 3 года назад

    Beautiful explanations

  • @fit_with_a_techie
    @fit_with_a_techie 2 года назад

    Thank you Professor :)

  • @greyreynyn
    @greyreynyn 3 года назад +1

    41:14 Question, on the output side, why isn't there an additional feed-forward layer between the masked self attention in the output and the attention to the input? And maybe more broadly what are those feed forward units doing?

  • @aileensengupta
    @aileensengupta Год назад

    Big fan, big fan Sir!!
    Finally understood this!

  • @MustafaQamarudDin
    @MustafaQamarudDin 4 года назад +2

    Thank you very much. It is very detailed and captures the intuition.

    • @syphiliticpangloss
      @syphiliticpangloss 4 года назад

      Could you explain what the model class looks like then? What is the capacity? What is the "unconstrained" version with higher capacity? I was full statistical learning theory style discussion in all pedalogical discussions. I don't understand how people think they understand this.
      If your life depended on it, would you feel confident in recommending one of these setups? What questions would you have to ask about the data, the model architecture, the observation process? You need worst case bounds, model complexity etc. I see none of that here.

    • @1Kapachow1
      @1Kapachow1 4 года назад +1

      @@syphiliticpangloss Well, in deep learning the theory is far behind engineering.
      When people say they understand this lecture, they don't mean worst case bounds (which I strongly doubt anyone in the world knows how to calculate for this, without adding so many relaxation assumptions which make it basically irrelevant, like convexity etc.),
      they just mean that:
      1. Engineering wise they understand how to build and use it
      2. They feel they grasp enough intuition to what is the purpose of each sub-block and why it was added.
      I don't think anyone truly "understands" much simpler models in DL than transformers, which perform in a far superior level to classical machine learning methods.
      For example, fully convolutional neural networks, trained with Adam optimizer, based on back-propagation, using BN.

    • @syphiliticpangloss
      @syphiliticpangloss 4 года назад

      @@1Kapachow1 So can someone explain what the transformer is doing then in a precise way? I would accept answers that reference probability distributions and predictive goals or computation description of components like NAND gates etc.
      Also accepted would be anything related to the eigenvalues, stability, curvature etc.
      There are lots of people trying to talk about this stuff. For example arxiv.org/abs/2004.09280
      Or Vapnick.
      To be perfectly clear, I think today we tend to say there are only two things really: a) "data" i.e. observations usually dozens to millions from some process we take to be slowly changing at most and b) predicates/models/architecture/constraints ... "observations" usually less than dozens, usually manually constructure (from other experiments and observations sets perhaps). To each of these we usually have some sort of "narrative" about where each came from, a way of describing it in some way to humans.
      The second thing is what I'm getting at. "Architecture" is a model constraint. If it is just pulled from thin air without undestanding the problem, the meta-problem etc, it is quite likely that there are buried problems, secret reasons for architecture choices that are not being disclosed or realised.
      Getting better at describing these models/arch/predicates is how we progress.

  • @c.l.1269
    @c.l.1269 3 года назад

    Great lecture! Thank you Professor!

  • @Vartazian360
    @Vartazian360 7 месяцев назад +3

    Little did anyone know just how groundbreaking this foundation would be for Chat GPT / GPT 4.

  • @hariaakashk6161
    @hariaakashk6161 3 года назад +1

    Great explanation sir... Thank You! Please post more such lectures and I would be the first to look at it...

  • @gudepuvenkateswarlu5648
    @gudepuvenkateswarlu5648 2 года назад

    Excellent session....Tq professor

  • @benjaminw2194
    @benjaminw2194 2 года назад

    I'm a novice and have been praying to get someone who discusses these papers. You're an answered prayer! Great lecturer.

  • @markphillip9950
    @markphillip9950 3 года назад

    Great lecture.

  • @abhijeetnarharshettiwar6175
    @abhijeetnarharshettiwar6175 2 года назад

    Thank you so much for great explanation, professor.

  • @AnonTrash
    @AnonTrash Год назад

    Beautiful.

  • @akashpb4183
    @akashpb4183 3 года назад

    Beautifully explained.. things seem clear to me now .. Thanks a lot sir!

  • @sheikhjubair7133
    @sheikhjubair7133 4 года назад

    Very clear explanation

  • @aymensekhri2133
    @aymensekhri2133 Год назад

    Thank you very much Sir!

  • @evennot
    @evennot 4 года назад

    19:00 it's basically an exclusionary perceptron layer, isn't it? (also could be called fuzzy LUT) I'm sure it was used before for the attention emulation

  • @mohamedabbashedjazi493
    @mohamedabbashedjazi493 3 года назад

    Softmax is computationally expensive, I wonder if this can be replaced somehow with another function to produce probabilities since Softmax is present in many places in all the blocks of the transformer network.

  • @nafeesahmad9083
    @nafeesahmad9083 2 года назад

    Woohoo... Thank you so much

  • @jaeyoungcheong1767
    @jaeyoungcheong1767 3 года назад

    Clearly! Thanks

  • @yashrajwani3322
    @yashrajwani3322 3 года назад

    great explanation

  • @weiyaox6896
    @weiyaox6896 2 года назад +1

    Best explanation

  • @reuben3648
    @reuben3648 Год назад

    Thank you soo much!!!

  • @yen-linchen7398
    @yen-linchen7398 Год назад

    Thank you!

  • @justinkim2973
    @justinkim2973 Год назад

    Best video to watch on the first day of 2023

  • @amitvikramsingh327
    @amitvikramsingh327 3 года назад

    Thank You.

  • @sienloonglee4238
    @sienloonglee4238 Год назад

    very good video!😀

  • @goldencircle4331
    @goldencircle4331 Год назад

    Huge thanks for putting this online.

  • @soumyarooproy
    @soumyarooproy 4 года назад

    Great point at 50:50 👍

  • @yd42330
    @yd42330 3 года назад +2

    Question about positional encoding.
    If we sum the Word Embedding (WE) with the Positional Encoding (PE) how does the model
    tell the difference between WE = 0.5, PE = 0.2 and WE = 0.4 and PE = 0.3 ?
    (Different words that are at different positions can yield the same value)
    Why not keep the PE separate from WE?

  • @444haluk
    @444haluk 3 года назад

    I heard queries, keys & values were primative concepts and counter-intuitive, but I didn't know it was THIS primative.

  • @hackercop
    @hackercop 2 года назад

    This was a great lecture - really explained this to me thanks

  • @varungoel185
    @varungoel185 3 года назад

    Around @29:50 mark, he first mentions that the key vectors correspond to each output word, but the slide mentions input word. Could someone please clarify this?

  • @kungchun9461
    @kungchun9461 3 года назад +2

    This year should be the "tranformer year" as there a breakout in domain of CV.

  • @opencvitk
    @opencvitk Год назад +1

    the explanation of K,V and Q is great. unfortunately i lost him as soon as he started on multi-head. must be that the single head i possess is empty :-)

  • @greyreynyn
    @greyreynyn 3 года назад

    46:45 - Also, the output shape is the same as the input shape right? ie, the size of the input sequence?

  • @prof_shixo
    @prof_shixo 4 года назад +1

    Thanks for the nice lecture. I am still confused regarding how transformers model can replace RNNs or LSTMs for general sequence learning. The size of a sequence might be very lengthy in some applications rather than just a sentence (which can be designed to be fixed in length) so how to deal with this especially if we need to keep the complete sequence with us as there is no recurrence? If the answer is to divide the sequence, then how to link different chunks over time without a recurrence or a carry over? (Loops over time)

    • @JAKKOtutorials
      @JAKKOtutorials 4 года назад +1

      transformers are able to "query the recurrencies", think of it as instead of repeating the operation as in RNNs you just query 'x' times a database of the possible values and its given inputs and check if it matches the requirements, and because it's not a recurrence, repetition, you can make multiple of these queries, each being a new operation, at the same time!! each operation can be resolved without interference creating new tokens, or pieces, which represent convergence points in the data universe you are travelling.
      it's a huge improvement.. confirmed by the models shown at the end of the lecture. hope this helps :)

    • @venkateshdas5422
      @venkateshdas5422 4 года назад

      As JAKKO mentioned the transforms use the attention mechanism in a very efficient manner. The size of the sequence can be sufficiently longer than a sentence and still the attention mechanism will be able to capture the dependencies between the words at different positions. And this creates an efficient contextual representation of the sequence better than the normal input embedding vector. And this is how the complete input sequence is captured by the model without the recurrence.
      This is really a beautiful approach. (personal opinion)

  • @greyreynyn
    @greyreynyn 3 года назад +2

    45:50 - For the multiple linear transformations, are we applying the same linear transform to each set of Q/K/V in a "head" ? Or does each Q/K/V get its own unique linear transform applied?

    • @knoxvoxx
      @knoxvoxx 3 года назад +1

      Unique linear transform each time I guess.( In the original paper, under section 3.2.2 they mention that " h times with different learned linear projections to dk, dk and dv respectively")
      If we take repeat scaled dot product attention 3 times , then we will have total of 9 linear projections.

    • @ryanwhite7401
      @ryanwhite7401 2 года назад +2

      They each get their own learned parameters.

  • @chakibchemso
    @chakibchemso Год назад +1

    and thats how gpt was born my fellas

  • @abhishekrohra9457
    @abhishekrohra9457 2 года назад

    Good explanation

  • @mohamedabdo-dl9dd
    @mohamedabdo-dl9dd 3 года назад

    thanks professor for easy explain ... can you share powerpoint with us ..

  • @ephremtadesse3195
    @ephremtadesse3195 2 года назад

    Very helpful

  • @autripat
    @autripat 3 года назад +2

    At 1:18:22, the professor refers to BERT and a "Decoder transformer that predicts a missing word".
    To me, BERT is a masked Encoder (not decoder).
    After all, BERT stands for bidirectional *encoder* representation from transformers.
    It's minor (and doesn't subtract from this great presentation), but can anyone comment?

    • @abdelrahmanhammad1020
      @abdelrahmanhammad1020 3 года назад +1

      Great lecture. And I believe you are correct, it seems there is a typo here. I was questioning the same!

  • @aricircle8040
    @aricircle8040 Год назад

    Thank you very much for sharing that great lecture!
    Shouldn't it be the attention vector instead of the value? at 27:44

  • @alexanderblumin6659
    @alexanderblumin6659 2 года назад +1

    Very intersting lecture. Something that is not totally clear on minute 46: these multihead presented intuitievely as explicit 3 various filterts as in cnn to produce 3 corresponding feature maps,but on previous part of lecture its being said that multi heads are stacked one after another to produce at first info from (word i,word j) and second pairs of that stuff i.e one is the input to another one. So how to understand it in the rightr way? Seems like on minute 46 the inputs to each of the linear are the same but on lecture part it looks like one is going after another and intuitevly the pair of pais and so one changes the ouput size.

  • @leoj5891
    @leoj5891 2 года назад

    does this normalization layer matter in the inference stage?

  • @greyreynyn
    @greyreynyn 3 года назад

    57:30 - I understand the normalization, but what's the intuition for adding? Does it just strengthen the signal from the input sequence?

    • @Victor-oc1ly
      @Victor-oc1ly 2 года назад +1

      I can comprehend the additive part as incrementing the knowledge from the self-attention operation to the query itself. You can think of it as the query (let's call it x) is a proper question and the sublayer (multihead attention) operation output is the contribution to your inquiry from the oracle (the sub-layer before the feed-forward step), which can then be used to refine your question (the query). If you look at the paper, it's really LayerNorm(x + Sublayer(x)) what the authors write for the Encoder and Decoder stacks.

  • @hemanthsharma5630
    @hemanthsharma5630 4 года назад

    Saviour!!!

  • @ryandruckman999
    @ryandruckman999 4 года назад

    1:02:00 For the positional embedding, I am confused... It seems like the formula produces a scalar output (ie you put in a position in your sequence, you get out some sin/cos value). How does it become a vector?

    • @ryandruckman999
      @ryandruckman999 4 года назад

      You could take the value for each dimension in your embedding and then you have a vector of values. But then it seems you'd be encoding the same information in every input, which doesn't seem helpful?

    • @venkateshdas5422
      @venkateshdas5422 4 года назад +1

      @@ryandruckman999 Actually while doing that the values fed into the cos and sin function also takes the scalar position of the word. So each positional embedding vector will have different values which in turn captures the position / order of the sequence.
      refer this article : kazemnejad.com/blog/transformer_architecture_positional_encoding/

  • @datahacker1405
    @datahacker1405 2 года назад

    Thank you sir

  • @ryandruckman999
    @ryandruckman999 4 года назад +3

    Amazing lecture! I am learning a lot! I have some questions though:

    • @aymensayed7604
      @aymensayed7604 4 года назад +1

      Hey Ryan, you can join one of the online events in this meetup group ;)
      www.meetup.com/Study-Of-Advanced-Natural-Language-Processing-Topics/

    • @syphiliticpangloss
      @syphiliticpangloss 4 года назад

      Why do you think it is amazing? Can you explain exactly how transformers change the capacity, vc dimension, radamacher complexity, how they change the "hypothesis space" and is there an equivalent "regularization". Otherwise this is just explaining some construction with no understanding.
      I think they should always draw the "block" structure across the temporal dimension. This would make it clearer. So "Dense" connections vs attention layers etc. I don't get the sense anyone is actually trying to actually think about what this operator class looks like. Kind of disheartening.

    • @karthikvenkatesh3538
      @karthikvenkatesh3538 4 года назад +1

      @@syphiliticpangloss can you point me towards videos or readings with the transformer block across the temporal dimension?

    • @syphiliticpangloss
      @syphiliticpangloss 4 года назад +1

      @@karthikvenkatesh3538 Dunno, that is kind my point. Always start with an SDE description in terms of functions and then describe the approximation class. I think it would help everyone quickly absorb things. It is easier to talk about the spectrum of the Hessian and whatever else one might try to think about abstractly when building approximation class structure manually.
      I think things like this might get into it arxiv.org/pdf/2001.08317.pdf