Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Поделиться
HTML-код
  • Опубликовано: 2 сен 2020
  • All Credits To Jay Alammar
    Reference Link: jalammar.github.io/illustrated...
    Research Paper: papers.nips.cc/paper/7181-att...
    youtube channel : • Jay's Visual Intro to AI
    Please donate if you want to support the channel through GPay UPID,
    Gpay: krishnaik06@okicici
    Discord Server Link: / discord
    Telegram link: t.me/joinchat/N77M7xRvYUd403D...
    Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
    / @krishnaik06
    Please do subscribe my other channel too
    / @krishnaikhindi
    Connect with me here:
    Twitter: / krishnaik06
    Facebook: / krishnaik06
    instagram: / krishnaik06

Комментарии • 228

  • @dandyyu0220
    @dandyyu0220 2 года назад +7

    I cannot express the amount of appreciation enough of your videos, especially NLP deep learning related topics! They are extremely helpful and so easy to understand from scratch! Thank you very much!

  • @mohammadmasum4483
    @mohammadmasum4483 Год назад +12

    @ 40:00 why we consider 64? - It is based on the how many multi head attention you want to apply. We used embedding size for each word = 512 and want to apply 8 multi head self attention; there fore for each attention we are using (512/8 =) 64 dimensional Q, K, V vector. So that, when we concatenate all the multi attention heads afterward, we can achieve the same 512 dimensional word embeddings which will be the input to the feed forward layer.
    Now, for instance, if you want 16 multi attention head, in that case you can use 32 dimensional Q, K, and V vector. My opinion is that, the initial word embedding size and the number of multi attention head are the hyperparameters.

  • @shrikanyaghatak
    @shrikanyaghatak 11 месяцев назад +2

    I am very new to the world of AI. I was looking for easy videos to teach me about the different models. I cannot imagine that I was totally enthralled by this video as long as you taught. You are a very good teacher. Thank you for publishing this video free. Thanks to Jay as well for simplifying such complex topic.

  • @ss-dy1tw
    @ss-dy1tw 3 года назад +1

    Krish, I really see the honesty in you man, lot of humility, very humble person. In the beginning of this video, you gave credit to Jay several times who created amazing blog for Transformers. I really liked that. Be like that.

  • @apppurchaser2268
    @apppurchaser2268 Год назад

    You are a really good teacher that always check your audiences weather they get the concept or not. Also, I appreciate your patience and the way you try to rephrase to have a better explanations.

  • @prasad5164
    @prasad5164 3 года назад

    I really admire you now. Just because you give the credit to the deserving at the beginning of the video.
    That attitude will make you a great leader. All the best!!

  • @nim-cast
    @nim-cast 9 месяцев назад +6

    Thanks for your fantastic LLM/Transformer series content, and I admire your positive attitude and support for the authors of these wonderful articles! 👏

    • @sivakrishna5557
      @sivakrishna5557 17 дней назад

      Could you please help me to get started on llm series, could you pls share the playlist link

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 года назад +2

    thank you, appreciate your time going through this material

  • @TusharKale9
    @TusharKale9 3 года назад +1

    Very well covered GPT-3 topic. Very important from NLP point of view. Thank you for your efforts.

  • @tshepisosoetsane4857
    @tshepisosoetsane4857 Год назад

    Yes this is the best video explaining these Models so far even non computer science people can understand what is happening , great work

  • @story_teller_1987
    @story_teller_1987 3 года назад +13

    Krish is a hard working person, not for himself but for our country in the best way he could...We need more persons like him in our country

    • @lohithklpteja
      @lohithklpteja 2 месяца назад

      Alu kavale ya lu kavale ahhh ahhh ahhh ahhh dhing chiki chiki chiki dhingi chiki chiki chiki

  • @akhilgangavarapu9728
    @akhilgangavarapu9728 3 года назад +3

    Million tons appreciation for making this video. Thank you soo much for your amazing work.

  • @Adil-qf1xe
    @Adil-qf1xe Год назад

    How did I miss the subscription to your channel? Thank you so much for this thorough explanation, and hats off to Jay Alammar.

  • @sarrae100
    @sarrae100 2 года назад

    Excellent blog from Jay, Thanks Krish for introducing this blog on ur channel !!

  • @mequanentargaw
    @mequanentargaw 9 месяцев назад

    Very helpful! Thank you all contributors!

  • @madhu1987ful
    @madhu1987ful 2 года назад

    Jay alammar blog is of course awesome. But you made it even more simpler while explaining. Thanks a lot

  • @harshavardhanachyuta2055
    @harshavardhanachyuta2055 3 года назад

    Thank you. Your teaching and jay's blog combination pull this topic. I like the way you are teaching. Keep going.

  • @jeeveshkataria6439
    @jeeveshkataria6439 3 года назад +22

    Sir, Please release the video of Bert. Eagerly waiting for it.

  • @anusikhpanda9816
    @anusikhpanda9816 3 года назад +27

    You can skim through all the youtube videos explaining transformers, but nobody comes close to this video.
    Thank you Sir🙏🙏🙏

    • @kiran5918
      @kiran5918 3 месяца назад

      Difficult to understand foreign accents. Desi away zindabad

  • @roshankumargupta46
    @roshankumargupta46 3 года назад +43

    This might help the guy who asked why we take the square root and also for other aspirants :
    The scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.

    • @tarunbhatia8652
      @tarunbhatia8652 3 года назад +1

      nice. i was also wondering about the same . it all started from gradient exploding or vansihing , how can i forget that :D

    • @apicasharma2499
      @apicasharma2499 2 года назад

      can this attention encoder-decoder be used in financial time series as well.. multivariate time series?

    • @matejkvassay7993
      @matejkvassay7993 2 года назад

      Hello, I think the sq root od dimension is not chosen just empirically but actually it's to normalize the length of vector or smth similar, it holds the vector length scales by sq root with increasing dimension size when some conditions I forgot are met, this way you scale it down to 1 ans thus prevent exploding dot product scores

    • @kunalkumar2717
      @kunalkumar2717 2 года назад

      @@apicasharma2499 yes, although i have not used it, but it can be used.

    • @generationgap416
      @generationgap416 Год назад

      The normalizing should come from softmax or by using the tri function to zero out the bottom of the matrix concatenated q, k and V MATRIX. to have good initialization weights, i think

  • @lshagh6045
    @lshagh6045 Год назад

    Very huge and tremendous effort, million thanks for your dedication

  • @suddhasatwaAtGoogle
    @suddhasatwaAtGoogle 2 года назад +37

    For anyone having a doubt at 40:00 as to why we have taken a square root of 64 is because, as per the research it was mathematically proven to be the best method to keep the gradients stable! Also, note that the value 64, which is the size of the Query, Keys and Values vectors, is in itself a hyperparameter which was found to be working the best. Hope this helps.

    • @latikayadav3751
      @latikayadav3751 10 месяцев назад

      The embedding vector dimension is 512. We divide this in 8 heads. We 512/8 =64. therefore size of query, keys and values is 64. therefore size is not hyperparameter.

    • @afsalmuhammed4239
      @afsalmuhammed4239 10 месяцев назад +1

      normlizing the data

    • @sg042
      @sg042 8 месяцев назад

      Another reason is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

    • @sartajbhuvaji
      @sartajbhuvaji 7 месяцев назад +1

      The paper states that:
      " While for small values of dk the two mechanisms(attention functions: additive attention and dot product attention) (note: paper uses dot product attention (q*k))
      perform similarly, additive attention outoerforms dot product attention without scaling for larger values of dk. We suspect that for larger values of dk, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot product by (1/sqrt(dk))"

  • @harshitjain4923
    @harshitjain4923 3 года назад +13

    Thanks for explaining Jay's blog. To add to the explanation at 39:30, the reason for using sqrt(dk) is to prevent the problem of vanishing gradient as mentioned in the paper. Since we are applying softmax on Q*K and if we consider a high dimension of these matrices, it will produce a high value which will get transformed close to 1 after softmax and hence leads to a small update in gradient.

    • @neelambujchaturvedi6886
      @neelambujchaturvedi6886 3 года назад

      Thanks for this Harshit

    • @shaktirajput4711
      @shaktirajput4711 2 года назад

      Thanks for explanation but I guess it will be called as exploding gradient not vanishing gradient. Hope I am not wrong.

  • @ashishjindal2677
    @ashishjindal2677 2 года назад

    Wonderful explanation of blog, thanks for introducing with jay. Your teaching style is awesome.

  • @hiteshyerekar9810
    @hiteshyerekar9810 3 года назад +2

    Great Session Krish. Because of Research paper I understand things very easily and clearly.

  • @louerleseigneur4532
    @louerleseigneur4532 3 года назад

    After watching your lecture it's more clear to me
    Thanks Krish

  • @kiran5918
    @kiran5918 4 месяца назад

    Wow what an explanation of transformers.. perfect for us.. it aligns with the way we r taught at school…

  • @abrarfahim2042
    @abrarfahim2042 2 года назад

    Thank you Krish. I learned so many things from your video.

  • @wentaowang8622
    @wentaowang8622 Год назад

    Very clear explanation. And Jay's blog is also amazing!!

  • @underlecht
    @underlecht 3 года назад +2

    I love your patience how many times you go around explaining things until they get clear even for such dumb guys as me. BTW residual connection are not due some layers are not important and we have to skip them, it is for to solve the vanishing gradients problem.

  • @tarunbhatia8652
    @tarunbhatia8652 3 года назад

    Thanks Krish, Awesome session, keep doing the great work!

  • @MuhammadShahzad-dx5je
    @MuhammadShahzad-dx5je 3 года назад +7

    Really nice sir, looking forward to Bert Implementation 😊

  • @smilebig3884
    @smilebig3884 2 года назад

    Very underrated video... this is super awesome explanation. I m watching and commenting 2nd time after a month.

  • @gurdeepsinghbhatia2875
    @gurdeepsinghbhatia2875 3 года назад +1

    sir thanks a alot , mza agia sir , your way of teaching with so humble and honest and most important patience , awesome video sir, too gud

  • @thepresistence5935
    @thepresistence5935 2 года назад

    Took More than 5 hours to understand this. Thanks Krish wonderful explanation.

  • @avijitbalabantaray5883
    @avijitbalabantaray5883 Год назад

    THank you Krish and Jay for this work.

  • @junaidiqbal5018
    @junaidiqbal5018 Год назад +1

    @31:45 If my understanding is correct, reson why we have 64, is because we we divide 512 into 8 equal heads. As we are computing the dot products to get the attention vaue, if we do the dot product of 512 embedding dimension length it will not only be computationally expensive but also the fact that we will get only one relation between the words . Taking advantage of parallel computation we divide 512 into 8 equal parts. this is why we call it as multi head attention. This way its computationally fast and we also get 8 different relation between the words. (FIY Attention is basically a relation between the words ). Any way Good work on explaining the architecture krish.

  • @pavantripathi1890
    @pavantripathi1890 8 месяцев назад

    Thanks to jay alamaar sir and you for the great explanation.

  • @Schneeirbisify
    @Schneeirbisify 3 года назад +1

    Hey Krish, thanks for the session. Great explanation! Could you please suggest if you have already uploaded session on Bert? And if not do you have still on plans? Would be very interesting to deep dive into practical application of Transformers.

  • @michaelpadilla141
    @michaelpadilla141 2 года назад

    Superb. Well done and thank you for this.

  • @shanthan9.
    @shanthan9. 3 месяца назад

    Every time I get confused or distracted while listening to the Transformers, I have to watch the video again; this is my third time watching it, and now I understand it better.

  • @MrChristian331
    @MrChristian331 2 года назад

    Great presentation! I understand it fully now I think.

  • @zohaibramzan6381
    @zohaibramzan6381 3 года назад +1

    Great to overcome confusions. I hope next to get hands on Bert.

  • @jimharrington2087
    @jimharrington2087 3 года назад

    Great effort Krish, Thanks

  • @RanjitSingh-rq1qx
    @RanjitSingh-rq1qx 6 месяцев назад

    Video was so good, i understand each and every thing just except only decoder side .

  • @elirhm5926
    @elirhm5926 2 года назад

    I don't know how to thank you and jay enough!

  • @Deepakkumar-sn6tr
    @Deepakkumar-sn6tr 3 года назад

    Great Session!....looking forward to Transformer Based recommender system

  • @markr9640
    @markr9640 6 месяцев назад

    Very well explained Sir! Thank you.

  • @aqibfayyaz1619
    @aqibfayyaz1619 3 года назад

    Great Effort. Very well explained

  • @pranthadebonath7719
    @pranthadebonath7719 8 месяцев назад

    Thank you, sir,
    that's a nice explanation.
    also thanks to Jay Alammar sir.

  • @kameshyuvraj5693
    @kameshyuvraj5693 3 года назад

    sir the way you explained the topics is ultimate sir

  • @prekshagampa5889
    @prekshagampa5889 Год назад

    Thanks a lot for detailed explaination. Really appreciate your effort for creating these videos

  • @Ram-oj4gn
    @Ram-oj4gn 6 месяцев назад

    great explanation.. I understood Transformers now..

  • @parmeetsingh4580
    @parmeetsingh4580 3 года назад +1

    Hi Krish, great session.
    I have a question - the Z we get after the self-attention block of the encoder, is it interpretable? that means if we could figure out by just looking at Z what results does the multi-head self-attention block gives?
    Kindly help me out with this.

  • @ruchisaboo29
    @ruchisaboo29 3 года назад +3

    Awesome explanation.. when will you post BERT video ? waiting for it and if possible please cover GPT-2 as well.. Thanks a lot for this amazing playlist.

  • @faezakamran3793
    @faezakamran3793 Год назад +3

    For those getting confused with 8 heads, all the words would be going to all the heads. It's not one word per head. The X matrix remains the same only the W matrix would change in case of multi-head attention.

  • @jaytube277
    @jaytube277 Месяц назад

    Thank you Krish for making such a great video. Really appreciate your hard work. One thing I have not understood here is that where is the loss getting calculated? Is it happening on the multiple heads or at the encoder decoder attention layer. What I am assuming is that while we are training the model, the translations will not be accurate and we should get some loss which we will try to minimize but I am not understanding where is that comparison is happening?

  • @sujithsaikalakonda4863
    @sujithsaikalakonda4863 7 месяцев назад

    Very well explained. Thank you sir.

  • @tapabratacse
    @tapabratacse Год назад

    superby you made the things look so easy

  • @ganeshkshirsagar5806
    @ganeshkshirsagar5806 7 месяцев назад

    Thank you so much sir for this superb session.

  • @utkarshsingh2675
    @utkarshsingh2675 Год назад

    thanks for such free contents!!...u r awesome sir!

  • @generationgap416
    @generationgap416 Год назад

    The reason to divide by sq of k is to prevent a constant value of x. That x = 1/2 for values near x = 0 from the left or right f(x) approaches y = 1/2. Look at the shape of the sigmoid function.

  • @toulikdas3915
    @toulikdas3915 3 года назад

    More this kind of videos on Research paper explanations and advanced concepts of deep learning and reinforcement learning sir.

  • @sweela1
    @sweela1 Год назад +1

    In my opinion, At 40:00 the under root is taken for the purpose of scaling to normalize the value from larger value to be transformed to smaller value so that SoftMax function of these values can also be calculated easily. Dk is the dimension whose under root is taken to scale the values.

    • @sg042
      @sg042 8 месяцев назад

      Another reason probably is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

  • @121MrVital
    @121MrVital 3 года назад +5

    Hi Krish,
    When you gonna make a video on "Bert" with practical implementation ??

  • @raghavsharma6430
    @raghavsharma6430 3 года назад

    krish sir, it's amazing!!!!

  • @sagaradoshi
    @sagaradoshi 2 года назад

    Thanks for the wonderful explanation .. For the decoder in the 2nd time instance we passed word/letter 'I', then in 3rd time instance do we pass both the words 'I' and 'Am' or only the word 'Am' is passed? Similarly for the 3rd time instance do we pass the words 'I', 'am' and 'a' or just the word/letter 'a' is passed?

  • @armingh9283
    @armingh9283 3 года назад

    Thank you sir. It was awsome

  • @manikaggarwal9781
    @manikaggarwal9781 5 месяцев назад

    superbly explained

  • @hudaalfigi2742
    @hudaalfigi2742 2 года назад

    i really want to thank you for your nice explanation actually i could not be able to understsnd it befor watchining this video

  • @ranjanarch4890
    @ranjanarch4890 2 года назад

    This video describes the inference of the Transformer. Can you do a video on training Architecture? I suppose we would need to give both languages datasets for training.

  • @mdmamunurrashid4112
    @mdmamunurrashid4112 11 месяцев назад

    You are amazing as always !

  • @BalaguruGupta
    @BalaguruGupta 3 года назад

    The layer normalization does (X + Z) here X is input Z is result of self attention calculation. You mentioned when the Self attention doesn't perform well, the self attention calculation will be skipped and jumps to Layer Normalization, hence the Z value will be 'EMPTY' (Please correct me here, if I'm wrong). In this case the layer normalization happens only on X (the imput). Am I correct?

  • @happilytech1006
    @happilytech1006 2 года назад

    Always helpful Sir!

  • @kiran082
    @kiran082 2 года назад

    Great Explanation

  • @digitalmbk
    @digitalmbk 3 года назад +2

    My MS SE thesis completion totally depends on your videos. Just AWESOME!!!

  • @mdzeeshan1148
    @mdzeeshan1148 6 дней назад

    Wooowoooooooowwwwwwwwwwww At last I got clarity. Thanks so much for wonderful explanation

  • @desrucca
    @desrucca Год назад +1

    AFAIK Resnet is not like dropout, instead it brings information from the previous layer to the n_th layer by doing this, vanishing gradients are less likely to occur.

  • @learnvik
    @learnvik 7 месяцев назад

    thanks, Question: in step 1 (30:52), what if the randomly initialized weights have the same value during the start? then all resulting vectors will have same values.

  • @dataflex4440
    @dataflex4440 Год назад

    Pretty good Explanation Mate

  • @AshishBamania95
    @AshishBamania95 2 года назад

    Thanks a lot!

  • @MayankKumar-nn7lk
    @MayankKumar-nn7lk 3 года назад

    Answer to why we are diving by the square root of dimension. basically, we are finding the similarity between the query and each key, there are different ways to get the similarity like dot product or scaled dot product so basically, here we are taking scaled dot product to keep the values in a fixed range

  • @joydattaraj5625
    @joydattaraj5625 3 года назад

    Good job Krish.

  • @shahveziqbal5206
    @shahveziqbal5206 2 года назад +1

    Thankyou ❤️

  • @althafjasar6395
    @althafjasar6395 2 года назад

    Hi Krish ,great session , just wanted to know have you uploaded session on Bert

  • @mayurpatilprince2936
    @mayurpatilprince2936 8 месяцев назад +1

    Why they multiply each value vector by the softmax score because they want to keep intact the values of the all word(s) and they want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example) ... they wanted to immerse whatever that sentence has irrelevant words ...

  • @Rider12374
    @Rider12374 23 дня назад

    Thanks krish don!!!

  • @BINARYACE2419
    @BINARYACE2419 2 года назад

    Well Explained Sir

  • @dhirendra2.073
    @dhirendra2.073 2 года назад

    Superb explanation

  • @User-nq9ee
    @User-nq9ee 2 года назад

    Thank you so much ..

  • @bruceWayne19993
    @bruceWayne19993 6 месяцев назад

    thank you🙏

  • @sreevanthat3224
    @sreevanthat3224 Год назад

    Thank you.

  • @sayantikachatterjee5032
    @sayantikachatterjee5032 6 месяцев назад

    at 58.49 it is told that if we increase no of heads it will give more importance to different words. so 'it' can give more importance to 'street' also. so between 'The animal' and 'street' which word will be more prioritized?

  • @adwait92
    @adwait92 2 года назад +3

    For the doubt at 40:00, the attention technique used in the paper is dot-product attention (refer page 2, section 3.2.1, para 2).
    So for larger values of d_k (dimensions of query, key and value), the dot product might grow very high in magnitude. Also, keep in mind that the layer following the attention is a Softmax. So for higher values of x, the softmax output will tend towards 1; hence, the resulting gradients (during backpropagation) would be very close to 0. This would eventually mean the model doesn't learn as the weights don't get updated.

  • @neelambujchaturvedi6886
    @neelambujchaturvedi6886 3 года назад +2

    Hey Krish, Had a quick question related to the explanation at 1:01:07 about positional encodings. How do we exactly create those embeddings, as in the paper the authors have used sine and cosine waves to produce these embeddings, I could not understand the intuition behind this, could you please help me understand this part, Thanks in advance.

    • @1111Shahad
      @1111Shahad 11 дней назад

      The use of sine and cosine functions ensures that the positional encodings have unique values for each position.
      Different frequencies allow the model to capture both short-range and long-range dependencies.
      These functions ensure that similar positions have similar encodings, providing a smooth gradient of positional information, which helps the model learn relationships between neighboring positions.

  • @ayushrathore8916
    @ayushrathore8916 3 года назад

    After the encoder. Is there any repository like which store all the output of encoder and then one by one it will pas to decoder to get one on one decoded output!

  • @muraki99
    @muraki99 9 месяцев назад

    Thanks!

  • @mohammedbarkaoui5218
    @mohammedbarkaoui5218 Год назад

    You are the best 😇

  • @harshjain-cc5mk
    @harshjain-cc5mk 2 года назад

    What is the basic requirement one should have to understand transformer, currently I am in my final year and willing to do a project on this. I do have knowledge of machine learning, neural networks and just started to learn RNN and CNN. Any guidance and suggestions are welcome.

  • @shashireddy7371
    @shashireddy7371 3 года назад

    Sir, is no of attention head means no of encoders?
    You have taken 6 encoders then how you will be able to get 8 attention head corresponding to 8 Z outputs