LSTMS generally perform better when it comes to short sequences, and remember LSTM is the revolution that led to the birth of The Transformer. I love both of them!
I've watched this maybe 5 times over 1 year, each time getting more and more from it. I think I finally intuitively understand how this works. Thanks for your work and your time!
This has been my experience with ML in general: I have to re-read papers and books over and over again, and each time I understand more. It's hard, but it pays off to finally get the grasp of such an almost mystical cocnept.
@@electric_sand Honestly, still making steady progress. I am now at a place where I am much, much further. I've mainly been preoccupied with datasets (e.g. reducing file storage size, faster reading and calculations, pytorch iterdatapipes) and realised it'd help me to go back more to the fundamentals (linear algebra, calculus, probability, pandas, numpy, data structures, builtin methods etc). It's been fun, overall. :)
@@niedas3426 Thanks for your response. Same here, I decided to go back to the fundamentals as well...I simply got tired of struggling through papers. Wish you the best mate.
Finally, someone is drawing vectors to describe what is meant by encoding with vectors, and how the vectors relate to one another. So many talk about this, but barely understand the details.
Really good explanation. You know how to provide the essence without getting lost into details. Details might be important later but the most important thing at first is the very main nature of the strategy and you provided it crystal clear. Thanks!!!
@bunch of nerds , BERT is also a transformer but with bidirectional (forward and backward) movement in the input sentence. And, GPT is a generative (autoregressive/ randomness) version of Transformer. Both are a language mode able to predict and understand the input sentence/s.
@@jugsma6676 is the most important thing to remember remember that you are the most powerful in the life of the tomb and your children are the most important thing to remember to be a a good friend friend of of a family who has been in in the last few months for a a long term relationship and you have to make sure that you have a good relationship with your your life insurance company in the world world world and of course you will be able and can can afford you for your business in the world and you you want to be a part of your happy moments and happiness in your lives by the time you get to the best of my abilities and I hope that you will be able and willing and willing to help help me in this situation as I am in the process and I am very very grateful for the the opportunity to work with my company in the field and I am am very grateful for the opportunity to work work with with my my partner in this process of learning from a very high level level of customer support for my future goals for the the industry in which I have have the opportunity to work with my business partner in the future of my company as well as a professional professional customer and and I am very confident confident in my ability to work work and to work with you in the process and I look forward to to speak with you you may be able to help me with this project as soon as I have a job in mind for my career and I will try and make sure I have the best of my abilities and I will try and get it to work for the best and I will try to the best of my abilities and I can make it to my end of of this project as soon as I get a job in a a good good place for my future goals for the next few weeks so I am going back to the best of my abilities and will let you know if I have any further ideas or comments about this opportunity and and if you have any further questions or concerns please feel free to contact us via email or or phone number and we can call or text or call or text me if you have any other ideas for us to discuss or if you have any other ideas for future projects and if you you need any help with any of the tomb or any other information you may have or if you have have any questions questions questions please feel free to contact me or you can call me me or my cell phone at any of those places or or at my cell number below are my my cell phone numbers and I will call them and get them to send me the info for your new home home and I can give you a few of the tomb I am looking at on Friday so I will be able and able to make some work if you want me there for a few hours and I can get you some info for the late night or early next weekend if you want me there there is no way I could be of any assistance or if you need need to get me a hotel for the night night and I will be in the office tomorrow morning to pick up my check for you to pick it 2017 and send it to me so I can send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I I I I I I I I I can come pick it out from your house and pick it out on the way ye if you want it for you or you could pick it up tomorrow morning and pick it out on the the phone or on the way to the office or at home or on the way to to get you a new phone or a new one for your phone and send it it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of of my address address so I can send it to the best of my abilities as I am not sure if I will be able to make the payment today because I have not received any response from the seller and and I have not received any reply yet again for my response and I am not sure if I will be able to make this appointment today due to the fact fact that I am not able and I I will not be able to make it to the meeting today because I am unable to make it to class on Friday and I am not sure if if you you will be able or able help me out at this point I am not sure if I will be able to make this work or if I need this to work out or if I need this to work on my end of the semester and I will try to the best of the best of my abilities abilities as well and will let you know if I have any other openings for me I will try to the best of my abilities and will let you know if I have any further further information regarding the job offer or if you would like me help in any of the positions I am interested to apply to your position and I look forward forward to hearing from you in the near the end of next month to to see if I have any further information on the job opportunity that I am looking to move to a few years later this year as I have been working for the last last month of of the month of the month and I have a few things I need to do to get the job completed ASAP and I need a few things things to go over with my parents 243 4444444444443444 4w4424w4w4w442 2 I 4242 the 3 44442 of the country to make sure they have a 2 and they will 2 1 for their 3rd 11 in their 3rd party party 2 2 and they will not 32424242243 the world 44 34 25 2 44442 243 4444444444443444 and 2 2 is not a good place 4444 243 is the only 2 in the floor that that has to do with it in in 2nd century with 24 of a wide 20 and the 55 is the only 2nd largest of all 3 of all time in 2nd year of the tomb year and a long period of 20th 20th of the 4242 24 and a few 554444 243 554444 243 and a few of them were in the floor 3 of a few of them were the ones 2 and I 2 the ones that are in a good shape for a 3rd floor 44 to be the only one who is in a good mood and they will 2 their love and the love of 20 for the first day and a few more than a few days of their own lives in
It's still like listening to a bad student struggling to explain something they don't themselves understand. And shame on the Google researchers for doing such a shit job explaining themselves, but I guess that's typical of the majority research papers out there - these people just don't care to teach their ideas to others (except maybe a very narrow circle with whom they have already communicated via other channels).
@@clray123 What parts do you think are explained poorly? To me it feels like Yannic understands the paper quite well, but i'm interested in what you think he might not understand all too well.
I recommend looking at the paper, because they use exactly this analogy. I found their description very helpful: "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key."
You have the best videos about machine learning I've seen, comparable to only perhaps 3blue1brown, but his videos aren't about as advanced topics. It would be really nice if you can make more!
@@snippletrap Arxiv sadly stopped posting a long time back and I personally find Henry AI's discussion to be superficial. Try Chris McCormik and reading groups by Rachael though. :)
Definitely both great channels but the comparison doesn't do justice to just how good 3b1b's animations are. Here Yannic writes on a tablet lol. Not really comparable.
It's amazing to have this explanation of the paper that is responsible for all of the AI interest and innovation happening now--- described as 'interesting' shortly after it came out. I love it.
This is by far the best explanation I've seen of this paper. I'm writing a review of this paper for a class and wouldn't have been able to do it without your video! Immensely grateful!
Thank you so much Yannic Kilcher, the paper seemed complex but you "encoded", performed "multi-head attention" and "decoded" it in such a simple way (: An amazing job! Undoubtedly the best explanation
an amazing explanation. truly amazing. I cant say how much I appreciate you putting dot product and soft max into intuitive and easy to understand words. very grateful
This sounds like someone who was reading a paper without realizing that it was to be the third largest thing to happen onward to humanity after a pandemic and from my own perspective an invasive war in Europe then the spark of AI that would happen with ChatGPT and the expansion of generative imaging like Stable Diffusion and Midjourney 😅🎉🎉🎉🎉 I would love to know how many subscribers you got from back then to just before ChatGPT and from ChatGPT up to nowadays 😅😅😅😅 You are such an amazing communicator ❤
Yeah, I'm late to the party, but I'd say that this video is still very relevant. I've read the paper several times and watching multiple blog posts and videos, but especially the Q,K,V mechanism never really clicked until watching this. Using dot-products between Q and K as a lookup mechanism. Ingenious! - Thanks for this video!
So the words are converted into vector embeddings, than positionally encoded using the sine and cosine function. 2. This vector is copied with one copy to be passed through the multihead attention layer to be contextualised by splitting each word into a key, query and value which are learned. There are a separate key, value and query vector for each word. 3. The key is matrix multiplied by the query after being passed through a linear layer, divided by the squareroot of the dimensionality of the key, multiplied by a softmax function and than matrix multiplied by the value. 4. It is than added to the other copy of the positionally encoded vector. 5. It is than normalised. 6. It than passes into the ffnn which has three layers. 7. In the dense layers it is matrix multiplied with the weight and than has a bias added, both of which are learnt. 8. It has the ReLU function applied to each word. 9. It is added with the residual(the vector before it has passed through the ffnn) 10. It is than normalised again... Than voile... what Am I missing here?
Thank you very much! This has helped me a lot. All I could find on this specific paper was confusing and hard to understand, I think it was explained extremely well in your video! Please make more of these, I think you might help lots of people :D
Love how 'All you need is attention' also applies to me in terms of understanding this video. Time to chug down some Adderall and take notes!! Also, probably not a good start when I have no idea what a vector even is... Take it in bit by bit. Robert Pirsig reading Thoreau style, ya dig?! Anyways, plz pray for me. Any God, Joseph Smith, Allah, Bill Murray, etc. And he will do. Pray for my attention, pray for my soul, pray for Bill Murray. Love❤🎉
Question: if there is a finite max length of an input/output sequence, why do you need a positional encoding? Wouldn't the network have a static place for the 1st word, 2nd word, ..., nth word in it's inputs? I'm struggling to understand the need for the positional encoder.
nevermind, I think I understand. Since the multi-head attention mechanism uses the same scaled dot product computation for each word in the sequence (and not different parts of a NN for example), the positional encoding is necessary in order to get different answers for the same word at different locations in the sequence.
Love the relaxing voice. Way better than reading the paper myself. Now I can be on the elliptical and still ingest the gist of papers. Thank you for making this!
While this video did add to my knowledge about the Transformer model, but I think it did a poor job in total. I still cannot say that I have a general understanding of how this model works. What I'm interested to know is whether the training and inference models are the same (in LSTM and GRU, the inference decoder is used differently from training decoder). And also what are the inputs and outputs of this model (I mean the exact structure, maybe even some examples).
Nice explanation. Note that by the time somebody gets to the part where you explain dot products, he or she would likeky already know what a dot product is.
very good intro. many videos don't focus on visual explanation which you definitely did cover. I'ld be thrilled to see a video that goes more into depth, also how exactly the decoding is done once it's trained and how embeddings could be obtained for other tasks. But other than that, very very very well done!
Well explained. After watching your newer videos, you could do even better job now! Please, give again your (filmed) attention to Attention is all you neeed.
Nice stuff, but could you also show the actual implementation of the papers you review? What nn can accomplish with languages in this video's topic would be interesting to watch.
Isn't dot product of a and b vectors the product of magnitudes of a, b and cos of the angle between vectors? In the video it was mentioned it is the angle between a and b which seems incorrect. However, dot product is intuitively related to the angle between the vectors.
I see that this thing seems to exist, even if it is hard to believe. But that it was created by learning, starting with nothing, is literally unbelievable for me. Did it start from something else? But what could that be?
Hello. Thank you for amazing video. Had some thoughts here. I would think decoder would first decode last word that went through an encoder (mouse) not the first word (the). The first word (the) went through encoder N-times, therefore should go through decoder N-times as well, where last word went through encoder only once what would make it so much easier to decode first.
Nobody knew this paper would change the world
Did it?
@@blitz1867 hell yeah it did its cited 124k times so it definetly did
@@yanniammari1491 And we're just getting started.
@@blitz1867 yes, definitely.
very true
Friendship ended with LSTM, transformer is now my best friend.
As far as i know there's a little amount of transformers for audio problems
LSTMS generally perform better when it comes to short sequences, and remember LSTM is the revolution that led to the birth of The Transformer. I love both of them!
Lstm sequentialization is kludged inside transformers. Pay attention
Same but with GRU
Friendship ended with Transformers, Retentive Networks are now my best friend.
I've watched this maybe 5 times over 1 year, each time getting more and more from it. I think I finally intuitively understand how this works. Thanks for your work and your time!
This has been my experience with ML in general: I have to re-read papers and books over and over again, and each time I understand more. It's hard, but it pays off to finally get the grasp of such an almost mystical cocnept.
It’s a little bit more complicated than just predicting the next word based on the last, which is the take a lot of people have on it.
@@niedas3426 How's it going...honestly this is how I feel sometimes, having to go through multiple videos and blogposts just to grasp concepts.
@@electric_sand Honestly, still making steady progress. I am now at a place where I am much, much further. I've mainly been preoccupied with datasets (e.g. reducing file storage size, faster reading and calculations, pytorch iterdatapipes) and realised it'd help me to go back more to the fundamentals (linear algebra, calculus, probability, pandas, numpy, data structures, builtin methods etc). It's been fun, overall. :)
@@niedas3426 Thanks for your response. Same here, I decided to go back to the fundamentals as well...I simply got tired of struggling through papers. Wish you the best mate.
I was searching for a channel like "Two minute papers" but not two mins in length and goes in depth. I think I found it!
Subbed!
Finally, someone is drawing vectors to describe what is meant by encoding with vectors, and how the vectors relate to one another. So many talk about this, but barely understand the details.
Really good explanation. You know how to provide the essence without getting lost into details. Details might be important later but the most important thing at first is the very main nature of the strategy and you provided it crystal clear. Thanks!!!
By far the best explanation about the paper "Attention Is All You Need". well explained. Thanks Yannic Kilcher
@bunch of nerds , BERT is also a transformer but with bidirectional (forward and backward) movement in the input sentence.
And, GPT is a generative (autoregressive/ randomness) version of Transformer.
Both are a language mode able to predict and understand the input sentence/s.
@bunch of nerds , It's much better to use inbuild stable tools than doing from scratch.
@@jugsma6676 is the most important thing to remember remember that you are the most powerful in the life of the tomb and your children are the most important thing to remember to be a a good friend friend of of a family who has been in in the last few months for a a long term relationship and you have to make sure that you have a good relationship with your your life insurance company in the world world world and of course you will be able and can can afford you for your business in the world and you you want to be a part of your happy moments and happiness in your lives by the time you get to the best of my abilities and I hope that you will be able and willing and willing to help help me in this situation as I am in the process and I am very very grateful for the the opportunity to work with my company in the field and I am am very grateful for the opportunity to work work with with my my partner in this process of learning from a very high level level of customer support for my future goals for the the industry in which I have have the opportunity to work with my business partner in the future of my company as well as a professional professional customer and and I am very confident confident in my ability to work work and to work with you in the process and I look forward to to speak with you you may be able to help me with this project as soon as I have a job in mind for my career and I will try and make sure I have the best of my abilities and I will try and get it to work for the best and I will try to the best of my abilities and I can make it to my end of of this project as soon as I get a job in a a good good place for my future goals for the next few weeks so I am going back to the best of my abilities and will let you know if I have any further ideas or comments about this opportunity and and if you have any further questions or concerns please feel free to contact us via email or or phone number and we can call or text or call or text me if you have any other ideas for us to discuss or if you have any other ideas for future projects and if you you need any help with any of the tomb or any other information you may have or if you have have any questions questions questions please feel free to contact me or you can call me me or my cell phone at any of those places or or at my cell number below are my my cell phone numbers and I will call them and get them to send me the info for your new home home and I can give you a few of the tomb I am looking at on Friday so I will be able and able to make some work if you want me there for a few hours and I can get you some info for the late night or early next weekend if you want me there there is no way I could be of any assistance or if you need need to get me a hotel for the night night and I will be in the office tomorrow morning to pick up my check for you to pick it 2017 and send it to me so I can send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I I I I I I I I I can come pick it out from your house and pick it out on the way ye if you want it for you or you could pick it up tomorrow morning and pick it out on the the phone or on the way to the office or at home or on the way to to get you a new phone or a new one for your phone and send it it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of my abilities and I will send it to the best of of my address address so I can send it to the best of my abilities as I am not sure if I will be able to make the payment today because I have not received any response from the seller and and I have not received any reply yet again for my response and I am not sure if I will be able to make this appointment today due to the fact fact that I am not able and I I will not be able to make it to the meeting today because I am unable to make it to class on Friday and I am not sure if if you you will be able or able help me out at this point I am not sure if I will be able to make this work or if I need this to work out or if I need this to work on my end of the semester and I will try to the best of the best of my abilities abilities as well and will let you know if I have any other openings for me I will try to the best of my abilities and will let you know if I have any further further information regarding the job offer or if you would like me help in any of the positions I am interested to apply to your position and I look forward forward to hearing from you in the near the end of next month to to see if I have any further information on the job opportunity that I am looking to move to a few years later this year as I have been working for the last last month of of the month of the month and I have a few things I need to do to get the job completed ASAP and I need a few things things to go over with my parents 243 4444444444443444 4w4424w4w4w442 2 I 4242 the 3 44442 of the country to make sure they have a 2 and they will 2 1 for their 3rd 11 in their 3rd party party 2 2 and they will not 32424242243 the world 44 34 25 2 44442 243 4444444444443444 and 2 2 is not a good place 4444 243 is the only 2 in the floor that that has to do with it in in 2nd century with 24 of a wide 20 and the 55 is the only 2nd largest of all 3 of all time in 2nd year of the tomb year and a long period of 20th 20th of the 4242 24 and a few 554444 243 554444 243 and a few of them were in the floor 3 of a few of them were the ones 2 and I 2 the ones that are in a good shape for a 3rd floor 44 to be the only one who is in a good mood and they will 2 their love and the love of 20 for the first day and a few more than a few days of their own lives in
It's still like listening to a bad student struggling to explain something they don't themselves understand. And shame on the Google researchers for doing such a shit job explaining themselves, but I guess that's typical of the majority research papers out there - these people just don't care to teach their ideas to others (except maybe a very narrow circle with whom they have already communicated via other channels).
@@clray123 What parts do you think are explained poorly? To me it feels like Yannic understands the paper quite well, but i'm interested in what you think he might not understand all too well.
The explanation of querying a key-value pair is really nice
I recommend looking at the paper, because they use exactly this analogy. I found their description very helpful:
"An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key."
You have the best videos about machine learning I've seen, comparable to only perhaps 3blue1brown, but his videos aren't about as advanced topics. It would be really nice if you can make more!
Arxiv Insights and Henry AI are pretty good too
@@snippletrap Arxiv sadly stopped posting a long time back and I personally find Henry AI's discussion to be superficial. Try Chris McCormik and reading groups by Rachael though. :)
Who do you mean by “Rachael reading groups”?
Definitely both great channels but the comparison doesn't do justice to just how good 3b1b's animations are. Here Yannic writes on a tablet lol. Not really comparable.
It's amazing to have this explanation of the paper that is responsible for all of the AI interest and innovation happening now--- described as 'interesting' shortly after it came out. I love it.
I just got a clear understanding of how the positional encoder works here. Kudos to you. Great Explanation!
This is by far the best explanation I've seen of this paper. I'm writing a review of this paper for a class and wouldn't have been able to do it without your video! Immensely grateful!
Who knew this paper would change how we look at sequences forever
Thank you so much Yannic Kilcher, the paper seemed complex but you "encoded", performed "multi-head attention" and "decoded" it in such a simple way (: An amazing job! Undoubtedly the best explanation
an amazing explanation. truly amazing. I cant say how much I appreciate you putting dot product and soft max into intuitive and easy to understand words. very grateful
This sounds like someone who was reading a paper without realizing that it was to be the third largest thing to happen onward to humanity after a pandemic and from my own perspective an invasive war in Europe then the spark of AI that would happen with ChatGPT and the expansion of generative imaging like Stable Diffusion and Midjourney 😅🎉🎉🎉🎉 I would love to know how many subscribers you got from back then to just before ChatGPT and from ChatGPT up to nowadays 😅😅😅😅 You are such an amazing communicator ❤
Yeah, I'm late to the party, but I'd say that this video is still very relevant. I've read the paper several times and watching multiple blog posts and videos, but especially the Q,K,V mechanism never really clicked until watching this. Using dot-products between Q and K as a lookup mechanism. Ingenious! - Thanks for this video!
Good that you were interrupted at 17:15. I had to strain my ears and go full volume to hear you. After that it was better.
This will never catch on. (Kidding)
Excellent explanation of Transformers. Clear, easy to follow, and great information. Thanks!
So the words are converted into vector embeddings, than positionally encoded using the sine and cosine function.
2. This vector is copied with one copy to be passed through the multihead attention layer to be contextualised by splitting each word into a key, query and value which are learned. There are a separate key, value and query vector for each word.
3. The key is matrix multiplied by the query after being passed through a linear layer, divided by the squareroot of the dimensionality of the key, multiplied by a softmax function and than matrix multiplied by the value.
4. It is than added to the other copy of the positionally encoded vector.
5. It is than normalised.
6. It than passes into the ffnn which has three layers.
7. In the dense layers it is matrix multiplied with the weight and than has a bias added, both of which are learnt.
8. It has the ReLU function applied to each word.
9. It is added with the residual(the vector before it has passed through the ffnn)
10. It is than normalised again...
Than voile... what Am I missing here?
i cant believei. just learned the intuition behind softmax, Yannic ur videos are pure gold, i hope life is treating u well !
Very well done! I agree with the other comments that this is the clearest explanation I have seen so far. Thanks for the great work!
Fun to see this today after all the recent successful transformer results! (June 2022) Thanks Yannic, keep it up!!
Why could there not be such a RUclips explanation from authors of the paper: would be very helpful for humanity right now. But this is quite helpful.
because publishing a paper doesn't directly prove the skill of being able to explain it in this way
Thank you very much! This has helped me a lot. All I could find on this specific paper was confusing and hard to understand, I think it was explained extremely well in your video! Please make more of these, I think you might help lots of people :D
You have done an excellent job in explaining attention method in simple words. Thanks so much!
Great video and very unique amongst most machine learning videos on youtube.
Thank you!
Love how 'All you need is attention' also applies to me in terms of understanding this video. Time to chug down some Adderall and take notes!!
Also, probably not a good start when I have no idea what a vector even is... Take it in bit by bit.
Robert Pirsig reading Thoreau style, ya dig?!
Anyways, plz pray for me. Any God, Joseph Smith, Allah, Bill Murray, etc. And he will do. Pray for my attention, pray for my soul, pray for Bill Murray. Love❤🎉
you have such a cool state of mind ... really adds to making your teaching style more interesting
This paper changed the world. AI revolution is started after this paper.
Question: if there is a finite max length of an input/output sequence, why do you need a positional encoding? Wouldn't the network have a static place for the 1st word, 2nd word, ..., nth word in it's inputs? I'm struggling to understand the need for the positional encoder.
nevermind, I think I understand. Since the multi-head attention mechanism uses the same scaled dot product computation for each word in the sequence (and not different parts of a NN for example), the positional encoding is necessary in order to get different answers for the same word at different locations in the sequence.
you got it :)
Such a clear explanation of Attention. I was struggling to understand attention and I would have watched over 20 videos for this but got no clarity.
Best explanation I've seen on this topic!
Love the relaxing voice. Way better than reading the paper myself. Now I can be on the elliptical and still ingest the gist of papers. Thank you for making this!
by far the most intuitive explanation. Thanks!
I'm watching the history of AGI being built right here
I had to revisit this video several times, but I think transformers finally clicked for me. Thank you!
I always thought about doing a youtube channel like this, but I guess I don't need to because you are so good at this thanks!
This is a great summary, thanks for making this!!
You need to remake this video. You've gotten so much better at doing this since you made this video and this topic is so foundational.
Thanks, nice video. You've come a long way since then, I'm sure, especially with the open assistant stuff.
VERY helpful, thanks! I'd love to see a "part 2" ...
I really appreciate that you are making these videos.
Really awesome job! I was puzzling about what the key, value pairs are. Thanks a lot!
the best transformer video I have watched. Well explained
Always wondered wht keys, values, queries are. Thank you for the clear explanation!
Just exceptional explanation. You clear things up so much!
One of the best explanation !!!
You are amazing!
This was such a good explanation. I've been trying to really understand these, but until now I haven't found a good resource. Cheers man!
This is beautiful, I really appreciate your work! Thank you
Very intuitive explanation here, thank you!
What's "Add and Norm" mean at each step of the network in the architecture??
Man. I just found your channel. All the best insh'Allah
While this video did add to my knowledge about the Transformer model, but I think it did a poor job in total. I still cannot say that I have a general understanding of how this model works. What I'm interested to know is whether the training and inference models are the same (in LSTM and GRU, the inference decoder is used differently from training decoder). And also what are the inputs and outputs of this model (I mean the exact structure, maybe even some examples).
Excellent video, thank you so much for illustrating these concepts so clearly.
the paper that changed everything
Very intuitive explanation. Thank you!
Nice explanation. Note that by the time somebody gets to the part where you explain dot products, he or she would likeky already know what a dot product is.
How far we've come in 5 years!
very good intro. many videos don't focus on visual explanation which you definitely did cover. I'ld be thrilled to see a video that goes more into depth, also how exactly the decoding is done once it's trained and how embeddings could be obtained for other tasks. But other than that, very very very well done!
Best explanation ever!! Can you make a video in which you explain how to use BERT (for beginners)?
I think huggingface has tutorials on that.
The amount of times this paper has been cited is astounding (rightfully so)
Can you do one for the paper 'Residual Attention Network for Image Classification'? There they've tried to use this concept for CNNs ?
Great explanation of the transformer model. Thank a lot!
I really understood the subject, thanks for your clear explanation.
Thank you! This is a very good explanation which I actually used in presenting this paper. Cheers man!
Super Erklärig, bis jetzt eini vo de beste woni gfunde ha.
Findes cool zgseh das mal öpper us de schwiz sich so uf dere plattform engagiert.
Witer so!
great explanation. please keep posting such summaries of great papers thanks!
so Representation in one Natural Language into the Universal Language of Math into anther Natural Language.
Bro reminded me of that one dude who posted a video in 2011 begging people to buy 1$ of bitcoin
By "traditional RNN" he really means RNNs that were practically used from 2012-2016 :) Although i guess BPTT is a 1970s tech.
This is great video. Please make a video on Hierarchial Neural Networks.
Thank you for the super neat explanation- Cleared a lot of stuff.
Great explanation !
this is excellent. Thank you so much for sharing!
As a student your videos are very helpful!
This 6 years after attention is all you need is just crazy rapid growth
Protip, print out this paper and take notes on it as you try to follow along with this, also read on your own and take note of questions you have.
Quite an amazing explanation! thanks a lot
What a wonderful explanation of transformers, straight to the point, really nice any on Wavtovec model?
Thanks a lot for this explanation video!!
21:30 - thanks for the great softmax explanation! I've had the "aha" moment :)
its very well explained at 9 mins already got the answer about Attention vs LSTMs. i wss sesrching long for it on google
Well explained. After watching your newer videos, you could do even better job now! Please, give again your (filmed) attention to Attention is all you neeed.
Nice stuff, but could you also show the actual implementation of the papers you review? What nn can accomplish with languages in this video's topic would be interesting to watch.
This explanation is amazing!! Thank you for this
Thank You for the overview.
(Let's take care of human attention, train it's span.)
Thank you, that was very informative and explained well!
Isn't dot product of a and b vectors the product of magnitudes of a, b and cos of the angle between vectors? In the video it was mentioned it is the angle between a and b which seems incorrect. However, dot product is intuitively related to the angle between the vectors.
True, it should probably be more accurately interpreted as a quantity that's proportional to the angle!
Very informative and well explained the details...Plz try to upload more on advance nlp topics ...
Awesome! Thank you very much. Very well explained. I'd suggest you use Apple Pencil. It works really well in annotating.
Thank you so much for this. Excellent explanation
best explaination thanks .One question do be calculate the query vector on both encoder and decoder side in transformer architecture?
Yes, you'd calculate all the vectors for all positions in input and conditioning output.
great explanation. Thank you, master
I see that this thing seems to exist, even if it is hard to believe. But that it was created by learning, starting with nothing, is literally unbelievable for me.
Did it start from something else? But what could that be?
Wait, huh, I kinda expected d(h4) to return the translation of "mouse" and be so forth going from end to beginning.
Thank you sir, very clear explanation
Hello. Thank you for amazing video. Had some thoughts here. I would think decoder would first decode last word that went through an encoder (mouse) not the first word (the). The first word (the) went through encoder N-times, therefore should go through decoder N-times as well, where last word went through encoder only once what would make it so much easier to decode first.