A few added notes based on common comments I see. Concerning masking self-attention, several people ask about cases where it feels like later words should update the meaning of earlier words, like languages with adjectives following nouns. The model can always put the richest meaning into the last token (e.g. early nouns getting baked into later adjectives). For example, @victorlevoso8984 noted below how empirical evidence suggests the meaning of a sentence often gets baked into the embedding of the punctuation mark at its end. Keep in mind that the model doesn't have to conceptualize things the way we humans do, and in all likelihood doesn't at all, so I wouldn't over-index on the motivating example given in this video. Also, one thing I should have called out more explicitly is how I personally like to think of vectors like embeddings, keys, queries, etc. as columns, and as a convention display them this way, but other sources, including the Attention is All You Need paper, may present them organized in a row-by-row fashion. This is relevant to parsing the equation shown at 10:29, where the expression from the paper that looks like Q K^T would, by the conventions of this video, instead look like K^T Q.
Ugh youtube aparently doesnt like me posting links to papers and deleted my reponse aparently so I'll try again. Source for what I'm saying is "Linear Representations of Sentiment in Large Language Models" but I haven't read it and have only heard a talk by the papers autors so take my vague explanation of it with a grain on salt. On a quick skim they mention the model moving summary information to "commas, periods and particular nouns."
@MarcusHilarius I had a longer response that got deleted by youtube (probably cause I linked to my paper as an example) but I think this is not exactly correct in that the purpose of the qk matrices is for each attention head to approximately only "read" the information in some of the tokens. So at each position the model could have the meaning of all the previous tokens on it but it doesn't have to, and theres all kinds of interesting dynamics about wich positions attend to each position in each layer and attention head for a given input.
Row-by-row is actually in some sense the standard notation in Deep learning, since most (all?) popular frameworks use it internally. I'm not actually sure why, I guess it's just an old GPU programming convention. It's not uncommon for researchers to use column notation in their papers, but personally I think it might be confusing for people new to the field, as the code will almost surely treat the vectors as rows (after all, usually the first data dimension is the batch size). It's not a very significant thing of course.
The most common association I've heard as to why there needs to be masking, is that otherwise the model could just cheat. Every word would have the same query, "hey, what's the word after my position x?" and then the keys would all be "I'm at x+1" or similar. The value at the x+1 embedding would just be the word itself, and it would know what it needs to write. Your model would always perfectly predict the next word. This is still interesting though. What if instead of asking the model to predict the next one word, we asked it to predict the next 10? You'd implement it by sliding the triangle in the masking matrix 10 steps (masking more). Your loss would rise dramatically, since even if the model gets the "gist" of what it should write (like if you want it to talk about "Harry Potter" somewhere) it would still be off if due to sentence structure the words "Harry Potter" weren't in the exact same place as in the original text. You'd need to change your loss calculation to somehow focus more on the "semantic correctness" of those 10 words (a challenge I'm sure), but the benefit would be that the model could generate whole phrases or sentences. I'd probably do this in some staged manner, like starting with one word loss and increasing the number of words as training goes on. I'm not sure what the benefit would be though, but might be fun to try.
I'm a university lecturer with a PhD in AI, and I cannot compete with the quality of this work. Videos like this put the entire higher education system to shame. Fantastic! ❤️
If you haven't already, check out the FAQ on his site about his animations. Its called Manim and is available for anyone. It seems that Grant is on a mission to create a Math and Code Animation Army. Sign me up! I teach high school CS without a full CS degree and I was able to use it in my classroom. You can install or just test it on a Jupyter notebook. Still requires an art to scaffold the instruction, but a good visualization can expedite the process.
Another role of universities I think is to hold the knowledge system and methodology with the help of social systems and social interaction inside and with another society services also via social interaction. So, now the balance has been shifted, making Unis more provide holding services and RUclips/Online/GPT - educating and transferring the existing knowledge to wide mass of people, just because huge multiplication factor (dozens of millions of viewers vs thousands of students). This new balance shouldn't be broken or rejected but taken into account to increase quality of education as a phenomenon.
@@iTXS they are failing hard at that too, specially because their "social systems" felt prey to some ideologies that are not conductive to critical thinking. We need to create a better system starting from what we already on the internet. " provide holding services " Why would we need that ? Open Source has solved that problem for software, Science just need to use our software development tools. You don't need the bloated organizational structure of universities at all to actually generate and curate knowledge to a high degree of success, that's what we software developers did with things like Linux. You guys are just stuck in the past.
Famously so I think. I really really like these names instead of the long boring ones. But... Also people aren't clever and the creative names would be bad, so maybe we stick to long boring ones for the most part haha
How I wish this video was available when the "Attention is What You Need" paper just came out. It was really hard to visualize by simply reading the paper. I read it multiple times but could not figure out what it was trying to do. Then subsequently, Jay Alammar posted a blog post called The illustrated transformer. That was a huge help for me back then. But this video raises the illustration to an entirely different level. Great job! I'm sure many undergraduates or hobbyist studying machine learning would benefit greatly.
You not only put out some of the best content on youtube but also give constant shutouts to other content creators that you admire. You are a GOAT 3Blue1Brown.
Attention existed before the 2017 paper "Attention Is All You Need". The main contribution was that attention was... all you needed for sequence processing (you didn't need recurrence). Self-attention specifically was novel though.
Before people debated how many self attention blocks you should stick into a model and where and if it was even worth it. The paper proved that more was better and that you're probably better off replacing linear layers and convolutional layers with more self attention blocks.
There have been many people working on this idea long before 2017. I, myself, was working on a model very similar to this in 2010, and I was inspired by papers and books going back much further. It seems people are not fond of appreciating multiple, independent instances of creation and discovery. It happens a lot in many domains.
@Haligonian the sources for their inspiration and basis are cited in the "attention is all you need" paper. I believe it's at the bottom in the acknowledgment section. Then, you go to those papers and look at their cited sources, repeat ad infinitum. This is the true, branching complexity of creative works. There's not often some lone genius having a singular "Eureka!" moment.
Geez Grant, I spent thousands of dollars on a very good deep learning executive certification from Carnegie Mellon, and your series here is better than their math slides. This series is really turning out great.
Never before have I felt the urge to support a creator for a specific video, but you sir have knocked it out the park and this is by far the best educational video I have seen. Not just on RUclips but in my last 20 years of working with data 🎉
As director of video content for a major educational publisher, this is some of the best educational content I’ve ever seen. Your content gives me ideas of how to shape the future of undergraduate level STEM videos. A true legend and inspiration in this space- thank you for the meticulously outstanding work that you do.
I cannot stress enough what a tour de force this is. It's probably one of the best math classes ever done anywhere in the world in all time. You're the best in the game and an inspiration for many. So so much thank you, Grant, you're doing God's work here.
I'm a Computer Science student currently working with a Transformer for my master thesis and this video is absolute gold to me. I think this is the best explanation video I've ever seen. Holy shit, it is so clear and insightful. I'm so looking forward to the third video of the series!!!! The first one was absolutely amazing too. Thank you sooo much for this genius piece of work!!!!
If I could write poetry about how much I appreciate and learn from your videos, I would but I'm not a poet. Thanks to everyone who worked on these videos.
oh how great the knowledge be, that grant can share so graciously. i would be lost in space and time, if you, my precious would not shine. it got a bit saccharine, but tried my best lol i do agree though, this channel is amazing.
In the vast digital sea, a beacon shines bright, Three Blue One Brown, with wisdom alight. AI, ML, and DL, in colors so clear, Guiding minds through concepts, year after year. With clarity and depth, their tales unfold, Grateful hearts cherish the knowledge they've told.
In realms where algorithms dance and code reigns supreme, Where data whispers secrets and dreams in every scheme, There dwells a sage, a maestro of the digital domain, Whose words and visuals in tandem, ignite the brain. Through the ethers of the web, his teachings unfurl, Like constellations in the night, they guide and swirl, With each stroke of insight, he paints a vivid scene, Unraveling mysteries, illuminating what has been. Oh, if only words could craft the depth of gratitude, For the knowledge gained, the horizons pursued, Yet in this humble verse, let it be known and heard, The gratitude abounds, for the wisdom conferred. To the minds behind the screens, the unseen hands, Who weave the fabric of understanding in digital lands, We extend our heartfelt thanks, our deepest regard, For kindling the flame of knowledge, burning bright and hard. - By Chat GPT 3.5
Just Wow, the educational value of this video is incredible. There are so many highly relevant and original ideas to explain abstract concepts and drastically simplify comprehension. I'm so thankful that you've made this content available to everyone for free. I absolutely love it!!
It is simply inexplicable how valuable these videos are to humanity, and I mean that literally. The way your videos convey the underlying ideas and intuitions of such complex technology opens the door for tomorrow's smart minds to learn even faster.
There are people … all over the world … like me … who really, really, really appreciate you. I cannot thank you enough for taking the time to share your knowledge and help others to understand this technology much more deeply. Seriously, kudos and sincerest thanks. ❤
Joining the list of commenters who have worked with transformers, built and trained them, explained them to others, and still learned really basic things from this video about what they're doing that just hadn't occurred to me. Great work.
Grant is all you need. This was propably the tenth video or podcast about the subject and only now I understand the underlying motivation for each component it has.
This is the best explanation of attention mechanism in transformers. The concept of queries, keys and values is hard to grasp for those not familiarized well with linear algebra. With these high-level visualizations and clear/concise explanations by this amazing teacher, one can really understand how amazing are the ideas put in this model. Thanks.
I wish some wealthy guy would fund you millions of dollars so that you can form a team and come up with video explanations for top research papers as early as possible. What a blessing that would be!
You made the best attention video I have watched online. This should be a compulary part of all NLP courses right now. Thanks for releasing this banger!
This video is worth gold. Trying to understand exactly why and how self attention and cross attention works nearly brought me to tears these past years and this video explains it so well and with visualized examples.
The dissemination of scientific concepts transformed into art. Immensely enjoyable and useful for the education of millions. Thank you as many times as the number of model parameters!
Rewatching the video is all i need apparantly :) On a serious note i've heard so much about this "attention" mechanic and always wondered how it worked so thank you for making this video!
Finally someone explains it right. One simple thing but important that I would emphasize for the new people in AI is that all matrix/vectors used in this process are "discovered" by the model with the gradient descent mechanism (reward mechanism) for the especific task. In this cases, predict next word.
Congratulations, your transformers series is a technical education masterpiece. I’ve been building products with LLMs for the past two years and shamefully only understood them at a very superficial level. You have a gift for clarifying without oversimplifying and condescension that is rare.
Your explanation of Q K V which is probably at the core of what attention does is fantastic. I have watched at least several dozen videos but i don't think anyone has been able to explain as succinctly as you have with your examples of what an attention "might be doing". Kudos, count me in as a supporter from now on
D.Trump can lead the resistance against misaligned AI; he has developed several novel strategies for reducing rogue AI systems into a harmless atavistic state.
The video says that masking is for preventing later tokens from influencing earlier ones. How does GPT3 handle the sentence: "The cat that left the town came back this morning"? The phrase "left the town" modifies the noun "cat." Later tokens do influence earlier ones in this case.
In that case, masked self-attention would have to do something a little different from how we humans might naturally think about it, for example, baking in all the meaning of a cat leaving the town into the embedding associated with "town".
@@3blue1brownis it right to say that all of the meaning is embedded in the last token, so "cat ""left" "town" "came" and "back" are all embedded in "morning"?
@@Scubadooper Well, remember that the point is to have, on the last layer, the embedding on a word be the prediction of the next word ? So yes the goal is exactly to have the whole meaning of the sentence be on the last token
@@Scubadooper You can think about it like "THE MORNING", that important morning when the cat came back after leaving the town (again that sad town that the cat left) previously.
Your work for humanity is invaluable. Aside from publishing free videos about these subjects, the effort you put into making something this complex to something reachable to all should always be commended. I have tried to understand this subject multiple times, reading the papers and everything, but I couldn't create a simple model in my head about what it was doing or what it brought to the table to be such a breakthrough in this field. Thanks a lot!
Videos like this put the entire higher education system to shame. Fantastic! I can’t believe you can watch such an excellent lesson on RUclips for free!
I understood the keys and queries vectors in a more generalist way but your explanation of them as "who's an adjective?" and "I am an adjective, look at me!" is just amazing!
Your way of explaining concepts feels so interactive to me. Follows the narrative > A question appears > 3B1D answers it immediately > On to the next subject. Thanks for the superb content!
To me the part that was more confusing when studying this for the first time was the seemingly arbitrary separation between Keys, Queries and Values. In particular, the Value matrix seemed sort of redundant if you already had the Keys, and the terminology coming from databases doesn't help much imo. This makes it much more clear, thank you!
I think the separation exists because otherwise you'd need some super array for all possible combinations. At least that's the intuition I have about it.
These videos should be on the Trending list. One doesn’t necessarily need to have a CS/Math background to appreciate this content and the immense effort behind creating this masterpiece! Grant, the world is indebted to your contributions in revolutionising how Math can be taught! You TEACH intuition!
I wish you had explicitly mentioned that the step of adding all the computed V vectors to the original input vector of an attention head, is the "residual connection"/ "Add" step in the Transformer paper. That's not immediately clear to someone who is connecting the dots between the paper content and this video But other than that small small qualm, the best video by far of any other out there :)
Man I admire your patience to explain this step by step and with great visuals. I can't thank you enough for making such complex topics somewhat comprehensible for people without the corresponding background. The day you started this channel deserves to be a national holiday.
I'm still absolutely fathomed we're now using O(n^3) algorithms like that because we have so much hardware in the last 10 to 20 years. That's the magic of LLMs, its not that they are "large", but that we have engineered enough hardware to be able to play with such things. But in the end its all just a lot of linear algebra, isn't it ?
Imagine 25 years ago, you have your Pentium MMX, with, if you're lucky, an 8MB Voodoo and 32MB RAM, and 8GB disk, and you're told: in 20 years you're going to have a little supercomputer in your computer, thousands of operands wide, 8GB of RAM on there or more. You might not be entirely surprised, Moore's Law sort of checks out, after all you had a 486 just previously and a simple Trident VGA, and maybe a 1MHz Commodore 64 previously before that. And yet it is impossible to grasp this power. And that we'll all be using a dark faced chatting app that runs slower than AIM/ICQ did back then while not doing a whole lot more. But that you would also be able to run mad neural network stuff on your computer with an eerily almost human like capability in language and other complex fields.
This is probably the BEST channel to learn fundamentals of Gen AI. I came here after reading Foundational Large Language Models & Text Generation paper, where these concepts felt hard to digest. But you hit the homerun with all the visuals. Thank you for helping me understand these concepts!!!!
There's only one other orator I like as much as 3blue1brown, just one. And that other orator who is just as pellucid, and engaging is Donald Trump. Yep, I only pay attention to 3blue1brown and Donald Trump; it's almost like they are the same person.
I’d really love a series like this on quantum computing. I don’t think any such video exists, let alone anything of 3B1B quality. Well, here’s begging. :)
I'm very much not a math person but I've been binging your stuff for a good 4 days now. Between the general sense of wonder, I keep being reminded of the HG Wells quote "History is a race between education and catastrophe" Idk if that ethos is something you deliberately bake into this channel but I very much see it everything you upload and I can't thank you enough. Idk if you fully appreciate the reach your efforts could have but either way, thank you for taking the time to break this stuff down while also not dumbing it down. I don't understand everything you say but you make me curious enough to figure it out and I feel genuinely empowered as a result. I'm sure I'm not just speaking for myself.
Great video as always! Minor quibble at 9:00, I have always heard and understood “attend to” as being from the perspective of the query (the video uses the key’s perspective) so it would be “the embedding of creature attends to fluffy and blue” instead. It doesn’t really matter since the dot product is symmetric, I just haven’t heard it used colloquially that direction (maybe due to the axis that the softmax is applied on?)
Skimming through to determine if this would explain it well to friends who are interested but yeah, I noticed all the matrices in this video are transposed relative to the notation. He's not wrong with what the calculations are doing but it's more confusing this way, especially when he shows softmax(qk^t/z)v as the notation. With this formulation, Q,K,V should have rows of projections so the attention matrix is properly formed. Softmax is then done over rows so when multiplying with V on the right, you're aggregating as weighted rows of V.
I had the same initially, but I'm not a native english speaker. I googled the meaning of attend to and it gave me: "has to do with". So if you think about it like that, fluffy and blue have to do with creature, not the other way around. Maybe that helps.
17:50 -- Love the 3b1b humblebrag here. essentially "Those paper writers make things confusing, and I am here to lead you with knowledge". Thank you Grant for bringing this to all of us!
Thank You 3Blue1Brown for providing such a high quality lecture intuition in LLM. You are the best teacher ever i have seen to explain things this much perfectively.
It is 1:00 in Sydney right now and I’m up late watching your video from my bed. I should probably get some sleep, I have morning classes, it’s just your content is to God damned interesting. Plus, I’m a teenager. I can’t be separated from my phone accept by 16th century French style beheading. POST MORE VIDEOS! If I can’t sleep you shouldn’t get the luxury!
This series is incredible. I'm an environmental and agricultural sciences master's student who is just super interested in AI. I've dabbled in uni courses which were fairly superficial (given my background and program) and decided to write my master's thesis about a specific practical use case of genAI. I know so very little, and these videos are tremendously helpful to help guide me through this super daunting and scary task. With my specific goal in mind, I have spent 3h on this video and the one before that alone. So so information dense yet highly comprehensible. Thank you from the bottom of my heart. Thanks to the wider "ML community" (and genAI itself lol, thanks to state of the art LLMs), I feel like I can actually do this (and learn so incredibly much - if not for my future career, for sure for being an informed person in this AI era)
By far this is (with ch. 5) the best explanation of Transformer architecture that I've ever seen. I previously understood Attention intuitively but struggled to explain this during job interviews under stress. You just helped me land a job. Thank you!
At the end of the day it is the mathematicians who have the best understanding of any concept, be it physics or AI or just pure abstract math. Thank you Grant for your amazing contribution and taking us along this journey...
When I originally learned how matrix math worked in 9th grade, I originally asked the question many students do. “What’s the point?” If I ever end up teaching 9th grade math, I have a lot of cool examples to share with students that are similarly doubtful
By the time i was at that point at some time in the 90s, i already tried implementing a 3D engine so i knew how useful they can be. But back then, we didn't really have SIMD. This came a little later, which made it that much more relevant. I think it's similar, in that you're reformulating things to make them as regular and as mutually independent as possible specifically in order to be able to parallelise them, but it gives you some neat advantages and expressive power beyond that as well. So many mathematical concepts could be considered optional but all of them increase your expressive power.
What an absolute masterpiece thank you so much for breaking this down - I read that paper over and over but never reached the enlightment this video presents ❤
Watching this video made me realize how convoluted the terms we, the research community of AI, have made deep learning to be. Planer / simpler terms are needed in research paper as paper like Attention is All You Need is read by all generation of researchers coming after us.
I'm afraid this applies to all the science fields. People tend to create jargon in subjects they are working on to make their job more efficient / faster for them and others like them. So papers are shorter, more compressed. Others to be able to read it need to learn those "compressions algorithms" ;) Unfortunately it makes it harder to learn anything from scratch so we need people that "decompress" it for us and teach us ho we can do it ourselves.
@@venugopal-nc3nz Nah, it's very natural process that happens for everybody. More you know more you "compress" the data to make it more effitient to process. That way one sentence can have very deep meaning for others like you and you don't need to write 20 pages to describe new concept you are working on. In fact deep learning implements exactly this: very sophisticated compression algorithms ;)
Computer science is a field so full of misnomers, that it's itself a misnomer. It's a field of engineering not science, and its subjects are not computers, those are the subject of computer engineering (electrical engineering), but software. It's also not software engineering per se, which is concerned with software architecture, though that can be considered a sub-field of computer science. It is more like information processing engineering, with heavy focus on algorithms and data structures. Unfortunately a certain amount of hindsight is needed to name things elegantly in a way that would be more leaning towards self explanatory, but there is resistance to renaming things, because it would break existing workflow.
@@SianaGearz I would argue that computer science is part of science, at least AI part we are talking about. Of course there is a lot of engineering there but without science those engineers would have nothing to do ;)
Since maybe 6th garde i thought math is out of reach and i'm too slow and too "dumb" for it. The content that you (especiallly the Lockdown Math series) and the community put out so far showed me (kinda opened my eyes) that everything is within reach and can be unterstood. It unlocks so much and it's so beautiful as well. One who easily turns the waterworks on, might shed a tear. It's like reuniting with a long lost familiy member or friend that you thought you would never meet again.
Hi , i'm not good speaking English , I'm learning and I wanna say thanks you, in spanish there aren't these movie (i'm sorry i don't know say this in youtube, I don't know if this say video like spanish) well, Thanks you I am 16 years old and this help me
I've been spending a lot of time, trying to understand transformers, before watching this. I have built a very very small one myself, to better understand, but with my limited resources, it can't do much. Anyway, the Q,K,V was very confusing to me, but now I finally understand. The thing that is most difficult to wrap my head around are the insane dimensions sizes and parameter counts. Those numbers are so high and all of those multiplying together, to get 57 billion parameters for the attention block themselves, is insane and fascinating to me. And then adding the rest, that is not attention...
Hi Akari! If you have time, I would love to talk with you about your process of building the transfomer. I"m , as well, in the process of understanding all this world. What do you think about having a chat? We learn more when we teach ;)
Watching this entire video was a cathartic experience. Can't wait for the chapter 7 where the emphasis is on the feed forward neural network weights training!
Jesus loves you ❤️ Please repent and turn to him and receive Salvation before it is too late. The end times written about in the Bible are already happening in the world. Jesus is the son of God and he died for our sins on the cross and God raised him from the dead on the third day. Jesus is waiting for you with open arms but time is running out. Please repent and turn to him before it is too late. Accept Jesus into your heart and invite him to be Lord and saviour of your life and confess and believe that Jesus is Lord, that he died for your sins on the cross and that God raised him from the dead. Confess that you are a sinner in need of God's Grace and ask God to forgive you for all your sins through Jesus. Jesus loves you. Nothing can compare to how he loves you. When he hung on that cross, he thought of you. As they tore open his back, he thought of your prayer time with him. As the thorns dug into his head, he thought of you spending time in the word of God. As the spears went into his side, he imagined embracing you in heaven.
What a beautiful illustration of attention. I want more on this topic!! Variations, details, depth, concepts, illustrations - this is just wonderful. I could watch this video on repeat and learn something new each time.
I have become a witness to such great genius with such a thorough understanding of these concepts. If at all there has been a list of greatest teachers, this guy needs to be up there
Humanity owes you a debt of gratitude beyond measure. As for myself, I can't help but feel a tinge of disappointment. Your original artworks in this channel, in their essence, serve as poignant reminders that life is fleeting, revealing the vast expanse of what I have yet to comprehend (not to mention "to teach").
Great video. Kind of saddens me that we're leaving the golden age of open sourced AI research. "OpenAI" used all the publicly shared transformative research from before (Transformers / Convolutions, Batchnorm etc.) used it, and then shut the door behind them. Well done to them for making billions. But unfortunately, this means that the next revolutionary advancement like Transformers aren't going to be published openly.
Had been seeking a simple and easy-to-understand the Attention Is All You Need Paper, this is the BESTEST so far. The rest of videos are just basically "reading the papers out loud" and take you like a PhD student, and if I could, I would read it myself. Papers are superbly technical, and I am sure there are tonnes of AI enthusiast and developers want to seek the basic understanding of "what it's all about", and the "ORIGIN".... so this just hits the spot. Amazing work and always love the Pi's. 🕺💃
As a hobby ai enthusiast I’ve been trying to understand transformers very hard, but failed. Thank you, now I do! How do people get this without 3blue1brown videos? Are they that smart? How many people could grasp all this from “attention is all you need” paper?
Absolutely amazing. Recently I've become a student of data science, professionally I'm responsible for RAN network strategy for telecom operator. I am completely deeply amazed how your work helps me with both of this fields! Indescribable :)
00:02 Transformers use attention mechanisms to process and associate tokens with semantic meaning. 02:38 Attention blocks refine word meanings based on context 05:15 Transforming embeddings through matrix-vector products and tunable weights in deep learning. 07:47 Transformers use key matrix to match queries and measure relevance. 10:31 Attention mechanism ensures no later words influence earlier words 12:55 Attention mechanism variations aim at making context more scalable. 15:26 Transformers use weighted sums to produce refined embeddings from attention 17:58 Self-attention mechanism explained with parameter count and cross-attention differentiation. 20:08 Transformers use multi-headed attention to capture different attention patterns 22:34 Implementation of attention differs in practice 24:53 Attention mechanism's success lies in parallelizability for fast computations. Crafted by Merlin AI.00:02 Transformers use attention mechanisms to process and associate tokens with semantic meaning. 02:38 Attention blocks refine word meanings based on context 05:15 Transforming embeddings through matrix-vector products and tunable weights in deep learning. 07:47 Transformers use key matrix to match queries and measure relevance. 10:31 Attention mechanism ensures no later words influence earlier words 12:55 Attention mechanism variations aim at making context more scalable. 15:26 Transformers use weighted sums to produce refined embeddings from attention 17:58 Self-attention mechanism explained with parameter count and cross-attention differentiation. 20:08 Transformers use multi-headed attention to capture different attention patterns 22:34 Implementation of attention differs in practice 24:53 Attention mechanism's success lies in parallelizability for fast computations. Crafted by Merlin AI.
A few added notes based on common comments I see.
Concerning masking self-attention, several people ask about cases where it feels like later words should update the meaning of earlier words, like languages with adjectives following nouns. The model can always put the richest meaning into the last token (e.g. early nouns getting baked into later adjectives). For example, @victorlevoso8984 noted below how empirical evidence suggests the meaning of a sentence often gets baked into the embedding of the punctuation mark at its end. Keep in mind that the model doesn't have to conceptualize things the way we humans do, and in all likelihood doesn't at all, so I wouldn't over-index on the motivating example given in this video.
Also, one thing I should have called out more explicitly is how I personally like to think of vectors like embeddings, keys, queries, etc. as columns, and as a convention display them this way, but other sources, including the Attention is All You Need paper, may present them organized in a row-by-row fashion. This is relevant to parsing the equation shown at 10:29, where the expression from the paper that looks like Q K^T would, by the conventions of this video, instead look like K^T Q.
I hope you make a video explain about State Space Model in Mamba, it quite confuse me with the derivative of x
Ugh youtube aparently doesnt like me posting links to papers and deleted my reponse aparently so I'll try again.
Source for what I'm saying is "Linear Representations of Sentiment in Large Language Models" but I haven't read it and have only heard a talk by the papers autors so take my vague explanation of it with a grain on salt.
On a quick skim they mention the model moving summary information to "commas, periods and particular
nouns."
@MarcusHilarius I had a longer response that got deleted by youtube (probably cause I linked to my paper as an example) but I think this is not exactly correct in that the purpose of the qk matrices is for each attention head to approximately only "read" the information in some of the tokens.
So at each position the model could have the meaning of all the previous tokens on it but it doesn't have to, and theres all kinds of interesting dynamics about wich positions attend to each position in each layer and attention head for a given input.
Row-by-row is actually in some sense the standard notation in Deep learning, since most (all?) popular frameworks use it internally. I'm not actually sure why, I guess it's just an old GPU programming convention. It's not uncommon for researchers to use column notation in their papers, but personally I think it might be confusing for people new to the field, as the code will almost surely treat the vectors as rows (after all, usually the first data dimension is the batch size). It's not a very significant thing of course.
The most common association I've heard as to why there needs to be masking, is that otherwise the model could just cheat. Every word would have the same query, "hey, what's the word after my position x?" and then the keys would all be "I'm at x+1" or similar. The value at the x+1 embedding would just be the word itself, and it would know what it needs to write. Your model would always perfectly predict the next word.
This is still interesting though. What if instead of asking the model to predict the next one word, we asked it to predict the next 10? You'd implement it by sliding the triangle in the masking matrix 10 steps (masking more). Your loss would rise dramatically, since even if the model gets the "gist" of what it should write (like if you want it to talk about "Harry Potter" somewhere) it would still be off if due to sentence structure the words "Harry Potter" weren't in the exact same place as in the original text. You'd need to change your loss calculation to somehow focus more on the "semantic correctness" of those 10 words (a challenge I'm sure), but the benefit would be that the model could generate whole phrases or sentences. I'd probably do this in some staged manner, like starting with one word loss and increasing the number of words as training goes on. I'm not sure what the benefit would be though, but might be fun to try.
I'm a university lecturer with a PhD in AI, and I cannot compete with the quality of this work. Videos like this put the entire higher education system to shame. Fantastic! ❤️
so true
If you haven't already, check out the FAQ on his site about his animations. Its called Manim and is available for anyone. It seems that Grant is on a mission to create a Math and Code Animation Army. Sign me up! I teach high school CS without a full CS degree and I was able to use it in my classroom. You can install or just test it on a Jupyter notebook. Still requires an art to scaffold the instruction, but a good visualization can expedite the process.
Another role of universities I think is to hold the knowledge system and methodology with the help of social systems and social interaction inside and with another society services also via social interaction. So, now the balance has been shifted, making Unis more provide holding services and RUclips/Online/GPT - educating and transferring the existing knowledge to wide mass of people, just because huge multiplication factor (dozens of millions of viewers vs thousands of students). This new balance shouldn't be broken or rejected but taken into account to increase quality of education as a phenomenon.
@@iTXS - For sure. I, and the wider higher education sector, need to up our game.
@@iTXS they are failing hard at that too, specially because their "social systems" felt prey to some ideologies that are not conductive to critical thinking.
We need to create a better system starting from what we already on the internet.
" provide holding services "
Why would we need that ? Open Source has solved that problem for software, Science just need to use our software development tools. You don't need the bloated organizational structure of universities at all to actually generate and curate knowledge to a high degree of success, that's what we software developers did with things like Linux.
You guys are just stuck in the past.
The volume of work, attention to detail and clarity we get from Grant is staggering. Bravo sir.
16:13
Are you kidding me? ONE WEEK FOR 2 MASTERPIECES?!
Thank you so much!
Waiting here for someone to correct your spelling...
@@sumedh-girish🤣
Peace man
I understand the context so no use dwelling on his misteak.
@@sumedh-girishon the contrary, i hope nobody ever corrects the mistake just to spite you >:3
@@skmgeek Humanity never learns from its mistakes.
I've got to say - "Attention Is All You Need" is an incredible title for a research paper.
Indeed
Famously so I think. I really really like these names instead of the long boring ones. But... Also people aren't clever and the creative names would be bad, so maybe we stick to long boring ones for the most part haha
I like the boring ones. That way I can easily decide which ones to read.
It’s going to be embedded into our future history lessons of when things began. =]
There are now more than 300 papers ending in "is/are All You Need"
How I wish this video was available when the "Attention is What You Need" paper just came out. It was really hard to visualize by simply reading the paper. I read it multiple times but could not figure out what it was trying to do.
Then subsequently, Jay Alammar posted a blog post called The illustrated transformer. That was a huge help for me back then. But this video raises the illustration to an entirely different level.
Great job! I'm sure many undergraduates or hobbyist studying machine learning would benefit greatly.
As a graduating PhD student working in Natural Language Processing, I still found that video to be extremely beneficial. Awesome!
Good luck for your PhD defense!
Wow, a nice competitive field! Best luck to you!!
Good luck out there!
can you please do something to make it safer and not just more powerful?
@@ChannelMath can you be more intelligent and not just scared?
You not only put out some of the best content on youtube but also give constant shutouts to other content creators that you admire. You are a GOAT 3Blue1Brown.
Attention existed before the 2017 paper "Attention Is All You Need".
The main contribution was that attention was... all you needed for sequence processing (you didn't need recurrence). Self-attention specifically was novel though.
Yeah 3 years earlier by Bahdanau et. al
Before people debated how many self attention blocks you should stick into a model and where and if it was even worth it.
The paper proved that more was better and that you're probably better off replacing linear layers and convolutional layers with more self attention blocks.
There have been many people working on this idea long before 2017. I, myself, was working on a model very similar to this in 2010, and I was inspired by papers and books going back much further. It seems people are not fond of appreciating multiple, independent instances of creation and discovery. It happens a lot in many domains.
@@waylonbarrett3456 Interesting, do you have a source for a much earlier similar model architecture?
@Haligonian the sources for their inspiration and basis are cited in the "attention is all you need" paper. I believe it's at the bottom in the acknowledgment section. Then, you go to those papers and look at their cited sources, repeat ad infinitum. This is the true, branching complexity of creative works. There's not often some lone genius having a singular "Eureka!" moment.
3b1b is the only content producer whose videos I start by first making coffee, then upvoting, then hitting the play button.
... and then be disappointed at the end because you can't upvote a second time.
For me, it’s like first, then coffee second
I just like the video and then go on with life
Geez Grant, I spent thousands of dollars on a very good deep learning executive certification from Carnegie Mellon, and your series here is better than their math slides. This series is really turning out great.
I wish any of the online courses I did for work had this quality.
Never before have I felt the urge to support a creator for a specific video, but you sir have knocked it out the park and this is by far the best educational video I have seen. Not just on RUclips but in my last 20 years of working with data 🎉
As director of video content for a major educational publisher, this is some of the best educational content I’ve ever seen. Your content gives me ideas of how to shape the future of undergraduate level STEM videos. A true legend and inspiration in this space- thank you for the meticulously outstanding work that you do.
I cannot stress enough what a tour de force this is. It's probably one of the best math classes ever done anywhere in the world in all time.
You're the best in the game and an inspiration for many. So so much thank you, Grant, you're doing God's work here.
I'm a Computer Science student currently working with a Transformer for my master thesis and this video is absolute gold to me. I think this is the best explanation video I've ever seen. Holy shit, it is so clear and insightful. I'm so looking forward to the third video of the series!!!! The first one was absolutely amazing too. Thank you sooo much for this genius piece of work!!!!
If I could write poetry about how much I appreciate and learn from your videos, I would but I'm not a poet. Thanks to everyone who worked on these videos.
oh how great the knowledge be,
that grant can share so graciously.
i would be lost in space and time,
if you, my precious would not shine.
it got a bit saccharine, but tried my best lol
i do agree though, this channel is amazing.
@@captainrob4656 There is a secret tool named chat GPT. You can use it to help you with your poem creation problems 😉.
In the vast digital sea, a beacon shines bright,
Three Blue One Brown, with wisdom alight.
AI, ML, and DL, in colors so clear,
Guiding minds through concepts, year after year.
With clarity and depth, their tales unfold,
Grateful hearts cherish the knowledge they've told.
In realms where algorithms dance and code reigns supreme,
Where data whispers secrets and dreams in every scheme,
There dwells a sage, a maestro of the digital domain,
Whose words and visuals in tandem, ignite the brain.
Through the ethers of the web, his teachings unfurl,
Like constellations in the night, they guide and swirl,
With each stroke of insight, he paints a vivid scene,
Unraveling mysteries, illuminating what has been.
Oh, if only words could craft the depth of gratitude,
For the knowledge gained, the horizons pursued,
Yet in this humble verse, let it be known and heard,
The gratitude abounds, for the wisdom conferred.
To the minds behind the screens, the unseen hands,
Who weave the fabric of understanding in digital lands,
We extend our heartfelt thanks, our deepest regard,
For kindling the flame of knowledge, burning bright and hard.
- By Chat GPT 3.5
Lessons truly grasped
The videos open minds
Thankfulness ensues
Just Wow, the educational value of this video is incredible.
There are so many highly relevant and original ideas to explain abstract concepts and drastically simplify comprehension.
I'm so thankful that you've made this content available to everyone for free.
I absolutely love it!!
Also I want to add, making such knowledge available to mainstream is truly a gift for humanity, and a thing that you deserve to be proud of
It is simply inexplicable how valuable these videos are to humanity, and I mean that literally. The way your videos convey the underlying ideas and intuitions of such complex technology opens the door for tomorrow's smart minds to learn even faster.
As a Master's student in Data Science and AI, I never really understood how attention worked. Thank you for making this video!
A master student in DS and AI and never understood attention sounds wrong to me
@@jabirlang216 He wasn't paying attention \s
@@jabirlang216these programs are mostly non rigorous. I only hire MS statistics or MS CS
the fact that this is freely available on YT is insane: thanks for all the amazing work throughout the years.
There are people … all over the world … like me … who really, really, really appreciate you. I cannot thank you enough for taking the time to share your knowledge and help others to understand this technology much more deeply. Seriously, kudos and sincerest thanks. ❤
Joining the list of commenters who have worked with transformers, built and trained them, explained them to others, and still learned really basic things from this video about what they're doing that just hadn't occurred to me. Great work.
Give an example.
The drought of 3blue1brown content has finally been overcome!
Don’t jinx it!
It takes time to learn these concepts from scratch and make these high quality videos.
As long as the videos keep coming well thought and well researched as they are, I don't care if they take their time! Congrats Grant and 3b1b team!
Shalll I drop the next hint and see how quickly new content comes out?
WE FEAST
Grant is all you need.
This was propably the tenth video or podcast about the subject and only now I understand the underlying motivation for each component it has.
This is the best explanation of attention mechanism in transformers. The concept of queries, keys and values is hard to grasp for those not familiarized well with linear algebra. With these high-level visualizations and clear/concise explanations by this amazing teacher, one can really understand how amazing are the ideas put in this model. Thanks.
I wish some wealthy guy would fund you millions of dollars so that you can form a team and come up with video explanations for top research papers as early as possible. What a blessing that would be!
I've been trying to understand this paper to this level of detail for years. And you took care of it in 20 minutes. You are a master at what you do.
This is pure gold!. Never seen such a good explanation of the attention mechanism before. Thank you for this.
You made the best attention video I have watched online. This should be a compulary part of all NLP courses right now. Thanks for releasing this banger!
This video is worth gold. Trying to understand exactly why and how self attention and cross attention works nearly brought me to tears these past years and this video explains it so well and with visualized examples.
I work with ML for computer vision and have never really understood transformers. This is by far the most clear explanation of them that I have seen!
The dissemination of scientific concepts transformed into art. Immensely enjoyable and useful for the education of millions. Thank you as many times as the number of model parameters!
Rewatching the video is all i need apparantly :)
On a serious note i've heard so much about this "attention" mechanic and always wondered how it worked so thank you for making this video!
Finally someone explains it right.
One simple thing but important that I would emphasize for the new people in AI is that all matrix/vectors used in this process are "discovered" by the model with the gradient descent mechanism (reward mechanism) for the especific task. In this cases, predict next word.
This video will become a part of history.
bot
@@MrZorroZorroZlmao
Congratulations, your transformers series is a technical education masterpiece. I’ve been building products with LLMs for the past two years and shamefully only understood them at a very superficial level. You have a gift for clarifying without oversimplifying and condescension that is rare.
Your explanation of Q K V which is probably at the core of what attention does is fantastic. I have watched at least several dozen videos but i don't think anyone has been able to explain as succinctly as you have with your examples of what an attention "might be doing". Kudos, count me in as a supporter from now on
So much value in one video. You are making the world a better place.
Truly
You hold this opinion when someone unleash not aligned AI on the world.
@@MrFujinko then its the "someone" who IS the problem in the first place dah
@@Otomega1 You ain't a teamplayer. This is modern world, everyone is to blame, everyone works for the system. You and I and anyone reading is guilty.
D.Trump can lead the resistance against misaligned AI; he has developed several novel strategies for reducing rogue AI systems into a harmless atavistic state.
the amount and complexity of material this dude is able to condense into 26 mins--all the while making it unintimidating is genius
The video says that masking is for preventing later tokens from influencing earlier ones. How does GPT3 handle the sentence: "The cat that left the town came back this morning"? The phrase "left the town" modifies the noun "cat." Later tokens do influence earlier ones in this case.
In that case, masked self-attention would have to do something a little different from how we humans might naturally think about it, for example, baking in all the meaning of a cat leaving the town into the embedding associated with "town".
@@3blue1brownis it right to say that all of the meaning is embedded in the last token, so "cat ""left" "town" "came" and "back" are all embedded in "morning"?
@@Scubadooper Well, remember that the point is to have, on the last layer, the embedding on a word be the prediction of the next word ?
So yes the goal is exactly to have the whole meaning of the sentence be on the last token
@@3blue1brown Incorrect, the word "town" would become empty and meaningless once the cat leaves it, this example would make the GPUs explode.
@@Scubadooper You can think about it like "THE MORNING", that important morning when the cat came back after leaving the town (again that sad town that the cat left) previously.
Your work for humanity is invaluable. Aside from publishing free videos about these subjects, the effort you put into making something this complex to something reachable to all should always be commended.
I have tried to understand this subject multiple times, reading the papers and everything, but I couldn't create a simple model in my head about what it was doing or what it brought to the table to be such a breakthrough in this field. Thanks a lot!
Do you mean "immeasurable value"? I certainly value it
Maybe you should say "is invaluable" vs "has no value". ;)
@@nelkabosal Thanks!
@@Scubadooper Thanks!
The Simpsons Dr. Nick: “Inflammable means FLAMMABLE??!”
Videos like this put the entire higher education system to shame. Fantastic!
I can’t believe you can watch such an excellent lesson on RUclips for free!
I understood the keys and queries vectors in a more generalist way but your explanation of them as "who's an adjective?" and "I am an adjective, look at me!" is just amazing!
Would there be a Nobel price for educational content this here would be a strong contender, well done!
Your way of explaining concepts feels so interactive to me. Follows the narrative > A question appears > 3B1D answers it immediately > On to the next subject. Thanks for the superb content!
To me the part that was more confusing when studying this for the first time was the seemingly arbitrary separation between Keys, Queries and Values. In particular, the Value matrix seemed sort of redundant if you already had the Keys, and the terminology coming from databases doesn't help much imo. This makes it much more clear, thank you!
I think the separation exists because otherwise you'd need some super array for all possible combinations. At least that's the intuition I have about it.
These videos should be on the Trending list. One doesn’t necessarily need to have a CS/Math background to appreciate this content and the immense effort behind creating this masterpiece! Grant, the world is indebted to your contributions in revolutionising how Math can be taught! You TEACH intuition!
Ooh, you've improved the patreon preview so much in just several days. Bloody well done, sir.
Thanks, and my appreciation to you and others who helped give feedback on the earlier version.
The way these complex concepts are visualized makes the material very approachable for newcomers like myself. I appreciate this work very much!
I wish you had explicitly mentioned that the step of adding all the computed V vectors to the original input vector of an attention head, is the "residual connection"/
"Add" step in the Transformer paper. That's not immediately clear to someone who is connecting the dots between the paper content and this video
But other than that small small qualm, the best video by far of any other out there :)
Man I admire your patience to explain this step by step and with great visuals. I can't thank you enough for making such complex topics somewhat comprehensible for people without the corresponding background. The day you started this channel deserves to be a national holiday.
I'm still absolutely fathomed we're now using O(n^3) algorithms like that because we have so much hardware in the last 10 to 20 years. That's the magic of LLMs, its not that they are "large", but that we have engineered enough hardware to be able to play with such things.
But in the end its all just a lot of linear algebra, isn't it ?
Imagine 25 years ago, you have your Pentium MMX, with, if you're lucky, an 8MB Voodoo and 32MB RAM, and 8GB disk, and you're told: in 20 years you're going to have a little supercomputer in your computer, thousands of operands wide, 8GB of RAM on there or more. You might not be entirely surprised, Moore's Law sort of checks out, after all you had a 486 just previously and a simple Trident VGA, and maybe a 1MHz Commodore 64 previously before that. And yet it is impossible to grasp this power. And that we'll all be using a dark faced chatting app that runs slower than AIM/ICQ did back then while not doing a whole lot more. But that you would also be able to run mad neural network stuff on your computer with an eerily almost human like capability in language and other complex fields.
This is probably the BEST channel to learn fundamentals of Gen AI. I came here after reading Foundational Large Language Models & Text Generation paper, where these concepts felt hard to digest. But you hit the homerun with all the visuals. Thank you for helping me understand these concepts!!!!
26 minutes of pure joy!
This comment was 4 minutes after the video released!!
YOU DIDN’T WATCH THE VIDEO!!!!!!
mtfk , this topic is so hard, can't understand what grant is saying after 6 minutes in the video!
There's only one other orator I like as much as 3blue1brown, just one. And that other orator who is just as pellucid, and engaging is Donald Trump. Yep, I only pay attention to 3blue1brown and Donald Trump; it's almost like they are the same person.
The amount of intellectual satisfaction this series has brought me 📈📈📈
Thanks a Ton.
I’d really love a series like this on quantum computing. I don’t think any such video exists, let alone anything of 3B1B quality. Well, here’s begging. :)
Veritasium has a few videos on quantum computing that help break it down for better understanding!
I'm very much not a math person but I've been binging your stuff for a good 4 days now. Between the general sense of wonder, I keep being reminded of the HG Wells quote "History is a race between education and catastrophe"
Idk if that ethos is something you deliberately bake into this channel but I very much see it everything you upload and I can't thank you enough. Idk if you fully appreciate the reach your efforts could have but either way, thank you for taking the time to break this stuff down while also not dumbing it down. I don't understand everything you say but you make me curious enough to figure it out and I feel genuinely empowered as a result. I'm sure I'm not just speaking for myself.
Great video as always! Minor quibble at 9:00, I have always heard and understood “attend to” as being from the perspective of the query (the video uses the key’s perspective) so it would be “the embedding of creature attends to fluffy and blue” instead. It doesn’t really matter since the dot product is symmetric, I just haven’t heard it used colloquially that direction (maybe due to the axis that the softmax is applied on?)
Skimming through to determine if this would explain it well to friends who are interested but yeah, I noticed all the matrices in this video are transposed relative to the notation. He's not wrong with what the calculations are doing but it's more confusing this way, especially when he shows softmax(qk^t/z)v as the notation. With this formulation, Q,K,V should have rows of projections so the attention matrix is properly formed. Softmax is then done over rows so when multiplying with V on the right, you're aggregating as weighted rows of V.
I had the same initially, but I'm not a native english speaker. I googled the meaning of attend to and it gave me: "has to do with". So if you think about it like that, fluffy and blue have to do with creature, not the other way around. Maybe that helps.
That was my understanding of "attend to" in this context also; eg 'creature' would attend to 'fluffy' and 'blue'. @3blue1brown are we mistaken there?
17:50 -- Love the 3b1b humblebrag here. essentially "Those paper writers make things confusing, and I am here to lead you with knowledge". Thank you Grant for bringing this to all of us!
This the best series by far for all AI/ML enthusiasts...happy learning
Thank You 3Blue1Brown for providing such a high quality lecture intuition in LLM. You are the best teacher ever i have seen to explain things this much perfectively.
It is 1:00 in Sydney right now and I’m up late watching your video from my bed. I should probably get some sleep, I have morning classes, it’s just your content is to God damned interesting. Plus, I’m a teenager. I can’t be separated from my phone accept by 16th century French style beheading. POST MORE VIDEOS! If I can’t sleep you shouldn’t get the luxury!
This series is incredible. I'm an environmental and agricultural sciences master's student who is just super interested in AI. I've dabbled in uni courses which were fairly superficial (given my background and program) and decided to write my master's thesis about a specific practical use case of genAI. I know so very little, and these videos are tremendously helpful to help guide me through this super daunting and scary task. With my specific goal in mind, I have spent 3h on this video and the one before that alone. So so information dense yet highly comprehensible. Thank you from the bottom of my heart. Thanks to the wider "ML community" (and genAI itself lol, thanks to state of the art LLMs), I feel like I can actually do this (and learn so incredibly much - if not for my future career, for sure for being an informed person in this AI era)
I currently should be studying chemistry and instead I'm here watching your awsome videos. Thanks for presenting these topics so they are interesting
By far this is (with ch. 5) the best explanation of Transformer architecture that I've ever seen. I previously understood Attention intuitively but struggled to explain this during job interviews under stress. You just helped me land a job. Thank you!
when are you uploading next video ? This series is one of the best resource to learn transformer.
At the end of the day it is the mathematicians who have the best understanding of any concept, be it physics or AI or just pure abstract math.
Thank you Grant for your amazing contribution and taking us along this journey...
When I originally learned how matrix math worked in 9th grade, I originally asked the question many students do. “What’s the point?”
If I ever end up teaching 9th grade math, I have a lot of cool examples to share with students that are similarly doubtful
By the time i was at that point at some time in the 90s, i already tried implementing a 3D engine so i knew how useful they can be. But back then, we didn't really have SIMD. This came a little later, which made it that much more relevant. I think it's similar, in that you're reformulating things to make them as regular and as mutually independent as possible specifically in order to be able to parallelise them, but it gives you some neat advantages and expressive power beyond that as well. So many mathematical concepts could be considered optional but all of them increase your expressive power.
Like a delicious coffee, like a fine wine, I sip every minute of these videos deliberately. It should not end in only 26 minute!!
where is the next chapter 😭😭😭😭😭
Waiting
Oh god I can't breath😥😥
He's edging us
Yes, next chapter please!
We need the last chapter
What an absolute masterpiece thank you so much for breaking this down - I read that paper over and over but never reached the enlightment this video presents ❤
Watching this video made me realize how convoluted the terms we, the research community of AI, have made deep learning to be.
Planer / simpler terms are needed in research paper as paper like Attention is All You Need is read by all generation of researchers coming after us.
I'm afraid this applies to all the science fields. People tend to create jargon in subjects they are working on to make their job more efficient / faster for them and others like them. So papers are shorter, more compressed. Others to be able to read it need to learn those "compressions algorithms" ;)
Unfortunately it makes it harder to learn anything from scratch so we need people that "decompress" it for us and teach us ho we can do it ourselves.
Researchers use tough language to keep outsider out
@@venugopal-nc3nz Nah, it's very natural process that happens for everybody. More you know more you "compress" the data to make it more effitient to process. That way one sentence can have very deep meaning for others like you and you don't need to write 20 pages to describe new concept you are working on. In fact deep learning implements exactly this: very sophisticated compression algorithms ;)
Computer science is a field so full of misnomers, that it's itself a misnomer. It's a field of engineering not science, and its subjects are not computers, those are the subject of computer engineering (electrical engineering), but software. It's also not software engineering per se, which is concerned with software architecture, though that can be considered a sub-field of computer science. It is more like information processing engineering, with heavy focus on algorithms and data structures.
Unfortunately a certain amount of hindsight is needed to name things elegantly in a way that would be more leaning towards self explanatory, but there is resistance to renaming things, because it would break existing workflow.
@@SianaGearz I would argue that computer science is part of science, at least AI part we are talking about. Of course there is a lot of engineering there but without science those engineers would have nothing to do ;)
Not even the best doctor in our universities can provide such masterpiece of explanation ! Thank you so much !
I can’t believe you can watch such an excellent lesson on RUclips for free!
Since maybe 6th garde i thought math is out of reach and i'm too slow and too "dumb" for it.
The content that you (especiallly the Lockdown Math series) and the community put out so far showed me (kinda opened my eyes) that everything is within reach and can be unterstood. It unlocks so much and it's so beautiful as well.
One who easily turns the waterworks on, might shed a tear. It's like reuniting with a long lost familiy member or friend that you thought you would never meet again.
Hi , i'm not good speaking English , I'm learning and I wanna say thanks you, in spanish there aren't these movie (i'm sorry i don't know say this in youtube, I don't know if this say video like spanish) well, Thanks you I am 16 years old and this help me
Grant, your work provides immeasurable value. This video is a striking display of the towering heights mankind is able to achieve.
I've been spending a lot of time, trying to understand transformers, before watching this. I have built a very very small one myself, to better understand, but with my limited resources, it can't do much.
Anyway, the Q,K,V was very confusing to me, but now I finally understand.
The thing that is most difficult to wrap my head around are the insane dimensions sizes and parameter counts. Those numbers are so high and all of those multiplying together, to get 57 billion parameters for the attention block themselves, is insane and fascinating to me. And then adding the rest, that is not attention...
Hi Akari! If you have time, I would love to talk with you about your process of building the transfomer. I"m , as well, in the process of understanding all this world. What do you think about having a chat? We learn more when we teach ;)
Watching this entire video was a cathartic experience. Can't wait for the chapter 7 where the emphasis is on the feed forward neural network weights training!
I see upload => I watch
Jesus loves you ❤️ Please repent and turn to him and receive Salvation before it is too late. The end times written about in the Bible are already happening in the world. Jesus is the son of God and he died for our sins on the cross and God raised him from the dead on the third day. Jesus is waiting for you with open arms but time is running out. Please repent and turn to him before it is too late. Accept Jesus into your heart and invite him to be Lord and saviour of your life and confess and believe that Jesus is Lord, that he died for your sins on the cross and that God raised him from the dead. Confess that you are a sinner in need of God's Grace and ask God to forgive you for all your sins through Jesus.
Jesus loves you. Nothing can compare to how he loves you. When he hung on that cross, he thought of you. As they tore open his back, he thought of your prayer time with him. As the thorns dug into his head, he thought of you spending time in the word of God. As the spears went into his side, he imagined embracing you in heaven.
@@L17_8 SPAM
I'm a data architect and just want to say thank you for putting this together ❤️ I'm off to write a LinkedIn post on this 🎉
Where's chapter 7😢
@@toshyamg just came out!
Consistency 🗿
What a beautiful illustration of attention. I want more on this topic!! Variations, details, depth, concepts, illustrations - this is just wonderful. I could watch this video on repeat and learn something new each time.
These learning videos where made by an ai 100 years in the future and send back to accelerate its own development.
I have become a witness to such great genius with such a thorough understanding of these concepts. If at all there has been a list of greatest teachers, this guy needs to be up there
Those who completely understood attention after this video will be feeling like God/Goddess 👼👼
Humanity owes you a debt of gratitude beyond measure.
As for myself, I can't help but feel a tinge of disappointment. Your original artworks in this channel, in their essence, serve as poignant reminders that life is fleeting, revealing the vast expanse of what I have yet to comprehend (not to mention "to teach").
Great video. Kind of saddens me that we're leaving the golden age of open sourced AI research. "OpenAI" used all the publicly shared transformative research from before (Transformers / Convolutions, Batchnorm etc.) used it, and then shut the door behind them. Well done to them for making billions. But unfortunately, this means that the next revolutionary advancement like Transformers aren't going to be published openly.
I have watched so many videos on attention, but none of them are easily understandable than this one. Brilliant work!
Where is Chapter 7
3Blue1Brown is the reason why internet appeared!
It's been 3 months. No Chapter 7
This is hands down the best lecture I've seen in my life.
Had been seeking a simple and easy-to-understand the Attention Is All You Need Paper, this is the BESTEST so far. The rest of videos are just basically "reading the papers out loud" and take you like a PhD student, and if I could, I would read it myself. Papers are superbly technical, and I am sure there are tonnes of AI enthusiast and developers want to seek the basic understanding of "what it's all about", and the "ORIGIN".... so this just hits the spot. Amazing work and always love the Pi's. 🕺💃
As a hobby ai enthusiast I’ve been trying to understand transformers very hard, but failed. Thank you, now I do! How do people get this without 3blue1brown videos? Are they that smart? How many people could grasp all this from “attention is all you need” paper?
These series are so very good. Thank you for making the dense field of machine learning more accessible.
Absolutely amazing. Recently I've become a student of data science, professionally I'm responsible for RAN network strategy for telecom operator. I am completely deeply amazed how your work helps me with both of this fields! Indescribable :)
Pure gold, I have been researching and trying to understand this topic for sometime now. Nothing beats the quality of this presentation.
00:02 Transformers use attention mechanisms to process and associate tokens with semantic meaning.
02:38 Attention blocks refine word meanings based on context
05:15 Transforming embeddings through matrix-vector products and tunable weights in deep learning.
07:47 Transformers use key matrix to match queries and measure relevance.
10:31 Attention mechanism ensures no later words influence earlier words
12:55 Attention mechanism variations aim at making context more scalable.
15:26 Transformers use weighted sums to produce refined embeddings from attention
17:58 Self-attention mechanism explained with parameter count and cross-attention differentiation.
20:08 Transformers use multi-headed attention to capture different attention patterns
22:34 Implementation of attention differs in practice
24:53 Attention mechanism's success lies in parallelizability for fast computations.
Crafted by Merlin AI.00:02 Transformers use attention mechanisms to process and associate tokens with semantic meaning.
02:38 Attention blocks refine word meanings based on context
05:15 Transforming embeddings through matrix-vector products and tunable weights in deep learning.
07:47 Transformers use key matrix to match queries and measure relevance.
10:31 Attention mechanism ensures no later words influence earlier words
12:55 Attention mechanism variations aim at making context more scalable.
15:26 Transformers use weighted sums to produce refined embeddings from attention
17:58 Self-attention mechanism explained with parameter count and cross-attention differentiation.
20:08 Transformers use multi-headed attention to capture different attention patterns
22:34 Implementation of attention differs in practice
24:53 Attention mechanism's success lies in parallelizability for fast computations.
Crafted by Merlin AI.