I can say without a doubt that there are not many highly qualified, passionate teachers who are also able to teach their subject. Sharing knowledge in this way is the greatest gift a researcher can give to the world! Me and everyone else thank you for that! :)
I saw his previous micrograd lecture and it literally moved me to tears. I had endured the struggle of drowning in pytorch source code, trying to understand what it is that they are really doing! For someone who simply can't move past without cutting open abstractions, this is pure blessing.
Andrej you are a gifted teacher. I love this teaching style of starting from scratch with a simple specific model to set the structure and ideology of the problem. 2. Add necessary and motivated complexity to get where we are today, 3. Seamlessly transfer to modern technology (eg PyTorch) to solve modern problems. 4. You make it all simple and compress it into the essentials without unnecessary lingo. It reinvigorates my passion for the field. Thank you very much for taking so much time to make this for free for everyone.
These lectures are literally GOLD. I'd pay for these, but Andrej is kind enough to give everything for free. I hope others find these gold lectures. Thank you so much for doing this. Please don't lose steam and I hope you continue to create them.
Bro just want to say that for the past 3 years I've been looking everywhere on the Internet for an explanation like thsi for backpropagation.. Found all kinf of things(e.g. Jacobian differentiable) but none actually made sense until today. U r the best, you bring so much value and let others light their candles at your light
Man, what a time to be alive. Imagine how hard it would be to get this kind of information just a couple decades ago. And now it's free and easily accessible at any convenient time. Thank you, Andrey, truly.
I almost completed Exercise 1 all on my own, but I had to step back for a day to refresh the basics because my college algebra was a bit rusty from 10 years of not using it. Exercises 2 and 3 totally overwhelmed me. However, when I follow your explanations, I understand everything. This is a huge because I remember that professors at my college couldn't explain complex concepts so easily. Andrej, you are a gift to this world!
What impressed me the most is that even at his level, Andrej still work in pen and papers in details by his own hand with humility and commitment. And it's really inspiring to see Andrej's react of happiness when solve a specific simple problem. We should always work with a passion. Thanks, Andrej!
I was "taught" calculus in high school but didn't really understand anything at all. Now, after seven years of no math formal education at all, I was able to immediately understand this exercise thanks to your lecture on micrograd. You're a brilliant teacher and I'm really grateful for that!
This lecture series is excellent. Seriously, some of the best learning resources for Neural Networks available anywhere: up-to-date, and goes deep into the details. These lectures with detailed examples and notebooks are an amazing resource. Thanks so much for this, Andrej.
Bruh, I'd be paying a shit ton of money in education for this otherwise free knowledge if it wasn't for your videos. Thank you so much, man. I cannot believe the ease with which you explain what seemed complex to me from a distance years ago. I cannot even believe I understand this stuff, man.
This dude is based!. I can actually cognitively map and visualize his explanations, and I am so grateful to have found him. Keep the videos coming please, and thank you so much.
Hello Andrej, I truly love this approach that you included exercises in your video. Your suggestion to first attempt to solve the exercises and then watching as you provide the solutions is the most effective way I personally grasp the concepts. Thank you for your outstanding work!
Andrej is providing the world with so much value, be it through his professional work in the industry (e.g. Tesla AI) or through education. He is literally one of the greatest of all time but is so down to earth and such a sweetheart. Thank you very much for your hard work to make it easier for all the rest of us and for inspiring us! 💚
Andrej, you are the best techer. I am 100% sure these lectures will become a CORE watching for any student who starts his ML journey. Hope we will have such lectures in CV and RL.
Thank you for providing a series that's so approachable but doesn't shy away from explaining the details. Also love the progression through all the impactful papers
These lectures are my favorite way to spend time at the moment. I love the comparison of training the neural network as a complicated pulley system, as a civil engineer this is very intuitive. Many Thanks Andrej, you are the best!
I really appreciate the lectures that you share with us. It is not about definitions, raw memorization, or even exercise per se. Instead, first-principle-thinking: take a big "mess" and then broke down into small manageable pieces. You do not solely demonstrate the problem-solving approaches brilliantly but also ignite curiosity to dig deeper (to go down to the level of atoms) into a specific topic. Thank you for the preparation, the passion, and the memes! :D
Each time I finish a lecture in these series I feel satisfaction and I say to myself Andrej must be a wizard! he takes you by the hand from scratch showing you not only clean solution but shade light on problems and guide you through to the end. Remarkable talent.
i am still on part 2 but i had to write this comment , your part 4 thumbnail is awesome and funny I am very grateful for these lectures. I could feel that the artificial intelligence knowledge that was intertwined inside me was well aligned because of you.
01:25:00 Here is the better implementation of the code: dC = torch.zeros_like(C) dC.index_add_(0, Xb.view(-1), demb.view(-1, 10)) Thanks to the ChatGPT :)
Love that he explains matlab as if it is not still used in 80% of labs in the world. Living in a world of tech giants will heal the matlab ptsd This is a masterclass - I've never seen it explained so thoroughly and clearly, and i've been around. PEAK EXPERTISE
Thanks for the great content! That's the best explanation I've ever seen! Also, regarding the last back propagation in the excersise 1 I've found the following method in pytorch: dC = torch.zeros_like(C) dC.index_add_(0, Xb.view(-1), demb.view(-1, demb.shape[2])) cmp('C', dC, C)
As a bioinformatician and a part-time data scientist, I should say this series is the best educational youtube video on deep neural network. Thank you for the video and offering the opportunity to learn.
I'm 3rd year Ph.D. student and I started my Ph.D. right after my undergrad, and I had very little idea how all the calculations are happening in neural networks back then. In the last three years to learn about neural nets I watched lots of videos, attended lectures, and completed summer camp, courses, also read books, papers, and blogs. But undoubtedly this is the best lecture on backprop! Thank you!
That's incredible !!! It's impossible to give such a knowledge without very deep knowledge with neural nets. I am really appreciate your work. I hope we can get more videos. This is defiantly a golden video!!! Thank you so much!
Hey Andrej, I don't know if you'll see this, but I just wanted to thank you whole heartedly for your awesome neural network playlist. It's by far the best and the most in-depth content on NNs I've ever come across. I really appreciate you sharing your knowledge for community. You're the best! Excited and awaiting for more such treasures!
I believe the loop implementing the final derivative at 1:24:21 can be vectorized if you just rewrite the selection operation as a matrix operation, then do a matmul derivative like done elsewhere in the video: X_e = F.one_hot(Xb, num_classes = 27).float() # Convert the selection operation into a selection matrix (emb = C[Xb] X_e @ C) dC = (X_e.permute(0,2,1) @ demb).sum(0) # Differentiate like any other matrix operation (dC = X_e.T @ demb; indices to track the batch dimensions)
Imo it's cleaner if you do this instead: Xe = F.one_hot(Xb.flatten(), num_classes=27).float().permute(1, 0) dC = Xe @ demb.view((-1, demb.shape[2])) I think this method is more understandable because it uses a 2D matmul...
Very good point on the fact that C[Xb] X_e @ C. It makes things much more clear. I came to the same solution, but from the bottom, experimenting with single records, imagining what I want to get. final solution is: dC = (torch.nn.functional.one_hot(Xb, num_classes=C.shape[0]).float().swapaxes(-1,-2) @ demb).sum(0) and one can investigate what is going on for a single batch element: torch.nn.functional.one_hot(Xb[0], num_classes=C.shape[0]).T.float() @ demb[0]
Thank you very much for an amazing series! The logit backprop derivation can be simplified a bit by realizing that log(f/g) is log f - log g. The second term is log Sum, the derivative will be 1/Sum times dSum/dxi which immediately yields the activation output. The first term is the log of an exponent, this cancels and the result has a trivial derivative of 0 or -1 when the index isn't/is the correct answer. This neatly shows that the derivative is "softmax output minus correct answer".
This is one of the most valuable videos I have come across for building strong intuition about what is going on in the backpropagation. BTW My solution for dC: dC = torch.einsum('bij,bik -> jk', F.one_hot(Xb, vocab_size).float(), demb). Gotta love einsum :)
Thank you so much :) It was a bit tough but very interesting task. P.S.: 1:25:47 dC can be done with dC.index_add_(0, Xb.view(-1), demb.view(-1, 10)) ;)
I arrived at a very similar solution, but I didn't know about index_add_. Instead you can do: Xb_onehot = F.one_hot(Xb.view(-1), num_classes=C.shape[0]).float() dC = Xb_onehot.T @ demb.view(-1, C.shape[1]) ty for the video :)
i've done with a basic approach dC = torch.zeros_like(C)# ([27, 10]) for i,iemb in zip(Xb.view(-1).tolist(),demb.view(-1, n_embd)): dC[i]+=iemb # zip (([96]), ([96, 10]))
@@ArvidLunnemark Instead of Xb.view(-1), one could also use Xb.flatten(), which is a bit more straightforward to interpret (and I believe is just a wrapper for view() internally anyway).
Thanks a lot, found a such pytorch way of doing C.grad: # Step 1: Reshape `dembcat` to match `emb`'s shape demb = dembcat.clone().view(emb.shape) # [batch_size, num_chars, emb_dim] # Step 2: Flatten the batch and character dimensions # This will create a 2D tensor where each row corresponds to a specific (batch, char) pair demb_flat = demb.view(-1, demb.size(-1)) # [batch_size * num_chars, emb_dim] indices = Xb.view(-1) # [batch_size * num_chars] # Step 3: Initialize `dC` as a zero tensor with the same shape as `C` dC = torch.zeros_like(C) # [vocab_size, emb_dim] # Step 4: Accumulate gradients using `index_add_` # This adds each row in `demb_flat` to the corresponding index in `dC` dC.index_add_(0, indices, demb_flat)
Andrej, words cannot express enough gratitude for sharing these lectures. Your passion for this subject is truly inspiring and your willingness to share your knowledge speaks to your moral character. Although you recorded this lecture 5 months ago, your words continue to lit up lightbulb-smiles across the globe and create intellectual connections with people all over the world. Thank you for your dedication and generosity.
Finally completed this one. I have to say this lecture is the most valuable one throughout all my studying of deep learning. As always, thank you Andrej for your generosity. Moving on to the next one!
1:26:00 can be vectorized using: dC = dC.index_add_(0, Xb.view(-1), demb.view(-1, C.shape[1])) This just gets all indices where to add by flattening Xb, and the values to add by flattening demb and keeping last dim as 10, so it can be added in dC along 0th dimension as its first dimension is also 10 Edit: Took me 3-4 days to get through your 1.5 hr video and it was hard, annoying, frustrating, I had to ask GPT 4-o for help with math. But in the end when the math just clicks and it works, it just so amazing!
Excellent tutorial to understand the mathematical process behind Neural net operations. Just shows how intuitively comfortable Andrej is with the fundamentals of the subject. Hats off!
I did my best to do exercise 1 on my own but couldn't really get anything on my own until `dh`. I got the through the rest with the video paused. The first day I spent a lot of time just repeating the patterns of calculations Andrej was doing, but I didn't really "get" what we were doing, and truly not for lack of trying, it just didn't click. The next morning it not only made perfect sense but seemed super obvious. I love how this course forces me to understand the theory, and how it pushes me to the edge of my limited ability.
This one kicked my ass! The way of the ninja is not an easy path, but I really enjoyed it, it was amazing as I started to solve it myself as the lecture progressed. Maybe this is the future of education
Thanks for this vey interesting serie. I found something strange during this lecture, so that, for me, using pytroch 1.13.1, the backward propagation of the tanh function using (1-tanh^2) gives results different to the ones I get using the autograd backpropagation. Difference is small (around 10^-8), but this is surprising since it seems that the backpropagation of the tanh function is done internaly in the pytorch using the same formula as the one you proposed.
I just finished part 2 yesterday night, and i was feeling blue that there was only 1 video left ! And this came to my notification, i just had to share my excitement :)))
Very useful educational videos, thanks for making and sharing them! It's interesting that Andrej also considers the shapes when backpropagating through matrix multiply, just how I came to "memorize" it :)
Thank you Andrej for sharing your experience with us! John Carmack used exactly this learning method, as he told in his interview with Lex Fridmann. In his "larval stage", he implemented the whole NN machinary, including back propagation, in C (so really low-level:)), to make sure that he understands how stuff work!
I dont know why (since I made the same calculations, using 1.0, float, etc) but starting from the calculation of dhpreact I lose the exact match and then the errors accumulate and from dbndiff even the approximate check fails (it is still pretty close, so I suppose it is some integer/float thing somewhere).Highly frustrating
@@tommasopellegrino8758 Yeah I've tried several different pytorch versions in colab - made no difference. Couldn't figure out how to change the cuda version so I guess I'll have to cope 😂
Question: At 1:45:54, you conclude in the last derivation step that d sigma^2 / d x_i = 2 / (m-1) * (x_i- mu). This would be correct if mu were just a constant, but in fact, mu is also a function of x_i: mu(x_i) = 1/m. So how does this cancel out so that you still end up with your simple expression?
I noticed that as well when I tried to solve those derivatives by hand. And also dx̂_i/dx_i isn't that simple as on the mentioned time code by the same reason. Assuming we are right the final derivative looks even worse 😅
I think this may be intentional from adding up every path through the graph separately. That equation pretends that mu is constant, because its contribution is added separately when we do dL/dmu*dmu/dx_i. dL/dmu does include the connection to sigma^2, as the "33rd fan out" (the first 32 being dxhat/dmu). As we then find out this contribution is zero anyway because a fan-out m cancels out the division by m in the mean, and it all adds up to nothing. Or I might be totally wrong :-)
Excellent series and delivery as usual. Thanks for all the hard work you put into this. Part of it is challenging to get through but a joy to decipher all the moving parts. I think a good understanding of the math behind back prop helps understand this. A good resource that covers this from a math perspective is Andrew Ng original Neural Net course.
Andrej, thank you for the work you put into this (and previous) lectures❤. Thanks to you, me and a lot of other people can enjoy learning NN 😍from the best.
Potentially a easier way to update dC in exercise 1 by flattening Xb and demb: dC = torch.zeros_like(C) for (x,e) in zip(Xb.view(-1), demb.view(-1, 10)): dC[x] += e BTW, thanks for the explanation! It helps me a lot!
take as much time as you want and try to go through this video as he has instructed, it is the only way of absorbing the content of this lecture, working through the derivatives on your own helps a lot, may be you have to watch twice the part which has exercise 1
53:55 if you scroll down, Wolfram Alpha provides 1 - x^2 + 2/3x^4 + O(x^5) as series expansion at x=0 of the derivative of tanh(x), which is the same as the series expansion for 1-tanh(x)^2.
Such a great man, just made all lectures for free , while mean UNI will charge you for even not relevant content now, I wish I can make world a better place by using AI in future. Currently, I can by commenting on this video , so that the ALgo. can recommend this to more people trying to learn neural network.. This is comment and all other comments are making world better place... And Andrej Sir , I will pay you back with some cool stuff build by me for this world.
Can someone please explain me where e^li came from in the term -(e^ly * e^li) / (Σe^lj)^2 at 1:30:11 (just under the separation line for i≠y v i=y)? I understand from the above line that we are looking for the derivative of e^ly / Σe^lj. So, when we consider the denominator we would get e^ly * -(Σe^lj)^-2 = -e^ly / (Σe^lj)^2 but the solution multiplies it by e^li which I do not quite get. Cheers!
I can say without a doubt that there are not many highly qualified, passionate teachers who are also able to teach their subject. Sharing knowledge in this way is the greatest gift a researcher can give to the world! Me and everyone else thank you for that! :)
I saw his previous micrograd lecture and it literally moved me to tears. I had endured the struggle of drowning in pytorch source code, trying to understand what it is that they are really doing! For someone who simply can't move past without cutting open abstractions, this is pure blessing.
exactly same with me@@vaguebrownfox
@@vaguebrownfoxI
These lectures are so good that I have to watch it several times while relaxing to seep in. I’m middle aged but still can follow 😅
Andrej you are a gifted teacher. I love this teaching style of starting from scratch with a simple specific model to set the structure and ideology of the problem. 2. Add necessary and motivated complexity to get where we are today, 3. Seamlessly transfer to modern technology (eg PyTorch) to solve modern problems. 4. You make it all simple and compress it into the essentials without unnecessary lingo. It reinvigorates my passion for the field. Thank you very much for taking so much time to make this for free for everyone.
Ky
.
These lectures are literally GOLD. I'd pay for these, but Andrej is kind enough to give everything for free.
I hope others find these gold lectures. Thank you so much for doing this. Please don't lose steam and I hope you continue to create them.
Bro just want to say that for the past 3 years I've been looking everywhere on the Internet for an explanation like thsi for backpropagation.. Found all kinf of things(e.g. Jacobian differentiable) but none actually made sense until today. U r the best, you bring so much value and let others light their candles at your light
The line "let others light their candles at your light" 👏👏👏
Man, what a time to be alive. Imagine how hard it would be to get this kind of information just a couple decades ago. And now it's free and easily accessible at any convenient time.
Thank you, Andrey, truly.
I almost completed Exercise 1 all on my own, but I had to step back for a day to refresh the basics because my college algebra was a bit rusty from 10 years of not using it. Exercises 2 and 3 totally overwhelmed me. However, when I follow your explanations, I understand everything. This is a huge because I remember that professors at my college couldn't explain complex concepts so easily. Andrej, you are a gift to this world!
I spent almost a whole day digesting this video. It's definitely worth it!
What impressed me the most is that even at his level, Andrej still work in pen and papers in details by his own hand with humility and commitment. And it's really inspiring to see Andrej's react of happiness when solve a specific simple problem. We should always work with a passion. Thanks, Andrej!
It makes my school pen and paper exercises suddenly a lot of sense! Such great teaching
I was "taught" calculus in high school but didn't really understand anything at all. Now, after seven years of no math formal education at all, I was able to immediately understand this exercise thanks to your lecture on micrograd. You're a brilliant teacher and I'm really grateful for that!
This lecture series is excellent. Seriously, some of the best learning resources for Neural Networks available anywhere: up-to-date, and goes deep into the details. These lectures with detailed examples and notebooks are an amazing resource. Thanks so much for this, Andrej.
Bruh, I'd be paying a shit ton of money in education for this otherwise free knowledge if it wasn't for your videos. Thank you so much, man. I cannot believe the ease with which you explain what seemed complex to me from a distance years ago. I cannot even believe I understand this stuff, man.
This dude is based!. I can actually cognitively map and visualize his explanations, and I am so grateful to have found him. Keep the videos coming please, and thank you so much.
Hello Andrej, I truly love this approach that you included exercises in your video. Your suggestion to first attempt to solve the exercises and then watching as you provide the solutions is the most effective way I personally grasp the concepts. Thank you for your outstanding work!
Andrej is providing the world with so much value, be it through his professional work in the industry (e.g. Tesla AI) or through education. He is literally one of the greatest of all time but is so down to earth and such a sweetheart.
Thank you very much for your hard work to make it easier for all the rest of us and for inspiring us! 💚
thank you for bringing me here Boris
yes, I always wanted to be a backprop ninja, now my dream will become true, thanks Andrej!
Andrej, you are the best techer. I am 100% sure these lectures will become a CORE watching for any student who starts his ML journey. Hope we will have such lectures in CV and RL.
Thank you for providing a series that's so approachable but doesn't shy away from explaining the details. Also love the progression through all the impactful papers
These lectures are my favorite way to spend time at the moment. I love the comparison of training the neural network as a complicated pulley system, as a civil engineer this is very intuitive. Many Thanks Andrej, you are the best!
Thanks for the great content!
I really appreciate the lectures that you share with us. It is not about definitions, raw memorization, or even exercise per se. Instead, first-principle-thinking: take a big "mess" and then broke down into small manageable pieces. You do not solely demonstrate the problem-solving approaches brilliantly but also ignite curiosity to dig deeper (to go down to the level of atoms) into a specific topic. Thank you for the preparation, the passion, and the memes! :D
Each time I finish a lecture in these series I feel satisfaction and I say to myself Andrej must be a wizard! he takes you by the hand from scratch showing you not only clean solution but shade light on problems and guide you through to the end. Remarkable talent.
i am still on part 2 but i had to write this comment , your part 4 thumbnail is awesome and funny
I am very grateful for these lectures.
I could feel that the artificial intelligence knowledge that was intertwined inside me was well aligned because of you.
01:25:00 Here is the better implementation of the code:
dC = torch.zeros_like(C)
dC.index_add_(0, Xb.view(-1), demb.view(-1, 10))
Thanks to the ChatGPT :)
I will put your poster on my wall to look at you everyday and remember how a great person you are. Your smile is contagious.
Andrej should read this comment 😂😂
This lecture really makes me appreciate autograd. I commend the ancient ML practitioners for surviving this brutality.
Love that he explains matlab as if it is not still used in 80% of labs in the world. Living in a world of tech giants will heal the matlab ptsd
This is a masterclass - I've never seen it explained so thoroughly and clearly, and i've been around. PEAK EXPERTISE
Thanks for the great content! That's the best explanation I've ever seen!
Also, regarding the last back propagation in the excersise 1 I've found the following method in pytorch:
dC = torch.zeros_like(C)
dC.index_add_(0, Xb.view(-1), demb.view(-1, demb.shape[2]))
cmp('C', dC, C)
Try:
1. dC = F.one_hot(Xb).float().view(-1, C.shape[0]).T @ demb.view(-1, C.shape[1])
2. dC = torch.einsum('bij,bik -> jk', F.one_hot(Xb, vocab_size).float(), demb)
Binge worthy! Ran through all lectures back-to-back after discovering. On the edge of my seat for more. Thanks Andrej!
My god, after 2 years of studying ai, reading wrong explanation of the derivative with matrix and vectors the algorithm finally led me there
As a bioinformatician and a part-time data scientist, I should say this series is the best educational youtube video on deep neural network. Thank you for the video and offering the opportunity to learn.
A long but very fruitful lecture, thanks a lot for this series Andrej !!
This is exactly what we invented the internet for.
I'm 3rd year Ph.D. student and I started my Ph.D. right after my undergrad, and I had very little idea how all the calculations are happening in neural networks back then. In the last three years to learn about neural nets I watched lots of videos, attended lectures, and completed summer camp, courses, also read books, papers, and blogs. But undoubtedly this is the best lecture on backprop! Thank you!
what uni are you study in?
the best lectures out there on deep learning
The best teacher in AI in the world.
That's incredible !!! It's impossible to give such a knowledge without very deep knowledge with neural nets. I am really appreciate your work. I hope we can get more videos. This is defiantly a golden video!!! Thank you so much!
Hey Andrej,
I don't know if you'll see this, but I just wanted to thank you whole heartedly for your awesome neural network playlist. It's by far the best and the most in-depth content on NNs I've ever come across. I really appreciate you sharing your knowledge for community. You're the best! Excited and awaiting for more such treasures!
no words to explain my feelings. karpathy is just Supercalifragilisticexpialidocious
I believe the loop implementing the final derivative at 1:24:21 can be vectorized if you just rewrite the selection operation as a matrix operation, then do a matmul derivative like done elsewhere in the video:
X_e = F.one_hot(Xb, num_classes = 27).float() # Convert the selection operation into a selection matrix (emb = C[Xb] X_e @ C)
dC = (X_e.permute(0,2,1) @ demb).sum(0) # Differentiate like any other matrix operation (dC = X_e.T @ demb; indices to track the batch dimensions)
Imo it's cleaner if you do this instead:
Xe = F.one_hot(Xb.flatten(), num_classes=27).float().permute(1, 0)
dC = Xe @ demb.view((-1, demb.shape[2]))
I think this method is more understandable because it uses a 2D matmul...
@@barni_7762 Thanks, it seems to have worked for me.
Very good point on the fact that C[Xb] X_e @ C. It makes things much more clear.
I came to the same solution, but from the bottom, experimenting with single records, imagining what I want to get.
final solution is:
dC = (torch.nn.functional.one_hot(Xb, num_classes=C.shape[0]).float().swapaxes(-1,-2) @ demb).sum(0)
and one can investigate what is going on for a single batch element:
torch.nn.functional.one_hot(Xb[0], num_classes=C.shape[0]).T.float() @ demb[0]
dC = torch.einsum('abc,abg->cg', F.one_hot(Xb, vocab_size).float(), demb)
@@barni_7762 very clean solution, this is what i did too!
I suspect this is a video I'll be coming back to for years to come.
Thanks!
Thank you very much for an amazing series!
The logit backprop derivation can be simplified a bit by realizing that log(f/g) is log f - log g. The second term is log Sum, the derivative will be 1/Sum times dSum/dxi which immediately yields the activation output. The first term is the log of an exponent, this cancels and the result has a trivial derivative of 0 or -1 when the index isn't/is the correct answer. This neatly shows that the derivative is "softmax output minus correct answer".
This is one of the most valuable videos I have come across for building strong intuition about what is going on in the backpropagation. BTW My solution for dC:
dC = torch.einsum('bij,bik -> jk', F.one_hot(Xb, vocab_size).float(), demb). Gotta love einsum :)
sprinkling Andrej magic through out the video - had me cracking at 43:40
It took me days to backprop through this lecture. Phew!. got it now.
Thank you so much :)
It was a bit tough but very interesting task.
P.S.: 1:25:47 dC can be done with dC.index_add_(0, Xb.view(-1), demb.view(-1, 10)) ;)
very cool, nice find, didn't know about index_add_, ty :)
I arrived at a very similar solution, but I didn't know about index_add_. Instead you can do:
Xb_onehot = F.one_hot(Xb.view(-1), num_classes=C.shape[0]).float()
dC = Xb_onehot.T @ demb.view(-1, C.shape[1])
ty for the video :)
can also be done with torch.einsum without the reshaping (but a little more confusion)
i've done with a basic approach
dC = torch.zeros_like(C)# ([27, 10])
for i,iemb in zip(Xb.view(-1).tolist(),demb.view(-1, n_embd)): dC[i]+=iemb # zip (([96]), ([96, 10]))
@@ArvidLunnemark Instead of Xb.view(-1), one could also use Xb.flatten(), which is a bit more straightforward to interpret (and I believe is just a wrapper for view() internally anyway).
Thanks a lot, found a such pytorch way of doing C.grad:
# Step 1: Reshape `dembcat` to match `emb`'s shape
demb = dembcat.clone().view(emb.shape) # [batch_size, num_chars, emb_dim]
# Step 2: Flatten the batch and character dimensions
# This will create a 2D tensor where each row corresponds to a specific (batch, char) pair
demb_flat = demb.view(-1, demb.size(-1)) # [batch_size * num_chars, emb_dim]
indices = Xb.view(-1) # [batch_size * num_chars]
# Step 3: Initialize `dC` as a zero tensor with the same shape as `C`
dC = torch.zeros_like(C) # [vocab_size, emb_dim]
# Step 4: Accumulate gradients using `index_add_`
# This adds each row in `demb_flat` to the corresponding index in `dC`
dC.index_add_(0, indices, demb_flat)
Andrej, words cannot express enough gratitude for sharing these lectures. Your passion for this subject is truly inspiring and your willingness to share your knowledge speaks to your moral character. Although you recorded this lecture 5 months ago, your words continue to lit up lightbulb-smiles across the globe and create intellectual connections with people all over the world. Thank you for your dedication and generosity.
1:06:20 your attention to detail here on the variance of arrays is out of this world
Finally completed this one. I have to say this lecture is the most valuable one throughout all my studying of deep learning. As always, thank you Andrej for your generosity. Moving on to the next one!
1:26:00 can be vectorized using: dC = dC.index_add_(0, Xb.view(-1), demb.view(-1, C.shape[1]))
This just gets all indices where to add by flattening Xb, and the values to add by flattening demb and keeping last dim as 10, so it can be added in dC along 0th dimension as its first dimension is also 10
Edit: Took me 3-4 days to get through your 1.5 hr video and it was hard, annoying, frustrating, I had to ask GPT 4-o for help with math. But in the end when the math just clicks and it works, it just so amazing!
This is great content. Thanks Andrej for your time.
Just grateful to have the chance to learn from Andrej Karpathy. Thanks heaps, it means a lot!
Excellent tutorial to understand the mathematical process behind Neural net operations. Just shows how intuitively comfortable Andrej is with the fundamentals of the subject. Hats off!
I did my best to do exercise 1 on my own but couldn't really get anything on my own until `dh`. I got the through the rest with the video paused. The first day I spent a lot of time just repeating the patterns of calculations Andrej was doing, but I didn't really "get" what we were doing, and truly not for lack of trying, it just didn't click. The next morning it not only made perfect sense but seemed super obvious. I love how this course forces me to understand the theory, and how it pushes me to the edge of my limited ability.
Excellent Andrej!! Can't wait for your next lecture. I'm so excited and motivated 🥰
This is amazing class. Part 4 rocks. I could do back prop manually.
This one kicked my ass! The way of the ninja is not an easy path, but I really enjoyed it, it was amazing as I started to solve it myself as the lecture progressed. Maybe this is the future of education
my favorite prof with new lecture
Thanks for this vey interesting serie. I found something strange during this lecture, so that, for me, using pytroch 1.13.1, the backward propagation of the tanh function using (1-tanh^2) gives results different to the ones I get using the autograd backpropagation. Difference is small (around 10^-8), but this is surprising since it seems that the backpropagation of the tanh function is done internaly in the pytorch using the same formula as the one you proposed.
Same here. Can't get `dhpreact` to be exact.
1-x² is just a second order approximation of the derivative. Wolfram Alpha is correct. Scroll down and check out alternate forms.
@@obnoxiaaeristokles3872 Both alternative forms are not exact either.
Andrej, Bro you are blowing my mind. Thank you.
simply the best! very good lessons with such maestry and passion, thanks a lot for sharing
This was very insightful. Andrej you are the best!
What a wonderful effort Andrej. Thanks for this!
I just finished part 2 yesterday night, and i was feeling blue that there was only 1 video left ! And this came to my notification, i just had to share my excitement :)))
Very useful educational videos, thanks for making and sharing them!
It's interesting that Andrej also considers the shapes when backpropagating through matrix multiply, just how I came to "memorize" it :)
Wow! This lecture is truly incredible and i have certainly learned a ton. Thank you very much, Andrej :)
Thank you Andrej for sharing your experience with us!
John Carmack used exactly this learning method, as he told in his interview with Lex Fridmann. In his "larval stage", he implemented the whole NN machinary, including back propagation, in C (so really low-level:)), to make sure that he understands how stuff work!
Pure gem...💎💎💎 Thanks Andrej for this amazing lecture.
I come to each of these videos to like them. I can't keep up with his pace of release but I will watch all of them in due time. Thanks Andrej.
Thank you for "making everything fully explicit"!
We're lucky to have you in this world
You can love people you don't know. I love you Andrej.
I dont know why (since I made the same calculations, using 1.0, float, etc) but starting from the calculation of dhpreact I lose the exact match and then the errors accumulate and from dbndiff even the approximate check fails (it is still pretty close, so I suppose it is some integer/float thing somewhere).Highly frustrating
Apparently this is common. I haven’t dug into details of it. It depends on I think the precise version of some library in use
I had the same problem in colab. I solved using jupyter notebook with pytorch 1.13.1
@@tommasopellegrino8758 FWIW, I'm seeing this error in colab and it's using pytorch 1.13.1+cu116
@@KibberShuriq maybe it is cuda, I did not look more into that (I have an m1 so no cuda)
@@tommasopellegrino8758 Yeah I've tried several different pytorch versions in colab - made no difference. Couldn't figure out how to change the cuda version so I guess I'll have to cope 😂
Question: At 1:45:54, you conclude in the last derivation step that d sigma^2 / d x_i = 2 / (m-1) * (x_i- mu). This would be correct if mu were just a constant, but in fact, mu is also a function of x_i: mu(x_i) = 1/m. So how does this cancel out so that you still end up with your simple expression?
I noticed that as well when I tried to solve those derivatives by hand. And also dx̂_i/dx_i isn't that simple as on the mentioned time code by the same reason. Assuming we are right the final derivative looks even worse 😅
I think this may be intentional from adding up every path through the graph separately. That equation pretends that mu is constant, because its contribution is added separately when we do dL/dmu*dmu/dx_i. dL/dmu does include the connection to sigma^2, as the "33rd fan out" (the first 32 being dxhat/dmu). As we then find out this contribution is zero anyway because a fan-out m cancels out the division by m in the mean, and it all adds up to nothing.
Or I might be totally wrong :-)
Watched this 2 years ago, now I can confidently backprob through batchnorm by hand
Excellent series and delivery as usual. Thanks for all the hard work you put into this. Part of it is challenging to get through but a joy to decipher all the moving parts. I think a good understanding of the math behind back prop helps understand this. A good resource that covers this from a math perspective is Andrew Ng original Neural Net course.
"...assuming that pytorch is correct..." hahahaha not only a great lecture but also with very funny nuggets. Thank you!
Wow. Really enjoyed this lesson. Thankyou Andrej! :)
Andrej, thank you for the work you put into this (and previous) lectures❤. Thanks to you, me and a lot of other people can enjoy learning NN 😍from the best.
Thanks a lot Andrej for all these awesome lectures. Please enable auto generated subtitle for this lecture.
This is the first time truly understood. Thank you!
Thank you, thank you, thank you ... What you are doing with these videos is amazing !
Potentially a easier way to update dC in exercise 1 by flattening Xb and demb:
dC = torch.zeros_like(C)
for (x,e) in zip(Xb.view(-1), demb.view(-1, 10)):
dC[x] += e
BTW, thanks for the explanation! It helps me a lot!
love your channel and content Andrej.. please keep more videos coming!
Andrej's lectures are just as clear as tear drops of a baby
This is exactly how I work through my coding problems as well. I also have similar thought process while developing algorithms.
take as much time as you want and try to go through this video as he has instructed, it is the only way of absorbing the content of this lecture, working through the derivatives on your own helps a lot, may be you have to watch twice the part which has exercise 1
Thanks for the videos! Please make a lot more! Please continue to share your knowledge with the world! Thanks
Thank you Andrej. I really appreciate your work.
Can't wait to come watch this when school holiday starts!
13 days later: here I am!
Andrej is on-firee! Thank for this awesome material!
A deluge of knowledge from you so often it's ridiculous. I'm absolutely certain you're a robot. Anyhow, Ninjas are awesome. Wax on Sensei!
53:55 if you scroll down, Wolfram Alpha provides 1 - x^2 + 2/3x^4 + O(x^5) as series expansion at x=0 of the derivative of tanh(x), which is the same as the series expansion for 1-tanh(x)^2.
Have mercy Andrej, my brain hurts! :D Feels like I'll need years to digest just these few lectures.
*_this will make happiness overflow out of you_*
Such a great man, just made all lectures for free , while mean UNI will charge you for even not relevant content now, I wish I can make world a better place by using AI in future. Currently, I can by commenting on this video , so that the ALgo. can recommend this to more people trying to learn neural network..
This is comment and all other comments are making world better place...
And Andrej Sir , I will pay you back with some cool stuff build by me for this world.
Can someone please explain me where e^li came from in the term -(e^ly * e^li) / (Σe^lj)^2 at 1:30:11 (just under the separation line for i≠y v i=y)?
I understand from the above line that we are looking for the derivative of e^ly / Σe^lj. So, when we consider the denominator we would get e^ly * -(Σe^lj)^-2 = -e^ly / (Σe^lj)^2 but the solution multiplies it by e^li which I do not quite get. Cheers!
okay, got it! 😎
Thanks for top-level video. Can't wait to see more. Thanks 🙏
Teaching taken to a different level.
Btw, the "low-budget" gray block mask at the end is very creative :D