Andrej, as a third year PhD student this video series has given me so much more understanding of the systems I take for granted. You're doing incredible work here!
1:30:10 The 5/3 gain in the tanh comes for the average value of tanh^2(x) where x is distributed as a Gaussian, i.e. integrate (tanh x)^2*exp(-x^2/2)/sqrt(2*pi) from -inf to inf ~= 0.39 The square root of this value is how much the tanh squeezes the variance of the incoming variable: 0.39 ** .5 ~= 0.63 ~= 3/5 (hence 5/3 is just an approximation of the exact gain). We then multiply by the gain to keep the output variance 1.
Exactly. My DL course back in 2015 had a ton of obscure math and no coding. I had no idea how to train NNs after that course. I'm rediscovering and learning a ton of stuff from this video alone, way more than from my course.
Andrej you have a wonderful gift for educating others. I’m a self learner of NNs and it’s a painful process but you seriously help ease that suffering… much appreciated! Ty
Every time another Andrej Karpathy video drops, its like Christmas for me. This video series has helped me to develop genuine Intuition about how neural networks work. I hope you continue to put these out, its making a massive impact on making these "black box" technologies accessible to the anyone and everyone!
You have some much depth in your knowledge, Yet, you manage to explain complex concepts with such and incredible didactics. This is someone who truly understands his field. Andrej, thank you so much and even more for the humility in how you do it so. You explain how libraries and languages like python and pytorch work and dive into the WHYs on why things are happening. This is absolute priceless.
I cannot fathom that this video only has 4k likes... He is literally explaining stuff that no one else goes through, because they simply don't know them, but they are crucial!
These video series are exceptional. The clarity and practicality of the tutorials are unmatched by any other course. Thank you for the invaluable help for all practitioners of deep learning!
This series is definitely the clearest presentation of ML concepts I have seen, but the teaching generalizes so well. I’ll be using this step-by-step intuition-building approach for all complicated things from now on. Nice that the approach also gives a confidence that I can understand stuff with enough time. Truly appreciate your doing this.
So many small things, scrutinzers and how easily he has pointed them out one by one, step by step, problem to solution is just amazing. Love your work Andrej. You are amazing.
Thank you Mr. Karpathy. I am in love with your teaching. Given how accomplished and experienced you are, you are still teaching with such grace. I dream about sitting in one of your classes and learning from you, and I hope to meet you one day. May you be blessed with good health. Lots of love and respect.
1:33:30 The reason the gradients of the higer layers have a bigger deviation (in the absence of tanh layer), is that you can write the whole NN as a sum of products, and it is easy to see that each weight of Layer 0 appears in 1 term, of layer 1 in 30 terms, of layer 2 in 3000 terms and so on. Therefore a small change of a weight in higer layers changes the output more.
This whole series is absolutely amazing. Thank you very much Andrej! Being able to code along with you, improving a system as my own knowledge improves is fantastic
Thank you, Andrej, amazing content! As a beginner in deep learning and even in programing, I find most materials out there are either pure theories or pure API applications, and they rarely come this deep and detailed. Your videos cover not just the knowledge of this field, but also so many empirical insights that came from working on actual projects. Just fantastic! Please make more of these lessons!
I think your videos are the only ones I've come across that actually explain why you have a validation split, for the developers/data scientists to check and optimise the parameters/distribution. The ability to stop and replay is invaluable for me. Thank-you so much for these fantastic videos.
I remember you you are badmephisto i will always remember you. you taught me how to solve the rubix cube : ) i will never forget how far you have come from the teenager in his room making rubix cube videos and look how far you have come god bless
I really enjoy learning and listening to people like Andrej who love what they do and aren't doing it just for money. Shared the channel with my friends ☺️ Keep up the great work Andrej!
Seeing this built out with the code side by side is the most helpful thing you have done here. I think this is the biggest difference in this pedagogical style. When you see the code, then it is real and clear what is happening, it is not handwaving or imagination. So this is the most valuable piece. And your gentle personality is just the wonderful cherry on top of the whole thing. Many kudos to you I hope you can continue outputting more and keep growing your contribution to ML research going forward. Incredible.
This video makes you feel grateful about internet where you can learn from masters in this much depth for free. Thank you Andrej, this whole series greatly helped my understanding of NN!
This is filling in a lot of gaps for me, thank you! I especially appreciate your insights about reading a network's behavior during training; they gave me a few epiphanies.
To put BatchNorm into perspective, I am going through Geoffrey Hinton's 2012 lecture notes on bag of tricks for mini-batch descent, it was when AlexNet was first published. Hinton was saying there was no one best way for learning method/gradient descent with mini-batches. Well, here it is BatchNorm. Hinton: "Just think of how much better neural nets will work, once we've got this sorted out". We are living in that future :)
What did he mean by "Hinton was saying there was no one best way for learning method/gradient descent with mini-batches"? Did he mean initializing them?
Great material on the intricacies of how neural networks work. Until now, I hadn't paid attention to the distribution of values entering the activation layers, and as it turns out, this is an extremely important issue. Thanks!
I have shot myself in the foot multiple times before these videos. Training big models are much more difficult than I initially anticipated. Time wasted sadly. But I have more confidence in myself thanks to these video. Thanks Andrej
Thank you @Andrej for bringing this series. You are a great teacher, the way you have simplified such seemingly complex topics is valuable to all the students like me. 🙏
this video totally opened my mind about this subject I've been obsessing about for over a year. Truly an amazing video, I feel it's the bare minimum to thank you for this.
These lectures are an awesome gift to us mortals. Such a clear explanation on the principles of neural networks. I only need to be able to afford access to massive TPU cloud compute and huge corpus, but at least I can now gain insight and understand the principles of these technologies.
*Abstract Building makemore Part 3* This lecture highlights the importance of understanding activation and gradient statistics for successful deep learning. While techniques like batch normalization provide significant stability, understanding these concepts remains crucial for building and analyzing deep neural networks. Modern innovations like batch normalization and advanced optimizers make training deep networks more manageable, but proper initialization and diagnostic tools are still essential for achieving optimal performance. *Summary* *Initialization and Activations:* - Initial loss (4:25): High initial loss (e.g., 27) indicates improper network initialization. - Softmax logits should be close to zero at initialization to produce a uniform probability distribution and expected loss. - This avoids confident mispredictions and the "hockey stick" loss curve. - Scaling down weights of the output layer can achieve this (9:28). - Saturated activations (13:09): Tanh activations clustered around -1 and 1 indicate saturation, hindering gradient flow. - Saturated neurons update less frequently and impede training. - This can lead to dead neurons, which never activate and don't learn (19:19). - Scaling down weights of the hidden layer can help prevent saturation (24:59). - Kaiming initialization (27:58): A principled approach to weight scaling, aiming for unit gaussian activations throughout the network. - Calculates standard deviation based on fan-in and gain factor specific to the non-linearity used (31:46). - PyTorch offers torch.nn.init.kaiming_normal_ for this (33:56). *Batch Normalization (**40:49**):* - Concept: Normalizes activations within each batch to be roughly unit gaussian. - Controls activation scale, stabilizing training and mitigating the need for precise weight initialization. - Offers a regularization effect due to coupling examples within a batch (51:55). - Implementation (42:17): - Normalizes activations by subtracting batch mean and dividing by batch standard deviation (42:41). - Learnable gain and bias parameters allow the network to adjust the normalized distribution (45:54). - Running mean and variance are tracked during training and used for inference (54:38). - Caveats: - Couples examples within a batch, leading to potential bugs and inconsistencies (50:20). - Requires careful handling at inference time due to batch dependency (54:03). - Makes bias terms in preceding layers redundant (1:01:37). *PyTorch-ifying the code (**1:18:40**):* - Code is restructured using torch.nn.Module subclasses for linear, batch normalization, and tanh layers (1:19:26). - This modular approach aligns with PyTorch's structure and allows easy stacking of layers. - Default PyTorch initialization schemes and parameters are discussed (1:08:52). *Diagnostic Tools (**1:19:13**):* - Visualization of statistics: Histograms of activations, gradients, weights, and update:data ratios reveal potential issues during training (1:26:53). - Forward pass activations: Should exhibit a stable distribution across layers, indicating proper scaling (1:26:53). - Backward pass gradients: Should be similar across layers, signifying balanced gradient flow (1:30:57). - Parameter weights: Distribution and scale should be monitored for anomalies and asymmetries (1:36:20). - Update:data ratio: Should be around -3 on a log scale, indicating a good learning rate and balanced parameter updates (1:39:56). i used gemini 1.5 pro to summarize the transcript
These are some of the best lectures i've ever seen. I love the explaination in the first part about tanh saturation. Really trying to get the viewer to develop intuition.
He says 'Bye', but looking at the time, it seems too early [01:18:30]. Most people don't want lectures to be long, but I'm happy this one didn't end there.
If you decide to make more content, a video series like this with a focus on self-driving or RL for robotics would be awesome. Not that you don't have enough going on, but that's my wish-list item :) Thanks for putting an incredibly in-depth resource out here free on the internet.
Here are some calculations to appreciate how good the prediction already is for Andrej's model. Perplexity = e^(CE), random guess gives a perplexity of e^(3.3)~27, you can think of to choose one out of 27 options. After the training, the CE is 2.0, perplexity =e^(2.0)~7, you only need to select from 7 options. Previously the bigram has a CE of 2.5 and that is e^(2.5)=12 options to choose from, this accuracy is really really good. There goes the inherent uncertainties in language as Andrej mentioned, you can start the first character with any character, so 7 on average is pretty good already.
These videos are so useful, Andrej thank you so much. The parts when you wrap up the lecture, and then change your mind to add more content are my fav. 😄
🎯Course outline for quick navigation: [00:00-03:21]1. Implementing and refactoring neural networks for language modeling -[00:00-00:30]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks. -[00:31-01:03]Understanding neural net activations and gradients in training is crucial for optimizing architectures. -[02:06-02:46]Refactored code to optimize neural net with 11,000 parameters over 200,000 steps, achieving train and val loss of 2.16. -[03:03-03:28]Using torch.nograd decorator to prevent gradients computation. [03:22-14:22]2. Efficiency of torch.no_grad and neural net initialization issues -[03:22-04:00]Using torch's no_grad makes computation more efficient by eliminating gradient tracking. -[04:22-04:50]Network initialization causes high loss of 27, rapidly decreases to 1 or 2. -[05:00-05:32]At initialization, the model aims for a uniform distribution among 27 characters, with roughly 1/27 probability for each. -[05:49-06:19]Neural net creates skewed probability distributions leading to high loss. -[12:08-12:36]Loss at initialization as expected, improved to 2.12-2.16 [14:24-36:39]3. Neural network initialization -[16:03-16:31]The chain rule with local gradient is affected when outputs of tanh are close to -1 or 1, leading to a halt in back propagation. -[18:09-18:38]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values. -[26:03-26:31]Optimization led to improved validation loss from 2.17 to 2.10 by fixing softmax and 10-inch layer issues. -[29:28-30:02]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets. -[30:17-30:47]Scaling down by 0.2 shrinks gaussian with standard deviation 0.6. -[31:03-31:46]Initializing neural network weights for well-behaved activations, kaiming he et al. -[36:24-36:55]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers. [36:39-51:52]4. Neural net initialization and batch normalization -[36:39-37:05]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization. -[40:32-43:04]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance. -[40:51-41:13]Batch normalization from 2015 enabled reliable training of deep neural nets. -[41:39-42:09]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper. -[43:20-43:50]Calculating standard deviation of activations, mean is average value of neuron's activation. -[45:45-46:16]Back propagation guides distribution movement, adding scale and shift for final output [51:52-01:01:35]5. Jittering and batch normalization in neural network training -[52:10-52:37]Padding input examples adds entropy, augments data, and regularizes neural nets. -[53:44-54:09]Batch normalization effectively controls activations and their distributions. -[56:05-56:33]Batch normalization paper introduces running mean and standard deviation estimation during training. -[01:00:46-01:01:10]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero. [01:01:36-01:09:21]6. Batch normalization and resnet in pytorch -[01:02:00-01:02:30]Biases are subtracted out in batch normalization, reducing their impact to zero. -[01:03:13-01:03:53]Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters. -[01:07:25-01:07:53]Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code. [01:09:21-01:23:37]7. Pytorch weight initialization and batch normalization -[01:10:05-01:10:32]Pytorch initializes weights using 1/fan-in square root from a uniform distribution. -[01:11:11-01:11:40]Scaling weights by 1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features. -[01:14:02-01:14:35]Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper. -[01:16:00-01:16:30]Batch normalization centers data for gaussian activations in deep neural networks. -[01:17:32-01:18:02]Batch normalization, influential in 2015, enabled reliable training of much deeper neural nets. [01:23:39-01:55:56]8. Custom pytorch layer and network analysis -[01:24:01-01:24:32]Updating buffers using exponential moving average with torch.nograd context manager. -[01:25:47-01:27:11]The model has 46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations. -[01:28:04-01:28:30]Saturation stabilizes at 20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3. -[01:33:19-01:33:50]Setting gain correctly at 1 prevents shrinking and diffusion in batch normalization. -[01:38:41-01:39:11]The last layer has gradients 100 times greater, causing faster training, but it self-corrects with longer training. -[01:43:18-01:43:42]Monitoring update ratio for parameters to ensure efficient training, aiming for -3 on log plot. -[01:51:36-01:52:04]Introduce batch normalization and pytorch modules for neural networks. -[01:52:39-01:53:06]Introduction to diagnostic tools for neural network analysis. -[01:54:45-01:55:50]Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress offered by Coursnap
Yeah! New video. I 🥰😍 love it. I have decided. After finishing all of your videos, I am planning to use your model as a starting point to solve the true AI problem, just like wright brothers. I want to try my way. I still think the idea of building a a brain simiulation of neuro network is wrong . And I think I have my way to solve the problem.
That is the way to go, @Jonathan Sum. Pick up a really hard problem, stay with it yourself, trying to solve them in many many ways, for years if needed. Whatever it takes. You can refer to others work as needed, but ONLY AFTER you have tried each subproblem on your own. I learnt this method of solving problems for oneself by seeing Newton's biography. That is how you create new knowledge.
oh man, this is top-notch content! Not sure if there are other available contents on these topics with so much clearness about its inner gears with reproducible examples. Thank you so much! You're a DL hero.
you have to watch these videos twice. Once you will just watch the videos. The next time you will try to write the code Andrej is writing from memory or from your notes. You don't progress until you are stuck, and only as the solution you will play those parts of the video.
At 1:04:35 would it help at the end of the training to optimize with bnmean_running and bnstd_running to normalize the preactivations hpreact? Maybe at that point regularization isn't necessary anymore and the rest of the weights can be optimized for the particular batch norm calibration that will be used during inference.
Great question, I’ve done this a number of times myself too because I am very uncomfortable with the train test mismatch. I never noticed dramatic and consistent improvement with it though ;(
@@AndrejKarpathy bnmean_running = (0.999 * bnmean_running) + (0.001 * bnmeani), why are you multiplying 0.999 with bnmean_running and 0.001 with bnmeani. Why this not works bnmean_running = bnmean_running + bnmeani
@@narenbabu629 this will just indefinetely increase bnmean_running. The idea is to change bnmean_running just slightly. For example 0.999 * 3 + 0.001 * 5 = 3.002, so previous value is almost unchanged, but increased just a bit
Andrej, as a third year PhD student this video series has given me so much more understanding of the systems I take for granted. You're doing incredible work here!
1:30:10 The 5/3 gain in the tanh comes for the average value of tanh^2(x) where x is distributed as a Gaussian, i.e.
integrate (tanh x)^2*exp(-x^2/2)/sqrt(2*pi) from -inf to inf ~= 0.39
The square root of this value is how much the tanh squeezes the variance of the incoming variable: 0.39 ** .5 ~= 0.63 ~= 3/5 (hence 5/3 is just an approximation of the exact gain).
We then multiply by the gain to keep the output variance 1.
Thank you : )
i hope they're using the actual value and just writing 5/3 in the docs as slang
Thank you for the insight!
@leopetrini Can you explain how you calculated the integral?
Awesome, thank you!
This has to be the best hands-on coding tutorial for these small yet super-important deep learning fundamentals online. Absolutely great job!
Turns out this should be the way to teach machine learning: a combination of theory reference and actual coding. Thank you Andrej!
Exactly. My DL course back in 2015 had a ton of obscure math and no coding. I had no idea how to train NNs after that course. I'm rediscovering and learning a ton of stuff from this video alone, way more than from my course.
Andrej you have a wonderful gift for educating others. I’m a self learner of NNs and it’s a painful process but you seriously help ease that suffering… much appreciated! Ty
Great, I am also going through this same painful process. Can you suggest something that can help ease this pain?
If you want to learn the theory. Try Soheil feizi . He is professor at UMD. Amazing teacher. And the course content is just top notch.
Every time another Andrej Karpathy video drops, its like Christmas for me. This video series has helped me to develop genuine Intuition about how neural networks work. I hope you continue to put these out, its making a massive impact on making these "black box" technologies accessible to the anyone and everyone!
I like that not even the smallest detail is pulled out of thin air, everything is completely explained
The quality of these lectures is out of the charts. This channel is a gold mine!. Andrej, thank you, thank you very much for these lectures.
You have some much depth in your knowledge,
Yet, you manage to explain complex concepts with such and incredible didactics.
This is someone who truly understands his field. Andrej, thank you so much and even more for the humility in how you do it so.
You explain how libraries and languages like python and pytorch work and dive into the WHYs on why things are happening.
This is absolute priceless.
I cannot fathom that this video only has 4k likes... He is literally explaining stuff that no one else goes through, because they simply don't know them, but they are crucial!
thanks for reminding to gie a like ;)
These video series are exceptional. The clarity and practicality of the tutorials are unmatched by any other course. Thank you for the invaluable help for all practitioners of deep learning!
This series is definitely the clearest presentation of ML concepts I have seen, but the teaching generalizes so well. I’ll be using this step-by-step intuition-building approach for all complicated things from now on. Nice that the approach also gives a confidence that I can understand stuff with enough time. Truly appreciate your doing this.
So many small things, scrutinzers and how easily he has pointed them out one by one, step by step, problem to solution is just amazing. Love your work Andrej. You are amazing.
Thank you Mr. Karpathy. I am in love with your teaching. Given how accomplished and experienced you are, you are still teaching with such grace. I dream about sitting in one of your classes and learning from you, and I hope to meet you one day. May you be blessed with good health. Lots of love and respect.
My mind is totally blown at the detail I am getting. Feel like this is an ivy league level course, with the content so meticulously covered.
1:33:30 The reason the gradients of the higer layers have a bigger deviation (in the absence of tanh layer), is that you can write the whole NN as a sum of products, and it is easy to see that each weight of Layer 0 appears in 1 term, of layer 1 in 30 terms, of layer 2 in 3000 terms and so on. Therefore a small change of a weight in higer layers changes the output more.
This whole series is absolutely amazing. Thank you very much Andrej! Being able to code along with you, improving a system as my own knowledge improves is fantastic
Thank you, Andrej, amazing content! As a beginner in deep learning and even in programing, I find most materials out there are either pure theories or pure API applications, and they rarely come this deep and detailed. Your videos cover not just the knowledge of this field, but also so many empirical insights that came from working on actual projects. Just fantastic! Please make more of these lessons!
You're a great teacher Andrej. This has been by far the most interesting ML course/training I have come across. Keep up the good work!
I think your videos are the only ones I've come across that actually explain why you have a validation split, for the developers/data scientists to check and optimise the parameters/distribution. The ability to stop and replay is invaluable for me. Thank-you so much for these fantastic videos.
The batch normalization explanation was amazing! Thank you for your hard work and concise and clear explanations.
Minute by minute, this course is giving us master-level knowledge. We're being molded into experts without even attending a world-class university! 🚀
I remember you you are badmephisto i will always remember you. you taught me how to solve the rubix cube : ) i will never forget how far you have come from the teenager in his room making rubix cube videos and look how far you have come god bless
I really enjoy learning and listening to people like Andrej who love what they do and aren't doing it just for money. Shared the channel with my friends ☺️
Keep up the great work Andrej!
Seeing this built out with the code side by side is the most helpful thing you have done here. I think this is the biggest difference in this pedagogical style. When you see the code, then it is real and clear what is happening, it is not handwaving or imagination. So this is the most valuable piece. And your gentle personality is just the wonderful cherry on top of the whole thing. Many kudos to you I hope you can continue outputting more and keep growing your contribution to ML research going forward. Incredible.
This video makes you feel grateful about internet where you can learn from masters in this much depth for free. Thank you Andrej, this whole series greatly helped my understanding of NN!
A lot of material packed in this video; I envision understanding the mechanics of these networks to take many years, even with some prior experience.
You are such a gold mine of knowledge, it's insane. I wish you were my DL professor during my PhD.
Thank you so much for this, Andrej! Your series single-handedly revitalized my love for deep learning! Please keep this series going :)
it's done the same for me. i'm excitedly going through each video. it feels good to be back!
This is filling in a lot of gaps for me, thank you! I especially appreciate your insights about reading a network's behavior during training; they gave me a few epiphanies.
To put BatchNorm into perspective, I am going through Geoffrey Hinton's 2012 lecture notes on bag of tricks for mini-batch descent, it was when AlexNet was first published. Hinton was saying there was no one best way for learning method/gradient descent with mini-batches. Well, here it is BatchNorm. Hinton: "Just think of how much better neural nets will work, once we've got this sorted out". We are living in that future :)
What did he mean by "Hinton was saying there was no one best way for learning method/gradient descent with mini-batches"?
Did he mean initializing them?
Great material on the intricacies of how neural networks work. Until now, I hadn't paid attention to the distribution of values entering the activation layers, and as it turns out, this is an extremely important issue. Thanks!
What an inspiration. Like others have alluded in the comments, I find Andrej's teaching so remarkably therapeutic.
I completed this one today, and I just want to show Andrej my gratitude. Looking forward to the next one. Thank you very much, Andrej. Thank you!
I have shot myself in the foot multiple times before these videos. Training big models are much more difficult than I initially anticipated. Time wasted sadly. But I have more confidence in myself thanks to these video. Thanks Andrej
Thank you @Andrej for bringing this series. You are a great teacher, the way you have simplified such seemingly complex topics is valuable to all the students like me. 🙏
this video totally opened my mind about this subject I've been obsessing about for over a year. Truly an amazing video, I feel it's the bare minimum to thank you for this.
Thank you Andrej! You sharing definitely make a lot of human happy!
this is my 3rd year in AI. very grateful i come across these videos.
I should have watched these long ago, the second best time to watch is now.
I don't think any other book or blog or videos cover what Andrej has covered. Awesome insights. THANK YOU Andrej.
You explain very simple things that I know and always give me a new perspective on it!. Your way of transmitting knowledge is incredible.
Exactly!
Your choice of visualizations as diagnostic tool is super insightful. Thanks so much for sharing your experience.
These lectures are an awesome gift to us mortals. Such a clear explanation on the principles of neural networks.
I only need to be able to afford access to massive TPU cloud compute and huge corpus, but at least I can now gain insight and understand the principles of these technologies.
You're an incredible teacher. You really have a gift. Thanks for these lectures!!!!
Thank you.. I am a self learner and your series has been a milestone for me.
Thank you for the deep dive into batch normalization and diagnostic approaches! Really useful to see it explained from the paper with the code.
*Abstract Building makemore Part 3*
This lecture highlights the importance of understanding activation and
gradient statistics for successful deep learning. While techniques
like batch normalization provide significant stability, understanding
these concepts remains crucial for building and analyzing deep neural
networks. Modern innovations like batch normalization and advanced
optimizers make training deep networks more manageable, but proper
initialization and diagnostic tools are still essential for achieving
optimal performance.
*Summary*
*Initialization and Activations:*
- Initial loss (4:25): High initial loss (e.g., 27) indicates improper network initialization.
- Softmax logits should be close to zero at initialization to produce a uniform probability distribution and expected loss.
- This avoids confident mispredictions and the "hockey stick" loss curve.
- Scaling down weights of the output layer can achieve this (9:28).
- Saturated activations (13:09): Tanh activations clustered around -1 and 1 indicate saturation, hindering gradient flow.
- Saturated neurons update less frequently and impede training.
- This can lead to dead neurons, which never activate and don't learn (19:19).
- Scaling down weights of the hidden layer can help prevent saturation (24:59).
- Kaiming initialization (27:58): A principled approach to weight scaling, aiming for unit gaussian activations throughout the network.
- Calculates standard deviation based on fan-in and gain factor specific to the non-linearity used (31:46).
- PyTorch offers torch.nn.init.kaiming_normal_ for this (33:56).
*Batch Normalization (**40:49**):*
- Concept: Normalizes activations within each batch to be roughly unit gaussian.
- Controls activation scale, stabilizing training and mitigating the need for precise weight initialization.
- Offers a regularization effect due to coupling examples within a batch (51:55).
- Implementation (42:17):
- Normalizes activations by subtracting batch mean and dividing by batch standard deviation (42:41).
- Learnable gain and bias parameters allow the network to adjust the normalized distribution (45:54).
- Running mean and variance are tracked during training and used for inference (54:38).
- Caveats:
- Couples examples within a batch, leading to potential bugs and inconsistencies (50:20).
- Requires careful handling at inference time due to batch dependency (54:03).
- Makes bias terms in preceding layers redundant (1:01:37).
*PyTorch-ifying the code (**1:18:40**):*
- Code is restructured using torch.nn.Module subclasses for linear, batch normalization, and tanh layers (1:19:26).
- This modular approach aligns with PyTorch's structure and allows easy stacking of layers.
- Default PyTorch initialization schemes and parameters are discussed (1:08:52).
*Diagnostic Tools (**1:19:13**):*
- Visualization of statistics: Histograms of activations, gradients, weights, and update:data ratios reveal potential issues during training (1:26:53).
- Forward pass activations: Should exhibit a stable distribution across layers, indicating proper scaling (1:26:53).
- Backward pass gradients: Should be similar across layers, signifying balanced gradient flow (1:30:57).
- Parameter weights: Distribution and scale should be monitored for anomalies and asymmetries (1:36:20).
- Update:data ratio: Should be around -3 on a log scale, indicating a good learning rate and balanced parameter updates (1:39:56).
i used gemini 1.5 pro to summarize the transcript
These are some of the best lectures i've ever seen. I love the explaination in the first part about tanh saturation. Really trying to get the viewer to develop intuition.
This is the most amazing video on neural network mathematics knowledge I've ever seen; thank you very much, Andrej!
Amazing explanation about the nitty gritty details of Deep Learning, the "dark arts" of the trade.
As a former HTML nerd, I am forever indebted to the amount of precise calculations and their limits, as is expressed in this video…
Can't even explain how impactful this video for my understanding of nns... Thank you so much!
I keep coming back to these videos again and again. Andrej is legend!
The amount of useful information in this video is impressive. Thanks for such good content.
1:18:36
The "Okay, so I lied" moment was too relatable xD
You are an Angel sir
The land of AI is blessed and the harvest is plenty. New AI warriors will rise from this
Thanks
He says 'Bye', but looking at the time, it seems too early [01:18:30]. Most people don't want lectures to be long, but I'm happy this one didn't end there.
This is a great lecture, especially the second half building intuition about diagnostics. Amazing stuff.
This is awsome!!
Thank you so much for taking the time to do this Andrej.
Please keep this going, I am learning so much from you.
My life is much better now because of your videos.
love the short snippets about how to implement these tools in production.
What a gem. Thanks for the lectures, Andrej
Thanks very much for this, please keep them coming, you are changing lives.
Thank you Andrej, i have finally found time to go through your lectures. I have learn and understood a lot more than before, thank you.
If you decide to make more content, a video series like this with a focus on self-driving or RL for robotics would be awesome. Not that you don't have enough going on, but that's my wish-list item :) Thanks for putting an incredibly in-depth resource out here free on the internet.
Here are some calculations to appreciate how good the prediction already is for Andrej's model.
Perplexity = e^(CE), random guess gives a perplexity of e^(3.3)~27, you can think of to choose one out of 27 options. After the training, the CE is 2.0, perplexity =e^(2.0)~7, you only need to select from 7 options. Previously the bigram has a CE of 2.5 and that is e^(2.5)=12 options to choose from, this accuracy is really really good.
There goes the inherent uncertainties in language as Andrej mentioned, you can start the first character with any character, so 7 on average is pretty good already.
It's so nice to have you back on RUclips! Thanks for teaching me Rubik's Cube back in the day and thanks for teaching us deep learning now!
Really enjoy your classes. I learnt a lot of tips for training and feel comfortable now. Will continue finishing this series.
Amazing, knowledge that is hell hard to find in other videos and also, you have AMAZING skill in clearly explaining complex stuff.
Andrej, thank you so much for your tutorials here. You've no idea how much your videos helped me. Please keep doing more videos.
Thank you for delving deep into the nitty-gritty details. I appreciate the exercises!
These videos are so useful, Andrej thank you so much. The parts when you wrap up the lecture, and then change your mind to add more content are my fav. 😄
Zdravím, díky, že si sa dal na tvorbu videí pre širšiu verejnosť!
Andrej, thank you ever so much. You are an inspiration, and thanks to you I have a better understanding of the concepts.
Your lecture is so amazing. Please keep updating, thanks for sharing and educating.
🎯Course outline for quick navigation:
[00:00-03:21]1. Implementing and refactoring neural networks for language modeling
-[00:00-00:30]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks.
-[00:31-01:03]Understanding neural net activations and gradients in training is crucial for optimizing architectures.
-[02:06-02:46]Refactored code to optimize neural net with 11,000 parameters over 200,000 steps, achieving train and val loss of 2.16.
-[03:03-03:28]Using torch.nograd decorator to prevent gradients computation.
[03:22-14:22]2. Efficiency of torch.no_grad and neural net initialization issues
-[03:22-04:00]Using torch's no_grad makes computation more efficient by eliminating gradient tracking.
-[04:22-04:50]Network initialization causes high loss of 27, rapidly decreases to 1 or 2.
-[05:00-05:32]At initialization, the model aims for a uniform distribution among 27 characters, with roughly 1/27 probability for each.
-[05:49-06:19]Neural net creates skewed probability distributions leading to high loss.
-[12:08-12:36]Loss at initialization as expected, improved to 2.12-2.16
[14:24-36:39]3. Neural network initialization
-[16:03-16:31]The chain rule with local gradient is affected when outputs of tanh are close to -1 or 1, leading to a halt in back propagation.
-[18:09-18:38]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values.
-[26:03-26:31]Optimization led to improved validation loss from 2.17 to 2.10 by fixing softmax and 10-inch layer issues.
-[29:28-30:02]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets.
-[30:17-30:47]Scaling down by 0.2 shrinks gaussian with standard deviation 0.6.
-[31:03-31:46]Initializing neural network weights for well-behaved activations, kaiming he et al.
-[36:24-36:55]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers.
[36:39-51:52]4. Neural net initialization and batch normalization
-[36:39-37:05]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization.
-[40:32-43:04]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance.
-[40:51-41:13]Batch normalization from 2015 enabled reliable training of deep neural nets.
-[41:39-42:09]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper.
-[43:20-43:50]Calculating standard deviation of activations, mean is average value of neuron's activation.
-[45:45-46:16]Back propagation guides distribution movement, adding scale and shift for final output
[51:52-01:01:35]5. Jittering and batch normalization in neural network training
-[52:10-52:37]Padding input examples adds entropy, augments data, and regularizes neural nets.
-[53:44-54:09]Batch normalization effectively controls activations and their distributions.
-[56:05-56:33]Batch normalization paper introduces running mean and standard deviation estimation during training.
-[01:00:46-01:01:10]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero.
[01:01:36-01:09:21]6. Batch normalization and resnet in pytorch
-[01:02:00-01:02:30]Biases are subtracted out in batch normalization, reducing their impact to zero.
-[01:03:13-01:03:53]Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters.
-[01:07:25-01:07:53]Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code.
[01:09:21-01:23:37]7. Pytorch weight initialization and batch normalization
-[01:10:05-01:10:32]Pytorch initializes weights using 1/fan-in square root from a uniform distribution.
-[01:11:11-01:11:40]Scaling weights by 1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features.
-[01:14:02-01:14:35]Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper.
-[01:16:00-01:16:30]Batch normalization centers data for gaussian activations in deep neural networks.
-[01:17:32-01:18:02]Batch normalization, influential in 2015, enabled reliable training of much deeper neural nets.
[01:23:39-01:55:56]8. Custom pytorch layer and network analysis
-[01:24:01-01:24:32]Updating buffers using exponential moving average with torch.nograd context manager.
-[01:25:47-01:27:11]The model has 46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations.
-[01:28:04-01:28:30]Saturation stabilizes at 20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3.
-[01:33:19-01:33:50]Setting gain correctly at 1 prevents shrinking and diffusion in batch normalization.
-[01:38:41-01:39:11]The last layer has gradients 100 times greater, causing faster training, but it self-corrects with longer training.
-[01:43:18-01:43:42]Monitoring update ratio for parameters to ensure efficient training, aiming for -3 on log plot.
-[01:51:36-01:52:04]Introduce batch normalization and pytorch modules for neural networks.
-[01:52:39-01:53:06]Introduction to diagnostic tools for neural network analysis.
-[01:54:45-01:55:50]Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress
offered by Coursnap
Dives straight in and kills the presentation…. Another banger…. Can make old papers fun to go through….
Thanks for the fantastic download! You have changed my learning_rate in this area from 0.1 to something >1!
Yeah! New video. I 🥰😍 love it. I have decided. After finishing all of your videos, I am planning to use your model as a starting point to solve the true AI problem, just like wright brothers.
I want to try my way. I still think the idea of building a a brain simiulation of neuro network is wrong . And I think I have my way to solve the problem.
That is the way to go, @Jonathan Sum. Pick up a really hard problem, stay with it yourself, trying to solve them in many many ways, for years if needed. Whatever it takes. You can refer to others work as needed, but ONLY AFTER you have tried each subproblem on your own. I learnt this method of solving problems for oneself by seeing Newton's biography. That is how you create new knowledge.
Have you heard of Openworm? Or the FlyEM project? They might be worth a look.
oh man, this is top-notch content! Not sure if there are other available contents on these topics with so much clearness about its inner gears with reproducible examples. Thank you so much! You're a DL hero.
I finally understand when Statistics come into play in machine learning. It's when you introduce the randomized weights(matrices)!
Thank you so much for your clear and thoughtful explanation Andrej!
Dakujem ti Andrej, je to vazne na inej urovni :)
Finally, all those small techniques make sense to me. Thank you so so much!
Better than study in University. Keep going A.K. Share your video for sure.
These videos have been incredible. Thank you so much for taking the time to make them, and I look forward to all the future ones!!!
lol although I don't really understand what's going on here, but I'm just liking and commenting to support Andrej! Keep it up Andrej!
Nice to be lectured again after watching the Stanford CS231 multiple times!
Thank you for explaining everything in such detail. It makes everything much more understandable
Awesome explanation Andrej! Than you very much for sharing your knowledge.
Thank you so much Andrej, you are a real inspiration for me and I really appreciate you
you have to watch these videos twice. Once you will just watch the videos. The next time you will try to write the code Andrej is writing from memory or from your notes. You don't progress until you are stuck, and only as the solution you will play those parts of the video.
I was honestly looking for this comment. I was starting to think everyone in the comment section is gettting it on the first attempt 😂😂.
Thank you so much for the time and effort put into the videos of this series. Appreciate it very much.
You will be remembered❤
i mean, he didnt die
Very interesting how you've described the concept of pre-tuning NN.
following your lectures is a delight! Thanks for taking the time to make them :)
At 1:04:35 would it help at the end of the training to optimize with bnmean_running and bnstd_running to normalize the preactivations hpreact? Maybe at that point regularization isn't necessary anymore and the rest of the weights can be optimized for the particular batch norm calibration that will be used during inference.
Great question, I’ve done this a number of times myself too because I am very uncomfortable with the train test mismatch. I never noticed dramatic and consistent improvement with it though ;(
@@AndrejKarpathy bnmean_running = (0.999 * bnmean_running) + (0.001 * bnmeani), why are you multiplying 0.999 with bnmean_running and 0.001 with bnmeani. Why this not works bnmean_running = bnmean_running + bnmeani
@@narenbabu629 this will just indefinetely increase bnmean_running. The idea is to change bnmean_running just slightly. For example 0.999 * 3 + 0.001 * 5 = 3.002, so previous value is almost unchanged, but increased just a bit
Wow, didn't knew that there is such content and from Karpathy himself. Thank you!
Thanks. This is a very helpful and intuitive lecture.