This series is the most interesting resource for DL I've come across, being a junior ML engineer myself. To be able to watch such a knowledgeable domain expert as Andrej explaining everything in the most understandable ways is a real privilege. A million thanks for you time and effort, looking forward to the next one and hopefully many more.
I love how we are all so stressed and worried that Andrej might grow apathetic to his RUclips channel, so everyone wants to be extra supportive 😆 Really shows how awesome of a communicator he is.
I really, really appreciate you putting in the work to create these lectures. I hope you can really feel the weight of the nearly hundred thousand humans who pushed through 12 hours of lectures on this because you've made it accessible. And that's just through now. These videos are such an incredible gift. Half of the views are me because I needed to watch each so many times in order to understand what's happening, because I started from so little. Also, it's super weird how different you are from other RUclipsrs and yet how likable you become as a human during this series. You are doing this right, and I appreciate it.
Love the series as well ! Coding through all of it. Would love to get together with people to replicate deep learning papers, like Andrej does here, to learn faster and not by myself.
@@raghavravishankar6262Andrej does have a server, we could meet there and then start our own. My handle is vady. (with a dot) if anyone wants to add me, or ping me in Andrej's server.
I experimented a bit with the MLP with 1 hidden layer and managed to scale it up to your fancy hierarchical model. :) Here is what i got: MLP(105k parameters): block_size = 10 emb_dim = 18 n_hidden = 500 lr = 0.1 # used the same learning rate decay as in the video epochs = 200000 mini_batch = 32 lambd = 1 ### added L2 regularization seed is 42 Training error: 1.7801 Dev error: 1.9884 Test error: 1.9863 (I checked this only becouse I was worried that somehow I overfitted the dev set) Some examples generated from the model that I kinda liked: Angelise Fantumrise Bowin Xian Jaydan
@@oklm2109 You just add the trainable parameters of every layer. If the model contains only Fully connected layers (aka linear in pytorch or dense in tf), the number of parameters for each layer is: n_weights = n_in*n_hidden_units n_biases = n_hidden units n_params = n_weights + n_biases = (1+n_input)*(n_hidden_units) n_in: number of inputs (think of it as the number of outputs(or hidden units) from the last layer. This formula is valid for Linear layers, other types of layers may have different formula.
I'd say that it is slightly not fair not to compare models with different block sizes. Because it not only influences the number of parameters but also the amount of information given as input.
^ Update - just tried this architecture locally and got the following without L2 regularization: train 1.7468901872634888 val 1.9970593452453613 How were you able to validate that there was an overfitting to the training dataset? Some examples: arkan. calani. ellizee. coralym. atrajnie. ndity. dina. jenelle. lennec. laleah. thali. nell. drequon. grayson. kayton. sypa. caila. jaycee. kendique. javion.
Thank you so much for creating this video lecture series. Your passion for this topic comes through so vividly in your lectures. I learned so much from every lecture and especially appreciated how the lectures started from the foundational concepts and built up to the state-of-the art techniques. Thank you!
Thank you so much for these videos. I really enjoy these deep dives, things make so much more sense when you're hand coding all the functions and running through examples. It's less of a black box and more intuitive. I hope this comment will encourage you to keep this going!
Absolutely love this series Andrej sir... It not only teaches me stuff but gives me confidence to work even harder to share whatever I know already.. 🧡
I rarely comment on videos, but thank you so much for this series, Andrej. I love learning, and learn many things that interest me. The reason I say that is that I have experienced a lot tutors/educators over time. And for what it's worth, I'd like you to know you're truly gifted when it comes to your understand of AI development and communicating that understanding.
Hey Andrej! I hope you continue and give us the RNN, GRU & Transformer lectures as well! The chatGPT one is great, but I feel like we missed the story in the middle, and jumped the story because of ChatGPT
@kshitijbanerjee6927 Bigrams and MLPs help you understand Transformers (which is the SOA).. Anyway IMO it would be a waste of time creating a lecture on RNNs, but if the majority want it, then maybe he should do it.. I don't care
Fully disagree that it’s not useful. I think the concepts of how they came up unrolling and BPTT, the gates used to solve long term memory problems are invaluable to appreciate and understand why transformers are such a big deal.
Learned a lot of practical tips and theoretical knowledge of why we do what we do and also the history of how Deep Learning evolved. Thanks a lot for this series. Requesting you to continue the series.
Thank you so much Andrej for the series, it helps me a lot. You are one of the reasons I was able to get into ML and build a career there. I admire your teaching skills! I didn't get why the sequence dim has to be part of the batch dimension, and I didn't hear Andrej talk about it explicitly, so here is my reasoning: The sequence dimension is an additional batch dimension because the output before batch norm is created by a linear layer with (32, 4, 20) @ (20, 68) + (68) which performs the matrix multiplication only with the last dimension (.., .., 20) and in parallel on the first two. So, the matrix multiplication is performed 32 * 4 times with (20) @ (20, 68). Thus, it's the same as having a (128, 20) @ (20, 68) calculation, where (32 * 4) = 128 is the batch dimension. So, the sequence dimension is treated effectively as if it was a "batch dimensions" in the linear layer and must be treated that way in batch norm too. (would be great if someone could confirm)
So far THE BEST lecture series I came across on RUclips. Along side learning the neural networks in this series, I have learned the PyTorch more than learning it by waching a PyTorch video series of 26 hrs from a youtuber.
I was worried I was going to have to wait a couple of months for the next video as I finished part 4 just last week. Can't wait to get into this one, thanks a lot for this series Andrej
Enjoying these video so much. To refresh most of what I've forgotten about Python and to begin playing with pytorch. Last I did this stuff myself was with C# and CNTK. Now going back to rebuild and rerun old models and data (much faster even & "better" results). Thank you.
Finally Completed this one. As always thank you Andrej for your generosity! Next I will practice through all five parts again and learn how to accelerate the training process by using GPUs.
Thanks so much for this series, I feel like this is the most important skill I might ever learn and it’s never been more accessible than in your lectures. Thank you!
ive been using this step loss function and I've been consistently getting slight better training and validation losses. for this i got 1.98 val loss. lr = 0.1 if i < 100000 else (0.01 if i < 150000 else 0.001)
Thank you so much for these lectures ! Can you please make a video on the “experimental harness” you mention at the end of the video? It would be super helpful and informative.
@Andrej thank you for making this. Please continue making such videos. It really helps beginners like me. If possible, could you please make a series of how actual development and production is done.
Thanks again Andrej! Love these videos! Dream come true to watch and learn these! Thanks for all you do to help people! You're helpfulness ripples throughout the world! Thanks again! lol
So I ended up this lecture series and I was expecting RNN/LSTM/GRU but was not there however throughout learnt a lot that can definitely on my own. Thanks Andrej
How can we help you keep putting these treasures out Andrej? I think the expected value of helping hundreds of thousands of ML practitioners improve their understanding of the building blocks might be proportional (or even outweigh) the value of your individual contributions at OpenAI. Thats not to say that your technical contributions are not valuable, on the contrary, I'm using their value as a point of comparison because I want to emphasise how amazingly valuable I think your work on education is. A useful analogy would be to ask which ended up having more impact on our progress in the field of physics: Richard Feynman's Lectures that motivated many to pursue science and improved the intuitions of everyone OR his individual contributions to the field?. At the end of the day is not about one or the other but just finding the right balance given the effective impact of each and, of course, your personal enjoyment.
Hi Andrej, thank you for taking the time to create these videos. In this video, for the first time, I'm having difficulties understanding what the model is actually learning. I've watched it twice and tried to understand the WaveNet paper, but that isn't really helping. Given an example input “emma“, the following character is supposed to be “.“, why is it beneficial to create a hidden layer to process “em“, “ma“, and then “emma“? Are we essentially encoding that given a 4 character word, IF the first two characters are “em“ it is likely that the 5th character is “.“, no matter what the third and fourth characters are? In other words, this implementation would probably assign a higher probability that “.“ is the fifth character after an unseen name, e.g. “emli“, simply because it starts with the bigram “em“? Thanks in advance, Dimitri.
Incredible video this helps a lot. Thank you for videos, especially I loved your Stanford videos regarding machine learning from scratch and that's how you do it without any libraries like tensorflow and pytorch. Keep going and thank you for helping hungry learners like me!!! Cheers 🥂
*Abstract* This video continues the "makemore" series, focusing on improving the character-level language model by transitioning from a simple multi-layer perceptron (MLP) to a deeper, tree-like architecture inspired by WaveNet. The video delves into the implementation details, discussing PyTorch modules, containers, and debugging challenges encountered along the way. A key focus is understanding how to progressively fuse information from input characters to predict the next character in a sequence. While the video doesn't implement the exact WaveNet architecture with dilated causal convolutions, it lays the groundwork for future explorations in that direction. Additionally, the video provides insights into the typical development process of building deep neural networks, including reading documentation, managing tensor shapes, and using tools like Jupyter notebooks and VS Code. *Summary* *Starter Code Walkthrough (**1:43**)* - The starting point is similar to Part 3, with minor modifications. - Data generation code remains unchanged, providing examples of three characters to predict the fourth. - Layer modules like Linear, BatchNorm1D, and Tanh are reviewed. - The video emphasizes the importance of setting BatchNorm layers to training=False during evaluation. - Loss function visualization is improved by averaging values. *PyTorchifying Our Code: Layers, Containers, Torch.nn, Fun Bugs (**9:19**)* - Embedding table and view operations are encapsulated into custom Embedding and Flatten modules. - A Sequential container is created to organize layers, similar to torch.nn.Sequential. - The forward pass is simplified using these new modules and container. - A bug related to BatchNorm in training mode with single-example batches is identified and fixed. *Overview: WaveNet (**17:12**)* - The limitations of the current MLP architecture are discussed, particularly the issue of squashing information too quickly. - The video introduces the WaveNet architecture, which progressively fuses information in a tree-like structure. - The concept of dilated causal convolutions is briefly mentioned as an implementation detail for efficiency. *Implementing WaveNet (**19:35**)* - The dataset block size is increased to 8 to provide more context for predictions. - The limitations of directly scaling up the context length in the MLP are highlighted. - A hierarchical model is implemented using FlattenConsecutive layers to group and process characters in pairs. - The shapes of tensors at each layer are inspected to ensure the network functions as intended. - A bug in the BatchNorm1D implementation is identified and fixed to correctly handle multi-dimensional inputs. *Re-training the WaveNet with Bug Fix (**45:25**)* - The network is retrained with the BatchNorm1D bug fix, resulting in a slight performance improvement. - The video notes that PyTorch's BatchNorm1D has a different API and behavior compared to the custom implementation. *Scaling up Our WaveNet (**46:07**)* - The number of embedding and hidden units are increased, leading to a model with 76,000 parameters. - Despite longer training times, the validation performance improves to 1.993. - The need for an experimental harness to efficiently conduct hyperparameter searches is emphasized. *Experimental Harness (**46:59**)* - The lack of a proper experimental setup is acknowledged as a limitation of the current approach. - Potential future topics are discussed, including: - Implementing dilated causal convolutions - Exploring residual and skip connections - Setting up an evaluation harness - Covering recurrent neural networks and transformers *Improve on My Loss! How Far Can We Improve a WaveNet on This Data? (**55:27**)* - The video concludes with a challenge to the viewers to further improve the WaveNet model's performance. - Suggestions for exploration include: - Trying different channel allocations - Experimenting with embedding dimensions - Comparing the hierarchical network to a large MLP - Implementing layers from the WaveNet paper - Tuning initialization and optimization parameters i summarized the transcript with gemini 1.5 pro
Haven't watched this video (yet) but i'm wondering if Andrej discussed WaveNet vs transformer. I know that the WaveNet paper came about around the same time as Attention is All You Need. It seems like both WaveNet and transformers can do sequence prediction/generation, but transformers have taken off. Is that because of transformers' better performance in most problem domains? Does WaveNet still outperform transformers in certain situations?
Finally finished all the lectures and i understood that i have a bad math understanding and bad understanding of dimensionality and operations over it. Anyways, thank you for helping out with the rest concepts and practices, i do better understand now of how backprop is working and what it is doing and what for.
At 38:00, it sounds like we compared two architectures, both with 22k parameters and an 8 character window: * 1 layer, full connectivity * 3 layers, tree-like connectivity In a single layer, full connectivity outperforms partial connectivity. But partial connectivity uses fewer parameters, so we can afford to build more layers.
Thanks for the masterclass!!!!!! .... btw i found u through an interview of geohotz with lex.... i heard u like to teach and they r right about that statement :)
I'm learning so much. I really appreciate the lucidity and simplicity of your approach. I do have a question. Why not initialize running_mean and running_var to None and then set them on the first batch? That would seem to be a better approach than to start them at zero and would be consistent with making them exponentially weighted moving averages - which they are except for the initialization at 0.0.
I have challenging question ( for me :) ). I made a very simple network which takes x as an input and produce y as an output. the network looks like that (y = sin(ax + b)) where a and b are learnable variables. training data is built out from sin(3x+4312). loss function is quadratic mean. using usual approaches I couldn't make it works ! what do you think the problem is?
Dear Andrej your work is amazing, we are here to share and have a beautiful world all together and you are doing that. If you could make a video about Convolution NNs, or Image net top architectures, any thing deep related to vision, that would be great Thank you !
This series is the most interesting resource for DL I've come across, being a junior ML engineer myself. To be able to watch such a knowledgeable domain expert as Andrej explaining everything in the most understandable ways is a real privilege. A million thanks for you time and effort, looking forward to the next one and hopefully many more.
I love how we are all so stressed and worried that Andrej might grow apathetic to his RUclips channel, so everyone wants to be extra supportive 😆 Really shows how awesome of a communicator he is.
I'm literally thinking about it when I saw this comment
Unfortunately he did it :(
@@jordankuzmanovik5297Hopefully he comes back.
haha, so true!
I really, really appreciate you putting in the work to create these lectures. I hope you can really feel the weight of the nearly hundred thousand humans who pushed through 12 hours of lectures on this because you've made it accessible. And that's just through now. These videos are such an incredible gift. Half of the views are me because I needed to watch each so many times in order to understand what's happening, because I started from so little. Also, it's super weird how different you are from other RUclipsrs and yet how likable you become as a human during this series. You are doing this right, and I appreciate it.
As a independent deep learning undergrad student ur videos helps me a lot. Thank u andrej Never stop this series.
We're on the same road!
Love the series as well ! Coding through all of it. Would love to get together with people to replicate deep learning papers, like Andrej does here, to learn faster and not by myself.
@@tanguyrenaudie1261 I'm in the same boat as well do you have a discord or something where we can talk further?
@Anri Lombard @ Nervous Hero
@@raghavravishankar6262Andrej does have a server, we could meet there and then start our own. My handle is vady. (with a dot) if anyone wants to add me, or ping me in Andrej's server.
Notification for a new andrej video guide feels like a new season of game of thrones just dropped at this point.
I experimented a bit with the MLP with 1 hidden layer and managed to scale it up to your fancy hierarchical model. :)
Here is what i got:
MLP(105k parameters):
block_size = 10
emb_dim = 18
n_hidden = 500
lr = 0.1 # used the same learning rate decay as in the video
epochs = 200000
mini_batch = 32
lambd = 1 ### added L2 regularization
seed is 42
Training error: 1.7801
Dev error: 1.9884
Test error: 1.9863 (I checked this only becouse I was worried that somehow I overfitted the dev set)
Some examples generated from the model that I kinda liked:
Angelise
Fantumrise
Bowin
Xian
Jaydan
What's the formula to calculate the number of parameters of an MLP model?
@@oklm2109 You just add the trainable parameters of every layer.
If the model contains only Fully connected layers (aka linear in pytorch or dense in tf), the number of parameters for each layer is:
n_weights = n_in*n_hidden_units
n_biases = n_hidden units
n_params = n_weights + n_biases = (1+n_input)*(n_hidden_units)
n_in: number of inputs (think of it as the number of outputs(or hidden units) from the last layer.
This formula is valid for Linear layers, other types of layers may have different formula.
I'd say that it is slightly not fair not to compare models with different block sizes. Because it not only influences the number of parameters but also the amount of information given as input.
@Zaphod42Breeblebrox out of curiosity, what do your losses and examples look like without the L2 regularization?
Also, love the username :P
^ Update - just tried this architecture locally and got the following without L2 regularization:
train 1.7468901872634888
val 1.9970593452453613
How were you able to validate that there was an overfitting to the training dataset?
Some examples:
arkan.
calani.
ellizee.
coralym.
atrajnie.
ndity.
dina.
jenelle.
lennec.
laleah.
thali.
nell.
drequon.
grayson.
kayton.
sypa.
caila.
jaycee.
kendique.
javion.
Please don't stop making these videos, these are gold !
Andrej, thanks a lot for the video! Please do not stop continuing the series. It's an honor to learn from you.
Thank you so much for creating this video lecture series. Your passion for this topic comes through so vividly in your lectures. I learned so much from every lecture and especially appreciated how the lectures started from the foundational concepts and built up to the state-of-the art techniques. Thank you!
Best resource by far for this content. Please keep making more of these; I feel I'm learning a huge amount from each video.
Thank you so much for these videos. I really enjoy these deep dives, things make so much more sense when you're hand coding all the functions and running through examples. It's less of a black box and more intuitive. I hope this comment will encourage you to keep this going!
Absolutely love this series Andrej sir... It not only teaches me stuff but gives me confidence to work even harder to share whatever I know already.. 🧡
Can't wait for part 6! So clear and I can follow step by step. Thanks so much
I rarely comment on videos, but thank you so much for this series, Andrej. I love learning, and learn many things that interest me. The reason I say that is that I have experienced a lot tutors/educators over time. And for what it's worth, I'd like you to know you're truly gifted when it comes to your understand of AI development and communicating that understanding.
Your work ethic and happy personality really move me. Respect to you, Andrej, you are great.🖖
All the makemore lessons have been awesome Andrej! Huge thanks for helping me understand better how this world works.
I love this series so much :) it has profoundly deepened my understanding of neural networks and especially backpropagation. Thank you
Hey Andrej! I hope you continue and give us the RNN, GRU & Transformer lectures as well! The chatGPT one is great, but I feel like we missed the story in the middle, and jumped the story because of ChatGPT
The ChatGPT lecture is the Transformer lecture.. And regarding RNNs, I don't see why would anyone still use it...
transformers yes . but it’s not like anyone will build bigrams either, it’s about learning the concepts like BPTT etc from roots
@kshitijbanerjee6927 Bigrams and MLPs help you understand Transformers (which is the SOA).. Anyway IMO it would be a waste of time creating a lecture on RNNs, but if the majority want it, then maybe he should do it.. I don't care
Fully disagree that it’s not useful.
I think the concepts of how they came up unrolling and BPTT, the gates used to solve long term memory problems are invaluable to appreciate and understand why transformers are such a big deal.
@@SupeHero00 RNNs are coming back due to SSMs like Mamba.
Learned a lot of practical tips and theoretical knowledge of why we do what we do and also the history of how Deep Learning evolved. Thanks a lot for this series. Requesting you to continue the series.
Thank you Andrej for creating this series. It has been very helpful. I just hope you get the time to continue with it.
This is the best deep learning course I've followed! Even better than the one on Coursera. Thanks!
Thank you so much Andrej for the series, it helps me a lot. You are one of the reasons I was able to get into ML and build a career there. I admire your teaching skills!
I didn't get why the sequence dim has to be part of the batch dimension, and I didn't hear Andrej talk about it explicitly, so here is my reasoning:
The sequence dimension is an additional batch dimension because the output before batch norm is created by a linear layer with (32, 4, 20) @ (20, 68) + (68) which performs the matrix multiplication only with the last dimension (.., .., 20) and in parallel on the first two. So, the matrix multiplication is performed 32 * 4 times with (20) @ (20, 68). Thus, it's the same as having a (128, 20) @ (20, 68) calculation, where (32 * 4) = 128 is the batch dimension. So, the sequence dimension is treated effectively as if it was a "batch dimensions" in the linear layer and must be treated that way in batch norm too.
(would be great if someone could confirm)
So far THE BEST lecture series I came across on RUclips. Along side learning the neural networks in this series, I have learned the PyTorch more than learning it by waching a PyTorch video series of 26 hrs from a youtuber.
Just wanna say thank you for sharing your experience -- love this from-scratch series starting from first principles!
When I did the mean() trick at ~8:50 I left out an audible gasp! That was such a neat trick, going to use that one in the future
Yes! I've been telling everyone about these videos. I've been checking whether you posted the next video everyday. Thank you.
I was worried I was going to have to wait a couple of months for the next video as I finished part 4 just last week. Can't wait to get into this one, thanks a lot for this series Andrej
These videos are amazing, please never stop making this type of content!
This is truly the best dl content out there. Most courses just focus on the theory but lack deep understanding.
Enjoying these video so much. To refresh most of what I've forgotten about Python and to begin playing with pytorch. Last I did this stuff myself was with C# and CNTK. Now going back to rebuild and rerun old models and data (much faster even & "better" results). Thank you.
Everytime a new video is out is like christmas for me!, please don't stop doing this, best ML content out there.
Please continue, I really like this series. You are an awesome teacher!
My best way to learn is to learn from one of the most experienced person in the field. Thanks for everything Andrej
Very grateful for these. An early endearing moment was in the Spelled-Out Intro when you took a moment to find the missing parentheses for 'print.'
Totally amazed by the amount of good work you put in. You've helped a lot of people Andrej. Keep up the good work
My favorite way to start a Monday morning is to wake up to a new lecture in Andrej's masterclass :)
Finally Completed this one. As always thank you Andrej for your generosity! Next I will practice through all five parts again and learn how to accelerate the training process by using GPUs.
Andrej you are the absolute greatest. Keep making your videos. Anxiously waiting to implement Transformers with you
Was looking forward to this one. Thanks, Andrej!
Please continue this series Sir Andrje. You are the savior!
Thanks so much Andrej! Hope to see a Part 6
Thanks so much for this series, I feel like this is the most important skill I might ever learn and it’s never been more accessible than in your lectures. Thank you!
How cool is it that anyone with an internet connection has access to such a great teacher? (answer: very)
Thank you, Andrej! Looking forward to the rest of the series!
ive been using this step loss function and I've been consistently getting slight better training and validation losses. for this i got 1.98 val loss.
lr = 0.1 if i < 100000 else (0.01 if i < 150000 else 0.001)
Thank you so much for these lectures ! Can you please make a video on the “experimental harness” you mention at the end of the video? It would be super helpful and informative.
This guy is literally my definition of wholesomeness. Again, Thank you, Andrej!
@Andrej thank you for making this. Please continue making such videos. It really helps beginners like me. If possible, could you please make a series of how actual development and production is done.
Thanks again Andrej! Love these videos! Dream come true to watch and learn these!
Thanks for all you do to help people! You're helpfulness ripples throughout the world!
Thanks again! lol
Great video again Andrej, keep up the good work and thank you as always!
I just found your youtube channel, and this is just amazing, please do not stop doing these videos, they are incredible
Great series! I really enjoy the progress and good explanations.
Andrej, thanks for all you do for us. You're the best.
Beautifully explained as always - thanks. It shows how much passion you have to come up with these awesome videos. We all blessed!
This is philanthropy! I love you man!
Absolutely awesome stuff Andrej. Thank you for doing this.
The sentence that Anderej said at 49:26 made me realize something, something very deep. 🔥
So I ended up this lecture series and I was expecting RNN/LSTM/GRU but was not there however throughout learnt a lot that can definitely on my own. Thanks Andrej
Deliberate errors on the right spot.. Your lectures are great.
How can we help you keep putting these treasures out Andrej? I think the expected value of helping hundreds of thousands of ML practitioners improve their understanding of the building blocks might be proportional (or even outweigh) the value of your individual contributions at OpenAI. Thats not to say that your technical contributions are not valuable, on the contrary, I'm using their value as a point of comparison because I want to emphasise how amazingly valuable I think your work on education is. A useful analogy would be to ask which ended up having more impact on our progress in the field of physics: Richard Feynman's Lectures that motivated many to pursue science and improved the intuitions of everyone OR his individual contributions to the field?. At the end of the day is not about one or the other but just finding the right balance given the effective impact of each and, of course, your personal enjoyment.
Thank you Sir. Have been waiting for this.
Hi Andrej, thank you for taking the time to create these videos. In this video, for the first time, I'm having difficulties understanding what the model is actually learning. I've watched it twice and tried to understand the WaveNet paper, but that isn't really helping. Given an example input “emma“, the following character is supposed to be “.“, why is it beneficial to create a hidden layer to process “em“, “ma“, and then “emma“? Are we essentially encoding that given a 4 character word, IF the first two characters are “em“ it is likely that the 5th character is “.“, no matter what the third and fourth characters are? In other words, this implementation would probably assign a higher probability that “.“ is the fifth character after an unseen name, e.g. “emli“, simply because it starts with the bigram “em“? Thanks in advance, Dimitri.
Loved this series. Would you please be willing to continue it so we get to work through the rest of CNN, RNN, and LSTM? Thanks!
Every video another solid pure gold bar
Andrej, we all love you. You're amazing!
Incredible video this helps a lot. Thank you for videos, especially I loved your Stanford videos regarding machine learning from scratch and that's how you do it without any libraries like tensorflow and pytorch. Keep going and thank you for helping hungry learners like me!!! Cheers 🥂
Thank you for this beautiful tutorial 🔥
*Abstract*
This video continues the "makemore" series, focusing on improving the character-level language model by transitioning from a simple multi-layer perceptron (MLP) to a deeper, tree-like architecture inspired by WaveNet. The video delves into the implementation details, discussing PyTorch modules, containers, and debugging challenges encountered along the way. A key focus is understanding how to progressively fuse information from input characters to predict the next character in a sequence. While the video doesn't implement the exact WaveNet architecture with dilated causal convolutions, it lays the groundwork for future explorations in that direction. Additionally, the video provides insights into the typical development process of building deep neural networks, including reading documentation, managing tensor shapes, and using tools like Jupyter notebooks and VS Code.
*Summary*
*Starter Code Walkthrough (**1:43**)*
- The starting point is similar to Part 3, with minor modifications.
- Data generation code remains unchanged, providing examples of three characters to predict the fourth.
- Layer modules like Linear, BatchNorm1D, and Tanh are reviewed.
- The video emphasizes the importance of setting BatchNorm layers to training=False during evaluation.
- Loss function visualization is improved by averaging values.
*PyTorchifying Our Code: Layers, Containers, Torch.nn, Fun Bugs (**9:19**)*
- Embedding table and view operations are encapsulated into custom Embedding and Flatten modules.
- A Sequential container is created to organize layers, similar to torch.nn.Sequential.
- The forward pass is simplified using these new modules and container.
- A bug related to BatchNorm in training mode with single-example batches is identified and fixed.
*Overview: WaveNet (**17:12**)*
- The limitations of the current MLP architecture are discussed, particularly the issue of squashing information too quickly.
- The video introduces the WaveNet architecture, which progressively fuses information in a tree-like structure.
- The concept of dilated causal convolutions is briefly mentioned as an implementation detail for efficiency.
*Implementing WaveNet (**19:35**)*
- The dataset block size is increased to 8 to provide more context for predictions.
- The limitations of directly scaling up the context length in the MLP are highlighted.
- A hierarchical model is implemented using FlattenConsecutive layers to group and process characters in pairs.
- The shapes of tensors at each layer are inspected to ensure the network functions as intended.
- A bug in the BatchNorm1D implementation is identified and fixed to correctly handle multi-dimensional inputs.
*Re-training the WaveNet with Bug Fix (**45:25**)*
- The network is retrained with the BatchNorm1D bug fix, resulting in a slight performance improvement.
- The video notes that PyTorch's BatchNorm1D has a different API and behavior compared to the custom implementation.
*Scaling up Our WaveNet (**46:07**)*
- The number of embedding and hidden units are increased, leading to a model with 76,000 parameters.
- Despite longer training times, the validation performance improves to 1.993.
- The need for an experimental harness to efficiently conduct hyperparameter searches is emphasized.
*Experimental Harness (**46:59**)*
- The lack of a proper experimental setup is acknowledged as a limitation of the current approach.
- Potential future topics are discussed, including:
- Implementing dilated causal convolutions
- Exploring residual and skip connections
- Setting up an evaluation harness
- Covering recurrent neural networks and transformers
*Improve on My Loss! How Far Can We Improve a WaveNet on This Data? (**55:27**)*
- The video concludes with a challenge to the viewers to further improve the WaveNet model's performance.
- Suggestions for exploration include:
- Trying different channel allocations
- Experimenting with embedding dimensions
- Comparing the hierarchical network to a large MLP
- Implementing layers from the WaveNet paper
- Tuning initialization and optimization parameters
i summarized the transcript with gemini 1.5 pro
Um, can I find Part 6 somewhere?(RNN, LSTM, GRU..) I was under the impression that the next video in the playlist is about building GPT from skretch.
That was a very great playlist, easy to understand and very helpfull, thank you very much!!
Haven't watched this video (yet) but i'm wondering if Andrej discussed WaveNet vs transformer. I know that the WaveNet paper came about around the same time as Attention is All You Need. It seems like both WaveNet and transformers can do sequence prediction/generation, but transformers have taken off. Is that because of transformers' better performance in most problem domains? Does WaveNet still outperform transformers in certain situations?
great video, been learning a ton from you recently. thank you andrej!
I am subscribing Andrej, just to support someone from our country, Slovakia. Even I don't understand nothing from the video >D
Been waiting for awhile. Thankyouuu !!
Hi Andrej. Is there going to be RNN, LSTN, GRU video? or maybe even part 2 on the topic of WaveNet with the residual connections?
Numpy / torch / tf tensor reshaping always feels like handwaivy magic.
Finally finished all the lectures and i understood that i have a bad math understanding and bad understanding of dimensionality and operations over it. Anyways, thank you for helping out with the rest concepts and practices, i do better understand now of how backprop is working and what it is doing and what for.
Jon Krohn has some a full playlist of algebra and calculus before starting machine learning
Please keep these coming!
Much appreciated, Andrej. Your tutorials are gem!
Bro pls release part 6, I know you are busy with chatgpt but the word needs you
Andrej is literally the bridge between worried senior engineers and the world of gen ai.
Thank you! Love the series! Helped me a lot with my learning experience with PyTorch
Thanks for continuing this fantastic series.
At 38:00, it sounds like we compared two architectures, both with 22k parameters and an 8 character window:
* 1 layer, full connectivity
* 3 layers, tree-like connectivity
In a single layer, full connectivity outperforms partial connectivity.
But partial connectivity uses fewer parameters, so we can afford to build more layers.
Awesome! Well explained and clear what's being done. Please keep doing this fantastic videos!!!
Thanks. Very helpful and intuitive.
Rarely finish entire episode. He'i Andrej 👌
Great content, Andrej! Keep them coming!
Thanks for the masterclass!!!!!! .... btw i found u through an interview of geohotz with lex.... i heard u like to teach and they r right about that statement :)
Fant wait for this next step in the process!
Thank you soo much for the series i recently started it and its the best thing on the entire youtube. keep it up
Can we use WaveNet for llm training? Interested in it's performance
AI Devil is back . Thanks for the video @Andrej Karpathy.
Thanks Andrej this course is awesome for base building..
I'm learning so much. I really appreciate the lucidity and simplicity of your approach. I do have a question. Why not initialize running_mean and running_var to None and then set them on the first batch? That would seem to be a better approach than to start them at zero and would be consistent with making them exponentially weighted moving averages - which they are except for the initialization at 0.0.
Thanks for the content & explanations Andrej and have a great time in Kyoto :)
Please don't stop the series😢
I have challenging question ( for me :) ). I made a very simple network which takes x as an input and produce y as an output. the network looks like that (y = sin(ax + b)) where a and b are learnable variables. training data is built out from sin(3x+4312). loss function is quadratic mean. using usual approaches I couldn't make it works ! what do you think the problem is?
Dear Andrej your work is amazing, we are here to share and have a beautiful world all together and you are doing that.
If you could make a video about Convolution NNs, or Image net top architectures, any thing deep related to vision, that would be great
Thank you !
SO EXCITED TO SEE THIS POSTED LEEEEETS GOOOOOOOO