Note that at 1:53 I made a mistake. It should be that b ∈ R^N. The batch dimension B was already accounted for when I wrote the bias matrix as repeated rows of b.
yeah................to be precise ------> b ∈ R ^ (1XN) and how it is added to each instance really depends on the implementation in the code... but it is actually frustatingly confusing for beginners..as rows need not be repeated in pytorch or numpy due to their elementwise operational capability.. thanks for the awesome lecture.
This is one of the most thorough videos explaining the matrix form of backpropagation - too many courses try to bypass the necessary rigour needed to achieve full understanding. I hope you can make more videos like this. You saved my ass
Amazing! Probably this is the hardest part of all of backpropagation but this made me visualize it so good! my handout just says ∂L/∂W = (∂L/∂Y)^T . X == which is of course the same thing as the video X^T . (∂L/∂Y) , from the matrix multiplication thing of (A.B)^T = (B^T.A^T), something that can be visually proved from after watching 3B1B's intuition series on linear algebra. -- (I only watch the ∂L/∂W part)
This is how I like to learn and understand things, unfortunately my deep learning course is going so fast, I'll never be able to catch up if I go like this, maybe I'll revisit this stuff someday in depth
This is a very helpful video! I'm learning back propagation for the first time and was totally confused be the shapes of the matrices that would never align for me. However one thing makes me wonder: Can this method only be used for (linear) NNs that don't use activation functions? This appears to be the case. Does that mean that if I'd want to do the same derivations for NNs that do use activation functions it would be even more complicated? Oh man, and there is me, thinking that this would be easy and straight forward haha
I'd say this video covers the "hardest" bit. Activations are easy to incorporate because they typically act on one neuron at a time, so there's no index tracking to do, it's just i -> i. In fact, this video already shows how to deal with activations. If you look at my final expressions they still have dL/dy in them. I left the loss function general, leaving you to fill that in depending on which specific loss function you are using. But what if this was a layer somewhere in the middle of the neural network, and I was really calculating da/dx, da/dW and da/db (a is for "activation"). All the math in the video would be exactly the same, but instead of dL/dy in the final expressions, you'd have da/dy. So you see, incorporating an activation just amounts to incorporating its derivative into the chain rule, and since the activation is a scalar to scalar function, there's no matrix multiplication. It's just a scalar. For example, what if my activation is a(y) = y**2 / 2. Then da/dy = y. Then you just plug y in where I have all the dL/dy in the video. If it's still not clear to you after reading this, I'd encourage you to just sit with it for some time and try to work through it. I feel like you are close.
Note that at 1:53 I made a mistake. It should be that b ∈ R^N. The batch dimension B was already accounted for when I wrote the bias matrix as repeated rows of b.
yeah................to be precise ------> b ∈ R ^ (1XN) and how it is added to each instance really depends on the implementation in the code... but it is actually frustatingly confusing for beginners..as rows need not be repeated in pytorch or numpy due to their elementwise operational capability.. thanks for the awesome lecture.
This is one of the most thorough videos explaining the matrix form of backpropagation - too many courses try to bypass the necessary rigour needed to achieve full understanding. I hope you can make more videos like this. You saved my ass
That was the one of the greatest lecture videos I've ever seen. Thanks.
Amazing! Probably this is the hardest part of all of backpropagation but this made me visualize it so good! my handout just says ∂L/∂W = (∂L/∂Y)^T . X == which is of course the same thing as the video X^T . (∂L/∂Y) , from the matrix multiplication thing of (A.B)^T = (B^T.A^T), something that can be visually proved from after watching 3B1B's intuition series on linear algebra. -- (I only watch the ∂L/∂W part)
This is how I like to learn and understand things, unfortunately my deep learning course is going so fast, I'll never be able to catch up if I go like this, maybe I'll revisit this stuff someday in depth
great explanation!
great stuff Alex
Very good
thank you so much!!
Great video, really useful! If you could also do dL/d sigmoid(y) and dL/d Prelu(h) That’d help a lot!
This is a very helpful video! I'm learning back propagation for the first time and was totally confused be the shapes of the matrices that would never align for me.
However one thing makes me wonder: Can this method only be used for (linear) NNs that don't use activation functions? This appears to be the case. Does that mean that if I'd want to do the same derivations for NNs that do use activation functions it would be even more complicated? Oh man, and there is me, thinking that this would be easy and straight forward haha
I'd say this video covers the "hardest" bit. Activations are easy to incorporate because they typically act on one neuron at a time, so there's no index tracking to do, it's just i -> i. In fact, this video already shows how to deal with activations. If you look at my final expressions they still have dL/dy in them. I left the loss function general, leaving you to fill that in depending on which specific loss function you are using. But what if this was a layer somewhere in the middle of the neural network, and I was really calculating da/dx, da/dW and da/db (a is for "activation"). All the math in the video would be exactly the same, but instead of dL/dy in the final expressions, you'd have da/dy. So you see, incorporating an activation just amounts to incorporating its derivative into the chain rule, and since the activation is a scalar to scalar function, there's no matrix multiplication. It's just a scalar. For example, what if my activation is a(y) = y**2 / 2. Then da/dy = y. Then you just plug y in where I have all the dL/dy in the video.
If it's still not clear to you after reading this, I'd encourage you to just sit with it for some time and try to work through it. I feel like you are close.