- Видео 6
- Просмотров 39 707
Alex-AI
Великобритания
Добавлен 5 июл 2021
Deriving Matrix Equations for Backpropagation on a Linear Layer
Doing the index tracking to figure out the matrix form of backpropagation is one of the more tedious aspects of working with neural networks but still quite useful to go through in detail every now and then. I can't claim you'll find this video entertaining or particularly interesting, but I hope some of you will find it useful.
Note that at 1:53 I made a mistake. It should be that b ∈ R^N. The batch dimension B was already accounted for when I wrote the bias matrix as repeated rows of b.
Sections:
0:00 - Setting up notation
6:50 - ∂L / ∂W
20:10 = ∂L / ∂b
23:30 = ∂L / ∂x
Note that at 1:53 I made a mistake. It should be that b ∈ R^N. The batch dimension B was already accounted for when I wrote the bias matrix as repeated rows of b.
Sections:
0:00 - Setting up notation
6:50 - ∂L / ∂W
20:10 = ∂L / ∂b
23:30 = ∂L / ∂x
Просмотров: 7 787
Видео
Bellman Equation Derived In Excruciatingly Baby Steps
Просмотров 2,2 тыс.3 года назад
All the Bellman Equation derivations I found were too quick for me. So I really broke it down step by step until I understood it well enough to explain to someone. If the title of this video appealed to you, I hope this gives you the explanation you were looking for! DISCLAIMER - I'm not as knowledgeable with reinforcement learning as I am with computer vision. Consider me a study buddy rather ...
A Common Misconception About Scaling Neural Network Inputs
Просмотров 7333 года назад
Back to the fundamentals for this one. Here I explain why the statement "You should scale all your input features to be in the same range" is misleading, and how to fix it. Something I forgot to mention in the video is that scale may matter in more ways than in the toy example I provided, but those are beyond the scope of the discussion. Comments and challenges welcome!
Feature Extraction With TorchVision's Newest Utility
Просмотров 7 тыс.3 года назад
In this video I walk you through how to use Torchvision's new feature extraction utility. Questions welcome in the comments! Walkthrough code: github.com/alexander-soare/torchvision_feature_extraction_walkthrough Torchvision docs: pytorch.org/vision/main/feature_extraction.html Timm library: github.com/rwightman/pytorch-image-models XCiT code: github.com/rwightman/pytorch-image-models/blob/mast...
Aggregating Nested Transformers
Просмотров 1 тыс.3 года назад
A walk through of the paper: "Aggregating Nested Transformers". I briefly discuss the original Vision Transformer before diving into (13:15) the adaptations made by NesT. Sections: 0:00 - Introduction 1:42 - Hierarchy in convenets 4:15 - Abstract 5:54 - Vision Transformer in a nutshell 13:15 - Key motivation 16:17 - NesT architecture 24:20 - NesT as a decoder 25:07 - Visual interpretability 30:...
Key Query Value Attention Explained
Просмотров 21 тыс.3 года назад
I kept getting mixed up whenever I had to dive into the nuts and bolts of multi-head attention so I made this video to make sure I don't forget. It follows the formulation in arxiv.org/pdf/1706.03762.pdf
Finally! A proper, thorough mathematical explanation of backpropagation!!!!
Amazing explanation. Never got QKV before this. Great job of combining the math with perfect analogies!
Thanks.
type of video im gonna have to watch once a month, my memory is just so bad... thanks brother, great explanation !
Really good explanation. YT has so many attempted explanations, including those with sadly incorrect & misleading info, but this is great. Am I understanding correctly, that the Key matrix is identical to the Query matrix, just transposed 90 degrees to allow matrix multiplication? What numerical values comprise the VALUE matrix? Thanks very much
Thanks! They key and query matrices have the same dimensions but they contain different values. That's because they are created by applying a different fully connected layers (one for the key, one for the query) to the tokens of the previous transformer layer. The value matrix is created with yet another fully connected layer. Hope that helps.
Great. Happy to learn this new information. Does this "fully connected layer" refer to a Neural Network, and is that what happens in the "Linear Transformation". If so, where do the weights and biases for that NN come from? And if a loss function is utilized, what does it use to determine whether the results are going in the good or bad direction. Thank you VERY much
@@19AKS58 when I say "fully connected layer" that's interchangeable with "linear layer" but the terminology is more precise (in my opinion). And yes this is a layer of a neural network. It's just a matrix with learnable parameters (the "weights") and a vector with learnable parameters (the "bias"). The choice of loss function, and target is totally independent of everything I discuss in this video. In image classification, one might use cross entropy loss with the ground truth labels of the dataset as targets. In natural language processing, one might use cross entropy loss with the ground truth token labels for "next token prediction".
Amazing! Probably this is the hardest part of all of backpropagation but this made me visualize it so good! my handout just says ∂L/∂W = (∂L/∂Y)^T . X == which is of course the same thing as the video X^T . (∂L/∂Y) , from the matrix multiplication thing of (A.B)^T = (B^T.A^T), something that can be visually proved from after watching 3B1B's intuition series on linear algebra. -- (I only watch the ∂L/∂W part)
This is how I like to learn and understand things, unfortunately my deep learning course is going so fast, I'll never be able to catch up if I go like this, maybe I'll revisit this stuff someday in depth
Wilson Kenneth Williams Margaret Walker Thomas
Gonzalez Timothy Taylor Kevin Miller Mary
Finally an explanation that doesn't hide the math! Finally I begin to understand this
This is one of the most thorough videos explaining the matrix form of backpropagation - too many courses try to bypass the necessary rigour needed to achieve full understanding. I hope you can make more videos like this. You saved my ass
thank you I thought I was going crazy seeing those two lines in the deepmind lecture
Very good
Yea , people liked it alot
thank you so much!!
Video: Excruciating baby steps Me watching in 0.5X struggling to keep up:
Pretty stupid mate.
❤❤❤
great stuff Alex
Bell eq is so profound
Not sure why you're scaling X1 by dividing it with a scalar value. I've never seen that recommended anywhere, and it would leave zero values unchanged, distorting the feature range. The course you're showing suggests scaling all features -1 to 1, which would move the zero (minimum) values to -1.
Hmm, so you're saying that incorporating an offset is essential? What if your data is already zero centered?
@@swazza9999 Even if the data is zero centered, wouldn't dividing every datapoint by a scalar value "distort" their distances to zero? Because zero doesn't "move" when divided, while every other value does. Min-max scaling "moves" every value including zero (unless the minimum is zero & the transformation range is 0-max) so it shouldn't have the same problem. Either way, I've never seen scaling (by dividing with the standard deviation or otherwise) without centering recommended before for neural network inputs. I might be wrong.
@@activatealmonds yeah I dropped centering from the vid not because I think that it's recommended not to use it, but because I think it's an unnecessary detail for the purposes of my argument. Like for example just consider that we're training an ML model to guess the sex of a baby based on their birth weight and length. We can center the data with respect to both of these variables, but if the units of weight are grams, and the units of length are meters, you're still going to have a situation where incrementing the weight by one unit is far less informative than incrementing the length by one unit.
Thanks for that. I have found myself searching on RUclips for something I forgot, and found a video of me explaining it, that I recorded a couple of years ago. It's great to put that stuff out there.
thanks a lot alex. keep up the good work. really appreciate it
Very good explanation! Subscribed to your channel :)
That was the one of the greatest lecture videos I've ever seen. Thanks.
This is a very helpful video! I'm learning back propagation for the first time and was totally confused be the shapes of the matrices that would never align for me. However one thing makes me wonder: Can this method only be used for (linear) NNs that don't use activation functions? This appears to be the case. Does that mean that if I'd want to do the same derivations for NNs that do use activation functions it would be even more complicated? Oh man, and there is me, thinking that this would be easy and straight forward haha
I'd say this video covers the "hardest" bit. Activations are easy to incorporate because they typically act on one neuron at a time, so there's no index tracking to do, it's just i -> i. In fact, this video already shows how to deal with activations. If you look at my final expressions they still have dL/dy in them. I left the loss function general, leaving you to fill that in depending on which specific loss function you are using. But what if this was a layer somewhere in the middle of the neural network, and I was really calculating da/dx, da/dW and da/db (a is for "activation"). All the math in the video would be exactly the same, but instead of dL/dy in the final expressions, you'd have da/dy. So you see, incorporating an activation just amounts to incorporating its derivative into the chain rule, and since the activation is a scalar to scalar function, there's no matrix multiplication. It's just a scalar. For example, what if my activation is a(y) = y**2 / 2. Then da/dy = y. Then you just plug y in where I have all the dL/dy in the video. If it's still not clear to you after reading this, I'd encourage you to just sit with it for some time and try to work through it. I feel like you are close.
Dam this is good, thank you for good lecture -- keep doing it!
great explanation!
Thank you! that was useful
Excellent explanation!
Hi Alex, this is the clearest explanation I have ever watch thanks a lot. I have a single question about Q, K and V initialization. Does all of them initialize in the same way or is there a difference? How dq, dk and dv differentiates from each other?
As for your question about the dimensions d_q, d_k and d_v: d_q = d_k so sometimes it's just written d_qk. Whereas d_v can be different. You can see the first part is true just by looking at the expression QK.T where we know Q is Nxd_q and K is Nxd_k. So we must have d_q = d_q for the matrix multiplication to work. (PS: it is AlexAI, I just keep forgetting to switch from my personal YT account when I respond to these).
Thanks for this explanation. After watching a few videos, this was the one that painted a cleared picture.
lol, cat stool journal. i have a dog and i know the importance of my dog's poo schedule, too. great video, man!
hahaha. Yeah my Russian Blue had diarrhea for like a year but we finally solved it.
Best explanation I've found. Thanks.
A very succinct description. Perhaps too succinct? Do you have a video that explains clearly how the W_q, W_k, and W_v matrices go from random initialization to appropriate weights through back propagation?
Honestly, that would be too broad. There's nothing fundamentally different about how that happens in this particular configuration vs any other neural network configuration. They are just matrix operations in a series of many, and they get their own partial derivatives just like any other weights. Is there something more to your question that I'm missing?
@@alex-ai7517 ok fair enough…I’m just befuddled as to where the transformer back propagates. I THINK it’s in these Weight matrices, but I don’t see how…I would think that Wq approx equals Wk so that q dot k approx equals 1. But that can’t happen with random W_a and W_k.
@@johnryan8645 yeah I think you may be mixing up different ideas here. There's no need to attempt to make Wq~Wk to satisfy yourself that backprop works. I don't know how to set you on the right path though unfortunately (because I'm not sure where you took a wrong turn). I can just say maybe you need to backtrack a bit and take a different path.
the way you expain it is what i want.
Thank you! I tried to wrap my head around this for a while, and this is the first explanation that made me kind-of-understand :)
Awesome video! This clears up a lot for me with the analogy and visuals. I have a question, why is using a matrix of attention scores and a weighted average better than just, say, an MLP? The general explanation is, "It allows the network to identify important information", but that doesn't really explain why it's better than a more naive approach. Maybe it has to do with how it's back-propagated? It feels arbitrary to me that we have 3 inputs (query, keys, and values) rather than just two and I can't make sense of it. I have a hard time understanding why we can't just combine keys and values and use a more complicated function for that layer or something. Edit: I think it just clicked!! Correct me if I'm wrong, but I think *it comes down to the attention scores all summing up to one.* So a hypothetical is: we have 10 people to query from, and 1 of them is *slightly* compatible, and the other 9 of them are *not compatible at all* , then the person who is *slightly* compatible will have *100% of attention paid to them bc they're the only person that could have slightly relevant information* . So it's because they're important *in relation* to the other people there. In an MLP, the model could pick up on that some, but with a lot of noise and diverse data, it wouldn't be able to prioritize information relatively like an attention network does. (Formatting is being a bit weird with asterisks, apologies.)
Just saw your edit but drafter this response before. So here it is: Attention is an explicit form of interaction between activations in the same layer. Whereas with an MLP or CNN you only get "flow of information" from one layer to the next. It's not necessarily "better" to use a transformer. A CNN learns to parse images with less data and training time than a transformer, but the transformer can catch up and surpass the CNN because it uses a more general mechanism (though this is not totally clear cut IMO). Also, if you're looking for the motivation behind the transformer architecture it might make more sense to consider it in the light of natural language processing where it originated. If you follow the evolution of RNNs, LSTMs and GRUs the transformer will feel like a natural win. As for why this specific key, query, value mechanism, you're totally right that it's arbitrary (to a degree). These just happened to be the details that were settled on and carried forward in follow up papers/works. The main point is that there's an explicit interaction between activations (rather than just activations * weights -> activations). In fact a transformer is effectively a fully connected graph neural network (GNN). And GNN mechanisms come in a wide variety of flavors. You can look them up. Even in just the "attention" literature there are 3 main mechanisms (you can probably find them in the 2017 attention is all you need paper if I recall correctly). Hope that helps.
@@alex-ai7517 This is great, thanks! In the second paragraph you mention that a transformer is effectively a fully connected GNN. Is that because graph neural networks normalize each list of connected node's values when propagating? (I don't know much about GNNs but hopefully that makes sense.) But ofc that they aren't connecting to all other nodes per se. Also "Attention is an explicit form of interaction between activations in the same layer ..." is very helpful for understanding. It's like horizontal info flow + vertical info flow instead of just vertical.
@@joeystenbeck6697 to respond to your question there, no I wasn't thinking of normalisation exactly. That would be more specific than I wanted to be. I literally just mean that the tokens in the transformer framework form the nodes of a graph, and since the self attention is global (as an aside, with some vision transformers it actually isn't) each node is connected to all others.
@@swazza9999 Ah gotcha, that makes sense
thank i was stuck lol
Hai Please post video about multi-headed attention
Great video, really useful! If you could also do dL/d sigmoid(y) and dL/d Prelu(h) That’d help a lot!
Thank you for this Alex, would you consider doing a similar video for cross attention? Your video helped me wrap my head around self attention, but I feel like I still struggle to intuit how the key and query interact in cross attention (for the value it's still a weighted sum). I'm especially confused with multimodal attention, e.g where the key is an image embedding and the query is a text embedding or vice versa. Any resources you'd recommend? Thank you!
Finally! Q, K, V properly explained, thanks!
9:22 "A weighted sum of all the values." Worth the price of admission! Subbed!
very, very nice. having a visual makes things much more clear. Next can be the intuition behind key, query, and values. They are each a function of the embedding vector, so what to pay attention to is a function of semantics (meaning and also grammatical function) of words, I would say. These kinds of insights into this whole attention mechanism would be very valuable also.
This is such a brilliant way of remembering the underlying purposes of Q,K,V. The implementation is straight forward enough but I found myself continually coming back and having to sense check myself. Props for sharing your thought process!
really cool explanation :D
Thank you great explaination
Great explanation. Thank you
Excellent! Thank you for this. I have studied this idea and used this idea, but still, I kept forgetting. Hopefully, I will be able to retain this for a long time :)
I found the derivations quite brief too and was looking for a more rigorous explanation, so this was useful. An important point at 19:39 that I think should be mentioned is that you get E(G_(t+1) | s',r,s,a) and since it's a Markov decision process, the rewards obtained from state s' would be independent of what action you took at s and what reward you got before arriving at s'. So this would equal E(G_(t+1) | s') , which you have written.
Yep. I suppose there are some other places in my derivation where I haven't been totally explicit about the conditions. For example, I often drop the pi once I have pinned down an action. But that's just because I know I'm not going to need to talk about it again and it's implicit. Thanks for raising this.