Видео 6
Просмотров 39 707

Bellman Equation Derived In Excruciatingly Baby Steps

32:10

A Common Misconception About Scaling Neural Network Inputs

19:10

Feature Extraction With TorchVision's Newest Utility

43:29

Aggregating Nested Transformers

48:07

Key Query Value Attention Explained

10:13

Deriving Matrix Equations for Backpropagation on a Linear Layer

Doing the index tracking to figure out the matrix form of backpropagation is one of the more tedious aspects of working with neural networks but still quite useful to go through in detail every now and then. I can't claim you'll find this video entertaining or particularly interesting, but I hope some of you will find it useful.
Note that at 1:53 I made a mistake. It should be that b ∈ R^N. The batch dimension B was already accounted for when I wrote the bias matrix as repeated rows of b.
Sections:
0:00 - Setting up notation
6:50 - ∂L / ∂W
20:10 = ∂L / ∂b
23:30 = ∂L / ∂x

Видео

Bellman Equation Derived In Excruciatingly Baby Steps

32:10

Bellman Equation Derived In Excruciatingly Baby Steps

Просмотров 2,2 тыс.3 года назад

All the Bellman Equation derivations I found were too quick for me. So I really broke it down step by step until I understood it well enough to explain to someone. If the title of this video appealed to you, I hope this gives you the explanation you were looking for! DISCLAIMER - I'm not as knowledgeable with reinforcement learning as I am with computer vision. Consider me a study buddy rather ...

A Common Misconception About Scaling Neural Network Inputs

19:10

A Common Misconception About Scaling Neural Network Inputs

Просмотров 7333 года назад

Back to the fundamentals for this one. Here I explain why the statement "You should scale all your input features to be in the same range" is misleading, and how to fix it. Something I forgot to mention in the video is that scale may matter in more ways than in the toy example I provided, but those are beyond the scope of the discussion. Comments and challenges welcome!

Feature Extraction With TorchVision's Newest Utility

43:29

Feature Extraction With TorchVision's Newest Utility

Просмотров 7 тыс.3 года назад

In this video I walk you through how to use Torchvision's new feature extraction utility. Questions welcome in the comments! Walkthrough code: github.com/alexander-soare/torchvision_feature_extraction_walkthrough Torchvision docs: pytorch.org/vision/main/feature_extraction.html Timm library: github.com/rwightman/pytorch-image-models XCiT code: github.com/rwightman/pytorch-image-models/blob/mast...

48:07

Aggregating Nested Transformers

Просмотров 1 тыс.3 года назад

A walk through of the paper: "Aggregating Nested Transformers". I briefly discuss the original Vision Transformer before diving into (13:15) the adaptations made by NesT. Sections: 0:00 - Introduction 1:42 - Hierarchy in convenets 4:15 - Abstract 5:54 - Vision Transformer in a nutshell 13:15 - Key motivation 16:17 - NesT architecture 24:20 - NesT as a decoder 25:07 - Visual interpretability 30:...

10:13

Key Query Value Attention Explained

Просмотров 21 тыс.3 года назад

I kept getting mixed up whenever I had to dive into the nuts and bolts of multi-head attention so I made this video to make sure I don't forget. It follows the formulation in arxiv.org/pdf/1706.03762.pdf

@eldadlivschitz8280 9 дней назад
Finally! A proper, thorough mathematical explanation of backpropagation!!!!
@arymehr Месяц назад
Amazing explanation. Never got QKV before this. Great job of combining the math with perfect analogies!
@au.adaora Месяц назад
Thanks.
@fbng Месяц назад
type of video im gonna have to watch once a month, my memory is just so bad... thanks brother, great explanation !
@19AKS58 2 месяца назад
Really good explanation. YT has so many attempted explanations, including those with sadly incorrect & misleading info, but this is great. Am I understanding correctly, that the Key matrix is identical to the Query matrix, just transposed 90 degrees to allow matrix multiplication? What numerical values comprise the VALUE matrix? Thanks very much
@alex-ai7517 2 месяца назад
Thanks! They key and query matrices have the same dimensions but they contain different values. That's because they are created by applying a different fully connected layers (one for the key, one for the query) to the tokens of the previous transformer layer. The value matrix is created with yet another fully connected layer. Hope that helps.
@19AKS58 2 месяца назад
Great. Happy to learn this new information. Does this "fully connected layer" refer to a Neural Network, and is that what happens in the "Linear Transformation". If so, where do the weights and biases for that NN come from? And if a loss function is utilized, what does it use to determine whether the results are going in the good or bad direction. Thank you VERY much
@alex-ai7517 2 месяца назад
@@19AKS58 when I say "fully connected layer" that's interchangeable with "linear layer" but the terminology is more precise (in my opinion). And yes this is a layer of a neural network. It's just a matrix with learnable parameters (the "weights") and a vector with learnable parameters (the "bias"). The choice of loss function, and target is totally independent of everything I discuss in this video. In image classification, one might use cross entropy loss with the ground truth labels of the dataset as targets. In natural language processing, one might use cross entropy loss with the ground truth token labels for "next token prediction".
@rickyc46 2 месяца назад
Amazing! Probably this is the hardest part of all of backpropagation but this made me visualize it so good! my handout just says ∂L/∂W = (∂L/∂Y)^T . X == which is of course the same thing as the video X^T . (∂L/∂Y) , from the matrix multiplication thing of (A.B)^T = (B^T.A^T), something that can be visually proved from after watching 3B1B's intuition series on linear algebra. -- (I only watch the ∂L/∂W part)
@rickyc46 2 месяца назад
This is how I like to learn and understand things, unfortunately my deep learning course is going so fast, I'll never be able to catch up if I go like this, maybe I'll revisit this stuff someday in depth
@WattRosemary-i1t 3 месяца назад
Wilson Kenneth Williams Margaret Walker Thomas
@shirleybargeman4484 3 месяца назад
Gonzalez Timothy Taylor Kevin Miller Mary
@arsnakehert 5 месяцев назад
Finally an explanation that doesn't hide the math! Finally I begin to understand this
@peterfrangos7522 5 месяцев назад
This is one of the most thorough videos explaining the matrix form of backpropagation - too many courses try to bypass the necessary rigour needed to achieve full understanding. I hope you can make more videos like this. You saved my ass
@ResidualSkill 5 месяцев назад
thank you I thought I was going crazy seeing those two lines in the deepmind lecture
@liviumircea6905 5 месяцев назад
Very good
@Tothefutureand 5 месяцев назад
Yea , people liked it alot
@김동열-u8c 5 месяцев назад
thank you so much!!
@MilesHatler 6 месяцев назад
Video: Excruciating baby steps Me watching in 0.5X struggling to keep up:
@dhirs7770 7 месяцев назад
Pretty stupid mate.
@punk3900 8 месяцев назад
❤❤❤
@guoguowg1443 8 месяцев назад
great stuff Alex
@jacekwojcieszynski8368 10 месяцев назад
Bell eq is so profound
@activatealmonds 11 месяцев назад
Not sure why you're scaling X1 by dividing it with a scalar value. I've never seen that recommended anywhere, and it would leave zero values unchanged, distorting the feature range. The course you're showing suggests scaling all features -1 to 1, which would move the zero (minimum) values to -1.
@swazza9999 11 месяцев назад
Hmm, so you're saying that incorporating an offset is essential? What if your data is already zero centered?
@activatealmonds 11 месяцев назад
@@swazza9999 Even if the data is zero centered, wouldn't dividing every datapoint by a scalar value "distort" their distances to zero? Because zero doesn't "move" when divided, while every other value does. Min-max scaling "moves" every value including zero (unless the minimum is zero & the transformation range is 0-max) so it shouldn't have the same problem. Either way, I've never seen scaling (by dividing with the standard deviation or otherwise) without centering recommended before for neural network inputs. I might be wrong.
@swazza9999 11 месяцев назад
@@activatealmonds yeah I dropped centering from the vid not because I think that it's recommended not to use it, but because I think it's an unnecessary detail for the purposes of my argument. Like for example just consider that we're training an ML model to guess the sex of a baby based on their birth weight and length. We can center the data with respect to both of these variables, but if the units of weight are grams, and the units of length are meters, you're still going to have a situation where incrementing the weight by one unit is far less informative than incrementing the length by one unit.
@CloudCastsAlanSmith 11 месяцев назад
Thanks for that. I have found myself searching on RUclips for something I forgot, and found a video of me explaining it, that I recorded a couple of years ago. It's great to put that stuff out there.
@amortalbeing 11 месяцев назад
thanks a lot alex. keep up the good work. really appreciate it
@jayeshgokhale Год назад
Very good explanation! Subscribed to your channel :)
@rabomeister Год назад
That was the one of the greatest lecture videos I've ever seen. Thanks.
@beniaminradomir9798 Год назад
This is a very helpful video! I'm learning back propagation for the first time and was totally confused be the shapes of the matrices that would never align for me. However one thing makes me wonder: Can this method only be used for (linear) NNs that don't use activation functions? This appears to be the case. Does that mean that if I'd want to do the same derivations for NNs that do use activation functions it would be even more complicated? Oh man, and there is me, thinking that this would be easy and straight forward haha
@swazza9999 Год назад
I'd say this video covers the "hardest" bit. Activations are easy to incorporate because they typically act on one neuron at a time, so there's no index tracking to do, it's just i -> i. In fact, this video already shows how to deal with activations. If you look at my final expressions they still have dL/dy in them. I left the loss function general, leaving you to fill that in depending on which specific loss function you are using. But what if this was a layer somewhere in the middle of the neural network, and I was really calculating da/dx, da/dW and da/db (a is for "activation"). All the math in the video would be exactly the same, but instead of dL/dy in the final expressions, you'd have da/dy. So you see, incorporating an activation just amounts to incorporating its derivative into the chain rule, and since the activation is a scalar to scalar function, there's no matrix multiplication. It's just a scalar. For example, what if my activation is a(y) = y**2 / 2. Then da/dy = y. Then you just plug y in where I have all the dL/dy in the video. If it's still not clear to you after reading this, I'd encourage you to just sit with it for some time and try to work through it. I feel like you are close.
@patiwatatayagul8738 Год назад
Dam this is good, thank you for good lecture -- keep doing it!
@ludwigkraken9935 Год назад
great explanation!
@eduardotestelino291 Год назад
Thank you! that was useful
@MeridianLights Год назад
Excellent explanation!
@kaanvural2920 Год назад
Hi Alex, this is the clearest explanation I have ever watch thanks a lot. I have a single question about Q, K and V initialization. Does all of them initialize in the same way or is there a difference? How dq, dk and dv differentiates from each other?
@swazza9999 Год назад
As for your question about the dimensions d_q, d_k and d_v: d_q = d_k so sometimes it's just written d_qk. Whereas d_v can be different. You can see the first part is true just by looking at the expression QK.T where we know Q is Nxd_q and K is Nxd_k. So we must have d_q = d_q for the matrix multiplication to work. (PS: it is AlexAI, I just keep forgetting to switch from my personal YT account when I respond to these).
@marianaagra5474 Год назад
Thanks for this explanation. After watching a few videos, this was the one that painted a cleared picture.
@matthewprestifilippo7673 Год назад
lol, cat stool journal. i have a dog and i know the importance of my dog's poo schedule, too. great video, man!
@swazza9999 Год назад
hahaha. Yeah my Russian Blue had diarrhea for like a year but we finally solved it.
@matthewprestifilippo7673 Год назад
Best explanation I've found. Thanks.
@johnryan8645 Год назад
A very succinct description. Perhaps too succinct? Do you have a video that explains clearly how the W_q, W_k, and W_v matrices go from random initialization to appropriate weights through back propagation?
@alex-ai7517 Год назад
Honestly, that would be too broad. There's nothing fundamentally different about how that happens in this particular configuration vs any other neural network configuration. They are just matrix operations in a series of many, and they get their own partial derivatives just like any other weights. Is there something more to your question that I'm missing?
@johnryan8645 Год назад
@@alex-ai7517 ok fair enough…I’m just befuddled as to where the transformer back propagates. I THINK it’s in these Weight matrices, but I don’t see how…I would think that Wq approx equals Wk so that q dot k approx equals 1. But that can’t happen with random W_a and W_k.
@alex-ai7517 Год назад
@@johnryan8645 yeah I think you may be mixing up different ideas here. There's no need to attempt to make Wq~Wk to satisfy yourself that backprop works. I don't know how to set you on the right path though unfortunately (because I'm not sure where you took a wrong turn). I can just say maybe you need to backtrack a bit and take a different path.
@vanilan3585 Год назад
the way you expain it is what i want.
@merlinthe Год назад
Thank you! I tried to wrap my head around this for a while, and this is the first explanation that made me kind-of-understand :)
@joeystenbeck6697 Год назад
Awesome video! This clears up a lot for me with the analogy and visuals. I have a question, why is using a matrix of attention scores and a weighted average better than just, say, an MLP? The general explanation is, "It allows the network to identify important information", but that doesn't really explain why it's better than a more naive approach. Maybe it has to do with how it's back-propagated? It feels arbitrary to me that we have 3 inputs (query, keys, and values) rather than just two and I can't make sense of it. I have a hard time understanding why we can't just combine keys and values and use a more complicated function for that layer or something. Edit: I think it just clicked!! Correct me if I'm wrong, but I think *it comes down to the attention scores all summing up to one.* So a hypothetical is: we have 10 people to query from, and 1 of them is *slightly* compatible, and the other 9 of them are *not compatible at all* , then the person who is *slightly* compatible will have *100% of attention paid to them bc they're the only person that could have slightly relevant information* . So it's because they're important *in relation* to the other people there. In an MLP, the model could pick up on that some, but with a lot of noise and diverse data, it wouldn't be able to prioritize information relatively like an attention network does. (Formatting is being a bit weird with asterisks, apologies.)
@alex-ai7517 Год назад
Just saw your edit but drafter this response before. So here it is: Attention is an explicit form of interaction between activations in the same layer. Whereas with an MLP or CNN you only get "flow of information" from one layer to the next. It's not necessarily "better" to use a transformer. A CNN learns to parse images with less data and training time than a transformer, but the transformer can catch up and surpass the CNN because it uses a more general mechanism (though this is not totally clear cut IMO). Also, if you're looking for the motivation behind the transformer architecture it might make more sense to consider it in the light of natural language processing where it originated. If you follow the evolution of RNNs, LSTMs and GRUs the transformer will feel like a natural win. As for why this specific key, query, value mechanism, you're totally right that it's arbitrary (to a degree). These just happened to be the details that were settled on and carried forward in follow up papers/works. The main point is that there's an explicit interaction between activations (rather than just activations * weights -> activations). In fact a transformer is effectively a fully connected graph neural network (GNN). And GNN mechanisms come in a wide variety of flavors. You can look them up. Even in just the "attention" literature there are 3 main mechanisms (you can probably find them in the 2017 attention is all you need paper if I recall correctly). Hope that helps.
@joeystenbeck6697 Год назад
@@alex-ai7517 This is great, thanks! In the second paragraph you mention that a transformer is effectively a fully connected GNN. Is that because graph neural networks normalize each list of connected node's values when propagating? (I don't know much about GNNs but hopefully that makes sense.) But ofc that they aren't connecting to all other nodes per se. Also "Attention is an explicit form of interaction between activations in the same layer ..." is very helpful for understanding. It's like horizontal info flow + vertical info flow instead of just vertical.
@swazza9999 Год назад
@@joeystenbeck6697 to respond to your question there, no I wasn't thinking of normalisation exactly. That would be more specific than I wanted to be. I literally just mean that the tokens in the transformer framework form the nodes of a graph, and since the self attention is global (as an aside, with some vision transformers it actually isn't) each node is connected to all others.
@joeystenbeck6697 Год назад
@@swazza9999 Ah gotcha, that makes sense
@nachosncheez2492 Год назад
thank i was stuck lol
@pradeepbiju Год назад
Hai Please post video about multi-headed attention
@Glaszg Год назад
Great video, really useful! If you could also do dL/d sigmoid(y) and dL/d Prelu(h) That’d help a lot!
@robertbracco8321 Год назад
Thank you for this Alex, would you consider doing a similar video for cross attention? Your video helped me wrap my head around self attention, but I feel like I still struggle to intuit how the key and query interact in cross attention (for the value it's still a weighted sum). I'm especially confused with multimodal attention, e.g where the key is an image embedding and the query is a text embedding or vice versa. Any resources you'd recommend? Thank you!
@s3nhxx Год назад
Finally! Q, K, V properly explained, thanks!
@fraternitas5117 Год назад
9:22 "A weighted sum of all the values." Worth the price of admission! Subbed!
@KadirPeker Год назад
very, very nice. having a visual makes things much more clear. Next can be the intuition behind key, query, and values. They are each a function of the embedding vector, so what to pay attention to is a function of semantics (meaning and also grammatical function) of words, I would say. These kinds of insights into this whole attention mechanism would be very valuable also.
@NicholasRenotte Год назад
This is such a brilliant way of remembering the underlying purposes of Q,K,V. The implementation is straight forward enough but I found myself continually coming back and having to sense check myself. Props for sharing your thought process!
@outliier Год назад
really cool explanation :D
@c.e1187 2 года назад
Thank you great explaination
@tasnimrahman7261 2 года назад
Great explanation. Thank you
@DeepakSingh-ji3zo 2 года назад
Excellent! Thank you for this. I have studied this idea and used this idea, but still, I kept forgetting. Hopefully, I will be able to retain this for a long time :)
@bharathhegde4665 2 года назад
I found the derivations quite brief too and was looking for a more rigorous explanation, so this was useful. An important point at 19:39 that I think should be mentioned is that you get E(G_(t+1) | s',r,s,a) and since it's a Markov decision process, the rewards obtained from state s' would be independent of what action you took at s and what reward you got before arriving at s'. So this would equal E(G_(t+1) | s') , which you have written.
@alex-ai7517 2 года назад
Yep. I suppose there are some other places in my derivation where I haven't been totally explicit about the conditions. For example, I often drop the pi once I have pinned down an action. But that's just because I know I'm not going to need to talk about it again and it's implicit. Thanks for raising this.

Alex-AI

Видео

Комментарии