This chanel is just getting better and better exponentially with every video. I have no academic background in ML but you made me watch a 40 minutes video on this. Amazing!!!
I think it is perfect the way it is right now. Feels much more personal and not to mention the videos can be put out much faster (not that I can keep up with them now, haha). Additions to the drawings and the highlights by hand as the talk progresses I find an unparalleled learning aid.
@@zoeweil7844 I agree. in-progress scribbles from first principles teach a lot more than a polished finished product, because we do not know where to pay attention first :) Pun unintended.
Window+random attention reminds me of functional columns in the neocortex + long range connections. Maybe the random connections could be drawn from a power-law distribution (dependent on distance) to emulate biological neural networks. At least as soon as we move forward to neuromorphic chips we will have to make similar tradeoffs as biological systems (connections per volume limit etc.).
I've worked w/random projections w/compressed sensing, fast SVD, and neural networks for decades, and I delight whenever I see them put to good use (although, as you point out, they build up the random, then don't really use it!). These applications amuse me every time I think of Yann LeCun saying that random projections are the stupidest thing you could possibly do. He was worked up about ELMs at the time, which correctly have a huge number of problems with attribution, but it still makes me grin.
Very clear explanation as always! I kind of not feel the criticism is fully merited. The whole idea of deep net is that it preserve the universal approximation ability, while at the same time reduces computing and memory demands. Like single hidden layer net is already a universal approximator, but the later deeper nets just offer different trade-off in the computing regime. Per this paper, the addition of random attention is a form of approximation of the full attention mechanism. It by itself is straightforward, or even trivial. But it offers a new form of trade-off space where one can start training longer time but with smaller memory, and meanwhile get a bigger attention range. That allows the exploration to be more flexible. Love to hear others' comments.
Funny idea - definitely an idea for alrger attention span. It moves from O(N2) to O(KN) where K is a constant - you need more layers than before to facilitate random walk. So, for N=1 you end up more expensive, for N=2 you end up more expensive if K is 2 -l ikely it is larger. The benefit slowly materialized. N=5 means K has to be 5 to be same... this already may be hard. I wonder, though, about the overhead of the random blocks - it is a lot more processing than simple matrix operation, so that has to be evaluated. I would also assume that the random points are identical on every layer - definitely more efficient. See, problem with having them new random on every layer is that a cluster may point to a dead end. That ends up being more and more (with more layers) unconnected connections that are dead ends. Definitely a nice idea to go from a very brutal use of computing resources into something that is deeper - significantly deeper - but not as computing intensive on every layer. And given that the later is quadratic and there is a serious need for serious attention windows (i.e. 64k+++) that may be a brutal benefit. There is also an idea to have layers with different sizes. First and last have the same - then go in and have N/2 in every other layer.
good viewpoint. Actually there is a term for this called DropConnect. While Dropout means dropped activations, DropConnect means zeroing of some weights (connections) and is used for regularizing. I saw it first in AWD-LSTM implementation by Stephen Merity.
What is missing is dynamic global attention. I.e. instead of just CLS and SEP tokens, let the model discover which tokens should have global / greater attention window size.
Regarding the transition of O(n^2) to O(n). You are, of course, right that the necessity to add more layers (in the worst case) mitigates this win in overall compute and even memory. However, if you think about model partition (more specifically placing each layer on its own compute node), it seems to me that, if price is not an object, you can train much larger models this way. Now couple this with model distillation (though dealing with sparse attention in model distillation won’t be entirely trivial) and it seems you might be able to distill reasonable sized, well-performing models for long sequences.d
@@timdernedde993 It depends on a couple of factors. Let's make a few simplifying assumptions: We use base BERT's configuration and only consider the memory requirements for the weights of one transformer layer plus the states for the input, output and intermediate states with batch size 1. Let's also assume that all states and weights are 4 byte floats. This leads to - 3 * 12 * 768 * (768/12) = 3 * 768^2 weights for the attention projection matrices - 768^2 weights for the combination matrix - 2 * 768 * 2048 weights for the feed forward network plus - n * 768 states for the input - 3 * n * 12 * (768/12) = n * 3 * 768 states for the attention projections - n^2 * 12 * (768/12) = n^2 * 768 states for the attention results - n * 768 states for the combination result - n * 2048 states for the hidden layer in the feed-forward network - n * 768 states for the output layer of the feed-forward network In total this results in (768n^2 + 665n + 5505024) * 4 bytes. A 2080 Ti has 11GB = 11000000000 bytes VRAM This leads to a maximal sequence length of 1886 tokens for the forward pass. In reality, the number is lower because our optimizer (e.g. ADAM) requires us to store additional information.
@@timdernedde993 Neural NLP is, in many cases, very resource intensive. Training BERT from scratch costs something in the neighborhood of $250,000 in terms of cloud-compute costs. (Assuming you pay full price in Google's cloud.)
Amazing summary! I was thinking about these 3 points 1) Table 1 shows random attention outperforms window attention. This seems counter-intuitive to me given that maximum information is in window tokens. 2) They start MLM training from RoBERTa checkpoint - is that ok given RoBERTa was trained with full attention? 3) Is this faster than RoBERTa for shorter sequences (
If the only reason they use random attention is to make sure they can route the information in a logarithmic number of steps (i.e., layers), why not just use dilated windowed attention with a dilation that doubles for each layer, like dilated convolutions in CNNs (what WaveNet used)? Seems to be much more straightforward approach to achieve the same thing.
Dear Yannic, New subscriber here. I was totally shocked when you explained that each 512 token sequence is a separate and independent document. In all the times I have read about BERT or looked at tutorials, that point has never been made clear and driven home as you just did. I thought, ‘Well, I’ll just put them all back together again at the end’. I’m still not really sure what all of this means. I would guess that the embeddings are all in one space no matter how many documents there were. This would enable BERT to recognize a new combination of (word 3 + word 4) when it had previously seen them separately in two different documents. But what about making sense of a single large document, or an entire corpus of them? Does BERT experience ‘forgetting’, be it catastrophic or not? How effective, or efficient, or even sensible, is an aggregation of 20, 200, or 2000 512 token sequences going to be? If a lengthy article starts talking about x, and then ten pages later shows how x impacts the interpretation of result y, how is an aggregation going to know what that is? And while I have seen tutorials talk about truncating longer token sequences in one fashion or another, you are also the first one to actually explain why we have the 512 limit in the first place. Finally, I would ask that you do a single video with a head to head comparison of Longformer vs. BigBird, and any other long form transformer I haven’t heard of yet. Thanks so much.
I skimmed through the paper and couldn't find the part where they state that random attention pattern is the same from layer to layer... Are you sure layers didn't have different patterns for the same batch? Mind pointing to exact location in the paper where you got this idea? (sorry for being nit-picking, but this part seem to be important)
I think the reformer idea is more promising. Somehow clustering the embeddings makes more sense to me than just random combinations or larger window sizes.
Good analogy with convolutions. Can't the random/window/global mechanisms be used to implement approximations of simple dense NN layers (and extend convolution layers with more information)? In computer vision, this could allow uniting convolution and dense layers (and use more dense-analogue layers). I wonder, what would be the performance?
At 27:38 in the left figure, why are the keys and queries assigned blocks. Shouldn't each of them be assigned cells and then we group cells together to save computation time. But if they indeed need to be assigned the blocks, then what are the individual cells?
The “global attention” for important things that don’t change.... isn’t that what we usually call “context”. It’s just getting combined with this global message slot... like “squeeze-excitation” layers?
I think context refers to slightly different things, this is for special tokens, like CLS or SEP. And I don't know much about squeeze-excitation to give a coherent answer here :)
I would be interested in the results with block_size small enough, to fit reasonable batch size and length on state-of-the-art (LoL) commodity HW. As you mention, all of the papers always say it's linear complexity, but then just use the gained performance to increase the model size in some sense. ResNet and othe papers show results with smaller sizes, which leads me to a conclusion the MLM quality degrades fast when the size of the model is under current SOTA size. If it's true, that would be an interesting insight - in line with the perfm. gains from increasing GPT size and data.
Do the techniques presented in this paper, help in the issues Dmitry Lepikhin and collabs found in their 1 trillion parameters Gshard model? Love your videos.
3:29 "...you need to have information routed FROM every token TO every token." So this sounds like a fully-connected (dense) layer. What's the difference?
to add to Yannics answer, the output of one attention-head is calculated as y = softmax[(W_q X) (W_k X).t * (1/root(k))] (W_v X) = F( Q K.t ) V where .t denotes transform and k is the length of the latent representation (hyper param part of model architecture) Whereas a fully connected layer is typically just y = act.func( Wx + B) where W has non-zero elements So there is a lot more going on in the attention head, and I think the most important part is that the "weight matrix" F(Q K.t) is a non linear function of the input, so I think you could say that the encoder stack learns a kind of higher order perspective and models the distribution W(X)
this all reminds me of mixing genes by sexual reproduction and the reason it's sometimes superior to asexual here is correspondence: generations-layers specimens-nodes mating-attending genes-vector coordinates in real life specimens tend to mate not entirely randomly but best ones tend to mate to the best ones and if one specimen has poor genetics it should not expect to find good mate that's part of the reason why overall good genes agglomerate with good genes and bad ones are brushed away from the population so, you could attend better than at random, or else we would have mated randomly I would like to know how probability of node interaction (genes appearing in one body) depends on the network graph(ancestry) and it's construction procedure and how graph theory comes into these scenarios see also: en.wikipedia.org/wiki/Vicar_of_Bray_(scientific_hypothesis)
Started to be addicted to your papers explanation specially when im interested in the NLP field, dont stop plz keep it up 👍🏼
Yannic can you make a video comparing all these O(n) transformers (Linformer, Reformer, etc)? It is getting confusing
Transformers are RNNs, the "true"* Linformer!
*until SOTA moves further
This chanel is just getting better and better exponentially with every video. I have no academic background in ML but you made me watch a 40 minutes video on this. Amazing!!!
Yannick if you team up with someone doing animations to visualize the concepts, your channel will become the most influential in Ai.
I'm prefer drawings by hand in real time
I think it is perfect the way it is right now. Feels much more personal and not to mention the videos can be put out much faster (not that I can keep up with them now, haha). Additions to the drawings and the highlights by hand as the talk progresses I find an unparalleled learning aid.
Actually, I also very much prefer the real time drawings by hand - really helps with learning.
@@zoeweil7844 I agree. in-progress scribbles from first principles teach a lot more than a polished finished product, because we do not know where to pay attention first :) Pun unintended.
No, too many graphics is distracting. Please don’t do this like sirajing siraj
Window+random attention reminds me of functional columns in the neocortex + long range connections. Maybe the random connections could be drawn from a power-law distribution (dependent on distance) to emulate biological neural networks. At least as soon as we move forward to neuromorphic chips we will have to make similar tradeoffs as biological systems (connections per volume limit etc.).
I've worked w/random projections w/compressed sensing, fast SVD, and neural networks for decades, and I delight whenever I see them put to good use (although, as you point out, they build up the random, then don't really use it!). These applications amuse me every time I think of Yann LeCun saying that random projections are the stupidest thing you could possibly do. He was worked up about ELMs at the time, which correctly have a huge number of problems with attribution, but it still makes me grin.
Thanks for getting to this one so promptly.
Very clear explanation as always!
I kind of not feel the criticism is fully merited.
The whole idea of deep net is that it preserve the universal approximation ability, while at the same time reduces computing and memory demands. Like single hidden layer net is already a universal approximator, but the later deeper nets just offer different trade-off in the computing regime.
Per this paper, the addition of random attention is a form of approximation of the full attention mechanism. It by itself is straightforward, or even trivial. But it offers a new form of trade-off space where one can start training longer time but with smaller memory, and meanwhile get a bigger attention range. That allows the exploration to be more flexible.
Love to hear others' comments.
True, maybe I was a bit harsh :)
Yannic, I know that DistilBERT came out a while ago but do you think you could do a video on it? Keep up the amazing work!!!
Funny idea - definitely an idea for alrger attention span. It moves from O(N2) to O(KN) where K is a constant - you need more layers than before to facilitate random walk. So, for N=1 you end up more expensive, for N=2 you end up more expensive if K is 2 -l ikely it is larger. The benefit slowly materialized. N=5 means K has to be 5 to be same... this already may be hard. I wonder, though, about the overhead of the random blocks - it is a lot more processing than simple matrix operation, so that has to be evaluated.
I would also assume that the random points are identical on every layer - definitely more efficient. See, problem with having them new random on every layer is that a cluster may point to a dead end. That ends up being more and more (with more layers) unconnected connections that are dead ends.
Definitely a nice idea to go from a very brutal use of computing resources into something that is deeper - significantly deeper - but not as computing intensive on every layer. And given that the later is quadratic and there is a serious need for serious attention windows (i.e. 64k+++) that may be a brutal benefit.
There is also an idea to have layers with different sizes. First and last have the same - then go in and have N/2 in every other layer.
Thanks for taking the time to make this video. Do you have an opinion which one is better: performer or big bird?
out of the box, probably big bird
This sounds like it should be equivalent to a strangely shaped dropout
good viewpoint. Actually there is a term for this called DropConnect. While Dropout means dropped activations, DropConnect means zeroing of some weights (connections) and is used for regularizing. I saw it first in AWD-LSTM implementation by Stephen Merity.
What is missing is dynamic global attention. I.e. instead of just CLS and SEP tokens, let the model discover which tokens should have global / greater attention window size.
very well done video. Thanks, man.
Regarding the transition of O(n^2) to O(n). You are, of course, right that the necessity to add more layers (in the worst case) mitigates this win in overall compute and even memory. However, if you think about model partition (more specifically placing each layer on its own compute node), it seems to me that, if price is not an object, you can train much larger models this way.
Now couple this with model distillation (though dealing with sparse attention in model distillation won’t be entirely trivial) and it seems you might be able to distill reasonable sized, well-performing models for long sequences.d
Do you know how large N must be that even a single full attention layer does not fit anymore on Lets say a 2080TI?
@@timdernedde993 It depends on a couple of factors. Let's make a few simplifying assumptions: We use base BERT's configuration and only consider the memory requirements for the weights of one transformer layer plus the states for the input, output and intermediate states with batch size 1. Let's also assume that all states and weights are 4 byte floats.
This leads to
- 3 * 12 * 768 * (768/12) = 3 * 768^2 weights for the attention projection matrices
- 768^2 weights for the combination matrix
- 2 * 768 * 2048 weights for the feed forward network
plus
- n * 768 states for the input
- 3 * n * 12 * (768/12) = n * 3 * 768 states for the attention projections
- n^2 * 12 * (768/12) = n^2 * 768 states for the attention results
- n * 768 states for the combination result
- n * 2048 states for the hidden layer in the feed-forward network
- n * 768 states for the output layer of the feed-forward network
In total this results in
(768n^2 + 665n + 5505024) * 4 bytes.
A 2080 Ti has 11GB = 11000000000 bytes VRAM
This leads to a maximal sequence length of 1886 tokens for the forward pass.
In reality, the number is lower because our optimizer (e.g. ADAM) requires us to store
additional information.
Thanks for the detailed answer! Somehow was surprising to me I. Just by gut feeling I would have hoped for more token
@@timdernedde993 Neural NLP is, in many cases, very resource intensive. Training BERT from scratch costs something in the neighborhood of $250,000 in terms of cloud-compute costs. (Assuming you pay full price in Google's cloud.)
Amazing summary! I was thinking about these 3 points
1) Table 1 shows random attention outperforms window attention. This seems counter-intuitive to me given that maximum information is in window tokens.
2) They start MLM training from RoBERTa checkpoint - is that ok given RoBERTa was trained with full attention?
3) Is this faster than RoBERTa for shorter sequences (
-> see answer on twitter, I guess :)
@@YannicKilcher link please
Nice video. I would have liked a little more attention to the random connections vs no random connections benchmarks to see how much they really buy.
If the only reason they use random attention is to make sure they can route the information in a logarithmic number of steps (i.e., layers), why not just use dilated windowed attention with a dilation that doubles for each layer, like dilated convolutions in CNNs (what WaveNet used)? Seems to be much more straightforward approach to achieve the same thing.
Dear Yannic,
New subscriber here. I was totally shocked when you explained that each 512 token sequence is a separate and independent document. In all the times I have read about BERT or looked at tutorials, that point has never been made clear and driven home as you just did. I thought, ‘Well, I’ll just put them all back together again at the end’.
I’m still not really sure what all of this means. I would guess that the embeddings are all in one space no matter how many documents there were. This would enable BERT to recognize a new combination of (word 3 + word 4) when it had previously seen them separately in two different documents. But what about making sense of a single large document, or an entire corpus of them? Does BERT experience ‘forgetting’, be it catastrophic or not? How effective, or efficient, or even sensible, is an aggregation of 20, 200, or 2000 512 token sequences going to be? If a lengthy article starts talking about x, and then ten pages later shows how x impacts the interpretation of result y, how is an aggregation going to know what that is?
And while I have seen tutorials talk about truncating longer token sequences in one fashion or another, you are also the first one to actually explain why we have the 512 limit in the first place.
Finally, I would ask that you do a single video with a head to head comparison of Longformer vs. BigBird, and any other long form transformer I haven’t heard of yet. Thanks so much.
I skimmed through the paper and couldn't find the part where they state that random attention pattern is the same from layer to layer... Are you sure layers didn't have different patterns for the same batch? Mind pointing to exact location in the paper where you got this idea? (sorry for being nit-picking, but this part seem to be important)
This video is s0o0o0o0 excellent.
I think the reformer idea is more promising. Somehow clustering the embeddings makes more sense to me than just random combinations or larger window sizes.
They are a bit half-hearted about their random tokens. You'd think ITC would reliably outperform ETC if it was important.
Good analogy with convolutions. Can't the random/window/global mechanisms be used to implement approximations of simple dense NN layers (and extend convolution layers with more information)? In computer vision, this could allow uniting convolution and dense layers (and use more dense-analogue layers). I wonder, what would be the performance?
True, maybe worth a try
At 27:38 in the left figure, why are the keys and queries assigned blocks. Shouldn't each of them be assigned cells and then we group cells together to save computation time. But if they indeed need to be assigned the blocks, then what are the individual cells?
I think this might just be sloppy notation / visualization for making a point.
@@YannicKilcher Got it! Thanks Yannic!
Looks like we can do it better than attending nodes randomly + window + globally by learning it from data.
how many layers are needed for the full-attention transformer to compute any function? I'd need to see how they define universality hmm
The “global attention” for important things that don’t change.... isn’t that what we usually call “context”.
It’s just getting combined with this global message slot... like “squeeze-excitation” layers?
I think context refers to slightly different things, this is for special tokens, like CLS or SEP.
And I don't know much about squeeze-excitation to give a coherent answer here :)
I like how you think!! Membit or possible share membit at times, routing of information to be able make decisions on. 😉
I would be interested in the results with block_size small enough, to fit reasonable batch size and length on state-of-the-art (LoL) commodity HW. As you mention, all of the papers always say it's linear complexity, but then just use the gained performance to increase the model size in some sense. ResNet and othe papers show results with smaller sizes, which leads me to a conclusion the MLM quality degrades fast when the size of the model is under current SOTA size. If it's true, that would be an interesting insight - in line with the perfm. gains from increasing GPT size and data.
Do the techniques presented in this paper, help in the issues Dmitry Lepikhin and collabs found in their 1 trillion parameters Gshard model?
Love your videos.
They are orthogonal, since the gshard paper deals with the feedforward layers.
This attention mechanism seems reminiscent of Echo State Networks
3:29 "...you need to have information routed FROM every token TO every token." So this sounds like a fully-connected (dense) layer. What's the difference?
in attention, the routing is dynamic
to add to Yannics answer, the output of one attention-head is calculated as
y = softmax[(W_q X) (W_k X).t * (1/root(k))] (W_v X) = F( Q K.t ) V
where .t denotes transform and k is the length of the latent representation (hyper param part of model architecture)
Whereas a fully connected layer is typically just
y = act.func( Wx + B)
where W has non-zero elements
So there is a lot more going on in the attention head, and I think the most important part is that the "weight matrix" F(Q K.t) is a non linear function of the input, so I think you could say that the encoder stack learns a kind of higher order perspective and models the distribution W(X)
you didnot say what a block really is :(
Matrix block-wise computation
So, as an engineering hack to speed it up they let the random tokens be a factor of the input length... Making it O(n^2) again
Is the code from Big Bird available ?
doesn't seem like it, I'd keep up to date here: github.com/huggingface/transformers/issues/6113
This looks same as longformer?
It adds random attention and has a theoretical analysis.
Of cuz Yannic posted new video on the newest interesting paper!
What is the difference between Google Research, USA and Google Research, Mountainview, CA, USA
Good question. No idea why they split the authors. Makes more sense internally to Google than it does to the rest of the world.
@@snippletrap I mean if its Google USA plus Google Switzerland then that is a different story
this all reminds me of mixing genes by sexual reproduction and the reason it's sometimes superior to asexual
here is correspondence:
generations-layers
specimens-nodes
mating-attending
genes-vector coordinates
in real life specimens tend to mate not entirely randomly but best ones tend to mate to the best ones and if one specimen has poor genetics it should not expect to find good mate
that's part of the reason why overall good genes agglomerate with good genes and bad ones are brushed away from the population
so, you could attend better than at random, or else we would have mated randomly
I would like to know how probability of node interaction (genes appearing in one body) depends on the network graph(ancestry) and it's construction procedure and how graph theory comes into these scenarios
see also: en.wikipedia.org/wiki/Vicar_of_Bray_(scientific_hypothesis)
Please use a microphone pop filter.