I find your level of detail absolutely spot on! None of my profs ever felt so well rehearsed while at the same time going just deep enough so that the audience (well, me) has a chance to actually follow along in real time. Big ups!
I just recently dicovered your channel. But I love your videos so much, Yannic. It is so fantastic having real experts posting stuff on youtube. Your channel is without any doubt Triple-A Plus 👍 Thank you so much for putting all that energy into your videos.
Thank you for walking through the key concepts and confusions :) I got chills, thinking about how much of an accelerant this is, for rolling-out massive attention. Every time a researcher says "oh, look a lotto ticket!" we casually assume that the efficiencies will make it easier for lower-tier compute to compete... while Microsoft leans over and says "I'll give you *two* billion, kid..." Also, at 23:54 -> you draw two sad skulls looking out the window of a bus at night, with a third skull at the back of the bus, asleep.
I still watch each of your videos. There is no one on RUclips who is going so deep into the papers like you do. Also I like the way you present your perception of the paper. #fanboy :D
Google lab at this point has to releas a bit better transformer each time - the first 48 mins are what I came for. If not this, once there (hopefully) will be a true liear attention (or something stronger more general and also linear with seq length). And that will be a great deal for all of us "gaming hardware" DL engineers.
Is it correct to think that Random Fourier Features is "the" modern breakthrough that's preventing Kernel methods from being banished into relative obscurity (except for niche applications or when you have a small data set) ?
however, When we use Transformer, we find that MLP computation is the bottleneck, too, because latent size d is very big, and seqence size N is not that big. I wonder is there article rethinking the MLP Layer?
I still think that there is some kind of No Free Lunch effect going on when it comes to attention. Sometimes you just need a large amount of compute. Regardless, this is the best tradeoff I have seen so far.
I was very very excited about it, but then I saw this paper comparing performers vs other attention mechanisms: openreview.net/pdf?id=qVyeW-grC2k It seems that the performer attention doesn't do as well as other attentions when there is some hierarchical structure (check listOps results). There are some interesting comments here: github.com/lucidrains/performer-pytorch/issues/2
Hello Yannic, You say approximate the attention matrix, this implies there is some ground truth attention matrix. Does this mean these methods are only applied at inference? Meaning to say, are the models still trained on the actual softmax attention and then an approximation is made during inference? If not, and this is actually used during training, meaning the model is trained to work best it can with this approximation of the softmax, why do we still talk about unbaisedness towards the actual attention matrix? We basically came up with a new type of model, why compare it to the softmax version? Just because we know that works? We do we desire our model approximate the original transformer? Why can it not be its own thing. Thank you in advance :)
So can we train a model with GPT 3 performance and same input sequence length faster using these or does this only allow us to have longer input sequences?
I am adapting this work for very long videos (egocentric lifelogging videos). However, I am stuck in equation 5. It would be a great help if you will provide proof/resources of equation 5. I also read the related work titled 'Orthogonal Random Features.' In this work, I follow the third equation. This equation seems the special case of equation 5. However, I still don't understand how h(x) is introduced in equation 5.
Is there any mention of the actual on-target (e.g. TPU) latency comparisons between conventional Transformer and Performers? (I don't see it in the paper, unless I am missing something)
Laplacian differentials in the multi-layer 225-bytes kernel isn't really interpolate themselves in the distraction progress, it could be generating more costable errors in R²d (upper) and R²d (lower) maximal interpolation rate counting by unlinear / meta-linear differentation, if we comfortably using only one of kernelization estimating network in one layer by product.
37:00: PRF are infinitely better than the trigonometric approximation: Why are the ratios between the MSEs going down to 0 and not just 1 for length differences close to 0? Does that not mean that in that area the PRF is infinitely worse than the trigonometric approximation?
Great video! Covers every aspect of it... I have one doubt though.. how to perform masking in the bidirectional Will it be the same as done in transformer. QK.t was masked anth then softmax was done in transformer but how to do it in this?
And waitwat -- the Values aren't coming from layer L+1. They're coming from layer L the same as Q and K. The inputs to layer L are being matMul'd by W_Q and W_K & softMax'd which gens the AttentionMatrix which is then applied to v (= matmul(inputs, W_V))
You say "and of course they beat everything". What is your opinion of that after looking at the "long-range arena": openreview.net/forum?id=qVyeW-grC2k which compares many different efficient transformer ideas including the Performer?
Thanks for the paper link -- interesting results. Cliffs: Performer is on the (current) Pareto-optimal curve with a great performance/accuracy tradeoff. Big Bird also on the PO curve and outdoes vanilla Transformer's accuracy slightly with less memory but similar (bad) performance. Reformer and Local Attention suck. Linformer and Linear Transformer are similar, but slightly dominated by Performer.
@@DavenH what does pareto-optimal curve mean? I only heard about pareto optimality from game theory. And why do you say Performer slightly dominates Linformer and Linear Transformer and BigBird has bad performance even though the Performer performs very much worse than the other models on, for instance, the list ops?
@@hannesstark5024 It's a term used in economics too. It means the curve on a multivariate function that expresses the best trade-offs possible. I'm using the term a bit flexibly because these are merely best / SOTA results, rather than known-optimal results. An example could be a measurement device that determines the momentum and position of a particle to the joint accuracy limit prescribed by the Planck Constant -- you can make a tradeoff on either measurement, and so long as the product of errors of those quantities is the Planck Constant, it will be fall on the Pareto-optimal curve of greatest possible accuracy. In contrast if you had a measurement device whose product of errors in each measurement was greater than PC, it would not be Pareto Optimal. If I haven't described it well enough, search "wiki Pareto Frontier". The comments about dominating Linformer and LT are from the overall results on the Long Range Arena task plotted in their Figure 3. You can see Performer lies on the Pareto Frontier, as does Big Bird and Synthesizer. Meaning they're particular combinations of accuracy and performance are not dominated. Performer is better in both accuracy and performance than LF, LT, LA, Reformer, and Sinkhorn, so those models are dominated and never the right choice (overall). But they could be the right choice for a particular task.
@@DavenH Ah, nice thanks for the explanation and pointer! Btw, do you know if the "size of the circle" representing the memory footprint is the radius or the area of the circles?
This is a seriously amazing video, make sure you all get over to Yannic's SubscribeStar and cough up! It's more cost-effective than going to university I promise! www.subscribestar.com/yannickilcher
13:00 - "What is Kernel? A kernel is a function used in SVM for helping to solve problems. They provide shortcuts to avoid complex calculations. The amazing thing about kernel is that we can go to higher dimensions and perform smooth calculations with the help of it We can go up to an infinite number of dimensions using kernels. Sometimes, we cannot have a hyperplane for certain problems. This problem arises when we go up to higher dimensions and try to form a hyperplane. A kernel helps to form the hyperplane in the higher dimension without raising the complexity." techvidvan.com/tutorials/svm-kernel-functions/
of course orthagonal w's are better, random w's will put your original vector into a latent space in the new high dimensional space. that is 40 years old knowledge.
"Believe it or not, you young kids" - don't make me feel even older than I am, you impudent zoomer! It's just... ten years ago or so :-| In Andrew Ng's first machine learning course (which had only a small chapter on neural networks, at the time they didn't impress me since they performed no better than SVMs and took ten times as long to train) I don't remember which activation function we used, but it was certainly not ReLU.
great explanation - random Fourier features are becoming quite trendy lately - (demonstrations are on "coordinate-based MLPs") arxiv.org/abs/2006.10739 . This random features idea works ridicolusly well
I find your level of detail absolutely spot on! None of my profs ever felt so well rehearsed while at the same time going just deep enough so that the audience (well, me) has a chance to actually follow along in real time. Big ups!
Please continue your classical paper series. That's very helpful for beginners like me.
yes, it is... even for PhD students 👀!
Yeeeees. I Love it
Yannic made close to none sassy remarks. This paper must be huge
I just recently dicovered your channel. But I love your videos so much, Yannic.
It is so fantastic having real experts posting stuff on youtube. Your channel is without any doubt Triple-A Plus 👍
Thank you so much for putting all that energy into your videos.
Agreed 100%...keep up the good work😎🎓
Yes they are really great. always really high quality just miss the days when he published every day one:D
@@martinpflaum882 I couldn't keep up back then.
Thank you for walking through the key concepts and confusions :) I got chills, thinking about how much of an accelerant this is, for rolling-out massive attention. Every time a researcher says "oh, look a lotto ticket!" we casually assume that the efficiencies will make it easier for lower-tier compute to compete... while Microsoft leans over and says "I'll give you *two* billion, kid..."
Also, at 23:54 -> you draw two sad skulls looking out the window of a bus at night, with a third skull at the back of the bus, asleep.
Hi! I've to study this paper for my last work at the university! Without your video I won't understand all these details! Thank you!
I still watch each of your videos. There is no one on RUclips who is going so deep into the papers like you do. Also I like the way you present your perception of the paper. #fanboy :D
You explain difficult things in a really enjoyable and easy way! Thanks for your work
Great video. I read the paper a few days ago but it's nice to have someone talk you through it as well. Nice clear explanations. Thanks 😊👍
TL;DR:
49:07 --of course they beat everything
In the paper, if it is going to be the next thing that everyone uses (we don't know) is fairly possible
Google lab at this point has to releas a bit better transformer each time - the first 48 mins are what I came for. If not this, once there (hopefully) will be a true liear attention (or something stronger more general and also linear with seq length). And that will be a great deal for all of us "gaming hardware" DL engineers.
Very well presented! Definitely worth watching more of your videos to learn your presentation skills :)
The most interesting paper that has come out so far in 2020 IMO. Thanks for the detailed video!
straightforward explanation . pretty cool
Is it correct to think that Random Fourier Features is "the" modern breakthrough that's preventing Kernel methods from being banished into relative obscurity (except for niche applications or when you have a small data set) ?
yes, one of the few last things that keeps kernels going
My head exploded. Thanks Yannic, no way I can understand this paper without your awesome explanation.
The Street Talk on Kernel with Alex Stenlake: ruclips.net/video/y_RjsDHl5Y4/видео.html (mentioned at 12:43)
Hello Yannic, Thnx for the great video, can you please share with us with software you use to record your scree and to edit the pdfs ?
OneNote
however, When we use Transformer, we find that MLP computation is the bottleneck, too, because latent size d is very big, and seqence size N is not that big. I wonder is there article rethinking the MLP Layer?
I still think that there is some kind of No Free Lunch effect going on when it comes to attention. Sometimes you just need a large amount of compute. Regardless, this is the best tradeoff I have seen so far.
I was very very excited about it, but then I saw this paper comparing performers vs other attention mechanisms: openreview.net/pdf?id=qVyeW-grC2k
It seems that the performer attention doesn't do as well as other attentions when there is some hierarchical structure (check listOps results). There are some interesting comments here: github.com/lucidrains/performer-pytorch/issues/2
What's the purpose of the function "g" at 15:55? It looks like they introduce it, but then they don't include it in the definition of phi(x)
Hello Yannic, You say approximate the attention matrix, this implies there is some ground truth attention matrix. Does this mean these methods are only applied at inference? Meaning to say, are the models still trained on the actual softmax attention and then an approximation is made during inference?
If not, and this is actually used during training, meaning the model is trained to work best it can with this approximation of the softmax, why do we still talk about unbaisedness towards the actual attention matrix? We basically came up with a new type of model, why compare it to the softmax version? Just because we know that works? We do we desire our model approximate the original transformer? Why can it not be its own thing.
Thank you in advance :)
Great video! I think there's a typo in the description of the video, should be Performer rather than Reformer
Thank you!
So can we train a model with GPT 3 performance and same input sequence length faster using these or does this only allow us to have longer input sequences?
technically yes, but whether it reaches gpt-3 is not clear
This isn't mathematics, this is grunt work!
I am adapting this work for very long videos (egocentric lifelogging videos). However, I am stuck in equation 5. It would be a great help if you will provide proof/resources of equation 5.
I also read the related work titled 'Orthogonal Random Features.' In this work, I follow the third equation. This equation seems the special case of equation 5. However, I still don't understand how h(x) is introduced in equation 5.
excellent video! Thanks, Yannic!
Can you do a video on a code along for some neural rendering repo on colab?
Thanks for your neat explanation.
I am curious to know how effective is performer based transformer on different NPUs? Is there any limitation?
Is there any mention of the actual on-target (e.g. TPU) latency comparisons between conventional Transformer and Performers? (I don't see it in the paper, unless I am missing something)
there is not, as far as I can tell
Beats me why I have not heard a SOTA model with Performers.
Laplacian differentials in the multi-layer 225-bytes kernel isn't really interpolate themselves in the distraction progress, it could be generating more costable errors in R²d (upper) and R²d (lower) maximal interpolation rate counting by unlinear / meta-linear differentation, if we comfortably using only one of kernelization estimating network in one layer by product.
37:00: PRF are infinitely better than the trigonometric approximation: Why are the ratios between the MSEs going down to 0 and not just 1 for length differences close to 0? Does that not mean that in that area the PRF is infinitely worse than the trigonometric approximation?
good point, I don't now
At 8:38 doing Q.(K^t.V) instead of (Q.K^t).V is same as the Transformers are RNN paper?
good connection, I don't know exactly
I think there are many technical blog sites who also wait for your videos. Once you explain it here. They just summarise the video there. 😅
This I totally agree (Haha). Many Chinese tech blog I follow post what the videos he makes
Matthew Tang these people who summarize the vid have no shame.
Great video! Covers every aspect of it... I have one doubt though.. how to perform masking in the bidirectional
Will it be the same as done in transformer.
QK.t was masked anth then softmax was done in transformer but how to do it in this?
I actually don't know exactly
Very nice video. Many thanks!
7:00 It's nTokens * nTokens (or MAX_TOKENS * MAX_TOKENS if you're batch-training and using padding) not L*L
And waitwat -- the Values aren't coming from layer L+1. They're coming from layer L the same as Q and K. The inputs to layer L are being matMul'd by W_Q and W_K & softMax'd which gens the AttentionMatrix which is then applied to v (= matmul(inputs, W_V))
I want causal performers in pytorch!!! 😍
this dalle+clip colab uses sigmoid and softmax. I thought that was modern..
What a great paper. This is the kind that will change the future.
Oh man this was cooler than Marvel, thank you!
You say "and of course they beat everything". What is your opinion of that after looking at the "long-range arena": openreview.net/forum?id=qVyeW-grC2k which compares many different efficient transformer ideas including the Performer?
Well, obviously it's from Google.
Thanks for the paper link -- interesting results.
Cliffs: Performer is on the (current) Pareto-optimal curve with a great performance/accuracy tradeoff.
Big Bird also on the PO curve and outdoes vanilla Transformer's accuracy slightly with less memory but similar (bad) performance.
Reformer and Local Attention suck.
Linformer and Linear Transformer are similar, but slightly dominated by Performer.
@@DavenH what does pareto-optimal curve mean? I only heard about pareto optimality from game theory. And why do you say Performer slightly dominates Linformer and Linear Transformer and BigBird has bad performance even though the Performer performs very much worse than the other models on, for instance, the list ops?
@@hannesstark5024 It's a term used in economics too. It means the curve on a multivariate function that expresses the best trade-offs possible. I'm using the term a bit flexibly because these are merely best / SOTA results, rather than known-optimal results.
An example could be a measurement device that determines the momentum and position of a particle to the joint accuracy limit prescribed by the Planck Constant -- you can make a tradeoff on either measurement, and so long as the product of errors of those quantities is the Planck Constant, it will be fall on the Pareto-optimal curve of greatest possible accuracy. In contrast if you had a measurement device whose product of errors in each measurement was greater than PC, it would not be Pareto Optimal. If I haven't described it well enough, search "wiki Pareto Frontier".
The comments about dominating Linformer and LT are from the overall results on the Long Range Arena task plotted in their Figure 3.
You can see Performer lies on the Pareto Frontier, as does Big Bird and Synthesizer. Meaning they're particular combinations of accuracy and performance are not dominated.
Performer is better in both accuracy and performance than LF, LT, LA, Reformer, and Sinkhorn, so those models are dominated and never the right choice (overall). But they could be the right choice for a particular task.
@@DavenH Ah, nice thanks for the explanation and pointer! Btw, do you know if the "size of the circle" representing the memory footprint is the radius or the area of the circles?
Hey yannick, can you make a video on PRADO.
Hi Yannic,
Can you please make a video on PRADO. Attaching the link of the paper(aclweb.org/anthology/D19-1506.pdf) for your reference.
Bochner never got enough love.
Dope, great theoretical breakthroughs
I think you meant to say "The Performer" instead of "The Reformer" in the video description. Thank you as always, keep up the good work!
You are a star! was wondering how this architecture works, and too lazy/dumb to read the paper.
This is a seriously amazing video, make sure you all get over to Yannic's SubscribeStar and cough up! It's more cost-effective than going to university I promise! www.subscribestar.com/yannickilcher
Does this mean sparse attention is dead?
Your videos are really awesome
13:00 - "What is Kernel?
A kernel is a function used in SVM for helping to solve problems. They provide shortcuts to avoid complex calculations.
The amazing thing about kernel is that we can go to higher dimensions and perform smooth calculations with the help of it
We can go up to an infinite number of dimensions using kernels. Sometimes, we cannot have a hyperplane for certain problems. This problem arises when we go up to higher dimensions and try to form a hyperplane. A kernel helps to form the hyperplane in the higher dimension without raising the complexity." techvidvan.com/tutorials/svm-kernel-functions/
Just saw this and now I’m clearing my schedule lol
thank you!
What tablet do you use?
Found the answer, It's a surface according to an older video.
of course orthagonal w's are better, random w's will put your original vector into a latent space in the new high dimensional space. that is 40 years old knowledge.
'what is this paper doing? it's exactly doing what I said was impossible' xD
"Believe it or not, you young kids" - don't make me feel even older than I am, you impudent zoomer! It's just... ten years ago or so :-|
In Andrew Ng's first machine learning course (which had only a small chapter on neural networks, at the time they didn't impress me since they performed no better than SVMs and took ten times as long to train) I don't remember which activation function we used, but it was certainly not ReLU.
great explanation - random Fourier features are becoming quite trendy lately - (demonstrations are on "coordinate-based MLPs") arxiv.org/abs/2006.10739 . This random features idea works ridicolusly well
Allright, so we're back to kernel methods. I'm sure most of this has been done
thank you
I don't quite get what an attention matrix is at 7:50. I thought we had a separate Q, K and V matrix, not one big attention matrix A
A is the product of Q and K
why without subtitles?
我哪儿敢说话