Rethinking Attention with Performers (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 30 ноя 2024

Комментарии • 97

  • @imgtsmnlom
    @imgtsmnlom 4 года назад +40

    I find your level of detail absolutely spot on! None of my profs ever felt so well rehearsed while at the same time going just deep enough so that the audience (well, me) has a chance to actually follow along in real time. Big ups!

  • @SunilMeena-do7xn
    @SunilMeena-do7xn 4 года назад +54

    Please continue your classical paper series. That's very helpful for beginners like me.

  • @andrehoffmann2018
    @andrehoffmann2018 4 года назад +77

    Yannic made close to none sassy remarks. This paper must be huge

  • @toom2141
    @toom2141 4 года назад +29

    I just recently dicovered your channel. But I love your videos so much, Yannic.
    It is so fantastic having real experts posting stuff on youtube. Your channel is without any doubt Triple-A Plus 👍
    Thank you so much for putting all that energy into your videos.

    • @klammer75
      @klammer75 4 года назад +4

      Agreed 100%...keep up the good work😎🎓

    • @martinpflaum882
      @martinpflaum882 4 года назад +3

      Yes they are really great. always really high quality just miss the days when he published every day one:D

    • @jonatan01i
      @jonatan01i 4 года назад +3

      @@martinpflaum882 I couldn't keep up back then.

  • @anthonyrepetto3474
    @anthonyrepetto3474 4 года назад +11

    Thank you for walking through the key concepts and confusions :) I got chills, thinking about how much of an accelerant this is, for rolling-out massive attention. Every time a researcher says "oh, look a lotto ticket!" we casually assume that the efficiencies will make it easier for lower-tier compute to compete... while Microsoft leans over and says "I'll give you *two* billion, kid..."
    Also, at 23:54 -> you draw two sad skulls looking out the window of a bus at night, with a third skull at the back of the bus, asleep.

  • @alessiaventani9504
    @alessiaventani9504 4 года назад +1

    Hi! I've to study this paper for my last work at the university! Without your video I won't understand all these details! Thank you!

  • @florianhonicke5448
    @florianhonicke5448 4 года назад +18

    I still watch each of your videos. There is no one on RUclips who is going so deep into the papers like you do. Also I like the way you present your perception of the paper. #fanboy :D

  • @allessandroable
    @allessandroable 4 года назад +2

    You explain difficult things in a really enjoyable and easy way! Thanks for your work

  • @katiefaery
    @katiefaery 4 года назад

    Great video. I read the paper a few days ago but it's nice to have someone talk you through it as well. Nice clear explanations. Thanks 😊👍

  • @konghong3885
    @konghong3885 4 года назад +34

    TL;DR:
    49:07 --of course they beat everything

    • @Neural_Causality
      @Neural_Causality 4 года назад +1

      In the paper, if it is going to be the next thing that everyone uses (we don't know) is fairly possible

    • @norik1616
      @norik1616 4 года назад +2

      Google lab at this point has to releas a bit better transformer each time - the first 48 mins are what I came for. If not this, once there (hopefully) will be a true liear attention (or something stronger more general and also linear with seq length). And that will be a great deal for all of us "gaming hardware" DL engineers.

  • @Jason-zb8xq
    @Jason-zb8xq 4 года назад +1

    Very well presented! Definitely worth watching more of your videos to learn your presentation skills :)

  • @NilabhraRoyChowdhury
    @NilabhraRoyChowdhury 4 года назад

    The most interesting paper that has come out so far in 2020 IMO. Thanks for the detailed video!

  • @wentianbao5368
    @wentianbao5368 4 года назад +1

    straightforward explanation . pretty cool

  • @JI77469
    @JI77469 3 года назад +1

    Is it correct to think that Random Fourier Features is "the" modern breakthrough that's preventing Kernel methods from being banished into relative obscurity (except for niche applications or when you have a small data set) ?

    • @YannicKilcher
      @YannicKilcher  3 года назад

      yes, one of the few last things that keeps kernels going

  • @clee5653
    @clee5653 4 года назад

    My head exploded. Thanks Yannic, no way I can understand this paper without your awesome explanation.

  • @iuhh
    @iuhh 4 года назад +1

    The Street Talk on Kernel with Alex Stenlake: ruclips.net/video/y_RjsDHl5Y4/видео.html (mentioned at 12:43)

  • @ilyasaroui7745
    @ilyasaroui7745 4 года назад +2

    Hello Yannic, Thnx for the great video, can you please share with us with software you use to record your scree and to edit the pdfs ?

  • @weikailin4342
    @weikailin4342 3 года назад

    however, When we use Transformer, we find that MLP computation is the bottleneck, too, because latent size d is very big, and seqence size N is not that big. I wonder is there article rethinking the MLP Layer?

  • @herp_derpingson
    @herp_derpingson 4 года назад +2

    I still think that there is some kind of No Free Lunch effect going on when it comes to attention. Sometimes you just need a large amount of compute. Regardless, this is the best tradeoff I have seen so far.

  • @felipemello1151
    @felipemello1151 4 года назад +4

    I was very very excited about it, but then I saw this paper comparing performers vs other attention mechanisms: openreview.net/pdf?id=qVyeW-grC2k
    It seems that the performer attention doesn't do as well as other attentions when there is some hierarchical structure (check listOps results). There are some interesting comments here: github.com/lucidrains/performer-pytorch/issues/2

  • @francescocariaggi1145
    @francescocariaggi1145 3 года назад

    What's the purpose of the function "g" at 15:55? It looks like they introduce it, but then they don't include it in the definition of phi(x)

  • @thomasevers1938
    @thomasevers1938 6 месяцев назад

    Hello Yannic, You say approximate the attention matrix, this implies there is some ground truth attention matrix. Does this mean these methods are only applied at inference? Meaning to say, are the models still trained on the actual softmax attention and then an approximation is made during inference?
    If not, and this is actually used during training, meaning the model is trained to work best it can with this approximation of the softmax, why do we still talk about unbaisedness towards the actual attention matrix? We basically came up with a new type of model, why compare it to the softmax version? Just because we know that works? We do we desire our model approximate the original transformer? Why can it not be its own thing.
    Thank you in advance :)

  • @NielsRogge
    @NielsRogge 4 года назад +6

    Great video! I think there's a typo in the description of the video, should be Performer rather than Reformer

  • @asilserhan685
    @asilserhan685 4 года назад +1

    So can we train a model with GPT 3 performance and same input sequence length faster using these or does this only allow us to have longer input sequences?

    • @YannicKilcher
      @YannicKilcher  3 года назад

      technically yes, but whether it reaches gpt-3 is not clear

  • @Myblogband
    @Myblogband 11 месяцев назад

    This isn't mathematics, this is grunt work!

  • @pravinnagar6111
    @pravinnagar6111 2 года назад

    I am adapting this work for very long videos (egocentric lifelogging videos). However, I am stuck in equation 5. It would be a great help if you will provide proof/resources of equation 5.
    I also read the related work titled 'Orthogonal Random Features.' In this work, I follow the third equation. This equation seems the special case of equation 5. However, I still don't understand how h(x) is introduced in equation 5.

  • @chaoqiyang5219
    @chaoqiyang5219 3 года назад +1

    excellent video! Thanks, Yannic!

  • @shivamraisharma1474
    @shivamraisharma1474 4 года назад +2

    Can you do a video on a code along for some neural rendering repo on colab?

  • @sarathks9911
    @sarathks9911 3 года назад

    Thanks for your neat explanation.
    I am curious to know how effective is performer based transformer on different NPUs? Is there any limitation?

  • @YashBhalgat
    @YashBhalgat 4 года назад

    Is there any mention of the actual on-target (e.g. TPU) latency comparisons between conventional Transformer and Performers? (I don't see it in the paper, unless I am missing something)

    • @YannicKilcher
      @YannicKilcher  3 года назад +1

      there is not, as far as I can tell

  • @andres_pq
    @andres_pq 3 года назад

    Beats me why I have not heard a SOTA model with Performers.

  • @G12GilbertProduction
    @G12GilbertProduction 4 года назад +1

    Laplacian differentials in the multi-layer 225-bytes kernel isn't really interpolate themselves in the distraction progress, it could be generating more costable errors in R²d (upper) and R²d (lower) maximal interpolation rate counting by unlinear / meta-linear differentation, if we comfortably using only one of kernelization estimating network in one layer by product.

  • @hannesstark5024
    @hannesstark5024 4 года назад

    37:00: PRF are infinitely better than the trigonometric approximation: Why are the ratios between the MSEs going down to 0 and not just 1 for length differences close to 0? Does that not mean that in that area the PRF is infinitely worse than the trigonometric approximation?

  • @faizanahemad
    @faizanahemad 4 года назад

    At 8:38 doing Q.(K^t.V) instead of (Q.K^t).V is same as the Transformers are RNN paper?

    • @YannicKilcher
      @YannicKilcher  3 года назад

      good connection, I don't know exactly

  • @RohitKumarSingh25
    @RohitKumarSingh25 4 года назад +1

    I think there are many technical blog sites who also wait for your videos. Once you explain it here. They just summarise the video there. 😅

    • @matthewtang1489
      @matthewtang1489 4 года назад

      This I totally agree (Haha). Many Chinese tech blog I follow post what the videos he makes

    • @shairuno
      @shairuno 4 года назад

      Matthew Tang these people who summarize the vid have no shame.

  • @Shivam98i
    @Shivam98i 4 года назад

    Great video! Covers every aspect of it... I have one doubt though.. how to perform masking in the bidirectional
    Will it be the same as done in transformer.
    QK.t was masked anth then softmax was done in transformer but how to do it in this?

  • @yi-hsuanyang1518
    @yi-hsuanyang1518 4 года назад +1

    Very nice video. Many thanks!

  • @pi5549
    @pi5549 Год назад

    7:00 It's nTokens * nTokens (or MAX_TOKENS * MAX_TOKENS if you're batch-training and using padding) not L*L

    • @pi5549
      @pi5549 Год назад

      And waitwat -- the Values aren't coming from layer L+1. They're coming from layer L the same as Q and K. The inputs to layer L are being matMul'd by W_Q and W_K & softMax'd which gens the AttentionMatrix which is then applied to v (= matmul(inputs, W_V))

  • @Zenol42
    @Zenol42 4 года назад +1

    I want causal performers in pytorch!!! 😍

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic 3 года назад

    this dalle+clip colab uses sigmoid and softmax. I thought that was modern..

  • @hyunsunggo855
    @hyunsunggo855 4 года назад

    What a great paper. This is the kind that will change the future.

  • @ivanvoid4910
    @ivanvoid4910 3 года назад

    Oh man this was cooler than Marvel, thank you!

  • @hannesstark5024
    @hannesstark5024 4 года назад +7

    You say "and of course they beat everything". What is your opinion of that after looking at the "long-range arena": openreview.net/forum?id=qVyeW-grC2k which compares many different efficient transformer ideas including the Performer?

    • @clee5653
      @clee5653 4 года назад

      Well, obviously it's from Google.

    • @DavenH
      @DavenH 4 года назад

      Thanks for the paper link -- interesting results.
      Cliffs: Performer is on the (current) Pareto-optimal curve with a great performance/accuracy tradeoff.
      Big Bird also on the PO curve and outdoes vanilla Transformer's accuracy slightly with less memory but similar (bad) performance.
      Reformer and Local Attention suck.
      Linformer and Linear Transformer are similar, but slightly dominated by Performer.

    • @hannesstark5024
      @hannesstark5024 4 года назад

      @@DavenH what does pareto-optimal curve mean? I only heard about pareto optimality from game theory. And why do you say Performer slightly dominates Linformer and Linear Transformer and BigBird has bad performance even though the Performer performs very much worse than the other models on, for instance, the list ops?

    • @DavenH
      @DavenH 4 года назад +2

      @@hannesstark5024 It's a term used in economics too. It means the curve on a multivariate function that expresses the best trade-offs possible. I'm using the term a bit flexibly because these are merely best / SOTA results, rather than known-optimal results.
      An example could be a measurement device that determines the momentum and position of a particle to the joint accuracy limit prescribed by the Planck Constant -- you can make a tradeoff on either measurement, and so long as the product of errors of those quantities is the Planck Constant, it will be fall on the Pareto-optimal curve of greatest possible accuracy. In contrast if you had a measurement device whose product of errors in each measurement was greater than PC, it would not be Pareto Optimal. If I haven't described it well enough, search "wiki Pareto Frontier".
      The comments about dominating Linformer and LT are from the overall results on the Long Range Arena task plotted in their Figure 3.
      You can see Performer lies on the Pareto Frontier, as does Big Bird and Synthesizer. Meaning they're particular combinations of accuracy and performance are not dominated.
      Performer is better in both accuracy and performance than LF, LT, LA, Reformer, and Sinkhorn, so those models are dominated and never the right choice (overall). But they could be the right choice for a particular task.

    • @hannesstark5024
      @hannesstark5024 4 года назад +1

      @@DavenH Ah, nice thanks for the explanation and pointer! Btw, do you know if the "size of the circle" representing the memory footprint is the radius or the area of the circles?

  • @karteekmenda7621
    @karteekmenda7621 3 года назад

    Hey yannick, can you make a video on PRADO.

    • @karteekmenda3282
      @karteekmenda3282 3 года назад

      Hi Yannic,
      Can you please make a video on PRADO. Attaching the link of the paper(aclweb.org/anthology/D19-1506.pdf) for your reference.

  • @scottmiller2591
    @scottmiller2591 4 года назад

    Bochner never got enough love.

  • @ProfessionalTycoons
    @ProfessionalTycoons 4 года назад

    Dope, great theoretical breakthroughs

  • @justincho2043
    @justincho2043 4 года назад +1

    I think you meant to say "The Performer" instead of "The Reformer" in the video description. Thank you as always, keep up the good work!

  • @TheReferrer72
    @TheReferrer72 4 года назад

    You are a star! was wondering how this architecture works, and too lazy/dumb to read the paper.

  • @machinelearningdojo
    @machinelearningdojo 4 года назад

    This is a seriously amazing video, make sure you all get over to Yannic's SubscribeStar and cough up! It's more cost-effective than going to university I promise! www.subscribestar.com/yannickilcher

  • @shaneacton1627
    @shaneacton1627 4 года назад

    Does this mean sparse attention is dead?

  • @ksy8585
    @ksy8585 3 года назад

    Your videos are really awesome

  • @johnpope1473
    @johnpope1473 3 года назад

    13:00 - "What is Kernel?
    A kernel is a function used in SVM for helping to solve problems. They provide shortcuts to avoid complex calculations.
    The amazing thing about kernel is that we can go to higher dimensions and perform smooth calculations with the help of it
    We can go up to an infinite number of dimensions using kernels. Sometimes, we cannot have a hyperplane for certain problems. This problem arises when we go up to higher dimensions and try to form a hyperplane. A kernel helps to form the hyperplane in the higher dimension without raising the complexity." techvidvan.com/tutorials/svm-kernel-functions/

  • @ross825
    @ross825 4 года назад

    Just saw this and now I’m clearing my schedule lol

  • @TIENTI0000
    @TIENTI0000 4 месяца назад

    thank you!

  • @krooqs
    @krooqs 4 года назад

    What tablet do you use?

    • @krooqs
      @krooqs 4 года назад +1

      Found the answer, It's a surface according to an older video.

  • @444haluk
    @444haluk 3 года назад

    of course orthagonal w's are better, random w's will put your original vector into a latent space in the new high dimensional space. that is 40 years old knowledge.

  • @shivani404sheth4
    @shivani404sheth4 3 года назад

    'what is this paper doing? it's exactly doing what I said was impossible' xD

  • @Mnnvint
    @Mnnvint 3 года назад

    "Believe it or not, you young kids" - don't make me feel even older than I am, you impudent zoomer! It's just... ten years ago or so :-|
    In Andrew Ng's first machine learning course (which had only a small chapter on neural networks, at the time they didn't impress me since they performed no better than SVMs and took ten times as long to train) I don't remember which activation function we used, but it was certainly not ReLU.

  • @charilaosmylonas5046
    @charilaosmylonas5046 4 года назад

    great explanation - random Fourier features are becoming quite trendy lately - (demonstrations are on "coordinate-based MLPs") arxiv.org/abs/2006.10739 . This random features idea works ridicolusly well

  • @marijnstollenga1601
    @marijnstollenga1601 4 года назад +1

    Allright, so we're back to kernel methods. I'm sure most of this has been done

  • @granttao7504
    @granttao7504 4 года назад

    thank you

  • @klingefjord
    @klingefjord 4 года назад

    I don't quite get what an attention matrix is at 7:50. I thought we had a separate Q, K and V matrix, not one big attention matrix A

  • @ВладимирПетров-ч8д8е
    @ВладимирПетров-ч8д8е 4 года назад +3

    why without subtitles?

  • @V2Vex
    @V2Vex 4 года назад +2

    我哪儿敢说话