How do Vision Transformers work? - Paper explained | multi-head self-attention & convolutions

Поделиться
HTML-код
  • Опубликовано: 1 окт 2024

Комментарии • 62

  • @AICoffeeBreak
    @AICoffeeBreak  2 года назад +19

    Okay bye!

  • @MrMIB983
    @MrMIB983 2 года назад +9

    Love your channel, best ml videos, you are so kind

  • @AD173
    @AD173 2 года назад +6

    Why all this emphasis on local minima at the end? You are highly likely to reach them anyhow.
    The bigger issue in higher dimensional space is saddle points; It has been shown that momentum and the like can help avoid those. The Dauphin paper that is referenced in this paper is about saddle points, not local minima. Heck, its title is "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
    "
    I mean, it's kinda a myth that you are aiming for global minima in deep learning. The point behind using deep learning is that it has been shown empirically, by Yann LeCun and others, that NNs with higher number of neurons tend to have more "good" local minima (that is, local minima that do not have losses much higher than the global one). See, for instance, "The Loss Surfaces of Multilayer Networks" by Choromanska et al.
    I would love to be shown a paper that definitely proves or demonstrates that any method, momentum or not, can "skip" a local minima over to a better local minima or somehow find the global one. I mean, several papers do contain speculative sentences implying, or straight up claiming, that, but I have never seen the papers actually proving that mathematically or empirically. Do you know something I don't?

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      I was simplistically referring to bad local minima. Better: bad optima.
      Even better: your comment.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +2

      Edit: found the quiz question about this! ruclips.net/user/postUgysXVJwPznKibv1LR14AaABCQ
      Original: I remember reading about this problem that saddle points are more common than local minima in high dimensions in the "Deep Learning" book by Goodfellow et al. 2016. I could swear we also had a quiz question on this but now I cannot find it and I assume I only dreamed about it. 😬

    • @AD173
      @AD173 2 года назад +3

      @@AICoffeeBreak Hehe, thank you, but I kinda hate my comment because I feel a bit pedantic.
      Can I ask you for a favor...?
      Look, the internet is full of bad information regarding our field, and there are tons of people covering the research done by the giants. I get that their results probably drives traffic to your channel, but it's already being covered by everyone.
      The weakness of a lot of those resources is that they get a lot of the basics wrong; Results like the ones I mentioned in my previous comment get ignored; You'll hardly find an "AI expert" here on RUclips talking about how finding good local minima is the reason we create deeper and deeper NN architectures.
      How about covering the more rigorous side of deep learning? I mean, probably not something super-theoretical like the well-posedness of inverse Bayesian problems, but just providing a deeper insights into deep learning? That probably requires going through the papers cited by the giants rather than the giants' papers themselves. No one seem to do that; It leads to a lot of misconceptions in the community (which is full of people who hate mathematics and thus do not read papers but instead learn through simplified intros on RUclips, toward data science and the like).
      It would be good to have a resource focusing on rectifying misconceptions within the community.

    • @AD173
      @AD173 2 года назад +3

      ​@@AICoffeeBreak Yeah, the problem isn't just that they exist, but also that they are hard to overcome. Essentially, the situation is "my loss has stopped improving.. Am I at the local minimum now? Or at a saddle point? How can I tell? I mean, the gradient is practically zero in all directions." That's a huge part of the reason behind the creation, and success, of momentum.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      I knew it: Here is the question! Buried so deep that YT did not first show it to me. ruclips.net/channel/UCobqgqE4i5Kf7wrxRxhToQAcommunity?lb=UgysXVJwPznKibv1LR14AaABCQ

  • @bradleypliam110
    @bradleypliam110 2 года назад +3

    Ms. Coffee Bean isn't animating herself .... yet 😉

  • @MachineLearningStreetTalk
    @MachineLearningStreetTalk 2 года назад +2

    Amazing video! 😎

  • @ronen300
    @ronen300 Год назад +3

    I wonder if the high frequency details would remain in MSA if they would not patchify the image into (16 by 16) or (8 by 8) pathces , and use each pixel individually ...
    Because it seems to me that the high frequncy robustness of MSA could be related to this process

  • @bielmonaco
    @bielmonaco Год назад +3

    Where are you Letitia? We need you to explain us more pappers! Please, comeback 😊

    • @AICoffeeBreak
      @AICoffeeBreak  Год назад +3

      It's been hard to find time to publish more than one video per month these days, sorry. :(

  • @_bustion_1928
    @_bustion_1928 2 года назад +3

    Very nice video and very nice paper. I've had an idea of combining convs and msa for a long time now...

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 2 года назад +6

    From my experience, rather than going full ViT, I think that the structure of a ConvNet backbone that extracts features, which then are fed to MSA (a la DETR) is a stronger structure for vision, as evident in current SOTA approaches for image classification.

  • @DerPylz
    @DerPylz 2 года назад +7

    Thank you for summarising this paper!

  • @HoriaCristescu
    @HoriaCristescu 2 года назад +5

    Very good video, especially the mention of the Hessian eigenvalues relation to the curvature of the loss landscape.

  • @anadianBaconator
    @anadianBaconator 2 года назад +3

    Fantastic!

  • @muhammadwaseem_
    @muhammadwaseem_ Год назад +3

    I would like to learn more on hessian and its eigen value interpretation on loss landscapes. Could anyone suggest any good materials?

    • @AICoffeeBreak
      @AICoffeeBreak  Год назад +2

      The Deep Learning book by Goodfellow et al. is a good start. And it is freely available.

    • @muhammadwaseem_
      @muhammadwaseem_ Год назад +1

      @@AICoffeeBreak Thank you so much!

  • @PritishMishra
    @PritishMishra 2 года назад +4

    I guess I can collect all the voice samples of your videos and try to train a model to predict which kind of expression Coffee Bean will make 😆

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      That would be so cool. Let me know if you really try it out. 🤝 I want to know how well it works!

    • @PritishMishra
      @PritishMishra 2 года назад +2

      ​@@AICoffeeBreak sure, I will let you know :-)

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      So excited about this! 😄

  • @wolfganggro
    @wolfganggro 2 года назад +5

    Quick question, how are you animation your videos? Sorry if this has been answered before.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      Everything but the Coffee Bean is just good old Powerpoint. 🙈
      What are your plans? :) The last time I got this question, someone wanted to start a channel himself. (Outlier: ruclips.net/video/wcqLFDXaDO8/видео.html )

    • @wolfganggro
      @wolfganggro 2 года назад +2

      @@AICoffeeBreak Hmm.. ja 😅 good questions. Not in particular, sometime I think this could be fun, but doing it like 3Blue1Brown with manim seem like a lot of work. The way you are doing it seems more approachable and you manage to create a great video and get the information across.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +2

      @@wolfganggro You can save even more effort in working with the paper directly, see Yannic. That works well too! :)
      Manim is great for math stuff, but I am not sure how this would scale for less math-centric visualizations.

    • @wolfganggro
      @wolfganggro 2 года назад +2

      @@AICoffeeBreak Thanks for the input. I'm actually more interested in the mathy topics, but I guess I just have to give it a try at some point :-)

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +2

      Exactly. Try it out. And send me a link when you did. 😶‍🌫️

  • @muhammadsalmanali1066
    @muhammadsalmanali1066 2 года назад +3

    Thank you so much for all the hard work
    Regards
    A struggling PhD student

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      Thanks and whish you all the best!
      A fellow struggling PhD student. 🙃

  • @edd36
    @edd36 2 года назад +3

    Thank you very much for the video!

  • @flamboyanta4993
    @flamboyanta4993 2 года назад +3

    Good luck with the Phd Letitia (and ms Coffee bean, too)!
    Stay strong!

  • @willsmithorg
    @willsmithorg 2 года назад +3

    Thank you. Interesting and well summarised as always.

  • @mr_tpk
    @mr_tpk 2 года назад +3

    Thank you for sharing ❤️

  • @LermanProductions
    @LermanProductions 2 года назад +4

    The graphics and animations here are so good. I’m looking to make deep learning videos. How do you produce these animations, like the one with the convolution kernel?? Not going to rip off your channel btw haha, just surprised to see such fancy graphics in a AI technical video

    • @PritishMishra
      @PritishMishra 2 года назад +3

      I am not promoting my channel here... but you can look at my latest video on the convolutional neural network I have made with Manim.. the package created by Grant Sanderson aka 3blue1brown.

    • @LermanProductions
      @LermanProductions 2 года назад +3

      @@PritishMishra Thanks

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      Definitely have a look at manim if you plan to visualise maths.
      I animated that conv with PowerPoint 🙈🙈

  • @김성수-i2e1g
    @김성수-i2e1g 2 года назад +4

    Thank you very much for sharing high quality videos!
    17:42 As far as I know, the blue region of the figure 9(left) implies pooling layers of convnets, not MSA. They also reduce the variance of feature maps by subsampling, which behaves like spatial smoothing of MSA layers.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +2

      Yes, correct! I exactly remember that I added these pen highlights last minute and meant to put the MSA on the grey parts of the right hand side of the figure (for ViT). Somehow (the inexplicability of doing something and not looking twice) it landed on the ResNet side. 🙈

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +2

      Thanks for pointing this out! 🎯

  • @saeednuman
    @saeednuman 2 года назад +4

    Thank you for such a detailed video; I have a question regarding the loss landscapes. Recently, I read the paper "When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations," which also talks about the loss landscapes of ViTs and ResNets but. If I compared that to this paper, the conclusion is different. Please can you help me understand what I am missing here?. Note: Sorry, I am reposting it because I cannot see my previous comment.

    • @namukpark
      @namukpark 2 года назад +4

      Hi, I'm the author of the paper "How Do Vision Transformers Work?" Thanks for the thoughtful feedback!
      The difference between the loss landscape visualization in the paper "When Vision Transformers Outperform ResNets..." and our empirical results is due to the following aspects: (1) They only visualized cross-entropy (NLL) landscapes, but we visualize loss (NLL + L2) landscapes on augmented datasets. Since NN optimizes "NLL + L2" on "augmented datasets" -- not "NLL" on "clean datasets" -- we believe that it is appropriate to visualize NLL + L2 on augmented datasets. (2) They used training configurations that is significantly different from standard practice, while we use a DeiT-style configuration. Since DeiT-style configuration is the de facto standard in ViT training, we believe our insights can be applied to a larger number of studies. (3) Other evidences: A box blur (the simplest low-pass filter) also flattens the losses (arxiv.org/abs/2105.12639); Hybrid model has flat loss; Learning trajectories and Hessian spectra, a *_set_* of Hessian eigenvalues, also lead to the same conclusion; ViT (flat loss) is robust against data perturbations; and so on.
      As pointed out in our paper, loss landscape smoothing methods can improve optimizations by reducing negative Hessian values.

  • @gergerger53
    @gergerger53 2 года назад +3

    "We are talking about inference, here, or about one iteration during training" (5:18) - I know you know that people use the term inference differently between ML and statistics, and in statistics it means to learn something about a system based on a sample, while in ML it means putting new data through a trained model. Here, you seem to imply that inference is connected to training, which jars with all previous experience I have with this term. I'd just do like many other researchers do and plainly avoid using the term, in favour of other less ambiguous / contentious words or phrasings.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      Sorry, I do not use any new meaning here. I mean inference as in making a prediction by running a sample through a model (forward pass).

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      I should have just said forward pass. It encompasses both the inference (prediction step at test time) as well as a snapshot in one training step.

    • @gergerger53
      @gergerger53 2 года назад +3

      @@AICoffeeBreak That would have been definitely much more clear to me. I didn't mean it as a criticism (because I love your stuff), it was just a tip to be clearer and sometimes it's good to suggest these things as it means future content is even more clear, concise and easier to follow.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      @@gergerger53 constructive criticism is my favourite, thanks!

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      @@gergerger53 "ML" and a "good terminology" are quite low in overlap and I try my best not to add to the confusion. 😅

  • @muhammadsalmanali1066
    @muhammadsalmanali1066 2 года назад +1

    We perform computations on key (K) and query (Q) for MSA. We get their values by multiplying our feature values (data-dependent) with the weights learned during the training. So even during the inference process the data might vary but the weight values are fixed to get our K and Q, So how MSAs are data agnostic?

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      Sorry, I am confused now. 😕 We were making this point for convolutions, not for MSA. MSA are dynamic w.r.t data even at inference time. Could you maybe post the timestamp where you understood it otherwise?

  • @simsonyee
    @simsonyee Год назад

    QK^t @4.43!!