Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Поделиться
HTML-код
  • Опубликовано: 16 сен 2024

Комментарии • 63

  • @TheAIEpiphany
    @TheAIEpiphany  3 года назад +13

    Transformers are ruining everything! They first ruled the NLP world and finally, they are killing it in computer vision as well.
    I make 2 predictions in this video:
    1. We can expect much bigger transformers being used in computer vision (same trend as in NLP)
    2. We can expect a smaller patch size combined with efficient transformers (Reformer, Linformer, Longformer, etc.) any time soon
    Forgot to mention 1 interesting thing. The transformer is in a way more general than CNNs and LSTMs (i.e. it has fewer inductive biases).
    It turns out that transformer is a special case of a GNN (graph neural network) in particular GAT (well everything is a graph haha, 0 shenons here but still).
    Check out this blog: thegradient.pub/transformers-are-graph-neural-networks/

    • @DavenH
      @DavenH 3 года назад

      I am so pumped to see what happens with Performers* on whole documents, books, images, movies, audio files... I'm sure multiple companies are training 1T+ parameter Performer models as we type. It's going to be a great year ahead of us.
      *linear scaling Transformers that are better than linformers/linear xf/reformer/sparse xf etc

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      @@DavenH Me too! I am also excited about many other areas of AI, especially graph neural nets and RL!
      Mostly because I am going to dig deeper into them over the next period! 😅
      Just researched AlphaStar a bit it also uses transformers to beat pro gamers of Star Craft II. RL as well but I was happy to see transformers in there as well! 😂

    • @ibrahimaba8966
      @ibrahimaba8966 3 года назад

      Hello. Can we use GAT with image PATCH to do the same work 🤔 ?

  • @masteronepiece6559
    @masteronepiece6559 3 года назад +13

    This guy will hit 500K subs. in the end of 2021.
    You're the only person on RUclips how that gives 100000....000% effort in his video. I'm learning a lot from you.

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      Hehe. 500k might be an overkill considering this is a niche channel but I'll give it my best shot!
      Thanks a lot for your kind words!

  • @lukkuuu6368
    @lukkuuu6368 2 года назад +8

    I read your blog post on your journey to a deepmind engineer.
    It was very very inspiring. Thank you for spending time writing that!

  • @DavenH
    @DavenH 3 года назад +7

    Excellent, love these paper breakdowns. Keep it up!

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад +4

      Thanks man! Appreciate it, if people find these kinds of breakdowns (which I'll know if I get enough similar feedback) useful I'll definitely make more of these.
      I learn a lot by doing this as well.
      As a side note I am noticing a huge gap in the community. On the one hand people don't know how to start learning ML and on the other hand people need help understanding the papers I'm trying to balance it out.
      Still not sure what the best strategy is but I'll continue covering seminal papers from different areas.

    • @DavenH
      @DavenH 3 года назад +1

      @@TheAIEpiphany My opinion is not to worry about the people just starting ML, unless that's what you want to do of course. There are many good resources for beginners...free courses, blogs, playlists etc.
      However, I find that there aren't many channels aimed at the level of actual ML practitioners like yourself. Yannic Kilcher's channel is one such beacon, and it's doing very well. I also want to say you present stuff in a very clear and digestible way, and a 30+ minute video is more than fine.
      There are so many papers coming out every day on arxiv it's impossible to keep up, so having any help distilling them is wonderful. One symptom of this bottleneck, I notice, is that people generally just read the highlights (papers from Google Brain, FAIR, DeepMind, OpenAI). Unfortunately, by dominating the mindshare this has pulled the research into areas well beyond the compute capabilities of PhD students, independent practitioners, or startup companies. I read the wav2vec 2.0 paper yesterday, and got all excited to try to apply their methods until I see they trained for multiple days on 100 GPUs, expensive v100s at that. Google papers are even worse this way, they never train on anything less than 1000 TPUs it seems!
      I guess they are probably the most high quality papers too, so there's a feedback effect. But there are gems that surely get missed by all the universities, which I would assume focus less on scaling and more on theory or other insights.

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      @@DavenH Extremely valuable feedback thank you!
      I somehow tend not to go over 30 minutes for now I am still figuring it out.
      I agree I even started writing down the amount of compute needed in my OneNote. 😅 And it's crazy.
      I agree there is a lot of valuable research that will be done aside from the mainstream deep learning of that I'm certain.
      Judea Pearl ideas, etc.

  • @taekwondosplit
    @taekwondosplit 3 года назад +2

    Excellent explanation! Thank you.

  • @irinelmanolache601
    @irinelmanolache601 2 года назад +1

    Thanks man, I really enjoy watching your explanations!

  •  2 года назад +1

    Great work explaining the paper Aleksa

  • @tongluo9860
    @tongluo9860 Год назад

    thank you for this great vedio, explaining Fit very well. It took me lots of time to understand Transfer and Bert series. You video make and vision part much easier to understand.

  • @amirzarei4558
    @amirzarei4558 2 года назад

    Thanks a lot for you great explanation that how vision transformer works.

  • @NasheedYasin08
    @NasheedYasin08 3 года назад +1

    Informative as anything. Definitely 25 min well spent.

  • @Raulvic
    @Raulvic 3 года назад +1

    Thx for the video!

  • @lukasugar94
    @lukasugar94 2 года назад +1

    Very nice!

  • @SpesMagisteriiGradus
    @SpesMagisteriiGradus 10 месяцев назад +1

    thank you so much

  • @SH94
    @SH94 3 года назад +1

    Keep up the good work bro!

  • @kassem6436
    @kassem6436 3 года назад +1

    great works.. keep going

  • @НиколайНовичков-е1э

    Great work! Thank you!

  • @clapathy
    @clapathy 2 года назад

    Nice job! Thank you very much!

  • @meidachen8489
    @meidachen8489 Год назад

    Nice work! Thank you! (Really nice prediction that big tech will have some large Transformer coming, now "Segment Anything" is here😁)

  • @DavenH
    @DavenH 3 года назад +1

    The discussion in this video about the inductive bias of resnets vs the unbiased Transformers got me thinking. Right now I'm doing a fun network architecture search (NAS) project, and it evolves architectures and tests each one on small amounts of data to compare against other architectures. This is somewhat similar to how Transformers learn dynamic routing, whereas the genetic search algorithm in the NAS-space is "learning" this routing by discrete methods.
    So if the comparison is valid -- the learned inductive bias of Transformers and the searched inductive bias of a NAS, I wonder which of these methods is more compute efficient overall? NAS is slow indeed, as it churns through many small, yet specialized architectures. But Transformers are perhaps equally slow, as they are searching over a similar space.
    I suspect given constant compute resources, a Transformer-based arch vs a NAS-search over smaller nets, the Transformer probably comes out ahead still as its search method is at least exploiting a gradient, while evolutionary strategies scale poorly with parameter size.

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      Nice connection! Interesting way to look at it!
      For me it looks like NAS is probably more flexible but it's also more compute intensive.
      I didn't get to play with NAS so far so I can't get into any serious discussion without taking (at least) a day to investigate it a bit more.
      Nice thoughts keep them coming!
      Did you play with GNNs? They are even more general than transformers like GAT e.g.

  • @shandi1241
    @shandi1241 3 года назад +1

    thnx you man, was really helpfull!

  • @ameynaik2743
    @ameynaik2743 2 года назад +1

    Thanks for the detailed overview. What exactly is class embedding? Why is it required?

  • @present-bk2dh
    @present-bk2dh 26 дней назад

    crazy to see that this was just 3 years ago

  • @aleksabisercic1410
    @aleksabisercic1410 3 года назад +1

    Love it !

  • @Deshwal.mahesh
    @Deshwal.mahesh 3 года назад +1

    Flattened Patch is not 14*14 only. TO Flatten it, You have to tke channels into consideration too so 14*14*3. Please correct me if I am wrong

  • @nire-hj9pe
    @nire-hj9pe 2 года назад

    super cool

  • @XX-vu5jo
    @XX-vu5jo 3 года назад +1

    I love this paper. But I hate that I cannot train shit with this LOL

  • @mickeymilo2753
    @mickeymilo2753 3 года назад

    Yes! NEW CLIP!

  • @sathishkumarthirumalai3722
    @sathishkumarthirumalai3722 Год назад

    Are the position encodings learnt in the vision transformer?. In the "Attention is all you need" transformer, positions are not learnt

  • @samirelzein1095
    @samirelzein1095 2 года назад

    make more of these :)

  • @JohnSmith-ut5th
    @JohnSmith-ut5th 2 года назад

    There are so many applications for AI/ML. I'm curious, why am I not seeing this being applied? So many people and companies are claiming to do "AI/ML" but I'm not seeing commercial applications.

  • @lifted1785
    @lifted1785 2 года назад

    bro this shit was helpful as fuck, u just helped me do my fucking capstone for mit. good looking out homie! Im subbing

  • @parker1981xxx
    @parker1981xxx 3 года назад

    What happens if the positional embeddings are not trainable (so they are constant)?

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      They are constant for ViT, they didn't gain much by learning them.

    • @parker1981xxx
      @parker1981xxx 3 года назад

      @@TheAIEpiphany That is exactly my observation: if the data is scarce and/or there are too many patches then trainable positional embeddings become a liability. Your message just confirmed it, thanks.

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic 3 года назад +2

    why can't we just use THEIR pre-trained model. They already did it once, whats the sense in doing the same process over wasting energy when they already have the model

  • @quanhua92
    @quanhua92 3 года назад +1

    How would you implement this Vision Transformer? I think ImageNet is the choice. However, it is still worse than ResNet. I would start with the imagenette dataset from FastAI for fast iterations then switch to ImageNet which will be trained on Lambdas lab GPU cloud with 4 GPUs.

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      I am not 100% sure I understood your question. Did you mean to ask:
      1. How would you train it like given the amount of compute it requires what would be the correct machine setup?
      2. How would we train it given that JFT-300 is not a public dataset?
      3. Or did you mean to ask how to implement it? Which is fairly simple as it's almost a regular transformer (except for the input preprocessing part contained in the stem of the model).

    • @quanhua92
      @quanhua92 3 года назад +1

      @@TheAIEpiphany likely 3. I want to implement from scratch to understand all the details. So I will need to find a replacement for the dataset and maybe a smaller version of the model. I don't think that it makes sense to train for 600 GPU days as an individual researcher.

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад +1

      @@quanhua92 Neither could you unless your dad is Bill Gates haha.
      Hm check out my GitHub project. And the annotated transformer and Jay Alammar's blog. I did a couple videos on how I did it maybe that could help as well.
      Your question is basically "how do I implement the transformer". The preprocessing step is really simple and that's beautiful about this model.

    • @marijnspijker5199
      @marijnspijker5199 3 года назад

      ​@@quanhua92 Did you get this working? I am looking for a thesis topic and I am wondering if it is feasible to make this work. Thanks.

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 года назад

    Cool

  • @navinbondade5365
    @navinbondade5365 3 года назад +1

    can you make a coding video on it ?

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      Will do, check out my DINO video and let me know what u think.

    • @jinorohit2921
      @jinorohit2921 2 года назад

      @@TheAIEpiphany thanks? ahahha great video btw

    • @TheAIEpiphany
      @TheAIEpiphany  2 года назад +1

      @@jinorohit2921 failed to edit it lolz

    • @TheAIEpiphany
      @TheAIEpiphany  2 года назад +1

      @@jinorohit2921 thanks! 🤣

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic 3 года назад

    how to convert a VIT-L_32.npz checkpoint to .VIT-L_32.pt so i can load it with clip? anyone know?