Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Поделиться
HTML-код
  • Опубликовано: 19 ноя 2024

Комментарии • 17

  • @TheAIEpiphany
    @TheAIEpiphany  3 года назад +1

    👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY:
    Discord ► discord.gg/peBrCpheKE
    📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER (it's comin'!):
    Substack ► aiepiphany.substack.com/

  • @sammay1540
    @sammay1540 2 года назад +1

    You’ve earned my subscription!

  • @tongluo9860
    @tongluo9860 2 года назад

    great work explaining Fit

  • @siddharthkapoor3178
    @siddharthkapoor3178 3 года назад +1

    The Figure 9 similarity where tokens at the corners or edges have high similarity to the tokens at the rest of the tokens at the boundary. It could be due the type of data where corner/edge tokens are generally background stuff and are uniform and similar in nature.

  • @dontaskme1625
    @dontaskme1625 3 года назад +3

    I want to suggest this paper for a future video "Variational Diffusion Models"

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      Great I'll check it out, feel free to share your suggestions on Discord as well: discord.com/invite/peBrCpheKE

    • @MrMIB983
      @MrMIB983 3 года назад

      Simon Simon, los Modelos de difusión están calientes

  • @bowenchen4908
    @bowenchen4908 3 года назад +1

    Thank you for the explanation! A quick dumb question on this paper, what does CLS token mean? And why do we have multiple when we are training for a classification task?

  • @sayakpaul3152
    @sayakpaul3152 3 года назад +1

    Could explain the sqrt(18) part in a bit more detail? I could not quite follow how you got to that.

    • @Peebuttnutter
      @Peebuttnutter 3 года назад +1

      my guess it's just the euclidean distance

  • @sacramentofwilderness6656
    @sacramentofwilderness6656 3 года назад +1

    Thanks a lot for a nice review of the paper. Few points raised questions in my mind. First of all, what can be the purpose of the matrix H in the HSIC projection? It is actually a projection on the subspace, orthogonal to the vector of ones - comparsion is between the non-uniformities inside the Gram matrices in some sense? Is there explanation, why they've chosen 50 layer ResNet for comparison? Seems like more fair comparison would between models of comparable scale, say ResNet 152 - or one should not expect noticeable change for this choice?

    • @RavenTheCute
      @RavenTheCute 3 года назад

      From my understanding, H may be used for normalization - because when we multiply a vector with centering matrix, it has the same effect as subtracting the mean of the vector; Also I believe they have shwon the comparison of 14 patches ViT with ResNet152 in appendix pages (Figure B.1)

  • @manub.n2451
    @manub.n2451 2 года назад

    When will you do a video on Swin Transformers ?

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic 3 года назад

    have you had a chance to try any of the notebooks where a vit guides any image generation model, such as vqgan, or even just raw rgb noise, to generate imagery from text prompts? The abilities of the vit, even vit-base-32, are VAST. Compared to resnet101, resnet50, resnet 50x4, resnet50x16 etc - we've experimented with all of the above - and resnet is absolute garbage compared to the vits lol. I dont think you've experienced what vit is capable of or else you'd be raving about it haha.
    Im not sure the scientists who created it even know what it can do since literally all they talk about is classification and getting scores and benchmarks. Make an image. Tell it to create a universe where heads are upside down. Tell it to show an image of a car with square wheels. Then try resnet -- resnet just falls short in every case.
    also the smaller the patch size the better. smaller patch size means higher resolution and more details per image. Of course id love to be able to try this with vit-H-14 but im not advanced enough to rig googles generic version for it- the regular vit-transformer on google vision doesnt have a text encoder trained with it multimodally.

    • @TheAIEpiphany
      @TheAIEpiphany  3 года назад

      I haven't yet but I will over the next period I'll be doing code walk-throughs - thanks for flagging that!
      And it makes sense I guess the spatial information being preserved part contributes heavily to that fact.

    • @fast_harmonic_psychedelic
      @fast_harmonic_psychedelic 3 года назад

      @@TheAIEpiphany it even suggests some knowledge of temporal flow but I'm not 100% , it might be vqgan 16384 itself

    • @vybhavramachandran
      @vybhavramachandran 3 года назад

      Any links to these notebooks? Thank you!