Vision Transformer in PyTorch

Поделиться
HTML-код
  • Опубликовано: 17 окт 2024

Комментарии • 234

  • @mildlyoverfitted
    @mildlyoverfitted  2 года назад +3

    Errata:
    * 217/218 lines of `custom.py`: shape should be (n_samples, n_patches+1, out_features)

  • @AladdinPersson
    @AladdinPersson 3 года назад +53

    This is awesome! Glad this got recommended, will watch this later 👍

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +8

      Appreciate the message! I hope you will like it:) BTW you are creating great content! Keep it up!

    • @AladdinPersson
      @AladdinPersson 3 года назад +2

      @@mildlyoverfitted Yeah I liked it :)

    • @Patrick-wn6uj
      @Patrick-wn6uj 6 месяцев назад

      @devstuff2576 pUT it into chatgpt

  • @liam9519
    @liam9519 3 года назад +45

    LOVE this live coding channel format! I always find it much easier to understand a paper when I see a simple implementation and this makes it even easier! Keep it up!

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      Thank you! It is funny that you say that because I am exactly like you!

  • @thuancollege5594
    @thuancollege5594 2 года назад +2

    I don't understand the reason why you use nn.Conv2d layer in Patch embedding module at 2:53 time. In my mind, I only use nn.Linear(in_channels, out_channels). Can you explain it?

  • @TheAero
    @TheAero Год назад

    Have literature tried working with Tranformer + CNN. Like replacing the 2d poolings with attention?

  • @tuankhoanguyen3222
    @tuankhoanguyen3222 3 года назад +5

    A great channel about Pytorch. I like the way you carefully explain the meaning of each function. It encourages me to get my hand more dirty. Thank you and looking forward to seeing more videos from you.

  • @mohitlamba6400
    @mohitlamba6400 2 года назад +1

    A much needed video. I went through several iterations of paper and supplementary videos online explaining the paper. I always had some residual doubts remaining and dint understand with pin point accuracy. After this video everything is now clear !!

  • @100vivasvan
    @100vivasvan 3 года назад +15

    ❤️❤️ absolutely fantastic presentation. This cured my depression after 5 days of banging head against the wall.
    The pace of this video is so ideal.
    One suggestion that I want to propose is to add network architecture's figure/diagram from the paper while writing the code so it's easier for the new ML/DL coders to understand.
    Keep it up. Looking forward for more. Amazing work. ❤️thank you so much ❤️

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +2

      Heh:) Thank you for the kind words! That is a great suggestion actually!

    • @rushirajparmar9602
      @rushirajparmar9602 3 года назад +1

      @@mildlyoverfitted Yes the diagram might be very helpful!

  • @tranquangkhai8329
    @tranquangkhai8329 3 года назад +4

    Not watch full video yet, but I like the way you explain thing clearly with ipython demo for a beginner like me. Nice video!

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      Appreciate it! Nice to know that you enjoyed the ipython segments:) I will definitely try to include them in my future videos too!

  • @danielasefa8087
    @danielasefa8087 5 месяцев назад +1

    Thank you so much for helping me to understand ViT!! Great work

  • @prajyotmane9067
    @prajyotmane9067 6 месяцев назад

    Where did you include positional encoding ? or its not needed when using convolutions for patching and embedding ?

  • @froghana1995
    @froghana1995 2 года назад +1

    Thank you for helping me understand ViT! It's a great and kind Video!!

  • @news2000tw
    @news2000tw Год назад

    Thank you!!!!! Super useful. Before, I knew how drop out works but I didn't know how pytorch handle it .

  • @yichehsieh243
    @yichehsieh243 3 года назад +1

    Thank you for uploading this video, make me learned a lot and got more familiar with ViT model

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      Glad you enjoyed it!

    • @yichehsieh243
      @yichehsieh243 3 года назад

      @@mildlyoverfittedafter some study, I got that actually ViT model is encoder of transformer, may I expect to introduce decoder part or complete seq2seq model in the future🤣
      Besides, I was surprised that the implementation of ViT model was completed without using nn.MultiheadAttention, nn.Transformer Isn’t it more convenient?

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      @@yichehsieh243 Good question actually. I guess one of the goals of this channel is to do things "from scratch" mostly for educational purposes. However, in real life I would always go for well maintained libraries rather than reinventing the wheel and reimplementing everything.

  • @mkamp
    @mkamp 2 года назад +4

    Beautiful code, wonderful explanations to follow along. Thanks for taking the extra time to look at some of the essential concepts in iPython. Superb content!

  • @vishalgoklani
    @vishalgoklani 3 года назад +6

    Excellent presentation, thank you for sharing!
    A few reasons why I enjoyed the video:
    1. < 30min, no fluff, no typos, no BS, good consistent pace. Everyone is busy, staying under 30min is extremely helpful and will force you to optimize your time!
    2. Useful sidebars; break down key concepts into useful nuggets. Very helpful
    3. Chose a popular topic, based on one of the best repos, and gave a nice intro
    4. Stay with pytorch please, no one likes tensorflow ;)
    I look forward to more of your work.
    Thank you

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +2

      Thank you very much:) Very encouraging and motivating comment!

    • @jamgplus334
      @jamgplus334 3 года назад

      no one likes tensorflow, haha.Strongly agree with you

  • @mevanekanayake4363
    @mevanekanayake4363 2 года назад +2

    Loved the video! Just a quick question: Here, you save the custom_model, that has not been trained a single epoch. How is it able to predict the image correctly (without training)? Or am I missing something here!

    • @mevanekanayake4363
      @mevanekanayake4363 2 года назад

      I got it! You are copying the learned weights from the official_model to the custom_model. I missed it the first time!

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      Yeh, that’s right! Anyway, thank you for your comment!!

    • @ibtissamsaadi6250
      @ibtissamsaadi6250 2 года назад

      I have the same problem !! i can't understand how it able to predict without training? please can explain me what's happens ! and how do i can train this model?

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      @@ibtissamsaadi6250 I just took a pretrained model from `timm` and copied its weights

    • @ibtissamsaadi6250
      @ibtissamsaadi6250 2 года назад

      @@mildlyoverfitted thanks for your replying ! , can you help me for do a training and test for ur code? it's possible? 1- load the pretarined model
      2- finetunine this model and train it
      3- test step
      it's correct ? i want to applied ViT for facial expression classification but i didn't find any example for do it

  • @StasGT
    @StasGT Год назад +1

    Thank you! It's best video about VIT for understanding.

    • @mildlyoverfitted
      @mildlyoverfitted  Год назад

      Appreciate your comment!

    • @StasGT
      @StasGT 11 месяцев назад

      @@mildlyoverfitted, in PyTorch transformer: torch.nn.modules.transformer.py, q & k & v = x. It was a discovery for me. But it gives more better convergation of net. I didn't know that, yet yesterday.
      # self-attention block
      def _sa_block(self, x: Tensor,
      attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
      x = self.self_attn(x, x, x,
      attn_mask=attn_mask,
      key_padding_mask=key_padding_mask,
      need_weights=False)[0]
      return self.dropout1(x)
      This method push 'x' to class MultiheadAttention in torch.nn.modules.activation.py

  • @AnkityadavGrowConscious
    @AnkityadavGrowConscious 3 года назад +2

    Amazing clarity. Your tutorial is gold!! Great work.

  • @goldfishjy95
    @goldfishjy95 3 года назад +2

    Thank you so much.. this is a lifesaver! Bless you my friend!

  • @aravindreddy4871
    @aravindreddy4871 3 года назад +2

    Hi great explanation. Can this transformer be used only for embedding extraction leaving out classification??

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      Thank you! You can simply take the final CLS token embedding:)

  • @mlearnxyz
    @mlearnxyz 2 года назад +2

    Excellent material. Thanks for preparing and sharing it! Keep up the good work.

  • @조원기-w6b
    @조원기-w6b 3 года назад +1

    thanks for your video. i have a question. i had a result from trained model, but i cant see the result like in your video. did u train vit model for imagenet data?

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      I used the pretrained model from the timm package as shown in the video. Not sure what it was trained on.

  • @vasylcf
    @vasylcf 2 года назад +2

    Thanks !!! I like your clear way of explanation

  • @sushilkhadka8069
    @sushilkhadka8069 Год назад +1

    shape of v : (n_samples, n_heads, n_patches + 1, head_dim)
    shape of atten : (n_samples, n_heads, n_patches + 1, n_patches + 1)
    How can you multiply these two tensors?
    And how the result's shape is same as v's?
    Please explain . BTW great content. Glad I found this channel.

    • @suleymanerim2119
      @suleymanerim2119 Год назад +1

      atten @ v can be done. the output (n_samples, n_heads, n_patches + 1, n_patches + 1) @ (n_samples, n_heads, n_patches + 1, head_dim) = (n_samples, n_heads, n_patches + 1, head_dim). For examples lets say you have two matrices with shapes (2,2,5,5) and (2,2,5,3) then output will be (2,2,5,3).

    • @sushilkhadka8069
      @sushilkhadka8069 Год назад

      @@suleymanerim2119 sorry my bad. I was doing v @ atten instead ot atten @ v. Thanks anyway

  • @talhayousuf4599
    @talhayousuf4599 3 года назад +5

    I subscribed owing to such a clean implementation, well explained. I love how you comment the code and check shapes on the go. I request you to please make a video on your approach to implement papers.

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      Great to hear that! I guess it is easier to take an existing code and modify it rather than starting from scratch:)

  • @sanskarshrivastava5193
    @sanskarshrivastava5193 3 года назад +1

    I'm so glad that i found this channel , you are a gem :) !!

  • @DCentFN
    @DCentFN 2 года назад

    Quick question. For the forward.py file, what is the purpose of k=10? I see it's used for the topk function but I was curious as to what the k variable denotes as well as why specifically 10 was chosen

  • @shahriarshayesteh8602
    @shahriarshayesteh8602 2 года назад +2

    Just found your amazing channel. I love it, pls continue.

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      Thank you for the kind message! I will definitely continue:)

  • @harrisnisar5345
    @harrisnisar5345 7 месяцев назад

    Amazing video. Just curious, what keyboard are you using?

  • @nishantbhansali3671
    @nishantbhansali3671 2 года назад +1

    very helpful video, please make a similar video explaining the decoder architecture as well

  • @fuat7775
    @fuat7775 2 года назад +1

    Thank you for the tutorial, your explanation was perfect!

  • @zeamon4932
    @zeamon4932 3 года назад +1

    I like the shape checking part and your vim usage, using old style vim just show your ability to play around code

  • @elaherahimian3619
    @elaherahimian3619 3 года назад +1

    Thanks for your great video and description, I have learned a lot.

  • @dhananjayraut
    @dhananjayraut 3 года назад +1

    really like the videos on the channel, keep them coming. I knew I had to subscribe just few minutes in the video.

  • @laxlyfters8695
    @laxlyfters8695 3 года назад +2

    Great video saw this posted on the Artificial Intelligence and deep learning group on facebook

  • @georgemichel9278
    @georgemichel9278 2 года назад

    One quick question: I have implemented ViT but when I try to train it from scratch it seems like it is not learning at all (the loss is not going down), and i have been using a simple dataset (cats vs dogs) with adamW optimizer and lr = 0.001. What should I do other than loading the pretrained weights

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      I would definitely try to overfit one single batch of your training data. If it is possible, then in theory your setup is correct and you just need to train on more data/longer. If it is not possible, something went wrong with your architecture/loss.
      I hope that helps. Also, I know that this might be against the spirit of what you are trying to do but there are a bunch of frameworks that implemented the architecture /training logic already. Examples:
      * rwightman.github.io/pytorch-image-models/models/vision-transformer/
      * huggingface.co/docs/transformers/model_doc/vit

  • @preetomsahaarko8145
    @preetomsahaarko8145 Год назад

    I am building a custom model based on ViT. It is almost the same as ViT, with just a few additional layers. I am trying to load the pretrained weights of ViT using load_state_dict() function. But the size of input image I am feeding to the model is not 384x384, rather 640x640. So the positional embedding layer of my model has more parameters than ViT. How to handle these extra parameters of positional embedding? Can I perform some interpolation of the existing parameters?

  • @omerfarukyasar4681
    @omerfarukyasar4681 2 года назад +1

    Thanks for all great content!

  • @visuality2541
    @visuality2541 2 года назад +1

    EXTREMELY HELPFUL AS ALWAYS. KUDOS

  • @klindatv
    @klindatv Год назад

    the positions of the patches embeddings are learned during training? and why?

  • @HassanKhan-fe3pn
    @HassanKhan-fe3pn 3 года назад +1

    Is it possible to fine tune vision transformers on a single GPU machine? Given that they’ve been trained using tons of TPUs, I’m inclined to think fine tuning also requires huge compute power and thus out of reach of most people at the moment.

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      I have never fine tuned a Vision Transformer myself, however, I would imagine it takes fewer resources than to train it scratch. Just give it a go with a single GPU and monitor the performance:) Good luck!

  • @DCentFN
    @DCentFN 2 года назад

    How would such an implementation be modified to accommodate for the vit_base_patch16_224_sam model?

    • @DCentFN
      @DCentFN 2 года назад

      Also how would fine-tuning be done with this model or the sam model to customize for more unique datasets?

  • @陈文-p6u
    @陈文-p6u 3 года назад +1

    Very helpful video!Thanks!! BTW, what's your developement environment?

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      You are welcome! Thank you for your comment! I only use vim and tmux:)

  • @dewan_shaheb
    @dewan_shaheb 3 года назад +1

    Loved it!
    Please make a video on Transformer in Transformer (TNT) pytorch implementation .

  • @vishakbhat3032
    @vishakbhat3032 Год назад

    Amazing explanation!!! Just loved it !!!

  • @nikhilmehra5559
    @nikhilmehra5559 Год назад

    Hi, I couldn't get why was the position embedding initiated as zero tensor. Why wasn't it initialised with index values of the patches in the original image as the flow diagram suggests. I would highly appreciate clarification on this? Great video btw!!

  • @danyellacarvalho4120
    @danyellacarvalho4120 Год назад

    Very helpful explanation. Thank you!

  • @hamedhemati5151
    @hamedhemati5151 3 года назад +2

    Hi, great video indeed!
    Thanks for your time for making such a video and for sharing it with the community. Do you have plans to create further videos on the implementation of other types of architectures or training/inference of models that might be more difficult than or different from the standard setups?

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +3

      Thank you for your message! Really nice to hear that. Yes, I am planning to create many videos like this on different architectures/approaches. The format will stay very similar to this video: Taking an existing github repo and implementing it from scratch! I guess the goal is to change things up and cover different fields and topics!

  • @陳思愷-b1y
    @陳思愷-b1y 3 года назад

    fantastic live coding video!!!!!!!! you save my day, and hope you can keep on it making such a nice video. I believe it would be the best video in explaining VIT~

  • @macx7760
    @macx7760 8 месяцев назад +1

    fantastic video, just a quick note: at 16:01 you say that "none of the operations are changing the shape of the tensor", but isnt this wrong, since when applying fc2, the last dim should be out_features, not hidden_features, so the shapes are also wrongly commented.

    • @mildlyoverfitted
      @mildlyoverfitted  8 месяцев назад +1

      Nice find and sorry for the mistake:)! Somebody already pointed it out a while ago:) Look at the pinned errata comment:)

    • @macx7760
      @macx7760 8 месяцев назад

      ah i see, my bad :D @@mildlyoverfitted

  • @HamzaAli-dy1qp
    @HamzaAli-dy1qp 2 года назад

    hoW CAN I train FaceForensics++ on VisionTransformer as you have used already present classes.

  • @EstZorion
    @EstZorion 2 года назад +1

    THANK YOU! JUST THANK YOU! 😂 I don't know why I thought the linear layer only accepts 2d tensors.........................

  • @pranavkathar5383
    @pranavkathar5383 2 года назад +1

    Amazing clarity. Your tutorial is gold!! Great work.
    Can you please make a video on code implementation of VOLO-D5 model (Vision Outlooker for Visual Recognition)

    • @mildlyoverfitted
      @mildlyoverfitted  9 месяцев назад

      Appreciate it! Thank you for the suggestion!

  • @junhyeokpark1214
    @junhyeokpark1214 2 года назад

    Love this vid :)
    Clear explained with nice examples

  • @tuoxin7800
    @tuoxin7800 Год назад

    Great video! Love it! my question is why you set pos_embeding to a learnable parameter?

    • @mildlyoverfitted
      @mildlyoverfitted  Год назад

      I think I just did what the paper suggested. However, yes, there are positional encodings that are not learnable so it is a possible alternative:)

  • @marearts.
    @marearts. 3 года назад +1

    Thank you. This is a really helpful video.

  • @_shikh4r_
    @_shikh4r_ 3 года назад +2

    Love this format 👍

  • @abdallahghazaly359
    @abdallahghazaly359 2 года назад +1

    thank you for the tutorial it helps me very much

  • @gopsda
    @gopsda 2 года назад

    Thanks again for the great hands-on tutorial on ViT. This helped me greatly to understand the Transformer implementation in Pytorch. My understanding is that you have covered the Encoder part here (for Classification tasks). Do you have a separate session on Decoder part or is it implemented here?

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      Glad it helped! As you pointed out, this video is purely about the encoder! I don't have a video on the BERT decoder with cross attention, however, I have a new video on the GPT-2 model that does contain a variant of the decoder block. Feel free to check it out:)

    • @gopsda
      @gopsda 2 года назад

      @@mildlyoverfitted Ok, Thanks. Will check it out soon.

  • @StasGT
    @StasGT Год назад

    I try to change a hyper-parameters, add a MLP-blocks & train it network. But result is same, 61% validation on CIFAR10. Why...?

  • @ahmedyassin7684
    @ahmedyassin7684 Год назад

    what a beautiful demo, Thank you

  • @mybirdda
    @mybirdda 3 года назад +2

    you're awesome literally! Please make more vedio!

  • @hakunamatata0014
    @hakunamatata0014 Год назад

    Hai thanks for the nice video, I have a question. So I am doing a CNN + ViT project using 3 conv layers, can you show me how to incorporate the CNN layers with the ViT architecture that you have implemented in your video and how can I optimize it. Please help me. Thank you very much.

  • @MercyPrasanna
    @MercyPrasanna 2 года назад

    its a great explanation, found it extremely useful!!

  • @jjongjjong2365
    @jjongjjong2365 3 года назад +1

    this is perfect code review
    thank u for sharing good review

  • @vaehligchs
    @vaehligchs 2 года назад

    hi, fantastic video! Is it better to get the input images in range -1 to 1 or 0 to 1?

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад

      Thank you! I guess it should not make a difference as long as you do the same thing at training and inference time.

    • @muhammadnaufil5237
      @muhammadnaufil5237 Год назад

      I think the batchnorm layer does that in the forward pass

  • @PriyaDas-he4te
    @PriyaDas-he4te Месяц назад

    Can we use this code for Change detection in two satellite images

  • @bhanu0669
    @bhanu0669 3 года назад +1

    Best video ever. Please implement Swin transformer which is the latest in Image Transformers family. I find it difficult to understand the source code of Window Attention in Swin Transformer. It would be very useful if you could upload either a walk through or implementation of Swin Transformer code

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      Appreciate it:) Anyway, I haven't even heard about this Swin Transformer. I will definitely try to read up on it and maybe make a video on it:)

  • @macknightxu2199
    @macknightxu2199 Год назад +1

    hi, can I run this code in a laptop without GPU?

  • @iinarrab19
    @iinarrab19 3 года назад +4

    I love things like these that are application focused. I am currently experimenting on editing backbones so that they should start with Gabor filters. These backbones are loaded from mmdetection or detectron2. Can you do something like that? As to how we could edit backbones? That might be useful to people that want to experiment.

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      Thank you for the feedback! Interesting suggestion! I am writing it down:)

    • @iinarrab19
      @iinarrab19 3 года назад +2

      @@mildlyoverfitted Thanks. It's kind of a transfer learning but with the ability to either replace layers or edit them. Thanks for these videos, btw

  • @gopsda
    @gopsda 2 года назад

    Thanks so much for the video. Easy to follow, and some detour to explain the side stuffs are also relevant. Line 217/218 comments on shape to be changed to (n_samples, n_patches+1, out_features) or am I wrong?

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад

      Thank you! You are absolutely right! Nice find! I will create an errata comment and fix it on github.

  • @siddharthmagadum16
    @siddharthmagadum16 2 года назад

    Can I train this on google colab free plan for a 21.4k number of cassava leaf images dataset?

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад

      I guess it should be possible:) Just give it a try:)

  • @mayukh3556
    @mayukh3556 3 года назад +1

    Instantly subscribed

  • @HaiderAli-lr9fw
    @HaiderAli-lr9fw Год назад

    Thanks for the explanation. Can you explain how do train VIT?

  • @kbkim-f4z
    @kbkim-f4z 3 года назад +1

    what a video!

  • @macx7760
    @macx7760 8 месяцев назад

    why is the shape of the mlp input at 2nd dim n_patches +1, isnt the mlp just applied to the class token?

    • @mildlyoverfitted
      @mildlyoverfitted  8 месяцев назад +1

      So the `MLP` module is used inside of the Transformer block and and it inputs a 3D tensor. See this link for the only place where the CLS is explicitly extracted github.com/jankrepl/mildlyoverfitted/blob/22f0ecc67cef14267ee91ff2e4df6bf9f6d65bc2/github_adventures/vision_transformer/custom.py#L423-L424
      Hope that helps:)

    • @macx7760
      @macx7760 8 месяцев назад

      thanks, yeah confused the mlp inside the block with the mlp at the end for classification@@mildlyoverfitted

  • @lauraennature
    @lauraennature 3 года назад +1

    🎉🎉🎉 1000 subscribers 👏👏👏

  • @cwang6936
    @cwang6936 3 года назад

    freaking awesome, Niubility!

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      hehe:) I had to google that word:) Thank you !

    • @cwang6936
      @cwang6936 3 года назад

      @@mildlyoverfitted We call it Chinglish(Chinese English). Ha, ha, ha.

  • @dontaskme1625
    @dontaskme1625 3 года назад +2

    can you make a video about the "Rethinking Attention with Performers" paper? :D

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      That is a great idea actually! Thank you! I just read through the paper and it looks interesting.

  • @youngyulkim3072
    @youngyulkim3072 Год назад

    thanks so much! helped me a lot

  • @jeffg4686
    @jeffg4686 7 месяцев назад

    "mildly overfitted" is how I like to keep my underwear so I don't get the hyena.

  • @hamzaahmed5837
    @hamzaahmed5837 3 года назад +2

    Great!

  • @andreydung
    @andreydung 2 года назад +1

    Awesome!!!!

  • @jamgplus334
    @jamgplus334 3 года назад +1

    awesome video

  • @vidinvijay
    @vidinvijay 8 месяцев назад +1

    novelty explained in just over 6 minutes. 🙇

  • @mehedihasanshuvo4874
    @mehedihasanshuvo4874 3 года назад +3

    excellent video. Could you create a YOLO algorithm tutorial? It would be very helpful for me

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      Really appreciate your feedback:) Thank you! I will definitely try to create a YOLO video in the future:)

  • @awsaf49
    @awsaf49 3 года назад

    What a coding speed !!! Did you speed up the video or you were actually coding it in real time?

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      It is all sped up:) The goal is to keep the videos as short as possible:)

    • @awsaf49
      @awsaf49 3 года назад

      @@mildlyoverfitted Oh no :v I was kinda motivated to code fast. Nice tutorial by the way :)

  • @TimurIshuov
    @TimurIshuov Год назад

    Thanks God and Thank you man!

  • @yizhouchen9543
    @yizhouchen9543 3 года назад

    pure curiosity, is this your real typing speed?

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад +1

      Hehe, of course not:) I speed up the coding parts in my editing software so that the videos don't get too long:)

  • @saniazahan5424
    @saniazahan5424 2 года назад

    Hi thanks for sharing. Its great. Could you please share your experience of training a transformer from scratch. I am trying to train one with skeleton datasets in a self supervised manner with SimCLR loss and my transformer seems not learn much and after few epoch loss increases. I am new to this and don't understand whats wrong.

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      Hey! Thank you for your comment! Hmmmm, it is a pretty hard question since I don't know what your code and exact setup look like. Anyway, a few ideas I would recommend (not sure if that applies to your problem):
      * Make sure it is possible to to "overfit" your network on a single batch of samples
      * Track as many relevant metrics (+other artifacts) as possible (with tools like TensorBoard) to understand what is going on
      * Try to use a popular open-source package/repository for the training before actually writing custom code

    • @saniazahan5424
      @saniazahan5424 2 года назад +1

      @@mildlyoverfitted Thanks a lot. I have just one concern. Transformers are really great in NLP and image or video data. But my data is sequence of frames with each frame containing just 30 values (10 joint with 3 x-y-z coordinates). Do you think a 300*30 dimension is too low for Transformer to learn something meaningful.

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      @@saniazahan5424 Interesting! I don't think that should be a problem. However, as I said, it is really hard to give any tips without actually knowing all the technical details:( Good luck with you project!!!

    • @saniazahan5424
      @saniazahan5424 2 года назад +1

      @@mildlyoverfitted I guess it is. Thanks.

  • @johanngerberding5956
    @johanngerberding5956 3 года назад +1

    very cool channel, keep going! :)

  • @sayeedchowdhury11
    @sayeedchowdhury11 2 года назад

    Thanks, can you please implement or point me to a repo which uses ViT for image captioning?

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад +1

      You're welcome! I am sorry but I have very little knowledge about image captioning:(

    • @sayeedchowdhury11
      @sayeedchowdhury11 2 года назад

      @@mildlyoverfitted No worries, thanks for your work anyway, really love it!

  • @rafaelgg1291
    @rafaelgg1291 3 года назад +2

    Thank you! New subscriber :D

  • @danieltello8016
    @danieltello8016 6 месяцев назад

    great video, can i run the code in a mac with M1 chip as it is?

  • @maralzarvani8154
    @maralzarvani8154 2 года назад +1

    Thank you! that is fantastic. I can deeply understand. could you please present Swin transformer like this?

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад

      Glad you liked it! Thank you for the suggestion!

  • @baveshbalaji301
    @baveshbalaji301 2 года назад

    Great Video on vision transformers. However, I have a small problem in the implementation. When I tried to train the model that I implemented, I was getting the same outputs for all the images in a batch. On further investigation, I found out that the first row of every tensor in a batch, i.e, the cls_token for every image in a batch, is not changing when it passes through all the layers. Is this problem occuring because we are giving the same cls_token to every class, or is it because of some other implementation error. It would be really great if someone could answer. Thanks in advance.

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад

      Thank you! AFAIK if your batch contains different images then you should indeed have different embeddings of the CLS token after the forward pass. Anyway, it is hard to say what the issue could be without seeing the code. If you think the problem is coming from my code feel free to create an issue on github where we could discuss it in detail!
      Cheers!

    • @baveshbalaji301
      @baveshbalaji301 2 года назад +1

      @@mildlyoverfittedThanks for the reply. In my implementation, I was passing the the tensor we get after performing layer normalization directly to the attention layer as the query, key and value. However, in your implementation and pytorch timm's implementation, you have passed the input tensor through a linear layer and reshaped it to get query, key and value. That was the problem with my code, but I still do not understand the reasoning behind my mistake. Because in the original transformer, we just pass the embeddings as key, value and query directly without performing any linear projections, so I thought the same would be applicable here. However, that was not the case. If anyone can give the reasoning behind this procedure, it would be really appreciable. Thanks in advance.

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад

      AFAIK one always needs to apply the linear projection. What paper do you refer to when you say "original transformer"?

    • @baveshbalaji301
      @baveshbalaji301 2 года назад

      @@mildlyoverfitted On Vaswani et al., in the description of the attention module, I thought that they never mentioned about applying linear projection. However, I might have missed that information in the original paper. Anyways, thanks for the reply.

    • @mildlyoverfitted
      @mildlyoverfitted  2 года назад

      @@baveshbalaji301 Just checked the paper. The Figure 2 (right) shows the linear mapping logic. But I agree that it is easy to miss:) In the text they actually use the W^{Q}, W^{K}, W^{V} matrices to refer to this linear mapping (no bias).

  • @adityamishra348
    @adityamishra348 3 года назад +1

    How about making a video on "Escaping the Big Data Paradigm with Compact Transformers" paper?

    • @mildlyoverfitted
      @mildlyoverfitted  3 года назад

      I am not familiar with this paper. However, I will definitely try to read it! Thanks for the tip!

  • @stanley_george
    @stanley_george Год назад

    Have you tried exploring what the different inner layer of Vision transformer sees ?

  • @KountayDwivedi
    @KountayDwivedi 2 года назад +1

    Many thanks for this amazing explanation. Could you, by any chance, be knowing a tutorial on how to utilize transformers on tabular data (using PyTorch)?
    Thanks again.
    :-}