Transformers are ruining everything! They first ruled the NLP world and finally, they are killing it in computer vision as well. I make 2 predictions in this video: 1. We can expect much bigger transformers being used in computer vision (same trend as in NLP) 2. We can expect a smaller patch size combined with efficient transformers (Reformer, Linformer, Longformer, etc.) any time soon Forgot to mention 1 interesting thing. The transformer is in a way more general than CNNs and LSTMs (i.e. it has fewer inductive biases). It turns out that transformer is a special case of a GNN (graph neural network) in particular GAT (well everything is a graph haha, 0 shenons here but still). Check out this blog: thegradient.pub/transformers-are-graph-neural-networks/
I am so pumped to see what happens with Performers* on whole documents, books, images, movies, audio files... I'm sure multiple companies are training 1T+ parameter Performer models as we type. It's going to be a great year ahead of us. *linear scaling Transformers that are better than linformers/linear xf/reformer/sparse xf etc
@@DavenH Me too! I am also excited about many other areas of AI, especially graph neural nets and RL! Mostly because I am going to dig deeper into them over the next period! 😅 Just researched AlphaStar a bit it also uses transformers to beat pro gamers of Star Craft II. RL as well but I was happy to see transformers in there as well! 😂
This guy will hit 500K subs. in the end of 2021. You're the only person on RUclips how that gives 100000....000% effort in his video. I'm learning a lot from you.
Thanks man! Appreciate it, if people find these kinds of breakdowns (which I'll know if I get enough similar feedback) useful I'll definitely make more of these. I learn a lot by doing this as well. As a side note I am noticing a huge gap in the community. On the one hand people don't know how to start learning ML and on the other hand people need help understanding the papers I'm trying to balance it out. Still not sure what the best strategy is but I'll continue covering seminal papers from different areas.
@@TheAIEpiphany My opinion is not to worry about the people just starting ML, unless that's what you want to do of course. There are many good resources for beginners...free courses, blogs, playlists etc. However, I find that there aren't many channels aimed at the level of actual ML practitioners like yourself. Yannic Kilcher's channel is one such beacon, and it's doing very well. I also want to say you present stuff in a very clear and digestible way, and a 30+ minute video is more than fine. There are so many papers coming out every day on arxiv it's impossible to keep up, so having any help distilling them is wonderful. One symptom of this bottleneck, I notice, is that people generally just read the highlights (papers from Google Brain, FAIR, DeepMind, OpenAI). Unfortunately, by dominating the mindshare this has pulled the research into areas well beyond the compute capabilities of PhD students, independent practitioners, or startup companies. I read the wav2vec 2.0 paper yesterday, and got all excited to try to apply their methods until I see they trained for multiple days on 100 GPUs, expensive v100s at that. Google papers are even worse this way, they never train on anything less than 1000 TPUs it seems! I guess they are probably the most high quality papers too, so there's a feedback effect. But there are gems that surely get missed by all the universities, which I would assume focus less on scaling and more on theory or other insights.
@@DavenH Extremely valuable feedback thank you! I somehow tend not to go over 30 minutes for now I am still figuring it out. I agree I even started writing down the amount of compute needed in my OneNote. 😅 And it's crazy. I agree there is a lot of valuable research that will be done aside from the mainstream deep learning of that I'm certain. Judea Pearl ideas, etc.
thank you for this great vedio, explaining Fit very well. It took me lots of time to understand Transfer and Bert series. You video make and vision part much easier to understand.
The discussion in this video about the inductive bias of resnets vs the unbiased Transformers got me thinking. Right now I'm doing a fun network architecture search (NAS) project, and it evolves architectures and tests each one on small amounts of data to compare against other architectures. This is somewhat similar to how Transformers learn dynamic routing, whereas the genetic search algorithm in the NAS-space is "learning" this routing by discrete methods. So if the comparison is valid -- the learned inductive bias of Transformers and the searched inductive bias of a NAS, I wonder which of these methods is more compute efficient overall? NAS is slow indeed, as it churns through many small, yet specialized architectures. But Transformers are perhaps equally slow, as they are searching over a similar space. I suspect given constant compute resources, a Transformer-based arch vs a NAS-search over smaller nets, the Transformer probably comes out ahead still as its search method is at least exploiting a gradient, while evolutionary strategies scale poorly with parameter size.
Nice connection! Interesting way to look at it! For me it looks like NAS is probably more flexible but it's also more compute intensive. I didn't get to play with NAS so far so I can't get into any serious discussion without taking (at least) a day to investigate it a bit more. Nice thoughts keep them coming! Did you play with GNNs? They are even more general than transformers like GAT e.g.
There are so many applications for AI/ML. I'm curious, why am I not seeing this being applied? So many people and companies are claiming to do "AI/ML" but I'm not seeing commercial applications.
How would you implement this Vision Transformer? I think ImageNet is the choice. However, it is still worse than ResNet. I would start with the imagenette dataset from FastAI for fast iterations then switch to ImageNet which will be trained on Lambdas lab GPU cloud with 4 GPUs.
I am not 100% sure I understood your question. Did you mean to ask: 1. How would you train it like given the amount of compute it requires what would be the correct machine setup? 2. How would we train it given that JFT-300 is not a public dataset? 3. Or did you mean to ask how to implement it? Which is fairly simple as it's almost a regular transformer (except for the input preprocessing part contained in the stem of the model).
@@TheAIEpiphany likely 3. I want to implement from scratch to understand all the details. So I will need to find a replacement for the dataset and maybe a smaller version of the model. I don't think that it makes sense to train for 600 GPU days as an individual researcher.
@@quanhua92 Neither could you unless your dad is Bill Gates haha. Hm check out my GitHub project. And the annotated transformer and Jay Alammar's blog. I did a couple videos on how I did it maybe that could help as well. Your question is basically "how do I implement the transformer". The preprocessing step is really simple and that's beautiful about this model.
why can't we just use THEIR pre-trained model. They already did it once, whats the sense in doing the same process over wasting energy when they already have the model
Transformers are ruining everything! They first ruled the NLP world and finally, they are killing it in computer vision as well.
I make 2 predictions in this video:
1. We can expect much bigger transformers being used in computer vision (same trend as in NLP)
2. We can expect a smaller patch size combined with efficient transformers (Reformer, Linformer, Longformer, etc.) any time soon
Forgot to mention 1 interesting thing. The transformer is in a way more general than CNNs and LSTMs (i.e. it has fewer inductive biases).
It turns out that transformer is a special case of a GNN (graph neural network) in particular GAT (well everything is a graph haha, 0 shenons here but still).
Check out this blog: thegradient.pub/transformers-are-graph-neural-networks/
I am so pumped to see what happens with Performers* on whole documents, books, images, movies, audio files... I'm sure multiple companies are training 1T+ parameter Performer models as we type. It's going to be a great year ahead of us.
*linear scaling Transformers that are better than linformers/linear xf/reformer/sparse xf etc
@@DavenH Me too! I am also excited about many other areas of AI, especially graph neural nets and RL!
Mostly because I am going to dig deeper into them over the next period! 😅
Just researched AlphaStar a bit it also uses transformers to beat pro gamers of Star Craft II. RL as well but I was happy to see transformers in there as well! 😂
Hello. Can we use GAT with image PATCH to do the same work 🤔 ?
This guy will hit 500K subs. in the end of 2021.
You're the only person on RUclips how that gives 100000....000% effort in his video. I'm learning a lot from you.
Hehe. 500k might be an overkill considering this is a niche channel but I'll give it my best shot!
Thanks a lot for your kind words!
I read your blog post on your journey to a deepmind engineer.
It was very very inspiring. Thank you for spending time writing that!
Thank you! 🙏
Excellent, love these paper breakdowns. Keep it up!
Thanks man! Appreciate it, if people find these kinds of breakdowns (which I'll know if I get enough similar feedback) useful I'll definitely make more of these.
I learn a lot by doing this as well.
As a side note I am noticing a huge gap in the community. On the one hand people don't know how to start learning ML and on the other hand people need help understanding the papers I'm trying to balance it out.
Still not sure what the best strategy is but I'll continue covering seminal papers from different areas.
@@TheAIEpiphany My opinion is not to worry about the people just starting ML, unless that's what you want to do of course. There are many good resources for beginners...free courses, blogs, playlists etc.
However, I find that there aren't many channels aimed at the level of actual ML practitioners like yourself. Yannic Kilcher's channel is one such beacon, and it's doing very well. I also want to say you present stuff in a very clear and digestible way, and a 30+ minute video is more than fine.
There are so many papers coming out every day on arxiv it's impossible to keep up, so having any help distilling them is wonderful. One symptom of this bottleneck, I notice, is that people generally just read the highlights (papers from Google Brain, FAIR, DeepMind, OpenAI). Unfortunately, by dominating the mindshare this has pulled the research into areas well beyond the compute capabilities of PhD students, independent practitioners, or startup companies. I read the wav2vec 2.0 paper yesterday, and got all excited to try to apply their methods until I see they trained for multiple days on 100 GPUs, expensive v100s at that. Google papers are even worse this way, they never train on anything less than 1000 TPUs it seems!
I guess they are probably the most high quality papers too, so there's a feedback effect. But there are gems that surely get missed by all the universities, which I would assume focus less on scaling and more on theory or other insights.
@@DavenH Extremely valuable feedback thank you!
I somehow tend not to go over 30 minutes for now I am still figuring it out.
I agree I even started writing down the amount of compute needed in my OneNote. 😅 And it's crazy.
I agree there is a lot of valuable research that will be done aside from the mainstream deep learning of that I'm certain.
Judea Pearl ideas, etc.
Excellent explanation! Thank you.
Great work explaining the paper Aleksa
Thanks! 🙏😄
Thanks man, I really enjoy watching your explanations!
thank you for this great vedio, explaining Fit very well. It took me lots of time to understand Transfer and Bert series. You video make and vision part much easier to understand.
thank you so much
Thx for the video!
Informative as anything. Definitely 25 min well spent.
Glad to hear that 😁
Thanks for the detailed overview. What exactly is class embedding? Why is it required?
Keep up the good work bro!
Thanks man!
The discussion in this video about the inductive bias of resnets vs the unbiased Transformers got me thinking. Right now I'm doing a fun network architecture search (NAS) project, and it evolves architectures and tests each one on small amounts of data to compare against other architectures. This is somewhat similar to how Transformers learn dynamic routing, whereas the genetic search algorithm in the NAS-space is "learning" this routing by discrete methods.
So if the comparison is valid -- the learned inductive bias of Transformers and the searched inductive bias of a NAS, I wonder which of these methods is more compute efficient overall? NAS is slow indeed, as it churns through many small, yet specialized architectures. But Transformers are perhaps equally slow, as they are searching over a similar space.
I suspect given constant compute resources, a Transformer-based arch vs a NAS-search over smaller nets, the Transformer probably comes out ahead still as its search method is at least exploiting a gradient, while evolutionary strategies scale poorly with parameter size.
Nice connection! Interesting way to look at it!
For me it looks like NAS is probably more flexible but it's also more compute intensive.
I didn't get to play with NAS so far so I can't get into any serious discussion without taking (at least) a day to investigate it a bit more.
Nice thoughts keep them coming!
Did you play with GNNs? They are even more general than transformers like GAT e.g.
Thanks a lot for you great explanation that how vision transformer works.
great works.. keep going
Nice job! Thank you very much!
Great work! Thank you!
Very nice!
thnx you man, was really helpfull!
You're welcome Alexander!
Nice work! Thank you! (Really nice prediction that big tech will have some large Transformer coming, now "Segment Anything" is here😁)
crazy to see that this was just 3 years ago
Are the position encodings learnt in the vision transformer?. In the "Attention is all you need" transformer, positions are not learnt
Love it !
Glad to hear that!
There are so many applications for AI/ML. I'm curious, why am I not seeing this being applied? So many people and companies are claiming to do "AI/ML" but I'm not seeing commercial applications.
I love this paper. But I hate that I cannot train shit with this LOL
make more of these :)
Flattened Patch is not 14*14 only. TO Flatten it, You have to tke channels into consideration too so 14*14*3. Please correct me if I am wrong
Yes! NEW CLIP!
It's burning baby!
super cool
How would you implement this Vision Transformer? I think ImageNet is the choice. However, it is still worse than ResNet. I would start with the imagenette dataset from FastAI for fast iterations then switch to ImageNet which will be trained on Lambdas lab GPU cloud with 4 GPUs.
I am not 100% sure I understood your question. Did you mean to ask:
1. How would you train it like given the amount of compute it requires what would be the correct machine setup?
2. How would we train it given that JFT-300 is not a public dataset?
3. Or did you mean to ask how to implement it? Which is fairly simple as it's almost a regular transformer (except for the input preprocessing part contained in the stem of the model).
@@TheAIEpiphany likely 3. I want to implement from scratch to understand all the details. So I will need to find a replacement for the dataset and maybe a smaller version of the model. I don't think that it makes sense to train for 600 GPU days as an individual researcher.
@@quanhua92 Neither could you unless your dad is Bill Gates haha.
Hm check out my GitHub project. And the annotated transformer and Jay Alammar's blog. I did a couple videos on how I did it maybe that could help as well.
Your question is basically "how do I implement the transformer". The preprocessing step is really simple and that's beautiful about this model.
@@quanhua92 Did you get this working? I am looking for a thesis topic and I am wondering if it is feasible to make this work. Thanks.
can you make a coding video on it ?
Will do, check out my DINO video and let me know what u think.
@@TheAIEpiphany thanks? ahahha great video btw
@@jinorohit2921 failed to edit it lolz
@@jinorohit2921 thanks! 🤣
why can't we just use THEIR pre-trained model. They already did it once, whats the sense in doing the same process over wasting energy when they already have the model
bro this shit was helpful as fuck, u just helped me do my fucking capstone for mit. good looking out homie! Im subbing
Cool
Thanks!
how to convert a VIT-L_32.npz checkpoint to .VIT-L_32.pt so i can load it with clip? anyone know?