The Figure 9 similarity where tokens at the corners or edges have high similarity to the tokens at the rest of the tokens at the boundary. It could be due the type of data where corner/edge tokens are generally background stuff and are uniform and similar in nature.
Thank you for the explanation! A quick dumb question on this paper, what does CLS token mean? And why do we have multiple when we are training for a classification task?
Thanks a lot for a nice review of the paper. Few points raised questions in my mind. First of all, what can be the purpose of the matrix H in the HSIC projection? It is actually a projection on the subspace, orthogonal to the vector of ones - comparsion is between the non-uniformities inside the Gram matrices in some sense? Is there explanation, why they've chosen 50 layer ResNet for comparison? Seems like more fair comparison would between models of comparable scale, say ResNet 152 - or one should not expect noticeable change for this choice?
From my understanding, H may be used for normalization - because when we multiply a vector with centering matrix, it has the same effect as subtracting the mean of the vector; Also I believe they have shwon the comparison of 14 patches ViT with ResNet152 in appendix pages (Figure B.1)
have you had a chance to try any of the notebooks where a vit guides any image generation model, such as vqgan, or even just raw rgb noise, to generate imagery from text prompts? The abilities of the vit, even vit-base-32, are VAST. Compared to resnet101, resnet50, resnet 50x4, resnet50x16 etc - we've experimented with all of the above - and resnet is absolute garbage compared to the vits lol. I dont think you've experienced what vit is capable of or else you'd be raving about it haha. Im not sure the scientists who created it even know what it can do since literally all they talk about is classification and getting scores and benchmarks. Make an image. Tell it to create a universe where heads are upside down. Tell it to show an image of a car with square wheels. Then try resnet -- resnet just falls short in every case. also the smaller the patch size the better. smaller patch size means higher resolution and more details per image. Of course id love to be able to try this with vit-H-14 but im not advanced enough to rig googles generic version for it- the regular vit-transformer on google vision doesnt have a text encoder trained with it multimodally.
I haven't yet but I will over the next period I'll be doing code walk-throughs - thanks for flagging that! And it makes sense I guess the spatial information being preserved part contributes heavily to that fact.
👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► discord.gg/peBrCpheKE
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER (it's comin'!):
Substack ► aiepiphany.substack.com/
You’ve earned my subscription!
great work explaining Fit
The Figure 9 similarity where tokens at the corners or edges have high similarity to the tokens at the rest of the tokens at the boundary. It could be due the type of data where corner/edge tokens are generally background stuff and are uniform and similar in nature.
I want to suggest this paper for a future video "Variational Diffusion Models"
Great I'll check it out, feel free to share your suggestions on Discord as well: discord.com/invite/peBrCpheKE
Simon Simon, los Modelos de difusión están calientes
Thank you for the explanation! A quick dumb question on this paper, what does CLS token mean? And why do we have multiple when we are training for a classification task?
Could explain the sqrt(18) part in a bit more detail? I could not quite follow how you got to that.
my guess it's just the euclidean distance
Thanks a lot for a nice review of the paper. Few points raised questions in my mind. First of all, what can be the purpose of the matrix H in the HSIC projection? It is actually a projection on the subspace, orthogonal to the vector of ones - comparsion is between the non-uniformities inside the Gram matrices in some sense? Is there explanation, why they've chosen 50 layer ResNet for comparison? Seems like more fair comparison would between models of comparable scale, say ResNet 152 - or one should not expect noticeable change for this choice?
From my understanding, H may be used for normalization - because when we multiply a vector with centering matrix, it has the same effect as subtracting the mean of the vector; Also I believe they have shwon the comparison of 14 patches ViT with ResNet152 in appendix pages (Figure B.1)
When will you do a video on Swin Transformers ?
have you had a chance to try any of the notebooks where a vit guides any image generation model, such as vqgan, or even just raw rgb noise, to generate imagery from text prompts? The abilities of the vit, even vit-base-32, are VAST. Compared to resnet101, resnet50, resnet 50x4, resnet50x16 etc - we've experimented with all of the above - and resnet is absolute garbage compared to the vits lol. I dont think you've experienced what vit is capable of or else you'd be raving about it haha.
Im not sure the scientists who created it even know what it can do since literally all they talk about is classification and getting scores and benchmarks. Make an image. Tell it to create a universe where heads are upside down. Tell it to show an image of a car with square wheels. Then try resnet -- resnet just falls short in every case.
also the smaller the patch size the better. smaller patch size means higher resolution and more details per image. Of course id love to be able to try this with vit-H-14 but im not advanced enough to rig googles generic version for it- the regular vit-transformer on google vision doesnt have a text encoder trained with it multimodally.
I haven't yet but I will over the next period I'll be doing code walk-throughs - thanks for flagging that!
And it makes sense I guess the spatial information being preserved part contributes heavily to that fact.
@@TheAIEpiphany it even suggests some knowledge of temporal flow but I'm not 100% , it might be vqgan 16384 itself
Any links to these notebooks? Thank you!