I have accidently discoovered your videos and have been loving it since then. Vey elegant and simple explanations of complex concepts. Well done. Thanks a lot for making these. 🙂
Haha thanks for the shoutout! I got that instant camera as a present the week I was working on the video, thought it’d be a perfect opportunity to play with it for the opening shot!
Diffusion models have in general shown more success in producing quality and diverse images. So they are the choice architecture for text to image models. However, diffusion models can’t be used to easily produce text. VQ-VAE is a special architecture coz it can be trained with a LLM to make them understand user input images & generate images coupled with text. So… in short, if you want your model to input text + images AND output text + images, VQVAE+transformers are a great choice. If you want to input text and generate images (no text), use something like stable diffusion with a control net. Hope that’s helpful.
@@avb_fj Yes it's help a lot to understand. I tried to produce images with VQVAE + text embeddings but i can't get diversity, maybe a random layer in the first embedding before image patches could be effective, i don't know, it seem that VQVAE can't produce good diversity. Maybe with PixelCNN, i will try.
what if i fine tune a mistral 7B for next frame prediction on a big dataset of 1500 hours? what do you recommend me for next frame prediction (videos of a similar kind)
Sounds like a pretty challenging task especially coz Mistral 7B afaik isn't a multimodal model. There might be a substantial domain shift in your finetuning data compared to the original training text dataset it was trained on. If you want to use Mistral only, you may need to follow a VQ-VAE like architecture (described in the video) to use a codebook based image/video generation model that autoregressively generates visual content, similar to the original Dall-E. These are extremely compute expensive coz each video would require multiple frames (and each frame would require multiple tokens). Hard to suggest anything without knowing more about the project (mainly compute budget, whether it needs to be multi-modal i.e. text+vision, is it purely an image-reader or will it need to generate images, if videos then how long, etc) as optimal answers may change accordingly (from VQ-VAE codebooks to LLAVA like models that only uses image-encoders & no image-decoders/generation, to good old Conv-LSTM models that have huge memory benefits for video generation (but hard to make multimodal), to hierarchical-attention-based models. I don't have any video that jumps into my mind to share with you.
dataset videos are minute long, 20 fps, 128 tokens each frame - so 1200*128 tokens per video. Videos are highway car driving ones, and need to generate next frame like how a real driving video will look like. Imagine synthetic data for self driving@@avb_fj
also we can condition the model like "move left" (set of discrete commands) and it would generate the next frame like car is moving to the left. there are about 100,000 videos so 1650+ hours of video
Hello! There are some papers and videos linked in the description for follow up resources. I would also recommend to search Yannic Kilcher's channel if he has a video covering a topic you are interested in.
but I tested Gemini vision, and it is not in real-time response .... I created more vision apps. by using Gemini Vision API all are not real-time responses I think the Google video is a trick for us
i LOVE this video! my algorithm knows EXACTLY what i want and to think i got it served less than an hour after it was posted 🥲i feel so special shout out to the creator for making such a great video and shout out to youtube for bringing me here
wow, making variational AE and all the explaination so intuitive and easy to understand!
🙌🏼🙌🏼 Thanks! Glad it worked.
Nice explanation of the VQ-VAE. Thank you!
I have accidently discoovered your videos and have been loving it since then. Vey elegant and simple explanations of complex concepts. Well done. Thanks a lot for making these. 🙂
😇
My mind can't decode into words how grateful I am, Great video!
Haha this made my day. Merry Christmas! :)
Your videos are great. I like the switch of scenes from being outside to the whiteboard etc. really professional and engaging. Keep it up.
Awesome! Thanks!
Awesome video! This had the right balance of technical and intuitive details for me. Keep em coming!
Amazing explanation and video quality!
This is amazing! This is what science is supposed to be about!
brilliant explanation, thank you
Great video, this helps me a lot when trying to understand multimodel AI. Hope you will keep dong this type of videos!
Very well done, thank you very much
Absolutely wonderful!
🙌🏼 Thanks!!
You are very good at explaining, thanks!
Glad it was helpful!
Amazing video! Looking forward to training an LLM using VQ-VAE
Hell yeah! Let me know how it goes!
thanks, love your videos
Learnt something before hitting the bed lol! Thanks for this...I finally know something about Gemini and how it works.
Thanks! Glad you enjoyed it.
Accidentally saw your video and you earned a sub
Thanks! Welcome to the community!
Here from Reddit.
Great video, thought it might be over my head but not at all.
Also love the style 🏆
Awesome to know man! Thanks!
Thanks for the great video! I'm curious if there is existing work/paper on this LLM+VQ-VAE idea.
GREAT VIDEO!
This video has the feel of Fayman Lectures
Wow... that's high praise, my friend! Appreciate it.
@@avb_fj Keep them coming.As an SE and technologist I like the way you are presenting complex facts in a simplified way.
I lOVE HOW YOU START THE VIDEO
Haha thanks for the shoutout! I got that instant camera as a present the week I was working on the video, thought it’d be a perfect opportunity to play with it for the opening shot!
very nice!!
Why Diffusion models are more used in modern days that VQ-VAE coupled with transformers regression ?
Diffusion models have in general shown more success in producing quality and diverse images. So they are the choice architecture for text to image models. However, diffusion models can’t be used to easily produce text. VQ-VAE is a special architecture coz it can be trained with a LLM to make them understand user input images & generate images coupled with text.
So… in short, if you want your model to input text + images AND output text + images, VQVAE+transformers are a great choice.
If you want to input text and generate images (no text), use something like stable diffusion with a control net.
Hope that’s helpful.
@@avb_fj Yes it's help a lot to understand. I tried to produce images with VQVAE + text embeddings but i can't get diversity, maybe a random layer in the first embedding before image patches could be effective, i don't know, it seem that VQVAE can't produce good diversity. Maybe with PixelCNN, i will try.
what if i fine tune a mistral 7B for next frame prediction on a big dataset of 1500 hours? what do you recommend me for next frame prediction (videos of a similar kind)
Sounds like a pretty challenging task especially coz Mistral 7B afaik isn't a multimodal model. There might be a substantial domain shift in your finetuning data compared to the original training text dataset it was trained on. If you want to use Mistral only, you may need to follow a VQ-VAE like architecture (described in the video) to use a codebook based image/video generation model that autoregressively generates visual content, similar to the original Dall-E. These are extremely compute expensive coz each video would require multiple frames (and each frame would require multiple tokens). Hard to suggest anything without knowing more about the project (mainly compute budget, whether it needs to be multi-modal i.e. text+vision, is it purely an image-reader or will it need to generate images, if videos then how long, etc) as optimal answers may change accordingly (from VQ-VAE codebooks to LLAVA like models that only uses image-encoders & no image-decoders/generation, to good old Conv-LSTM models that have huge memory benefits for video generation (but hard to make multimodal), to hierarchical-attention-based models. I don't have any video that jumps into my mind to share with you.
dataset videos are minute long, 20 fps, 128 tokens each frame - so 1200*128 tokens per video. Videos are highway car driving ones, and need to generate next frame like how a real driving video will look like. Imagine synthetic data for self driving@@avb_fj
also we can condition the model like "move left" (set of discrete commands) and it would generate the next frame like car is moving to the left. there are about 100,000 videos so 1650+ hours of video
sir could you refer me some followup resources to learn more about this?
Hello! There are some papers and videos linked in the description for follow up resources. I would also recommend to search Yannic Kilcher's channel if he has a video covering a topic you are interested in.
but I tested Gemini vision, and it is not in real-time response .... I created more vision apps. by using Gemini Vision API all are not real-time responses I think the Google video is a trick for us
i LOVE this video! my algorithm knows EXACTLY what i want and to think i got it served less than an hour after it was posted 🥲i feel so special
shout out to the creator for making such a great video and shout out to youtube for bringing me here
So glad! :)