If LLMs are text models, how do they generate images?

Neural Breakdown with AVB

Просмотров 6 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 3 ноя 2024
Наука

Комментарии • 48

@uniquescience7047 10 месяцев назад ⁺⁷
wow, making variational AE and all the explaination so intuitive and easy to understand!
@avb_fj 10 месяцев назад
🙌🏼🙌🏼 Thanks! Glad it worked.
@TanNguyen-nq9nj 10 месяцев назад ⁺⁴
Nice explanation of the VQ-VAE. Thank you!
@pandasambit15 Месяц назад ⁺¹
I have accidently discoovered your videos and have been loving it since then. Vey elegant and simple explanations of complex concepts. Well done. Thanks a lot for making these. 🙂
@avb_fj Месяц назад
😇
@karamkoujan 10 месяцев назад ⁺³
My mind can't decode into words how grateful I am, Great video!
@avb_fj 10 месяцев назад ⁺¹
Haha this made my day. Merry Christmas! :)
@MrHampelmann123 10 месяцев назад ⁺⁸
Your videos are great. I like the switch of scenes from being outside to the whiteboard etc. really professional and engaging. Keep it up.
@avb_fj 10 месяцев назад
Awesome! Thanks!
@TP-ct7qm 10 месяцев назад ⁺¹
Awesome video! This had the right balance of technical and intuitive details for me. Keep em coming!
@kshitijdesai2402 Месяц назад
Amazing explanation and video quality!
@teleprint-me 10 месяцев назад ⁺¹
This is amazing! This is what science is supposed to be about!
@henkjekel4081 7 месяцев назад ⁺¹
brilliant explanation, thank you
@teezzz20 5 месяцев назад
Great video, this helps me a lot when trying to understand multimodel AI. Hope you will keep dong this type of videos!
@FlotosMC 3 месяца назад
Very well done, thank you very much
@eric-theodore-cartman6151 10 месяцев назад ⁺²
Absolutely wonderful!
@avb_fj 10 месяцев назад
🙌🏼 Thanks!!
@lucamatteobarbieri2493 8 месяцев назад ⁺¹
You are very good at explaining, thanks!
@avb_fj 8 месяцев назад
Glad it was helpful!
@aneeshsathe2494 10 месяцев назад ⁺¹
Amazing video! Looking forward to training an LLM using VQ-VAE
@avb_fj 10 месяцев назад
Hell yeah! Let me know how it goes!
@zakarkak 6 месяцев назад
thanks, love your videos
@fojo_reviews 10 месяцев назад ⁺²
Learnt something before hitting the bed lol! Thanks for this...I finally know something about Gemini and how it works.
@avb_fj 10 месяцев назад ⁺²
Thanks! Glad you enjoyed it.
@reshaknarayan3944 Месяц назад
Accidentally saw your video and you earned a sub
@avb_fj Месяц назад
Thanks! Welcome to the community!
@ChrisHow 10 месяцев назад ⁺¹
Here from Reddit.
Great video, thought it might be over my head but not at all.
Also love the style 🏆
@avb_fj 10 месяцев назад
Awesome to know man! Thanks!
@RealAnthonyPeng 6 месяцев назад
Thanks for the great video! I'm curious if there is existing work/paper on this LLM+VQ-VAE idea.
@Blooper1980 10 месяцев назад ⁺¹
GREAT VIDEO!
@XoPlanetI 10 месяцев назад ⁺⁴
This video has the feel of Fayman Lectures
@avb_fj 10 месяцев назад
Wow... that's high praise, my friend! Appreciate it.
@XoPlanetI 10 месяцев назад ⁺¹
@@avb_fj Keep them coming.As an SE and technologist I like the way you are presenting complex facts in a simplified way.
@bikrammajhi3020 6 месяцев назад
I lOVE HOW YOU START THE VIDEO
@avb_fj 6 месяцев назад
Haha thanks for the shoutout! I got that instant camera as a present the week I was working on the video, thought it’d be a perfect opportunity to play with it for the opening shot!
@NikhilKumar-fo5on 5 месяцев назад
very nice!!
@blancanthony9992 6 месяцев назад
Why Diffusion models are more used in modern days that VQ-VAE coupled with transformers regression ?
@avb_fj 6 месяцев назад ⁺²
Diffusion models have in general shown more success in producing quality and diverse images. So they are the choice architecture for text to image models. However, diffusion models can’t be used to easily produce text. VQ-VAE is a special architecture coz it can be trained with a LLM to make them understand user input images & generate images coupled with text.
So… in short, if you want your model to input text + images AND output text + images, VQVAE+transformers are a great choice.
If you want to input text and generate images (no text), use something like stable diffusion with a control net.
Hope that’s helpful.
@blancanthony9992 6 месяцев назад ⁺¹
@@avb_fj Yes it's help a lot to understand. I tried to produce images with VQVAE + text embeddings but i can't get diversity, maybe a random layer in the first embedding before image patches could be effective, i don't know, it seem that VQVAE can't produce good diversity. Maybe with PixelCNN, i will try.
@sehajpasricha7231 10 месяцев назад
what if i fine tune a mistral 7B for next frame prediction on a big dataset of 1500 hours? what do you recommend me for next frame prediction (videos of a similar kind)
@avb_fj 10 месяцев назад
Sounds like a pretty challenging task especially coz Mistral 7B afaik isn't a multimodal model. There might be a substantial domain shift in your finetuning data compared to the original training text dataset it was trained on. If you want to use Mistral only, you may need to follow a VQ-VAE like architecture (described in the video) to use a codebook based image/video generation model that autoregressively generates visual content, similar to the original Dall-E. These are extremely compute expensive coz each video would require multiple frames (and each frame would require multiple tokens). Hard to suggest anything without knowing more about the project (mainly compute budget, whether it needs to be multi-modal i.e. text+vision, is it purely an image-reader or will it need to generate images, if videos then how long, etc) as optimal answers may change accordingly (from VQ-VAE codebooks to LLAVA like models that only uses image-encoders & no image-decoders/generation, to good old Conv-LSTM models that have huge memory benefits for video generation (but hard to make multimodal), to hierarchical-attention-based models. I don't have any video that jumps into my mind to share with you.
@sehajpasricha7231 10 месяцев назад
dataset videos are minute long, 20 fps, 128 tokens each frame - so 1200*128 tokens per video. Videos are highway car driving ones, and need to generate next frame like how a real driving video will look like. Imagine synthetic data for self driving@@avb_fj
@sehajpasricha7231 10 месяцев назад
also we can condition the model like "move left" (set of discrete commands) and it would generate the next frame like car is moving to the left. there are about 100,000 videos so 1650+ hours of video
@sehajpasricha7231 10 месяцев назад
sir could you refer me some followup resources to learn more about this?
@avb_fj 10 месяцев назад ⁺¹
Hello! There are some papers and videos linked in the description for follow up resources. I would also recommend to search Yannic Kilcher's channel if he has a video covering a topic you are interested in.
@FalahgsGate 10 месяцев назад
but I tested Gemini vision, and it is not in real-time response .... I created more vision apps. by using Gemini Vision API all are not real-time responses I think the Google video is a trick for us
@kat_the_vat 10 месяцев назад
i LOVE this video! my algorithm knows EXACTLY what i want and to think i got it served less than an hour after it was posted 🥲i feel so special
shout out to the creator for making such a great video and shout out to youtube for bringing me here
@avb_fj 10 месяцев назад
So glad! :)

Следующие

Автовоспроизведение

Two Large Language Models DEBATE about AGI and Humanity + How i did it! (ChatGPT vs Mixtral)