How diffusion models work - explanation and code!

Umar Jamil

Просмотров 15 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 23 ноя 2024

Комментарии • 24

@umarjamilai Год назад ⁺³
Full code and PDF slides available at: github.com/hkproj/pytorch-ddpm
@christopherhornle4513 Год назад ⁺¹
Can you explain how the (text) guidance with clip works? I don't find any information other than that CLIP is used to influence the UNet during training through attention layers (also indicated by the "famous" LDM depiction). However it is not mentioned how the CLIP embeddings are aligned with the latents used by the UNet or VAE. I suppose it must be involved in the training process somehow? otherwise the embeddings can not be compatible or?
@umarjamilai Год назад
@@christopherhornle4513 Hi Christopher! I'm preparing a video on how to code Stable Diffusion from zero, without using any external library except for PyTorch. I'll explain how the UNet works, how CLIP works (with Classifier and Classifier-Free guidance). I'll also explain advanced topics like score-based generative models and k-diffusion. The math is very hard, but I'll try to explain the concepts behind the maths rather than the proofs, so that people who have little or no maths background can understand what's going on even if they don't understand every detail. Since time is limited and the topic is vast, it will take me some more time before the video is ready. Please stay tuned!
@christopherhornle4513 Год назад
@@umarjamilai That sounds awesome, thank you! I know pretty much how CLIP and the UNet work independently from each other. Cross attention, also clear. I am just wondering how the text embeddings are compatible with the UNet, if they are from a separate model (CLIP). I guess the UNet is trained feeding in CLIP text via attention to reproduce CLIP images (frozen VAE). Just stange that its not mentioned in the places I looked at.
@umarjamilai Год назад ⁺¹
@@christopherhornle4513 Let me simplify it for you: the Unet is a model that is trained to predict the noise added to a noisy image at a particular time of a time-schedule, so given X+Noise and the time step T, the Unet has to return X. During the training, we not only provide X+Noise and T, but we also provide the CLIP embeddings (that is the embeddings of the caption associated with the image), so when training we provide the Unet with X (image) + Noise + T (time step) + CLIP_EMBEDDINGS (embeddings associated with the caption of the image). When T=1000, the image is completely noisy according to the Normal distribution. When you sample (generate an image), you start from complete noise (T=1000). Since it is complete noise, the model could output any image when denoising, because it doesn't know which image the noise corresponds to. To "guide" the de-noisification process, the Unet needs some "help", and that guidance is your prompt. Since CLIP's embeddings kind of represent a language model, if an image was trained with caption "red car with man driving", if you use the prompt "red car with woman driving", CLIP's embeddings will tell the UNET how to denoise the image so as to reproduce something that is close to your prompt. So, summarizing, Unet and CLIP are connected because CLIP's embeddings (the embeddings extracted by encoding the caption associated with the image being trained upon) are used when training the Unet (they are given as parameters in each layer of the Unet) and the CLIP's embeddings (from your prompt) as used as input of the Unet to help him denoise when generating the image. I hope this clarifies the process. In my next video, which hopefully will come within two weeks, I'll explain everything in detail.
@christopherhornle4513 Год назад
Thank you very much!! Now I understand: During training the UNet learns to predict the noise given the text embedding (+timestamp and other if provided). So it learns which (text) embeddings are associated with specific features of the images and the noise predictions for those images. During sampling we start with noise (no encoded image) and provide an embedding, the model will use it as guidance for denoising along the features it has learned to be associated with that embedding.
@oiooio7879 Год назад ⁺¹
This is a great conceptual breakdown of diffusion models thank you!
@pikachu123454 Год назад ⁺⁴
I would love a video of you breaking down the math :)
@umarjamilai Год назад ⁺²
Hi! A new video is coming soon :) stay tuned!
@thelookerful 9 месяцев назад ⁺¹
So cool! Thank you for your explanations
@Charbel-n1k 2 месяца назад ⁺¹
Thank you for thr clear explanation!
@swapankumarsarkar1737 5 месяцев назад
Dear Sir please make a video on details explanation on code of diffusion model . It will be helpful. Thanks for understanding and valuable video
@muhammadsaad3421 2 месяца назад ⁺²
I can only say geniou guy ever
@VinayKumar-o8j5y 7 месяцев назад ⁺¹
Can you do code for inpainting in diffusion model please
@shajidmughal3386 5 месяцев назад
i came here form your VAE video. after that, should i be doing the 5hr long stable diffusion or this one?? what do you suggest?
@jerrylin2790 5 месяцев назад
I watched the 5 hour one first then come to this. Now I would say, I know how to train the model, thanks to Umar.
@qtomars.theory 11 дней назад ⁺¹
Great!!!!
@mokira3d48 Год назад
If cant you implement an example in next tutorial like you made for the transformers, it will be great !😊
@umarjamilai Год назад ⁺²
You can start by browsing the code I've shared, because it's a full working code to train a diffusion model. I'll try to make a video explaining each line of code as well
@mokira3d48 Год назад
@@umarjamilai okay, thank guys!
@ramprasath6424 Год назад
can u do a Bert coding video
@umarjamilai Год назад
Thanks for the suggestion, I'll try my best
@umarjamilai Год назад ⁺⁶
Hi! My new video on BERT is out: ruclips.net/video/90mGPxR2GgY/видео.html

Следующие

Автовоспроизведение

Stable Diffusion in Code (AI Image Generation) - Computerphile