parti prompts are all about "two dogs on the left the dog on the left is black and the one on the right is white, and a cat on the right holding up its right paw, with 12 squares on the carpet and a triangle on the wall"
*Abstract* Stability AI, the open-source AI pioneers, have released a fantastic paper on scaling diffusion models for high-resolution image generation. This paper is a must-read - a deep dive into the math and techniques used, packed with valuable experimental results. Let's break it down: * *Rectified Flow: Making Diffusion Efficient* Think of image generation like getting from point A to point B. Traditional diffusion models take a roundabout route, but rectified flow aims for a straight line - more efficient and better results! * *The Power of Simplicity:* Rectified flow is surprisingly simple, yet when combined with a clever time-step sampling technique ('log-normal' sampling), it outperforms more complex methods. This saves researchers a ton of compute and energy! * *New Architecture, Better Results:* Stability AI introduced a new transformer-based architecture (MNDIT) that separates visual and text features, improving how the model understands both. * *Scaling Up = Better Images:* Unsurprisingly, bigger models (within reason) give better images. This is great news for the future of AI image generation. Stability AI's focus on sharing their findings is admirable. This paper helps the whole field, potentially saving tons of wasted compute and making AI a bit more environmentally friendly. Disclaimer: I used gemini advanced 1.0 (2024.03.04) to summarize the video transcript. This method may make mistakes in recognizing words
The VAE scaling isn't new, it was shown in the EMU paper. One thing neither paper discusses, is that there's an issue with scaling the channels - they have diminishing returns in terms of information density. For example, with d=8, if you sort the channels by PCA variance, the first 4 channels have the most information, then the next 2 have high frequency detail, and the last two are mostly noisy. There's still valid information in that "noise", but it may not be worth the increased computational complexity. Alternatively, this could be a property of the KL regularization where it focuses information density to few channels. The idea of shifting the timestep distribution was proposed in the UViT paper (Simple Diffusion), I'm surprised they did not reference it directly. Although, the UViT paper provided a theoretical perspective, which does not necessarily align with human preferences. I wish they had done a more comprehensive search with the diffusion methods... It's missing some of the other training objectives (likely due to author bias and not compute), which means it's not quite as useful as claimed.
I have a doubt abt your "increasing the complexity" part. Does increasing the channels increase complexity significantly? Increasing the spatial dimensions of the latent is costly.
@@KunalSwami Increasing channels does not significantly increase the diffusion model's computational complexity, but it increases the semantic complexity of the latent space (potentially making it harder to learn - there are ways around this), and it increases both the semantic and computational complexity of the VAE. The SD3 paper showed the former, where they the VAE with more channels performed worse until a certain model size was reached (indicating that the latent space was harder to learn). The latter claim comes from anecdotal evidence from training VAEs - you typically need to increase the VAE base channel dim to support a deeper latent space, and VAEs can be quite slow to train.
Intel does have a big investment. Personally, I think they should sell the model. Keep it open source, but sell rights to use the high end models. That way, they have a solid business plan.
dude, so much nonsense in one single... you're the champ man. do you actually know anything that you're talking about? (rhetoric question, you obviously don't). by the time i got to your "what's a vector field" (@min 23) i just gave up (what you're showing in that image is a representation of a function f:R^2->R^2, which is anything but a vector field, it's a function bro, a function, get it?)
It's just crazy to see Hu-po understand all the concepts in this paper; what an insane guy!
parti prompts are all about "two dogs on the left the dog on the left is black and the one on the right is white, and a cat on the right holding up its right paw, with 12 squares on the carpet and a triangle on the wall"
Summary starts at 1:52:24
*Abstract*
Stability AI, the open-source AI pioneers, have released a fantastic
paper on scaling diffusion models for high-resolution image
generation. This paper is a must-read - a deep dive into the math and
techniques used, packed with valuable experimental results. Let's
break it down:
* *Rectified Flow: Making Diffusion Efficient* Think of image
generation like getting from point A to point B. Traditional
diffusion models take a roundabout route, but rectified flow aims
for a straight line - more efficient and better results!
* *The Power of Simplicity:* Rectified flow is surprisingly simple,
yet when combined with a clever time-step sampling technique
('log-normal' sampling), it outperforms more complex methods. This
saves researchers a ton of compute and energy!
* *New Architecture, Better Results:* Stability AI introduced a new
transformer-based architecture (MNDIT) that separates visual and
text features, improving how the model understands both.
* *Scaling Up = Better Images:* Unsurprisingly, bigger models (within
reason) give better images. This is great news for the future of AI
image generation.
Stability AI's focus on sharing their findings is admirable. This
paper helps the whole field, potentially saving tons of wasted compute
and making AI a bit more environmentally friendly.
Disclaimer: I used gemini advanced 1.0 (2024.03.04) to summarize the
video transcript. This method may make mistakes in recognizing words
super interesting to listen to someone with actuall understanding of how all that magic works ;)!
The VAE scaling isn't new, it was shown in the EMU paper. One thing neither paper discusses, is that there's an issue with scaling the channels - they have diminishing returns in terms of information density. For example, with d=8, if you sort the channels by PCA variance, the first 4 channels have the most information, then the next 2 have high frequency detail, and the last two are mostly noisy. There's still valid information in that "noise", but it may not be worth the increased computational complexity. Alternatively, this could be a property of the KL regularization where it focuses information density to few channels.
The idea of shifting the timestep distribution was proposed in the UViT paper (Simple Diffusion), I'm surprised they did not reference it directly. Although, the UViT paper provided a theoretical perspective, which does not necessarily align with human preferences.
I wish they had done a more comprehensive search with the diffusion methods... It's missing some of the other training objectives (likely due to author bias and not compute), which means it's not quite as useful as claimed.
I have a doubt abt your "increasing the complexity" part. Does increasing the channels increase complexity significantly? Increasing the spatial dimensions of the latent is costly.
@@KunalSwami Increasing channels does not significantly increase the diffusion model's computational complexity, but it increases the semantic complexity of the latent space (potentially making it harder to learn - there are ways around this), and it increases both the semantic and computational complexity of the VAE. The SD3 paper showed the former, where they the VAE with more channels performed worse until a certain model size was reached (indicating that the latent space was harder to learn). The latter claim comes from anecdotal evidence from training VAEs - you typically need to increase the VAE base channel dim to support a deeper latent space, and VAEs can be quite slow to train.
Great explanations , i can't wait to test the multimodals inputs
broo this is such an informative video man. kudos to you on making such complicated equations so easy and intuitive to understand for beginners
Intel does have a big investment.
Personally, I think they should sell the model.
Keep it open source, but sell rights to use the high end models.
That way, they have a solid business plan.
Thank you, helps a lot!! Next the SD3-Turbo Paper please :)
Talking about signal/noise ratio on a mic input that is clipping. Nice 😂
Thats not what parti prompts are for. its not about visually pleasing images. its about accurately captioned images
Thanks for the clarification , sorry I got this wrong :(
great video. do you know if rectified flow is in diffusers?
What pdf reader do you use for annotation?
How many convergence points does the vector field have?
voll gutes video!
i was here
witnessed
dude, so much nonsense in one single... you're the champ man. do you actually know anything that you're talking about? (rhetoric question, you obviously don't). by the time i got to your "what's a vector field" (@min 23) i just gave up (what you're showing in that image is a representation of a function f:R^2->R^2, which is anything but a vector field, it's a function bro, a function, get it?)