Stable Diffusion 3

hu-po

Просмотров 8 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 23 сен 2024
Like 👍. Comment 💬. Subscribe 🟥.
🏘 Discord: / discord
github.com/hu-...
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
stability.ai/n...
arxiv.org/pdf/...

Комментарии • 22

@wzyjoseph7317 6 месяцев назад ⁺⁴
It's just crazy to see Hu-po understand all the concepts in this paper; what an insane guy!
@fast_harmonic_psychedelic 6 месяцев назад ⁺²
parti prompts are all about "two dogs on the left the dog on the left is black and the one on the right is white, and a cat on the right holding up its right paw, with 12 squares on the carpet and a triangle on the wall"
@wolpumba4099 6 месяцев назад ⁺⁶
Summary starts at 1:52:24
@wolpumba4099 6 месяцев назад
*Abstract*
Stability AI, the open-source AI pioneers, have released a fantastic
paper on scaling diffusion models for high-resolution image
generation. This paper is a must-read - a deep dive into the math and
techniques used, packed with valuable experimental results. Let's
break it down:
* *Rectified Flow: Making Diffusion Efficient* Think of image
generation like getting from point A to point B. Traditional
diffusion models take a roundabout route, but rectified flow aims
for a straight line - more efficient and better results!
* *The Power of Simplicity:* Rectified flow is surprisingly simple,
yet when combined with a clever time-step sampling technique
('log-normal' sampling), it outperforms more complex methods. This
saves researchers a ton of compute and energy!
* *New Architecture, Better Results:* Stability AI introduced a new
transformer-based architecture (MNDIT) that separates visual and
text features, improving how the model understands both.
* *Scaling Up = Better Images:* Unsurprisingly, bigger models (within
reason) give better images. This is great news for the future of AI
image generation.
Stability AI's focus on sharing their findings is admirable. This
paper helps the whole field, potentially saving tons of wasted compute
and making AI a bit more environmentally friendly.
Disclaimer: I used gemini advanced 1.0 (2024.03.04) to summarize the
video transcript. This method may make mistakes in recognizing words
@HolidayAtHome 6 месяцев назад
super interesting to listen to someone with actuall understanding of how all that magic works ;)!
@hjups 6 месяцев назад ⁺²
The VAE scaling isn't new, it was shown in the EMU paper. One thing neither paper discusses, is that there's an issue with scaling the channels - they have diminishing returns in terms of information density. For example, with d=8, if you sort the channels by PCA variance, the first 4 channels have the most information, then the next 2 have high frequency detail, and the last two are mostly noisy. There's still valid information in that "noise", but it may not be worth the increased computational complexity. Alternatively, this could be a property of the KL regularization where it focuses information density to few channels.
The idea of shifting the timestep distribution was proposed in the UViT paper (Simple Diffusion), I'm surprised they did not reference it directly. Although, the UViT paper provided a theoretical perspective, which does not necessarily align with human preferences.
I wish they had done a more comprehensive search with the diffusion methods... It's missing some of the other training objectives (likely due to author bias and not compute), which means it's not quite as useful as claimed.
@KunalSwami 4 месяца назад
I have a doubt abt your "increasing the complexity" part. Does increasing the channels increase complexity significantly? Increasing the spatial dimensions of the latent is costly.
@hjups 4 месяца назад ⁺¹
@@KunalSwami Increasing channels does not significantly increase the diffusion model's computational complexity, but it increases the semantic complexity of the latent space (potentially making it harder to learn - there are ways around this), and it increases both the semantic and computational complexity of the VAE. The SD3 paper showed the former, where they the VAE with more channels performed worse until a certain model size was reached (indicating that the latent space was harder to learn). The latter claim comes from anecdotal evidence from training VAEs - you typically need to increase the VAE base channel dim to support a deeper latent space, and VAEs can be quite slow to train.
@bause6182 5 месяцев назад
Great explanations , i can't wait to test the multimodals inputs
@siddharthmenon1932 5 месяцев назад
broo this is such an informative video man. kudos to you on making such complicated equations so easy and intuitive to understand for beginners
@jeffg4686 5 месяцев назад
Intel does have a big investment.
Personally, I think they should sell the model.
Keep it open source, but sell rights to use the high end models.
That way, they have a solid business plan.
@erikhommen1450 6 месяцев назад
Thank you, helps a lot!! Next the SD3-Turbo Paper please :)
@MultiMam12345 6 месяцев назад
Talking about signal/noise ratio on a mic input that is clipping. Nice 😂
@fast_harmonic_psychedelic 6 месяцев назад ⁺¹
Thats not what parti prompts are for. its not about visually pleasing images. its about accurately captioned images
@hu-po 6 месяцев назад ⁺¹
Thanks for the clarification , sorry I got this wrong :(
@chickenp7038 6 месяцев назад
great video. do you know if rectified flow is in diffusers?
@KunalSwami 4 месяца назад
What pdf reader do you use for annotation?
@jkpesonen 6 месяцев назад
How many convergence points does the vector field have?
@timeTegus Месяц назад
voll gutes video!
@danielvarga_p 6 месяцев назад ⁺²
i was here
@Elikatie25 6 месяцев назад ⁺¹
witnessed
@isiisorisiaint 5 месяцев назад
dude, so much nonsense in one single... you're the champ man. do you actually know anything that you're talking about? (rhetoric question, you obviously don't). by the time i got to your "what's a vector field" (@min 23) i just gave up (what you're showing in that image is a representation of a function f:R^2->R^2, which is anything but a vector field, it's a function bro, a function, get it?)

Следующие

Автовоспроизведение