Stable Diffusion 3 - RAW First Impression!
HTML-код
- Опубликовано: 22 фев 2024
- Stable Diffusion 3 - Review and Critical Look. How good is it really. Where are it's limits. What can SD3 really do better. Let's have a look.
My Twitter: / oliviosarikas
My Patreon: / sarikas
#### Links from my Video ####
stability.ai/stablediffusion3
stability.ai/news/stable-diff...
/ 1760660709308846135
/ 1760783270105457149
/ 1760725050095747249
/ 1760676926421954736
/ 1760676723836993554
/ 1760676074491687310
/ 1760668434772156552
/ 1760666901326315870
/ 1760656804206317918
#### Join and Support me ####
Buy me a Coffee: www.buymeacoffee.com/oliviotu...
Join my Facebook Group: / theairevolution
Joint my Discord Group: / discord
AI Newsletter: oliviotutorials.podia.com/new...
Support me on Patreon: / sarikas - Хобби
My Twitter: twitter.com/OlivioSarikas
My Patreon: www.patreon.com/sarikas
👋
I hear Gemini is bad at handling colors, especially the color white.
SD3 looks pretty amazing, much more useful than MJ tbh. Midjourney turns every prompt into a dramatic artwork with deep shadows and out of focus lights. Cool, but you didn't ask for that. You ask for computer, SD3 gives you computer. Perfect.
Noooooo.....don't zoom in on the clowns!!!! Now I'll have nightmares.....
Bwahahhahahaha!!!! 🤡🤡🤡🤡🤡
🤡
😂
The most impressive image is the bottle and the triangle square thing afterwards. Imagine prompting for a beautiful girl wearing blue jeans, a yellow shirt, and red hair. And actually getting what we prompted. Without any ComfyUI shenanigans, simply by prompting. Greetings from vienna, love your content.
You can in a1111, activate regional prompter and just separate what you want by BREAK keyword. It works 10/10 times with something do simple.
Dude just get albedobase checkpoint for sdxl, I pasted your "prompt" from this comment and it did it for me correctly literally the first try. Btw I did a batch of two and both images are good with what you asked for. I'm using a1111 webui without doing anything else just pasting your text "a beautiful girl... etc". And the hand is also correct on first pic, 2nd one nicely cropped it at the image border so it's also "good" since it's not visible so no chance for it to mess it up lol. The hands are about 50%% of the time good or acceptable usually with this model which is pretty good.
SDXL is pretty decent when it comes to adhereing to characters and portraits related prompts, specially OpenDall-E model.
But for sure SD3 is several steps beyond that.
They are so quick in DEV. Can‘t wait so see more soon. ❤
i hope we get to play with this soon
@@OlivioSarikas, me too!
Midjourney is intended to be very hands off -- You want a fast and fantastic result with a short prompt, so it's less realistic and less controllable, but "pretty by default".
Stable Diffusion being "boring by default" is a good thing. It should be realistic and prompt-following, because while it requires more work, it allows for greater creative control by the end user.
I hope the community can work toward a non-Stability, fully-open-source model that’s open for commercial use. Stability has trended toward the same censored and eventually proprietary path that “Open”AI went down. Had 1.5 not been released by their partner at the time, we never would have gotten it. And they’re no longer releasing models for free commercial use.
@8:32 => if you look at the light sources reflections off the bottles, you can see that the sizes of the light sources reflections changes sizes as you move to the next adjacent bottles. This happens due to proximity of to the light sources. Therefore, the green bottle is the closest to the right light source, and therefore it has the biggest right light source reflection, but the smallest left light source reflection. This is actually impressive!!
I wonder if the low parameter models would make domain-limited fine-tuning/training more accessable/faster. Example, say you want to make a model generating South Park characters, that does that really well, but only that. Like 'a person' prompt would generate a random South Park character.
Pretty sure this is covered with Lora and Textual Inversions.
@@Zhoul-is-back TI would take forevver to train :D Havent actually thought about Lora being able to do this (or tried rather). I suspect it will take alot more time though too. I was more thinking about 'removing' prior knowledge than transforming it.
I hope. The prompting seems powerful.
You know what will be mindblowing? That would be feet and hands that are not deformed or missing.
yeah, i was waiting for you talk about it!
thank you :)
Thanks for the Video. Everything is developing so fast, last week Sora, now this. I am curious when Dalle 4 comes out? Yesterday? Tomorrow? Next week?
Why are they so worried about text?
They could invest more in better hands and details in general.
Screw text! haha ha
It's the plebian way to demo that something "follows prompt". Sadly I agree with you in this case, you could reasonably train a separate LoRA or module for fixing text, the overfocus on text over details is kind of a bummer. The other thing I worry is they repeatedly mention "safety" in their blog post, I wonder if it means greatly reduced NSFW or human face generation or at least CLIP capabilities in this area. I know this is an industry wide pressure campaign thing but still, the lack in NSFW will matter more in SD than the other AIs as it is one of SD's traditional strength over the others, one of the main reasons to use SD over the others.
agree!
@@zxbc1 SDXL has greatly reduced capabilities to create NSFW. Also SDXL creativity is just not that great. By what I can see SD 3 will be even worse. I use SDXL for some generations and use SD 1.5 on top of the image to make it more realistic, because SDXL photorealism is weaker than SD 1.5. Some community models of SDXL can create good pictures, but if you really want to add detail you can only do it with SD 1.5... and people think SDXL is "higher" resolution... it's actually lower resolution. SD 1.5 is still the best AI model for images. The most photorealistic looking images I have are from SD 1.5 not SDXL. I actually asked my wife to look at pictures and tell me which ones are photos and which one are AI generated and she couldn't believe SD 1.5 are not photos. I had to demonstrate how the image is generated and not a real photo, she thought SD 1.5 images are photos. I even made slight adjustments so she saw how the thing actually works and it's not some photo. For SDXL she was not that impressed and she immediately told me these are AI generations.
@@zxbc1 there will always be previous SD models, the NSFW cat is out of the bag
The reason why they are so worried about text is because it's the biggest untackled problem for reality. The first giveaway that an image is AI generated is that cities do not have any legible words. Unless you are in nature, it is completely impossible to look around you and not see any words. Shop signs, street signs, posters, etc. While there are many options to improve hands (ADetailer, inpainting, Lora's), there isn't a single way to improve text.
Thanks!
Thank you so much again for your support ❤️
Always like Olivio 😍
very interested in this!
So... Basically, SDXL wasn't "soft reboot" SD3 at all, but just double resolution version of ordinary SD, which served as filler before SD3 comes out. Speaking of small details - they should implement some kind of an after-inpainter, which automatically masks some areas, that were poorly rendered during a first run, and re-renders them separately in a second run.
You seriously can't see the difference between SD and SDXL? The quality is miles away lol.
How would the AI understand a "poor rendered area" to automatically fixing it? I don't think there is such thing.
The last thing on my mind to fix in diffusion is Text .. 🤦♂
How so? It's impossible to create realistic city shots with the garbled nonsense you get now.
Dude... the clowns are perfect. They are as scary as heck, just like in real life.
Nice video, Olivio! Let's see when it arrives to the public! I agree that SD3 looks less stylish than MJ, but the realistic feeling-what SD3 achieves-makes me feel it's easier to be confused about whether it's real or not. Not only in the artistic way. What do you think? And thank you for your videos; they're amazing!
The picture result ( The Style ) isn't a negative point in my opinion. In fact, I like how SD3 Model gives out, because the prompt didn't mention the style of the picture.
What I mean is, the more specific you put in the prompt, the more accurate the result will be, not like midjourny, which throws the style even if you don't mention which.
The way I see SD3 is that it follows your words precisely, which I love too much and very excited about and it tops all other AI models.
About the hands and fingers, I am with you on that specially the fingers of that robot, but it is something that can be solved easily I guess.
Let's see it when we try it, I sent them for access and got a reply that they will send me an invitation soon, I can't wait to try it out.
Absolutely. MJ makes great images but SD3 gives what you ask. I usually don't want the extra aesthetic gravy that MJ pours over every image because it drowns out my own ideas.
Its actually better to not have midjourney styles burned into the base model. It will make it much easier to burn any other style into the model. It's kind like working with 3D content where the base stuff is realistic then stylize it manually to anime, wow, fornight or pixar styles.
Like those bottles are a excellent example. Just prompting for the bottles gives you a plain room and just the bottles. MJ has tons of clutter noise in the backround, stylistic lighting choices. So if you wanted plain bottles minus all noise, then you need to waste tokens to fight off that style
I liked the text in image results, hope in the future we don’t need a controlnet or inpainting for that
Your comment on the cat's small head had me laughing.
I wonder how compositions would be! Because sending that back to really cool already stablished models would be awesome.
Do you know anything about the release? I haven’t gotten an invite yet and was wondering if you know more.
Any idea when we will have access to it? thanks
Multi-modal is UI design language focused on the interaction mode. I would guess that e. g. voice inputs can be meant by this - or neuralinks! 😮
Do you think SD 1.5 still reliable than this? I tried to invest my time learning on Midjourney but everything looks cinematic final touch. So I started in Stable diffusion hope I wudnt regret.
are we finally getting a 1.5 replacement?
Aw. I need to buy another one 2Tb SDD. Again
If you get access, test it's limitations with a prompt like 44 unique items:
a photo of a lineup of a dog, cat, bird, frog, lizard, elephant, pig, squirrel, monkey, rabbit, chicken, cow, fox, horse, giraffe, deer, camel, bear, goat, deer, alligator, snake, turtle, rhinoceros, racoon, buffalo, rat, otter, walrus, lion, tiger, crab, ostrich, beaver, spider, butterfly, beetle, kangaroo, hedgehog, koala, skunk, opossum, lobster and a parrot
Oh my goodness, if you're eating THAT much wasabi with your sushi, you must have a digestive tract of steel *HOT*
Saitama enters the chat
@@OlivioSarikas LOL!!!! 🤣🤣🤣🤣🤣
Those clowns looked like "It 2" style horror clowns, complete with their own disturbing physical mutations added by the AI.
the large zombie hand certainly had IT powers ;)
Thanks for nit-picking the details.
It may seem hateful to people new to this, but once you are generating hundreds of images, we all do it and we all favor models that are better and more accurate.
oh yes, easy background change is a must. say something like photoshop is doing would be amazing
7:54 - We are in the middle of the journey!
Ba-dum tsss! xD
If they nail text it will really be amazing!
me too
I don't understand that obsession about text, you can add text with control net or inpainting. It's one the easiest things to do... generating correct hands is much more important and much harder to repair.
Will SD 3 also be possible to use on eg a Mac M1, without frying it?
pretty that should work, since M1 is supported by A1111 as far as i know. not a mac user though
They mentioned a 800M parameters option. For comparison, SDXL is 3.5B and SD1.5 is about 1B parameters.
We'll see...
Hi Olivio ich habe paar fragen bezüglich SD
Hoping it is a SD1.5 / SD2 improvement in 512x512 / 768x768 because SDXL is the heaviest and we need an in between
It's likely sdxl resolution, atleast the one showcased are
But the 800M parameter model should be sd1.5 resolution ig , cause sd1.5 itself is 866M parameters.
But since its different architecture, I don't know.
"...it's very good at text..." which they also said about SDXL and I have never found that to be true. :(
Midjourney looks like SD with random LORAs added. The problem is that you can't specify the LORA.
It's over for OpenAI and Midjourney.
Hmmm, that remark about aesthetics of MidJourney vs StableDiffusion 3... If the prompting of SD3 is as good as it appears to be here then I'm guessing (hoping?) that, given the correct prompting, then SD3 might do better than these *examples* are showing. BUT "Beauty is in the eye of the beholder", right? I also really don't want to see a "one size fits all" thing - it cramps style, spontaneity, experimentation and future development.
I actually don't understand Olivio's argument about the lack of styling from SD3 in the examples. There is no style mentioned in the prompts for SD3, why would it create stylized images? The default SD generations without a style/modifier in the prompt are photos or something close as photos.
If I will be able to train own models and loras, just give it to mee :D :D fingrs crossed
As AI gets better placing text is closer to taking all Graphic Designers' jobs. Now with the new version of SD or even MJ, you can place simple text correctly. Two more papers (updates) down the line and will be able to design a complete infographic or even a magazine page layout. 😢
I hope they improve hands! ✋ ...why are they so obsessed with the text stuff?
@2:22
"If you eat sushi in the morning, sushi in the evening, and sushi for supper, you wouldn't be so fat" - They Call Me Bruce
ruclips.net/video/oqEtrmqmgus/видео.html
I think prompt understanding is a huge leg up over the competition, but we still yet have to see how it handles more esoteric imagery like "Gandalf blowing a smoke cloud in the shape of a pirate ship." As for the horrible, terrible hands, hopefully an improved hand detailer will come at some point.
While I'm definitely interested and it looks great, like Sora AI, it seems a bit too good to be true. Even allowing for cherry picking.
I hope it delivers but I'll need to see it in action before I fully believe it.
Just to say but we can get sushi around here where fish is wrapped around or is on top of the roll.
I kind of feel like they are focusing a bit too much on text.
Yeah, how many re rolls did SD3 have to do
It's very likely the model is quite censored in terms of human anatomy, hence why it's arguably gone backwards from 1.5 in a lot of anatomical details and the only images they were forthcoming with was of clowns. You would think you wouldn't need fully nude people to get those sorts of details right but maybe they've had to be really overzealous in pruning the data set to ensure it has no idea of what a naked human body looks like. A bunch of pictures of people in baggy, oversized clothes, perhaps.
Video generation is very interesting 🤔, I hope they get this running on multigpu setups.
yes, looking forward to video too. hm.. you think many users will have multiple GPUs or are you thinking about professional use?
@@OlivioSarikasmultigpu is also interesting for casual users as if you upgrade your gpu you can just leave i nthe old one and utilize both for a significant boost.
My concern is that they may be throwing away the aesthetic part of image generation in favor of this kind of 'comprehension' that sort of looks like a cheap photoshop collage. Sure I can have bottles accurately labeled 1 2 and 3 but if it looks like a cheap render, why bother? Same with the shapes/dog/cat one. None of the results look particularly pleasing to look at which is a concern when it's a pattern across every image I've seen from this model. And these are the ones they want us to see.
I just hope it's good enough for the community to breath life into
Wait... Why you have Thai Twitter trends?
i hope they can run on low vram TwT
Today "low VRAM is 8 Gb and less" ;-)
Why did you rush through that final image just saying it was very nice without scrutinizing the details like you did with the rest of the images? I would point out that the smoke words are in the wrong place, replacing most of the train cars. It would be much better if the smoke words had replaced most of the trail of smoke instead of the train cars and if they had been in a billowy flowing cursive script instead of separate block letters.
interesting point of view. I thought having it as the train carts is pretty sweet, even if it doesn't make sense for the smoke to be a cart as you say
Why, oh why, can they not just commit to getting the hands right??
whats about amd gpu :D
several UIs support AMD AI rendering
@@OlivioSarikas Indeed, A1111 and ComfyUI work reasonably well with my old 8GB AMD Radeon RX 580X through DirectML. Still no xformers support, though.
@@OlivioSarikas but not all extension like reactor
@@VisualWebsolutions True. That's because programming CUDA cores is much simpler that just using general shaders (as far as I know).
Yeah SD3, be ready to need a 4090 to run it locally.😉
I run SD on a Pi5, the 800MB should work on the Pi5 maybe.
Honestly I'm not that impressed, I mean working text is nice but that's not enough to justify need for more space and power once again. Unless the color thing actually works most of the time. Oh well, I still think they should just upgrade 1.5
I think getting AI to master hands, feet and limbs need to be the priority.
Funny: I like the SD3 style more than the MJ style at 6:20
I guess I'm tired of seeing the MJ style.
interesting take
yeah, same here
MJ style made the image very easy to be noticed that it is an AI generated.
@@reytr21 exactly.
@@OlivioSarikas also, you mentioned each image how much you liked Midjouney's "aesthetics" but these examples were about prompt adherence. It doesn't matter how fancy an image looks if it isn't what you wanted. Like the computer image. Did they ask for a dystopian looking computer? The styling is built into Midjourney but with SD you can easily add effects and characteristics to the image prompt if you want and aren't stuck with what Midjourney pumps out. With Midjourney you get what they decide to give you. With SD variants you get 100's of tuned models and LORAs to tailor and customize your image.
Until anyone can do hands and follow prompts like D3 …. It’s just second tier.
The sushi ripped up because the raccoon ate some. Obviously.
Well.. all of these tools are only focussing on standart graphic stuff.
but not one of them masters architecture even in an acceptable way. waiting for this..
There are models that are trained on architecture ;-)
Same for pixel art and so much more!
It's just the general models that do everything. And that "everything" is mediocre.
@@igorthelight well you know.
I tested many of them in a big architectural company.
Most of them don't get even close yet of crating decent consistent images that you could prompt for a specific style.
It works for family houses or rooms sometimes smal business interiors.
But what we do in our company are airports. And giant commercial districts or stadiums.
He are they all failing.
Best untill now still is realistic vision.
Problem in real architecture compared to advertising and graphics stuff is.
You don't need some somehow nice looking images.
You need to exactly pull the precise style you want. With perfect linear lines logical facades.
I know ip adapter help out here a bit and with controlnet you get somewhat to the direction. But not yet with seriously big architecture and images for completions.
You can't even text hhe ai for architectural styles beside of zaha hadid or Gothic.
So... It's realy a loooong road to go here
@@tomschuelke7955 Agree! Today what models could generate is somewhat like fuzzy drawings. I think we need at least 5 more years!
the search and replace vid is not about SD3, fyi.
it's not? but the tweet it's in talks about SD3
@Sarikas Emad's was talking about things "other than SD3". I also worked on the model used to make that video (not DreamShaper). Side note: the SD3 we're showing is not even half cooked. The latest examples I posted on twitter are already newer and much better, and it's still not close to being final.
you eat wrong sushi.
sushi with fish on top exists.
the cheff were drunk while slicing sushi but overall it's more than Okay
That fish on top of the sushi rolls is sashimi. It's a type of sushi.
Sashimi isn't actually a type of sushi :P
hm... you are right. i love sushi, but i have never seen that
This would still be considered sushi. Sometimes it's all inside in a 'roll', sometimes slices of fish/seafood is on top. It being raw doesn't make is sashimi though; sashimi isn't served on top or inside rice. If it's served with rice it would be on the side.
I certainly hope it does eat D3 for breakfast (or at least makes a darn good meal out of it)
There are still things that even with all the handholding, forced guidance, and blodd sacrifices to every God ever concieved that SD1.5 - XL can not do (or get beyomd tripped up in pitiful attempts)
And that's no fun 👿
Even though the robot image at 2:30 has a lot of background noise, there are some very interesting details if you pay close attention. For example, those bags on the shelf are clearly Fritos bags, and there are Coca-Cola bottles on the floor in the background. I don't know if Wal-Mart was specifically used in the prompt, but you can clearly tell the image takes place in one.
"Real Artists" are going to be even more pissed with this...love it 😂
No the clowns werent flawed. It wasnt even AI. it was nightmare fuel. Also, what horse power is this new text model going to need?
Quite underwhelming. Most of the examples look like creepy low-res stock photos full of glitches and errors still. Makes you wonder what this is really good for? Legally and creatively, you can just download an image off Google in many cases. Illustrations fare better in generators, then again while the form is better, they are almost always dull in terms of expression and pose and make you feel empty and numb.
Stable Diffusion 3 still feels like cartoonish child like created art.
Either keep the beard dyed or don't
It's the moon phases. I'm a celestial being. They call me the grey walker
Olivio stop calling it twitter. It's 'X' 😭😭
Don't be such a slave to branding ;)
@@OlivioSarikas 😅lol
Graphic Designers => unemployed.