Thank you for cutting through the hype. The aim of every new AI model is to do things not just better but also more efficiently than the competition. In that respect Stable Diffusion wins hands down. SD is also free of the censorship that is hampering users of the other models, whose content policy is so vague that they don’t know if they are violating it or not.
References: ►Read the full article: www.louisbouchard.ai/latent-diffusion-models/ ►Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695), arxiv.org/pdf/2112.10752.pdf ►Latent Diffusion Code: github.com/CompVis/latent-diffusion ►Stable Diffusion Code (text-to-image based on LD): github.com/CompVis/stable-diffusion ►Try it yourself: huggingface.co/spaces/stabilityai/stable-diffusion ►Web application: stabilityai.us.auth0.com/u/login?state=hKFo2SA4MFJLR1M4cVhJcllLVmlsSV9vcXNYYy11Q25rRkVzZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIFRjV2p5dHkzNGQzdkFKZUdyUEprRnhGeFl6ZVdVUDRZo2NpZNkgS3ZZWkpLU2htVW9PalhwY2xRbEtZVXh1Y0FWZXNsSE4 ►My Newsletter (A new AI application explained weekly to your emails!): www.louisbouchard.ai/newsletter/
Oh wow, that is a lot of hate in a single message. I’m sorry you cannot stand how I speak. It is hard for me to speak a second language and I do my best at it. Hopefully will get better over time as I am now also able to chat with people in English, which will surely help too.
Wow, that is a first! I actually also write the articles if you hate the voiceover so much, you don’t have to listen to it. Really surprised that it is that hard to understand too! I’m sorry that a video can hurt you this much.
I’m not sure whether to be amused by your comments or be sad about your reality that is causing you to insult a random person online like this. I hope you’ll figure how to feel good and be happy as much as I am! And maybe you should try focusing on yourself for a little while but you should talk to a specialist and not listen to me.
Hey guys! Please have a look at Qwak's website for me, I'm grateful to have my friends sponsoring this video and I'm sure their tool will be useful to some of you :) www.qwak.com
I deleted my comment after asking my colleagues a similar question, but before realizing you replied, since I didn’t want to add to anyone’s confusion with the question 😅 But, I was able to read most of your response through the YT notification. Thank you for replying!
Hey, I think you may have gotten one fact wrong. The diffusion models were trained on "hundreds of GPUs" I'm sure, but I don't think it's running the prompts through a bunch of GPUs, or we would probably get them back instantly. I say this because I have the free GRisk Stable Diffusion, that one uses your own graphics card (GTX 1080 in my case) and it works just as good if not better than some of the others. It's just limited to 512x512. But most of them are, unless you upscale it within the model. GRisk is limited, but it's really good and I encourage everyone to try it, especially if you have a beefy GPU. And if you don't you could use Topaz Gigapixel AI trial mode to upscale your 512x512 images.
this tool is about to end careers, its incredible. The size of the models are only about 2-5GB which is just insane. I wonder if you can train it with videos and noise/denoise the frames.
There are custom models out there that are 40gb+ trained on 'special' art websites that exhaustively tag all of their art with various highly descriptive parameters. The results are better than ever but most of the models are more or less 'secret'
Stable diffusion can in fact already do this! You have an image to image gesture which you can feed a rough sketch and it works quite well. Not perfect obviously but really cool!
To learn reconstruct the original image....I would be very interested how this "learn result" is saved because it has to be save somewhere. In a database? Since an image can have let's say 12 megapixels, how is this even possible to save the learn process? This is not clear to me, in which form the result is stored. As a 3D model?
Yes it is saved in the whole models weights, which has millions of parameters and does transformations through simple functions. Adding up all those functions and memorized parameters, you are able to reconstruct an image starting from the right point in a « latent space ». This is basically a very tiny representation of this image that is learned during training, which moves thanks to the text prompt you give it or a conditional image :) The model then reconstruct the image thanks to the millions of functions together that basically represents a very complex function, which, in theory, could « predict » any signal (if big enough and trained enough). And in this case our signal is a 12MP image :)
Hi! I discuss the attention process in two videos. Here a long time ago with transformer network: ruclips.net/video/sMCHC7XFynM/видео.html And here more recently when vision transformers were introduced: ruclips.net/video/QcCJJOLCeJQ/видео.html
Can someone explain to me or point to a very high level explanation of how the text prompt is combined with the latent space data to create a new image?
You transform the text into token embeddings the same shape as the image. This aims to put the text information into a higher dimensional space. Then you add this information to the latent representation of the image by either multiplication, addition or other techniques. To not overwrite the complete model, skip connections are used. Hope that helped a bit!
Impressive man🔥 I just tried to generate an image using this tool hosted on hugging face. Guess what im in the long queue 😂 people are going crazy First, text-to-text language models like gpt... Next, text-to-image models like dall-e... I think text-to-video is upcoming...
Oh yes it is! In fact there was the Transframer model shared a few days ago by deepmind doing just that haha, and another one too. They are just the first steps but it is definitely coming
@@WhatsAI yeah I read about it. For now it can only generate a short clip in low resolution. We may see high resolution full movie in near future. Having all these tools make me wonder how AI can compromise some jobs and may disrupt some industries. Time will tell
I actually covered SDEdit on my channel! Stable diffusion is different in the way that it learns through a dataset to denoise an input (in this case an image, in the latent space (encoded space using a VAE)) using UNets that will learn Gaussian parameters to remove the noise step by step and then you can simply send noise and have images in milliseconds. SDEdit works similarly but within the image space directly, so much slower, and uses stochastic differential equations to sample Gaussian parameters to learn to remove the noise instead of unets to predict Gaussian parameters.
@@WhatsAI thanks for reply, really appreciate your time, would you make a video or blogpost about how researchers Comes up with neural network architecture pertaining to specific job like GANs, diffusion model etc, I'm really curious to know how they approach the problem and then work their way out of it.
All my pleasure, thanks to you for following my work! That is a great subject to cover, thank you for the suggestion! I feel it will be quite complex, maybe an interview format would work best 🤔 Would love to know what you think about how should this be done for the best possible outcome video format.
@@WhatsAI yes please as long as it serves the purpose, the intuition or key Idea behind their approach. Interviewing will help to gain an insight into the approach they follow.
Hi Louis, thanks so much for your great video :D, I saw a somehow similar model for the brain anomaly detection and segmentation from the paper "Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models". From what I observe it also utilizes encoder decoder architecture with the latent diffusion model to learn the latent distribution. My question is, during the training can I train the encoder decoder (VQ-VAE) separately from the diffusion model? let say, I first train the VAE model and then freeze the VAE weight to train my diffusion model. I'm not sure how it's done for this paper, I've been checking the code but looks like the autoencoder and diffusion model here are trained separately, but I might misunderstand the code :D Thank you
Hi! Thank you very much. From what I understood, that are being trained separately! :) The VAE was trained to encode and decide a signal only and then the diffusion part is trained using fixed VAEs !
@@WhatsAI Thanks for your fast response Louis, yes I've been working on unsupervised brain anomaly segmentation, and the paper I mentioned seems to be one of the newest MICCAI 2022 related to the topic :D. I'm currently doing the code re-implementation but there is one section of the inference trick mentioned in the paper I still cant reproduce T.T Please let me know if you also decided to re-implement the paper :D
Will do! Could you message me on twitter, LinkedIn or by email? Would love to work on that with you if you are working using hit or just share results! I also work in a very similar application so definitely worth staying in touch!
After reading the comment, I listened to your pronunciation in detail. (I work in speech recognition, so pronunciation is a familiar topic to me). I think your pronunciation is not too far off from a regular English pronunciation, but one weak point is sentence prosody, i.e. the pitch of your voice in the course of the sentence. It seems you do not think about the sentence as a whole, but only about short segments at a time, which leads to an unnatural and "chunky" prosody, which almost sounds like a last generation speech synthesis. If you want to improve, mainly focus on sentence prosody, pronunciation is only a secondary and minor issue in my humble opinion.
Thank you very much for this amazing feedback Charles! I think it may come from reading a script and not having the whole sentence in mind while saying it! It would be incredible if you have a listen to some of my recent longer podcast form and let me know if there is the same problem or how I could improve there!
Get your copy of "Building LLMs for Production": amzn.to/4bqYU9b
Thank you for cutting through the hype. The aim of every new AI model is to do things not just better but also more efficiently than the competition. In that respect Stable Diffusion wins hands down. SD is also free of the censorship that is hampering users of the other models, whose content policy is so vague that they don’t know if they are violating it or not.
Agreed! Thanks to Emad and everyone behind SD.
References:
►Read the full article: www.louisbouchard.ai/latent-diffusion-models/
►Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695), arxiv.org/pdf/2112.10752.pdf
►Latent Diffusion Code: github.com/CompVis/latent-diffusion
►Stable Diffusion Code (text-to-image based on LD): github.com/CompVis/stable-diffusion
►Try it yourself: huggingface.co/spaces/stabilityai/stable-diffusion
►Web application: stabilityai.us.auth0.com/u/login?state=hKFo2SA4MFJLR1M4cVhJcllLVmlsSV9vcXNYYy11Q25rRkVzZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIFRjV2p5dHkzNGQzdkFKZUdyUEprRnhGeFl6ZVdVUDRZo2NpZNkgS3ZZWkpLU2htVW9PalhwY2xRbEtZVXh1Y0FWZXNsSE4
►My Newsletter (A new AI application explained weekly to your emails!): www.louisbouchard.ai/newsletter/
To get better!
Oh wow, that is a lot of hate in a single message. I’m sorry you cannot stand how I speak. It is hard for me to speak a second language and I do my best at it. Hopefully will get better over time as I am now also able to chat with people in English, which will surely help too.
Wow, that is a first! I actually also write the articles if you hate the voiceover so much, you don’t have to listen to it.
Really surprised that it is that hard to understand too! I’m sorry that a video can hurt you this much.
I’m not sure whether to be amused by your comments or be sad about your reality that is causing you to insult a random person online like this. I hope you’ll figure how to feel good and be happy as much as I am! And maybe you should try focusing on yourself for a little while but you should talk to a specialist and not listen to me.
Nice! Your quality is great. I’m trying to get my quality to this level on my own RUclips channel
Thank you! I am sure you can do even better haha!
This video is short and super to the point!
That was the goal! We don’t play around 😎
Just discovered this channel - well done! Excellent coverage of Stable Diffusion. I like that you didn't skimp on the technical details
Glad you think so! That’s exactly my goal :)
Hey guys! Please have a look at Qwak's website for me, I'm grateful to have my friends sponsoring this video and I'm sure their tool will be useful to some of you :)
www.qwak.com
I deleted my comment after asking my colleagues a similar question, but before realizing you replied, since I didn’t want to add to anyone’s confusion with the question 😅
But, I was able to read most of your response through the YT notification. Thank you for replying!
My pleasure! :)
Hey, I think you may have gotten one fact wrong. The diffusion models were trained on "hundreds of GPUs" I'm sure, but I don't think it's running the prompts through a bunch of GPUs, or we would probably get them back instantly. I say this because I have the free GRisk Stable Diffusion, that one uses your own graphics card (GTX 1080 in my case) and it works just as good if not better than some of the others. It's just limited to 512x512. But most of them are, unless you upscale it within the model. GRisk is limited, but it's really good and I encourage everyone to try it, especially if you have a beefy GPU. And if you don't you could use Topaz Gigapixel AI trial mode to upscale your 512x512 images.
You are right and I may have said that wrong, sorry for this error in the video! I haven’t heard of GRisk, thank you for sharing, will check it out!
this tool is about to end careers, its incredible. The size of the models are only about 2-5GB which is just insane. I wonder if you can train it with videos and noise/denoise the frames.
There are custom models out there that are 40gb+ trained on 'special' art websites that exhaustively tag all of their art with various highly descriptive parameters. The results are better than ever but most of the models are more or less 'secret'
@@Sammysapphira what models are these?
I wonder if there will be a rough sketch to detailed image AI in the future.
Stable diffusion can in fact already do this! You have an image to image gesture which you can feed a rough sketch and it works quite well. Not perfect obviously but really cool!
can do a very simple image (stick figures) and it will spit out gold.
There is an app/platform that you can get on the waitlist for which does this. I forgot the name but you can Google it
This rocks thank you for the insight
Always my pleasure Job! 😊
To learn reconstruct the original image....I would be very interested how this "learn result" is saved because it has to be save somewhere. In a database? Since an image can have let's say 12 megapixels, how is this even possible to save the learn process? This is not clear to me, in which form the result is stored. As a 3D model?
Isn't it saved via model weights?
Yes it is saved in the whole models weights, which has millions of parameters and does transformations through simple functions. Adding up all those functions and memorized parameters, you are able to reconstruct an image starting from the right point in a « latent space ». This is basically a very tiny representation of this image that is learned during training, which moves thanks to the text prompt you give it or a conditional image :)
The model then reconstruct the image thanks to the millions of functions together that basically represents a very complex function, which, in theory, could « predict » any signal (if big enough and trained enough). And in this case our signal is a 12MP image :)
please I can't find the video you discussed attention as stated at 04:54. Can you show me or you just mentioned it on the side in the other video?
Hi! I discuss the attention process in two videos.
Here a long time ago with transformer network: ruclips.net/video/sMCHC7XFynM/видео.html
And here more recently when vision transformers were introduced:
ruclips.net/video/QcCJJOLCeJQ/видео.html
@@WhatsAI Thank you very much
Can someone explain to me or point to a very high level explanation of how the text prompt is combined with the latent space data to create a new image?
You transform the text into token embeddings the same shape as the image. This aims to put the text information into a higher dimensional space. Then you add this information to the latent representation of the image by either multiplication, addition or other techniques. To not overwrite the complete model, skip connections are used.
Hope that helped a bit!
@@abail7010 Thank you - that does help a bit - this aspect seems to be less covered in the high level explanations so I appreciate that - cheers.
Holy shit! Brother, I just wanted to know what Stable Diffusion is! What did you do like that!
Impressive man🔥
I just tried to generate an image using this tool hosted on hugging face. Guess what im in the long queue 😂 people are going crazy
First, text-to-text language models like gpt...
Next, text-to-image models like dall-e...
I think text-to-video is upcoming...
Oh yes it is! In fact there was the Transframer model shared a few days ago by deepmind doing just that haha, and another one too. They are just the first steps but it is definitely coming
@@WhatsAI yeah I read about it. For now it can only generate a short clip in low resolution. We may see high resolution full movie in near future. Having all these tools make me wonder how AI can compromise some jobs and may disrupt some industries. Time will tell
Indeed, only time will tell!
How do you compare it with SDEdit paper from Stanford University??
I actually covered SDEdit on my channel! Stable diffusion is different in the way that it learns through a dataset to denoise an input (in this case an image, in the latent space (encoded space using a VAE)) using UNets that will learn Gaussian parameters to remove the noise step by step and then you can simply send noise and have images in milliseconds. SDEdit works similarly but within the image space directly, so much slower, and uses stochastic differential equations to sample Gaussian parameters to learn to remove the noise instead of unets to predict Gaussian parameters.
@@WhatsAI thanks for reply, really appreciate your time, would you make a video or blogpost about how researchers Comes up with neural network architecture pertaining to specific job like GANs, diffusion model etc, I'm really curious to know how they approach the problem and then work their way out of it.
All my pleasure, thanks to you for following my work!
That is a great subject to cover, thank you for the suggestion! I feel it will be quite complex, maybe an interview format would work best 🤔
Would love to know what you think about how should this be done for the best possible outcome video format.
@@WhatsAI yes please as long as it serves the purpose, the intuition or key Idea behind their approach. Interviewing will help to gain an insight into the approach they follow.
Hi Louis, thanks so much for your great video :D, I saw a somehow similar model for the brain anomaly detection and segmentation from the paper "Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models". From what I observe it also utilizes encoder decoder architecture with the latent diffusion model to learn the latent distribution. My question is, during the training can I train the encoder decoder (VQ-VAE) separately from the diffusion model? let say, I first train the VAE model and then freeze the VAE weight to train my diffusion model. I'm not sure how it's done for this paper, I've been checking the code but looks like the autoencoder and diffusion model here are trained separately, but I might misunderstand the code :D Thank you
Hi! Thank you very much.
From what I understood, that are being trained separately! :)
The VAE was trained to encode and decide a signal only and then the diffusion part is trained using fixed VAEs !
Are you working in the medical field? Because I am too and the paper you referred seems pertinent for my work haha!
@@WhatsAI Thanks for your fast response Louis, yes I've been working on unsupervised brain anomaly segmentation, and the paper I mentioned seems to be one of the newest MICCAI 2022 related to the topic :D. I'm currently doing the code re-implementation but there is one section of the inference trick mentioned in the paper I still cant reproduce T.T Please let me know if you also decided to re-implement the paper :D
Will do! Could you message me on twitter, LinkedIn or by email? Would love to work on that with you if you are working using hit or just share results! I also work in a very similar application so definitely worth staying in touch!
I understood nothing.
Same to me
Wow!
😊
░p░r░o░m░o░s░m░ 🎶
😮😮🎉q
tu serais pas français par hasard hahaha
Quasiment! Québécois :)
first
Congrats!! 😉
this is way too complicated to understand
Sorry I would love to watch your video, but I simply cannot undertand your english, so I am off
That is unfortunate! I didn’t know it was a hard accent to understand.
@@WhatsAI I understand your perfectly and clear.
After reading the comment, I listened to your pronunciation in detail. (I work in speech recognition, so pronunciation is a familiar topic to me). I think your pronunciation is not too far off from a regular English pronunciation, but one weak point is sentence prosody, i.e. the pitch of your voice in the course of the sentence. It seems you do not think about the sentence as a whole, but only about short segments at a time, which leads to an unnatural and "chunky" prosody, which almost sounds like a last generation speech synthesis. If you want to improve, mainly focus on sentence prosody, pronunciation is only a secondary and minor issue in my humble opinion.
Thank you very much for this amazing feedback Charles! I think it may come from reading a script and not having the whole sentence in mind while saying it! It would be incredible if you have a listen to some of my recent longer podcast form and let me know if there is the same problem or how I could improve there!