Diffusion models from scratch in PyTorch

Поделиться
HTML-код
  • Опубликовано: 24 дек 2024

Комментарии • 211

  • @AndrejKarpathy
    @AndrejKarpathy 2 года назад +61

    Quite good!

  • @kiunthmo
    @kiunthmo Год назад +45

    Really well explained, and a compact notebook. It's basically all written directly from Torch, very refreshing to see when so much content is heavily reliant on APIs.

    • @VeryToxicToast
      @VeryToxicToast 6 месяцев назад

      Yeah, it would only take 16 years to create an api like torch to run on a gpu then show it in a video.

  • @alexvass
    @alexvass Год назад +1

    great video, very clear

  • @LuisPereira-bn8jq
    @LuisPereira-bn8jq Год назад +3

    Thanks a lot for the video, really helpful for someone trying to grasp these models.
    Also, little typo I noticed: at 16:06 in the cell "# Simulate forward diffusion" noise is being added a little faster than intended.
    The culprit is the line "image, noise = forward_diffusion_sample(image,t)" due to it rewriting the variable "image" at each step in the loop, despite the fact that forward_diffusion_sample was built expecting the initial non noisy image. So from the second iteration step onwards we're adding noise to an already noisy image.

    • @DeepFindr
      @DeepFindr  Год назад +2

      Hehe thanks for this finding, this is indeed a bug. I just checked, it doesn't look very different with the correction (assigning to a new variable). If I'm not mistaken, this led to a multiplication by 2, as in every step the pre-computed noise for this t is added, plus the cumulative noise until t (which should be the same as the pre computed one) hence leading to twice the noise as indented. Anyways, thanks for this comment! :)

    • @LuisPereira-bn8jq
      @LuisPereira-bn8jq Год назад +6

      ​@@DeepFindr Hi again. Yeah, the bug didn't really affect the images much, but it might confuse some viewers about whether you're computing x_t from x_0 or from x_{t-1}.
      As for the "multiplication by 2" bit, it's not going to be exactly that since the betas are changing and you're adding (t-1)-step noise to t-step noise. Moreover, adding a N(0,1) to another independent N(0,1) is a N(0,2), whose standard deviation is sqrt(2), so what was happening should be closer to multiplication by sqrt(2), even if also not exactly that.
      Anyway, since my previous comment I've now finished the video and trained if for 100 epochs so far (with comparable results to yours).
      I have two more comments in the latter bits of the video, namely the "sample_timestep" function at 26:59:
      - I was rather confused for a while as to why we were returning "model_mean" rather than just "x" for t=0. Though eventually I realized that the t's in the code are offset from the t's in the paper: the code is 0-indexed but the paper is 1-indexed. So the t=0 case in the sample_timestep is really inferring x_0 from x_1 in terms of the paper.
      It might be worth adding a comment about this in either the video or the code.
      - it took me quite a bit to understand the output of the sample_timestop function. I think I mostly got it now, but this is a really subtle step that is worth demystifying.
      Here's my current understanding: in effect our model is always trying to predict x_0 from x_t, but we don't expect the prediction to be great for large t. However, the distribution of p(x_(t-1) | x_t, x_0) is a fully known normal distribution, so we instead we the predicted x_0 to approxiamte this p(...), then sample from that to get x_(t-1).
      In retrospect, I've seen multiple videos on diffusion try to describe this process in words as "we predict the full noise, but then we add some of the noise back", but that vague description never made sense to me.
      So maybe an extra comment on this could help a future viewer as well.
      Anyway, let me thank you again for the video. My hope is to eventually actually understand stuff like stable diffusion with all its bells and whistles, and this already helped a lot.
      And on that note, I noticed that the weights for the network in the video take up 700M, compared to something like 4GB for stable diffusion, so it's maybe not so surprising that this would require a while to train from scratch.

    • @DeepFindr
      @DeepFindr  Год назад

      @Luis Pereira yes I totally agree, in retrospect some things could've been more in depth. Meanwhile I've also experimented more and read other papers about these models (and also the connection to score based approaches) which could also be added here. Maybe I'll make an update video some day :)

    • @LuisPereira-bn8jq
      @LuisPereira-bn8jq Год назад

      @@DeepFindr No worries. My experience is that in retrospect nearly everything could has been improved in some way or other.
      And if you ever find the time for another
      video, I at least would be interested. There are a decent number of good RUclips videos on this topic, but this is one of the best I've found.

  • @rafa_br34
    @rafa_br34 7 месяцев назад +2

    Great video! For me, the code makes it easier to understand the math than the actual formulas, so videos like these really help.

    • @DorinIonescu
      @DorinIonescu Месяц назад

      Yes - I am a slow learner and I will reverse engineer the Python code to understand the math.
      It's probabilities and it is more like a quantum computer all that math that we hear in the video
      Then we approximate all the quantum computer moves with classical CPU (or GPU) ops and we have to understand how much the formulas were rounded, truncated, and what parts were ignored.
      I took first prize in exact Math, I still do not have the right soul to eat all these elucubrations ... but I am working on it.
      Wait until some operation is executed on Quantums - then it will blow our minds - because this Gauss is a joke compared to Young

  • @arnabkumarpan5615
    @arnabkumarpan5615 Год назад +1

    You are seeing something thats gonna change the way we see our universe in upcoming 2-3years! Save my comment!

  • @tanbui7569
    @tanbui7569 2 года назад +15

    Extremely fantastic implementation. I understood the whole diffusion ideas and all mathematical details made sense to me just from your codes.

  • @ViduzTube
    @ViduzTube Год назад +13

    Very good job! A note: to me you should add torch.clamp(image, -1.0, 1.0) after each forward_diffusion_sample() call. You can check the behavior with and without the clamp when simulating forward diffusion. Images shown without clamp seems "not naturally noisy" as the pixel range is no longer between -1 and 1. I don't know how much this affects the final training result; it should be tried.

    • @DeepFindr
      @DeepFindr  Год назад +3

      Yes very good point. Later I also realized this and it actually led to an improvement (on a different dataset however). :)

    • @oliverliu9248
      @oliverliu9248 Год назад

      Thank you! When I was watching I was wondering what would happen as you add the variance and the value exceeds 1. Your answer helped me understand it.

  • @sachinmotwani2905
    @sachinmotwani2905 Год назад +3

    Unable to access the dataset - stanford-cars.

  • @roblee5721
    @roblee5721 Год назад +3

    Very nice video with good explanation. I would like to point out that your Block class, the same batchnorm is used in different places. Batchnorm is trainable and have weights, so you might want to treat it more like an actual layer rather than memory-less operations like pooling and ReLU.

    • @DeepFindr
      @DeepFindr  Год назад

      Hi, thanks for pointing out. This was a little bug, which I've corrected in the original notebook. :)

  • @ioannisd2762
    @ioannisd2762 9 месяцев назад

    Amazing video! Highly suggested before diving into the paper

  • @shakibyazdani9276
    @shakibyazdani9276 2 года назад +2

    what a great explanation, I will take a deeper look at the code. Thanks

  • @LiquidMasti
    @LiquidMasti Год назад +1

    Loved the simple implementation also thanks for sharing additional articles

  • @amortalbeing
    @amortalbeing 10 месяцев назад +1

    @13:09 Why isnt sqrt_recip_alphas used anywhere?
    also why do do you calculate sqrt_one_minus_alphas_cumprod, where in the equation we only have 1-alphas_cumprod? is that a typo?
    whats the alphas_cumprod_prev exactly?
    can someone please explain what is being done here? and whats the posterior_variance?
    Thanks a lot in advance

  • @henriwang8603
    @henriwang8603 Год назад +3

    Maybe it's because I came to see this video not in time. The Standford car dataset link is now invalid. An 404 error.

  • @MassivaRiot
    @MassivaRiot 2 года назад +9

    Absolutely phenomenal content! Love it ❤️

  • @egoistChelly
    @egoistChelly Год назад +2

    The dataset is no longer available.

  • @leonliang9185
    @leonliang9185 2 года назад +2

    Bro, one day in the future when this channel becomes famous, don't forget I am one of your early fans!

  • @nicksanders1438
    @nicksanders1438 4 месяца назад

    It's a good, concise walk-through with good code implementation examples. However I'd recommend avoiding some ambiguous variable names in code like betas, Block, etc.

  • @Ayanshandseals
    @Ayanshandseals Год назад +2

    Stanfords Cars dataset is no longer available in PyTorch datasets. Do you have any alternate locations for the same data?

    • @DorinIonescu
      @DorinIonescu Месяц назад

      search for Kaggle Stanford Cars - I tried to add the link but RUclips AI is blocking all external links (sends them to manual approval)

  • @curiousseeker3784
    @curiousseeker3784 Год назад

    ques: at 15:18 , why did we not directly scaled b/w -1&1 , or are there two different tensors we are scaling? one b/w 0&1 and the other b/w -1&1 ?

  • @SeonhoonKim
    @SeonhoonKim Год назад

    Thanks for your a lot contribution for this.. But a bit confused. at 7:30, q(Xt | Xt-1) is a distribution for which "the sampled noise" follows OR for which "the Noised image" follows ?

    • @DeepFindr
      @DeepFindr  Год назад

      It's the distribution of the noised image :) the distribution of the noise is always gaussian. This formula expresses the mixture of the original input and the noise distribution, hence the distribution of the noised image.

    • @SeonhoonKim
      @SeonhoonKim Год назад

      @@DeepFindr Thanks for your reply !! just one more, please..? Then q(Xt | Xt-1) = N(Xt; ... , BtI ) means the variance of Xt is Bt ? Someone says V(Xt) eventually becomes to 1 in every step, so a bit confused...

    • @DeepFindr
      @DeepFindr  Год назад

      @@SeonhoonKim bt is just the variance of this single step. Have a look at the "closed form" part with alpha. Ideally alpha bar becomes 0 at the end (the cumulative product) which leads to a variance of 1

  • @nikhilprem7998
    @nikhilprem7998 10 месяцев назад +1

    The dataset StandfordCars is no more available, what alternative can I use?

    • @henrysun6430
      @henrysun6430 9 месяцев назад +1

      ^ having the same problem

  • @derekyun5109
    @derekyun5109 Год назад +1

    Looks like the torchvision dataset for StanfordCars is now deprecated or sth; the original url from which the function pulls the data is closed

  • @emmanuelkoupoh7979
    @emmanuelkoupoh7979 2 года назад +3

    Hey, very interesting. Math stuffs are extremely comprehensible the way you explained it. Thanks

  • @jby1985
    @jby1985 Год назад

    At 10:19, q is the noise and the next forward image is x+q. Do I understand it right? Or we just used q and x interchangeably?

  • @namirahrasul
    @namirahrasul 8 месяцев назад +1

    Also StandfordCars is no longer available can you plese chnage it?

  • @peterthegreat7125
    @peterthegreat7125 Год назад

    Oh my god, this explanation is SUPER CLEAR! 🤯

  • @sergiobromberg9233
    @sergiobromberg9233 Год назад +1

    Thank you! I really liked your graphic interpretation of the beta scheduling. It's missing in many other videos about diffusion.

  • @akurmustafa_
    @akurmustafa_ 12 дней назад

    Thanks great explanation, I wonder how the sampling with T can produce plausible images with just a single pass. I would expect to call sampling code recursively T times with timestamp 1 at each call and z is updated according to result of the previous call. Similar to autoregressive architectures

  • @michael2826
    @michael2826 9 месяцев назад

    Thanks! From the training result, what I saw was an image that when from its original image to a less noisy one? I was expecting to see a noisy image converted to a less noisy one or it's original one?

  • @ovidiuluciancristina9127
    @ovidiuluciancristina9127 Месяц назад

    Great video, really well explained!
    One thing I found suspicious is that in the forward diffusion process you generate noise like this "noise = torch.randn_like(x_0)". As far as I know, this samples the uniform distibution U(0,1) and not the standard Gaussian.

  • @cankoban
    @cankoban 2 года назад +4

    Great effort thank you! The simplified version of it is still complicated though :D Probably, I need to watch this couple of more time after reading resources you attached.

    • @DorinIonescu
      @DorinIonescu Месяц назад

      Yes - I am at view number 10 ... and there will be 10 for the math (I kept the desert last 🙂)

  • @adamgrygielski7395
    @adamgrygielski7395 2 года назад +2

    Shouldn't you use separate BN layers for 1st and 2nd convolution in a block? In your implementation batch statistics are shared among 2 layers what seems to be a bug.

    • @DeepFindr
      @DeepFindr  2 года назад +3

      Yep, you are right. I updated the notebook.
      Actually I also found that bug in a local version of the code and forgot to adjust the notebook. Bnorm layers can't be shared as each layer learns individual normalization coefficients.
      Thanks for pointing this out :)

  • @tensenpark
    @tensenpark Год назад

    Damnnn I wish I watched your video first thing when trying to understand this. Great explanation

  • @FlummoxTheMagnificent
    @FlummoxTheMagnificent Год назад +1

    How do I make it match a prompt?

    • @DeepFindr
      @DeepFindr  Год назад +2

      For that you need to conditionally generate images, for example using CLIP latents. This simply means that you add some text embedding during the denoising. The best explanation for this can probably be found here: jalammar.github.io/illustrated-stable-diffusion/

    • @FlummoxTheMagnificent
      @FlummoxTheMagnificent Год назад +1

      @@DeepFindr that’s interesting, ok, thank you!

    • @DorinIonescu
      @DorinIonescu Месяц назад

      We will have to add context to the model - I hope to come back with some examples

  • @cerann89
    @cerann89 Год назад +2

    Thanks for a great tutorial. I think there is a small bug though in the implementation of the output layer of UNet. The output channel dimension is flipped with a kernel size and set to fixed 3. Shouldn't it look instead like this: self.output = nn.Conv2d(up_channels[-1], out_dim, 3)

    • @DeepFindr
      @DeepFindr  Год назад +1

      Oh yes :D bugs everywhere.
      With output dim 1 it would just produce black and white images, so this bug led to color ;-) have you tried it with another kernel size? Did it make a difference?

    • @cerann89
      @cerann89 Год назад +2

      @@DeepFindr I actually tried it on medical MRI images which have only one color dim (greyscale). That is where the error was triggered. I kept the kernel size at 3, so no I can’t give any input on the influence of the kernel size.

  • @MonkkSoori
    @MonkkSoori Год назад

    Hello thank you for the video and code. I have two questions:
    Q1- In the Block module 24:16 why is the input channel in `self.conv1` multiplied by 2? The input channel is twice the size of the output based on the `up_channels` list in the `init` of your SimpleUnet class. Is this related to adding "residual x as additional channels" at 24:50?
    Q2- How do you direct which way the diffusion direction goes in? I know this is a very simplified example model but how would you add the ability to direct the generation towards making a certain class of car, or a car based on text descriptions like "red SUV"? Is there a good explanatory paper or blog post or video on this matter that you can recommend (preferably practical without a lot of math)?
    (Thank you again for the video)

    • @DorinIonescu
      @DorinIonescu Месяц назад

      We will have to add context to the model - I hope to come back with some examples

  • @lchunleo
    @lchunleo Год назад

    thanks for the well explained and code. but i wonder how can i use the trained model to do generation of images? can advise?

  • @itsthenial
    @itsthenial Год назад

    How to save the model and generate images after training?

  • @usama57926
    @usama57926 Год назад +1

    Can you make a video on *Conditional generation in Diffusion modals*

  • @hilmiyafia
    @hilmiyafia 2 года назад

    In the Google Collab there are log and exp in the Sinusoidal Embedding block. You did not explain where that come from. I don't see it on the formula at 20:26.

    • @DeepFindr
      @DeepFindr  2 года назад

      Hi :)
      Some implementations of positional embeddings are calculated in log space, that's why you see exp and log there. This usually improves numerical stability and is sometimes also done for loss functions

  • @realdreray
    @realdreray 2 года назад

    Thanks for putting this together

  • @terguunzoregtiin8791
    @terguunzoregtiin8791 Год назад

    Really helpful content, and the recommended resources are very good, thanks

  • @bluebear7870
    @bluebear7870 2 года назад

    I have a question,sir
    At 13:22, the formula is (1-alpha_bar), why does the code put root(1-alpha_bar)?\

  • @タンココア
    @タンココア 2 года назад

    Thank you! This is the best video I've ever seen

    • @DeepFindr
      @DeepFindr  2 года назад

      Glad that you liked it!

  • @nicolasf1219
    @nicolasf1219 Год назад

    So, stupid question:
    In the SimpleUnetClass we define the outputLayer with parameter of 3, to regain the amount of channels our image has. Couldn't we then just input the image_channels variable there? What if my image is grayscale and has only 1 channel?

  • @int16_t
    @int16_t Год назад

    Say I have a still image x0 and a pre initialized noisy image N. I think I can apply a noise to x0 by "(1-B)x0 + BN". When B=1, the output is N, the noisy image, when B=0, the output is the still image. But that's just the linear version.

  • @glacialclaw1211
    @glacialclaw1211 Год назад

    How do I put my own training dataset to the ipynb script?

  • @rajatsubhrachakraborty6767
    @rajatsubhrachakraborty6767 Год назад

    How can we save all the generated images? Like as far as my understanding goes at the end of training there would be generated images of stanfordcars from completely noised images.

  • @maxlchn1462
    @maxlchn1462 5 месяцев назад

    Sehr gut erklärt und umgesetzt 👏

  • @michakowalczyk7411
    @michakowalczyk7411 2 года назад +40

    Always been told after math classes: "You won't need that in real life anyways" xd

    • @DeepFindr
      @DeepFindr  2 года назад +6

      I can relate xD

    • @snoosri
      @snoosri Год назад +2

      but what is real life?

    • @michakowalczyk7411
      @michakowalczyk7411 Год назад +2

      ​@@snoosri Don't take words so literally, especially on media. I believe you can grasp the meaning from context my friend :)

    • @superpie0000
      @superpie0000 Год назад

      ​@@snoosriunderrated comment

    • @sparshagarwal7289
      @sparshagarwal7289 2 месяца назад

      So trye

  • @marcinwaesa8713
    @marcinwaesa8713 2 года назад +3

    When you was explaining code for noise scheduler, T value changed from 200 to 300 which also i think should be reflected in different (smaller) betas, cause we end up with smaller alpha compound.

  • @DmitryFink
    @DmitryFink Год назад

    How many epochs does it take to produce anything that does not look like noise? I've downloaded the dataset from kaggle and replaced the data loader code in the collab, forward process works, howerver training doesn't seem to work. the loss is stuck from the very beginning at ~0.81, it doesn't go down, and the pictures sampled still look like noise. I am at Epoch 65, it does not seem to improve at all

  • @namirahrasul
    @namirahrasul 8 месяцев назад

    I didnt understand...why do we have to convert images to tensors?

  • @arymansrivastava6313
    @arymansrivastava6313 Год назад

    Will it be possible to generate new images using this model, if saved after training?, please share as to how to generate new images if possible.

  • @ashishkannad3021
    @ashishkannad3021 3 месяца назад

    why are we adding time embedding to input features, like literally adding them together. Can a simple concatenation of input features and time embedding possible ? btw dope video thanks for sharing

  • @AI_ML_DL_LLM
    @AI_ML_DL_LLM Год назад +1

    Thank you mate for the video, can you make one for the "conditional Diffusion" too? thanks

  • @ShobeirKSMazinani
    @ShobeirKSMazinani 2 года назад

    What a great video! Loved it!

  • @harsh9558
    @harsh9558 Год назад

    The explanation was awesome 🔥

  • @usama57926
    @usama57926 Год назад +1

    Thank you! It was helpful

  • @tomquilter4370
    @tomquilter4370 11 месяцев назад +1

    data = torchvision.datasets.StanfordCars(root=".", download=True) Giving an istant error?

  • @王涛-d5x
    @王涛-d5x Год назад

    Excellent Explanation. Learn a lot from your video, thank you~

  • @senpeng6441
    @senpeng6441 2 года назад

    really good introduction!thanks

  • @FrankWu-hc1dl
    @FrankWu-hc1dl 8 месяцев назад

    Hey I have a question: I think in the Colab notebook you only sample one time step from each image in a batch, but I was wondering why we don't take more intermediate time steps from each sample?

  • @erank3
    @erank3 2 года назад +4

    Thanks so much !! This is gold! Keep them coming. I’m curious to see the results of the model, can you share some more pictures?

    • @DeepFindr
      @DeepFindr  2 года назад +2

      Thank you! Happy that you liked it!
      I only have the pictures at the end of this video. Unfortunately I didn't save the model weights after the longer training, because I thought I wouldn't need them anymore :/

  • @sanjeevlvac1784
    @sanjeevlvac1784 Год назад

    if anyone wants to implement the DDPM for time-series data which model would be good instead Unet? any suggestions ?

  • @Zindit
    @Zindit 2 года назад

    Forward processing is very clear. Could you categorize the code blocks of backward processing?

  • @xczhou3340
    @xczhou3340 Год назад

    Thanks for the amazing video !

  • @omarlopezrincon
    @omarlopezrincon 2 года назад

    did you share the code of the implementation for your personal gpu?

    • @DeepFindr
      @DeepFindr  2 года назад +1

      Hi, I dont have it anymore but it's basically the same as this one, just in a python file.
      Also there were some comments below to change parts of the code e.g. Positional embeddings and final filter sizes that might improve the performance.

    • @omarlopezrincon
      @omarlopezrincon 2 года назад

      @@DeepFindr thanks, are you a researcher ?

    • @DeepFindr
      @DeepFindr  2 года назад +1

      I work in applied research in the industry. For me the sweet spot between pure research and building software products :)

    • @DorinIonescu
      @DorinIonescu Месяц назад

      Check the value of the variable "device" (it should be cpu or cuda (for GPU))

  • @xhinker
    @xhinker Год назад

    At 11:45, the third line of the formula should have a bar on the top of alpha_t

  • @jeonghwanh8617
    @jeonghwanh8617 Год назад

    best explanation, and perhaps the sigma in the normal distribution graph should be sigma^2

  • @catfood7859
    @catfood7859 2 года назад +1

    Thanks for the great video first! I trained the model on human face dataset using your code, it seems the sampling results appear checkerboard (grid pattern) artifacts, how to solve this?

    • @DeepFindr
      @DeepFindr  2 года назад

      Hi! Make sure to train the model long enough (e.g. Set 1000 epochs and see what happens). Also you might want to fine tune the model architecture and add more advanced layers like attention.
      I also encountered weird patterns at first but after training longer the quality got better.

    • @catfood7859
      @catfood7859 2 года назад +1

      @@DeepFindr Thanks for the advice, I'll try it : )

    • @KJPCox
      @KJPCox 2 года назад +1

      It helps if you set the final layers filter size to 1

  • @sohampyne8009
    @sohampyne8009 Год назад +1

    Really nice exposition. Can you please elaborate on the specifications of the machine it was trained on and approx how long the training took?

  • @infocus2160
    @infocus2160 2 года назад +1

    Many thanks. Excellent explanation. Can we use the diffusion models for the deblurring of images? These are Generative models, and I want to use them for image restoration problems. Thanks

    • @DeepFindr
      @DeepFindr  2 года назад +2

      Hi!
      Yes they can be used for image restoration as well. Have you seen this paper: arxiv.org/abs/2201.11793
      :)

    • @infocus2160
      @infocus2160 2 года назад

      @@DeepFindr Excellent thanks. You are amazing.

  • @asheeshmathur
    @asheeshmathur Год назад

    Excellent tutorial, but looks code needs to updated, Stanford Cars data set is gone. Available at Kaggle, could you please update it accordingly.

  • @user-dc2vc5ju3m
    @user-dc2vc5ju3m 9 месяцев назад

    Amazing explanation

  • @jfliu730
    @jfliu730 Год назад

    i dont know what the p(x0:T)means in 19:12

  • @AyePhyuPhyuAung-o8g
    @AyePhyuPhyuAung-o8g Год назад

    Hi, thank you for the well explained video! I've been following your code and training the same on StanfordCars dataset. At epoch 65, the sampled images of my training just come out as grey images. Is there something wrong with my training? Should I adjust the learning rate?

    • @neelsortur1036
      @neelsortur1036 11 месяцев назад

      also having this issue, did you figure it out?

    • @DorinIonescu
      @DorinIonescu Месяц назад

      Yes the images are close to gray - there is something hidden inside that makes medium of colors or loose RBG toward grays

  • @AbhishekSingh-hz6rv
    @AbhishekSingh-hz6rv Год назад

    Did anyone got an output, dont know why I am getting only noisy images in Epoch 0 ,

  • @jungminhwang8115
    @jungminhwang8115 Год назад

    Hi, thanks for awesome works. I would like to reduce image size. but, when I changed, training is not working, could you send me some info? and I would like to change your good code to DDIM method, only changing sampling part? could you send me detail info?

  • @orrimoch5226
    @orrimoch5226 2 года назад +3

    Thanks! Great animation and explanations..amazing 🙏

  • @adamtran5747
    @adamtran5747 2 года назад

    love the content brother

  • @anonymousperson9757
    @anonymousperson9757 Год назад +2

    Thanks for this amazing video! Do you plan on extending this video to include conditional generation at some point in the future? I would love to see an implementation of the SR3/Palette models that use DDPM for image to image translation tasks such as super-resolution, JPEG restoration etc. In this case, the reverse diffusion process is conditioned on the input image.

  • @tidianec
    @tidianec 2 года назад

    That was really clear. Thank you !

  • @oglee815
    @oglee815 8 месяцев назад

    beutifully explained

  • @yangjun330
    @yangjun330 Год назад

    Thanks a lot. Can I ask how to choose timesteps in diffusion? Is the larger the timesteps, the better?

    • @DeepFindr
      @DeepFindr  Год назад

      Basically it's a hyperparameter. Not only the step size is relevant, but rather the beta schedule (so start and end values). In my experiments I was simply visualizing the data distributions to determine a good value. You have a good schedule if the last distribution follows a normal gaussian with zero mean and std 1. Also, I have the feeling that a higher number of steps leads to higher fidelity, but I didn't further look into this

  • @sriharsha580
    @sriharsha580 2 года назад

    why appending time embedding to the feature is after first layer of cnn in UNet? Why not add the time embedding at the initial step (before UNet)?

    • @DeepFindr
      @DeepFindr  2 года назад

      You could also do that, but I added the timestep in each of the Unet blocks.
      I think that there are many possibilities to try things out :)

  • @isaacsalvador4188
    @isaacsalvador4188 2 года назад

    Do you know how to modify this diffusion model to accept a custom data set?

    • @DeepFindr
      @DeepFindr  2 года назад

      Yes, simply exchange the Dataset class with a custom dataset from pytorch. As long as it's images, the rest should work fine :)

  • @피카라이언
    @피카라이언 Год назад

    Thank you for the awesome guide :D
    Just 1 simple question. In the plotted image, are we looking at x0, x1, .... x10 where x0 is the image at the very left (denoised version) and x10 image at the very right (noised most)?

  • @chrislloyd1734
    @chrislloyd1734 2 года назад +20

    How can a model that is only 3.2GB, produce almost infinite image combinations that can be produced from just a simple text prompt, with so many language variables. What I am interested in, is how a prompt of say a "monkey riding a bicycle" can produce something that visually represents the prompt. How are the data images tagged and categorized in training to do this? As a creative person we often say that an idea is still misty and is not formed yet. What strikes me about this diffusion process is the similarity in how our minds at a creative level seem to work. We iterate and de-noise the concept until it becomes concrete using a combination of imagination and logic. It is the same process that you described to arrive at the finished formula. What also strikes me about the images produced by these diffusion algorithms is that they look so creative and imaginative. Even artists are shocked when they see them for the first time and realize a machine made them. My line of thinking here is that we use two main tools to acquire and simulate knowledge and experience. They are images and language. Maybe this input is then stored in a similar way as a diffusion model within our memory. Logic, creativity and ideas are just a consequence of reconstituting this data due to our current social or environmental needs. This could explain our thinking process and why our memory is of such low resolution. The de-noising process could also explain many human conditions such as depression and even why we dream etc. This brings up the interesting question " Could a diffusion model be created to simulate a human personality"? Or provide new speed think concepts and formulas for the solving of a multitude of complex problems for that matter. The path would be 1) diffusion model idea/concept 2) ask a GAN like gpt-3 to check if it works 3) feed back to the diffusion model and keep iterating until it does in much the same way as de-noising a picture. Just a thought from a diffusion brain.

    • @flubnub266
      @flubnub266 Год назад +7

      It's because the subset of possible images we humans are interested in is actually very specific. If you think about it, infinite combinations isn't that complicated. It's when we want specific things that you need more information. It only takes a few KB of code to make a pseudorandom number generator that can theoretically output every possible image, but we would see the vast majority of those permutations as boring rainbow noise. Ironically, the storage space used by generative models is needed to essentially explain what we DON'T want, so that we are left with the very specific subset that does meet our requirements.

    • @DorinIonescu
      @DorinIonescu Месяц назад

      1. 3.2 GB - I suppose you are talking about Midjourney and similar as this model has only 249Mb
      2. Yes we are reinventing the brain here. The kid who can not reach the toy until his 200 move (epoch), develops his CNN inside his head to paint and then see an electronic CNN that is doing something similar ... it's magic until you see this video 🙂

  • @curiousseeker3784
    @curiousseeker3784 Год назад

    that's insane math

  • @FelipeOliveira-gt9bf
    @FelipeOliveira-gt9bf 2 года назад +1

    Awesome video! what is the software are you using to draw these examples?

    • @DeepFindr
      @DeepFindr  2 года назад +1

      Thanks!
      Its nothing fancy - a mix of PowerPoint and DaVinci Resolve. :)

  • @Crized_man
    @Crized_man 11 месяцев назад

    Great video! Can You give any tips for 128*128 images generation with that model, please?

    • @DorinIonescu
      @DorinIonescu Месяц назад

      Coincidence - I was working at that. Here are the 128x128 results (a mosaic 1152 x 768) on my site Videnda AI

    • @DorinIonescu
      @DorinIonescu Месяц назад

      I also added the models. I saved 2 versions, Dictionary, and Full model as I noticed at sampling that you may find differences
      model_dict_epoch_250_128x128_cars.pth
      model_full_epoch_250_128x128_cars.pth
      model_dict_epoch_499_64x64_cars.pth
      model_full_epoch_499_64x64_cars.pth

  • @frederictost6659
    @frederictost6659 Год назад

    Thank you for the video.

  • @sienloonglee4238
    @sienloonglee4238 2 года назад

    very nice video and very easy to understand

  • @arnob3196
    @arnob3196 2 года назад +1

    how long did it take to train 500 epochs in your rtx 3060?

    • @DeepFindr
      @DeepFindr  2 года назад

      Hi! Good question, it was certainly several hours. I ran it overnight

  • @kidzheng8531
    @kidzheng8531 2 года назад +2

    Thanks for sharing this tutorial. It's so kind for beginners.

  • @xingyubian5654
    @xingyubian5654 2 года назад

    goated content

  • @JDechnics
    @JDechnics 2 года назад +2

    How do you made sure that the cars that has been generated at the end are truly original generations and not just copy of some cars in the dataset?

    • @DeepFindr
      @DeepFindr  2 года назад +3

      This actually relates to all generative models - how to make sure that the model doesn't simply memorize the train set.
      For example I've also seen this discussion for GANs: www.lesswrong.com/posts/g9sQ2sj92Nus9DeKX/gan-discriminators-don-t-generalize
      There is also some research in that direction: openreview.net/forum?id=PlGSgjFK2oJ
      To answer your question: you need to sample some data points and compare them with the nearest matches in the Dataset to be sure the model didn't overfit. More data always helps of course, to make it less likely that the model memorizes the whole dataset.