This is one of the most principled, logically organized, and reasonablly neat explanations I have ever watched on score-based and diffusion models. Amazing Song.
9:52 intractable to compute integral of exponential of neural networks 12:00 desiderata of deep generative models 19:00 the goal is to minimize fisher divergence between abla_x log(p(x_data)) and score function s(x), we don't know ground truth log(p(x_data)) but score matching is equivalent to fisher divergence up to the constant, thus same in the optimization perspective. 23:00 however, score matching is not scalable, greatly due to the Jacovian term. the term requires many times of backpropagation computations. Thus before computing fisher divergence, project each terms with vector v to make the Jacobian disappear, and thus become more scalable, this is called sliced score matching. 29:00 denoised score matching. The objective is tractable because we design the perturbation kernel by hand(the kernel is easily computable). However because of added noise, the denoised score matching can't estimate noise free distributions. Also the variance of denoising score matching objective becomes bigger and bigger eventually explodes when the smaller the magnitude of the noise. 31:20 in case of Gaussian perturbation kernel, denoising score matching problem takes more simpler form. Optimize the objective with stochastic gradient descent. Be careful to choose appropriate magnitude of sigma. 36:00 sampling from langevin dynamics, initialize x0 from simple(gaussian, uniform) distribution and z from N(0, 1), and then repeat the procedure. 37:20 naive version of langevin dynamics sampling not working well in practice because of the low density region
This is the best summarizing resource I have found on the topic. The visual aids are really helpful and the nature of the problem and series of steps leading to improved models, along with the sequence of logic are so clear. What an inspiration!
Excellent overview of excellent work, thank you! I am worried about simplified CT scans, however. Wouldn't we get bias toward priors when we're looking instead for anomalies? There needs to be a way to detect all abnormalities with 100% reliability while still reducing radiation. Is this possible?
why solving "maximize likelyhood" problem is equal to solve "explicit score matching " problem? For example, once you get S(x, theta), you do get corresponding p(x); but is it the same P(x; theta) where theta maximize likelyhood?
Why annealed langevin dynamics from the highest noise level to the lowest noise level instead of langevin dynamics sampling directly from the score model with the lowest noise level?
You use the perturbed noise to traverse and converge to high density areas via Langevin dynamics. Due to the manifold hypothesis large areas of the data space have no density and thus no gradient for langevin to traverse. The large noise is traverse these spaces. Once closer to the higher density, the noise schedule can be decreased and the process repeats
I have a question about Part "prpo. evaluation": does it mean to use ODE to calculate the likelihood (P_theta(x_0)), but how to input the original data x_0 to the diffusion model?
When we can obatin equivalence between DDPM(training network to obtain noise) and score based training in DDPM, then shouldn't both give same kind of results?
This is one of the most principled, logically organized, and reasonablly neat explanations I have ever watched on score-based and diffusion models. Amazing Song.
This one of the best presentations I have ever attended
recommended to anyone who wants to understand beyond the mere "noising/denoising" type explanations on diffusion models
Thank you guys for making this talk available on your RUclips channel. This is pure gold
What an amazing explanation. Wish there was an AI/authors explaining their papers so clearly.
This probably has to be the best explanation of diffusion models out there.Thank you!
The presentation kept me interested throughout. The simplicity and effectiveness of presentation blew my mind. Song is a genius.. Long way to go!
9:52 intractable to compute integral of exponential of neural networks
12:00 desiderata of deep generative models
19:00 the goal is to minimize fisher divergence between
abla_x log(p(x_data)) and score function s(x), we don't know ground truth log(p(x_data)) but score matching is equivalent to fisher divergence up to the constant, thus same in the optimization perspective.
23:00 however, score matching is not scalable, greatly due to the Jacovian term. the term requires many times of backpropagation computations. Thus before computing fisher divergence, project each terms with vector v to make the Jacobian disappear, and thus become more scalable, this is called sliced score matching.
29:00 denoised score matching. The objective is tractable because we design the perturbation kernel by hand(the kernel is easily computable). However because of added noise, the denoised score matching can't estimate noise free distributions. Also the variance of denoising score matching objective becomes bigger and bigger eventually explodes when the smaller the magnitude of the noise.
31:20 in case of Gaussian perturbation kernel, denoising score matching problem takes more simpler form. Optimize the objective with stochastic gradient descent. Be careful to choose appropriate magnitude of sigma.
36:00 sampling from langevin dynamics, initialize x0 from simple(gaussian, uniform) distribution and z from N(0, 1), and then repeat the procedure.
37:20 naive version of langevin dynamics sampling not working well in practice because of the low density region
It really shows how good the explanation is when even I can follow along. Thanks for sharing!
This is the best summarizing resource I have found on the topic. The visual aids are really helpful and the nature of the problem and series of steps leading to improved models, along with the sequence of logic are so clear. What an inspiration!
Wow,..., one of the best presentations on generative modelling !!!!
What a pleasant insight to think of gradients of the logits as score function! Thank you for sharing the great idea.
16:54 all papers referenced... This man is amazing
khatarnaak aadmi hain !!
Very good explanation !!!
sab ka saath, sab ka vikaas
Amazing insights into generative models! Thanks for sharing this valuable knowledge!
I am 44 second in the talk and already wanna say thank you! :)
Really amazing explanation for the entire diffusion model. Clear, great, wonderful work.
Extremely insightful lecture that is worth every minute of it. Thanks for sharing it.
Amazing how useful just adding noise is!
I like your presentation, in the end, we appreciate the interpretation and intuitions behind. So that, we can use them to solve other problems.
Very clear! Thanks for this amazing lecture!
46:30 was a true mic-drop moment from Yang Song 😄
a very accessible and amazing tutorial that explained everything clearly and thoroughly!
damn straight from yang song...
I love this talk! amazing and clear explanation!
Amazing explation for me to understand the diffusion model!
Crazy this is available for free!! ty
Oh, very clear explanation! Would it be possible to share this slide?
actually such a fire talk
Thanks for the talk very illuminating.
Amazing explaination. Could you share the slide for the presentation?
Such an incredible talk, i was just curious about how everyone here keeps track of all this knowledge, would love to hear from you all
literally the best explaination!!!
amazing presentation!
Such a great explanation
great great stuff from absolute expert
Great video , thank you !
Amazing tutorial
MindBlowing
What is now utilizing this, is it still SoTA? Did this improve OpenAI, MJ, Stability, etc.? It sounds promising but I need updated information.
Excellent overview of excellent work, thank you! I am worried about simplified CT scans, however. Wouldn't we get bias toward priors when we're looking instead for anomalies? There needs to be a way to detect all abnormalities with 100% reliability while still reducing radiation. Is this possible?
Thank you!
Thank you so much song
why solving "maximize likelyhood" problem is equal to solve "explicit score matching " problem? For example, once you get S(x, theta), you do get corresponding p(x); but is it the same P(x; theta) where theta maximize likelyhood?
nice video!
Why annealed langevin dynamics from the highest noise level to the lowest noise level instead of langevin dynamics sampling directly from the score model with the lowest noise level?
You use the perturbed noise to traverse and converge to high density areas via Langevin dynamics. Due to the manifold hypothesis large areas of the data space have no density and thus no gradient for langevin to traverse. The large noise is traverse these spaces. Once closer to the higher density, the noise schedule can be decreased and the process repeats
This is amazing!
I have a question about Part "prpo. evaluation": does it mean to use ODE to calculate the likelihood (P_theta(x_0)), but how to input the original data x_0 to the diffusion model?
When we can obatin equivalence between DDPM(training network to obtain noise) and score based training in DDPM, then shouldn't both give same kind of results?
How to be good in math like Dr. Song?
stanford还是牛逼,谢谢。
Now I need some GPUs
Wow that was cristal clear
27:44
牛
太牛逼了