I watched this video to clean up some of my understanding around MH and get some visual grounding. Excellently explained and nice cuts of humor to keep it engaging! What I didn't expect was this touching tribute to Arianna Rosenbluth. Women certainly did not get the proper recognition regardless of the incredible contributions they had to the field. Thanks for sharing this important slice of history and putting emphasis on her contributions.
one of the best explanations for the one of the most challenging topics to understand from a book....your knowledge, research, and experience reflected in the video..personally find u underrated... best wishes for more subscribers
Have not used Subset Simulation in practice (generally do not like to make tutorials on subjects that I have not used in practice); may be will make a tutorial some time later after I have tired it myself. No promise. Sorry!
Wow, this is a rare hidden gem. Best video on this subject. I also enjoy the history of the field and the researchers who had worked on this almost a century before me. Quite chilling to think about it.
Dr. Sachdeva, I am riveted by your tutorials. So clear. Thank you for historical perspective, especially Arianna Rosenbluth. So important to remember who these ideas/methods came from. In my worlds (Markov modeling of ion channels, and other stuff), I think we use the phrase "detailed balance" interchangeably with "microscopic reversibility" (a phrase due to Tolman, I think, from the 1930s?). The idea is that a closed physical Markov system (say, an ion channel protein transitioning between different conformational states), at thermodynamic equilibrium, should be completely symmetrical and reversible. If there is a loop in the Markov chain (e.g., S1S2S3S1, where Sn is a discrete state), then the *product* of transition rates should be the same clockwise vs counter-clockwise. Thermodynamically, this is equivalent to conservation of free energy, similar to the idea that traversing any landscape and returning to the same point will result in the same final energy. Is this the same meaning that you intend for "detailed balance"? If so, why is this mathematically necessary? In real life, ion channels are not in a closed system and are rarely at thermodynamic equilibrium. There are ionic gradients and voltage gradients. A biological cell (e.g., a neuron) is not a closed thermodynamic system, it exchanges energy with its environment (the "bath"). So the assumptions about "microscopic reversibility" may not (do not) strictly apply. Nevertheless, I have encountered a lot of resistance about "violating" microscopic reversibility in some of my published models of ion channel function. So, my questions are a) is "microscopic reversibility" the same as what you mean by "detailed balance"; b) is detailed balance a mathematical necessity, or can it depend on conditions; c) should I retract all of my previously published papers :-}? Again, thank you. These tutorials are wonderful.
Thanks so much for creating the video. Your video is always concrete, visual, intuitive. I wish other videos dumping math formulas from text books will learn from you on how to teach or explain in complex things in simple, visual, intuitive manners.
Excellent explanation, took me Some hours searching about this algorithm, u explain it perfecr. Actually, the hole point is the acceptance probability That has relation with the shape of our final curve-distro. For example in bayes, we dont need to compute Likelihood*prior analytically but Just to take a number out of it. The normal is used as a distribution to search around xt point, so we searching around each point of x axis and accepting it as "part" of our distribution according to the acceptance probability. Perfetto!Thank you
How can we just sample from the target distribution in order to calculate the acceptance ratio? Doesn't that defeat the purpose of the algorithm: Why wouldn't I just take samples from the target directly? Or is the problem that we are not able to draw independently from the target?
Maybe it's a silly question...but why not just sample over the entire known range [-6, 6]? What is the need for sampling using the uniform distribution?
When we say we sample it means that those samples should come in the same proportion as that of the distribution function. In other words, for a given function samples between [-6,6] that have high probability of occurring should occur more. Now the problem is how do you do sampling for any arbitrary function? We can only obtain samples from uniform distribution that gives samples that are all equally likely. The sampling algorithms (rejection, importance, etc) are about transforming the samples from uniform distribution function to the target distribution function.
In 3:13 why do you refer to x=0.6 as the first value while immediately after you refer to x=0.6 as the first sample? shouldn't x=0.6 be a single value from a sample of data?
I have a question. if we know f(x) since we need that to calculate alpha, what point of the Metropolis-Hastings algorithm ?? to guess the distribution of X which we already use in our algorithm?
You need to consider few things: Even if you know the functional form of your distribution function you still can not directly sample from it. Computing log prob and sampling are two different things. Secondly, most of the time you do not know the functional form. All you have is the product of likelihood and prior. This is where metropolis Hastings is useful.
at 20:37 you say that f = likelihood * prior but shouldn't it be f = likelihood * prior / N and when calculating alpha N's will cancel eachother out and that's how we avoid the integral in N?? also in my opinion you missed the most important part HOW WILL THE ESTIMATED distribiution look like?? how do I get it from these samples points??? What is q, is it pdf of N(x_t, 0.4) in this example?
The (normalized) target function does include the normalizing constant. I do show the approximated (estimated) function using the histogram in the tutorial. It was constructed using the samples that were collected. Q is Normal distribution in this example. I also show that using the yellow function 😳
@@KapilSachdeva well at 20:42 there is literally f(x) = likelihood * prior on the screen and it got me confused. In my opinion it should be clarified that f(x) = likelihood * prior / N and then when calculating f(x+1)/f(x) N's will cancel out. Is already a hard topic and when I saw f(x) = likelihood * prior at first I thought we are trying to approximate just the integral in the denominator not the whole thing
Thank you so much. People like yourself are really creating a greater impact in education industry ! I paid probably $xxxx .xx to study this concept which you tought in ~20min. Can you help with Gibbs Sampling as well.
When you say states, do you mean the random variable or the row vector probability. In previous videos, states (s1, s2, s3, and s4) were referred to as the probabilities within the random variable x. Plus, when you say parameter, do you mean the states defined above or the random variables?
States do not mean probabilities. States mean the values a random variable can take. With each state is “associated” it’s probability. Parameters in Bayesian statistics are Random Variables. This means a parameter can have various states.
Question, if a sample is rejected...then how does it get accounted for in the final histogram? Intuitively, it seems like once a sample is rejected, it will always be rejected, and the final histogram would be missing quite a bit. Also, it seems where we accept samples, the acceptance would grow indefinitely at the random variable and the histogram would over extend above the distribution we want to approximate at that specific area.
"once a sample is rejected it will always be rejected" Above conclusion is not necessarily correct. Once you have the "stationary markov chain" then the sample will be accepted (in a longer run) in accordance with its plausibility. In other words, a less likely sample will be accepted less number of times. It is shown in the histogram as well. See the links in the description of the video to run the code yourself. I have feeling that in you are not including/considering the notion of stationary markov chains. There is another concept that I have not gone over - "burn in" (or discarding the initial samples). That "burn in" (or the samples we throw away in the initial quite many steps) would help prevent some of the corner cases that you are worried about. This all being said, the quality of the reproducibility of the target function using samples depends on how complex the target function is . Complexity would equate to higher dimensional functions and shape of it. This is where more advanced MCMC algorithms play a role. Hope this helps in reducing the concerns.
@@KapilSachdeva Ah ha, thanks for nudging my assumptions in the correct direction. So in lament terms....stationary markov chain meaning the probability at f(xt) is equally or to the probability at f(xt + 1), or in other words the histogram at f(xt) is of the same height as the histogram at f(xt+1)....in the case we would start rejecting the sample at f(xt) and begin to accept the nearby samples that were once rejected as that once rejected sample would be more plausible. Visually speaking, it's almost as if the peak of our gaussian distribution rises with the histogram height to the point where once the gaussian rises above the underlying distribution then the once rejected samples begin to get accepted.
It does not mean that. A stationary markov chain means that the transition probability between various states is "fixed". Transition prob to move from state 1 to state 2 will be different from state 1 to state 3 etc. It might be a good idea to watch the Markov Chain tutorial in this series which is required to understand MCMC algorithms. ruclips.net/video/CIe869Rce2k/видео.html
Thanks for the video! Kris Hauser's notes are not accessible anymore! Could you perhaps upload it and give us a new link, if you have the pdf saved? I would really appreciate that. Thanks!!
Indeed the link is invalid now. But I do have a copy of it :) Let me know if above link is accessible to you drive.google.com/file/d/1_pGhAdG2q58H-T6NaFuFGF_ut6JJhH7i/view?usp=share_link
Still not very clear to me why that hasting generalization is required? It would have been better if you had explicitly pointed out the problem statement with metropolis that requires generalization solution.
This the primary purpose of MCMC methods ie to sample from unknown target distribution function. This is why you create multiple markov chains and compare their distributions. If they are similar it is a clue that you have succeeded to sample from unknown target distribution function. I talk about it around 12:45 in the tutorial.
@@nikolaobradovicpi let’s answer your question considering 2 scenarios: 1) You know the target function but you cannot sample from it. Knowing the function here refers to the formula of the function (pass in the input and get the output). This is the case with our toy example. Sampling to recreate a function vs evaluating a function are two different things. 2) The second situation to consider is that you do not even know the functional form and then your question is valid ie how do we even get to evaluate? …. This is where you rely on computing/evaluating it “indirectly”. As an example, In the (Bayesian statistics) posterior distribution, the numerator consists of 2 components - likelihood and prior. You have access to their functional form but not to the functional form of product of these components. However, for the evaluation you need to evaluate likelihood and prior separately and multiply the results. This is how you will get alpha! I talk about this later in the tutorial when explaining the “Hastings” part and show that your target function is the product of likelihood and prior. Got to 20:00 you will see the explanation in the tutorial. Hope this makes sense.
Absolute BEST MCMC video, this helped me develop a true intuition of the topic
🙏
This is the best MCMC video I have seen. I also like the history part. Please continue making such high quality videos.
🙏
I watched this video to clean up some of my understanding around MH and get some visual grounding. Excellently explained and nice cuts of humor to keep it engaging!
What I didn't expect was this touching tribute to Arianna Rosenbluth. Women certainly did not get the proper recognition regardless of the incredible contributions they had to the field. Thanks for sharing this important slice of history and putting emphasis on her contributions.
🙏
one of the best explanations for the one of the most challenging topics to understand from a book....your knowledge, research, and experience reflected in the video..personally find u underrated... best wishes for more subscribers
Thanks 🙏… please share it with your colleagues.
Please, could you make a video on subset simulation (SS)!
Have not used Subset Simulation in practice (generally do not like to make tutorials on subjects that I have not used in practice); may be will make a tutorial some time later after I have tired it myself. No promise. Sorry!
Wow, this is a rare hidden gem. Best video on this subject. I also enjoy the history of the field and the researchers who had worked on this almost a century before me. Quite chilling to think about it.
🙏
Thank you sir Sachdeva, i love your witty way of explaining and also the visual representation!
🙏😃
I am grateful to your lectures ❤ what a wonderful service you’ve done to all the learners
Dr. Sachdeva, I am riveted by your tutorials. So clear. Thank you for historical perspective, especially Arianna Rosenbluth. So important to remember who these ideas/methods came from. In my worlds (Markov modeling of ion channels, and other stuff), I think we use the phrase "detailed balance" interchangeably with "microscopic reversibility" (a phrase due to Tolman, I think, from the 1930s?). The idea is that a closed physical Markov system (say, an ion channel protein transitioning between different conformational states), at thermodynamic equilibrium, should be completely symmetrical and reversible. If there is a loop in the Markov chain (e.g., S1S2S3S1, where Sn is a discrete state), then the *product* of transition rates should be the same clockwise vs counter-clockwise. Thermodynamically, this is equivalent to conservation of free energy, similar to the idea that traversing any landscape and returning to the same point will result in the same final energy. Is this the same meaning that you intend for "detailed balance"? If so, why is this mathematically necessary? In real life, ion channels are not in a closed system and are rarely at thermodynamic equilibrium. There are ionic gradients and voltage gradients. A biological cell (e.g., a neuron) is not a closed thermodynamic system, it exchanges energy with its environment (the "bath"). So the assumptions about "microscopic reversibility" may not (do not) strictly apply. Nevertheless, I have encountered a lot of resistance about "violating" microscopic reversibility in some of my published models of ion channel function. So, my questions are a) is "microscopic reversibility" the same as what you mean by "detailed balance"; b) is detailed balance a mathematical necessity, or can it depend on conditions; c) should I retract all of my previously published papers :-}? Again, thank you. These tutorials are wonderful.
Will have to spend time pondering over it :) … it’s been a while now. Thanks for your kind words.
Wow, I wish I had found this video earlier. Your channel in general actually! So far it is a hidden gem but I am sure it will not remain that for long
🙏
Thanks so much for creating the video. Your video is always concrete, visual, intuitive. I wish other videos dumping math formulas from text books will learn from you on how to teach or explain in complex things in simple, visual, intuitive manners.
🙏 thanks for the kind words. Much appreciated.
Simply Excellent, it shows how hard work you have done to make this video.
🙏
the only explanation that i didn't fall asleep listening. thank you very much.
😃😃
Excellent explanation, took me Some hours searching about this algorithm, u explain it perfecr.
Actually, the hole point is the acceptance probability That has relation with the shape of our final curve-distro. For example in bayes, we dont need to compute Likelihood*prior analytically but Just to take a number out of it.
The normal is used as a distribution to search around xt point, so we searching around each point of x axis and accepting it as "part" of our distribution according to the acceptance probability.
Perfetto!Thank you
🙏 glad it was helpful to you!
Can you further elaborate on your conclusionv
this is the best explanation I have ever seen
🙏
This video is a gem of this subject. Really congrats for your teaching skills
🙏
This is exactly what I have been looking for, thank you so much!
🙏
Quite incredible honestly. I would pay for those videos, they're that good.
🙏 Just this appreciation is enough.
Haven't got any cycles in the last few months to work on the videos but will make them for sure.
@@KapilSachdeva I hope you make a video on the convergence criteria for MCMC's and especially Metropolis-Hastings, and the Thinning phenomenon
What an amazing video! Congratulations on your work!
Outstanding explanation! Thank you so much Sir!
🙏
How can we just sample from the target distribution in order to calculate the acceptance ratio? Doesn't that defeat the purpose of the algorithm: Why wouldn't I just take samples from the target directly? Or is the problem that we are not able to draw independently from the target?
best MCMC video I have seen
🙏
Thank you for such a brillian tutorial. Any videos on Gibs sampling?
🙏
Excellent overview. Highly appreciated.
Excellent video, very clear explanation -- thank you so much!
🙏
Excellent! That was an amazing tutorial!
Thank you very much for the excellent content and presentation. Stay blessed
🙏
This videos is so very intuitive ! Really thank you. You save me 😊😊😊😊😊😊😊😊!!!
🙏
Sir I am asst prof Bioinformatics I am using this for molecular phylogenetics...thank you so much sir for this video
🙏
Thanks a lot for this clarifying video.
🙏
Maybe it's a silly question...but why not just sample over the entire known range [-6, 6]? What is the need for sampling using the uniform distribution?
When we say we sample it means that those samples should come in the same proportion as that of the distribution function. In other words, for a given function samples between [-6,6] that have high probability of occurring should occur more.
Now the problem is how do you do sampling for any arbitrary function? We can only obtain samples from uniform distribution that gives samples that are all equally likely. The sampling algorithms (rejection, importance, etc) are about transforming the samples from uniform distribution function to the target distribution function.
How do we evaluate the target function f, if we assume that it is not known and we want to discover it?
Nice and simple explanation ☺️
Thank you 🙂
In 3:13 why do you refer to x=0.6 as the first value while immediately after you refer to x=0.6 as the first sample? shouldn't x=0.6 be a single value from a sample of data?
It’s the first sample.
I have a question. if we know f(x) since we need that to calculate alpha, what point of the Metropolis-Hastings algorithm ?? to guess the distribution of X which we already use in our algorithm?
You need to consider few things:
Even if you know the functional form of your distribution function you still can not directly sample from it. Computing log prob and sampling are two different things.
Secondly, most of the time you do not know the functional form. All you have is the product of likelihood and prior. This is where metropolis Hastings is useful.
Thank you very much for this video, it was really helpful.
🙏
Thanks a lot! Awesome explanations! please extend it for HMC and NUTS as well :)
🙏
at 20:37 you say that f = likelihood * prior but shouldn't it be f = likelihood * prior / N and when calculating alpha N's will cancel eachother out and that's how we avoid the integral in N?? also in my opinion you missed the most important part HOW WILL THE ESTIMATED distribiution look like?? how do I get it from these samples points??? What is q, is it pdf of N(x_t, 0.4) in this example?
The (normalized) target function does include the normalizing constant.
I do show the approximated (estimated) function using the histogram in the tutorial. It was constructed using the samples that were collected.
Q is Normal distribution in this example. I also show that using the yellow function 😳
@@KapilSachdeva well at 20:42 there is literally f(x) = likelihood * prior on the screen and it got me confused. In my opinion it should be clarified that f(x) = likelihood * prior / N and then when calculating f(x+1)/f(x) N's will cancel out. Is already a hard topic and when I saw f(x) = likelihood * prior at first I thought we are trying to approximate just the integral in the denominator not the whole thing
Apologies for the inconvenience it caused you.
As you correctly figured out we are "not" trying to approximate the integral in the denominator.
I have an algorithm for MCMC and would like your interpretation for it. Is this possible?
I am not sure if I understand your question. Are you asking to explain another MCMC algorithm?
@@KapilSachdeva I have an MCMC problem that is solved but I dont understand which is which in it. What is f(x) or the proposed distribution, etc.
In real life we don't know Target distribution - f(x). How did you calculated alpha for various sample points ? f(Xt+1)/f(Xt)
amazing explanation
🙏
Thank you so much. People like yourself are really creating a greater impact in education industry ! I paid probably $xxxx .xx to study this concept which you tought in ~20min. Can you help with Gibbs Sampling as well.
🙏
Amazing video. Thank you so much!
🙏
Excellent as always!
Can you make a video on Langevin Monte Carlo as well?
🙏
When you say states, do you mean the random variable or the row vector probability. In previous videos, states (s1, s2, s3, and s4) were referred to as the probabilities within the random variable x.
Plus, when you say parameter, do you mean the states defined above or the random variables?
States do not mean probabilities. States mean the values a random variable can take. With each state is “associated” it’s probability.
Parameters in Bayesian statistics are Random Variables. This means a parameter can have various states.
@@KapilSachdeva Very clear. I carried on the notations from your previous video. By the way, you are very good at teaching.
Question, if a sample is rejected...then how does it get accounted for in the final histogram? Intuitively, it seems like once a sample is rejected, it will always be rejected, and the final histogram would be missing quite a bit. Also, it seems where we accept samples, the acceptance would grow indefinitely at the random variable and the histogram would over extend above the distribution we want to approximate at that specific area.
"once a sample is rejected it will always be rejected"
Above conclusion is not necessarily correct. Once you have the "stationary markov chain" then the sample will be accepted (in a longer run) in accordance with its plausibility. In other words, a less likely sample will be accepted less number of times. It is shown in the histogram as well. See the links in the description of the video to run the code yourself.
I have feeling that in you are not including/considering the notion of stationary markov chains.
There is another concept that I have not gone over - "burn in" (or discarding the initial samples). That "burn in" (or the samples we throw away in the initial quite many steps) would help prevent some of the corner cases that you are worried about.
This all being said, the quality of the reproducibility of the target function using samples depends on how complex the target function is . Complexity would equate to higher dimensional functions and shape of it. This is where more advanced MCMC algorithms play a role.
Hope this helps in reducing the concerns.
@@KapilSachdeva Ah ha, thanks for nudging my assumptions in the correct direction. So in lament terms....stationary markov chain meaning the probability at f(xt) is equally or to the probability at f(xt + 1), or in other words the histogram at f(xt) is of the same height as the histogram at f(xt+1)....in the case we would start rejecting the sample at f(xt) and begin to accept the nearby samples that were once rejected as that once rejected sample would be more plausible. Visually speaking, it's almost as if the peak of our gaussian distribution rises with the histogram height to the point where once the gaussian rises above the underlying distribution then the once rejected samples begin to get accepted.
It does not mean that.
A stationary markov chain means that the transition probability between various states is "fixed". Transition prob to move from state 1 to state 2 will be different from state 1 to state 3 etc.
It might be a good idea to watch the Markov Chain tutorial in this series which is required to understand MCMC algorithms.
ruclips.net/video/CIe869Rce2k/видео.html
I don't know if the sum of two independent normal variables has 2 peaks..
Thanks for the video! Kris Hauser's notes are not accessible anymore! Could you perhaps upload it and give us a new link, if you have the pdf saved? I would really appreciate that. Thanks!!
Indeed the link is invalid now.
But I do have a copy of it :) Let me know if above link is accessible to you
drive.google.com/file/d/1_pGhAdG2q58H-T6NaFuFGF_ut6JJhH7i/view?usp=share_link
@@KapilSachdeva Awesome! Thanks a lot
Still not very clear to me why that hasting generalization is required? It would have been better if you had explicitly pointed out the problem statement with metropolis that requires generalization solution.
Very good, very helpful!
🙏
Very nice video thank you very much
🙏
Excellent!!!!!!!!!!!!!!!!!!
🙏
what if you want to sample from a distribution without knowing what the target distribution is?
This the primary purpose of MCMC methods ie to sample from unknown target distribution function.
This is why you create multiple markov chains and compare their distributions. If they are similar it is a clue that you have succeeded to sample from unknown target distribution function. I talk about it around 12:45 in the tutorial.
@@KapilSachdeva How do you compute alpha on 9:29 if you don't know f(x) since you say MCMC sample from unknown target distribution function?
@@nikolaobradovicpi let’s answer your question considering 2 scenarios:
1) You know the target function but you cannot sample from it. Knowing the function here refers to the formula of the function (pass in the input and get the output). This is the case with our toy example. Sampling to recreate a function vs evaluating a function are two different things.
2) The second situation to consider is that you do not even know the functional form and then your question is valid ie how do we even get to evaluate? …. This is where you rely on computing/evaluating it “indirectly”. As an example, In the (Bayesian statistics) posterior distribution, the numerator consists of 2 components - likelihood and prior. You have access to their functional form but not to the functional form of product of these components. However, for the evaluation you need to evaluate likelihood and prior separately and multiply the results. This is how you will get alpha!
I talk about this later in the tutorial when explaining the “Hastings” part and show that your target function is the product of likelihood and prior.
Got to 20:00 you will see the explanation in the tutorial.
Hope this makes sense.
@@KapilSachdeva Thank you very much sir for the response. Your tutorials are great!
🙏
Thank you!
🙏
thankyoun so much nsir
🙏
thank you!
🙏
Supreb explanation
🙏
When you don't know how to name something you simply call it a Kernel hahah good one !
😊🙏
Hah never knew anything about this topic. Thanks. But there was a lot of noise cut off in between. Idk.
Thanks Sunny. Noise cut off. ..:( .. sorry if it is the case but am not able to observe it. At what time it happens?
@@KapilSachdeva ah my bad. Bluetooth was running out of battery. Didn't realize it's the headphones 🤦♂️
:) … u made me nervous … but thanks for verifying. 🙏