Thanks for showing the pure beauty of math underlying beneath these ML concepts. My goal is to finish this playlist by the end of 2024! Thanks a Kilian, from Azerbaijan!
impressive lecture, thanks a lot! I was also impressed to discover that, if instead of taking the MAP you take the EAP (expected a posteriori), then the Bayesian approach implies smoothing even with uniform prior (that is alpha=beta=1)! beautiful
"Bayesian statistics has nothing to do with Bayes' rule" - knowing this would have avoided a lot of confusion for me over the years. I kept trying to make the (presumably strong) connection between the two and assumed I didn't understand Bayesian reasoning because I couldn't figure out this mysterious connection
At circa 37:28, professor, you say some on the lines of 'which parameter makes our data most likely', could I say in other words: 'which parameter it is that corresponds to this distribution of data' ? But not 'which parameter most probable corresponds to this distribution' ? Or neither? Because what confuses me is reading this P(D|theta) . I read as what's the probability of this data / dataset given I got this theta/parameters/weights, because when I start, I start with the data, then I try to estimate the parameters not the opposite. Suppose I have somehow weights then I try to discover the probability that this weights/parameteres/theta belongs to this dataset. Weird. Am I a Bayesian? Lol. (e.g. logistic classification task for fraud). Kind Regards!
Yes, you may be in the early stadium of turning into a Bayesian. Basically if you treat theta is a random variable and assign it a prior distribution you can estimate P(theta|D) i.e. what is the most likely parameter given this data. If you are frequentist, then theta is just a parameter of a distribution and you pretend that you drew the data from exactly this distribution. You then maximize P(D;theta) i.e. which parameter theta makes my data most likely. (In practice these two approaches end up being very similar ...)
I have a question about how MLE is formulated when using the binomial distribution (or maybe in general?): I might be overly pedantic or just plain wrong but looking at 18:01 wouldn't it be "more correct" to say P(H | D; theta) instead of just P(D;theta)? Since we're looking at the probability of H given the Data, while using theta as a parameter?
@@kilianweinberger698 sir I saw that lec and it's notes as well but in notes it's mentioned about Bayes optimal classifier but I don't think it's there in the video lec. Please correct me if I'm wrong. Thank you for your reply 😊
Edit: After thinking, and checking, and finishing the lecture, and watching a bit of the lecture after this one i have came to the conclussion that my first explanation was wrong, as i didn't have enough knoweledge. The way it is calculated is good and fine, where i struggled was to understand the right PDF the Professor was using. What threw me off was P(D; theta) which is a joint PDF (i know it's PMF, but for me they are all pdfs if you put delta function in there) of obdaining exactly data D, because D is a realization of some random vector X, so to be more precise in notation P(D; theta) should be written as P(X = D; theta). But what Professor meant was the PDF P(H = n_h; len(D), theta) which is a binomial distribution. Then we can calculate MLE just as it was calculated during the lectures. But this is not the probability of getting the data D, but the probability of observing exactly n_h heads in len(D) tosses. Then in MAP we have conditional PDF H|theta ~ Binom(len(D), theta), written as P(H = n_h | theta; len(D)), we treat theta as random variable but len(D) as a parameter. There are two problems with explanation that starts around 18:00. Let me state the notation first. Let D be the data gathered, this data is the realization of random vector X. n_h is the number of heads tossed in D. nCr(x, y) is combinations of x choose y. 1. Professor writes that P(D;theta) is equal to the binomial distribution of the number of heads tossed which is not true. Binomial distribution is determined by two parameters, the number of independent Bernoulli trials (n) and the probability of obtaining a desired outcome (p), thus theta = (n, p). If we have tossed the coin n times, there is nothing we don't know about n, since we have choosen it, and so n is fixed and most importantly it is known to us! Because of that, let us denote n = len(D) and then theta = p. Let now H = number of heads tossed, then P(H = n_h; len(D), theta) = nCr(len(D), n_h) * theta ^ n_h * (1 - theta) ^ (len(D) - n_t) is precisely the distribution that was written by the Professor. I have also noticed that one person in comments asked why cannot we write P(H|D;theta), and more precisely P(H = n_h|len(D); theta). The reason for that is that len(D) is not a random variable, we are the one choosing the number of tosses and there is nothing random about it. Note that in this notation used in a particulat comment theta is treated as a parameter as it is written after ";". 2. To be precise P(X = D; theta) is a joint distribution. For example if we would have tossed the coin three times, then D = (d1, d2, d3) with d_i = {0, 1} (0 for tails and 1 for heads), and P(X = D;theta) = P(d1, d2, d3;theta). P(X = D;theta) is the joint probability of observing the data D we got from the experiment. The likelihood function is then defined as L(theta|D) = P(X = D;theta), but keep in mind that the likelihood is not a conditional probability distribution, as theta is not a random variable. The correct way to interpret L(theta|D) is as function of theta, which value also depends on the underlying measurements D. Now, if the data is i.i.d. then we can write that P(X = D;theta) = P(X_1 = d1;theta) * P(X_2 = d2;theta) * ... * P(X_len(D) = d_len(D);theta) = L(theta|D) In our example of coin tossing P(X_i = d_i;theta) = theta ^ d_i * (1 - theta) ^ (1 - d_i), where d_i = {0, 1} (0 for tails and 1 for heads) Given that L(theta|D) = theta ^ sum(d_i) * (1 - theta) ^ (len(D) - sum(d_i)), where sum(d_i) is simply n_h, the number of heads observed. And now we are maximizing the likelihood of observing the data we have obtained. Note that the way it was done during the lacures was right! But we were maximizing the likelihood of observing n_h heads in len(D) tosses, not of observing exactly data D. Also for any curious person, the "true bayesian method" that was described by the Professor at the end is called minimum mean-squared estimation (MMSE), that aims to minimize the expected squared error between random variable theta and some estimation of theta using the data random vector g(X). To support my argumenting, here are sources i used to write the above statements: "Foundations of Statistics for Data Scientists" by Alan Agresti (Chapter 4.2), and "Introduction to Probability for Data Science" by Stanley Chan (Chapter 8.1). Sorry for any grammar mistakes, as english is not my first language. As i'm still learning all this data science stuff i can be wrong, and i'm very open to any criticism and discussion. Happy learning!
You're mixing up 'distribution' and 'density'.P(d1, d2, d3;theta), this notation is correct but P(X = D;theta) is wrong as its a density function and you can't write like that. But since they are also probabilities(discrete), you can write like that here
Hi Killian, While calculating the likelihood function in the example. You have taken (nH+nT)choose(nH) also into consideration. Which doesn’t change the optimization though, but shouldn’t be there I guess, because in P(Data | parameter) , all samples being independent should just be Q^nH * (1-Q)^nT. Rght?
18:19 how did that expression even came, like what is this expression even called in maths. By the way I am b.tech. student so I guess I might not have read the math behind this expression.
Just look it up Binomial Distribution. Thats a usual way of writing probability of an event which follows binomial distribution. You may also wanna check Bernoulli's Distribution first.
I raise my hand. Why you assume any type of distribution when discussing? What if I don't know that formula? But what I see is nH and nT. Why not work with those?
I have a question. Is P(D;theta) the same as (D|theta)? The same value seems to be used for both in the lecture, but I recall Dr.Weinberger saying that there is a difference earlier in the lecture.
Well, for all means and purposes it is the same. If you write P(D|theta) you imply that theta is a random variable, enabling to impose a prior P(theta). If you write P(D;theta) you treat it as a parameter, and a prior distribution wouldn't make much sense. If you don't use a prior the two notations are identical in practice.
At 36:43 you call P(D|theta) the likelihood, the quantity we maximize in MLE, but earlier you emphasized how MLE is about maximizing P(D ; theta) and noted how you made a "terrible mistake" in your notes by writing P(D|theta), which is the Bayesian approach...I'm confused.
Actually, it is more subtle. Even if you optimize MAP, you still have a likelihood term. So it is not that Bayesian statistics doesn‘t have likelihoods, it is just that it allows you to treat the parameters as a random variable. So P(D|theta) is still the likelihood of the data, just here theta is a random variable, whereas in P(D;theta) it would be a hyper-parameter. Hope this makes sense.
In the coin toss example (lecture notes, under "True" Bayesian approach), P(heads∣D)=...=E[θ|D] = (nH+α)/(nH+α+nT+β). Can anyone explain why the last equality holds?
Hey Prof, Your lectures are really good. But if you could provide some real time applications/examples while explaining a few concepts it would let every one understand the concepts better !
@@kilianweinberger698 okay. I am now having a, "omg he replied!!" moment. :D. Anyway, you are a really great teacher. I have searched long and hard for a course on machine learning that covered it from a mathematical perspective. I found yours on a friday and i have now finished 9 lectures in 3 days. Danke schön! :)
He is a fantastic teacher. And all the energetic teaching has given him a bad throat.
The key to deeper understanding of algorithms is the assumptions about the underlying data. Thank you and great respect.
This lecturer is amazing. As a Ph.D candidate, I always revisit the lectures to familiarise myself with the basics.
This is probably the best explanation I came across regarding the difference between the Bayesian and Frequentists statistics. :D
Thanks for showing the pure beauty of math underlying beneath these ML concepts. My goal is to finish this playlist by the end of 2024! Thanks a Kilian, from Azerbaijan!
48:46 He looks so happy.
impressive lecture, thanks a lot!
I was also impressed to discover that, if instead of taking the MAP you take the EAP (expected a posteriori), then the Bayesian approach implies smoothing even with uniform prior (that is alpha=beta=1)! beautiful
Makes the concepts of MLE and MAP very very clear. We also get to know that - Bayesians and frequentists both trust the Bayes rule.
Someone from Algeria confirms that this lecture is incredible. You have transformed complex concepts very simple
"Bayesian statistics has nothing to do with Bayes' rule" - knowing this would have avoided a lot of confusion for me over the years. I kept trying to make the (presumably strong) connection between the two and assumed I didn't understand Bayesian reasoning because I couldn't figure out this mysterious connection
You are totally wrong !
Thank you Kilian, you are very talented!!
At circa 37:28, professor, you say some on the lines of 'which parameter makes our data most likely', could I say in other words: 'which parameter it is that corresponds to this distribution of data' ? But not 'which parameter most probable corresponds to this distribution' ? Or neither? Because what confuses me is reading this P(D|theta) . I read as what's the probability of this data / dataset given I got this theta/parameters/weights, because when I start, I start with the data, then I try to estimate the parameters not the opposite. Suppose I have somehow weights then I try to discover the probability that this weights/parameteres/theta belongs to this dataset. Weird. Am I a Bayesian? Lol. (e.g. logistic classification task for fraud). Kind Regards!
Yes, you may be in the early stadium of turning into a Bayesian. Basically if you treat theta is a random variable and assign it a prior distribution you can estimate P(theta|D) i.e. what is the most likely parameter given this data.
If you are frequentist, then theta is just a parameter of a distribution and you pretend that you drew the data from exactly this distribution. You then maximize P(D;theta) i.e. which parameter theta makes my data most likely.
(In practice these two approaches end up being very similar ...)
Thank you Dr. Weinberger. you are a great lecturer and also RUclips algorithm subtitles your "also" as "eurozone".
I have a question about how MLE is formulated when using the binomial distribution (or maybe in general?): I might be overly pedantic or just plain wrong but looking at 18:01 wouldn't it be "more correct" to say P(H | D; theta) instead of just P(D;theta)? Since we're looking at the probability of H given the Data, while using theta as a parameter?
Starts at 2:37
From 30:00 pure gold content. :)
can someone tell where's the lecture in which he proved K nearest algorithm which he mentioned @5:09
ruclips.net/video/oymtGlGdT-k/видео.html
@@kilianweinberger698 sir I saw that lec and it's notes as well but in notes it's mentioned about Bayes optimal classifier but I don't think it's there in the video lec. Please correct me if I'm wrong. Thank you for your reply 😊
Thank you professor Kilian.
Amazing lecture, best explanation of MLE vs MAP
I think I finally understood a difference between frequentist and bayesian reasoning, thank you :)
when he says basically, it sounds like bayesly. And most of the time it still makes sense
48:18 loop this!! Thx Professor Weinberger!
Edit: After thinking, and checking, and finishing the lecture, and watching a bit of the lecture after this one i have came to the conclussion that my first explanation was wrong, as i didn't have enough knoweledge. The way it is calculated is good and fine, where i struggled was to understand the right PDF the Professor was using. What threw me off was P(D; theta) which is a joint PDF (i know it's PMF, but for me they are all pdfs if you put delta function in there) of obdaining exactly data D, because D is a realization of some random vector X, so to be more precise in notation P(D; theta) should be written as P(X = D; theta). But what Professor meant was the PDF P(H = n_h; len(D), theta) which is a binomial distribution. Then we can calculate MLE just as it was calculated during the lectures. But this is not the probability of getting the data D, but the probability of observing exactly n_h heads in len(D) tosses. Then in MAP we have conditional PDF H|theta ~ Binom(len(D), theta), written as P(H = n_h | theta; len(D)), we treat theta as random variable but len(D) as a parameter.
There are two problems with explanation that starts around 18:00. Let me state the notation first. Let D be the data gathered, this data is the realization of random vector X. n_h is the number of heads tossed in D. nCr(x, y) is combinations of x choose y.
1. Professor writes that P(D;theta) is equal to the binomial distribution of the number of heads tossed which is not true. Binomial distribution is determined by two parameters, the number of independent Bernoulli trials (n) and the probability of obtaining a desired outcome (p), thus theta = (n, p). If we have tossed the coin n times, there is nothing we don't know about n, since we have choosen it, and so n is fixed and most importantly it is known to us! Because of that, let us denote n = len(D) and then theta = p. Let now H = number of heads tossed, then
P(H = n_h; len(D), theta) = nCr(len(D), n_h) * theta ^ n_h * (1 - theta) ^ (len(D) - n_t)
is precisely the distribution that was written by the Professor. I have also noticed that one person in comments asked why cannot we write P(H|D;theta), and more precisely P(H = n_h|len(D); theta). The reason for that is that len(D) is not a random variable, we are the one choosing the number of tosses and there is nothing random about it. Note that in this notation used in a particulat comment theta is treated as a parameter as it is written after ";".
2. To be precise P(X = D; theta) is a joint distribution. For example if we would have tossed the coin three times, then D = (d1, d2, d3) with d_i = {0, 1} (0 for tails and 1 for heads), and P(X = D;theta) = P(d1, d2, d3;theta). P(X = D;theta) is the joint probability of observing the data D we got from the experiment. The likelihood function is then defined as L(theta|D) = P(X = D;theta), but keep in mind that the likelihood is not a conditional probability distribution, as theta is not a random variable. The correct way to interpret L(theta|D) is as function of theta, which value also depends on the underlying measurements D. Now, if the data is i.i.d. then we can write that
P(X = D;theta) = P(X_1 = d1;theta) * P(X_2 = d2;theta) * ... * P(X_len(D) = d_len(D);theta) = L(theta|D)
In our example of coin tossing
P(X_i = d_i;theta) = theta ^ d_i * (1 - theta) ^ (1 - d_i), where d_i = {0, 1} (0 for tails and 1 for heads)
Given that
L(theta|D) = theta ^ sum(d_i) * (1 - theta) ^ (len(D) - sum(d_i)),
where sum(d_i) is simply n_h, the number of heads observed. And now we are maximizing the likelihood of observing the data we have obtained. Note that the way it was done during the lacures was right! But we were maximizing the likelihood of observing n_h heads in len(D) tosses, not of observing exactly data D.
Also for any curious person, the "true bayesian method" that was described by the Professor at the end is called minimum mean-squared estimation (MMSE), that aims to minimize the expected squared error between random variable theta and some estimation of theta using the data random vector g(X).
To support my argumenting, here are sources i used to write the above statements: "Foundations of Statistics for Data Scientists" by Alan Agresti (Chapter 4.2), and "Introduction to Probability for Data Science" by Stanley Chan (Chapter 8.1). Sorry for any grammar mistakes, as english is not my first language. As i'm still learning all this data science stuff i can be wrong, and i'm very open to any criticism and discussion. Happy learning!
You're mixing up 'distribution' and 'density'.P(d1, d2, d3;theta), this notation is correct but P(X = D;theta) is wrong as its a density function and you can't write like that. But since they are also probabilities(discrete), you can write like that here
Just Brilliant !! Thank you prof. kilian !!!
Hi Killian,
While calculating the likelihood function in the example. You have taken (nH+nT)choose(nH) also into consideration. Which doesn’t change the optimization though, but shouldn’t be there I guess, because in P(Data | parameter) , all samples being independent should just be Q^nH * (1-Q)^nT.
Rght?
Thank you. This is great service.
18:19 how did that expression even came, like what is this expression even called in maths.
By the way I am b.tech. student so I guess I might not have read the math behind this expression.
Just look it up Binomial Distribution. Thats a usual way of writing probability of an event which follows binomial distribution. You may also wanna check Bernoulli's Distribution first.
I raise my hand. Why you assume any type of distribution when discussing? What if I don't know that formula? But what I see is nH and nT. Why not work with those?
34:52 very interesting point!
MLE starts at 11:40
I have a question. Is P(D;theta) the same as (D|theta)? The same value seems to be used for both in the lecture, but I recall Dr.Weinberger saying that there is a difference earlier in the lecture.
Well, for all means and purposes it is the same. If you write P(D|theta) you imply that theta is a random variable, enabling to impose a prior P(theta). If you write P(D;theta) you treat it as a parameter, and a prior distribution wouldn't make much sense. If you don't use a prior the two notations are identical in practice.
Ok, I get it now. Thank you for explaining this!
What's the name of the last equation?
nH + nT choose nH what exactly do u mean here?
At 36:43 you call P(D|theta) the likelihood, the quantity we maximize in MLE, but earlier you emphasized how MLE is about maximizing P(D ; theta) and noted how you made a "terrible mistake" in your notes by writing P(D|theta), which is the Bayesian approach...I'm confused.
Actually, it is more subtle. Even if you optimize MAP, you still have a likelihood term. So it is not that Bayesian statistics doesn‘t have likelihoods, it is just that it allows you to treat the parameters as a random variable. So P(D|theta) is still the likelihood of the data, just here theta is a random variable, whereas in P(D;theta) it would be a hyper-parameter. Hope this makes sense.
@@kilianweinberger698 Thanks a lot for the explanation. Highly appreciated.
I teared up at the end as well
Very enjoyable. I think a Killian is like a thousand million right?
I got confused at the end though. I need to revise.
In the coin toss example (lecture notes, under "True" Bayesian approach), P(heads∣D)=...=E[θ|D] = (nH+α)/(nH+α+nT+β). Can anyone explain why the last equality holds?
Impeccable
Made my day !! Learnt a lot !!
Great!
bayesian @23:30, then 32:00
thanks prof
Amazing
Hey Prof, Your lectures are really good. But if you could provide some real time applications/examples while explaining a few concepts it would let every one understand the concepts better !
xkcd commic mentioned in the lecture: xkcd.com/1132/
MVP
Wow, apt explanation. The captions were bad, at some point: "This means that theta is no longer parameter it's a random bear" 🤣
12:46
So the trick to getting past the spam filter is to use obscure words in the english language eh. Who wohuld have thought xD
Not the lesson I was trying to get across, but yes :-)
@@kilianweinberger698 okay. I am now having a, "omg he replied!!" moment. :D. Anyway, you are a really great teacher. I have searched long and hard for a course on machine learning that covered it from a mathematical perspective. I found yours on a friday and i have now finished 9 lectures in 3 days. Danke schön! :)
I'm in the 7th lecture. I hope I find myself commenting on the last one.
Don’t give up!
talking about logistics and taking stupid questions from students are major waste of talent of this great teacher.