I've never encountered anything better than these playlist. Thank you, Professor, for the detailed explanation and, most importantly, for presenting it in such an engaging way that it sparks a deep passion in everyone. Here I am in 2024, following since lecture one and feeling how much I have developed after your lectures.
the fact that i am one of just 18k that watched this lecture is an amazing fact for me, thanks for this great content, really satisfied my passion on ml
Please, please share your other courses online.. Its simple beautiful and very clearly explained. I just enjoy them, without even trying hard to understand :)
Really beautiful explaination about the relation between regularization and early stop. Just some question for beyes optimal classifier for regression at 22:48, I know it used for classification, how it use for regression?
could you by any chance link the "projects" that you set for you class please? it would be very beneficial for myself however if it would be time consuming dont worry about it :)
Dear Kilian. One question please, when we want to detect underfitting, just looking at training error and test error is enough, isn’t it?( we see both are being high, we conclude underfitting). Why do we need that graph(the one with increasing training instances) ?
A little bit confusion about the noise. In the lecture note, it has mentioned that the algorithm can never beat the noise part, cause it is the intrinsic part of the data. But in the lecture, you mentioned that we could improve the noise by introducing more features. My confusion is that shouldn't introducing more features is a choice of algorithm and considered as part of algorithm? Great thanks.
Here I consider the feature extraction part independent of the algorithm. I.e. step 1 is you create a data set (with features), step 2 you train an ML algorithm on that data set. In high noise settings the algorithm (step 2) cannot really do much, but you can improve the results through the data (either by cleaning up features, removing mislabeled samples, or creating new features).
In the case of the features ,the way I see it is that the noise is a term that represent all the features in the universe that I don't take into account but could affect my data. For example if I want to predict if my water is boiling given the temperature I would see that for a given temperature sometimes is boiling and sometimes is not, that could be caused by many factors and one is a variation in pressure. So if now I consider not only the temperature but also the pressure, i.e. added a new feature, the noise in my data should be reduced because I'm considering a new factor that affects my data.
Thank you! Ive watched a few of these lectures now, and you have a brilliant way of sharing these concepts. I’m curious to hear your thoughts on why you would reserve a testing set in the implementation of k-fold cross-validation you describe. My understanding is that you propose to split the data (D) from distribution (p) into a training and testing set and then run k-fold cross-validation on the training set. You would then use the testing set to estimate E[summation(loss(xi, yi))]~p, ie; the expectation of the cost from distribution p. Yes, this would be an unbiased estimate of this cost. However, is there not some value k whereby simply running k fold cross-validation on the entire dataset (D) would converge to this expectation; E[summation(loss(xi, yi))]~p? Of course k=1 would be biased, but you reduce that bias when you average over multiple cross validation sets. My hypothesis would be that it would converge to E[summation(loss(xi, yi))] ~p and be that error you could tell your boss. It boils down to expectation in both scenarios really being the expectation of cost on D. We use D because we don’t have p. Would you point me in a good direction here?
The reason is that you use k-fold cross validation to pick your hyper-parameters across a set of many options. Let's say for nearest neighbors with k=1,3,5,7 you get an average cross validation error across the leave-out-sets of 3.1%, 4.2%, 2.7%, 3.5% respectively. So you pick k=5 and conclude your validation error is 2.7%. Very likely your true test error will be higher than that. The reason is that you picked the lowest value on the validation sets, so you are overly optimistic (i.e. you cheated ;-)). An estimate of the test error should always be that you have all hyper-parameters fixed prior to the evaluation, you run your classifier over the set that this classifier has never seen before and measure the error. Ultimately that's how it will also be when your classifier is exposed to new data. Hope this helps.
Hi Professor, It's a great lecture series. Thank you for sharing them with us. My question is if Bayes classifier have zero variance and bias error then why don't we always get the best result for Bayes?
I had the same question too. I think he's referring to the "Bayes Optimal Classifier" which is defined in terms of the true underlying data distribution. So it actually doesn't have variance, because it doesn't depend on the dataset - it uses the true underlying distribution. But no real-world classifier can exactly replicate this, because we don't know the real distribution. So for something like Naive Bayes, you are depending on the dataset (as that's how you calculate the data and feature probabilities) - and therefore you would have variance.
great great lecture! I have one doubt: how does set size influence bias? after all bias is related to the classifier average with respect to fixed size datasets (the n parameter), right? If this n does influence the variance it should influence the bias as well, shouldn't it?
Actually n does not affect bias. Bias is the error that the expected classifier would still make. However, as n->large, the variance will become very small and the remaining error will be dominated by bias. So if you have n very large and still high error, it is usually good to fight bias (i.e. get a more powerful classifier).
@@kilianweinberger698 but if I have few data the possibility of being highly biased should be lower, right? (it is easier to fit less data than lots of them)
I've never encountered anything better than these playlist. Thank you, Professor, for the detailed explanation and, most importantly, for presenting it in such an engaging way that it sparks a deep passion in everyone. Here I am in 2024, following since lecture one and feeling how much I have developed after your lectures.
The mediocre teacher tells. The good teacher explains. The superior teacher demonstrates. The great teacher inspires. ― William Arthur Ward
What a guy! Love the way you go about things Prof. Wish you soon have a million subs.
i don't. if he does we are out of jobs!
God that variance explanation using the minimization graph at the end hit so hard. Loved that!
This video and the whole playlist is a treasure trove for ML students
the fact that i am one of just 18k that watched this lecture is an amazing fact for me, thanks for this great content, really satisfied my passion on ml
That was the good explanation i’ve ever seen. Thank you sir for sharing with us this knowledge.
The lecture - Very Clearly Explains - the concepts; worth subscribing
Amazing Way of Teaching. You're really a lifesaver, Sir.
K for Kilian!! How amazing these lectures are. Thanks, Prof. :)
Please, please share your other courses online.. Its simple beautiful and very clearly explained. I just enjoy them, without even trying hard to understand :)
Really beautiful explaination about the relation between regularization and early stop. Just some question for beyes optimal classifier for regression at 22:48, I know it used for classification, how it use for regression?
If you have P(Y|X) you would typically predict the expected value or its mode (depends a little on the application).
could you by any chance link the "projects" that you set for you class please? it would be very beneficial for myself however if it would be time consuming dont worry about it :)
@jack - information about the projects and homework assignments is in the Notes to Lecture #1
Dear Kilian. One question please, when we want to detect underfitting, just looking at training error and test error is enough, isn’t it?( we see both are being high, we conclude underfitting). Why do we need that graph(the one with increasing training instances) ?
Yes, that’s fair. If it is a very clear case (and train/test error are high and almost identical) then you won’t need the graph.
Kilian Weinberger awesome. I got the idea. Thanks a lot.
A little bit confusion about the noise. In the lecture note, it has mentioned that the algorithm can never beat the noise part, cause it is the intrinsic part of the data. But in the lecture, you mentioned that we could improve the noise by introducing more features. My confusion is that shouldn't introducing more features is a choice of algorithm and considered as part of algorithm? Great thanks.
Here I consider the feature extraction part independent of the algorithm. I.e. step 1 is you create a data set (with features), step 2 you train an ML algorithm on that data set. In high noise settings the algorithm (step 2) cannot really do much, but you can improve the results through the data (either by cleaning up features, removing mislabeled samples, or creating new features).
In the case of the features ,the way I see it is that the noise is a term that represent all the features in the universe that I don't take into account but could affect my data. For example if I want to predict if my water is boiling given the temperature I would see that for a given temperature sometimes is boiling and sometimes is not, that could be caused by many factors and one is a variation in pressure. So if now I consider not only the temperature but also the pressure, i.e. added a new feature, the noise in my data should be reduced because I'm considering a new factor that affects my data.
Thank you! Ive watched a few of these lectures now, and you have a brilliant way of sharing these concepts.
I’m curious to hear your thoughts on why you would reserve a testing set in the implementation of k-fold cross-validation you describe.
My understanding is that you propose to split the data (D) from distribution (p) into a training and testing set and then run k-fold cross-validation on the training set. You would then use the testing set to estimate E[summation(loss(xi, yi))]~p, ie; the expectation of the cost from distribution p. Yes, this would be an unbiased estimate of this cost.
However, is there not some value k whereby simply running k fold cross-validation on the entire dataset (D) would converge to this expectation; E[summation(loss(xi, yi))]~p?
Of course k=1 would be biased, but you reduce that bias when you average over multiple cross validation sets. My hypothesis would be that it would converge to E[summation(loss(xi, yi))] ~p and be that error you could tell your boss.
It boils down to expectation in both scenarios really being the expectation of cost on D. We use D because we don’t have p.
Would you point me in a good direction here?
Hmm, there may be bias given that we don’t shuffle D each time we validate on a new subset.. Will think on it
The reason is that you use k-fold cross validation to pick your hyper-parameters across a set of many options. Let's say for nearest neighbors with k=1,3,5,7 you get an average cross validation error across the leave-out-sets of 3.1%, 4.2%, 2.7%, 3.5% respectively. So you pick k=5 and conclude your validation error is 2.7%. Very likely your true test error will be higher than that. The reason is that you picked the lowest value on the validation sets, so you are overly optimistic (i.e. you cheated ;-)). An estimate of the test error should always be that you have all hyper-parameters fixed prior to the evaluation, you run your classifier over the set that this classifier has never seen before and measure the error. Ultimately that's how it will also be when your classifier is exposed to new data. Hope this helps.
Kilian Weinberger thank you for your reply! And thanks again for your videos. I’ve really enjoyed them over the last week. I hope you have a good day
Hi Professor, It's a great lecture series. Thank you for sharing them with us. My question is if Bayes classifier have zero variance and bias error then why don't we always get the best result for Bayes?
I had the same question too. I think he's referring to the "Bayes Optimal Classifier" which is defined in terms of the true underlying data distribution. So it actually doesn't have variance, because it doesn't depend on the dataset - it uses the true underlying distribution.
But no real-world classifier can exactly replicate this, because we don't know the real distribution. So for something like Naive Bayes, you are depending on the dataset (as that's how you calculate the data and feature probabilities) - and therefore you would have variance.
great great lecture!
I have one doubt: how does set size influence bias? after all bias is related to the classifier average with respect to fixed size datasets (the n parameter), right? If this n does influence the variance it should influence the bias as well, shouldn't it?
Actually n does not affect bias. Bias is the error that the expected classifier would still make. However, as n->large, the variance will become very small and the remaining error will be dominated by bias. So if you have n very large and still high error, it is usually good to fight bias (i.e. get a more powerful classifier).
@@kilianweinberger698 but if I have few data the possibility of being highly biased should be lower, right? (it is easier to fit less data than lots of them)
You should have 800k subs, not 8k
i guess even 80M would be less, for the content he provides
"kernel = linear classifier on steroid" - Killian lol