Machine Learning Lecture 20 "Model Selection / Regularization / Overfitting" -Cornell CS4780 SP17

Поделиться
HTML-код
  • Опубликовано: 7 янв 2025

Комментарии • 34

  •  9 месяцев назад +3

    I've never encountered anything better than these playlist. Thank you, Professor, for the detailed explanation and, most importantly, for presenting it in such an engaging way that it sparks a deep passion in everyone. Here I am in 2024, following since lecture one and feeling how much I have developed after your lectures.

  • @utkarshtrehan9128
    @utkarshtrehan9128 4 года назад +10

    The mediocre teacher tells. The good teacher explains. The superior teacher demonstrates. The great teacher inspires. ― William Arthur Ward

  • @venugopalmani2739
    @venugopalmani2739 5 лет назад +22

    What a guy! Love the way you go about things Prof. Wish you soon have a million subs.

  • @StevenSarasin
    @StevenSarasin Год назад

    God that variance explanation using the minimization graph at the end hit so hard. Loved that!

  • @vaibhavsingh8715
    @vaibhavsingh8715 3 года назад +1

    This video and the whole playlist is a treasure trove for ML students

  • @llll-dj8rn
    @llll-dj8rn Год назад

    the fact that i am one of just 18k that watched this lecture is an amazing fact for me, thanks for this great content, really satisfied my passion on ml

  • @alihajji2613
    @alihajji2613 5 лет назад +4

    That was the good explanation i’ve ever seen. Thank you sir for sharing with us this knowledge.

  • @sandeepreddy6295
    @sandeepreddy6295 4 года назад +2

    The lecture - Very Clearly Explains - the concepts; worth subscribing

  • @PriyanshuSingh-hm4tn
    @PriyanshuSingh-hm4tn 2 года назад

    Amazing Way of Teaching. You're really a lifesaver, Sir.

  • @Ankansworld
    @Ankansworld 4 года назад +1

    K for Kilian!! How amazing these lectures are. Thanks, Prof. :)

  • @deepfakevasmoy3477
    @deepfakevasmoy3477 4 года назад +1

    Please, please share your other courses online.. Its simple beautiful and very clearly explained. I just enjoy them, without even trying hard to understand :)

  • @jiahao2709
    @jiahao2709 4 года назад +1

    Really beautiful explaination about the relation between regularization and early stop. Just some question for beyes optimal classifier for regression at 22:48, I know it used for classification, how it use for regression?

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +1

      If you have P(Y|X) you would typically predict the expected value or its mode (depends a little on the application).

  • @MrJackstarman
    @MrJackstarman 6 лет назад +8

    could you by any chance link the "projects" that you set for you class please? it would be very beneficial for myself however if it would be time consuming dont worry about it :)

    • @billwindsor4224
      @billwindsor4224 5 лет назад +1

      @jack - information about the projects and homework assignments is in the Notes to Lecture #1

  • @in100seconds5
    @in100seconds5 4 года назад +2

    Dear Kilian. One question please, when we want to detect underfitting, just looking at training error and test error is enough, isn’t it?( we see both are being high, we conclude underfitting). Why do we need that graph(the one with increasing training instances) ?

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +2

      Yes, that’s fair. If it is a very clear case (and train/test error are high and almost identical) then you won’t need the graph.

    • @in100seconds5
      @in100seconds5 4 года назад

      Kilian Weinberger awesome. I got the idea. Thanks a lot.

  • @vincentxu1964
    @vincentxu1964 5 лет назад +1

    A little bit confusion about the noise. In the lecture note, it has mentioned that the algorithm can never beat the noise part, cause it is the intrinsic part of the data. But in the lecture, you mentioned that we could improve the noise by introducing more features. My confusion is that shouldn't introducing more features is a choice of algorithm and considered as part of algorithm? Great thanks.

    • @kilianweinberger698
      @kilianweinberger698  5 лет назад +5

      Here I consider the feature extraction part independent of the algorithm. I.e. step 1 is you create a data set (with features), step 2 you train an ML algorithm on that data set. In high noise settings the algorithm (step 2) cannot really do much, but you can improve the results through the data (either by cleaning up features, removing mislabeled samples, or creating new features).

    • @ivanvignolles6665
      @ivanvignolles6665 5 лет назад +11

      In the case of the features ,the way I see it is that the noise is a term that represent all the features in the universe that I don't take into account but could affect my data. For example if I want to predict if my water is boiling given the temperature I would see that for a given temperature sometimes is boiling and sometimes is not, that could be caused by many factors and one is a variation in pressure. So if now I consider not only the temperature but also the pressure, i.e. added a new feature, the noise in my data should be reduced because I'm considering a new factor that affects my data.

  • @colinmanko7002
    @colinmanko7002 4 года назад +1

    Thank you! Ive watched a few of these lectures now, and you have a brilliant way of sharing these concepts.
    I’m curious to hear your thoughts on why you would reserve a testing set in the implementation of k-fold cross-validation you describe.
    My understanding is that you propose to split the data (D) from distribution (p) into a training and testing set and then run k-fold cross-validation on the training set. You would then use the testing set to estimate E[summation(loss(xi, yi))]~p, ie; the expectation of the cost from distribution p. Yes, this would be an unbiased estimate of this cost.
    However, is there not some value k whereby simply running k fold cross-validation on the entire dataset (D) would converge to this expectation; E[summation(loss(xi, yi))]~p?
    Of course k=1 would be biased, but you reduce that bias when you average over multiple cross validation sets. My hypothesis would be that it would converge to E[summation(loss(xi, yi))] ~p and be that error you could tell your boss.
    It boils down to expectation in both scenarios really being the expectation of cost on D. We use D because we don’t have p.
    Would you point me in a good direction here?

    • @colinmanko7002
      @colinmanko7002 4 года назад +1

      Hmm, there may be bias given that we don’t shuffle D each time we validate on a new subset.. Will think on it

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +2

      The reason is that you use k-fold cross validation to pick your hyper-parameters across a set of many options. Let's say for nearest neighbors with k=1,3,5,7 you get an average cross validation error across the leave-out-sets of 3.1%, 4.2%, 2.7%, 3.5% respectively. So you pick k=5 and conclude your validation error is 2.7%. Very likely your true test error will be higher than that. The reason is that you picked the lowest value on the validation sets, so you are overly optimistic (i.e. you cheated ;-)). An estimate of the test error should always be that you have all hyper-parameters fixed prior to the evaluation, you run your classifier over the set that this classifier has never seen before and measure the error. Ultimately that's how it will also be when your classifier is exposed to new data. Hope this helps.

    • @colinmanko7002
      @colinmanko7002 4 года назад

      Kilian Weinberger thank you for your reply! And thanks again for your videos. I’ve really enjoyed them over the last week. I hope you have a good day

  • @tanaytalukdar4875
    @tanaytalukdar4875 3 года назад +1

    Hi Professor, It's a great lecture series. Thank you for sharing them with us. My question is if Bayes classifier have zero variance and bias error then why don't we always get the best result for Bayes?

    • @rossroessler5159
      @rossroessler5159 3 месяца назад

      I had the same question too. I think he's referring to the "Bayes Optimal Classifier" which is defined in terms of the true underlying data distribution. So it actually doesn't have variance, because it doesn't depend on the dataset - it uses the true underlying distribution.
      But no real-world classifier can exactly replicate this, because we don't know the real distribution. So for something like Naive Bayes, you are depending on the dataset (as that's how you calculate the data and feature probabilities) - and therefore you would have variance.

  • @deltasun
    @deltasun 4 года назад

    great great lecture!
    I have one doubt: how does set size influence bias? after all bias is related to the classifier average with respect to fixed size datasets (the n parameter), right? If this n does influence the variance it should influence the bias as well, shouldn't it?

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +1

      Actually n does not affect bias. Bias is the error that the expected classifier would still make. However, as n->large, the variance will become very small and the remaining error will be dominated by bias. So if you have n very large and still high error, it is usually good to fight bias (i.e. get a more powerful classifier).

    • @deltasun
      @deltasun 4 года назад

      @@kilianweinberger698 but if I have few data the possibility of being highly biased should be lower, right? (it is easier to fit less data than lots of them)

  • @yamacgulfidanalumni6286
    @yamacgulfidanalumni6286 4 года назад +5

    You should have 800k subs, not 8k

  • @kc1299
    @kc1299 4 года назад +2

    "kernel = linear classifier on steroid" - Killian lol