Regularization Part 1: Ridge (L2) Regression

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024

Комментарии • 1,5 тыс.

  • @statquest
    @statquest  4 года назад +144

    Correction:
    13:39 I meant to put "Negative Log-Likelihood" instead of "Likelihood".
    A lot of people ask about 15:34 and how we are supposed to do Cross Validation with only one data point. At this point I was just trying to keep the example simple and if, in practice, you don't have enough data for cross validation then you can't fit a line with ridge regression. However, much more common is that you might have 500 variables but only 400 observations - in this case you have enough data for cross validation and can fit a line with Ridge Regression, but since there are more variables than observations, you can't do ordinary least squares.
    ALSO, a lot of people ask why can't lambda by negative. Remember, the goal of lambda is not to give us the optimal fit, but to prevent overfitting. If a positive value for lambda does not improve the situation, then the optimal value for lambda (discovered via cross validation) will be 0, and the line will fit no worse than the Ordinary Least Squares Line.
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @statquest
      @statquest  4 года назад +2

      @VINAY MALLU To repeat what I wrote in the comment you replied to: A lot of people ask why can't lambda by negative. Remember, the goal of lambda is not to give us the optimal fit, but to prevent overfitting. If a positive value for lambda does not improve the situation, then the optimal value for lambda (discovered via cross validation) will be 0, and the line will fit no worse than the Ordinary Least Squares Line.

    • @statquest
      @statquest  4 года назад +5

      @VINAY MALLU The larger the dataset, the less likely you are to overfit the data. So in some sense, regularization becomes less important. However, Lasso (L1) regularization is still helpful for removing extra variables regardless of the size of the dataset. And even with very large datasets, ML algorithms that depend on weak learners benefit from regularization.

    • @sfz82
      @sfz82 3 года назад

      @@statquest Coming back to Vinay's question: In the counterexample he gives a negative lambda would not achieve a better fit to the training data, but prevent overfitting (in that case overfitting to a more shallow slope). I really liked the video and found most of it very intuitive, but the fact that ridge regression favours a more shallow slope is not. With a large set of predictors, it's easy to see that enforcing sparsity may provide better out-of-sample predictions in practice. But with a single predictor the prior assumption of 'the obsvered data tend to overestimate the influence of the predictor' seems no more justified than its opposite would be. In other words: under OLS assumptions the distribution of OLS fitted slopes will be symmetrically centered on the 'true' slope. But the example was really helpful to understand that ridge regression doesn't work that way and instead biases the fit towards the intercept.

    • @cosworthpower5147
      @cosworthpower5147 3 года назад +1

      Is there an intuitive explanation, why the intercept beta 0 is not included in the regularization process?

    • @statquest
      @statquest  2 года назад +1

      @@cosworthpower5147 The goal is to reduce sensitivity to the parameters. The y-axis intercept does not depend on any of the parameters, so there's no reason to shrink it. Instead, as the other parameters go to 0, the intercept goes to the mean y-axis value.

  • @ardakosar3826
    @ardakosar3826 3 года назад +91

    Explaining things at this complexity at this level of simplicity is a real skill! Awesome channel!

  • @ryzary
    @ryzary 4 года назад +231

    After watching dozens of StatQuest videos, I finally know when to say 'BAM!'

    • @statquest
      @statquest  4 года назад +26

      Bam! :)

    • @akshaydeshmukh4916
      @akshaydeshmukh4916 3 года назад

      🤣🤣🤣🤣🤣🤣🤣🤣

    • @nailashah6918
      @nailashah6918 3 года назад +3

      plz tell me when to say BAM
      M still unable to understand 😭

    • @loayzag91
      @loayzag91 3 года назад +9

      I had to build a ML model to help me predict the proper times to say ‘BAM!’

    • @ashishshrivastava8864
      @ashishshrivastava8864 3 года назад

      BAM!

  • @jeremyandersen9456
    @jeremyandersen9456 4 года назад +2

    8:11 would seem to be esoteric knowledge but it does help a lot in contextualizing things

  • @josephpark3949
    @josephpark3949 10 месяцев назад +1

    Nice Holy Grail reference

  • @jacksonchow3359
    @jacksonchow3359 4 года назад

    Tony Stark: we don't really need to start a conversation.
    Me: you don't really need to sing a song to explain regularization

    • @statquest
      @statquest  4 года назад +1

      It’s true, but I just can’t help it.

    • @jacksonchow3359
      @jacksonchow3359 4 года назад +1

      @@statquest that's just a side jokes~ by the way, I liked your video. Thanks for you effort.

    • @statquest
      @statquest  4 года назад

      @@jacksonchow3359 Thanks! :)

  • @santoshbala9690
    @santoshbala9690 3 года назад

    Hi Josh.. Thanks for the video.. Towards the end.. you mentioned that Ridge used cross validation to to find a solution for more parameters(variables) with lesser observations.. How is that done.. is there any video done for that already... Please help to understand

    • @statquest
      @statquest  3 года назад

      Unfortunately I don't have a video on how that works yet.

    • @19Lxndr97
      @19Lxndr97 3 года назад

      @@statquest Can you share the references on that? Please

    • @statquest
      @statquest  3 года назад +1

      @@19Lxndr97 Here's one thing stats.stackexchange.com/questions/223486/modelling-with-more-variables-than-data-points and the concepts that apply to Elastic Net, also apply here: web.stanford.edu/~hastie/Papers/elasticnet.pdf

  • @Sickkkkiddddd
    @Sickkkkiddddd Год назад

    Are you saying introducing bias is particularly helpful when we do not have an adequate sample size to train on?

  • @Charles_Reid
    @Charles_Reid 5 месяцев назад

    what if the model is overtrained, but the slope must be brought further away from zero in order to correct for the over-training? Is there a way to do this with ridge regression? Seems like ridge regression can only bring the slope closer to zero for a linear model. Also, why can't lambda be negative? Thanks

    • @statquest
      @statquest  5 месяцев назад +1

      The idea is to make the dependent variable less sensitive to changes in the independent variable. Increasing the slope would only increase the sensitivity to these changes, and thus, result in a model that is more overfit.

  • @scubashar
    @scubashar 3 года назад +513

    I am a machine learning engineer at a large, global tech company with a decade of experience in industry and a computer science graduate student. Your channel has helped me immensely in learning new concepts for work and job interviews, and your videos are so enjoyable to watch. They make learning feel effortless! Thank you so much!!

    • @statquest
      @statquest  3 года назад +21

      Wow! Thank you very much! :)

    • @VainCape
      @VainCape 3 года назад +12

      can you give me a job plz?

    • @LucasPossatti
      @LucasPossatti 2 года назад +11

      @Son Of Rabat , some people (like me) might have skipped the "simple stuff" to jump right into the complex stuff because it gives better results. For example, I was introduced to ML by working with image classification and object detection right away, where deep learning is king. I studied backpropagation, gradient descent, etc, but never heard of Ridge Regression, for example, until recently. Now I'm trying to collect the pieces I left behind.
      (I also always sucked with the theoretical parts. As long as the evaluation metrics were good, it was fine... And it kind of worked for me, for some time. I'm now trying to change that, and deepen my theoretical knowledge.)

    • @LucasPossatti
      @LucasPossatti 2 года назад +5

      Today, I also work for a global tech company (as a Data Scientist). Not for a decade though. 😅

    • @joshsherfey
      @joshsherfey 2 года назад +3

      @@LucasPossatti same for me. I work as DS at large tech company, but still learn a lot from SQ

  • @lucaspenna6009
    @lucaspenna6009 4 года назад +66

    Professors in general teach Ridge Regression with many complicated equations and notations. You made this topic very clear and easy to understand. Thank u very much again.

  • @PolitePolice563
    @PolitePolice563 7 месяцев назад +17

    This channel is by far the best at explaining mathematical concepts related to machine learning. I'm in a machine learning class at my university and go to every class lecture. I leave not having understood an hour and fifteen minutes of lecture. I immediately pull up this channel and watch a video on the same concept and "BAM". It makes sense.

  • @JT2751257
    @JT2751257 4 года назад +21

    Josh, I have been practicing data science since last 4 years and have used Ridge regression as well. But now I am feeling embarrassed after watching this explanation because before the video I only had half baked knowledge. You deserve a lot of accolades my friend :)

    • @statquest
      @statquest  4 года назад +1

      Awesome! I'm glad the videos are helpful. :)

  • @vspecky6681
    @vspecky6681 4 года назад +32

    I was listening with extreme focus and you suddenly threw "Airspeed of Swallow" at me. I died XDDDDDDDDDDDD

    • @statquest
      @statquest  4 года назад

      Awesome! :)

    • @oldcowbb
      @oldcowbb 3 года назад +1

      what do you mean, African or European Swallow

  • @petax004
    @petax004 5 лет назад +27

    You just spoon feed my brain with your clear explanation, thanks man!

  • @nathanx.675
    @nathanx.675 4 года назад +9

    Who's watching this the day before their machine learning finals?

  • @juhipathak8433
    @juhipathak8433 5 лет назад +150

    Your channel is a god send!

  • @TheGoldenFluzzleBuff
    @TheGoldenFluzzleBuff 5 лет назад +32

    I have a big data economics exam tomorrow and you literally just saved my life. I don't always understand what my professor is trying to explain, but you did it super clearly. Actual life saver

  • @Nicole-se7zj
    @Nicole-se7zj 2 года назад +10

    I've spent so much time trying to read and understand what EXACTLY is ridge regression. This video made it much easier to understand. Thank you so much for simplifying this complex concept!

  • @SpL-mu5zu
    @SpL-mu5zu 4 года назад +19

    YOU ARE THOUSANDS OF TIMES BETTER THAN MY PROF...CLEAR & SIMPLE. THANKSSSSS

  • @andersonarroyo7238
    @andersonarroyo7238 4 года назад +9

    This is my first video and I am so impressed by how you explain things!!! It is like my buddy from college will explain it to me in plain words. You rock StatQuest, I am a follower from now on!! Thank you

    • @statquest
      @statquest  4 года назад +1

      Awesome! Thank you!

  • @EvaPev
    @EvaPev 9 месяцев назад +9

    I have no words to express how good this lecture is.

  • @NaggieNag
    @NaggieNag 4 года назад +7

    I don't know how my stat teacher can make something this easy to understand that complicated. Everytime I can't understand what he's talking about in the class I know that I have to turn to StatQuest. Thank you for what you're doing.

  • @iefe65
    @iefe65 5 лет назад +6

    Small question: Does ridge regression only decrease sensitiveness ? What if instead of this example, our test set was above the red line ? Normally we'll need to increase sensitiveness ?

    • @vishaltyagi2983
      @vishaltyagi2983 8 месяцев назад

      This will be taken care of... if you are taking a random sample ... don't worry

    • @Niglnws
      @Niglnws Месяц назад

      Did you understand why?

    • @Niglnws
      @Niglnws Месяц назад

      @@vishaltyagi2983 can you explain more? i am trying for an hour to proof it myself and reached that the random sample has less variance but that doesnt matter, because it doesnt differ. Then i found your reply.

  • @kslm2687
    @kslm2687 6 лет назад +8

    Thank you for this video, it's so helpful! I can't believe, it's only 500 views. Please consider patreon account that people could thank for your work!

    • @statquest
      @statquest  6 лет назад +4

      Thank you! I'll look into the patreon account. In the mean time you can support my channel through my bandcamp site - even if you don't like the songs, you can buy an album and that will support me. joshuastarmer.bandcamp.com/

  • @anamfatima5489
    @anamfatima5489 4 года назад +17

    I came to know about this channel 2 hours ago. Simple and Outstanding explanation. My aim is to watch each and every video.
    Loving your style of teaching.
    From India.

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

  • @charissapoh1159
    @charissapoh1159 3 года назад +7

    your explanations are insane... they're so easy to understand and literally capture the essence of the topic without being overly complicated! i've bingewatched so many of your videos ever since chancing upon your channel last night - i specially love the little jingles you add in at the start of your videos, they really add such a fun and personal touch~ thank you so so soo much, your channel has really helped me immensely!!!

  • @tusharpatil96
    @tusharpatil96 4 года назад +4

    Probably the most sensible explanation available on youtube..and yes...BAM!! ;)

  • @anubhavsoni7620
    @anubhavsoni7620 5 месяцев назад +2

    Thanks for this your videos allways help me 🙏❤

  • @tymothylim6550
    @tymothylim6550 3 года назад +6

    Thank you, Josh, for another fun StatQuest! I really enjoyed learning the use and benefits of Ridge Regression!

  • @republic2033
    @republic2033 4 года назад +6

    You have that ability to explain difficult topics in a very simple way, this is amazing! Thank you so much

  • @DragomirJtac
    @DragomirJtac 6 лет назад +4

    Incredibly clear explanation. I'm using your Machine Learning videos to study for my midterm for sure. It's so nice to know that these concepts aren't above my head after all.

    • @statquest
      @statquest  6 лет назад +2

      Nice!! Good luck on your mid terms!

  • @monicakulkarni3319
    @monicakulkarni3319 5 лет назад +7

    I really appreciate your videos! Keep up the good work.

  • @akshitmiglani5419
    @akshitmiglani5419 3 года назад +2

    That's a great explanation. However I have a doubt in this.
    When you showed the example @6:25, you reduced the slope to 0.8 from 1.3 and then mentioned "Setting up lambda = 1 resulted in a smaller slope" but we already reduced the slope to 0.8, it's not the lambda that did it, lambda is just the penalty multiplier in the cost function. So are we tuning only lambda or lambda and slope? How does this work?
    Thank you!

    • @statquest
      @statquest  3 года назад +3

      In practice you set lambda, and the optimal (usually smaller) slope, given the additional penalty, is solved for you by the ridge regression algorithm.

  • @seetarajpara7626
    @seetarajpara7626 3 года назад +4

    This is incredibly helpful!! I will be watching many of your videos to supplement my stats/data science studies :) Thank you!

    • @statquest
      @statquest  3 года назад

      Glad it was helpful!

  • @programminginterviewprep1808
    @programminginterviewprep1808 5 лет назад +21

    These videos are awesome!
    Somehow, listening to the video, I feel it comes from/for someone with a background in stats, than a typical computer science machine learning video.

    • @statquest
      @statquest  5 лет назад +9

      Interesting. My background is both computer science and statistics - but I did biostatistics for years before I did machine learning, so that might explain it.

  • @pritamfeb13
    @pritamfeb13 5 месяцев назад +2

    Only Statquest can make someone emotional while learning statistics. The ease with which the concepts are flowing flawlessly into my brain makesme teary. Thank you so much 🥺❣

  • @the40yearpuzzle
    @the40yearpuzzle Год назад +4

    I am brand-new to statistics, and I'm in school to be a data scientist. so many times, I lose the plot watching lectures from my professors who have the Curse of Knowledge. I end up spending hours watching your videos and they help so much, I just don't even have words! I've recommended your channel to all my classmates--and I mentioned it so much, my professor is considering adding your channel to recommended materials for next semester! you are a shining light of joy in a jargon-filled sea of confusion.

    • @statquest
      @statquest  Год назад +1

      Thank you so much and good luck with your coursework! :)

    • @Dreadheadezz
      @Dreadheadezz Год назад +1

      I study data science too at a uni and his videos are helping me stay afloat in my statistical learning course. Not all heroes were capes and he's truly one of them!

    • @shivanit148
      @shivanit148 Год назад

      @Linda Wallberg @Josh Sherfey @Lucas Possatti I don't see why we even use lambda, it doesn't seem to change anything 🤔, i'd understand if it were a value between 0-1 but not any>=0. Can someone please explain? Multiplying lamba (scalar) to slope² should only scale it in parallel direction right? We basically just take any smaller arbitrary slope (introduce bias) and that's all.

    • @statquest
      @statquest  Год назад

      @@shivanit148 No, we don't take an arbitrary smaller slope. We find the one slope that minimizes the SSR + penalty

  • @existentialrap521
    @existentialrap521 Год назад +1

    My Crips lurkin', don't die tonight
    I just want to dance wit' you, baby
    Just don't move too fast, I'm too crazy
    Man down, down the ave and get shaded
    WE OUT HERE LEARNIN RIDGE BOIIS AY AY AYAYAYA YAY
    thanks Josh

  • @meichendong3434
    @meichendong3434 5 лет назад +3

    I love your videos. They are so easy to follow and understand complicated concepts and procedures! Thanks for sharing all of the brilliant ideas!

    • @statquest
      @statquest  5 лет назад

      Awesome! Thank you! :)

  • @furo.v
    @furo.v 6 месяцев назад +1

    I think the problem of having two examples and trying to fit the line so that it fits other imaginary examples that aren't present, and saying this is the unsolvable that Ridge solves, is a bit misleading. If you only have two examples, you have to find more data, the solution to that isn't to use Ridge.

    • @statquest
      @statquest  6 месяцев назад

      Noted. However, the point is just to provide some intuition of how to deal with the very real problem that happens when you have a model with 1,000s of parameters (see 17:24 ).

  • @akashdesarda5787
    @akashdesarda5787 5 лет назад +6

    Quadruple bam!!!! For your explanation

    • @statquest
      @statquest  5 лет назад

      Hooray! I'm glad you like it! :)

  • @akshaygupta8837
    @akshaygupta8837 4 года назад +3

    Great Video. One question though, does ridge regression always reduces the slope. What if the Line of Least square had low slope from the beginning and a good fit would be one with higher slope. So will the regularisation increase the slope?

    • @statquest
      @statquest  4 года назад

      In this case, usually the optimal value for lambda will be 0, so Ridge Regression will do nothing.

    • @안용수-o4y
      @안용수-o4y 4 года назад +2

      This was my question as well. Thanks for asking, and also the reply!

    • @akshaygupta8837
      @akshaygupta8837 4 года назад +2

      @@statquest got it. Thanks

  • @Tntpker
    @Tntpker 6 лет назад +4

    How would you do cross validation for the example @ 10:16 to determine lambda? For example, would you then take 10 random samples of 2 (out of 8) data points and try different lambda's (for example lambda 1-20) for each _individual_ sample? And then determine which value of lambda in all those 10 samples gives the lowest variance?

    • @statquest
      @statquest  6 лет назад +1

      That's the idea. In practice, there are usually many more samples, so you're not just picking 2 samples at a time, but that's the idea.

    • @Tntpker
      @Tntpker 6 лет назад

      @@statquest Thanks!

    • @dadipsaus332
      @dadipsaus332 5 лет назад

      How to calculate that variance then?

  • @monazaizan947
    @monazaizan947 11 месяцев назад +3

    You made learning this complicated topic (for me) a lot more fun than from reading from a textbook or from my own lecturer. Very entertaining too... Well done!

    • @statquest
      @statquest  11 месяцев назад

      Glad it was helpful!

  • @zeerot
    @zeerot 6 лет назад +6

    Josh, you're a true hero with your explanations. Thanks a bunch!
    I have one question though. In the video (in the graph at 19:20 for example) you show that a ridge regression would fit real world data better, as it shrinks the beta (the graph shows that in the real world this beta is also smaller, due to most green points (=real world data) being positioned below the red line (=training data)).
    However, would ridge regression still be better if for example most of the green dots would be above the red line? Because with ridge regression we would shrink the beta, while the real world beta in reality has even a higher slope than the slope of the red line (thus in this case ridge would lead to increase in both variance and bias for real world data?)

    • @statquest
      @statquest  6 лет назад +3

      This is a great question - the key is that when lambda = 0, then you get the exact same result as least squares - so Ridge Regression can not do worse than Least Squares, it can on only do better. In the case you mention, sure, if all of the green dots are above the red dots, neither Least Squares or Ridge Regression will do well - but Ridge Regression will do no worse than Least Squares.

    • @CyberSinke
      @CyberSinke 3 года назад +1

      Thank you for posting this question. One thousand comments on this video, all well deserved praise as this video and the whole channel are awesome. Yet only you asked this obvious question. Makes me wonder how many people actually bothered to understand the whole point of Ridge Regression.

    • @Niglnws
      @Niglnws Месяц назад

      @@CyberSinke Exactly what shocked me too, i am trying for one hour to understand it by assuming sample variance underestimation of population but it doesnt matter, it is just the sample which picked randomly.

    • @Niglnws
      @Niglnws Месяц назад

      @@statquest why not it will not do worse? it will make the slope flatter which is away from the real relation which is more vertical or steeper.

    • @statquest
      @statquest  Месяц назад

      @@Niglnws It will do no worse because we will compare it to the simple least squares fit. If it performs worse, we won't use it.

  • @elliotyip9844
    @elliotyip9844 Год назад +2

    The way you go through the logic step by step makes you a good teacher. In many of my research occasions they just say "adjust your alpha higher or lower until you don't overfit / underfit" but I don't even know what am I looking at. Bless you.

  • @lazypunk794
    @lazypunk794 5 лет назад +3

    So from what I understand, ridge regression controls the slope from getting big right? This affects bias but reduces variance a lot so overall its better.
    But what if my true model has a slope that is actually bigger(steeper) than what I got using my training data? In that case wouldn't you be making the model worse by using regularization?
    In other words, why are we "desensitizing" when we don't know what the underlying model is? What if sensitivity in actual model is higher?

    • @sidsr
      @sidsr 5 лет назад

      I have this exact same doubt! I guess we use trial and error and see whether the model improves, if it doesn't the only way to either use a more complex function or get more training data.

    • @lazypunk794
      @lazypunk794 5 лет назад

      @@sidsr oh okay.. but still regularization works pretty much everytime right

    • @meinizizheng9867
      @meinizizheng9867 5 лет назад

      I think once you test all possible value of lambda, the one gives you the smallest test error will be the best one. So if true model is steeper (and assume test error gave you an approximation to true error) the lambda will reduce to zero.

    • @-long-
      @-long- 5 лет назад

      by trial and error, your model will get the best performance when lambda=0, which means "no regularizer used".

  • @aliciachen9750
    @aliciachen9750 5 лет назад +4

    wow. seriously better explained than lectures from my professor in the data science department

  • @hrushikeshkulkarni7353
    @hrushikeshkulkarni7353 Год назад +3

    The lecture was at a whole different level.....thank you for such amazing content dear Josh

  • @fisicaparalavida108
    @fisicaparalavida108 4 месяца назад +1

    those grapsh are excellent. how much work doing it. Thank you so much!

    • @statquest
      @statquest  4 месяца назад

      Thanks! Lots of work goes into these.

  • @jobandeepsingh1929
    @jobandeepsingh1929 5 лет назад +4

    your channel deserves more recognition, Keep up the good work

  • @aaryan9058
    @aaryan9058 Месяц назад +1

    Hey Josh! Thanks for the video. I have a small doubt.
    At 17:44, you mention we could use measurements from 10,000 genes to predict size. And that would mean getting gene expression from 10,001 mice. But is that really the case?
    Suppose we are using only 3 genes in our equation. So, we are looking for 4 parameters here. We can just find the gene expression value from 2 mice, giving a total of 6 data points (3 gene expression values from each mice). And shouldn't that be enough to find the parameters?
    Please let me know if I said anything incorrect here :)

    • @statquest
      @statquest  Месяц назад +1

      No, that doesn't work. In your example with 4 parameters, we would need at least 5 mice because when we have 4 parameters, we need 5 measurements per parameter, not 5 measurements total.

    • @aaryan9058
      @aaryan9058 Месяц назад +1

      @@statquest Got it! Thanks a lot! Super BAM

  • @spencerprice1676
    @spencerprice1676 5 лет назад +4

    Thank you so much. You made this so much easier to understand than my professor. Really appreciate it

    • @statquest
      @statquest  5 лет назад

      You're welcome! I'm glad to hear that the video was helpful. :)

  • @fjumi3652
    @fjumi3652 2 года назад +1

    this is actually quite simple (conceptually at least), why did my professor make it so complicated?!

  • @kadhirn4792
    @kadhirn4792 4 года назад +15

    Love from India. Wish me good luck interview in less than days.

    • @statquest
      @statquest  4 года назад +5

      Thank you and good luck with your interviews. Let me know how they go. :)

    • @Whoasked777
      @Whoasked777 3 года назад

      @@statquest narrator: they never did let StatQuest know...

    • @statquest
      @statquest  3 года назад

      @@Whoasked777 Totally! I hope they went well.

  • @ArinzeDavid
    @ArinzeDavid 2 года назад +2

    I study financial Technology at Imperial College Business School; I must say your content made the "Big Data in Finance" module damn easier to understand

    • @statquest
      @statquest  2 года назад

      Hooray! I'm glad my videos are helpful! :)

  • @fmetaller
    @fmetaller 5 лет назад +3

    Great explanation as always. There is something it's not convincing me about this type of regression. The ridge regression assumes that the training data are always overestimating the slope. Isn't possible that the training data are underestimating the slope instead?

    • @statquest
      @statquest  5 лет назад +2

      If the training data underestimate the slope, then shrinking it will not improve the fit during cross validation. In this case the best value for lambda will be zero. So ridge regression can’t make things worse. Does this make sense?

    • @fmetaller
      @fmetaller 5 лет назад

      @@statquest yes it's clear. Thank you for your explanation.

    • @akhilmahajan1417
      @akhilmahajan1417 5 лет назад +3

      I also had same question. Thankfully, I found your comment!

  • @jushkunjuret4386
    @jushkunjuret4386 5 лет назад +1

    What if the data we are observing actually gives us a smaller slope than it should have, and having a Ridge penalty term will make the model worse.

    • @statquest
      @statquest  5 лет назад

      If your sample is a poor representation of the true population, usually because the sample size is too small, then there's a good chance the parameter estimates, regardless of the results of ridge regression, will not be significantly different from 0.

  • @hzyTMU
    @hzyTMU 5 лет назад +4

    How to prove "the slop close to 0 when lambda increasing in the 9:42"?

    • @badoiuecristian
      @badoiuecristian 4 года назад

      I have the same question

    • @chandankumar-jo7rf
      @chandankumar-jo7rf 4 года назад +1

      when lambda tend to infinity, SSE will be negligible compared to lambda * slope^2, hence slope has to go to 0

  • @麦泳琳-r5l
    @麦泳琳-r5l 4 года назад +3

    Hi Josh, your videos are amazing and I love it.
    You mentioned at the end of the video that we can use ridge regression and cross validation to fit some data points that the least squares cannot.
    But how can we fit a sample with only one data and we are not able to use the cross validation here? (since there is only one data)

    • @statquest
      @statquest  4 года назад

      I was just trying to keep the example simple, but you are correct: If you don't have enough data for cross validation then you can't fit a line with ridge regression. Instead, imagine you have 500 variables but only 400 observations - in this case you have enough data for cross validation and can fit a line with Ridge Regression, but since there are more variables than observations, you can't do ordinary least squares.

    • @麦泳琳-r5l
      @麦泳琳-r5l 4 года назад +1

      StatQuest with Josh Starmer Thank you!

  • @PedroRibeiro-zs5go
    @PedroRibeiro-zs5go 4 года назад +3

    Thanks Josh! You’re absolutely the best 💪🏻

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

  • @hansenmarc
    @hansenmarc 2 года назад +1

    StatQuest or: how I learned to stop worrying and love the Greek characters.
    Yee-haw! 🤠

  • @aliozcankures7864
    @aliozcankures7864 2 года назад +3

    absolutely amazing, thank you sir!

  • @dfsgjlgsdklgjnmsidrg
    @dfsgjlgsdklgjnmsidrg 6 месяцев назад +1

    You said at 3:40 the line is overfitting but the diffiniton of overfitting being low bias and high variance is not true. two data points in training data is in this case underfitting.

    • @statquest
      @statquest  6 месяцев назад

      We are overfitting the the specific data that we have trained our model on. This results in high variance.

  • @kaimueric9390
    @kaimueric9390 4 года назад +3

    BAM! The concepts are presents in the clearest way ever.

  • @AanyaSS
    @AanyaSS 7 месяцев назад +1

    How are you so amazing, how is it possible....a thousand thanks!!

    • @statquest
      @statquest  7 месяцев назад

      Wow, thank you!

  • @1pompeya170
    @1pompeya170 4 года назад +6

    you are my sunshine,my only sunshine , you make me happy when f**king math puzzled me!

  • @shangxiang7160
    @shangxiang7160 6 лет назад +3

    Hi, thank you for the nice video but I have one concern: You example works well when slope on training data(red data) bigger than the original data(green data), what if the slope is smaller than original one? Will Ridge regression make things even worse?

    • @statquest
      @statquest  6 лет назад +1

      This is a good question. At 8:49, I talk about how different values for lambda influence the Ridge Regression Line and I show that when lambda = 0, then there is no difference between Ridge Regression and regular old Linear Regression (aka "Least Squares"). I then say that in order to find the best value for lambda, we try a bunch of different values and use 10-Fold Cross Validation to decide which one is best. So, when considering different values for lambda, make sure you include 0 as a possible value. This ensures that your Ridge Regression Line will never be worse than linear regression/regular old least squares.

    • @shangxiang7160
      @shangxiang7160 6 лет назад +1

      @@statquest So, lambda will always greater than 0?

    • @statquest
      @statquest  6 лет назад

      @@shangxiang7160 Lambda can be any value greater than or equal to 0. So lambda can be 0.

  • @macilguiddir3680
    @macilguiddir3680 6 лет назад +9

    Josh, even though I have just started Machine Learning and Data Science in my French Engineering "Grande Ecole", watching your videos just replaced most of the teachers I had met in my life. Great BAM my friend and thank you, just keep it up! You got a rare gift

    • @statquest
      @statquest  6 лет назад +1

      Thank you so much! I'm so happy to hear that my videos are helpful! :)

    • @macilguiddir3680
      @macilguiddir3680 6 лет назад +1

      StatQuest with Josh Starmer Even French people rely on you and are looking forward to studying your next videos ;)

    • @statquest
      @statquest  6 лет назад

      Hooray!

    • @luisakrawczyk8319
      @luisakrawczyk8319 5 лет назад

      lol tu dois avoir des très mauvais profs du coup, c'est quelle école?

  • @theredviper24
    @theredviper24 Год назад +1

    I'm convinced this is Phil from Modern family teaching us statistics.

  • @longkhuong8382
    @longkhuong8382 6 лет назад +3

    Mega BAM!!!! Thank you
    I can't wait to learn the next lesson

    • @statquest
      @statquest  6 лет назад +1

      Hooray!!!! :) The next one, on Lasso Regression, should come out in the next week or so.

    • @longkhuong8382
      @longkhuong8382 6 лет назад +1

      Yeah!, It's great. Thank you

  • @vusalaalakbarova7378
    @vusalaalakbarova7378 2 года назад +1

    Thanks for great explanation, but at 5:55, you magically jumped from 0.4 intercept to 0.9 and 1.3 slope to 0.8 and didn't explain how you got these values. I'm totally beginner so couldn't understand this point.

    • @statquest
      @statquest  2 года назад +2

      I just picked a different line so that we could compare the total scores of the two lines.

    • @vusalaalakbarova7378
      @vusalaalakbarova7378 2 года назад +1

      @@statquest Okay thank you :)

  • @tommcnally3231
    @tommcnally3231 4 года назад +5

    My lecturer explained this by just putting the equation in front of us on the slides. The maths is easy but I didn't understand the point or intuition behind behind adding a penalty. Now I do. Thank you.

    • @statquest
      @statquest  4 года назад

      I'm glad the video was helpful. :)

  • @SomeOfOthers
    @SomeOfOthers 5 лет назад +2

    I've taken 4 machine learning courses and always wondered what ridge regression was, because I've heard it several times, but I was never taught it. I never realized it was just adding the regularization parameter! Awesome! Thank you so much.

    • @statquest
      @statquest  5 лет назад +1

      Hooray! I'm glad the video helped clear up a long standing mystery. As you've noticed, a lot of machine learning is about giving old things new names - which makes it a lot easier to understand than we might think at first.

  • @shashankupadhyay821
    @shashankupadhyay821 4 года назад +3

    This is so cool, it's almost like magic.

  • @LetWorkTogether
    @LetWorkTogether 4 года назад +2

    So awesome!!! Many complicated stuffs are simply putted. You're grate! :D Thank you.

  • @gramble10
    @gramble10 5 лет назад +3

    14:57 An African or European swallow?

    • @statquest
      @statquest  5 лет назад

      A most excellent question sir! :)

  • @winghho9
    @winghho9 5 лет назад +2

    Didn't even realized this StatQuest video is super long until you mentioned it, truly enjoy your way to explain, thanks))))))))

    • @statquest
      @statquest  5 лет назад

      Hooray! I'm glad you liked it. :)

  • @sam271183
    @sam271183 5 лет назад +4

    Just Brilliant!! Josh Starmer - You are a genius!

  • @itsIs263
    @itsIs263 5 лет назад +2

    I wished My school UT Dallas had professors like you. So I need not struggle on web. But thankfully I found you.

  • @usamanavid2044
    @usamanavid2044 4 года назад +3

    Love from 🇵🇰 Pakistan.

  • @daliakamal5621
    @daliakamal5621 3 года назад +2

    Amazing video, I have read many articles and watched many videos to understand the idea behind Ridge & Lasso Regression and finally you explained in the most simplest way, many thanks for your effort.

    • @statquest
      @statquest  3 года назад

      Glad it was helpful!

  • @herp_derpingson
    @herp_derpingson 6 лет назад +3

    This reminds me of L2 regularization of weights in neural networks.

    • @statquest
      @statquest  6 лет назад

      Yes! This is the exact same thing, only applied to Regression. I think it appeared first in the regression context, but I'm not sure.

  • @aarondijkstra7623
    @aarondijkstra7623 2 года назад +1

    Hahahah my college professor cant explain shit, thanks for the clarification

    • @statquest
      @statquest  2 года назад

      I'm glad this was helpful.

  • @MrChryssy1
    @MrChryssy1 5 лет назад +11

    How do we get the new line in 3:40 ? We calculated 1.69 and 0.74, what did we do with it to get the new line?

    • @statquest
      @statquest  5 лет назад +17

      In practice, ridge regression starts with the least squares estimates for the slope and intercept. Then it changes the slope a little bit to see if the sum of the squared residuals plus lambda times the squared slope gets smaller. If so, keep the new value. Then make the slope a little smaller and see if the sum of squared residuals plus lambda times the squared slope gets smaller. If so, keep the new value. Repeat those steps over and over again until you the sum of the squared residuals plus lambda times the squared slope no longer gets smaller. Does that make sense?

    • @utkarshkulshrestha2026
      @utkarshkulshrestha2026 5 лет назад +3

      @@statquest Hi Josh, the slope that you are referring to is just one of our parameters that we want to minimize right? For a higher order fitting, can it be any other parameter apart from slope?

    • @statquest
      @statquest  5 лет назад +5

      @@utkarshkulshrestha2026 Least Squares will work to minimize the sum of the squared residuals using all of the parameters and the ridge regression will be applied to all parameters except for the intercept. Thus, for all parameters other than the intercept, we try to minimize the sum of the squared residuals plus the ridge regression penalty. Usually reducing the parameter values will increase the sum of the squared residuals a little bit and decrease the ridge regression penalty a lot. Does that make sense?

    • @utkarshkulshrestha2026
      @utkarshkulshrestha2026 5 лет назад

      @@statquest Yes, this was pretty very much clear. Thank you..!!

    • @MrChryssy1
      @MrChryssy1 5 лет назад +1

      @@statquest I mean the calculation^^That is what I am not quite sure about

  • @malini76
    @malini76 2 года назад +2

    Whenever I feel some concept in ML, DS is not easily understood, I come to this channel because you explain it in a simple way with good examples.

  • @MridhuMathsMagician
    @MridhuMathsMagician 4 года назад +1

    your channel is blessing . I request you to kindly make vedios on text analysis

    • @statquest
      @statquest  4 года назад

      Thank you. I have one video on Naive Bayes that might be helpful: ruclips.net/video/bTs-QA2oJSE/видео.html

  • @makemymarket1772
    @makemymarket1772 13 дней назад +1

    Such an amazing video, thank you Josh!

  • @youknowwhatlol6628
    @youknowwhatlol6628 8 месяцев назад +1

    Greetings from Ukriane, Josh! I'd like to say thanks to you for even though we are in a difficult situation here, but your videos on machine learning techniques always help me comprehend topics of this field....i am grateful to you! Thank you so much!!!

    • @statquest
      @statquest  8 месяцев назад +1

      Wow! I can't imagine trying to learn ML in your situation, but I'm happy that I can help in some way.

  • @codewithsid2063
    @codewithsid2063 6 лет назад +2

    Your videos are so underrated. Please have a patreon account so that community can help you bring these high quality videos.

    • @statquest
      @statquest  6 лет назад +2

      Thank you! I'll look into the patreon account. In the mean time you can support my channel through my bandcamp site - even if you don't like the songs, you can buy an album and that will support me. joshuastarmer.bandcamp.com/

  • @danilafarga6810
    @danilafarga6810 11 месяцев назад +1

    if I pass my ML class I am dedicating my PhD to you lol

    • @statquest
      @statquest  11 месяцев назад

      Good luck! :)

  • @dollysiharath4205
    @dollysiharath4205 Год назад +1

    I do enjoy your single as well as learning from your teaching :)

  • @anonyme103
    @anonyme103 4 года назад +1

    How come you are not a kaggle GrandGrandTripleBAM master ?

  • @adenuristiqomah984
    @adenuristiqomah984 3 года назад +1

    Hey Josh, it might be unrelated to the video's content but have you considered making videos about Gaussian processes? Thanks!

  • @cyan-chunyuezheng7783
    @cyan-chunyuezheng7783 3 года назад +1

    Thanks your explanation is much better than our lecturer :0

  • @CWunderA
    @CWunderA 6 лет назад +2

    Looking forward to that next video, it doesn't seem very intuitive how ridge regression can still find a decent fit when there are not enough examples. I'm guessing this ends up being very sensitive to the value of lambda?

    • @statquest
      @statquest  6 лет назад +2

      It's all about the cross validation - that's what helps figure out the ideal lambda. That said, the example in the video is a super simple model with a super limited sample size just to keep things easy to see - However, in practice, Ridge Regression is usually used on larger models, with lots of variables, so you can have a bunch of samples, but not many more than the number of samples.

  • @dainegai
    @dainegai 4 года назад +1

    Great 'Quest as always!
    Small visual typo: at 12:48, as lambda increases, the model should converge to "the mean of all the (training) samples", right? (As lambda -> infinity, we set "diet difference = 0", and to minimize sum of squared residuals term, we'd set the intercept term to be the mean of all the samples.)
    So the "high-fat diet line" goes down, *but also* the "normal-diet line" would go up, right?

  • @skylarj720
    @skylarj720 2 года назад +2

    Thank you, Josh, you made the ML and stat easy and enjoyable. Hands down better than most stat prof.

    • @statquest
      @statquest  2 года назад

      Thank you very much! :)

  • @Ahmadalisalh6012
    @Ahmadalisalh6012 3 года назад +1

    Your videos are super helpful, THANK YOU