Man I've discovered your channel and am watching your videos non-stop. No matter which topic, it is ALL as if a stream of light shines and makes it all understandable. You've got a gift.
Non math person here and even i could understand this tutorial. Probably have to see it a couple more times because I'm a bit slow in my 40s now. But you really have a gift. Keep up the good work.
You always make your content so easy to understand. Just the right amount of math mixed with simple examples that clearly illustrate the main ideas of whatever topic you are talking about. Keep up the great work!
Hey thanks a lot, was literally just searching about Gradient Boosting today and your explanations have always been great. Good pacing and explanations even with some math involved.
Great video! A bonus for using squared error loss (which is commonly used) as the loss function for regression problems: the gradient of squared error loss is just the residual! So each weak learner is essentially trained on the previous residual, which makes sense intuitively. (I think that's why each gradient is called "r"?)
Yeah, squared error is easily differentiable compared to others like root squared error, and is not dependent upon number of observation like mean squared error or root mean squared error does , if you want gradient exactly equal to residual , you can choose to take (1/2)(squares error) as loss function.
your channel is criminally underrated. Just one question. You mentioned using linear weak learners, i.e. f(x) is a linear function of x. In this case how would you ever get anything other than a linear function after any number of iterations? at the end of the day, you are just adding multiple linear functions. it seems this whole procedure would only make sense, if you pick a nonlinear weak learner.
Just asking that is the concept of gradient Boosting similar to Taylor Series functions. Each term is not very good at predicting the function but as u add more functions(terms), the approximation to the function gets better.
Hi! Why do we use f2(x) instead of raw r1_hat? I mean why to make predictions of residuals and use them if we already have the exact value of gradient ?
one question , in Step 3 , is your target variable , the gradient with respective to the previous prediction? if so , dont you think there is a possibility of it becoming infinity and we try to fit something to infinity?
Hi, thank you for this informative video. I have some problem understanding the graph at 5:27. How do you map out the curve on the graph if you have a single pair of prediction and loss function values. do you create some mesh out of the give pair?
In words, is it correct to phrase Gradient Boosting as being multiple regression models combined, where each subsequent model aims to correct the error that the previous models couldn't account for?
Great video! Never seen gradient descent used with the derivative of the loss function with respect to the prediction. Not sure if I understand it 100% but If the gradient were, for example, -1 for ri, would the subsequent weak learner fit a model to -1? Or would the new weak learner fit a model to (old pred -(Learning Rate * gradient))? Would love to see a simple example worked out for 1 or 2 iterations if possible. Thank you! :)
Come to think of it, concepts from gradient boosting apply perfectly to less mathematical aspects of life too. Just take a tiny step in the right direction and repeat!
Very good content but then it would be great if you can stay at the corner allowing us to have a look at the board for us to understand otherwise great session
Honestly, StatQuest has a much better way of explaining this. First he explains the logic by means of an example and then he explains the algebra afterwards. I'd recommend his videos on gradient boosting for anyone who didn't understand this. Without having seen his videos on it I would have been unable to understand the algebra.
Man I've discovered your channel and am watching your videos non-stop. No matter which topic, it is ALL as if a stream of light shines and makes it all understandable. You've got a gift.
Agreed! You've got a gift to shine the light over topics.
No
Non math person here and even i could understand this tutorial. Probably have to see it a couple more times because I'm a bit slow in my 40s now. But you really have a gift. Keep up the good work.
You always make your content so easy to understand. Just the right amount of math mixed with simple examples that clearly illustrate the main ideas of whatever topic you are talking about. Keep up the great work!
Hey thanks a lot, was literally just searching about Gradient Boosting today and your explanations have always been great. Good pacing and explanations even with some math involved.
you are awesome man! I just love coming back to your videos every time. they are just the right length, and the perfect depth.. Kudos!
Great video! A bonus for using squared error loss (which is commonly used) as the loss function for regression problems: the gradient of squared error loss is just the residual! So each weak learner is essentially trained on the previous residual, which makes sense intuitively. (I think that's why each gradient is called "r"?)
Yeah, squared error is easily differentiable compared to others like root squared error, and is not dependent upon number of observation like mean squared error or root mean squared error does , if you want gradient exactly equal to residual , you can choose to take (1/2)(squares error) as loss function.
This is a fantastic video. Thank you for sharing!
Glad you enjoyed it!
The last part of 'Why does it work?' made all the difference.
totally agree
Your videos on data science are awesome! They help me to prepare for my university exam a lot. Thank you very much!
Man U r the 5th person, none has explained as simple and clear as you, thanks a ton
incredible video, you make understandable a really hard concept. Keep teaching like this and big things will come!
Completely agree, you are changing our lives! Cheers!
Great video as always! I would love If you could build on that video and talk about XGBoost and math behind it next!
I worked on this 5(?) years ago, but needed a reminder - thanks!
You're an amazing teacher, thanks a lot from Sweden!
Thanks for the effort u put in to help ur watchers understand, it really helped me understand the concept behind gradient descent!
your channel is criminally underrated. Just one question. You mentioned using linear weak learners, i.e. f(x) is a linear function of x. In this case how would you ever get anything other than a linear function after any number of iterations? at the end of the day, you are just adding multiple linear functions. it seems this whole procedure would only make sense, if you pick a nonlinear weak learner.
Unbelievable variety of topics in this channel! What is your daily job? You have an amazing amount of knowledge
Thanks for the video, also really like the whiteboard format
Pls don't stop making these videos
Finally understood it really well, thanks!
Very awesome, thanks for the explanation 👍
Waw thank you so much for this amazingly clear video explanation 🤗!!! Instantly subscribed :)
Phenomenal. Thank you again for making these videos
You are doing a great job, really enjoying your videos.
that was very clear and useful, thank you
Thank you so much! You just blew my mind
You're very welcome!
thanks man you explain it so much better than my uni professor :)
Glad to hear that!
Can mathematics behind ML be less dreadful and more fun? Well yes, if we have a tutor like him... amazing explanation ❤️
Best boosting definition yet.
well done - gee there is something to be said about a good explanation and a whiteboard. Fantastic explanation.
Thanks!
Perfect, really well done!
Thanks!
Thank you for this good explanation.
Just asking that is the concept of gradient Boosting similar to Taylor Series functions. Each term is not very good at predicting the function but as u add more functions(terms), the approximation to the function gets better.
So so well explained
Any chance your interested in doing a video on EM algorithm intro with a toy example? Love your videos please keep them coming!
Great video brother.
Amazing video. Thanks.
You're the man. Thank you!
Hi! Why do we use f2(x) instead of raw r1_hat? I mean why to make predictions of residuals and use them if we already have the exact value of gradient ?
Thanks for sharing!
Thanks! great videos.
Man, you are amazing!
one question , in Step 3 , is your target variable , the gradient with respective to the previous prediction? if so , dont you think there is a possibility of it becoming infinity and we try to fit something to infinity?
Hi, thank you for this informative video. I have some problem understanding the graph at 5:27. How do you map out the curve on the graph if you have a single pair of prediction and loss function values. do you create some mesh out of the give pair?
In words, is it correct to phrase Gradient Boosting as being multiple regression models combined, where each subsequent model aims to correct the error that the previous models couldn't account for?
Love the videos! Great topic
let's talk about the first word in gradient boosting..... boosting :D Nice video as always
Thumbs up for the pen catch recovery at the start.
😂
Excellently explained. I was just reviewing this and was very helpful to see how someone else think through this.
Great video! Never seen gradient descent used with the derivative of the loss function with respect to the prediction. Not sure if I understand it 100% but If the gradient were, for example, -1 for ri, would the subsequent weak learner fit a model to -1? Or would the new weak learner fit a model to (old pred -(Learning Rate * gradient))? Would love to see a simple example worked out for 1 or 2 iterations if possible. Thank you! :)
great vid!
nice video series
Its not super clear to me how or where the learning rate comes into play here and what its relation to the scaling factor gamma is.
Can you please make a video on XGBoost and its advantages by comparing. Thank you.
are the initial weak learners randomly selected? If so, can this initial random selection be optimized?
great video
Yeiii you are the best !!
Isn't gradient the partial derivative with respect to feature(xi), not with respect to the prediction(y^)?
Thanks man
Hmmmm v interesting. Something to think about. Thx
move on so I can get screenshot 😂.
brilliant explanation ,well done
after 4 hrs. of searching in vain, this has truly proven to be a savior!
The first time I watched this video, I understood shit! Now the second time, I studied the subject and learn more :), it is much more clear now :)
Bro, it's late AF and I'm not gonna lie, I'm passing out now, but I'mma DEFINITELY catch this shit tomorrow. 👍
😂 come back anytime
@@ritvikmathWell, it's been a year, but I came back! 😂
Learners together strong
Come to think of it, concepts from gradient boosting apply perfectly to less mathematical aspects of life too. Just take a tiny step in the right direction and repeat!
yes love when math reflects life!
Very good content but then it would be great if you can stay at the corner allowing us to have a look at the board for us to understand otherwise great session
Thanks for the suggestion !
Honestly, StatQuest has a much better way of explaining this. First he explains the logic by means of an example and then he explains the algebra afterwards. I'd recommend his videos on gradient boosting for anyone who didn't understand this. Without having seen his videos on it I would have been unable to understand the algebra.
Ripped...
Hello Ritvik, are you on LinkedIn? Would love to connect with you!
That was amazing. Thanks a lot.