I dont understand for the regreesion. Why average the predictions works for the regression. Since each predictors are weak learner, won't average them still give a poor result?
The key idea is that the individual predictors have low bias but suffer from high variance. Upon averaging, the bias (which was already small) remains unchanged, the variance however scales down. As a result, the mean squared error (which is bias^2 + variance) decreases because of the averaging operation.
@@Michael-kt3tf The idea is exactly the same for regression too. Since the individual predictors are unbiased and (hopefully) close to independent, one can expect the outputs of the predictors to be spread around the ideal prediction. In such a scenario, averaging should give you something close to the ideal prediction. Note that it is important for the predictors to be independent. If they are not, in particular, if they are positively correlated, one can expect all of them to have similar error, and hence averaging wouldn't be very useful.
So in gradient boosting, there will be just one learner which gets optimized by fitting to the residuals in subsequent step? Unlike Adaboost where we have multiple weak learners
In gradient boosting also there will be multiple weak learners. It is a forward stepwise additive process. At each step, you fit a new (high bias) weak learner to the gradient (which happens to be the residuals) and add it to the current estimate of the model.
Prof, can't we do regularization like L1 or L2 to overcome overfitting in gradient boosting?
Some panetly of k!
25:51
Prof. please drop computational linear algebra playlist.
Thanks
I dont understand for the regreesion. Why average the predictions works for the regression. Since each predictors are weak learner, won't average them still give a poor result?
For bagging or random forest, each predictor has high variance (due to overfit). Taking an average of these trees reduces the variance.
The key idea is that the individual predictors have low bias but suffer from high variance. Upon averaging, the bias (which was already small) remains unchanged, the variance however scales down. As a result, the mean squared error (which is bias^2 + variance) decreases because of the averaging operation.
Ashish Katiyar yeah, but isnt regression output a continuous value. I understand it works for the classification
@@Michael-kt3tf The idea is exactly the same for regression too. Since the individual predictors are unbiased and (hopefully) close to independent, one can expect the outputs of the predictors to be spread around the ideal prediction. In such a scenario, averaging should give you something close to the ideal prediction. Note that it is important for the predictors to be independent. If they are not, in particular, if they are positively correlated, one can expect all of them to have similar error, and hence averaging wouldn't be very useful.
Ashish Katiyar got it, thank you
finished. 2024/9/2
So in gradient boosting, there will be just one learner which gets optimized by fitting to the residuals in subsequent step? Unlike Adaboost where we have multiple weak learners
In gradient boosting also there will be multiple weak learners. It is a forward stepwise additive process. At each step, you fit a new (high bias) weak learner to the gradient (which happens to be the residuals) and add it to the current estimate of the model.