Thanks a lot professor for the entire course and greetings from Greece. I studied data science and machine learning on my master but your lectures are entirely pure masterpiece.
The fact that all the topics are covered so exhaustively make it a must watch. I started from Decision Trees, but I will re-watch whole series. Thank You Killian for posting videos.
Unlike the Squared loss function, to my understanding, the exponential loss won't be minimum at H = Y (vector of labels). I take a very simple case of a dataset of the two labels (say [+1,+1]), Loss function at H = [+1,+1] is 2*exp(-1 * 1) whereas at H = [+2,+2] is 2*exp(-1*2) where the latter is clearly the minimum. Does it contradict your contour lines at 5:05, Dr. Killian? Grateful for all the explanations you've provided so far.
Such an amazing lecture on Gradient boosting. Can you provide any reference towards Gradient Boosted Classification Trees. -- Like what is the loss function used in that case? --What is the training dataset used for building classifier ht? Thanks in advance.
Sir, could i please please pretty please get the coding assignments too? These intuition building in these lectures is perfect, the only thing needed is i think the coding assignments
Hi Professor, thank you so much for the lecture. I wonder if it's possible that AdaBoost stops when training error is zero? Because I see from your demo, after training error is close to zero and exponential loss goes smaller and smaller, the test error doesn't change too much. I guess we don't need to waste time on letting exponential loss smaller and smaller.
No it typically doesn't stop when zero training error is reached. The reason is that even if the training error is zero, the training LOSS will still be >0 and can be further reduced (e.g. by increasing the margin of the decision boundary).
Hi professor, great intuition for explaining why adaboost overfits slowly and can be observed on log(#iterations) scale. One question though, I work with GBM and XGboost all the time and they also behave very similar to adaboost when it comes to overfitting. Do you have any intuition behind this?
Hmm, good question ... XGBoost in its vanilla form is just GBRT with squared loss - so the same rational doesn’t apply here. My guess would be that your data set is large enough that it may take a long time to overfit.
Hi Prof Kilian. Quick question on Boosting method. I watched the videos twice already (and I was also a certificate program student) but couldnt see any explanation.. Previous lecture mentions that one of the requirements of boosting method is that the weak learners must at least point to the right direction.. How can we check that the weak learners are on the right direction to run the boosting? Does this happen with trial-error or is there a method or a way? Thanks for the great class!!!
In AdaBoost, the error on the re-weighted training set must be 0.5 (or you would just flip the predictions), so you stop the moment your best weak learner has an error =0.5 (which means you just cannot learn anything useful anymore). In AnyBoost the inner product of the weak learner predictions and the gradients should be >0. (Same thing, it can never be
sao chi co 10 ngan luot coi video nay!! trong khi may video nham kia lai qua troi nguoi coi :D:D:D cac ban sinh vien xu thien dia nguc hong biet chon hang gi het tron ah.
In AdaBoost we assume each weak learner only outputs +1 / -1. So you have to take the output of each linear classifier and apply the sign() function. Now you are combining multiple linear classifiers in a non-linear fashion.
@@kilianweinberger698 Thank you very much for your answer Prof. Weinberger. I see now that by taking only the sign of each classifier you are applying the step function and they are no longer linear, similar to an activation function in a neural network (now I wonder if there would be any advantage of using other functions for boosting). I would also like to thank you for your amazingly clear and understandable course. I can say that I have understood all topics (even gaussian processes, which I previously believed to be impossible), and I will be very interested too watch if you decide to do any further videos, either other courses or opinion pieces, and to contribute to any patreon-like funding. Best regards.
I think the intuition explained in all his lectures are amazing and so helpful. This is probably one of most approachable lectures on ML there is.
Thanks a lot professor for the entire course and greetings from Greece. I studied data science and machine learning on my master but your lectures are entirely pure masterpiece.
The fact that all the topics are covered so exhaustively make it a must watch. I started from Decision Trees, but I will re-watch whole series.
Thank You Killian for posting videos.
I watched some videos and they are really amazing. He explained very easy to understand. Thanks you Sir for sharing it. I appreciate it
Unlike the Squared loss function, to my understanding, the exponential loss won't be minimum at H = Y (vector of labels). I take a very simple case of a dataset of the two labels (say [+1,+1]), Loss function at H = [+1,+1] is 2*exp(-1 * 1) whereas at H = [+2,+2] is 2*exp(-1*2) where the latter is clearly the minimum. Does it contradict your contour lines at 5:05, Dr. Killian? Grateful for all the explanations you've provided so far.
Yes, good point, for the exponential loss the solution is always somewhere in the limit. Still, the principle is the same ...
Such an amazing lecture on Gradient boosting.
Can you provide any reference towards Gradient Boosted Classification Trees.
-- Like what is the loss function used in that case?
--What is the training dataset used for building classifier ht?
Thanks in advance.
Sir, could i please please pretty please get the coding assignments too? These intuition building in these lectures is perfect, the only thing needed is i think the coding assignments
Hi Professor, thank you so much for the lecture. I wonder if it's possible that AdaBoost stops when training error is zero? Because I see from your demo, after training error is close to zero and exponential loss goes smaller and smaller, the test error doesn't change too much. I guess we don't need to waste time on letting exponential loss smaller and smaller.
No it typically doesn't stop when zero training error is reached. The reason is that even if the training error is zero, the training LOSS will still be >0 and can be further reduced (e.g. by increasing the margin of the decision boundary).
Hi professor, great intuition for explaining why adaboost overfits slowly and can be observed on log(#iterations) scale. One question though, I work with GBM and XGboost all the time and they also behave very similar to adaboost when it comes to overfitting. Do you have any intuition behind this?
Hmm, good question ... XGBoost in its vanilla form is just GBRT with squared loss - so the same rational doesn’t apply here. My guess would be that your data set is large enough that it may take a long time to overfit.
Thanks a lot !!
Two loss are the global loss and the one with the weak learners right,Sir?
Hi Prof Kilian. Quick question on Boosting method.
I watched the videos twice already (and I was also a certificate program student) but couldnt see any explanation..
Previous lecture mentions that one of the requirements of boosting method is that the weak learners must at least point to the right direction..
How can we check that the weak learners are on the right direction to run the boosting? Does this happen with trial-error or is there a method or a way?
Thanks for the great class!!!
In AdaBoost, the error on the re-weighted training set must be 0.5 (or you would just flip the predictions), so you stop the moment your best weak learner has an error =0.5 (which means you just cannot learn anything useful anymore).
In AnyBoost the inner product of the weak learner predictions and the gradients should be >0. (Same thing, it can never be
Thank you very much for the explanation, I will check my runs accordingly!
sao chi co 10 ngan luot coi video nay!! trong khi may video nham kia lai qua troi nguoi coi :D:D:D cac ban sinh vien xu thien dia nguc hong biet chon hang gi het tron ah.
build in system against overfitting, because \alpha decrease
Can we say Adaboost is like coordinate descent in the functional space??
Yes, with an adaptive stepsize.
hahahahaha, adaboost is never overfitting, of course!! fixing a bug tends to create another bug :D:D
how can adaboost work for svm? wouldnt the linear combination of such a linear classifier be linear?
In AdaBoost we assume each weak learner only outputs +1 / -1. So you have to take the output of each linear classifier and apply the sign() function. Now you are combining multiple linear classifiers in a non-linear fashion.
@@kilianweinberger698 Thank you very much for your answer Prof. Weinberger. I see now that by taking only the sign of each classifier you are applying the step function and they are no longer linear, similar to an activation function in a neural network (now I wonder if there would be any advantage of using other functions for boosting).
I would also like to thank you for your amazingly clear and understandable course. I can say that I have understood all topics (even gaussian processes, which I previously believed to be impossible), and I will be very interested too watch if you decide to do any further videos, either other courses or opinion pieces, and to contribute to any patreon-like funding. Best regards.