He is the only professor I've seen who asks, "Raise hands if this makes sense". He could have asked "Raise hands if it doesn't make sense", but he knows people who didn't understand feel under-confident and are generally hesitant to ask questions. He repeats key concepts till most of class has understood. Great respect for you sir! And great teaching method!
finding your channel has helped me SOO MUCH, I really appreciate when there is some mathematical rigour and most of the ML guys on the internet just do a wave of hands every time a nuanced concept comes along. Thank you
Most people are viewing these lectures because they came across one video and now they want to learn everything Prof Kilian has to share. But if this was offered as a certified course online, more people would know about this and can officially take time out to complete the course.
Hello Everyone, including The Prof Kilian, First of all thank you from the bottom of my heart to upload this online. These Gem Video Lectures. Absolutely Loved them. Anyone including Dr. Kilian , Can you please recommend a Deep learning course like this one ? I'd be highly obliged. It has to have quality like this course.
before each split randomly subsample k≤dk≤d features (without replacement) and only consider these for your split. (This further increases the variance of the trees.) .....is the line from lecture notes..... i think it should be "decrease the variance" ...
No, it is correct. It is a little confusing, because generally fewer features would mean you decrease the variance of your classifier. However by reducing the features for each individual split, you build very different trees - so this *increases* the variance *across* the trees, which you then reduce again by averaging.
i love this teaching style, i would love to see tu berlin create such tutorium and lecture notes, rather than copying snippets from universities from all over the world and then just paste it. a lot of times the use code from huge projects but they just use a little part for the task, but they dont explain the rest of the code, but you need to understand the whole code to fill the gaps. so youre ending up with a project that is quite advanced but some parts are left out and suddenly you need to translate everything including the names of the math symbols and they use different symbols in english, so you always need to think of that. so they dont even crate their own content for the homework and that sucks.
Thanks for the lecture. I have a question concerning showing that bootstrapped data sets are drawn from the original distribution P. In your lecture notes you are assuming that P is discrete, is it an assumption that simplifies the proof of P(X=x) = Q(X=x)? Or in other words, can we show that P and Q are identical distributions in the case when P is not discrete (in the general case)?
I really enjoy your lectures Prof'! Could you please elaborate on your state that the sets D1,..,Dm are not I.I.D? Since you sample with replacement, I do not see any reason that knowing the previous sets D_1,..,D_(i-1), will help you to predict the set D_i (except for maybe the not very interesting case that D has only a single element, in this case all this process is doomed anyway...).
Let’s say the elements in D are drawn i.i.d. from P. Given P, if you take the first half of the D, you gain no information about the second half. In contrast, if you now create data sets D_1, D_2,..., D_M from D (through sampling with replacement), these data sets are not i.i.d. For example, imagine P(z)>0 i.e. there is a non-zero chance that we sample a specific element z from P. But imagine that just by chance, z is not in D. If M is large enough, and you observe that z is not in D_1, not in D_2, not in D_3, ... , not in D_{M-1} you can conclude that z was almost certainly not in D, and therefore is extremely unlikely to be in D_M. In other words, the data sets D_1,..,D_{M-1} tell you something about D_M - and they are not independent. Hope this helps. Of course you can also simply look at the extreme case where D={z}. Clearly D_1 tells you something about D_2. Hope this helps.
@@kilianweinberger698 Thanks a lot for the specific answer!! I understand that for outliers (usually the rarest ones, maybe also the common ones) you can learn something about D_m assuming that you have D_1, .., D_{M-1}. It is weird that even Wikipedia (perhaps not the accurate source of truth, but still usually pretty accurate) stats that : " This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling" here en.wikipedia.org/wiki/Bootstrap_aggregating
Oh I see the confusion. The subsets are dependent samples from the original distribution P, but they are independent bootstraps from the original data set. The bootstraps are even independent of D={x}, as D_1 tells you nothing about D_2 that you couldn’t already infer from D. Hope this helps.
I think running time of the naive implementation may be O(dn^2) instead of O(dkn^2). We have to split every dimension n times, so nd is a must. But I don't have to do the rest of the operations k times, I believe. Using a loop such as: left_pt = 0 right_pt = 0 left_class_1 = 0 left_class_2 = 0 right_class_1 = 0 right_class_2 = 0 for i = 1 to n: class_1 = 0 class_2 = 0 if y_i == class_1: class_1 += 1 if y_i == class_2: class_2 += 1 if x_i[d] < some_threshold: left_pt += 1 left_class_1 += class_1 left_class_2 += class_2 if x_i[d] >= some_threshold: right_pt += 1 right_class_1 += class_1 right_class_2 += class_2 Effectively, this will calculate how many data points we have to the left and right. Additionally, it will calculate p_k as well since p_left_class_1 = left_class_1 / left_pt. Is there any reason we should also go over the labels with another loop?
I was thinking the same. We can make one pass through the entire dataset and keep a running count on different variables for left, right, and one for each class on left and right each. As soon as we encounter any point, we either increase the left or right counter, and simultaneously update the corresponding class counter in the left or right variables. That I believe will remove k from the algorithmic complexity as you suggested.
@prof, great lecture. Especially liked how you started from kd trees as a data structure and arrived at decision trees. Btw how do you show Di (the subsamples) is coming from the same distribution as D? Also, in random forests, k = sqrt(d) - what would be a justification / intuition for that?
@@kilianweinberger698 yes, found that in the notes later. What about the intuition for the sqrt(n) for random forests feature subsample though? Not sure if I missed that too!
Can someone help me with why the bagging datasets are not inependant? are we not drawing them randomly? what would be an example that is independent then?
Well, imagine in the extreme case your original training data set only has a single element {x}, drawn from some distribution P. Now you apply bagging on this data set of size one, and you re-sample m new data sets from it. But because it only contains one element, they are all identical {x},{x},{x},...,{x}. So clearly they are not independent. If I tell you what one data set is, you know all the others. Hope this helps.
why the complexity of 1st approach is O(d*k*N^2)? we can compute all Pk first, and then multiply with H(SL) (for all K class labels ). Therefore, we can do this in O(d*N*(N+K)). isn't it?
If you didn’t use the weighted sum, and just the sum H(Sr)+H(Sl) you would end up splitting off a single element. That would make one side completely pure (entropy 0), but wouldn’t help you much to improve your classifier. Hope this helps.
@@kilianweinberger698 Ahh, thanks! But what is wrong with the 'regular average' aka weights = 1/2? @edit For example if we have target variable = [1, 0, 1, 0, 0, 0, 0, 1] then 1:(n-1) split will 'always' have maximal information gain compared to 2:(n-2), 3:(n-3) ... etc., be it that we use weighted average, 'normal' average or not using average at all in order to calculate information gain. I think in real life that it's not very likely we will have such feature that splits the target in 1:(n-1) fashion, but still I'm curious.
@@71sephiroth Actually, I think you are right that it is a weighted sum, i.e., weighted by the size of Sr and Sl. Let's use c(X) to indicate the class (c1,c2,...cK) a sample X belong to. So the weighted sum actually approximate H(Sl)Pr(Sl)+H(Sr)Pr(Sr) = H(c(X)|X in Sl) Pr(X in Sl) + H(c(X)|X in Sr) Pr(X in Sr) = H(c(X)|s(X)), where s(X) is the index indicating if X is in the left node or right node. For example, s(X)=1 if X in Sl and s(X)=0 if X in Sr. Since the conditional entropy H(c(X)|s(X)) can be viewed as the amount of remaining information c(X) given that s(X) is known. It is very natural that we would like to minimize H(c(X)|s(X)) to get the optimum split.
why do we solve the problem of variance by just not insisting on every point being correctly classified ? why dont we just keep branching until certain threshold then just do a majority vote ?
@@kilianweinberger698 it is thr first time I hear about "fiddling parameters". could you please recommend me some resource in which I can read about it.
The person who gave the thumbs down (dislike) to this video is currently jobless and wasting his money in data science diplomas/courses and consuming bullshit content :P
He is the only professor I've seen who asks, "Raise hands if this makes sense". He could have asked "Raise hands if it doesn't make sense", but he knows people who didn't understand feel under-confident and are generally hesitant to ask questions. He repeats key concepts till most of class has understood. Great respect for you sir! And great teaching method!
finding your channel has helped me SOO MUCH, I really appreciate when there is some mathematical rigour and most of the ML guys on the internet just do a wave of hands every time a nuanced concept comes along.
Thank you
InstaBlaster
Professor, you are a true genius at explaining this stuff. Your lectures are so much fun and motivating!
Fantastic lecture!
Most people are viewing these lectures because they came across one video and now they want to learn everything Prof Kilian has to share. But if this was offered as a certified course online, more people would know about this and can officially take time out to complete the course.
Actually, it is: online.cornell.edu/machine-learning/
Thanks a million prof kilian 😀
Thank You Sir! for this clear and concise explanation of the topic.
Thank you for the amazing lecture series.
Awesome explanation..!!
lectures so good, kilian on my poster wall. :)
Hello Everyone, including The Prof Kilian, First of all thank you from the bottom of my heart to upload this online. These Gem Video Lectures. Absolutely Loved them.
Anyone including Dr. Kilian , Can you please recommend a Deep learning course like this one ?
I'd be highly obliged. It has to have quality like this course.
i can reccommend
@@jayantpriyadarshi9266 tell
@@rohit2761 search for deep learning by Mitesh Khapra .. on RUclips
@@jayantpriyadarshi9266 bro thanks, I've already done that course. Please suggest suve other one
@@rohit2761 there is deep learning 2 as well by him .. which is more about GAN s and all.. if you did this much i think you can read papers.
before each split randomly subsample k≤dk≤d features (without replacement) and only consider these for your split. (This further increases the variance of the trees.) .....is the line from lecture notes..... i think it should be "decrease the variance" ...
No, it is correct. It is a little confusing, because generally fewer features would mean you decrease the variance of your classifier.
However by reducing the features for each individual split, you build very different trees - so this *increases* the variance *across* the trees, which you then reduce again by averaging.
@@kilianweinberger698 Thanks alot...
"I ask my uncle who's a magician".
For the inpatients, lecture starts at 2:55.
Hi Dr. Kilian, would you recommend pruning before bagging or it doesn't matter?
i love this teaching style, i would love to see tu berlin create such tutorium and lecture notes, rather than copying snippets from universities from all over the world and then just paste it. a lot of times the use code from huge projects but they just use a little part for the task, but they dont explain the rest of the code, but you need to understand the whole code to fill the gaps. so youre ending up with a project that is quite advanced but some parts are left out and suddenly you need to translate everything including the names of the math symbols and they use different symbols in english, so you always need to think of that. so they dont even crate their own content for the homework and that sucks.
Thanks for the lecture.
I have a question concerning showing that bootstrapped data sets are drawn from the original distribution P. In your lecture notes you are assuming that P is discrete, is it an assumption that simplifies the proof of P(X=x) = Q(X=x)? Or in other words, can we show that P and Q are identical distributions in the case when P is not discrete (in the general case)?
Thanks a lot Sir !!
I really enjoy your lectures Prof'! Could you please elaborate on your state that the sets D1,..,Dm are not I.I.D? Since you sample with replacement, I do not see any reason that knowing the previous sets D_1,..,D_(i-1), will help you to predict the set D_i (except for maybe the not very interesting case that D has only a single element, in this case all this process is doomed anyway...).
Let’s say the elements in D are drawn i.i.d. from P.
Given P, if you take the first half of the D, you gain no information about the second half.
In contrast, if you now create data sets D_1, D_2,..., D_M from D (through sampling with replacement), these data sets are not i.i.d.
For example, imagine P(z)>0 i.e. there is a non-zero chance that we sample a specific element z from P. But imagine that just by chance, z is not in D.
If M is large enough, and you observe that z is not in D_1, not in D_2, not in D_3, ... , not in D_{M-1} you can conclude that z was almost certainly not in D, and therefore is extremely unlikely to be in D_M. In other words, the data sets D_1,..,D_{M-1} tell you something about D_M - and they are not independent. Hope this helps.
Of course you can also simply look at the extreme case where D={z}. Clearly D_1 tells you something about D_2.
Hope this helps.
@@kilianweinberger698 Thanks a lot for the specific answer!! I understand that for outliers (usually the rarest ones, maybe also the common ones) you can learn something about D_m assuming that you have D_1, .., D_{M-1}.
It is weird that even Wikipedia (perhaps not the accurate source of truth, but still usually pretty accurate) stats that : " This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling" here
en.wikipedia.org/wiki/Bootstrap_aggregating
Oh I see the confusion. The subsets are dependent samples from the original distribution P, but they are independent bootstraps from the original data set.
The bootstraps are even independent of D={x}, as D_1 tells you nothing about D_2 that you couldn’t already infer from D. Hope this helps.
@@kilianweinberger698 Yes, that's exactly was the confusion indeed!! Thanks for clearing it up Prof'!
I think running time of the naive implementation may be O(dn^2) instead of O(dkn^2). We have to split every dimension n times, so nd is a must. But I don't have to do the rest of the operations k times, I believe.
Using a loop such as:
left_pt = 0
right_pt = 0
left_class_1 = 0
left_class_2 = 0
right_class_1 = 0
right_class_2 = 0
for i = 1 to n:
class_1 = 0
class_2 = 0
if y_i == class_1:
class_1 += 1
if y_i == class_2:
class_2 += 1
if x_i[d] < some_threshold:
left_pt += 1
left_class_1 += class_1
left_class_2 += class_2
if x_i[d] >= some_threshold:
right_pt += 1
right_class_1 += class_1
right_class_2 += class_2
Effectively, this will calculate how many data points we have to the left and right. Additionally, it will calculate p_k as well since p_left_class_1 = left_class_1 / left_pt.
Is there any reason we should also go over the labels with another loop?
I was thinking the same. We can make one pass through the entire dataset and keep a running count on different variables for left, right, and one for each class on left and right each.
As soon as we encounter any point, we either increase the left or right counter, and simultaneously update the corresponding class counter in the left or right variables. That I believe will remove k from the algorithmic complexity as you suggested.
I think he got mixed up in dealing continuous and discrete features.yes u are right for continuous features is O(n^2d)
@prof, great lecture. Especially liked how you started from kd trees as a data structure and arrived at decision trees. Btw how do you show Di (the subsamples) is coming from the same distribution as D?
Also, in random forests, k = sqrt(d) - what would be a justification / intuition for that?
Yes, that should be in the notes. The D_i are sampled from the same distribution as D, but they are *not* independent.
@@kilianweinberger698 yes, found that in the notes later. What about the intuition for the sqrt(n) for random forests feature subsample though? Not sure if I missed that too!
Btw prof, the whole video series is awesome! Have to finish up kernels and Gaussian process ,but the rest have been wonderful!
Can someone help me with why the bagging datasets are not inependant? are we not drawing them randomly?
what would be an example that is independent then?
Well, imagine in the extreme case your original training data set only has a single element {x}, drawn from some distribution P. Now you apply bagging on this data set of size one, and you re-sample m new data sets from it. But because it only contains one element, they are all identical {x},{x},{x},...,{x}. So clearly they are not independent. If I tell you what one data set is, you know all the others. Hope this helps.
The concept is bootstrapping.u are creating random subsets from same data.samples may be repeating in any of the trees..
why the complexity of 1st approach is O(d*k*N^2)? we can compute all Pk first, and then multiply with H(SL) (for all K class labels ). Therefore, we can do this in O(d*N*(N+K)). isn't it?
I am thinking the same.
exactly , that what i was thinking about .
Thanks
Is it a property of an entropy that it's additive, because I can't seem to figure out why intuitievly we use the weighted sum of H(Sr) and H(Sl)?
If you didn’t use the weighted sum, and just the sum H(Sr)+H(Sl) you would end up splitting off a single element. That would make one side completely pure (entropy 0), but wouldn’t help you much to improve your classifier. Hope this helps.
@@kilianweinberger698 Ahh, thanks! But what is wrong with the 'regular average' aka weights = 1/2?
@edit
For example if we have target variable = [1, 0, 1, 0, 0, 0, 0, 1] then 1:(n-1) split will 'always' have maximal information gain compared to 2:(n-2), 3:(n-3) ... etc., be it that we use weighted average, 'normal' average or not using average at all in order to calculate information gain. I think in real life that it's not very likely we will have such feature that splits the target in 1:(n-1) fashion, but still I'm curious.
Both Gini and Entropy are concave functions, the form of weighted sum matches the form of the property of the concave function.
@@haizhoushi8287 Can you please elaborate more!
@@71sephiroth Actually, I think you are right that it is a weighted sum, i.e., weighted by the size of Sr and Sl. Let's use c(X) to indicate the class (c1,c2,...cK) a sample X belong to. So the weighted sum actually approximate H(Sl)Pr(Sl)+H(Sr)Pr(Sr) = H(c(X)|X in Sl) Pr(X in Sl) + H(c(X)|X in Sr) Pr(X in Sr) = H(c(X)|s(X)), where s(X) is the index indicating if X is in the left node or right node. For example, s(X)=1 if X in Sl and s(X)=0 if X in Sr.
Since the conditional entropy H(c(X)|s(X)) can be viewed as the amount of remaining information c(X) given that s(X) is known. It is very natural that we would like to minimize H(c(X)|s(X)) to get the optimum split.
why do we solve the problem of variance by just not insisting on every point being correctly classified ? why dont we just keep branching until certain threshold then just do a majority vote ?
Indeed, you can do that. If you enforce a max leaf-node size (e.g. stop splitting if
@@kilianweinberger698 it is thr first time I hear about "fiddling parameters". could you please recommend me some resource in which I can read about it.
@@alimamdouh1764 Hey he is trying to tell that we need less parameters to deal for bagging.
The person who gave the thumbs down (dislike) to this video is currently jobless and wasting his money in data science diplomas/courses and consuming bullshit content :P
43:38 ahaha you are the number one
Regression 21:40
Bagging Starts from: ruclips.net/video/0LB1cy2sCXc/видео.html ( 33.52)