Machine Learning Lecture 30 "Bagging" -Cornell CS4780 SP17

Поделиться
HTML-код
  • Опубликовано: 12 янв 2025

Комментарии • 64

  • @aakashPotter
    @aakashPotter 3 года назад +15

    He is the only professor I've seen who asks, "Raise hands if this makes sense". He could have asked "Raise hands if it doesn't make sense", but he knows people who didn't understand feel under-confident and are generally hesitant to ask questions. He repeats key concepts till most of class has understood. Great respect for you sir! And great teaching method!

  • @astutzorr3355
    @astutzorr3355 4 года назад +24

    finding your channel has helped me SOO MUCH, I really appreciate when there is some mathematical rigour and most of the ML guys on the internet just do a wave of hands every time a nuanced concept comes along.
    Thank you

  • @jenishah9825
    @jenishah9825 2 года назад +3

    Most people are viewing these lectures because they came across one video and now they want to learn everything Prof Kilian has to share. But if this was offered as a certified course online, more people would know about this and can officially take time out to complete the course.

  • @nicolasrover8803
    @nicolasrover8803 3 года назад +1

    Professor, you are a true genius at explaining this stuff. Your lectures are so much fun and motivating!

  • @Saganist420
    @Saganist420 5 лет назад +11

    Fantastic lecture!

  • @vivekupadhyay6663
    @vivekupadhyay6663 3 года назад +2

    Thank You Sir! for this clear and concise explanation of the topic.

  • @jorgeestebanmendozaortiz873
    @jorgeestebanmendozaortiz873 3 года назад

    For the inpatients, lecture starts at 2:55.

  • @salmaabdelmonem7482
    @salmaabdelmonem7482 4 года назад +5

    Thanks a million prof kilian 😀

  • @hdang1997
    @hdang1997 5 лет назад +6

    "I ask my uncle who's a magician".

  • @rohit2761
    @rohit2761 3 года назад +2

    Hello Everyone, including The Prof Kilian, First of all thank you from the bottom of my heart to upload this online. These Gem Video Lectures. Absolutely Loved them.
    Anyone including Dr. Kilian , Can you please recommend a Deep learning course like this one ?
    I'd be highly obliged. It has to have quality like this course.

    • @jayantpriyadarshi9266
      @jayantpriyadarshi9266 2 года назад

      i can reccommend

    • @rohit2761
      @rohit2761 2 года назад

      @@jayantpriyadarshi9266 tell

    • @jayantpriyadarshi9266
      @jayantpriyadarshi9266 2 года назад

      @@rohit2761 search for deep learning by Mitesh Khapra .. on RUclips

    • @rohit2761
      @rohit2761 2 года назад

      @@jayantpriyadarshi9266 bro thanks, I've already done that course. Please suggest suve other one

    • @jayantpriyadarshi9266
      @jayantpriyadarshi9266 2 года назад

      @@rohit2761 there is deep learning 2 as well by him .. which is more about GAN s and all.. if you did this much i think you can read papers.

  • @sumitsingh-yz3vm
    @sumitsingh-yz3vm 3 года назад +1

    Thank you for the amazing lecture series.

  • @thatsharma1066
    @thatsharma1066 4 месяца назад +1

    lectures so good, kilian on my poster wall. :)

  • @varunjindal1520
    @varunjindal1520 4 года назад +1

    Awesome explanation..!!

  • @ChiragSinghishere
    @ChiragSinghishere 4 года назад +1

    before each split randomly subsample k≤dk≤d features (without replacement) and only consider these for your split. (This further increases the variance of the trees.) .....is the line from lecture notes..... i think it should be "decrease the variance" ...

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +4

      No, it is correct. It is a little confusing, because generally fewer features would mean you decrease the variance of your classifier.
      However by reducing the features for each individual split, you build very different trees - so this *increases* the variance *across* the trees, which you then reduce again by averaging.

    • @ChiragSinghishere
      @ChiragSinghishere 4 года назад +1

      @@kilianweinberger698 Thanks alot...

  • @aragasparyan8295
    @aragasparyan8295 3 года назад

    Thanks for the lecture.
    I have a question concerning showing that bootstrapped data sets are drawn from the original distribution P. In your lecture notes you are assuming that P is discrete, is it an assumption that simplifies the proof of P(X=x) = Q(X=x)? Or in other words, can we show that P and Q are identical distributions in the case when P is not discrete (in the general case)?

  • @DommageCollateral
    @DommageCollateral Год назад

    i love this teaching style, i would love to see tu berlin create such tutorium and lecture notes, rather than copying snippets from universities from all over the world and then just paste it. a lot of times the use code from huge projects but they just use a little part for the task, but they dont explain the rest of the code, but you need to understand the whole code to fill the gaps. so youre ending up with a project that is quite advanced but some parts are left out and suddenly you need to translate everything including the names of the math symbols and they use different symbols in english, so you always need to think of that. so they dont even crate their own content for the homework and that sucks.

  • @broccolee4328
    @broccolee4328 2 года назад

    Hi Dr. Kilian, would you recommend pruning before bagging or it doesn't matter?

  • @roniswar
    @roniswar 3 года назад +1

    I really enjoy your lectures Prof'! Could you please elaborate on your state that the sets D1,..,Dm are not I.I.D? Since you sample with replacement, I do not see any reason that knowing the previous sets D_1,..,D_(i-1), will help you to predict the set D_i (except for maybe the not very interesting case that D has only a single element, in this case all this process is doomed anyway...).

    • @kilianweinberger698
      @kilianweinberger698  3 года назад +4

      Let’s say the elements in D are drawn i.i.d. from P.
      Given P, if you take the first half of the D, you gain no information about the second half.
      In contrast, if you now create data sets D_1, D_2,..., D_M from D (through sampling with replacement), these data sets are not i.i.d.
      For example, imagine P(z)>0 i.e. there is a non-zero chance that we sample a specific element z from P. But imagine that just by chance, z is not in D.
      If M is large enough, and you observe that z is not in D_1, not in D_2, not in D_3, ... , not in D_{M-1} you can conclude that z was almost certainly not in D, and therefore is extremely unlikely to be in D_M. In other words, the data sets D_1,..,D_{M-1} tell you something about D_M - and they are not independent. Hope this helps.
      Of course you can also simply look at the extreme case where D={z}. Clearly D_1 tells you something about D_2.
      Hope this helps.

    • @roniswar
      @roniswar 3 года назад +1

      @@kilianweinberger698 Thanks a lot for the specific answer!! I understand that for outliers (usually the rarest ones, maybe also the common ones) you can learn something about D_m assuming that you have D_1, .., D_{M-1}.
      It is weird that even Wikipedia (perhaps not the accurate source of truth, but still usually pretty accurate) stats that : " This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling" here
      en.wikipedia.org/wiki/Bootstrap_aggregating

    • @kilianweinberger698
      @kilianweinberger698  3 года назад +2

      Oh I see the confusion. The subsets are dependent samples from the original distribution P, but they are independent bootstraps from the original data set.
      The bootstraps are even independent of D={x}, as D_1 tells you nothing about D_2 that you couldn’t already infer from D. Hope this helps.

    • @roniswar
      @roniswar 3 года назад +1

      @@kilianweinberger698 Yes, that's exactly was the confusion indeed!! Thanks for clearing it up Prof'!

  • @astutzorr3355
    @astutzorr3355 4 года назад +1

    Can someone help me with why the bagging datasets are not inependant? are we not drawing them randomly?
    what would be an example that is independent then?

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +6

      Well, imagine in the extreme case your original training data set only has a single element {x}, drawn from some distribution P. Now you apply bagging on this data set of size one, and you re-sample m new data sets from it. But because it only contains one element, they are all identical {x},{x},{x},...,{x}. So clearly they are not independent. If I tell you what one data set is, you know all the others. Hope this helps.

    • @srisaisubramanyamdavanam9912
      @srisaisubramanyamdavanam9912 2 месяца назад

      The concept is bootstrapping.u are creating random subsets from same data.samples may be repeating in any of the trees..

  • @ugurkap
    @ugurkap 5 лет назад +2

    I think running time of the naive implementation may be O(dn^2) instead of O(dkn^2). We have to split every dimension n times, so nd is a must. But I don't have to do the rest of the operations k times, I believe.
    Using a loop such as:
    left_pt = 0
    right_pt = 0
    left_class_1 = 0
    left_class_2 = 0
    right_class_1 = 0
    right_class_2 = 0
    for i = 1 to n:
    class_1 = 0
    class_2 = 0
    if y_i == class_1:
    class_1 += 1
    if y_i == class_2:
    class_2 += 1
    if x_i[d] < some_threshold:
    left_pt += 1
    left_class_1 += class_1
    left_class_2 += class_2
    if x_i[d] >= some_threshold:
    right_pt += 1
    right_class_1 += class_1
    right_class_2 += class_2
    Effectively, this will calculate how many data points we have to the left and right. Additionally, it will calculate p_k as well since p_left_class_1 = left_class_1 / left_pt.
    Is there any reason we should also go over the labels with another loop?

    • @aayushchhabra7465
      @aayushchhabra7465 5 лет назад +2

      I was thinking the same. We can make one pass through the entire dataset and keep a running count on different variables for left, right, and one for each class on left and right each.
      As soon as we encounter any point, we either increase the left or right counter, and simultaneously update the corresponding class counter in the left or right variables. That I believe will remove k from the algorithmic complexity as you suggested.

    • @srisaisubramanyamdavanam9912
      @srisaisubramanyamdavanam9912 2 месяца назад

      I think he got mixed up in dealing continuous and discrete features.yes u are right for continuous features is O(n^2d)

  • @rajatpatel4657
    @rajatpatel4657 4 года назад +6

    why the complexity of 1st approach is O(d*k*N^2)? we can compute all Pk first, and then multiply with H(SL) (for all K class labels ). Therefore, we can do this in O(d*N*(N+K)). isn't it?

    • @StarzzLAB
      @StarzzLAB 4 года назад

      I am thinking the same.

    • @husamalsayed8036
      @husamalsayed8036 3 года назад

      exactly , that what i was thinking about .

  • @TheCuriousCurator-Hindi
    @TheCuriousCurator-Hindi 23 дня назад

    I have been a [senior] applied scientist at places like Amazon & Microsoft. Have been studying ML for 12 years and 7+ years industry experience at big tech. What are the chances I get into cornell PhD program? In terms of fluency I can read 80-90% PML : an introduction by Kevin Murphy in 3-4 weeks.

  • @pmdl
    @pmdl 4 года назад

    @prof, great lecture. Especially liked how you started from kd trees as a data structure and arrived at decision trees. Btw how do you show Di (the subsamples) is coming from the same distribution as D?
    Also, in random forests, k = sqrt(d) - what would be a justification / intuition for that?

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +2

      Yes, that should be in the notes. The D_i are sampled from the same distribution as D, but they are *not* independent.

    • @pmdl
      @pmdl 4 года назад

      @@kilianweinberger698 yes, found that in the notes later. What about the intuition for the sqrt(n) for random forests feature subsample though? Not sure if I missed that too!

    • @pmdl
      @pmdl 4 года назад

      Btw prof, the whole video series is awesome! Have to finish up kernels and Gaussian process ,but the rest have been wonderful!

  • @71sephiroth
    @71sephiroth 4 года назад

    Is it a property of an entropy that it's additive, because I can't seem to figure out why intuitievly we use the weighted sum of H(Sr) and H(Sl)?

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +2

      If you didn’t use the weighted sum, and just the sum H(Sr)+H(Sl) you would end up splitting off a single element. That would make one side completely pure (entropy 0), but wouldn’t help you much to improve your classifier. Hope this helps.

    • @71sephiroth
      @71sephiroth 4 года назад

      @@kilianweinberger698 Ahh, thanks! But what is wrong with the 'regular average' aka weights = 1/2?
      @edit
      For example if we have target variable = [1, 0, 1, 0, 0, 0, 0, 1] then 1:(n-1) split will 'always' have maximal information gain compared to 2:(n-2), 3:(n-3) ... etc., be it that we use weighted average, 'normal' average or not using average at all in order to calculate information gain. I think in real life that it's not very likely we will have such feature that splits the target in 1:(n-1) fashion, but still I'm curious.

    • @haizhoushi8287
      @haizhoushi8287 4 года назад

      Both Gini and Entropy are concave functions, the form of weighted sum matches the form of the property of the concave function.

    • @71sephiroth
      @71sephiroth 4 года назад

      @@haizhoushi8287 Can you please elaborate more!

    • @ousam2010
      @ousam2010 4 года назад

      @@71sephiroth Actually, I think you are right that it is a weighted sum, i.e., weighted by the size of Sr and Sl. Let's use c(X) to indicate the class (c1,c2,...cK) a sample X belong to. So the weighted sum actually approximate H(Sl)Pr(Sl)+H(Sr)Pr(Sr) = H(c(X)|X in Sl) Pr(X in Sl) + H(c(X)|X in Sr) Pr(X in Sr) = H(c(X)|s(X)), where s(X) is the index indicating if X is in the left node or right node. For example, s(X)=1 if X in Sl and s(X)=0 if X in Sr.
      Since the conditional entropy H(c(X)|s(X)) can be viewed as the amount of remaining information c(X) given that s(X) is known. It is very natural that we would like to minimize H(c(X)|s(X)) to get the optimum split.

  • @alimamdouh1764
    @alimamdouh1764 3 года назад +1

    why do we solve the problem of variance by just not insisting on every point being correctly classified ? why dont we just keep branching until certain threshold then just do a majority vote ?

    • @kilianweinberger698
      @kilianweinberger698  3 года назад +2

      Indeed, you can do that. If you enforce a max leaf-node size (e.g. stop splitting if

    • @alimamdouh1764
      @alimamdouh1764 3 года назад

      @@kilianweinberger698 it is thr first time I hear about "fiddling parameters". could you please recommend me some resource in which I can read about it.

    • @srisaisubramanyamdavanam9912
      @srisaisubramanyamdavanam9912 2 месяца назад

      @@alimamdouh1764 Hey he is trying to tell that we need less parameters to deal for bagging.

  • @vocabularybytesbypriyankgo1558
    @vocabularybytesbypriyankgo1558 3 месяца назад

    Thanks a lot Sir !!

  • @shrishtrivedi2652
    @shrishtrivedi2652 3 года назад

    Regression 21:40

  • @sandeshhegde9143
    @sandeshhegde9143 5 лет назад

    Bagging Starts from: ruclips.net/video/0LB1cy2sCXc/видео.html ( 33.52)

  • @marcogelsomini7655
    @marcogelsomini7655 3 года назад

    43:38 ahaha you are the number one

  • @umangsharma6908
    @umangsharma6908 3 года назад +1

    The person who gave the thumbs down (dislike) to this video is currently jobless and wasting his money in data science diplomas/courses and consuming bullshit content :P

  • @yannickpezeu3419
    @yannickpezeu3419 3 года назад

    Thanks