Beyond neural scaling laws - Paper Explained

Поделиться
HTML-код
  • Опубликовано: 12 дек 2024

Комментарии • 49

  • @amenezes
    @amenezes 2 года назад +5

    Great summary of the paper, thank you!
    I've dived a bit deeper into it and I think the explanation of the theoretical setup in the video does not fully match the one in the paper.
    What I got from the video:
    1. We have a labeled (infinite) dataset
    2. The teacher perceptron learns to label the training data
    3. The student also learns on the training data but only for a few epochs
    4. The margin is the difference between the distance of the point to the teacher and to the student boundaries
    What I got from the paper:
    1. We get (infinite) data points from a normal distribution
    2. We initialize the teacher perceptron with a random weights vector and use it to label the data (i.e. the teacher is only used to generate synthetic labels)
    3. The student learns from the labeled data
    4. The margin is the distance from the point to the student boundary (the teacher is not involved here)
    The results in Fig.1 assume the student is perfectly aligned with the teacher (i.e. the margin perfectly reflects the distance to the real class boundary), while in Fig.2 the authors show the effect of having a misaligned student.
    Let me know your thoughts on this :)

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      Thanks for the comment! The way you understood it, is the way I understood this first too, but on a second thought, it made no sense.
      Thanks to your comment, I am on my third iteration and your explanation makes sense again, so let's discuss a bit: You mean that we do not need the teacher model other than labelling data. Why then bother using it to generate the labels, if we could just assume some labels?
      Also, how did they otherwise estimate the angle Theta between the probe student and the teacher T? (paper page 5 top).

    • @amenezes
      @amenezes 2 года назад +3

      @@AICoffeeBreak Thanks for the questions, they also helped me clarifying my thoughts. The point where I say the teacher is ONLY used to generate the labels is indeed incorrect. I meant to emphasize that the teacher is not used to compute the margin, but it is actually relevant for the rest of the study.
      To elaborate on the questions, I will explain my understanding of the teacher-student perceptron setup for studying learning mechanics in general, regardless of the phenomena being studied (data pruning in this case).
      In general, on a machine learning task we have
      1. a set of observations from the "world", which is governed by some unknown real model of the "world"
      2. a model with learnable parameters, which we assume is able to approximate the real model of the "world"
      3. the learning process, where we find the parameters that best fit the observations
      This theoretical setup allows us to isolate the learning process, since
      1. the observations are taken from a "world" which is governed by a known model: the teacher perceptron
      2. the assumption that our model (the student) is able to approximate the real model of the "world" perfectly holds, since they are both perceptrons (we just need to find the right parameters)
      Adding these points to the limit of infinite data and infinite parameters, we get a perfect scenario where we can study the learning process without the influence of the limitations that exist in real scenarios. And since we know the real model that governs our synthetic world, we can also quantify the actual deviation between the real and the learned models (which is different from the error on the observations) and use it when studying learning mechanics.
      I guess this would explain the points you've raised. Disclaimer: I didn't go into the statistical mechanics papers in the references, this is just my interpretation from this paper.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      @@amenezes Thanks for the clarification. I think you are right. This also makes the proposed method more theoretically motivated (the one with the k-means clustering of representations taken from a pretrained model). The discrepancy between theory (in my understanding) and the proposed method was really stark.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      I've pinned your comment as an erratum to the video explanation.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +3

      I am still confused a bit (or the paper is extremely confusing). I cite from the paper's introduction (page 2, numbered point 1): "where examples are pruned based on their teacher margin", so the distance to the teacher boundary is relevant.

  • @Neptutron
    @Neptutron 2 года назад +8

    So awesome that you have NVIDIA as a sponsor xD

  • @frommarkham424
    @frommarkham424 3 месяца назад +1

    3:22 thanks for the knowledge🙏we gonna make it out the data center with this tutorial🗣🗣🗣🗣

  • @WilliamDye-willdye
    @WilliamDye-willdye 2 года назад +4

    The comparison of pruning strategies was very helpful to me. Thank you for summarizing the paper, and best wishes at the conference.

  • @lighterswang4507
    @lighterswang4507 Год назад +2

    Very similar with the idea of active learning

  • @Erosis
    @Erosis 2 года назад +5

    These results feel intuitive with what I've felt in practice. The math is nuts, though. :)

  • @thipoktham5164
    @thipoktham5164 2 года назад +2

    I was going to read this paper, thanks for the nice explanation!

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +2

      Great timing! ⏲ Glad it was helpful! :)

  • @mandarjoshi6814
    @mandarjoshi6814 2 года назад +2

    11:40 so in experiment authers selected top 80% difficult/hard examples from clusters and did not include bottom 20% of easy examples while training because initial dataset (ImageNet) is fairly large. Is my understanding correct?
    Thanks for explaining.

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +4

      Exactly! It is not much data they discard: they discard only 20% to keep the same performance as when not discarding anything.
      But imagine, that discarding 20% of the data when working with billion of examples, is a considerable amount. :)

    • @mandarjoshi6814
      @mandarjoshi6814 2 года назад +1

      @@AICoffeeBreak Thank you 🤗

  • @Self-Duality
    @Self-Duality 2 года назад +2

    Nice summary analysis 😊💭

  • @frommarkham424
    @frommarkham424 3 месяца назад +1

    4:39 mann the diminishing returns be hitting real hard today💀

  • @cipritom
    @cipritom 2 года назад

    super nice explanation and reasoning ! Thanks for the insight

  • @TheNettforce
    @TheNettforce 2 года назад

    Thanks for the great introduction to this topic

  • @joecincotta5805
    @joecincotta5805 4 месяца назад

    Super interesting. I thought they were going to map entropy of the dataset, which is kind of what they imply, easy vs hard is equivalent to novel vs non-novel data in the distribution of the data.

  • @dr.mikeybee
    @dr.mikeybee 2 года назад +1

    Nicely done!

  • @RfMac
    @RfMac 2 года назад +1

    Awesome video, love your explanations!

  • @vadrif-draco
    @vadrif-draco 2 года назад +1

    very exciting, thank you

  • @ScriptureFirst
    @ScriptureFirst 2 года назад +1

    Great content. very accessible: thank you

  • @frenchmarty7446
    @frenchmarty7446 2 года назад +3

    Could this be useful for data augmentation?
    For example: assuming I start with a certain size dataset and don't need to prune any examples, could/should I make more augmented copies of the more informative samples? Could I also test to see what kinds of augmentations are more or less useful?

  • @worldofai2924
    @worldofai2924 2 года назад

    Thank you for a great video!

  • @averma12
    @averma12 2 года назад

    How does this compare to finetuning the same model on smaller data. How much data would be needed.

  • @flamboyanta4993
    @flamboyanta4993 2 года назад

    The screen shot of the mathematics made me chuckle....in horror. Thanks Letitia for an excellent video!

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +1

      🤣🤣🤣 Yeah, the math part is really impressive. 😏

  • @sonataarcfan9279
    @sonataarcfan9279 2 года назад +1

    How do you make those animations like in "Exponential scaling in theory" part, which software do you use? I would really be appreciated if you could tell me :)

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +1

      With PowerPoint. I draw with the drawing functionality. Then I select the drawing, go to the Animations tab and click on Replay.

  • @kailashj2145
    @kailashj2145 2 года назад +1

    hey, any update on the giveaway?

    • @AICoffeeBreak
      @AICoffeeBreak  2 года назад +1

      Check your email. You should have received a notification whether you won or not. ☺️

  • @brandomiranda6703
    @brandomiranda6703 2 года назад

    Do you have a mailing list?

  • @joecincotta5805
    @joecincotta5805 4 месяца назад +1

    My new favourite videi

  • @DerPylz
    @DerPylz 2 года назад +2

    📈

  • @kornellewychan
    @kornellewychan 2 года назад +1

    greate work, more like that

  • @poketopa1234
    @poketopa1234 Год назад

    Isn't this just hard sample mining?

  • @Quaquaquaqua
    @Quaquaquaqua 3 месяца назад

    Shouldnt you use density based clustering?

  • @TheTimtimtimtam
    @TheTimtimtimtam 2 года назад +1

    First :)

  • @JorgetePanete
    @JorgetePanete 2 года назад +1

    0:10 " "*

  • @brandomiranda6703
    @brandomiranda6703 2 года назад +1

    Funny they prune the "easy" examples close to the prototypical centroids. Most few shot learning methods like fo-proto-maml uses prototypical examples as the key. Is this suggesting doing that is wrong?
    Also, I would have intuitively expected the prototypical examples to summarize the data better and thus the ones to keep. But the do the opposite. That seems bizarre.
    I think at least as a sanity check to see their theory really holds in practice to truly challenge their hypothesis they should've tested the reverse. Remove the "hard" examples and if the results still work out I'd personally be very skeptical. It probably didn't occur to them to do this due to confirmation bias...it's happened to me! 😳🫣 But it's no excuse. As a reviewer I'd immediately reject it unless my experiment or something equivalent is done. A falsification experiment.

    • @huonglarne
      @huonglarne Год назад

      Thanks for the insight. I never would have realized that

    • @huonglarne
      @huonglarne Год назад

      I think maybe they want the model to generalize, even for "outliers" in the data.
      Or maybe when the dataset is imbalanced and some classes are under represented then pruning the easy samples may help the model not overfit

  • @godfrycunio3404
    @godfrycunio3404 2 года назад

    👉 【p】【r】【o】【m】【o】【s】【m】