Lecture 7 | Acceleration, Regularization, and Normalization

Поделиться
HTML-код
  • Опубликовано: 20 окт 2024

Комментарии • 4

  • @rashidkp123
    @rashidkp123 4 года назад +1

    Fantastic lecture, particularly i found the last 5 minute bind blowing. Thanks Professor.

  • @ronmedina429
    @ronmedina429 4 года назад

    Prof Bhiksha, in the slides, it is said that mini-batch gradient descent results to a degradation of \sqrt{B}. I'm not sure if this is correct. I think it should be calculated as O(1/sqrt{Bk}) = O(1/sqrt{B(t/B)}) where B is the batch size and t is the number of iterations. Then O(1/sqrt{t}) decay rate per iteration same as SGD.
    This can also be seen by calculating the total updates based on epsilon. O(1/B epsilon^2) for minibatch to get epsilon convergence. But then each update costs B iterations. Thus O(1/epsilon^2) iterations to get epsilon-convergence same as SGD. Please correct me if there is a misunderstanding on my part. Thank you.

  • @RedShipsofSpainAgain
    @RedShipsofSpainAgain 5 лет назад

    14:01 We don't know, a priori, the *true* curve (blue line). How is this particular point chosen here (the yellow bar)? One would think intuitively that not all sample points are equally worthwhile; the points that have the most error (distance between predicted and actual curve) could also improve the curve the most if they're adjusted. In contrast, the points that are already pretty darn close to their true value aren't "worth" as much to correct/improve.
    So is there a way to know which of the sample points are going to improve the curve most? Because it would make sense to focus on those points first. Maybe this is like Bayesian optimization where you choose an x point that is furthest from two consecutive known points because that chosen point will give the most information gain.

    • @carnegiemellonuniversityde4339
      @carnegiemellonuniversityde4339  5 лет назад +2

      You may want to read more into the idea of statistical leverage and what is considered a high-leverage point. This would tell you, of the points sampled, which have the most effect on the change in the outcome of the model. As a follow-up, you may then want to better understand, first in the case of linear regression, how the projection matrix relates the leverage to the model errors. This may help give perspective.
      That leads to the following claim, which is that a neural network with linear activations has its neurons attempting to solve for this projection matrix between dimensions. Though this depends on your loss function, convincing yourself of this will help build your intuition of what is happening.