Machine Learning Lecture 16 "Empirical Risk Minimization" -Cornell CS4780 SP17

Поделиться
HTML-код
  • Опубликовано: 1 авг 2024
  • Lecture Notes:
    www.cs.cornell.edu/courses/cs4...

Комментарии • 37

  • @matthieulin335
    @matthieulin335 4 года назад +32

    as a student from cs at Tsinghua, I would say this is the best course in ML you can find out there

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +14

      Thanks! Please send my warmest regards to Prof. Gao Huang from me.

    • @matthieulin335
      @matthieulin335 4 года назад

      @@kilianweinberger698 will do when this virus ends !

  • @user-me2bw6ir2i
    @user-me2bw6ir2i Год назад +1

    I'm incredibly grateful for your intuitive explanation of SVM, that really helped me to understand this topic.

  • @rajeshs2840
    @rajeshs2840 4 года назад +2

    Thank you Prof ... After your videos ,started loving ML..

  • @raviraja2691
    @raviraja2691 3 года назад +3

    I really want to put my laptop away... But I'm watching Prof Kilian's awesome lectures... So can't help it!

  • @sansin-dev
    @sansin-dev 4 года назад +3

    Brilliant lecture.

  • @in100seconds5
    @in100seconds5 4 года назад +3

    Boy this is wonderful

  • @StevenSarasin
    @StevenSarasin Год назад +1

    log(cosh(x)) is such a clever idea, asymptotically linear and locally (near x=0) quadratic, a smooth version exactly analogous to the Huber idea of mixing L1 and L2 norm. Worth checking out the taylor expansion (which can be thought of us a microscope for functions to say what polynomial does this function look like close to a point (typically 0). You will get that cosh(log(x)) = .5*x^2 + O(x^4), i.e. quadratic near 0.

  • @8943vivek
    @8943vivek 3 года назад

    Wow! CRISP!

  • @jachawkvr
    @jachawkvr 4 года назад +3

    I loved your visualization for l1 and l2 regularization. I had seen these before, but never really understood what they meant.
    I have a question here : How would we optimize the objective function while using l1 regularization? I think gradient descent would not work well since the function is not differentiable at some very key points.

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +6

      Yes, good point. SGD gets a little tricky. If you use the full gradient (summed over all sample) you can use sub-gradient descent. As long as you make sure you reduce your step size it should converge nicely.

  • @JoaoVitorBRgomes
    @JoaoVitorBRgomes 3 года назад

    @killian weinberger in the first 20 min of lecture you say the derivative of squared loss is the mean. But shouldnt it be the bias and variance? Or the intercept or the weights?

  • @theflippedbit
    @theflippedbit 4 года назад +1

    Hi, professor. I really like your way of explaining the ML concepts. i wish there were assignments/quizzes on the related topics, where we could try out these learning algos and get a more hands-on experience. i checked the course page but couldn't find any assignments.

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +9

      Past 4780 exams are here: www.dropbox.com/s/zfr5w5bxxvizmnq/Kilian past Exams.zip?dl=0
      Past 4780 Homeworks are here: www.dropbox.com/s/tbxnjzk5w67u0sp/Homeworks.zip?dl=0
      Unfortunately, I cannot hand out the programming assignments from the Cornell class. There is an online version of the class (with interactive programming assignments and all that stuff), but the university does charge tuition.
      www.ecornell.com/certificates/technology/machine-learning/

    • @danielrudnicki88
      @danielrudnicki88 3 года назад +1

      @@kilianweinberger698 Is there a current version of link with exams? The one above unfortunately expired.
      Thanks for these amazing lectures:)

  • @vishnuvardhanchakka1308
    @vishnuvardhanchakka1308 3 года назад

    Sir, In Plots of Common Regression Loss Functions , x-axis should be h(Xi) - Yi but in course page its showing h(Xi) * Yi

  • @omarjaafor6646
    @omarjaafor6646 2 года назад

    Where were you all these years

  • @Theophila-FlyMoutain
    @Theophila-FlyMoutain 5 месяцев назад

    Hi Professor. Thank you for sharing the video. I am now using Gaussian Process Regression in physics field. One thing I noticed is that even though there exists specific loss function for GPR, many people use root-mean-squared-error as loss function. Is there any rule to choose loss function and regularization?

  • @sekfook97
    @sekfook97 3 года назад

    I start to understand why optimise wTw instead of just W. wTw would be a scalar value and w is a vector. Guess it is much easier for us to use a scalar value as a constraint? also it would form a bigger vector space for us to search for optimal W if our constraints is wTw

    • @kilianweinberger698
      @kilianweinberger698  3 года назад +1

      It is tricky to optimize over a vector, like w. Imagine w is two dimensional, which vector is more optimal [1,2] or [2,1], or [4,0]? When you optimize w’w you get a scalar for which minimization and maximization are well defined.

    • @sekfook97
      @sekfook97 3 года назад

      @@kilianweinberger698 thanks for the detailed explanation!

  • @hdang1997
    @hdang1997 4 года назад +2

    Is using MAP estimation synonymous to regularizing?

    • @kilianweinberger698
      @kilianweinberger698  4 года назад +5

      No, not exactly, but in many settings the resulting parameter estimate is identical to what you would obtain with a specific regularizer (depending on the prior). The idea of enforcing a prior can be viewed as a form of regularization.

  • @bnglr
    @bnglr 4 года назад

    does this has anything to do with ERM?

  • @aloysiusgunawan7709
    @aloysiusgunawan7709 2 года назад

    Hello Prof, if the constraint is w1^2 + w2^2

  • @smsubham342
    @smsubham342 Год назад

    Why squared loss estimates mean and absolute error estimates median? Googled on this but no clear answer.

    • @kilianweinberger698
      @kilianweinberger698  Год назад

      You can derive it pretty easily if you let your classifier be a constant predictor. Let's call your prediction p.
      What minimizes 1/n \sum_i=1^n (p-y_i)^2 ?
      If you take the derivative and equate it to zero you will see that the optimum is when p is the mean of all y_i.
      You can proof a similar result for the median if it is the absolute loss. Hope this helps.

  • @KulvinderSingh-pm7cr
    @KulvinderSingh-pm7cr 5 лет назад +1

    I need a little help, I am studying learning theory and need some good quality material for developing intuition about it.
    It would be greatly helpful if professor or anyone can refer to some sort of resource or something to learn more.
    Will be happy if anyone can help, Thanks a lot in advance.

    • @kokonanahji9062
      @kokonanahji9062 5 лет назад +2

      this series ruclips.net/channel/UCR4_akQ1HYMUcDszPQ6jh8Qplaylists might help

  • @rodas4yt137
    @rodas4yt137 4 года назад +2

    Never seen 0 dislikes on a 10k views video though

  • @xenonmob
    @xenonmob 3 года назад +1

    put your laptops away?

  • @kareemjeiroudi1964
    @kareemjeiroudi1964 5 лет назад +3

    What the heck? That's like the first 10 minutes of my lecture in Theoretical Concepts of ML. I wish my lecture was that easy.

    • @ugurkap
      @ugurkap 5 лет назад +4

      Congrats.