log(cosh(x)) is such a clever idea, asymptotically linear and locally (near x=0) quadratic, a smooth version exactly analogous to the Huber idea of mixing L1 and L2 norm. Worth checking out the taylor expansion (which can be thought of us a microscope for functions to say what polynomial does this function look like close to a point (typically 0). You will get that cosh(log(x)) = .5*x^2 + O(x^4), i.e. quadratic near 0.
I loved your visualization for l1 and l2 regularization. I had seen these before, but never really understood what they meant. I have a question here : How would we optimize the objective function while using l1 regularization? I think gradient descent would not work well since the function is not differentiable at some very key points.
Yes, good point. SGD gets a little tricky. If you use the full gradient (summed over all sample) you can use sub-gradient descent. As long as you make sure you reduce your step size it should converge nicely.
@killian weinberger in the first 20 min of lecture you say the derivative of squared loss is the mean. But shouldnt it be the bias and variance? Or the intercept or the weights?
Hi, professor. I really like your way of explaining the ML concepts. i wish there were assignments/quizzes on the related topics, where we could try out these learning algos and get a more hands-on experience. i checked the course page but couldn't find any assignments.
Past 4780 exams are here: www.dropbox.com/s/zfr5w5bxxvizmnq/Kilian past Exams.zip?dl=0 Past 4780 Homeworks are here: www.dropbox.com/s/tbxnjzk5w67u0sp/Homeworks.zip?dl=0 Unfortunately, I cannot hand out the programming assignments from the Cornell class. There is an online version of the class (with interactive programming assignments and all that stuff), but the university does charge tuition. www.ecornell.com/certificates/technology/machine-learning/
Hi Professor. Thank you for sharing the video. I am now using Gaussian Process Regression in physics field. One thing I noticed is that even though there exists specific loss function for GPR, many people use root-mean-squared-error as loss function. Is there any rule to choose loss function and regularization?
I start to understand why optimise wTw instead of just W. wTw would be a scalar value and w is a vector. Guess it is much easier for us to use a scalar value as a constraint? also it would form a bigger vector space for us to search for optimal W if our constraints is wTw
It is tricky to optimize over a vector, like w. Imagine w is two dimensional, which vector is more optimal [1,2] or [2,1], or [4,0]? When you optimize w’w you get a scalar for which minimization and maximization are well defined.
No, not exactly, but in many settings the resulting parameter estimate is identical to what you would obtain with a specific regularizer (depending on the prior). The idea of enforcing a prior can be viewed as a form of regularization.
You can derive it pretty easily if you let your classifier be a constant predictor. Let's call your prediction p. What minimizes 1/n \sum_i=1^n (p-y_i)^2 ? If you take the derivative and equate it to zero you will see that the optimum is when p is the mean of all y_i. You can proof a similar result for the median if it is the absolute loss. Hope this helps.
I need a little help, I am studying learning theory and need some good quality material for developing intuition about it. It would be greatly helpful if professor or anyone can refer to some sort of resource or something to learn more. Will be happy if anyone can help, Thanks a lot in advance.
as a student from cs at Tsinghua, I would say this is the best course in ML you can find out there
Thanks! Please send my warmest regards to Prof. Gao Huang from me.
@@kilianweinberger698 will do when this virus ends !
I'm incredibly grateful for your intuitive explanation of SVM, that really helped me to understand this topic.
Thank you Prof ... After your videos ,started loving ML..
I really want to put my laptop away... But I'm watching Prof Kilian's awesome lectures... So can't help it!
Brilliant lecture.
Boy this is wonderful
log(cosh(x)) is such a clever idea, asymptotically linear and locally (near x=0) quadratic, a smooth version exactly analogous to the Huber idea of mixing L1 and L2 norm. Worth checking out the taylor expansion (which can be thought of us a microscope for functions to say what polynomial does this function look like close to a point (typically 0). You will get that cosh(log(x)) = .5*x^2 + O(x^4), i.e. quadratic near 0.
Wow! CRISP!
I loved your visualization for l1 and l2 regularization. I had seen these before, but never really understood what they meant.
I have a question here : How would we optimize the objective function while using l1 regularization? I think gradient descent would not work well since the function is not differentiable at some very key points.
Yes, good point. SGD gets a little tricky. If you use the full gradient (summed over all sample) you can use sub-gradient descent. As long as you make sure you reduce your step size it should converge nicely.
@killian weinberger in the first 20 min of lecture you say the derivative of squared loss is the mean. But shouldnt it be the bias and variance? Or the intercept or the weights?
Hi, professor. I really like your way of explaining the ML concepts. i wish there were assignments/quizzes on the related topics, where we could try out these learning algos and get a more hands-on experience. i checked the course page but couldn't find any assignments.
Past 4780 exams are here: www.dropbox.com/s/zfr5w5bxxvizmnq/Kilian past Exams.zip?dl=0
Past 4780 Homeworks are here: www.dropbox.com/s/tbxnjzk5w67u0sp/Homeworks.zip?dl=0
Unfortunately, I cannot hand out the programming assignments from the Cornell class. There is an online version of the class (with interactive programming assignments and all that stuff), but the university does charge tuition.
www.ecornell.com/certificates/technology/machine-learning/
@@kilianweinberger698 Is there a current version of link with exams? The one above unfortunately expired.
Thanks for these amazing lectures:)
Sir, In Plots of Common Regression Loss Functions , x-axis should be h(Xi) - Yi but in course page its showing h(Xi) * Yi
Where were you all these years
Hi Professor. Thank you for sharing the video. I am now using Gaussian Process Regression in physics field. One thing I noticed is that even though there exists specific loss function for GPR, many people use root-mean-squared-error as loss function. Is there any rule to choose loss function and regularization?
I start to understand why optimise wTw instead of just W. wTw would be a scalar value and w is a vector. Guess it is much easier for us to use a scalar value as a constraint? also it would form a bigger vector space for us to search for optimal W if our constraints is wTw
It is tricky to optimize over a vector, like w. Imagine w is two dimensional, which vector is more optimal [1,2] or [2,1], or [4,0]? When you optimize w’w you get a scalar for which minimization and maximization are well defined.
@@kilianweinberger698 thanks for the detailed explanation!
Is using MAP estimation synonymous to regularizing?
No, not exactly, but in many settings the resulting parameter estimate is identical to what you would obtain with a specific regularizer (depending on the prior). The idea of enforcing a prior can be viewed as a form of regularization.
does this has anything to do with ERM?
Hello Prof, if the constraint is w1^2 + w2^2
Yes, here B is the squared radius.
Why squared loss estimates mean and absolute error estimates median? Googled on this but no clear answer.
You can derive it pretty easily if you let your classifier be a constant predictor. Let's call your prediction p.
What minimizes 1/n \sum_i=1^n (p-y_i)^2 ?
If you take the derivative and equate it to zero you will see that the optimum is when p is the mean of all y_i.
You can proof a similar result for the median if it is the absolute loss. Hope this helps.
I need a little help, I am studying learning theory and need some good quality material for developing intuition about it.
It would be greatly helpful if professor or anyone can refer to some sort of resource or something to learn more.
Will be happy if anyone can help, Thanks a lot in advance.
this series ruclips.net/channel/UCR4_akQ1HYMUcDszPQ6jh8Qplaylists might help
Never seen 0 dislikes on a 10k views video though
put your laptops away?
What the heck? That's like the first 10 minutes of my lecture in Theoretical Concepts of ML. I wish my lecture was that easy.
Congrats.