Por dios... este profesor rockea demasiado! es muty claro, excelente. Se hace entender al máximo. Sus ejemplos son simples y entiendo perfecto. Es un genio!
58:50 he says that Conjugate gradient takes into account the second order terms - isn't that rather Newton's method ? Conjugate gradient is an improvement but through different means, i.e. taking into account earlier directions of descent, whereas Newton's method makes explicit use of the second order Taylor expansion.
@20:50 after he explains the problem of not learning on the data before choosing a model. My question is what if someone gives the data after already this and uses only z(x_1^2 + x_2^2 - 0.6) and this is the data they give you. How could you possible know the data was originally something else... you assume the data is the collected one. How would you charge the correct VC dimension.
I can understand the use of the soft threshold but saying this is the probability without explanation cannot be acceptable as the Professor said that there are other soft threshold functions that can be used. Depending on the function used there must be a correction to get the real probability.
The real probability is what we try to best approximate/learn within the hypothesis set. Sigmoid hypothesis set allows to adjust the range over which probability goes from 0 to 1 (magnitude of w -bigger w is more of a hard transition -closer to perceptron), and direction of transition (direction of w othogonal to the hyperplane where probability is 0.5), and its offset from origin(w_0).
The machine will set the weights after enough iteration so that it matches the actual probability, regardless of what soft threshold we use. Different functions will have different weights for the same data, and hence give approximately the same probability.
can someone plz explain to me the use of taylor series in gradiant decent? I mean, I know you use it to approxiamte a function near a point, but what did he do here? howcan you approximte delta Ein, if its built using 2 Ein's? what is the input here and around what vale he tryes to approximate?
he use taylor serie of first degree to calculate gradient descent. f(x)=f(W0)+((x-W0)^T).f'(W0)+O(||x-W0||²) replace f by E_in and X by W1, and you will find the gradient descent of E_in.
E(in) is a vector (say like (3, 4)), ||E(in)|| is the norm (magnitude) of the vector, which is more commonly the L2 norm which is equal to square_root(3^2 + 4^2) = square_root(25) = 5
Maestro!!! This has to be the standard in RUclips, very good Lectures, other Universities Learn from here.
Por dios... este profesor rockea demasiado! es muty claro, excelente. Se hace entender al máximo. Sus ejemplos son simples y entiendo perfecto. Es un genio!
Professor Yaser always answers a question in a more fundamental way, instead of trying to solve a problem at face value.
is extremely good at generalizations
54:38 "Let's say I am in 3 dimensions." Hmm, yes, you've been in 3 dimensions for a while now. :D
not in 3 dimenssion we actually live in 3 spatial and 1 temporal :p
That sounds like a joke Yaser would actually make. I can imagine it in his voice.
best machine learning course ever.
Logistic: 24:33
Amazing leacture ever seen
Excellent!! Thanks for uploading.
great lecture, extremely clear and understandable
He is the best !
"The weight of the person, not the weight of the input" Hahaha xD
58:50 he says that Conjugate gradient takes into account the second order terms - isn't that rather Newton's method ? Conjugate gradient is an improvement but through different means, i.e. taking into account earlier directions of descent, whereas Newton's method makes explicit use of the second order Taylor expansion.
Should try the online course: Learning from Data
@20:50 after he explains the problem of not learning on the data before choosing a model.
My question is what if someone gives the data after already this and uses only z(x_1^2 + x_2^2 - 0.6) and this is the data they give you.
How could you possible know the data was originally something else... you assume the data is the collected one. How would you charge the correct VC dimension.
Superb!
thanks a lot for such a great lecture
1st question explanation awesome!
look nice 2021
I can understand the use of the soft threshold but saying this is the probability without explanation cannot be acceptable as the Professor said that there are other soft threshold functions that can be used. Depending on the function used there must be a correction to get the real probability.
The real probability is what we try to best approximate/learn within the hypothesis set. Sigmoid hypothesis set allows to adjust the range over which probability goes from 0 to 1 (magnitude of w -bigger w is more of a hard transition -closer to perceptron), and direction of transition (direction of w othogonal to the hyperplane where probability is 0.5), and its offset from origin(w_0).
The machine will set the weights after enough iteration so that it matches the actual probability, regardless of what soft threshold we use. Different functions will have different weights for the same data, and hence give approximately the same probability.
can someone plz explain to me the use of taylor series in gradiant decent? I mean, I know you use it to approxiamte a function near a point, but what did he do here? howcan you approximte delta Ein, if its built using 2 Ein's? what is the input here and around what vale he tryes to approximate?
he use taylor serie of first degree to calculate gradient descent.
f(x)=f(W0)+((x-W0)^T).f'(W0)+O(||x-W0||²)
replace f by E_in and X by W1, and you will find the gradient descent of E_in.
Lol really like how he explained the steepest descent.
I don't get it
In the logistic regression Ein(w0) is a scalar but is assumed to be a vector
What am I missing?
Yes, it is scalar. Where do we use it as a vector?
I thought you could only take the gradient of a vector?
Gradient is a vector. But you usually take it of function depending on several variables. Yet, the function itself is a scalar in every point.
Oh I just realized my error
The gradient is the partial derivative of each weight, right?
Yep, exactly.
What is the difference between E(in) and|| E(in)||? What do those lines stand for?
E(in) is a vector (say like (3, 4)), ||E(in)|| is the norm (magnitude) of the vector, which is more commonly the L2 norm which is equal to square_root(3^2 + 4^2) = square_root(25) = 5
Tip: Wath all these lectures at 1.5x