An excellent teacher. I now understand why he has got Feynman award. Thank you so much Prof. Mostafa. I wish the universities have many many teachers like you.
This lecture made so much sense it almost made the material seem obvious. After many other sources failed to sink in, I feel like I found an oasis in the desert. It helped me to pause the video occasionally and attempt to work out what the next step would be. I measured my understanding at any given time by how well I predicted the next slide.
Very enjoyable lecture! I really liked the build up of the network from simple perceptrons and how the back-propagation algorithm is derived. Can't wait to learn about support vector machines.
To be a bit more precise on the "(almost)", I think that adding a slide with the complete partial derivative chains for two consecutive w: w^{(L)} and w^{(L-1)}, would help to show why the recursive computation works (showing the common \delta^{(L)} term), this could even be extended to three level, but if I'm not mistaken this would lead to an 7 terms formula, which may be confusing.
@4:10 He says neural nets aren’t the model of choice these days and people might choose SVM’s, for example. This is from 2012. When did NN make the jump to being are great choice again?
I feel this is still true in industry and academia. NNs achieve SOTA and thus grab the headlines, but they are computationally expensive and are not always deployed in production. As far as NLP goes, NNs were revitalized in 2017 thanks to the attention mechanism and again in 2018 with pseudo-labeling (MLM). The former allowed you to fit in a single GPU large models that would have required a supercomputer with CNNs. The latter provided access to petabytes of labeled information. Still, people use perceptrons and regex because they're way faster and don't require a 16GB gpu for inference.
@@charlesaydin2966 That statement is false. I can give you a counterexample right now where a local minimum exists while neither the 1st or 2nd order derivative exists, e.g., a "V" function. Besides, Googling the assumptions for gradient descent, all sources I found state that only the 1st derivative needs to exist. 2nd order derivatives are only needed if you want to use more sophisticated approaches such as Newton method's etc. See: www.stat.cmu.edu/~ryantibs/convexopt-F15/scribes/05-grad-descent-scribed.pdf or www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote07.html Can you provide a proof or a reference for your claim because it does not make any sense. Sure, if the 2nd order derivative exists and is >0, you can conclude it's a min, but it's completely illogical to say that a local min does not exist unless 2nd derivative exists.
These lectures are very very good..... but why neural network, as other topic, are not included in the book "learning from data"? Thank Prof. Yasser for your work. It is very useful.
The clarity of his lectures is amazing. But in this one, the left graph on the 5th slide is confusing. Taking certain sample or a subset of samples to compute the direction for the next step does not change your position in the w-space. Only the surface of E_in(w) is changed. Hard to plot, but the shown visualization is just not helping at all.
+Koral Linski I think you actually do change your position in the w-space right after picking a single training example. This is more clear in the pseudo code at 1:00:00 where you update the weights before picking the next example.
+Mohamed Ezz Yes, right. Of course you change the position in the w-space in a learning step. But this is not what I meant. In slide 5 he wants to illustrate the difference between SGD and GD over all traning samples. The figure only shows that there are several local minima in the error-function E_in(w) for ALL SAMPLES over the w-space. Considerung one learning step the choice of a subset of samples alters the shape of E_in(w), but not your position in w-space. Thats why you get a different gradient, a different direction and finally overcome small suboptimal minima.
The learnable parameters are the weights associated with each feature the movies have and the customers apparently prefer (based on their watch history), then the algorithm tries to predict the possible movie rating that customer will give to a specific movie. If this rating cross a threshold, the algorithm will suggest that movie
I'm a little confused about computing the error of the output of a neuron in the final layer. Lets say I have a neuron signal "s", an activation function "f(x)", a neuron output "x" which is the signal fed through the activation function ( i.e. x=f(s) ), and finally the derivative of the activation function f'(x). For this particular neuron what is the equation to find the neuron error "d"?
Error is not computed for one neuron. It is computed for the whole network (as the hypothesis model). That is the Ein, an function existing only after the final layer. For your question, the Ein'(s) = Ein'(f)*f'(s), assuming x = f(s).
No, there was a room full of Caltech students in his (very popular) class. The other person you hear asking questions is conveying questions from the online students who were watching the videos stream.
It is amazing how you can simplify a subject just by explaining the right things in the right order.
An excellent teacher. I now understand why he has got Feynman award. Thank you so much Prof. Mostafa. I wish the universities have many many teachers like you.
Love this professor! Calm and serious, yet exciting and funny.
Yaser Abu-Mostafa for president!
This lecture made so much sense it almost made the material seem obvious. After many other sources failed to sink in, I feel like I found an oasis in the desert.
It helped me to pause the video occasionally and attempt to work out what the next step would be. I measured my understanding at any given time by how well I predicted the next slide.
Enjoyed the explanation, amazing!
"Sorry, we denied credit because lambda is less than .5" :)
Very enjoyable lecture! I really liked the build up of the network from simple perceptrons and how the back-propagation algorithm is derived. Can't wait to learn about support vector machines.
Best lecture I found on youtube about neural networks. Very clear.
Thanks for making a RUclips channel as a young teenager it is cool that I can watch these.
Genio! No se puede creer lo bien que explica.
could not be more clear and descriptive, great lecture
thorough examination on the topic, i doubt anyone can do better than Abu-mostafo
Extraordinary class.
Very good explanation of neural networks.
Amazing lecturer, I will look for another his courses:)
It always impresses me how we can apply theory to anything. If you look at results - we fail horribly at anything last three thousand years.
He is indeed a really good lecturers and explains things clearly and makes them easy to grasp.
always Egyptian professors has the art of teaching and clarifying hard topics
Absolutely well done and definitely keep it up!!! 👍👍👍👍👍
The intro to SGD is excellent. Thank a lot
thank you! very helpful! greatings from Guatemala!
This video helped me implement the back propagation network which solves XOR problem.
Thanks.
can you please share code
@@vigneshshetty7228 github.com/hiraditya/neural-network
datascience.stackexchange.com/questions/73725/implementing-neural-network-using-caltech-course
Very nice talk, feels like (almost) nothing is missing :), thank you for putting this online.
To be a bit more precise on the "(almost)", I think that adding a slide with the complete partial derivative chains for two consecutive w: w^{(L)} and w^{(L-1)}, would help to show why the recursive computation works (showing the common \delta^{(L)} term), this could even be extended to three level, but if I'm not mistaken this would lead to an 7 terms formula, which may be confusing.
Very good lecture about a very intriguing technique. Thoroughly enjoyed :)
I really like the lecture. It's so well explained
This teacher is very good.
thank you ..from Tunisia
merci beaucoup
From moroco hhhh it s a great course really
from algeria haha
From Punjab LOL
So clear! Great professor!
thank you ,, from Algeria
merci beaucoup
What's the pre-requisites for this course?
Really starts at 4:32 into the presentation
Brilliant lecturer!
@4:10 He says neural nets aren’t the model of choice these days and people might choose SVM’s, for example. This is from 2012. When did NN make the jump to being are great choice again?
I feel this is still true in industry and academia. NNs achieve SOTA and thus grab the headlines, but they are computationally expensive and are not always deployed in production. As far as NLP goes, NNs were revitalized in 2017 thanks to the attention mechanism and again in 2018 with pseudo-labeling (MLM). The former allowed you to fit in a single GPU large models that would have required a supercomputer with CNNs. The latter provided access to petabytes of labeled information. Still, people use perceptrons and regex because they're way faster and don't require a 16GB gpu for inference.
"It's a funny array but its a legitimate array"
That was explained well enough for even me to understand...
Factor no.7 is important in your case, just like 42 is the ultimate answer to the ultimate question of life, universe, and everything!
For the recap of lecture 9, how come he says that gradient descent requires a 'twice differentiable' function? Isn't it only the first derivative?
+Shem Leong For a local minimum to exist, not only should the 1st derivative equal zero, but also the 2nd derivative should exist and be > 0.
@@charlesaydin2966 That statement is false. I can give you a counterexample right now where a local minimum exists while neither the 1st or 2nd order derivative exists, e.g., a "V" function. Besides, Googling the assumptions for gradient descent, all sources I found state that only the 1st derivative needs to exist. 2nd order derivatives are only needed if you want to use more sophisticated approaches such as Newton method's etc. See:
www.stat.cmu.edu/~ryantibs/convexopt-F15/scribes/05-grad-descent-scribed.pdf
or
www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote07.html
Can you provide a proof or a reference for your claim because it does not make any sense. Sure, if the 2nd order derivative exists and is >0, you can conclude it's a min, but it's completely illogical to say that a local min does not exist unless 2nd derivative exists.
yes I think he misspoke. He was probably thinking about a second-order method
This guy knows his stuff
amazing
incredible prof
Nice & clear explanation.
Dear professor, why you never take a sip of water?
His video explanation of SGD should be put in place of the current Wikipedia page.
Really great
These lectures are very very good..... but why neural network, as other topic, are not included in the book "learning from data"?
Thank Prof. Yasser for your work. It is very useful.
you can download e-chapters that cover other topics (SVM, neural network...) in book.caltech.edu/bookforum/forumdisplay.php?f=149
Great!!!!
Is it mandatory to register ?
Finally know why lecturer skipped to explain this
The clarity of his lectures is amazing.
But in this one, the left graph on the 5th slide is confusing. Taking certain sample or a subset of samples to compute the direction for the next step does not change your position in the w-space. Only the surface of E_in(w) is changed. Hard to plot, but the shown visualization is just not helping at all.
+Koral Linski I think you actually do change your position in the w-space right after picking a single training example. This is more clear in the pseudo code at 1:00:00 where you update the weights before picking the next example.
+Mohamed Ezz Yes, right. Of course you change the position in the w-space in a learning step. But this is not what I meant.
In slide 5 he wants to illustrate the difference between SGD and GD over all traning samples. The figure only shows that there are several local minima in the error-function E_in(w) for ALL SAMPLES over the w-space. Considerung one learning step the choice of a subset of samples alters the shape of E_in(w), but not your position in w-space. Thats why you get a different gradient, a different direction and finally overcome small suboptimal minima.
+Koral Linski Got you. Right, you see a different E(w) surface for each batch/example.
Are the lecture slides available somewhere?
+Brando Miranda yeah, they are available here: work.caltech.edu/telecourse.html
Nice O.O; very helpful for my midterm
can anyone explain what he says in the final slide?
This is a 100 level course?
I didn´t quite get that. Was it "okey" before or after the partial derivative?
helpful! UoG, Ethiopia
great
Excuse me, I have problem of understanding slide #6. As "u" and "v" are datasets from user and movie; so, where are the learnable parameters?
The learnable parameters are the weights associated with each feature the movies have and the customers apparently prefer (based on their watch history), then the algorithm tries to predict the possible movie rating that customer will give to a specific movie. If this rating cross a threshold, the algorithm will suggest that movie
Thank you very much.
👍👍👍👍👍
This vdeo seems good. I will watch it when I find the time :)
there is no topic of neural network included in the text book . can anyone send me the textbook . thanks
Neural networks is one of the "dynamic echapters" amlbook.com/
I'm a little confused about computing the error of the output of a neuron in the final layer.
Lets say I have a neuron signal "s", an activation function "f(x)", a neuron output "x" which is the signal fed through the activation function ( i.e. x=f(s) ), and finally the derivative of the activation function f'(x).
For this particular neuron what is the equation to find the neuron error "d"?
Error is not computed for one neuron. It is computed for the whole network (as the hypothesis model). That is the Ein, an function existing only after the final layer. For your question, the Ein'(s) = Ein'(f)*f'(s), assuming x = f(s).
Yup
I think there is an error on slide 19 : when he does (theta '), it should be (1-s ^2) and not (1-x^2). Could someone confirm it please?
No, it is not an error. theta'(s) = 1 - theta(s)^2 = 1 - x^2.
1-theta^2(s)
we know that [theta(s) = x]
=> 1-x^2
On not slavishly following a biological model: "We get a plane that flies but doesn't flap its wings." :-)
greeting from AU
good
paddai xa keta le j hos
20:44
The RUclips algorithm has taken my pivot from dancing cats to self-education a bit too far this time...
ok
RUclips Algorithm here I come again!
1:04:12 jajaja
it seems like there is only one student in the room. lol
No, there was a room full of Caltech students in his (very popular) class. The other person you hear asking questions is conveying questions from the online students who were watching the videos stream.
Hooray im the 666th like !!! as of now there are 666 likes of this video
He looks like the guy from Scrubs :P
Ah......the big bang theory
if he says "ok" one more time i go nuts...