Does anyone else find this guy absolutely hilarious for some reason? Something about the look on his face makes you feel like he's constantly thinking "Yeah, I'm killing this lecture right now". When he whips out that smug half smile I can't help but laugh out loud. You can tell he loves teaching. Great set of lectures.
Such a marvelous lecture! The logical sequence, the explanation, the joke, the intuitutive powerpoint/animation, ... I wish all my lectures are like this
If this video is confusing to you, consider the following: The example at 9:34 is only to show that we know something about the entire set, based on a sample. Basically, it says the bigger the sample, the closer v relates to u (Hoeffding). At 28:00, forget the above example. We are not trying to make a hypothesis for that example. The new values have nothing to do with above example. From this point on, we have a random data set (X) with an unknown function f. We want to know if we could make a hypothesis h to predict results. In other words: is learning feasible? So in the new bin, the probability of how many points are green is a measurement for how correct a hypothesis is. We do not know how many are green, so we take a sample. In this sample, we get the relation between correct and incorrect results of the hypothesis and this says something about the entire bin (Hoeffding). So if the sample is sufficiently big and has a lot of positive predictions than yes, learning is feasible. Or not? -> 33:30 okay? okay.
Yea, it's so misleading. The same marbles can plays two different roles. First it's used for measuring the probability of getting green and on the second time it's used for checking if a hypothesis h is correct (h(x)=fx(x)) or wrong (h(x)!=f(x)). He's a good professor but he's confusing.
can you please tell me how exactly mue is defined, picking means, I purchase both balls and defining the % how much of which 1 we got there in the bin after mixing or else, how we're defining picking red ball because ball are being picked in sample, like we pick 9 balls at a time and then define the probability of the same as "nue", but how that's being incorporated with the "mue"
Well, this was for a long time the most baffling point of the entire lecture for me. However, when complemented with the book, it suddenly hit me: although we cannot explicitly compute f(x) when comparing it to g(x), since f is unknown, hence the colors themeselves are completely unknown to us, what we CAN do is to view randomly picked x's as samples from probability distribution P, where the x is red with probability mu and green with probability 1 - mu. That's it.
19:00 Now, between you and me, I prefer the original formula better. Without the 2. However, the formula with the 2s has the distinct advantage of being … true. So we have to settle for that. Best quote ever on the Hoeffding's Inequality. :)
Abu-Mostafa is a real genius in explaining complex things in simple terms...The real game changer is Hoefflding's inequality because it allows us to model the learning problem in a way that bounds the uncertainty independent of unknown parameter(mu). The only thing that remains is a tradeoff between error tolerance (epsilon) and sample size(N) and is captured by the relation P(|nu - mu| > epsilon)
The Coursera ML course assumes you're an idiot to start with, teaches you little and then proclaims you an "expert". This course assumes substantial background, teaches you things in-depth, and at the end is still humble about how much knowledge is gave you. Andrew Ng's RUclips lectures recorded at Stanford are quite good though.
Thanks to pofessor and Caltech, sending me back to my youth! I recall myself, excited by outstanding lecturers, I met then.... His speach and passion perfectly colourizes the topic and, I beleive, is of great help to fogeign students to understand.
I am not completely undrestood analogy between learning and 1000 coins. At 43:08 we try different hypothesis (because bins are different) and try to select the best according the sample. At 48:52 we have the same hypothesis (because bins are the same) and try different samples.
This lecture is perfect! I recommend complementing it with Andrew Ng lecture 9 on learning theory from his youtube machine learning course. This prof. Is VERY good at conveying the intuition behind, while Ng will go more deeply into the Maths. They both complement each other in a perfect way.
This is such a nice complementary course to Andrew Ng’s videos. I would seriously consider paying for quality lectures like these. Eternally grateful to Caltech for providing this one free of charge!
Dr. Mostafa, thank you for the internesting lectures. I watched these while running on the treadmill for many days until it makes sense. I am happy that you let the world watch your great work and amazing lecture style. More students would love math if they had you for a teacher. Thank you!
Props for the method of explaining the Hoeffding's Inequality. I find that going on about each element of the equation separately facilitated a lot to understanding it. Congratulations!
The professor here is making a subtle but very important point. He is saying that given a set of sample some or the other hypothesis **must** agree with the data. And the more hypotheses there are the more likely it gets that one of them will agree with the data (Capital M in his lecture). This is a guaranteed fact (that some hypothesis will agree with the sample), we want to make sure that the probability of accidental agreement is is small. Recall professor's coin toss example.
What do you mean by "set of samples"? I thought when we're talking about multiple bins, we're talking about the same sample but different hypothesis applied to it.
@@Satelliteua We are not talking about the same sample from each bin.. The way I understand the problem is this way: We have 1 bin containing all the possible data points (marbles). For each h (a possible hypothesis), we extract a random sample from the bin. Now, each h actually changes mu in the bin (since each h changes the colors of the marbles based on its conformity to f) - this is why the prof represents the problems as multiple bins. Now what happens when we pull out random samples from each bin? It might be the case that all the samples pulled out are correctly classified by h. Does that then mean that these data points were correctly classified because h tends to f? Well not necessarily. There are two things at play here. Let's first assume we are dealing with a single h (a single bin). From this bin we pull out a sample of data and check its classification accuracy. If we keep pulling out samples, then we might get a sample where h gets it all correct. So in this case, it was just luck that h got it all correct. Now though, we are dealing with multiple h (bins), and from each bin we are pulling out different samples. So now, like the coin example, we are actually likely to find the case (the h) where we pull out all heads, even though h is not close to f. This is why as M grows, we are likely to find a wrong model. This is what I understood. Sorry if my ideas are all over the place, its a difficult concept to put into words.
@@Omar-kw5ui i started watching lectures recently. Your argument is correct. But if in real life i am doing some machine learning with some hypothesis & i find that all samples are 'green' (agree with function's Y points) then should i still abandon this particular hypothesis saying its just pure luck & keep on searching for other hypothesis where most likely i will probably never come close? I guess he is saying that by pointing out the coin-toss person who got all 5 heads. But then which hypothesis should i settle on as my 'g' ?
The best way to have a grasp of all these lectures is to do all the homeworks and projects. If you are a "learning machine", then start to read conference papers on learning machine....so you will be ready for research.
Thanks for the excellent lecture. Here are a couple of questions: At 42:22 Just want to be more rigorous, would the P notation in this Hoeffding inequality depend on both X and y? At 50:02 How exactly is g defined here in order to have this inequality hold? The inequality seems to require that g minimize the | Ein - Eout|? But that is not intuitive. Instead, it is more intuitive to have g that minimizes Ein (or eventually Eout) based on the definition of Ein and Eout earlier.
For my second question, I see it now because it is a less than or equal sign there, in stead of the equal sign. That inequality always holds since g is one of h in H. Thus whatever criterion for defining g is fine as long as g is one of the hs.
True, we do not know the target function itself, but what we do know is it's value at certain points which are part of the dataset that we use to train our hypothesis. I believe the prof. refers to comparing the two functions at those points.
Don't equate "we don't know what the target function is" to "we don't know anything about the target function" as someone below already stated we do know something about the target function namely what its values are at those points that we sampled. So the more we sample the more we get to know about the target function. The probability comes in to allow us to state a certain probability that the pattern in the bin will be "sufficiently similar" (this is the epsilon definition) to the sample chosen as long as the sample is large enough.
Awesome lecture.. The 10 times coin flip experiment was surprising and it is pretty interesting. Few observations: 1) By choosing a sufficiently big M, then I can get inequality like P[Bad Event] < 2 : which actually is a no-brainer 2) An absolute value of "bound" is meaningless unless I know the "mu" , the unknown quantity. But that said, practically, we can overcome this by some educated guesses and possibly central limit theorem too. 3) I am surprised there is no talk of Central limit theorem... I was expecting that something will be proved based on it...Possibly Hoeffding has relation to it....
cont. the formula with M on the RHS will be used, which is very conservative. Unfortunately, most hypothesis space is not finite, i.e., you have infinite number of hypothesis in it, so you can't use M to measure the model complexity. So in that case, we will resort to something called VC dimension and derive generalization bounds based on that.
I think the original Hoeffding's equality applies when the hypothesis is specified BEFORE you see the data (e.g. a crazy hypothesis like if it is raining then approves the credit card otherwise not). However, in reality, we will learn a specific hypothesis using the data (e.g. using least squares to learn the regression coefficients), in that case, the learned hypothesis is g and you can considered it as chosen from a hypothesis space (H). If the hypothesis space is finite of size M, then
The professor's book also says that there is a problem because we use the same data points for all hypotheses instead of generating new data for each hypothesis. So it breaks the assumption of independence of data generation. Why wasn't it mentioned in the lecture?
I like the explanation that a complex g is too into the historical data and possible to get bigger Eout. I'm wondering whether most financial models in 2009 are like that and not many people could understand it, so few economists realized the crash was coming until it's too late.
How did we sum up the RHS in the Hoeffding's inequality. I mean each of the hypothesis will have a different bound(epsilon) and hence a different exponential term. So, how do they sum up to be substituted by a "M" times the exponential . Also if we keep the bound same for each inequality wont the "N" no of samples change. How is the exponential consistent across all the hypothesis. Am I missing something?
Hmm...a lot of questions here. You start off by by defining what you find to be the maximum "acceptable" deviation of your selected hypothesis can be. This acceptable value is epsilon. The deviation is between the in-sample error and out-sample error. You cannot guarantee this but you can ensure that the chance that it will exceed this deviation is smaller than a certain probability. This is why the whole probability thing is brought in to the discussion. Now g is just one of the h's in the hypothesis set. So if the in- and out-samples of g deviates by more than epsilon this implies that (at least) one of the h's deviates by more than epsilon so we can say that it must be the case that h1 deviates by more than epsilon or h2 deviates by more than epsilon and so on. The probabilityh of deviation between in and outsampling is independent of the number of red and green balls i.e. it is independent of any h that is why each h has the same bound.
I'm getting confused towards the end. Once you have g, what use is it to compare it to every h in H? Surely, g is the best h in H, that's how it became g. Also, why does he add M to the inequality at the very end? Doesn't that just increase the value, the bigger H is? So with a large H and a subsequent large M, won't the comparison be totally redundant? I think he made that point at the end, but I don't see why he added it in, in the first place.
Thanks for pointing it out! I don't know how could I miss looking for a ML course on Coursera. The problem will be the overlap with the Criptography one from Dan Boneh
The main objective of learning, as laid out in the lecture, is finding a hypothesis that behaves similarly for the training data(in sample) and test data(out sample). No matter the performance of the hypothesis on the sample, if we can prove that the hypothesis is performing approximately same for in sample and out sample than we have essentially proved that learning is feasible i.e generalizing beyond in samples is possible. The final modification to Hoeffding's formula states that with reasonable choice of M,epsilon and N, the probability of in sample performance deviating from out sample performance can indeed be bound to an acceptable limit thus proving learning is feasible. The fact that M is infinite in all the models we generally come across and still able to learn is proved in theory of generalization lecture. Thanks.
Is learning feasible means here: can we - based on observations of our insample data - make statements on our outsample data or in other words, can we generalize observations on our selected sample to the entire population (in the bin)?
I am doing the course on ML in Coursera, It is a very good start for someone who jwants to get started and know what machine learning is all about and do some exercises to get a feel for it. That said, simple derivations in calculus which you should know from high school are skipped and just the final formula is given which is a little disappointing. I don't see how anyone can do machine learning without knowing basic calculus. Too emphasis is placed on being nice.
Hello and thank you for the wonderful lectures! I'm new in this field and I am trying to combine it with Computational Geometry. As such, my problems are unique in the sense that the training sets can (usually) be constructed at the will of the modeller. The data are always (potentially) there, it is a matter of choosing which to produce. I was wondering if there is a theoretical or practical approach to an optimized selection of training samples from the whole? Does that relate to assigning a specifc (hopefully optimum in some way)l P(x), e.g. uniform distribution which takes samples uniformly from the whole? Or is it that random selection is still good enough in this case? Thank you in advance. Theodore.
Umm... I'm probably missing sth ;) What formula is he using to get to 63% probabilities? If each coin gets 10 straight heads once each 1,024 times (say we run it infinite times... Then the proportion should be 1 over 2 to the N, right? So 1 over 2 to the ten, so once each 1024 times roughly.) So because the probability of each coin is independent, doesn't it mean that the probability should be almost 100%? (1000/1024) Ah, Ok... So even if you had 100 trillion flips each with 99.9% chances of being heads, you still have 0.01x0.01... (100 trillion times) of chances to get all tails. For this example, you have 1/1024 chances of it being heads 10 consecutive times, so you have (1-(1/1024)) chances of it having at least one of the 10 flips being tails... That is, if you have 1 in 1024 chances of it being heads 10 straight times, you have 1023 in 1024 chances of it not being that. And if it is not that, it means that, at least, there is one tails somewhere (at least one) that would break the chain. So over 1000 repetitions, you have (1-(1/1024)) to the 1000, or (1023/1024) to the 1000, or 37% chances to get at least one tail on each set of 10. So 63% chances approx. to get 10 consecutive heads. That being said, I still believe if the chances are 1 in 1024 to get 10/10 heads, for each 1024 attempts when the number of attempts goes towards infinity, we should get at least one of those to be 10 straight heads? So maybe it has to do with distribution? Like sometimes you can get 2 or more sets of 10 straight heads in your lot of 1000, while other times you may get none. So the chances of you finding (in a lot of 1000 tries) at least 1 set of 10 straight heads is 63%? (because they can form clusters, and sometimes you will get a group with none) Or maybe it doesn't have to do with that? I mean, what are probabilities, really? Say you have 99.9% chances to get heads and 0.01% to get tails. You do it twice and the chances to get at least one time heads are really high, of course. But there is 1/10,000 chances of actually being tails and tails... So if you go towards infinity, you might think the distribution would be 99.9% of the time, no matter where or the order, you get heads, and 0.01% of the time you would get tails. But for N tries, there is 0.01 to the N chances of actually being all tails... So you can do it 100 trillion times, or go towards infinity, and there is still a very, very, very small, but real chance to, well, get all tails. So the chance is there, and now let's suppose it happens... Now, if that slim chance was the way events unfolded, then that option would happen forever, infinite times, and the 99.9% chances would mean nothing. You might say, well, but if we run the experiment again, now we will probably get 99.9% of the time heads. So the 99.9% vs 0.01% probability isn't wrong... But actually, this new set of samples can be concatenated to the last, as they go towards infinity and the premise is that this will happen (eventually) infinite times, and ALL times, as one single time not getting tails would break the chain... So now we might say it is unlikely, but now think of a person seeing it, witnessing the event... Wouldn't they say chances are 100% tails? So one important thing is that possibilities don't guarantee you will get heads and tails in a proportion of 1/1024 and 1023/1024. It really doesn't. A probability of 90% doesn't mean sth will happen 90% of the time, but that we believe it has '9 chances out of ten to be that'. But once the drawing is made, it can happen 70% of the time only, on 2% of the time, and stay like this forever... At least that's my understanding of it after giving it some thought!
You are correct that the probability of getting 10 heads is 1/(2^10). Let's call this a. The probability of NOT getting 10 heads in 1000 flips is (1-a)^1000 and getting at least one such result is 1 - (1-a)^1000 = 62.36%
Yeah, I know ;) I have to admit it puzzled me for a while until I figured it out (see paragraph three), though! Thanks for the reply and clear explanation!
Since mu depends on probability distribution; should it not be constant for all bins; i.e all h? And it should be nu that should change with h and bins. Why is mu different for different bins?
Thank you caltech and Professor. But please can you help me, it's decades I touched maths to catch up. Tell which links I must read and understand so I can get back here to follow like you smart guys.
This means that we assume that the data we have was sampled in the same way real data occurs. Does not seems a trivial assumption to me but makes allot of sense to need such an assumption. Anyway, great lecture!
Whatever happens in the bin, is hypothetical. Just assume you have chosen a hypothesis h. This will agree with the target function in some cases and differ in other over the entire set of inputs which is possibly infinite. The main takeaway is you can compare it on the sample which is the training data for which the value of target function is available. Thus the essence is, if you see the hypothesis chosen by you is agreeing with the values of target function on the sample, this will probably behave the same for out of sample data points with in a threshold (according to Hoeffding's formula). Feel free to ask if you have further queries.
Remind myself that this is just foundation, and that it is dry and Zzzz.... but must ....keep...going.. an hour later...really the summary is that the more your model cater towards a specific sample, your model is more prone to failure when it comes to unknown. it is like fourier series, fitting too well to the data can lead to not actually learning at all >...
How is the probability distribution over X considered into the learning process? The marbles (sample) from the bin (space) are subjected to the probability distribution. How does the probability affect learning? I only know that the multi-bins problem necessitates the modification of the plain-vanilla hoeffding's inequality. The multi-bins are brought about by the number of hypothesis in the hypothesis set, not by the probability distribution over the X space.
The essence of a probability distribution is that enables you to state that the pattern you observe in your sample will - with some particular probability - reflect the pattern in the bin. Making a statement about the situation in the bin based on what you observe in your sample IS the learning statement. You cannot generalize (or learn) more than this. If I select all green marbles in my sample. Can I say that all the marbles are green in the bin if the sample size is 10, 100 or 1000000? The answer is no. In fact I cannot state anything certain about the content of the bin no matter how large my sample is or how it is made up. Saying I cannot say anything certain about the bin is the same as stating I cannot learn anything certain about the bin.
I did not understand the union bound concept. My doubt is that shouldn't the upper bound (the probability of a hypothesis that is selected is bad) be (1/M) times (2M exp{-2e^2N}, assuming each hypthesis is equally likely to be selected. An analogy is consider this question "There are two bags containing white and black balls. In the first bag there are 8 white and 6 black balls and in the second bag there are 4 white and 7 black balls. One ball is drawn at random from any of these two bags. Find the probability of this ball being black." In the above question assume that selecting a bad ball signifies a bad event. Thus, P(bad event)=1/2*6/14+1/2*7/11=(1/2)*(6/14+7/11). In this example, M=2.
There is no probablity associated with selecting hypotheses. Probability is only associated with selecting the sample data (x). The learning algorithm will (likely) examine many or all hypotheses. It choses a particular hypothesis based on various criteria. Probability has nothing to do with this selection. Your thought error is probably that you saw that a hypothesis was "selected" and assumed this was a probabiliy concept. This is (rightly) confusing. It probably was better to say a hypothesis was "chosen" in order to stay away from probability terminology.
@ 34:20, I understand that the marble is green if the target function corresponds to the hypothesis used. However, didn't the professor say the target function itself is unknown?
Is an new hypothesis h_avg, defined as an average over a subset set of hypotheses in H, necessarily also in H? or does it depend on the functional form of H?
For 1 to get all heads = (1/2)^10, therefor 1-(1/2)^10= getting at least 1 tail. Likely good of everyone getting at least 1 tail = (1-(1/2)^10)^1000, so we take subtract that result from 1 to find how many did not get any tails
In slide 23, why is the probablity equation of g dependent on all hypothesis while we pick only one out of the multiple hypothesis? Shouldnt it be equal to the probabilty of the hypothesis chosen?
thank you for the lecture. it was really insightful though it's hard for me to capture it all. I'd like the questions that the students ask. why do we have multiple bins? they were cute though haha
Ng's lectures are hardly even within the realm of a true 'lecture'. His command of the english language is very limited and he rarely explains anything beyond the iteration of formulas and proofs. You would be better served by reading a book on the subject. Abu-mostafa is a TRUE teacher which walks you through the process. Of the dozens of lectures on this complex subject, he has the best compromise between content and approachability. /opinion
Hello. What should I do, if I misunderstood a lot of this talk? It can matter of language, because I'm not from English spoken countries, but other, just programming courses I understand. Specially things about machine learning, I don't. What to do? Study math or what...
This is something you can do to understand this lecture better: 1. Turn the subtitles on (by clicking the subtitles button) 2. If you don't understand something, watch and think over and over again until you totally get it. 3. This (work.caltech.edu/lectures.html#lectures)contains the lecture's slides, I think it helps. Happy learning!
Try other machine learning courses, for instance on Udacity and Coursera, then come back to this one. Also, if you haven't had calculus, linear algebra, and probability, this isn't going to make a lot of sense to you. So if you lack that math background, then go study those topics one at a time, then come back.
You pick a hypothesis function. The hyothesis function returns a particular value x when h(x)=f(x) which is the unknown target function color the marble green otherwise red. The h is a proxy for the unknown function f. This algorithm colors all the marbles. It is important to realize that WE can't see the marbles in the bin but we do know they are colored. What we are trying to do is to find an h which has the lowest number of marbles colored red because a red marble x means h(x) is not equal to f(x). At this point probability does not come into the discussion we are simply coloring the marbles. The marble colors are not random. The marbles you pick are. In other words, you did not pick a randomly red marble you randomly happended to pick a marble that was red.
Lol i love the lectures but why the heck did you use "mew" and "new" it is so confusing. Just think of two completely different sounding things. just use the sample mean x bar and the population mean mew lol duh
Learning is entirely feasible when this guy is your teacher.
Absolutely (and there are > 1 color pair at our disposal to disambiguate two different concepts)
Does anyone else find this guy absolutely hilarious for some reason? Something about the look on his face makes you feel like he's constantly thinking "Yeah, I'm killing this lecture right now". When he whips out that smug half smile I can't help but laugh out loud. You can tell he loves teaching. Great set of lectures.
I agree, I saw wisdom in his face. #respect
actually the updating process is there in his mind, and always expressed in his face
Such a marvelous lecture!
The logical sequence, the explanation, the joke, the intuitutive powerpoint/animation, ...
I wish all my lectures are like this
Agreed, this guys is SO good, great job sprinkling in jokes to keep everyone's attention
where was the joke?
If this video is confusing to you, consider the following:
The example at 9:34 is only to show that we know something about the entire set, based on a sample. Basically, it says the bigger the sample, the closer v relates to u (Hoeffding).
At 28:00, forget the above example. We are not trying to make a hypothesis for that example. The new values have nothing to do with above example. From this point on, we have a random data set (X) with an unknown function f. We want to know if we could make a hypothesis h to predict results. In other words: is learning feasible?
So in the new bin, the probability of how many points are green is a measurement for how correct a hypothesis is. We do not know how many are green, so we take a sample. In this sample, we get the relation between correct and incorrect results of the hypothesis and this says something about the entire bin (Hoeffding). So if the sample is sufficiently big and has a lot of positive predictions than yes, learning is feasible.
Or not? -> 33:30
okay? okay.
Yea, it's so misleading. The same marbles can plays two different roles. First it's used for measuring the probability of getting green and on the second time it's used for checking if a hypothesis h is correct (h(x)=fx(x)) or wrong (h(x)!=f(x)). He's a good professor but he's confusing.
can you please tell me how exactly mue is defined, picking means, I purchase both balls and defining the % how much of which 1 we got there in the bin after mixing or else, how we're defining picking red ball because ball are being picked in sample, like we pick 9 balls at a time and then define the probability of the same as "nue", but how that's being incorporated with the "mue"
Well, this was for a long time the most baffling point of the entire lecture for me. However, when complemented with the book, it suddenly hit me: although we cannot explicitly compute f(x) when comparing it to g(x), since f is unknown, hence the colors themeselves are completely unknown to us, what we CAN do is to view randomly picked x's as samples from probability distribution P, where the x is red with probability mu and green with probability 1 - mu. That's it.
How much does this person knows, if he calls such a big concept just as simple tool. Respect.
19:00
Now, between you and me, I prefer the original formula better. Without the 2.
However, the formula with the 2s has the distinct advantage of being … true. So we have to settle for that.
Best quote ever on the Hoeffding's Inequality. :)
Professor, your lectures are so enjoyable that I look forward to "learning" :). Thank you!
Brilliant lecture. A different approach than that of Andrew ng's. Loved it !!
It's way better than Andrew ng. I am a math major.
Abu-Mostafa is a real genius in explaining complex things in simple terms...The real game changer is Hoefflding's inequality because it allows us to model the learning problem in a way that bounds the uncertainty independent of unknown parameter(mu). The only thing that remains is a tradeoff between error tolerance (epsilon) and sample size(N) and is captured by the relation
P(|nu - mu| > epsilon)
The Coursera ML course assumes you're an idiot to start with, teaches you little and then proclaims you an "expert". This course assumes substantial background, teaches you things in-depth, and at the end is still humble about how much knowledge is gave you. Andrew Ng's RUclips lectures recorded at Stanford are quite good though.
Very well put, I was really frustrated with the coursera videos when I found this series and the experience has been much better.
@@s9chroma210 how're you mowadays??
@@supriyamanna715 Im doing quite good. Did this and a few other courses which really helped!
@@s9chroma210 which coursera course are you talking about that’s ‘drumming things down’?
Thanks to pofessor and Caltech, sending me back to my youth! I recall myself, excited by outstanding lecturers, I met then.... His speach and passion perfectly colourizes the topic and, I beleive, is of great help to fogeign students to understand.
56:41 ooo, that's the hypothesis!! Nothing I need more than that. Thanks Professor!!!
I am not completely undrestood analogy between learning and 1000 coins. At 43:08 we try different hypothesis (because bins are different) and try to select the best according the sample. At 48:52 we have the same hypothesis (because bins are the same) and try different samples.
This lecture is perfect! I recommend complementing it with Andrew Ng lecture 9 on learning theory from his youtube machine learning course. This prof. Is VERY good at conveying the intuition behind, while Ng will go more deeply into the Maths. They both complement each other in a perfect way.
This is such a nice complementary course to Andrew Ng’s videos. I would seriously consider paying for quality lectures like these. Eternally grateful to Caltech for providing this one free of charge!
It would be great to see the Professor in some courses on Coursera. He is one of the best one I've ever heard. Thanks!
The coin analogy for multiple bins was SO SO SO SO GOOD.
Dr. Mostafa, thank you for the internesting lectures. I watched these while running on the treadmill for many days until it makes sense. I am happy that you let the world watch your great work and amazing lecture style. More students would love math if they had you for a teacher. Thank you!
Your comment about running on a treadmill for many days while trying to understand this lecture made me laugh 😂
Q&A session is great, which explains a lot questions in my mind. Especially for me without looking at other materials
Props for the method of explaining the Hoeffding's Inequality. I find that going on about each element of the equation separately facilitated a lot to understanding it. Congratulations!
In a world of bestest MOCCs this playlist stands apart.
@45:24: I got 5 heads!!! Actually a total of 7 consecutive heads before my first tails. You always think it happens to someone else...
Day 2 done. Amazing lecture. Thank You Professor Yaser and Caltech for making them open to public.
The professor here is making a subtle but very important point. He is saying that given a set of sample some or the other hypothesis **must** agree with the data. And the more hypotheses there are the more likely it gets that one of them will agree with the data (Capital M in his lecture). This is a guaranteed fact (that some hypothesis will agree with the sample), we want to make sure that the probability of accidental agreement is is small. Recall professor's coin toss example.
What do you mean by "set of samples"? I thought when we're talking about multiple bins, we're talking about the same sample but different hypothesis applied to it.
@@Satelliteua We are not talking about the same sample from each bin.. The way I understand the problem is this way: We have 1 bin containing all the possible data points (marbles). For each h (a possible hypothesis), we extract a random sample from the bin. Now, each h actually changes mu in the bin (since each h changes the colors of the marbles based on its conformity to f) - this is why the prof represents the problems as multiple bins. Now what happens when we pull out random samples from each bin? It might be the case that all the samples pulled out are correctly classified by h. Does that then mean that these data points were correctly classified because h tends to f? Well not necessarily. There are two things at play here. Let's first assume we are dealing with a single h (a single bin). From this bin we pull out a sample of data and check its classification accuracy. If we keep pulling out samples, then we might get a sample where h gets it all correct. So in this case, it was just luck that h got it all correct. Now though, we are dealing with multiple h (bins), and from each bin we are pulling out different samples. So now, like the coin example, we are actually likely to find the case (the h) where we pull out all heads, even though h is not close to f. This is why as M grows, we are likely to find a wrong model. This is what I understood. Sorry if my ideas are all over the place, its a difficult concept to put into words.
@@Omar-kw5ui Actually this is quite correct!
@@Omar-kw5ui i started watching lectures recently. Your argument is correct. But if in real life i am doing some machine learning with some hypothesis & i find that all samples are 'green' (agree with function's Y points) then should i still abandon this particular hypothesis saying its just pure luck & keep on searching for other hypothesis where most likely i will probably never come close? I guess he is saying that by pointing out the coin-toss person who got all 5 heads. But then which hypothesis should i settle on as my 'g' ?
Superb teacher! His lectures are so clear and intuitive that they make 'learning' delightful.
Love both his voice and jokes. A brilliant professor
Thumbs up to Professor Abu-Mostafa. Fantastic professor, fantastic sense of humor.
This is an amazing lecture. Looking forward for watching the next lecture videos.
The best way to have a grasp of all these lectures is to do all the homeworks and projects. If you are a "learning machine", then start to read conference papers on learning machine....so you will be ready for research.
Where can I find conference paper?
Prof. Abu-Mostafa is the man! Super cool guy.
I cannot thank you much for this lecture. You make machine learning math a piece of cake.
Thank you Caltech and Prof
What does it mean to have stringent tolerance (time in the video: around 57:57). Basically, what if the inequality gives 2.
he is literally manifestation of "how to teach"
The professor teaches so good. Glad they made the video.
Professor Yaser Abu-Mostafa is amazing!
He sounds just like King Julian, the king of the lemurs, I keep on hoping he starts singing "I like to move it move it"
I've been laughing for 10 minutes straight
excellent lecture. It looks like that the inequation in the final verdict is so loose that P[|E_in(g) - E_out(g)| > eps]
I was waiting for the Indian girl to ask questions here as well. will expect her again in the future videos as well
32:20 Am I understanding correctly that we can ignore the probability of the data because it is somehow hidden in mu and nu?
I did not see that mentioned anywhere, Dr. Yasser has a book describing the course content in more details called "Learning From Data".
The book provides genuine additional insight to the lectures and vice verse.
Thank you sir for the recordings. Now I hope I can pass my course in "Statistical Foundation of Machine Learning". :)
Thanks for the excellent lecture. Here are a couple of questions:
At 42:22 Just want to be more rigorous, would the P notation in this Hoeffding inequality depend on both X and y?
At 50:02 How exactly is g defined here in order to have this inequality hold? The inequality seems to require that g minimize the | Ein - Eout|? But that is not intuitive. Instead, it is more intuitive to have g that minimizes Ein (or eventually Eout) based on the definition of Ein and Eout earlier.
For my second question, I see it now because it is a less than or equal sign there, in stead of the equal sign. That inequality always holds since g is one of h in H. Thus whatever criterion for defining g is fine as long as g is one of the hs.
P is a selection probability that is assigned to X i.e. it defines the probability of selecting certain x's. It has nothing to do with y (or Y).
28:47
How do we compare the hypothesis to the target function if we never know what the target function is?
That's what I'll like to know. Please reply if you have figured it. Thanks.
True, we do not know the target function itself, but what we do know is it's value at certain points which are part of the dataset that we use to train our hypothesis. I believe the prof. refers to comparing the two functions at those points.
Don't equate "we don't know what the target function is" to "we don't know anything about the target function" as someone below already stated we do know something about the target function namely what its values are at those points that we sampled. So the more we sample the more we get to know about the target function. The probability comes in to allow us to state a certain probability that the pattern in the bin will be "sufficiently similar" (this is the epsilon definition) to the sample chosen as long as the sample is large enough.
From the data we already have.
Thanks for the excellent lecture. Really enjoying them. Very well explained.
Awesome lecture.. The 10 times coin flip experiment was surprising and it is pretty interesting. Few observations:
1) By choosing a sufficiently big M, then I can get inequality like P[Bad Event] < 2 : which actually is a no-brainer
2) An absolute value of "bound" is meaningless unless I know the "mu" , the unknown quantity. But that said, practically, we can overcome this by some educated guesses and possibly central limit theorem too.
3) I am surprised there is no talk of Central limit theorem... I was expecting that something will be proved based on it...Possibly Hoeffding has relation to it....
What was the conclusion of the lecture? I mean how did we prove that learning is feasible?
cont. the formula with M on the RHS will be used, which is very conservative. Unfortunately, most hypothesis space is not finite, i.e., you have infinite number of hypothesis in it, so you can't use M to measure the model complexity. So in that case, we will resort to something called VC dimension and derive generalization bounds based on that.
He fields questions like a total G, I love this course
I like how he says okaaaaaay !
There's a 90 million Egyptian pronouncing it exactly like that XD
Didn't state Hoeffding's inequality correctly. The value of nu must be bounded with a range 1 with the formula (at 20:30).
I think the original Hoeffding's equality applies when the hypothesis is specified BEFORE you see the data (e.g. a crazy hypothesis like if it is raining then approves the credit card otherwise not). However, in reality, we will learn a specific hypothesis using the data (e.g. using least squares to learn the regression coefficients), in that case, the learned hypothesis is g and you can considered it as chosen from a hypothesis space (H). If the hypothesis space is finite of size M, then
The professor's book also says that there is a problem because we use the same data points for all hypotheses instead of generating new data for each hypothesis. So it breaks the assumption of independence of data generation. Why wasn't it mentioned in the lecture?
I like the explanation that a complex g is too into the historical data and possible to get bigger Eout. I'm wondering whether most financial models in 2009 are like that and not many people could understand it, so few economists realized the crash was coming until it's too late.
How did we sum up the RHS in the Hoeffding's inequality. I mean each of the hypothesis will have a different bound(epsilon) and hence a different exponential term. So, how do they sum up to be substituted by a "M" times the exponential . Also if we keep the bound same for each inequality wont the "N" no of samples change. How is the exponential consistent across all the hypothesis. Am I missing something?
Hmm...a lot of questions here. You start off by by defining what you find to be the maximum "acceptable" deviation of your selected hypothesis can be. This acceptable value is epsilon. The deviation is between the in-sample error and out-sample error. You cannot guarantee this but you can ensure that the chance that it will exceed this deviation is smaller than a certain probability. This is why the whole probability thing is brought in to the discussion. Now g is just one of the h's in the hypothesis set. So if the in- and out-samples of g deviates by more than epsilon this implies that (at least) one of the h's deviates by more than epsilon so we can say that it must be the case that h1 deviates by more than epsilon or h2 deviates by more than epsilon and so on. The probabilityh of deviation between in and outsampling is independent of the number of red and green balls i.e. it is independent of any h that is why each h has the same bound.
thanks a lot for sharing. This will surely be of help to me in my m.sc
at 36:26 min shouldn't it be epsilon instead of nu?
Thank you very much - a brilliant work
I'm getting confused towards the end. Once you have g, what use is it to compare it to every h in H? Surely, g is the best h in H, that's how it became g. Also, why does he add M to the inequality at the very end? Doesn't that just increase the value, the bigger H is? So with a large H and a subsequent large M, won't the comparison be totally redundant? I think he made that point at the end, but I don't see why he added it in, in the first place.
13:00 Isn't this 'problem' just Hume's original problem of induction?
It is indeed!
You know output for In-sample cases(training set).. So if output matches, hypothesis for that sample is green..(Still target function remains unknown)
Thanks for pointing it out! I don't know how could I miss looking for a ML course on Coursera. The problem will be the overlap with the Criptography one from Dan Boneh
The more Hulft runs, the clearer we see, That more data brings us closer to certainty!
What was the conclusion of the lecture? I mean how did we prove that learning is feasible?
The main objective of learning, as laid out in the lecture, is finding a hypothesis that behaves similarly for the training data(in sample) and test data(out sample). No matter the performance of the hypothesis on the sample, if we can prove that the hypothesis is performing approximately same for in sample and out sample than we have essentially proved that learning is feasible i.e generalizing beyond in samples is possible. The final modification to Hoeffding's formula states that with reasonable choice of M,epsilon and N, the probability of in sample performance deviating from out sample performance can indeed be bound to an acceptable limit thus proving learning is feasible. The fact that M is infinite in all the models we generally come across and still able to learn is proved in theory of generalization lecture. Thanks.
Is learning feasible means here: can we - based on observations of our insample data - make statements on our outsample data or in other words, can we generalize observations on our selected sample to the entire population (in the bin)?
It's been an excellent lecture so far. Though I am not very clear at what script H and each h mean.
I am doing the course on ML in Coursera, It is a very good start for someone who jwants to get started and know what machine learning is all about and do some exercises to get a feel for it. That said, simple derivations in calculus which you should know from high school are skipped and just the final formula is given which is a little disappointing. I don't see how anyone can do machine learning without knowing basic calculus. Too emphasis is placed on being nice.
Hello and thank you for the wonderful lectures!
I'm new in this field and I am trying to combine it with Computational Geometry. As such, my problems are unique in the sense that the training sets can (usually) be constructed at the will of the modeller. The data are always (potentially) there, it is a matter of choosing which to produce. I was wondering if there is a theoretical or practical approach to an optimized selection of training samples from the whole? Does that relate to assigning a specifc (hopefully optimum in some way)l P(x), e.g. uniform distribution which takes samples uniformly from the whole? Or is it that random selection is still good enough in this case?
Thank you in advance.
Theodore.
Umm... I'm probably missing sth ;) What formula is he using to get to 63% probabilities? If each coin gets 10 straight heads once each 1,024 times (say we run it infinite times... Then the proportion should be 1 over 2 to the N, right? So 1 over 2 to the ten, so once each 1024 times roughly.) So because the probability of each coin is independent, doesn't it mean that the probability should be almost 100%? (1000/1024)
Ah, Ok... So even if you had 100 trillion flips each with 99.9% chances of being heads, you still have 0.01x0.01... (100 trillion times) of chances to get all tails.
For this example, you have 1/1024 chances of it being heads 10 consecutive times, so you have (1-(1/1024)) chances of it having at least one of the 10 flips being tails... That is, if you have 1 in 1024 chances of it being heads 10 straight times, you have 1023 in 1024 chances of it not being that. And if it is not that, it means that, at least, there is one tails somewhere (at least one) that would break the chain. So over 1000 repetitions, you have (1-(1/1024)) to the 1000, or (1023/1024) to the 1000, or 37% chances to get at least one tail on each set of 10. So 63% chances approx. to get 10 consecutive heads.
That being said, I still believe if the chances are 1 in 1024 to get 10/10 heads, for each 1024 attempts when the number of attempts goes towards infinity, we should get at least one of those to be 10 straight heads? So maybe it has to do with distribution? Like sometimes you can get 2 or more sets of 10 straight heads in your lot of 1000, while other times you may get none. So the chances of you finding (in a lot of 1000 tries) at least 1 set of 10 straight heads is 63%? (because they can form clusters, and sometimes you will get a group with none)
Or maybe it doesn't have to do with that? I mean, what are probabilities, really? Say you have 99.9% chances to get heads and 0.01% to get tails. You do it twice and the chances to get at least one time heads are really high, of course. But there is 1/10,000 chances of actually being tails and tails... So if you go towards infinity, you might think the distribution would be 99.9% of the time, no matter where or the order, you get heads, and 0.01% of the time you would get tails. But for N tries, there is 0.01 to the N chances of actually being all tails... So you can do it 100 trillion times, or go towards infinity, and there is still a very, very, very small, but real chance to, well, get all tails. So the chance is there, and now let's suppose it happens... Now, if that slim chance was the way events unfolded, then that option would happen forever, infinite times, and the 99.9% chances would mean nothing. You might say, well, but if we run the experiment again, now we will probably get 99.9% of the time heads. So the 99.9% vs 0.01% probability isn't wrong... But actually, this new set of samples can be concatenated to the last, as they go towards infinity and the premise is that this will happen (eventually) infinite times, and ALL times, as one single time not getting tails would break the chain...
So now we might say it is unlikely, but now think of a person seeing it, witnessing the event... Wouldn't they say chances are 100% tails?
So one important thing is that possibilities don't guarantee you will get heads and tails in a proportion of 1/1024 and 1023/1024. It really doesn't. A probability of 90% doesn't mean sth will happen 90% of the time, but that we believe it has '9 chances out of ten to be that'. But once the drawing is made, it can happen 70% of the time only, on 2% of the time, and stay like this forever... At least that's my understanding of it after giving it some thought!
You are correct that the probability of getting 10 heads is 1/(2^10). Let's call this a. The probability of NOT getting 10 heads in 1000 flips is (1-a)^1000 and getting at least one such result is 1 - (1-a)^1000 = 62.36%
Yeah, I know ;) I have to admit it puzzled me for a while until I figured it out (see paragraph three), though!
Thanks for the reply and clear explanation!
Really Brilliant lecture ..
Great explanation sir. I like ur accent. Your accent is really like a Ratahan accent.
Since mu depends on probability distribution; should it not be constant for all bins; i.e all h? And it should be nu that should change with h and bins. Why is mu different for different bins?
Thank you caltech and Professor. But please can you help me, it's decades I touched maths to catch up. Tell which links I must read and understand so I can get back here to follow like you smart guys.
This means that we assume that the data we have was sampled in the same way real data occurs. Does not seems a trivial assumption to me but makes allot of sense to need such an assumption.
Anyway, great lecture!
Why not writing the RHS of the Hoeffding's inequality as min(1, 2exp(-2Neps^2)) since a probability cannot exceeds 1 anyway?
beautiful so far !!
One thing I couldn't figured it out though is how the target function and the hypothesis would agree? how the comparison occurs?
Whatever happens in the bin, is hypothetical. Just assume you have chosen a hypothesis h. This will agree with the target function in some cases and differ in other over the entire set of inputs which is possibly infinite. The main takeaway is you can compare it on the sample which is the training data for which the value of target function is available. Thus the essence is, if you see the hypothesis chosen by you is agreeing with the values of target function on the sample, this will probably behave the same for out of sample data points with in a threshold (according to Hoeffding's formula). Feel free to ask if you have further queries.
sir 1.can u please explain hypothesis and target function in bin marble problem through some mathematical expression (as a example)....
fantastic lecture !!!! thanks a lot
Remind myself that this is just foundation, and that it is dry and Zzzz.... but must ....keep...going..
an hour later...really the summary is that the more your model cater towards a specific sample, your model is more prone to failure when it comes to unknown. it is like fourier series, fitting too well to the data can lead to not actually learning at all >...
maybe you'd like Stanford's course better by Andrew Ng, Google it and check it out. I like it.
No that was the last lecture. This one wasn't really about that.
How is the probability distribution over X considered into the learning process? The marbles (sample) from the bin (space) are subjected to the probability distribution. How does the probability affect learning? I only know that the multi-bins problem necessitates the modification of the plain-vanilla hoeffding's inequality. The multi-bins are brought about by the number of hypothesis in the hypothesis set, not by the probability distribution over the X space.
The essence of a probability distribution is that enables you to state that the pattern you observe in your sample will - with some particular probability - reflect the pattern in the bin. Making a statement about the situation in the bin based on what you observe in your sample IS the learning statement. You cannot generalize (or learn) more than this. If I select all green marbles in my sample. Can I say that all the marbles are green in the bin if the sample size is 10, 100 or 1000000? The answer is no. In fact I cannot state anything certain about the content of the bin no matter how large my sample is or how it is made up. Saying I cannot say anything certain about the bin is the same as stating I cannot learn anything certain about the bin.
I did not understand the union bound concept. My doubt is that shouldn't the upper bound (the probability of a hypothesis that is selected is bad) be (1/M) times (2M exp{-2e^2N}, assuming each hypthesis is equally likely to be selected. An analogy is consider this question
"There are two bags containing white and black balls. In the first bag there are 8 white and 6 black balls and in the second bag there are 4 white and 7 black balls. One ball is drawn at random from any of these two bags. Find the probability of this ball being black."
In the above question assume that selecting a bad ball signifies a bad event. Thus, P(bad event)=1/2*6/14+1/2*7/11=(1/2)*(6/14+7/11). In this example, M=2.
I have the same doubt right now. Did you find your answer?
There is no probablity associated with selecting hypotheses. Probability is only associated with selecting the sample data (x). The learning algorithm will (likely) examine many or all hypotheses. It choses a particular hypothesis based on various criteria. Probability has nothing to do with this selection. Your thought error is probably that you saw that a hypothesis was "selected" and assumed this was a probabiliy concept. This is (rightly) confusing. It probably was better to say a hypothesis was "chosen" in order to stay away from probability terminology.
this is awesome man, tks.
@ 34:20, I understand that the marble is green if the target function corresponds to the hypothesis used. However, didn't the professor say the target function itself is unknown?
I think he meant that the marble is green if yhat equals to y, namely, the target value instead of the underlying target function.
Is an new hypothesis h_avg, defined as an average over a subset set of hypotheses in H, necessarily also in H? or does it depend on the functional form of H?
No, H is a set of any group of hypotheses. The set does not have to have any form of arithmetic closure.
Something like feynman's lectures is being attempted.
47:00 how did he obtain 63%?
For 1 to get all heads = (1/2)^10, therefor 1-(1/2)^10= getting at least 1 tail. Likely good of everyone getting at least 1 tail = (1-(1/2)^10)^1000, so we take subtract that result from 1 to find how many did not get any tails
In slide 23, why is the probablity equation of g dependent on all hypothesis while we pick only one out of the multiple hypothesis? Shouldnt it be equal to the probabilty of the hypothesis chosen?
g is one of hypotheses h so what the dependency statement says is that if something applies to g it must therefor apply to (at least) one of the h's.
How would you know what is the value of E(out) ??
thank you for the lecture. it was really insightful though it's hard for me to capture it all. I'd like the questions that the students ask. why do we have multiple bins? they were cute though haha
55:59 Overfitting!!!
awesome! thanks
Ng's lectures are hardly even within the realm of a true 'lecture'. His command of the english language is very limited and he rarely explains anything beyond the iteration of formulas and proofs. You would be better served by reading a book on the subject.
Abu-mostafa is a TRUE teacher which walks you through the process. Of the dozens of lectures on this complex subject, he has the best compromise between content and approachability.
/opinion
Hello. What should I do, if I misunderstood a lot of this talk? It can matter of language, because I'm not from English spoken countries, but other, just programming courses I understand. Specially things about machine learning, I don't. What to do? Study math or what...
Try to become smarter
shitty comment sry, this can't help me ;-)
Just a bit of humor) Turn on subtitles, so you eliminate poor audience (if so)
This is something you can do to understand this lecture better:
1. Turn the subtitles on (by clicking the subtitles button)
2. If you don't understand something, watch and think over and over again until you totally get it.
3. This (work.caltech.edu/lectures.html#lectures)contains the lecture's slides, I think it helps.
Happy learning!
Try other machine learning courses, for instance on Udacity and Coursera, then come back to this one. Also, if you haven't had calculus, linear algebra, and probability, this isn't going to make a lot of sense to you. So if you lack that math background, then go study those topics one at a time, then come back.
Brillant ya yaser.
Is learning feasible? 8:07
Im confused what the green and red marbles mean, if you pick random marbles from the bin they are random? what is learning in this context?
Ahh I had to watch it twice. Abstract representations
You pick a hypothesis function. The hyothesis function returns a particular value x when h(x)=f(x) which is the unknown target function color the marble green otherwise red. The h is a proxy for the unknown function f. This algorithm colors all the marbles. It is important to realize that WE can't see the marbles in the bin but we do know they are colored. What we are trying to do is to find an h which has the lowest number of marbles colored red because a red marble x means h(x) is not equal to f(x). At this point probability does not come into the discussion we are simply coloring the marbles. The marble colors are not random. The marbles you pick are. In other words, you did not pick a randomly red marble you randomly happended to pick a marble that was red.
Nice analogy!!
This is the right answer (1/1024) for the first question in the "coin analogy"
Muito grato a UFRJ pela tradução. Excelente iniciativa para todos nós aprendentes autônomos.
Probably Approximately Correct :-o I have the book by Leslie Valiant
The Prof. is amazing. He also looks like the Prince Charles
I love this guy !
Lol i love the lectures but why the heck did you use "mew" and "new" it is so confusing. Just think of two completely different sounding things. just use the sample mean x bar and the population mean mew lol duh