Lecture 02 - Is Learning Feasible?

Поделиться
HTML-код
  • Опубликовано: 8 сен 2024
  • Is Learning Feasible? - Can we generalize from a limited sample to the entire space? Relationship between in-sample and out-of-sample. Lecture 2 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - itunes.apple.c... and on the course website - work.caltech.ed...
    Produced in association with Caltech Academic Media Technologies under the Attribution-NonCommercial-NoDerivs Creative Commons License (CC BY-NC-ND). To learn more about this license, creativecommons...
    This lecture was recorded on April 5, 2012, in Hameetman Auditorium at Caltech, Pasadena, CA, USA.

Комментарии • 232

  • @iAmTheSquidThing
    @iAmTheSquidThing 4 года назад +187

    Learning is entirely feasible when this guy is your teacher.

    • @y-revar
      @y-revar 7 месяцев назад

      Absolutely (and there are > 1 color pair at our disposal to disambiguate two different concepts)

  • @michaelnguyen8120
    @michaelnguyen8120 5 лет назад +90

    Does anyone else find this guy absolutely hilarious for some reason? Something about the look on his face makes you feel like he's constantly thinking "Yeah, I'm killing this lecture right now". When he whips out that smug half smile I can't help but laugh out loud. You can tell he loves teaching. Great set of lectures.

    • @-long-
      @-long- 4 года назад +3

      I agree, I saw wisdom in his face. #respect

    • @supriyamanna715
      @supriyamanna715 2 года назад +5

      actually the updating process is there in his mind, and always expressed in his face

  • @faruksn
    @faruksn 10 лет назад +212

    19:00
    Now, between you and me, I prefer the original formula better. Without the 2.
    However, the formula with the 2s has the distinct advantage of being … true. So we have to settle for that.
    Best quote ever on the Hoeffding's Inequality. :)

  • @kora5
    @kora5 8 лет назад +101

    Such a marvelous lecture!
    The logical sequence, the explanation, the joke, the intuitutive powerpoint/animation, ...
    I wish all my lectures are like this

    • @tomvonheill
      @tomvonheill 7 лет назад +3

      Agreed, this guys is SO good, great job sprinkling in jokes to keep everyone's attention

    • @kezwikHD
      @kezwikHD 5 лет назад

      Ohhh yes...so much better than my university.
      I get more of the stuff than in my own lectures even when i don't speak that much english and when i am currently just in the first semester at my place.

    • @plekkchand
      @plekkchand 5 лет назад

      where was the joke?

  • @d13tr
    @d13tr 5 лет назад +35

    If this video is confusing to you, consider the following:
    The example at 9:34 is only to show that we know something about the entire set, based on a sample. Basically, it says the bigger the sample, the closer v relates to u (Hoeffding).
    At 28:00, forget the above example. We are not trying to make a hypothesis for that example. The new values have nothing to do with above example. From this point on, we have a random data set (X) with an unknown function f. We want to know if we could make a hypothesis h to predict results. In other words: is learning feasible?
    So in the new bin, the probability of how many points are green is a measurement for how correct a hypothesis is. We do not know how many are green, so we take a sample. In this sample, we get the relation between correct and incorrect results of the hypothesis and this says something about the entire bin (Hoeffding). So if the sample is sufficiently big and has a lot of positive predictions than yes, learning is feasible.
    Or not? -> 33:30
    okay? okay.

    • @AvielLivay
      @AvielLivay 3 года назад

      Yea, it's so misleading. The same marbles can plays two different roles. First it's used for measuring the probability of getting green and on the second time it's used for checking if a hypothesis h is correct (h(x)=fx(x)) or wrong (h(x)!=f(x)). He's a good professor but he's confusing.

    • @radicalengineer2331
      @radicalengineer2331 2 года назад

      can you please tell me how exactly mue is defined, picking means, I purchase both balls and defining the % how much of which 1 we got there in the bin after mixing or else, how we're defining picking red ball because ball are being picked in sample, like we pick 9 balls at a time and then define the probability of the same as "nue", but how that's being incorporated with the "mue"

    • @jorgetimes2
      @jorgetimes2 5 месяцев назад

      Well, this was for a long time the most baffling point of the entire lecture for me. However, when complemented with the book, it suddenly hit me: although we cannot explicitly compute f(x) when comparing it to g(x), since f is unknown, hence the colors themeselves are completely unknown to us, what we CAN do is to view randomly picked x's as samples from probability distribution P, where the x is red with probability mu and green with probability 1 - mu. That's it.

  • @sagarrathi1
    @sagarrathi1 5 лет назад +3

    How much does this person knows, if he calls such a big concept just as simple tool. Respect.

  • @supriyamanna715
    @supriyamanna715 2 года назад +2

    56:41 ooo, that's the hypothesis!! Nothing I need more than that. Thanks Professor!!!

  • @thekolbaska
    @thekolbaska 11 лет назад +62

    The Coursera ML course assumes you're an idiot to start with, teaches you little and then proclaims you an "expert". This course assumes substantial background, teaches you things in-depth, and at the end is still humble about how much knowledge is gave you. Andrew Ng's RUclips lectures recorded at Stanford are quite good though.

    • @s9chroma210
      @s9chroma210 4 года назад +3

      Very well put, I was really frustrated with the coursera videos when I found this series and the experience has been much better.

    • @supriyamanna715
      @supriyamanna715 2 года назад

      @@s9chroma210 how're you mowadays??

    • @s9chroma210
      @s9chroma210 2 года назад

      @@supriyamanna715 Im doing quite good. Did this and a few other courses which really helped!

    • @FsimulatorX
      @FsimulatorX 2 года назад

      @@s9chroma210 which coursera course are you talking about that’s ‘drumming things down’?

  • @NaveenKumar-nd5ts
    @NaveenKumar-nd5ts 8 лет назад +79

    Brilliant lecture. A different approach than that of Andrew ng's. Loved it !!

    • @mukulkumar2316
      @mukulkumar2316 3 года назад +9

      It's way better than Andrew ng. I am a math major.

    • @letswasteayear7908
      @letswasteayear7908 2 года назад

      The way andrew teaches is very bland. Whereas this guy is a storyteller.

  • @sharmilavelamur8342
    @sharmilavelamur8342 8 лет назад +22

    Professor, your lectures are so enjoyable that I look forward to "learning" :). Thank you!

  • @UserUser-pv2wo
    @UserUser-pv2wo 8 лет назад +5

    Thanks to pofessor and Caltech, sending me back to my youth! I recall myself, excited by outstanding lecturers, I met then.... His speach and passion perfectly colourizes the topic and, I beleive, is of great help to fogeign students to understand.

  • @pinocolizziintl
    @pinocolizziintl 11 лет назад +6

    It would be great to see the Professor in some courses on Coursera. He is one of the best one I've ever heard. Thanks!

  • @FsimulatorX
    @FsimulatorX 2 года назад +1

    This is such a nice complementary course to Andrew Ng’s videos. I would seriously consider paying for quality lectures like these. Eternally grateful to Caltech for providing this one free of charge!

  • @kezwikHD
    @kezwikHD 5 лет назад +6

    Hey, i just want to thank caltech sooo much for this course.
    I am currently studying computer science in the first year, with the goal of Machinelearning and AI. But most of the courses i have to make are out of my interests. I unterstand that a lot of those courses are juat basic for some of the stuff in higher semesters, but since i am studying out of interest and not because of the graduation, i only need the basics for the stuff i am interested in in higher semester, like machine learning. All the rest is not really needed. And that is why this course really is what i want, because i can learn what my interests are in and don't have to deal with all the other stuff that i in specific won't need.
    I even understand more in this course than i do in the lectures of my university. Even when i don't speak that much english. Your Professor is really good with explaining and keep that style of PDFs. Because the PDFs are why i don't get anything in my university...there is a minimum of 10 formulas and 150 words per slide, so basically every slide is just a wall of text the professor would read off of :/
    Thank you caltech

    • @FsimulatorX
      @FsimulatorX 2 года назад +1

      How are you doing today?

    • @kezwikHD
      @kezwikHD 2 года назад +1

      @@FsimulatorX Actually i now understand, why we learn all the boring basics and all the theory. Still i agree to a lot of the things i stated above (in very poor english as i now must confess ^^). I actually did some courses on DeepLearning and Pattern Recognition and Artificial Intelligence, but to be honest, my professors don't even bother explaining the stuff, they just give a ton of formulas and that's it. Therefore i learn much more just researching on my own.
      However to answer your question, i am doing quite well. I understand much more than i did back then. I am working for an international company as student and i am about to finish my Bachelors degree.
      Thank you for asking :)
      To anyone feeling the same way about their own university: don't give up, it gets better and even if the lectures don't, you still learn new stuff (and even if it is just "how to research" which by the way is the most valuable of all)

  • @Nbecom
    @Nbecom 11 лет назад +10

    The professor here is making a subtle but very important point. He is saying that given a set of sample some or the other hypothesis **must** agree with the data. And the more hypotheses there are the more likely it gets that one of them will agree with the data (Capital M in his lecture). This is a guaranteed fact (that some hypothesis will agree with the sample), we want to make sure that the probability of accidental agreement is is small. Recall professor's coin toss example.

    • @Satelliteua
      @Satelliteua 4 года назад

      What do you mean by "set of samples"? I thought when we're talking about multiple bins, we're talking about the same sample but different hypothesis applied to it.

    • @Omar-kw5ui
      @Omar-kw5ui 4 года назад +1

      ​@@Satelliteua We are not talking about the same sample from each bin.. The way I understand the problem is this way: We have 1 bin containing all the possible data points (marbles). For each h (a possible hypothesis), we extract a random sample from the bin. Now, each h actually changes mu in the bin (since each h changes the colors of the marbles based on its conformity to f) - this is why the prof represents the problems as multiple bins. Now what happens when we pull out random samples from each bin? It might be the case that all the samples pulled out are correctly classified by h. Does that then mean that these data points were correctly classified because h tends to f? Well not necessarily. There are two things at play here. Let's first assume we are dealing with a single h (a single bin). From this bin we pull out a sample of data and check its classification accuracy. If we keep pulling out samples, then we might get a sample where h gets it all correct. So in this case, it was just luck that h got it all correct. Now though, we are dealing with multiple h (bins), and from each bin we are pulling out different samples. So now, like the coin example, we are actually likely to find the case (the h) where we pull out all heads, even though h is not close to f. This is why as M grows, we are likely to find a wrong model. This is what I understood. Sorry if my ideas are all over the place, its a difficult concept to put into words.

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      @@Omar-kw5ui Actually this is quite correct!

  • @sreeragm8366
    @sreeragm8366 4 года назад +2

    In a world of bestest MOCCs this playlist stands apart.

  • @ProdTheRs
    @ProdTheRs 4 года назад +2

    This lecture is perfect! I recommend complementing it with Andrew Ng lecture 9 on learning theory from his youtube machine learning course. This prof. Is VERY good at conveying the intuition behind, while Ng will go more deeply into the Maths. They both complement each other in a perfect way.

  • @user-wn1vz8yt9j
    @user-wn1vz8yt9j 6 лет назад +6

    Q&A session is great, which explains a lot questions in my mind. Especially for me without looking at other materials

  • @zenicv
    @zenicv 11 месяцев назад

    Abu-Mostafa is a real genius in explaining complex things in simple terms...The real game changer is Hoefflding's inequality because it allows us to model the learning problem in a way that bounds the uncertainty independent of unknown parameter(mu). The only thing that remains is a tradeoff between error tolerance (epsilon) and sample size(N) and is captured by the relation
    P(|nu - mu| > epsilon)

  • @emrefisne9743
    @emrefisne9743 9 месяцев назад +1

    he is literally manifestation of "how to teach"

  • @edvaned8207
    @edvaned8207 4 года назад +2

    Muito grato a UFRJ pela tradução. Excelente iniciativa para todos nós aprendentes autônomos.

  • @minhtamnguyen4842
    @minhtamnguyen4842 5 лет назад +1

    Love both his voice and jokes. A brilliant professor

  • @PotadoTomado
    @PotadoTomado 8 лет назад +4

    Prof. Abu-Mostafa is the man! Super cool guy.

  • @ProdTheRs
    @ProdTheRs 4 года назад

    The coin analogy for multiple bins was SO SO SO SO GOOD.

  • @philipralph
    @philipralph 7 лет назад +7

    @45:24: I got 5 heads!!! Actually a total of 7 consecutive heads before my first tails. You always think it happens to someone else...

  • @LessTrustMoreTruth
    @LessTrustMoreTruth 11 лет назад

    Thumbs up to Professor Abu-Mostafa. Fantastic professor, fantastic sense of humor.

  • @manjuhhh
    @manjuhhh 10 лет назад +4

    Thank you Caltech and Prof

  • @jyotsnamasand2414
    @jyotsnamasand2414 Год назад

    Superb teacher! His lectures are so clear and intuitive that they make 'learning' delightful.

  • @Rafaelkenjinagao
    @Rafaelkenjinagao 2 года назад

    Props for the method of explaining the Hoeffding's Inequality. I find that going on about each element of the equation separately facilitated a lot to understanding it. Congratulations!

  • @rain531
    @rain531 2 года назад

    Day 2 done. Amazing lecture. Thank You Professor Yaser and Caltech for making them open to public.

  • @MatthewRalston89
    @MatthewRalston89 4 года назад

    Dr. Mostafa, thank you for the internesting lectures. I watched these while running on the treadmill for many days until it makes sense. I am happy that you let the world watch your great work and amazing lecture style. More students would love math if they had you for a teacher. Thank you!

    • @FsimulatorX
      @FsimulatorX 2 года назад

      Your comment about running on a treadmill for many days while trying to understand this lecture made me laugh 😂

  • @Shahada2012
    @Shahada2012 8 лет назад +1

    The best way to have a grasp of all these lectures is to do all the homeworks and projects. If you are a "learning machine", then start to read conference papers on learning machine....so you will be ready for research.

    • @toori2l5l
      @toori2l5l 8 лет назад +1

      Where can I find conference paper?

  • @msharee9
    @msharee9 10 лет назад +1

    I cannot thank you much for this lecture. You make machine learning math a piece of cake.

  • @neilbryanclosa462
    @neilbryanclosa462 7 лет назад +5

    This is an amazing lecture. Looking forward for watching the next lecture videos.

  • @NhatTanDuong
    @NhatTanDuong Год назад

    Professor Yaser Abu-Mostafa is amazing!

  • @dissonantiacognitiva7438
    @dissonantiacognitiva7438 9 лет назад +44

    He sounds just like King Julian, the king of the lemurs, I keep on hoping he starts singing "I like to move it move it"

    • @bluekeybo
      @bluekeybo 6 лет назад +2

      I've been laughing for 10 minutes straight

  • @dank8981
    @dank8981 10 лет назад +17

    I like how he says okaaaaaay !

    • @YousefHamza
      @YousefHamza 9 лет назад +18

      There's a 90 million Egyptian pronouncing it exactly like that XD

  • @ShubhamSharma-tn3wm
    @ShubhamSharma-tn3wm 3 года назад

    I was waiting for the Indian girl to ask questions here as well. will expect her again in the future videos as well

  • @ningli335
    @ningli335 7 лет назад

    The professor teaches so good. Glad they made the video.

  • @thangbom4742
    @thangbom4742 5 лет назад

    excellent lecture. It looks like that the inequation in the final verdict is so loose that P[|E_in(g) - E_out(g)| > eps]

  • @southshofosho
    @southshofosho 8 лет назад +2

    He fields questions like a total G, I love this course

  • @pavel4616
    @pavel4616 8 месяцев назад

    I am not completely undrestood analogy between learning and 1000 coins. At 43:08 we try different hypothesis (because bins are different) and try to select the best according the sample. At 48:52 we have the same hypothesis (because bins are the same) and try different samples.

  • @xhulioisufi2979
    @xhulioisufi2979 2 года назад

    Thank you sir for the recordings. Now I hope I can pass my course in "Statistical Foundation of Machine Learning". :)

  • @samchien474
    @samchien474 11 лет назад

    cont. the formula with M on the RHS will be used, which is very conservative. Unfortunately, most hypothesis space is not finite, i.e., you have infinite number of hypothesis in it, so you can't use M to measure the model complexity. So in that case, we will resort to something called VC dimension and derive generalization bounds based on that.

  • @helenlundeberg
    @helenlundeberg 9 лет назад +5

    I love this guy !

  • @samchien474
    @samchien474 11 лет назад

    I think the original Hoeffding's equality applies when the hypothesis is specified BEFORE you see the data (e.g. a crazy hypothesis like if it is raining then approves the credit card otherwise not). However, in reality, we will learn a specific hypothesis using the data (e.g. using least squares to learn the regression coefficients), in that case, the learned hypothesis is g and you can considered it as chosen from a hypothesis space (H). If the hypothesis space is finite of size M, then

  • @ahmedelsayed121
    @ahmedelsayed121 6 лет назад +2

    I did not see that mentioned anywhere, Dr. Yasser has a book describing the course content in more details called "Learning From Data".

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      The book provides genuine additional insight to the lectures and vice verse.

  • @sarnathk1946
    @sarnathk1946 7 лет назад

    Awesome lecture.. The 10 times coin flip experiment was surprising and it is pretty interesting. Few observations:
    1) By choosing a sufficiently big M, then I can get inequality like P[Bad Event] < 2 : which actually is a no-brainer
    2) An absolute value of "bound" is meaningless unless I know the "mu" , the unknown quantity. But that said, practically, we can overcome this by some educated guesses and possibly central limit theorem too.
    3) I am surprised there is no talk of Central limit theorem... I was expecting that something will be proved based on it...Possibly Hoeffding has relation to it....

    • @jonsnow9246
      @jonsnow9246 6 лет назад +1

      What was the conclusion of the lecture? I mean how did we prove that learning is feasible?

  • @webbertiger
    @webbertiger 11 лет назад

    I like the explanation that a complex g is too into the historical data and possible to get bigger Eout. I'm wondering whether most financial models in 2009 are like that and not many people could understand it, so few economists realized the crash was coming until it's too late.

  • @danielgray8053
    @danielgray8053 3 года назад +2

    Lol i love the lectures but why the heck did you use "mew" and "new" it is so confusing. Just think of two completely different sounding things. just use the sample mean x bar and the population mean mew lol duh

  • @pavel4616
    @pavel4616 9 месяцев назад

    The professor's book also says that there is a problem because we use the same data points for all hypotheses instead of generating new data for each hypothesis. So it breaks the assumption of independence of data generation. Why wasn't it mentioned in the lecture?

  • @yairelblinger2891
    @yairelblinger2891 9 лет назад

    This means that we assume that the data we have was sampled in the same way real data occurs. Does not seems a trivial assumption to me but makes allot of sense to need such an assumption.
    Anyway, great lecture!

  • @AndyLee-xq8wq
    @AndyLee-xq8wq Год назад

    Nice analogy!!

  • @rajkumarsaini7553
    @rajkumarsaini7553 Год назад

    What does it mean to have stringent tolerance (time in the video: around 57:57). Basically, what if the inequality gives 2.

  • @pavansughosh
    @pavansughosh 12 лет назад

    You know output for In-sample cases(training set).. So if output matches, hypothesis for that sample is green..(Still target function remains unknown)

  • @WhyMe432532
    @WhyMe432532 11 лет назад

    Thanks for the excellent lecture. Really enjoying them. Very well explained.

  • @lolilops54
    @lolilops54 11 лет назад +1

    I'm getting confused towards the end. Once you have g, what use is it to compare it to every h in H? Surely, g is the best h in H, that's how it became g. Also, why does he add M to the inequality at the very end? Doesn't that just increase the value, the bigger H is? So with a large H and a subsequent large M, won't the comparison be totally redundant? I think he made that point at the end, but I don't see why he added it in, in the first place.

  • @avirtser
    @avirtser 9 лет назад

    Thank you very much - a brilliant work

  • @olugbemieric1757
    @olugbemieric1757 10 лет назад

    thanks a lot for sharing. This will surely be of help to me in my m.sc

  • @don2186
    @don2186 11 лет назад

    I am doing the course on ML in Coursera, It is a very good start for someone who jwants to get started and know what machine learning is all about and do some exercises to get a feel for it. That said, simple derivations in calculus which you should know from high school are skipped and just the final formula is given which is a little disappointing. I don't see how anyone can do machine learning without knowing basic calculus. Too emphasis is placed on being nice.

  • @veronicanwabufo5905
    @veronicanwabufo5905 3 года назад

    It's been an excellent lecture so far. Though I am not very clear at what script H and each h mean.

  • @nissarali8788
    @nissarali8788 6 лет назад

    beautiful so far !!

  • @htf7
    @htf7 6 лет назад

    Great explanation sir. I like ur accent. Your accent is really like a Ratahan accent.

  • @nayanvats3424
    @nayanvats3424 4 года назад +1

    How did we sum up the RHS in the Hoeffding's inequality. I mean each of the hypothesis will have a different bound(epsilon) and hence a different exponential term. So, how do they sum up to be substituted by a "M" times the exponential . Also if we keep the bound same for each inequality wont the "N" no of samples change. How is the exponential consistent across all the hypothesis. Am I missing something?

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      Hmm...a lot of questions here. You start off by by defining what you find to be the maximum "acceptable" deviation of your selected hypothesis can be. This acceptable value is epsilon. The deviation is between the in-sample error and out-sample error. You cannot guarantee this but you can ensure that the chance that it will exceed this deviation is smaller than a certain probability. This is why the whole probability thing is brought in to the discussion. Now g is just one of the h's in the hypothesis set. So if the in- and out-samples of g deviates by more than epsilon this implies that (at least) one of the h's deviates by more than epsilon so we can say that it must be the case that h1 deviates by more than epsilon or h2 deviates by more than epsilon and so on. The probabilityh of deviation between in and outsampling is independent of the number of red and green balls i.e. it is independent of any h that is why each h has the same bound.

  • @taritgoswami9793
    @taritgoswami9793 7 лет назад

    Really Brilliant lecture ..

  • @spike345185
    @spike345185 4 года назад

    Love this guy

  • @marcogelsomini7655
    @marcogelsomini7655 2 года назад

    56:20 awesome , thx!!

  • @RD-lf3pt
    @RD-lf3pt 8 лет назад +4

    Umm... I'm probably missing sth ;) What formula is he using to get to 63% probabilities? If each coin gets 10 straight heads once each 1,024 times (say we run it infinite times... Then the proportion should be 1 over 2 to the N, right? So 1 over 2 to the ten, so once each 1024 times roughly.) So because the probability of each coin is independent, doesn't it mean that the probability should be almost 100%? (1000/1024)
    Ah, Ok... So even if you had 100 trillion flips each with 99.9% chances of being heads, you still have 0.01x0.01... (100 trillion times) of chances to get all tails.
    For this example, you have 1/1024 chances of it being heads 10 consecutive times, so you have (1-(1/1024)) chances of it having at least one of the 10 flips being tails... That is, if you have 1 in 1024 chances of it being heads 10 straight times, you have 1023 in 1024 chances of it not being that. And if it is not that, it means that, at least, there is one tails somewhere (at least one) that would break the chain. So over 1000 repetitions, you have (1-(1/1024)) to the 1000, or (1023/1024) to the 1000, or 37% chances to get at least one tail on each set of 10. So 63% chances approx. to get 10 consecutive heads.
    That being said, I still believe if the chances are 1 in 1024 to get 10/10 heads, for each 1024 attempts when the number of attempts goes towards infinity, we should get at least one of those to be 10 straight heads? So maybe it has to do with distribution? Like sometimes you can get 2 or more sets of 10 straight heads in your lot of 1000, while other times you may get none. So the chances of you finding (in a lot of 1000 tries) at least 1 set of 10 straight heads is 63%? (because they can form clusters, and sometimes you will get a group with none)
    Or maybe it doesn't have to do with that? I mean, what are probabilities, really? Say you have 99.9% chances to get heads and 0.01% to get tails. You do it twice and the chances to get at least one time heads are really high, of course. But there is 1/10,000 chances of actually being tails and tails... So if you go towards infinity, you might think the distribution would be 99.9% of the time, no matter where or the order, you get heads, and 0.01% of the time you would get tails. But for N tries, there is 0.01 to the N chances of actually being all tails... So you can do it 100 trillion times, or go towards infinity, and there is still a very, very, very small, but real chance to, well, get all tails. So the chance is there, and now let's suppose it happens... Now, if that slim chance was the way events unfolded, then that option would happen forever, infinite times, and the 99.9% chances would mean nothing. You might say, well, but if we run the experiment again, now we will probably get 99.9% of the time heads. So the 99.9% vs 0.01% probability isn't wrong... But actually, this new set of samples can be concatenated to the last, as they go towards infinity and the premise is that this will happen (eventually) infinite times, and ALL times, as one single time not getting tails would break the chain...
    So now we might say it is unlikely, but now think of a person seeing it, witnessing the event... Wouldn't they say chances are 100% tails?
    So one important thing is that possibilities don't guarantee you will get heads and tails in a proportion of 1/1024 and 1023/1024. It really doesn't. A probability of 90% doesn't mean sth will happen 90% of the time, but that we believe it has '9 chances out of ten to be that'. But once the drawing is made, it can happen 70% of the time only, on 2% of the time, and stay like this forever... At least that's my understanding of it after giving it some thought!

    • @fadaimammadov9316
      @fadaimammadov9316 8 лет назад +13

      You are correct that the probability of getting 10 heads is 1/(2^10). Let's call this a. The probability of NOT getting 10 heads in 1000 flips is (1-a)^1000 and getting at least one such result is 1 - (1-a)^1000 = 62.36%

    • @RD-lf3pt
      @RD-lf3pt 8 лет назад +3

      Yeah, I know ;) I have to admit it puzzled me for a while until I figured it out (see paragraph three), though!
      Thanks for the reply and clear explanation!

  • @jonsnow9246
    @jonsnow9246 6 лет назад +10

    What was the conclusion of the lecture? I mean how did we prove that learning is feasible?

    • @desitravellers2023
      @desitravellers2023 6 лет назад +42

      The main objective of learning, as laid out in the lecture, is finding a hypothesis that behaves similarly for the training data(in sample) and test data(out sample). No matter the performance of the hypothesis on the sample, if we can prove that the hypothesis is performing approximately same for in sample and out sample than we have essentially proved that learning is feasible i.e generalizing beyond in samples is possible. The final modification to Hoeffding's formula states that with reasonable choice of M,epsilon and N, the probability of in sample performance deviating from out sample performance can indeed be bound to an acceptable limit thus proving learning is feasible. The fact that M is infinite in all the models we generally come across and still able to learn is proved in theory of generalization lecture. Thanks.

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      Is learning feasible means here: can we - based on observations of our insample data - make statements on our outsample data or in other words, can we generalize observations on our selected sample to the entire population (in the bin)?

  • @brighty916
    @brighty916 6 лет назад

    this is awesome man, tks.

  • @pablogarcia-zo1um
    @pablogarcia-zo1um 7 лет назад

    fantastic lecture !!!! thanks a lot

  • @ArunPaji
    @ArunPaji 11 лет назад +2

    Something like feynman's lectures is being attempted.

  • @onurcanisler
    @onurcanisler Год назад

    *I understood the topic probably approximately correct.*

  • @Sonia1978NYC
    @Sonia1978NYC 11 лет назад +1

    Probably Approximately Correct :-o I have the book by Leslie Valiant

  • @Shahada2012
    @Shahada2012 8 лет назад

    Brillant ya yaser.

  • @lolilops54
    @lolilops54 11 лет назад

    Ahhh, thank you. Very well explained.

  • @pinocolizziintl
    @pinocolizziintl 11 лет назад

    Thanks for pointing it out! I don't know how could I miss looking for a ML course on Coursera. The problem will be the overlap with the Criptography one from Dan Boneh

  • @GauravJain108
    @GauravJain108 5 лет назад

    Just awesome!!!!

  •  11 лет назад +2

    This is the right answer (1/1024) for the first question in the "coin analogy"

  • @rehantahirch
    @rehantahirch 12 лет назад +1

    The Prof. is amazing. He also looks like the Prince Charles

  • @thanhquocbaonguyen8379
    @thanhquocbaonguyen8379 2 года назад

    thank you for the lecture. it was really insightful though it's hard for me to capture it all. I'd like the questions that the students ask. why do we have multiple bins? they were cute though haha

  • @dtung2008
    @dtung2008 6 лет назад

    Didn't state Hoeffding's inequality correctly. The value of nu must be bounded with a range 1 with the formula (at 20:30).

  • @VIVEKPANDEYIITB
    @VIVEKPANDEYIITB 2 года назад

    Since mu depends on probability distribution; should it not be constant for all bins; i.e all h? And it should be nu that should change with h and bins. Why is mu different for different bins?

  • @izleaa
    @izleaa 11 лет назад

    very nice explanation!

  • @user-wn1vz8yt9j
    @user-wn1vz8yt9j 6 лет назад

    How is the probability distribution over X considered into the learning process? The marbles (sample) from the bin (space) are subjected to the probability distribution. How does the probability affect learning? I only know that the multi-bins problem necessitates the modification of the plain-vanilla hoeffding's inequality. The multi-bins are brought about by the number of hypothesis in the hypothesis set, not by the probability distribution over the X space.

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      The essence of a probability distribution is that enables you to state that the pattern you observe in your sample will - with some particular probability - reflect the pattern in the bin. Making a statement about the situation in the bin based on what you observe in your sample IS the learning statement. You cannot generalize (or learn) more than this. If I select all green marbles in my sample. Can I say that all the marbles are green in the bin if the sample size is 10, 100 or 1000000? The answer is no. In fact I cannot state anything certain about the content of the bin no matter how large my sample is or how it is made up. Saying I cannot say anything certain about the bin is the same as stating I cannot learn anything certain about the bin.

  • @theodoregalanos9272
    @theodoregalanos9272 7 лет назад +1

    Hello and thank you for the wonderful lectures!
    I'm new in this field and I am trying to combine it with Computational Geometry. As such, my problems are unique in the sense that the training sets can (usually) be constructed at the will of the modeller. The data are always (potentially) there, it is a matter of choosing which to produce. I was wondering if there is a theoretical or practical approach to an optimized selection of training samples from the whole? Does that relate to assigning a specifc (hopefully optimum in some way)l P(x), e.g. uniform distribution which takes samples uniformly from the whole? Or is it that random selection is still good enough in this case?
    Thank you in advance.
    Theodore.

  • @delightfulsunny
    @delightfulsunny 10 лет назад +7

    Remind myself that this is just foundation, and that it is dry and Zzzz.... but must ....keep...going..
    an hour later...really the summary is that the more your model cater towards a specific sample, your model is more prone to failure when it comes to unknown. it is like fourier series, fitting too well to the data can lead to not actually learning at all >...

    • @alfonshomac
      @alfonshomac 10 лет назад +1

      maybe you'd like Stanford's course better by Andrew Ng, Google it and check it out. I like it.

    • @MrCmon113
      @MrCmon113 5 лет назад

      No that was the last lecture. This one wasn't really about that.

  • @esteban246
    @esteban246 11 лет назад

    great teacher

  • @jonsnow9246
    @jonsnow9246 6 лет назад +3

    55:59 Overfitting!!!

    • @-long-
      @-long- 4 года назад

      awesome! thanks

  • @granand
    @granand 7 лет назад

    Thank you caltech and Professor. But please can you help me, it's decades I touched maths to catch up. Tell which links I must read and understand so I can get back here to follow like you smart guys.

  • @bertrandduguesclin826
    @bertrandduguesclin826 3 года назад

    Why not writing the RHS of the Hoeffding's inequality as min(1, 2exp(-2Neps^2)) since a probability cannot exceeds 1 anyway?

  • @judgeomega
    @judgeomega 11 лет назад +3

    Ng's lectures are hardly even within the realm of a true 'lecture'. His command of the english language is very limited and he rarely explains anything beyond the iteration of formulas and proofs. You would be better served by reading a book on the subject.
    Abu-mostafa is a TRUE teacher which walks you through the process. Of the dozens of lectures on this complex subject, he has the best compromise between content and approachability.
    /opinion

  • @shakesbeer00
    @shakesbeer00 6 лет назад

    Thanks for the excellent lecture. Here are a couple of questions:
    At 42:22 Just want to be more rigorous, would the P notation in this Hoeffding inequality depend on both X and y?
    At 50:02 How exactly is g defined here in order to have this inequality hold? The inequality seems to require that g minimize the | Ein - Eout|? But that is not intuitive. Instead, it is more intuitive to have g that minimizes Ein (or eventually Eout) based on the definition of Ein and Eout earlier.

    • @shakesbeer00
      @shakesbeer00 6 лет назад

      For my second question, I see it now because it is a less than or equal sign there, in stead of the equal sign. That inequality always holds since g is one of h in H. Thus whatever criterion for defining g is fine as long as g is one of the hs.

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      P is a selection probability that is assigned to X i.e. it defines the probability of selecting certain x's. It has nothing to do with y (or Y).

  • @y-revar
    @y-revar 7 месяцев назад

    "It's a logical thing rather than a mathematical thing"🤔

  • @DrNeelDas
    @DrNeelDas 7 лет назад +1

    I did not understand the union bound concept. My doubt is that shouldn't the upper bound (the probability of a hypothesis that is selected is bad) be (1/M) times (2M exp{-2e^2N}, assuming each hypthesis is equally likely to be selected. An analogy is consider this question
    "There are two bags containing white and black balls. In the first bag there are 8 white and 6 black balls and in the second bag there are 4 white and 7 black balls. One ball is drawn at random from any of these two bags. Find the probability of this ball being black."
    In the above question assume that selecting a bad ball signifies a bad event. Thus, P(bad event)=1/2*6/14+1/2*7/11=(1/2)*(6/14+7/11). In this example, M=2.

    • @ronithm5340
      @ronithm5340 7 лет назад

      I have the same doubt right now. Did you find your answer?

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      There is no probablity associated with selecting hypotheses. Probability is only associated with selecting the sample data (x). The learning algorithm will (likely) examine many or all hypotheses. It choses a particular hypothesis based on various criteria. Probability has nothing to do with this selection. Your thought error is probably that you saw that a hypothesis was "selected" and assumed this was a probabiliy concept. This is (rightly) confusing. It probably was better to say a hypothesis was "chosen" in order to stay away from probability terminology.

  • @user-po7sf3se5r
    @user-po7sf3se5r 3 месяца назад

    Define the problem before?

  • @RababMuhammadALy
    @RababMuhammadALy 10 лет назад +1

    very nice

  • @aviraljanveja5155
    @aviraljanveja5155 5 лет назад

    Brilliant ! :D

  • @wafamribah4162
    @wafamribah4162 6 лет назад +1

    One thing I couldn't figured it out though is how the target function and the hypothesis would agree? how the comparison occurs?

    • @desitravellers2023
      @desitravellers2023 6 лет назад

      Whatever happens in the bin, is hypothetical. Just assume you have chosen a hypothesis h. This will agree with the target function in some cases and differ in other over the entire set of inputs which is possibly infinite. The main takeaway is you can compare it on the sample which is the training data for which the value of target function is available. Thus the essence is, if you see the hypothesis chosen by you is agreeing with the values of target function on the sample, this will probably behave the same for out of sample data points with in a threshold (according to Hoeffding's formula). Feel free to ask if you have further queries.

    • @adarshsingh6313
      @adarshsingh6313 6 лет назад

      sir 1.can u please explain hypothesis and target function in bin marble problem through some mathematical expression (as a example)....

  • @loganphillips1674
    @loganphillips1674 6 лет назад +2

    28:47
    How do we compare the hypothesis to the target function if we never know what the target function is?

    • @hyp5094
      @hyp5094 5 лет назад

      That's what I'll like to know. Please reply if you have figured it. Thanks.

    • @prathameshmandke5966
      @prathameshmandke5966 5 лет назад +5

      True, we do not know the target function itself, but what we do know is it's value at certain points which are part of the dataset that we use to train our hypothesis. I believe the prof. refers to comparing the two functions at those points.

    • @roelofvuurboom5431
      @roelofvuurboom5431 3 года назад

      Don't equate "we don't know what the target function is" to "we don't know anything about the target function" as someone below already stated we do know something about the target function namely what its values are at those points that we sampled. So the more we sample the more we get to know about the target function. The probability comes in to allow us to state a certain probability that the pattern in the bin will be "sufficiently similar" (this is the epsilon definition) to the sample chosen as long as the sample is large enough.

    • @FsimulatorX
      @FsimulatorX 2 года назад

      From the data we already have.