RL Course by David Silver - Lecture 5: Model Free Control

Поделиться
HTML-код
  • Опубликовано: 30 сен 2024
  • #Reinforcement Learning Course by David Silver# Lecture 5: Model Free Control
    #Slides and more info about the course: goo.gl/vUiyjq

Комментарии • 179

  • @nirajabcd
    @nirajabcd 4 года назад +73

    God bless internet. Thank you DeepMind and David Silver for making this world class course accessible to everyone.

  • @stefanweber2683
    @stefanweber2683 3 года назад +1

    This lecture was an absolute pleasure. Thank you David!

  • @prateeksingh4078
    @prateeksingh4078 3 года назад +1

    at 20:48 how is the epsilon greedy policy making it better? if we expand the weighting expression using the formula from the previous slide all we're left with is the value from the left expression. Can someone explain in which cases would it be better ?

  • @siddharthkotwal8823
    @siddharthkotwal8823 8 лет назад +207

    Man David Silver's articulation is quite phenomenal.

  • @yasseraziz1287
    @yasseraziz1287 3 года назад +63

    0:00 Warm Introduction
    0:44 Outline
    1:20 Model-Free Reinforcement Learning
    2:10 Uses and Applications of Model-Free Control
    4:10 On and Off-Policy Learning
    5:40 Generalized Policy Iteration
    7:15 Generalized Policy Iteration with Monte-Carlo Evaluation
    13:30 Example of Greedy Action Selection
    15:50 Epsilon Greedy Exploration
    17:44 Epsilon Greedy Policy Improvement
    21:05 Monte-Carlo Policy Iteration
    23:40 Monte-Carlo Control
    25:37 GLIE (Greedy in the Limit with Infinite Exploration)
    28:27 GLIE Monte-Carlo Control
    35:17 Monte-Carlo Control in Blackjack
    38:46 MC vs TD Control
    41:00 Updating Action-Value Functions with Sarsa
    44:50 Sarsa Algorithm for On-Po licy Control
    48:15 Convergence of Sarsa
    49:45 Windy Gridworld Example
    53:39 n-Step Sarsa
    57:20 Forward View Sarsa(lambda)
    1:00:24 Backward View Sarsa(lambda)
    1:05:10 Sarsa(lambda) Algorithm
    1:06:37 Sarsa(lambda) Gridworld Example
    1:19:50 Off-Policy Learning
    1:25:20 Importance Sampling
    1:28:50 Q-learning
    1:32:50 Off-Policy Control with Q-learning
    1:33:27 Q-learning Control Algorithm (SARSAMAX)
    1:34:10 Relationship Between DP and TD

  • @jordanburgess
    @jordanburgess 8 лет назад +187

    oo, he had a haircut :)

  • @JH-bb8in
    @JH-bb8in 4 года назад +33

    Love how the views drop as we get further into the course

  • @paveltikhonov8780
    @paveltikhonov8780 3 года назад +20

    "Or learn how to fold proteins"
    I see what you did here)

  • @denizcanbay6312
    @denizcanbay6312 4 года назад +31

    It is like listening to Richard Feynman. His comprehension is phenomenal! Thanks for putting these online

  • @NganVu
    @NganVu 4 года назад +60

    1:19 Introduction
    7:12 On-Policy Monte-Carlo Control
    38:46 On-Policy Temporal-Difference Learning
    1:19:51 Off-Policy Learning
    1:34:09 Summary

    • @mathavraj9378
      @mathavraj9378 4 года назад

      Thank you for the time stamps in every video

    • @hobby_coding
      @hobby_coding 3 года назад

      thank you very much

  • @LucGendrot
    @LucGendrot 4 года назад +16

    This feels like the kind of lecture where, if you don't understand absolutely everything that was discussed you shouldn't move on until you do.

    • @arielfayol7198
      @arielfayol7198 Месяц назад +1

      I actually recommend otherwise. I watched the entire course, then attempted the homework. The first two questions were very straightforward, so I did them then came back to watch the lectures to verify my solution. I was understanding the concepts better while rewatching, and there is no way I would have understood this much without doing the homework rather than expecting to understand everything first before moving on

    • @vinith3773
      @vinith3773 21 день назад

      Pfft... explore/exploit repeat

  •  7 лет назад +43

    "I only get this one stream of data that we call life" 1:24:50 :D

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 года назад +3

      yep and only 1 single episodes, so make sure u pick the right function approximation to deal with it. :)

    • @sohampatil6539
      @sohampatil6539 3 года назад

      @@alvinphantomhive3794 Luckily we use online learning XD

    • @alvinphantomhive3794
      @alvinphantomhive3794 3 года назад +1

      @@sohampatil6539Ow yea that's true for most people. Anyway it's a year ago comment, so i need to update my statement that we as human actually unable to choose the func approximator. It Is the BrAiN as default! Sophisticated Performance but Not Robust enough :D

  • @stefano3808
    @stefano3808 3 года назад +7

    it's funny how it always looks like he is going to cough but he is not

  • @MattClimbs
    @MattClimbs 8 лет назад +52

    Monty Python? 24:10

  • @ck5300045
    @ck5300045 8 лет назад +21

    In the slide of Q-learning (1:29:11), shouldn't the "Next action chosen using the behavior policy" be denoted by A_t instead of A_{t+1}? Because it is chosen at state S_t.
    Also, shouldn't the successor action A' be drawn from state S_{t+1} instead of S_t?

    • @zhaslandochshanov4910
      @zhaslandochshanov4910 8 лет назад +4

      I think in both cases it should be S_{t+1}. The actual action made at state S_{t+1} is A_{t+1}, but the Q-table is updated using the A' action.

    • @jwu08
      @jwu08 7 лет назад +2

      How can you update towards A' using R_{t+1} if R_{t+1} comes from the action taken by A_{t+1}? That was the bit I didn't understand from the slide. Don't you end up polluting your Q values with rewards that didn't actually come from actions that you took?

    • @MinhVu-fo6hd
      @MinhVu-fo6hd 7 лет назад +2

      The Slide is correct. If you go back to his first or second lectures, he mentioned about the time notation and the state notation. Basically, R_{t+1} is the reward if you at S_{t} and taking an action A_{t}, so R_{t+1} in the target is correct. I hope it help.

    • @claudiocimarelli
      @claudiocimarelli 7 лет назад +2

      Actually i was wondering if there was an error there.
      On the 3rd and 4th bullet point, i think the action A_t+1 and the alternative A` should be drawn from the state S_t+1. This because S_t+1 is the state you ended up acting in the step before, from state S_t with action A_t, gaining the reward R_t+1.

    • @kiuhnmmnhuik2627
      @kiuhnmmnhuik2627 7 лет назад +23

      @Evan Su Yes on both counts, if I interpret the slide correctly.
      Just to be clear:
      1. we start from S = S_t
      2. we sample A from mu(.|S), the behavior policy
      3. we end up in S'
      4. we sample A' from pi(.|S'), the target policy
      5. we update q(S, A)
      6. we set S = S' and go back to 2.
      Note that:
      - the update to q(S,A) does not depend on the behavior policy, but only on the target policy. While it's true that the behavior policy determines *which* q(S,A) we'll update, once we decided we want to update a particular q(S,A), the update itself won't depend on the behavior policy but only on the target policy.
      - By setting S = S' and forgetting A', the trajectory will be independent of the target policy.

  • @DotcomL
    @DotcomL 7 лет назад +18

    This lecture really puts it all together so well. And the questions helped a lot too!

  • @alvinphantomhive3794
    @alvinphantomhive3794 4 года назад +8

    am i the only one who feel a bit uncomfortable ignoring the math precisely (e-greedy) and pretends its alright only by grasping the main idea from david silver.. lol

  • @AwesomeLemur
    @AwesomeLemur 3 года назад +3

    "...We only got this one stream of data we call life."
    man didn't expect you to get so deeeeep

  • @willlogs
    @willlogs 4 года назад +4

    I put everything on 2x speed. I put this one 0.25x :)))

    • @honestexpression6393
      @honestexpression6393 4 года назад

      I watched this at 2x too lol. Will defo have to re-visit.

    • @willlogs
      @willlogs 4 года назад

      @@honestexpression6393 you have fast mind!

  • @nichonifroa1
    @nichonifroa1 6 лет назад +10

    24:15: In the context of Monty P... Carlo.

  • @miumiuli9496
    @miumiuli9496 3 года назад +5

    Currently studying this module for masters, man I wish I was born a few years earlier and went to UCL. David Silver is honestly so smart.

  • @dhfromkorra
    @dhfromkorra 7 лет назад +55

    David "Gold"!

  • @AtoZshow115
    @AtoZshow115 7 месяцев назад +1

    What a hard way to introduce a lecture "In a Way, everything in the course has been leading up to this lecture" - as he smiles ear to ear. I cant believe this got my hyped

  • @karthik-ex4dm
    @karthik-ex4dm 6 лет назад +5

    No online paid or gree course can beat his content....
    GLIE MC looks so easy on theory but harsh in code

  • @jldupont
    @jldupont 7 лет назад +11

    This guy totally rocks!

  • @zhaoxuanzhu4364
    @zhaoxuanzhu4364 5 лет назад +13

    This series is seriously addictive. Be careful with your time management!

    • @prasannautube1
      @prasannautube1 5 лет назад +4

      very true. i am looping again and again for almost a week, in lecture 4,5

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 года назад +5

      i guess the addiction comes from the uncomfortable feeling of not being able to grasp the lecture information

  • @ThePaypay88
    @ThePaypay88 5 лет назад +4

    Dude is amazing teacher and fluent speaker even for non native english speakers

  • @muratcan__22
    @muratcan__22 4 года назад +4

    for the first time in 5 lessons, he needed to correct himself

  • @akshayshrivastava97
    @akshayshrivastava97 4 года назад +3

    Another great lecture and an exemplification of what someone who knows their field like the back of their hand, sounds like.

  • @abdullahal-dujaili4252
    @abdullahal-dujaili4252 8 лет назад +6

    In the slide of e-greedy policy improvement's theorem, shouldn't $q_\pi(s,\'{\pi})$ in the first equation be $v_{\'{\pi}}(s)$, instead?

    • @zl7460
      @zl7460 6 лет назад

      They are equivalent by construction here.

    • @Qiang12345
      @Qiang12345 6 лет назад

      according to the textbook, policy pi is not an epsilon-greedy policy, but an epsilon-soft policy. The left part of the equation is correct.

    • @pi3ni0
      @pi3ni0 6 лет назад

      Do you mean Sutton&Barto's textbook? I didn't read the textbook, but I think putting value function in the first equation on the slide is easier to follow.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Come on guys, pi' is a stochastic policy by definition, so there is clearly an mistake in the slides, since $q_\pi(s,\pi'(s))$ is a random variable. On the left-hand side, it should be the expectation of $q_\pi(s,\pi'(s))$. Plus obviously the argument is incomplete since you have to estimate the expectation of $q_{\pi'}(s,\pi'(s))$ instead of that of $q_\pi(s,\pi'(s))$. He even says this in the lecture.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      @@zl7460 Not true. pi' is not deterministic.

  • @SayanGHD
    @SayanGHD 4 года назад +3

    49:26 in practice we don't worry about this and this either. :)

  • @aesclamadoca
    @aesclamadoca 2 года назад +2

    this series is phenomenal. No wonder every other lecture series references or copies it

  • @keshavg6767
    @keshavg6767 3 года назад +3

    God Bless this man

  • @simon.p
    @simon.p 6 лет назад +3

    1:30:50 Shouldn't the condition state for mu and pi be S_t+1, not S_t?

    • @ernestoyou8610
      @ernestoyou8610 5 лет назад +1

      Semin Park A_t+1 should be A_t, I think this correction is better.

    • @Thvnvtos
      @Thvnvtos 4 года назад

      @@ernestoyou8610 No, I think it's At+1, it's the action that you're going to follow next, so basically you sample A' from pi to update Q(St, At) and then sample At+1 from mu and follow it.

  • @jg9193
    @jg9193 5 лет назад +11

    The theory is great and all, but it would be really helpful to see some code + explanations

  • @OmPrakash-vt5vr
    @OmPrakash-vt5vr 2 месяца назад

    I think there needs to be a small correction at 1:30:57 the third bullet point states the we sample At+1 from behavior given state St, at first glance this looks incorrect because At+1 must be sampled given state St+1 and also the update equations make more sense if At was sampled from behavior policy given St

  • @randalllionelkharkrang4047
    @randalllionelkharkrang4047 Год назад +1

    I've been having this question about SARSA and TD lambda both. Why do we need summation of all lambdas to add up to 1?

    • @edmonddantes4705
      @edmonddantes4705 Год назад +1

      So that the sum becomes an unbiased estimator of the discounted return in the limit :).
      Say X_1, X_2, ..., X_n,... are samples of a random variable with finite variance and expectation mu, and consider the infinite sum on n:
      X = (1-lambda) sum lambda^n X_n.
      Then E[X] = mu. If the geometric sequence of lambdas summed, say, to 1/2, you would have E[X] = mu/2, which obviously wouldn't be desirable.
      Note: similar sums are very common in machine learning. Look at adaptive optimizers based on moving averages like Adam, or at exponential moving averages applied to weights during training:)

  • @forheuristiclifeksh7836
    @forheuristiclifeksh7836 Месяц назад +1

    46:45 sarsa for online

  • @simon.p
    @simon.p 6 лет назад +2

    @53:44 On the n-step Sarsa slide, I think there's a notation mistake. The Q function should have both S and A as inputs. Other than that, great lecture :D

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 года назад

      which Q function you refers to? is it the n-steps Sarsa updates?? seems like everythings just good

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      ​​@@alvinphantomhive3794 He refers to q_t^1, q_t^2, etc, and he is right that they should depend on both S_t, A_t, but the dependence is sort of obvious, so David didn't write it.

  • @DrN007
    @DrN007 2 года назад

    Your description of off-policy, till Q-Learning, is overly complicated.

  • @WLDLNG
    @WLDLNG 5 лет назад +1

    I didn't understand how following a greedy policy will cause an exploration problem. Don't we explore all possible states and find the value of every state-action pair during the policy evaluation phase? Once we have explored the entire state space, how will selecting the actions with the highest q value cause an exploration problem?
    Can someone help me out?

    • @alessandrozhao
      @alessandrozhao 5 лет назад

      As this is model-free, I think you are not able to explore all possible states and perform the policy evaluation like in model based RL, due to the lack of P_ss'. As a result, we are working with action-value q, and the way to update this is to sample a bunch of episodes, and for each episode q(s) = R(s) + discount*q(s+1) + ... Then you take the average and that's your new q. Now it is totally due to the samples you take to update q, thus not guarantee to explore all the states. My two cents.

  • @JatanBuch
    @JatanBuch Год назад

    Could the voice at 1:31:03 be David Runciman's?

  • @noamzilo6730
    @noamzilo6730 5 лет назад +1

    20:52 but why wouldn't I get stuck in a local max? the theorem only assures I wouldn't make the policy any worse, not that I would be making it better

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      And who said you couldn't get stuck in a suboptimal policy? You can get stuck and you will typically get stuck, that is precisely why GLIE is proposed. GLIE ensures that the exploring policy is greedy in the limit, so epsilon->0 if you are using a GLIE algorithm based on epsilon greedy policies. In other words, epsilon is ideally ADAPTIVE.

  • @souradeepkar
    @souradeepkar 5 месяцев назад

    He's too good at explaining. Lovely :)

  • @jackcarreira6931
    @jackcarreira6931 5 лет назад +1

    Why does the proof at 18:03 makes sense at the end result, when compares q_pi'(s, a') >= V_pi(s).
    The q-function has the values of each action possible at each state.
    Taking the mean of these at each state gives us the State Value function.
    So V_pi(s) is a mean between possible actions.
    q_pi'(s, a') has the value of ONE of the actions at that state, a': chosen by the new policy pi'.
    Started with:
    q_pi'(s, a') >= V_pi(s)
    I'm interpreting this as a relation between a 'max' and a mean (as the policy tries to maximize this reward).
    But this is (almost) always true: a max greater than the mean.
    If the policies are different, this inequality cannot ensure this new policy is better.
    Is that the case? or am I missing something?

    • @jackcarreira6931
      @jackcarreira6931 5 лет назад +1

      Found this: stats.stackexchange.com/a/304406/249143

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 года назад

      @@jackcarreira6931 woah, thanks so much.. i paused for like few hours to grasp these ideas

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      There is an expectation missing on the left-hand side of q at the beginning of the proof. Note that pi' is not deterministic, so pi'(s) is a random variable.

  • @noamzilo6730
    @noamzilo6730 5 лет назад +1

    1:00:10 why does it make sense to take into account small lookaheads, when we do take into account large lookaheads? If we advance more into the future, it would make sense to take the real knolwedge into account rather than weigh in assumed knowledge. What am I missing?

    • @vedgupta1686
      @vedgupta1686 4 года назад

      Variance man, variance

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      (1) You incur in variance. (2) you lose online updates. Talking about variance, when lambda goes towards 1, TD(lambda) gets close to MC and you trade variance for a more unbiased estimator (more variance, less bias). When lambda goes towards 0, you trade bias for reduction in variance (more bias, less variance).

  • @robertpeharz8217
    @robertpeharz8217 2 года назад

    Great lecture, as always! I’d have a question about GLIE Monte Carlo (around 30:00). The theorem says that GLIE MC control will converge to q_*. However, Sutton and Barto (page 99) mention that convergence of MC control with exploration starts (somewhat simpler setting than GLIE) is theoretically an open problem. So, has the convergence of GLIE MC control actually been proven?

  • @shivendraiitkgp
    @shivendraiitkgp 5 лет назад

    I wrote a SARSA(0) code for solving the windy grid world at 51:00. Might be helpful for others.
    Github link - github.com/ShivendraAgrawal/RL_David_Silver/blob/master/SARSA_windy_gridworld.py

  • @hardianlawi8647
    @hardianlawi8647 6 лет назад +1

    What is the idea of taking the ratio between the target policy and behavioral policy in Importance Sampling?

    • @borispyakillya4777
      @borispyakillya4777 6 лет назад +1

      it's some kind of similarity estimation between the policies. How your target policy looks like behavioral one at some states.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      You want to evaluate the return of policy pi. Note that this is computed as an expectation, which can be approximated by sampling returns from trajectories FOLLOWING PI. In general, when sampling and approximating this expectation, you want to attach more weight to trajectories that are more common under pi than to trajectories than are rare under pi. This occurs naturally when you compute the expectation on-policy. However, if you are sampling off-policy, you might be often seeing trajectories that are common under mu but uncommon under pi and you do not want to weight those too much. The ratio between the target policy pi and the behavior policy mu achieves precisely this. If a trajectory is common under mu but uncommon under pi, it will attach less weight to the sample of the return, since we are sampling under mu. Conversely, if some trajectories are uncommon under mu but common under pi, the ratio will increase the weighting, since now that sample return is considered to be "more representative". In summary, that ratio ASSIGNS A NEW WEIGHT TO THE EXPERIMENTAL RETURN, IT REWEIGHTS THE EXPERIMENTAL RETURN BY TAKING INTO ACCOUNT HOW IMPORTANT/REPRESENTATIVE IT WOULD BE IF WE WERE FOLLOWING PI.

  • @giampierocampa4099
    @giampierocampa4099 11 месяцев назад

    Thank you for finally clearly explaining the equivalence between forward and backward view. The book is a little confusing about this, but this example at the end on the gridworld really clarified in my mind why the two things are basically the same!

  • @vivekpadman5248
    @vivekpadman5248 Год назад

    Thank you soo much sir for this great series... really learnt a lot and motivated me to work even harder

  • @Paul-rs4gd
    @Paul-rs4gd 7 лет назад +1

    I love the lectures. Thank you.
    I have a question about the backward Sarsa(lambda) algorithm implementation. Do we really need to store eligibility trace ? It seems like nothing really happens until a reward is received (an event). When(and only when) the event occurs, we could run back along the trajectory updating Q(s,a) values visited in the trajectory by alpha*delta*(decaying weight) ? Only state/action values in the trajectory would be updated, but only state/action pairs in the trajectory could have non-zero eligibility traces anyway. Maybe what I am saying just swaps storing the eligibilities for storing the trajectory, but it seems more efficient.

    • @kiuhnmmnhuik2627
      @kiuhnmmnhuik2627 7 лет назад +2

      The lack of such implementation details in pseudo-code is intentional. The method you propose would be quite slow in practice when you keep receiving rewards every few steps. You could solve the problem by maintaining a set of q(s,a) whose eligibility traces are big enough.

    • @Paul-rs4gd
      @Paul-rs4gd 7 лет назад

      Thanks. I think I see what you mean. The trajectory could become quite long and my proposal traverses it every time a reward is received. So you are suggesting only keeping state/action pairs, from the trajectory, where the eligibility has not decayed below a certain threshold. In practice that would be a few of the most recently visited states in the trajectory. Older pairs would have very small eligibilities and removed from the list.

    • @kiuhnmmnhuik2627
      @kiuhnmmnhuik2627 7 лет назад

      Yes, that's exactly what I meant! Moreover, in practice we can represent states through features and then only store eligibility traces for the features. I think this is discussed in lecture 7 of this course. More generally, instead of just remembering which state to change, we remember what changes we should make to the weight vector (of a neural net, for instance) to modify the value of the state we're interested in. In other words, we remember the gradient for influencing a particular state. The main advantages are that (1) the weights are usually way fewer than the states and (2) we can generalize over similar states.

  • @오식이-r2l
    @오식이-r2l Год назад

    prof. david silver's hair is so cute

  • @43SunSon
    @43SunSon 4 года назад +3

    I have to say, David Sliver is slightly smarter than me:)

  • @florentinrieger5306
    @florentinrieger5306 Год назад

    This whole series is just so brilliant!

  • @yongbiaochen8978
    @yongbiaochen8978 6 лет назад

    Any one has any idea on how i might get the slides of this course?? the url provided is not valid now and I've also checked the course website of 2015 but only the slides of the first two chapters can be found.

    • @hoseinfooladi6945
      @hoseinfooladi6945 6 лет назад +1

      www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

  • @raihaan1819
    @raihaan1819 4 года назад

    1:34:08 - Looking at the pseudocode, I don't see why Q-Learning is off-policy when the actual Q function is updated on every step... Bit confused here because this looks like an algorithm which could run online

    • @geekfun8595
      @geekfun8595 3 года назад +2

      It's because when updating the Q function you dont move toward the direction of the target defined by (the reward you just received + gamma*Q(next state, action you will perform at next state) but you move instead toward (the reward you just received + gamma*Q(next state, argmax(available actions at next state))

  • @herrizaax
    @herrizaax 3 года назад

    24:12 monty py.. carlo xD

  • @stevens.k.markrov3547
    @stevens.k.markrov3547 5 лет назад

    In SARSA algorithm, does it matter it the Q value exceeds max reward? Eg, the reward that can be awarded after each step ranges from -1 to 1. And Q for one state is something like 1.2 or 4.5 or -2.3, etc. Does this mean my code has errors or is it normal?

    • @vissarionchristodoulou6822
      @vissarionchristodoulou6822 4 года назад +1

      I do not know if this is still relevant to you, but the situation you are describing is absolutely normal, because Q(s,a) does not only account for the immediate reward, but is an estimate of the long term reward of being in state s and taking action a. The only thing that I would like to point out as a technicality, is that Q does not refer to states only, but rather to a pair of a state and an action.

  • @PyramidDJ
    @PyramidDJ 6 лет назад

    How can I run Monte Carlo sampling with Policy Evaluation? If I have a policy, that policy tells me what action should I take. If I use monte carlo, I randomly walk through the states? How does it fit together and where I am missing a piece?

    • @idali24
      @idali24 6 лет назад

      Actually, Monte Carlo starts for each episode of exploration from a random pair of state-action (s, a),
      then updating value-Action(Q) for all pair (s, a) appearing in the episode following by policy improvement for all visited states.
      After a large number of episodes and if we are using epsilon-greedy we grantee the infinite exploration and thus a maximum policy improvement and a convergence towards optimal policy.

  • @ildaalushaj8627
    @ildaalushaj8627 3 года назад

    Does the First-Term MC and MC(all) return estimate biased? And how can we prove that formally?

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      The experimental returns that are averaged in both MC and every visit MC in order to estimate the total return are UNBIASED SAMPLES of the real return. In other words, MC and every visit MC are unbiased estimators. The convergence to the real returns is due to the central limit theorem. If you want to take a look at how the density looks formally, then see the definition of
      ho^\pi (it appears slightly because equation (1)) in proceedings.mlr.press/v32/silver14.pdf. That's the density you are sampling from, but in an episodic environment.

  • @sck3570
    @sck3570 7 лет назад +1

    awesome

  • @FirespyWins
    @FirespyWins 7 лет назад

    Wow, perfect lecture until the end. Excited to see the rest! Thank you so much for uploading this!

  • @rudolfreiter5217
    @rudolfreiter5217 4 года назад +2

    He keeps on exploring the Monty Python direction!

  • @bajza6046
    @bajza6046 3 года назад

    i like ya cut g

  • @chamangupta4624
    @chamangupta4624 3 года назад

    very good lec

  • @dannydrinkswater
    @dannydrinkswater 6 лет назад

    One question in the example of Sarsa(lambda) example, since we know the trajectory, do we update the error flow in reverse order, therefore, every state would have a valid update with the error flow

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 года назад

      its not update in reverse order. in Sarsa(lamda) when you reach to the terminate state, or obtain any state which contain a reward value, you immediately update every single pair of action and states towards the Elgibility Traces and TD error ( Sarsa(lambda) algorithm 1:05:19 ). And also since the value of each pair of states and action decaying by the Elgibility Traces as you move further from the particular state, it looks like you update the Q in reverse order, fortunately its not..
      i guess you can take a look at the explanation of a question that was asked in 1:12:28

  • @timzhang6962
    @timzhang6962 8 лет назад

    At 28:45, if my robot is exploring the path for a maze, how does it know what the reward is when performing a certain action or being at certain state?

    • @gregcolbourn4637
      @gregcolbourn4637 7 лет назад

      The reward is given by the setup (e.g. the goal in a maze, or the score on the Atari game). In more practical scenarios, presumably setting the utility (reward) function is a whole other thing that needs doing (e.g. program the robot to recognise a sensor telling it it's reached the end point of a maze). The process described in the lecture is how it works out which path to take - of course it can only start weighting moves after it's actually reached the goal once.

  • @cheungtyrone3615
    @cheungtyrone3615 4 года назад

    I have a question about the Sarsa(\lambda) algorithm. By using eligibility trace, the policy gets updated in each step, however it is still not completely online in that the agent needs to store all the states it have travelled through the walk, right?
    Also, Sarsa(0) isn't actually that bad, right? The case where only one update is made, occurs only when the immediate reward of each state-action are identically negative except for the goal position.
    And in off-policy learning, you are assuming that you know the exact foumulation of for example the other agent's policy and can act according to it. However, in most cases, for example by obsserving human reaction to the environment, all you get is a sample drawn from human's policy. In that case, do we just initialize our Q value for the observed state-action pair in the samples as if the agent experienecd these samples by itself?
    In addition, how is Q learning better than maintaining only one \varepsilon-greedy policy with decaying \varepsilon? Should the greedy policy generates a suboptimal result and the true optimal policy travels through a vastly different set of states, neither Q learning nor \varepsilon=greedy is capable of exploiting these unvisited states. So, it is like Q learning is not combining an exploratory and an exploiting policy but actually just looking ahead and sampling one more step.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      (1) Sarsa(lambda) (I am assuming backward view from now on) does not update the policy as you write. It updates the state-value function. Moreover, we are assuming the state space S to be fixed and known and it is always stored (in any algorithm. We are in the look-up table case). I think you mean that the eligibility traces of all the visited states in that episode are updated at every step, which is true. However, updating eligibility traces does not make the algorithm offline. Online means that you don't wait until the end of the episode to update your value function, which you don't, since you make updates in every state transition. That's the whole point about online algorithms. In every iteration of Sarsa(lambda), multiple values are updated (the ones associated with visited states) and the eligibility traces are also updated. The algorithm is fully ONLINE, since updates occur every time you transition from one state to another.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      (2) In the off-policy learning algorithms presented in the slides, it is enough to be able to sample from your behaviour policy, so what you are saying is not true. You don't need the MDP formulation, it is model-free. It is made very clear in the slides. Note that you select the states and actions that will be updated by following your behaviour policy, but the values at those points are updated towards those of your target policy.

  • @alessandrozhao
    @alessandrozhao 5 лет назад

    I am thinking that since using q (action-value) is quite convenient here to update model-free control, why we were not using q for model based policy iteration as well? I feel like in principle, we can use either v or q for model based policy iteration, so I wonder for model based control, is there any advantage of using v over q?

    • @japneetsingh5015
      @japneetsingh5015 5 лет назад

      1) The dimensionality of q(s, a) is much higher than v(s).
      2) The optimal policy will maximize v(s) over all actions. Hence our agent will perform better in stochastic environment

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      @@japneetsingh5015 One could do it of course, but why go for something that is higher dimensional? It means higher computational cost. Note that in the model free case, one can calculate q* directly from v*, so it seems much better to do it at the end instead of updating q all the time.

  • @armanyekekhani663
    @armanyekekhani663 4 года назад

    Magnificent!

  • @BorisBrodski
    @BorisBrodski 5 лет назад

    Great lecture! One question: Why do we need to store the whole matrix E(s,a)? A list of (s, a) for current episode should be enough. Right?

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 года назад

      so you can calculate the elgibility traces in proportion to each pair of (s,a).

    • @edmonddantes4705
      @edmonddantes4705 Год назад +1

      Yeah, it is enough of course :)

  • @kevinmcinerney9552
    @kevinmcinerney9552 7 лет назад

    There are 10^120 possible chess positions according to Shannon. Can sampling from each state really work? I'm reading Sutton and following these lectures, but something isn't adding up for me. Can anybody point me in the right direction...?

    • @kiuhnmmnhuik2627
      @kiuhnmmnhuik2627 7 лет назад +2

      When chess players think several moves ahead, they're basically doing smart sampling.

    • @yoloswaggins2161
      @yoloswaggins2161 7 лет назад

      You just estimate it with a function approximator, getting your function approximator to generalize is a completely different tale though.

    • @zl7460
      @zl7460 6 лет назад

      In each episode we need to update a few states only.

  • @aswahd
    @aswahd 3 года назад

    38:44

  • @Pedro-rc8mu
    @Pedro-rc8mu 7 лет назад

    At 17:43, why does the proof only hold if pi is eps-greedy? Why can't it be an arbitrary policy? ( I know it cant be right but I don't see what would be wrong in the proof).

    • @AbhishekSharma0318
      @AbhishekSharma0318 6 лет назад

      The second line of the proof won't follow.

    • @zl7460
      @zl7460 6 лет назад

      No it's because this makes the weights in 3rd line >=0. The 2nd line is from pi' not pi so it's irrelevant.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      First, choose a policy pi* that is optimal already and improve it epsilon-greedily. Now the policy becomes worse, so it is clear that the theorem does not work in this case.
      Now, where does the proof break? Abhishek claims the second line will not work, which is not true haha. The second line definitely works, since we are assuming that pi' is epsilon-greedy. One of the things that break for sure is the inequality in the third line, since the weights in the sum could now be negative (note that since pi is not necessarily epsilon-greedy, there is no guarantee that pi(a|s)-epsilon/m is nonnegative).

  • @jimbobbillybob
    @jimbobbillybob 5 лет назад

    what does GLIE stand for?

    • @jimbobbillybob
      @jimbobbillybob 5 лет назад +1

      nvm, Greedy in the Limit with Infinite Exploration (GLIE)

  • @lizafan1894
    @lizafan1894 6 лет назад +2

    Hi, I'm confused about concepts "state" and "episode", can anyone help to enlighten me . Thanks.

    • @pi3ni0
      @pi3ni0 6 лет назад +5

      An episode is a sequence of states, actions, and rewards. S1, A1, R2, S2, A2, R2... For example, it can be one game of chess. From the probabilistic point of view, it's a sequence of random variables. You can sample those to estimate Q(s, a) using Monte Carlo method or any other method. In other words, the episode is an experience of the agent.

  • @mehdimashayekhi1675
    @mehdimashayekhi1675 6 лет назад

    Thanks for sharing!

  • @federicobassetto6908
    @federicobassetto6908 5 лет назад

    Best teacher in the world!

    • @zeMalreen
      @zeMalreen 4 года назад

      With Andrew Ng :p

  • @백성현-o2s
    @백성현-o2s 5 лет назад

    I luv it !! >

  • @mhttxhe4959
    @mhttxhe4959 6 лет назад

    great

  • @jiniliu9811
    @jiniliu9811 6 лет назад

    So clear!!!

  • @jjong
    @jjong 7 лет назад

    The inequation in the formula at page 12 is wrong. You introduced {pie(a|s) - eps/m}/(1-eps) as the weighting variable (which sums to 1), but each term is not gauranteed to be larger than 0, in which case the inequation doesn't hold.

    • @jjong
      @jjong 7 лет назад

      Simple counterexample can be thought in the case when eps is near 1, where the terms become "mean of q_pi(s, a)" and "weighted sum of q_pi(s, a)" where weight is pi(a|s). Latter can be certainly be bigger, if pi(a|s) is bigger for larger q_pi(s, a).

    • @yoloswaggins2161
      @yoloswaggins2161 7 лет назад +1

      But pi(a|s) is just epsilon greedy isn't it?

    • @jjong
      @jjong 6 лет назад +1

      Yolo Swaggins Hey genius! Thanks for enlightening me!!

    • @zl7460
      @zl7460 6 лет назад

      I had the same confusion too, but yeah pi is eps-greedy so its good.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Actually, this is precisely the reason why the proof breaks if pi is not assumed to be epsilon-greedy. I noticed when reviewing the proof, so I am pretty glad someone else noticed.

  • @woncoh1
    @woncoh1 5 лет назад

    24:44
    44:02

  • @sultanyerumbayev1408
    @sultanyerumbayev1408 6 лет назад +2

    What it means to act Greedly?

    • @EatMyDusty
      @EatMyDusty 6 лет назад

      Choose the action for best value in S_t+1

    • @alvinphantomhive3794
      @alvinphantomhive3794 4 года назад

      Someone's offer you a bike or a car, if you act greedly based on the prices then you pick a car..

  • @binyaminmanela3585
    @binyaminmanela3585 7 лет назад

    can E(A,S) be > 1?

    • @Avandale0
      @Avandale0 4 года назад

      yes, there was a really nice graph in the previous lecture showing the behavior of E(a,s) over time.
      Let's assume you arrive at the pair (s,a). Then E(s,a) = 1. If "a" enables you to land in the same state as before (i.e S_t+1 = S_t), you can arrive at the pair (s, a) again. In that case, we will now have E(s,a) = 2 - gamma, which is > 1

  • @kubic71
    @kubic71 4 года назад

    19:42
    "Max of all the Q-values has to be at least as good as any weighted sum of all actions" - That is not true, because we can have negative weight coefficients (for larger epsilons for example)
    Simple counter-example:
    Q-values = [-1, 0]
    Weights = [-1, 0]
    Max(Q-values) = 0 < dot-product(Weights, Q-Values ) = (-1)*(-1) + 0*0 = 1
    It would hold if we took convex combination of our q-values, but if there are some coefficients, that are negative, it is affine combination, for which the statement about the max doesn't hold anymore.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Mate, those weights are nonnegative because pi is assumed to be epsilon-greedy, so by definition pi(a|s) is greater or equal than epsilon/m. That is the crux of the proof.

  • @thirdreplicator
    @thirdreplicator 6 лет назад +1

    If you listen to the lecture at 2x for 90 minutes, then slow it down to normal speed, David Silver sounds like he is completely stoned.

  • @robinkhlng8728
    @robinkhlng8728 2 года назад

    Is there any theory on the convergence of Sarsa(Lambda)?
    Does it follow from the convergence of One-step Sarsa and Sarsa(Lambda) being kind of the control equivalent of TD(Lambda) which converges?