0:00 Warm Introduction 0:44 Outline 1:20 Model-Free Reinforcement Learning 2:10 Uses and Applications of Model-Free Control 4:10 On and Off-Policy Learning 5:40 Generalized Policy Iteration 7:15 Generalized Policy Iteration with Monte-Carlo Evaluation 13:30 Example of Greedy Action Selection 15:50 Epsilon Greedy Exploration 17:44 Epsilon Greedy Policy Improvement 21:05 Monte-Carlo Policy Iteration 23:40 Monte-Carlo Control 25:37 GLIE (Greedy in the Limit with Infinite Exploration) 28:27 GLIE Monte-Carlo Control 35:17 Monte-Carlo Control in Blackjack 38:46 MC vs TD Control 41:00 Updating Action-Value Functions with Sarsa 44:50 Sarsa Algorithm for On-Po licy Control 48:15 Convergence of Sarsa 49:45 Windy Gridworld Example 53:39 n-Step Sarsa 57:20 Forward View Sarsa(lambda) 1:00:24 Backward View Sarsa(lambda) 1:05:10 Sarsa(lambda) Algorithm 1:06:37 Sarsa(lambda) Gridworld Example 1:19:50 Off-Policy Learning 1:25:20 Importance Sampling 1:28:50 Q-learning 1:32:50 Off-Policy Control with Q-learning 1:33:27 Q-learning Control Algorithm (SARSAMAX) 1:34:10 Relationship Between DP and TD
I actually recommend otherwise. I watched the entire course, then attempted the homework. The first two questions were very straightforward, so I did them then came back to watch the lectures to verify my solution. I was understanding the concepts better while rewatching, and there is no way I would have understood this much without doing the homework rather than expecting to understand everything first before moving on
What a hard way to introduce a lecture "In a Way, everything in the course has been leading up to this lecture" - as he smiles ear to ear. I cant believe this got my hyped
Thank you for finally clearly explaining the equivalence between forward and backward view. The book is a little confusing about this, but this example at the end on the gridworld really clarified in my mind why the two things are basically the same!
In the slide of Q-learning (1:29:11), shouldn't the "Next action chosen using the behavior policy" be denoted by A_t instead of A_{t+1}? Because it is chosen at state S_t. Also, shouldn't the successor action A' be drawn from state S_{t+1} instead of S_t?
How can you update towards A' using R_{t+1} if R_{t+1} comes from the action taken by A_{t+1}? That was the bit I didn't understand from the slide. Don't you end up polluting your Q values with rewards that didn't actually come from actions that you took?
The Slide is correct. If you go back to his first or second lectures, he mentioned about the time notation and the state notation. Basically, R_{t+1} is the reward if you at S_{t} and taking an action A_{t}, so R_{t+1} in the target is correct. I hope it help.
Actually i was wondering if there was an error there. On the 3rd and 4th bullet point, i think the action A_t+1 and the alternative A` should be drawn from the state S_t+1. This because S_t+1 is the state you ended up acting in the step before, from state S_t with action A_t, gaining the reward R_t+1.
@Evan Su Yes on both counts, if I interpret the slide correctly. Just to be clear: 1. we start from S = S_t 2. we sample A from mu(.|S), the behavior policy 3. we end up in S' 4. we sample A' from pi(.|S'), the target policy 5. we update q(S, A) 6. we set S = S' and go back to 2. Note that: - the update to q(S,A) does not depend on the behavior policy, but only on the target policy. While it's true that the behavior policy determines *which* q(S,A) we'll update, once we decided we want to update a particular q(S,A), the update itself won't depend on the behavior policy but only on the target policy. - By setting S = S' and forgetting A', the trajectory will be independent of the target policy.
7 лет назад+44
"I only get this one stream of data that we call life" 1:24:50 :D
@@sohampatil6539Ow yea that's true for most people. Anyway it's a year ago comment, so i need to update my statement that we as human actually unable to choose the func approximator. It Is the BrAiN as default! Sophisticated Performance but Not Robust enough :D
Great lecture, as always! I’d have a question about GLIE Monte Carlo (around 30:00). The theorem says that GLIE MC control will converge to q_*. However, Sutton and Barto (page 99) mention that convergence of MC control with exploration starts (somewhat simpler setting than GLIE) is theoretically an open problem. So, has the convergence of GLIE MC control actually been proven?
I think there needs to be a small correction at 1:30:57 the third bullet point states the we sample At+1 from behavior given state St, at first glance this looks incorrect because At+1 must be sampled given state St+1 and also the update equations make more sense if At was sampled from behavior policy given St
at 20:48 how is the epsilon greedy policy making it better? if we expand the weighting expression using the formula from the previous slide all we're left with is the value from the left expression. Can someone explain in which cases would it be better ?
Do you mean Sutton&Barto's textbook? I didn't read the textbook, but I think putting value function in the first equation on the slide is easier to follow.
Come on guys, pi' is a stochastic policy by definition, so there is clearly an mistake in the slides, since $q_\pi(s,\pi'(s))$ is a random variable. On the left-hand side, it should be the expectation of $q_\pi(s,\pi'(s))$. Plus obviously the argument is incomplete since you have to estimate the expectation of $q_{\pi'}(s,\pi'(s))$ instead of that of $q_\pi(s,\pi'(s))$. He even says this in the lecture.
@53:44 On the n-step Sarsa slide, I think there's a notation mistake. The Q function should have both S and A as inputs. Other than that, great lecture :D
@@alvinphantomhive3794 He refers to q_t^1, q_t^2, etc, and he is right that they should depend on both S_t, A_t, but the dependence is sort of obvious, so David didn't write it.
I love the lectures. Thank you. I have a question about the backward Sarsa(lambda) algorithm implementation. Do we really need to store eligibility trace ? It seems like nothing really happens until a reward is received (an event). When(and only when) the event occurs, we could run back along the trajectory updating Q(s,a) values visited in the trajectory by alpha*delta*(decaying weight) ? Only state/action values in the trajectory would be updated, but only state/action pairs in the trajectory could have non-zero eligibility traces anyway. Maybe what I am saying just swaps storing the eligibilities for storing the trajectory, but it seems more efficient.
The lack of such implementation details in pseudo-code is intentional. The method you propose would be quite slow in practice when you keep receiving rewards every few steps. You could solve the problem by maintaining a set of q(s,a) whose eligibility traces are big enough.
Thanks. I think I see what you mean. The trajectory could become quite long and my proposal traverses it every time a reward is received. So you are suggesting only keeping state/action pairs, from the trajectory, where the eligibility has not decayed below a certain threshold. In practice that would be a few of the most recently visited states in the trajectory. Older pairs would have very small eligibilities and removed from the list.
Yes, that's exactly what I meant! Moreover, in practice we can represent states through features and then only store eligibility traces for the features. I think this is discussed in lecture 7 of this course. More generally, instead of just remembering which state to change, we remember what changes we should make to the weight vector (of a neural net, for instance) to modify the value of the state we're interested in. In other words, we remember the gradient for influencing a particular state. The main advantages are that (1) the weights are usually way fewer than the states and (2) we can generalize over similar states.
You want to evaluate the return of policy pi. Note that this is computed as an expectation, which can be approximated by sampling returns from trajectories FOLLOWING PI. In general, when sampling and approximating this expectation, you want to attach more weight to trajectories that are more common under pi than to trajectories than are rare under pi. This occurs naturally when you compute the expectation on-policy. However, if you are sampling off-policy, you might be often seeing trajectories that are common under mu but uncommon under pi and you do not want to weight those too much. The ratio between the target policy pi and the behavior policy mu achieves precisely this. If a trajectory is common under mu but uncommon under pi, it will attach less weight to the sample of the return, since we are sampling under mu. Conversely, if some trajectories are uncommon under mu but common under pi, the ratio will increase the weighting, since now that sample return is considered to be "more representative". In summary, that ratio ASSIGNS A NEW WEIGHT TO THE EXPERIMENTAL RETURN, IT REWEIGHTS THE EXPERIMENTAL RETURN BY TAKING INTO ACCOUNT HOW IMPORTANT/REPRESENTATIVE IT WOULD BE IF WE WERE FOLLOWING PI.
@@ernestoyou8610 No, I think it's At+1, it's the action that you're going to follow next, so basically you sample A' from pi to update Q(St, At) and then sample At+1 from mu and follow it.
So that the sum becomes an unbiased estimator of the discounted return in the limit :). Say X_1, X_2, ..., X_n,... are samples of a random variable with finite variance and expectation mu, and consider the infinite sum on n: X = (1-lambda) sum lambda^n X_n. Then E[X] = mu. If the geometric sequence of lambdas summed, say, to 1/2, you would have E[X] = mu/2, which obviously wouldn't be desirable. Note: similar sums are very common in machine learning. Look at adaptive optimizers based on moving averages like Adam, or at exponential moving averages applied to weights during training:)
1:00:10 why does it make sense to take into account small lookaheads, when we do take into account large lookaheads? If we advance more into the future, it would make sense to take the real knolwedge into account rather than weigh in assumed knowledge. What am I missing?
(1) You incur in variance. (2) you lose online updates. Talking about variance, when lambda goes towards 1, TD(lambda) gets close to MC and you trade variance for a more unbiased estimator (more variance, less bias). When lambda goes towards 0, you trade bias for reduction in variance (more bias, less variance).
And who said you couldn't get stuck in a suboptimal policy? You can get stuck and you will typically get stuck, that is precisely why GLIE is proposed. GLIE ensures that the exploring policy is greedy in the limit, so epsilon->0 if you are using a GLIE algorithm based on epsilon greedy policies. In other words, epsilon is ideally ADAPTIVE.
I didn't understand how following a greedy policy will cause an exploration problem. Don't we explore all possible states and find the value of every state-action pair during the policy evaluation phase? Once we have explored the entire state space, how will selecting the actions with the highest q value cause an exploration problem? Can someone help me out?
As this is model-free, I think you are not able to explore all possible states and perform the policy evaluation like in model based RL, due to the lack of P_ss'. As a result, we are working with action-value q, and the way to update this is to sample a bunch of episodes, and for each episode q(s) = R(s) + discount*q(s+1) + ... Then you take the average and that's your new q. Now it is totally due to the samples you take to update q, thus not guarantee to explore all the states. My two cents.
Why does the proof at 18:03 makes sense at the end result, when compares q_pi'(s, a') >= V_pi(s). The q-function has the values of each action possible at each state. Taking the mean of these at each state gives us the State Value function. So V_pi(s) is a mean between possible actions. q_pi'(s, a') has the value of ONE of the actions at that state, a': chosen by the new policy pi'. Started with: q_pi'(s, a') >= V_pi(s) I'm interpreting this as a relation between a 'max' and a mean (as the policy tries to maximize this reward). But this is (almost) always true: a max greater than the mean. If the policies are different, this inequality cannot ensure this new policy is better. Is that the case? or am I missing something?
There is an expectation missing on the left-hand side of q at the beginning of the proof. Note that pi' is not deterministic, so pi'(s) is a random variable.
One question in the example of Sarsa(lambda) example, since we know the trajectory, do we update the error flow in reverse order, therefore, every state would have a valid update with the error flow
its not update in reverse order. in Sarsa(lamda) when you reach to the terminate state, or obtain any state which contain a reward value, you immediately update every single pair of action and states towards the Elgibility Traces and TD error ( Sarsa(lambda) algorithm 1:05:19 ). And also since the value of each pair of states and action decaying by the Elgibility Traces as you move further from the particular state, it looks like you update the Q in reverse order, fortunately its not.. i guess you can take a look at the explanation of a question that was asked in 1:12:28
I have a question about the Sarsa(\lambda) algorithm. By using eligibility trace, the policy gets updated in each step, however it is still not completely online in that the agent needs to store all the states it have travelled through the walk, right? Also, Sarsa(0) isn't actually that bad, right? The case where only one update is made, occurs only when the immediate reward of each state-action are identically negative except for the goal position. And in off-policy learning, you are assuming that you know the exact foumulation of for example the other agent's policy and can act according to it. However, in most cases, for example by obsserving human reaction to the environment, all you get is a sample drawn from human's policy. In that case, do we just initialize our Q value for the observed state-action pair in the samples as if the agent experienecd these samples by itself? In addition, how is Q learning better than maintaining only one \varepsilon-greedy policy with decaying \varepsilon? Should the greedy policy generates a suboptimal result and the true optimal policy travels through a vastly different set of states, neither Q learning nor \varepsilon=greedy is capable of exploiting these unvisited states. So, it is like Q learning is not combining an exploratory and an exploiting policy but actually just looking ahead and sampling one more step.
(1) Sarsa(lambda) (I am assuming backward view from now on) does not update the policy as you write. It updates the state-value function. Moreover, we are assuming the state space S to be fixed and known and it is always stored (in any algorithm. We are in the look-up table case). I think you mean that the eligibility traces of all the visited states in that episode are updated at every step, which is true. However, updating eligibility traces does not make the algorithm offline. Online means that you don't wait until the end of the episode to update your value function, which you don't, since you make updates in every state transition. That's the whole point about online algorithms. In every iteration of Sarsa(lambda), multiple values are updated (the ones associated with visited states) and the eligibility traces are also updated. The algorithm is fully ONLINE, since updates occur every time you transition from one state to another.
(2) In the off-policy learning algorithms presented in the slides, it is enough to be able to sample from your behaviour policy, so what you are saying is not true. You don't need the MDP formulation, it is model-free. It is made very clear in the slides. Note that you select the states and actions that will be updated by following your behaviour policy, but the values at those points are updated towards those of your target policy.
am i the only one who feel a bit uncomfortable ignoring the math precisely (e-greedy) and pretends its alright only by grasping the main idea from david silver.. lol
I am thinking that since using q (action-value) is quite convenient here to update model-free control, why we were not using q for model based policy iteration as well? I feel like in principle, we can use either v or q for model based policy iteration, so I wonder for model based control, is there any advantage of using v over q?
1) The dimensionality of q(s, a) is much higher than v(s). 2) The optimal policy will maximize v(s) over all actions. Hence our agent will perform better in stochastic environment
@@japneetsingh5015 One could do it of course, but why go for something that is higher dimensional? It means higher computational cost. Note that in the model free case, one can calculate q* directly from v*, so it seems much better to do it at the end instead of updating q all the time.
How can I run Monte Carlo sampling with Policy Evaluation? If I have a policy, that policy tells me what action should I take. If I use monte carlo, I randomly walk through the states? How does it fit together and where I am missing a piece?
Actually, Monte Carlo starts for each episode of exploration from a random pair of state-action (s, a), then updating value-Action(Q) for all pair (s, a) appearing in the episode following by policy improvement for all visited states. After a large number of episodes and if we are using epsilon-greedy we grantee the infinite exploration and thus a maximum policy improvement and a convergence towards optimal policy.
The experimental returns that are averaged in both MC and every visit MC in order to estimate the total return are UNBIASED SAMPLES of the real return. In other words, MC and every visit MC are unbiased estimators. The convergence to the real returns is due to the central limit theorem. If you want to take a look at how the density looks formally, then see the definition of ho^\pi (it appears slightly because equation (1)) in proceedings.mlr.press/v32/silver14.pdf. That's the density you are sampling from, but in an episodic environment.
At 28:45, if my robot is exploring the path for a maze, how does it know what the reward is when performing a certain action or being at certain state?
The reward is given by the setup (e.g. the goal in a maze, or the score on the Atari game). In more practical scenarios, presumably setting the utility (reward) function is a whole other thing that needs doing (e.g. program the robot to recognise a sensor telling it it's reached the end point of a maze). The process described in the lecture is how it works out which path to take - of course it can only start weighting moves after it's actually reached the goal once.
1:34:08 - Looking at the pseudocode, I don't see why Q-Learning is off-policy when the actual Q function is updated on every step... Bit confused here because this looks like an algorithm which could run online
It's because when updating the Q function you dont move toward the direction of the target defined by (the reward you just received + gamma*Q(next state, action you will perform at next state) but you move instead toward (the reward you just received + gamma*Q(next state, argmax(available actions at next state))
In SARSA algorithm, does it matter it the Q value exceeds max reward? Eg, the reward that can be awarded after each step ranges from -1 to 1. And Q for one state is something like 1.2 or 4.5 or -2.3, etc. Does this mean my code has errors or is it normal?
I do not know if this is still relevant to you, but the situation you are describing is absolutely normal, because Q(s,a) does not only account for the immediate reward, but is an estimate of the long term reward of being in state s and taking action a. The only thing that I would like to point out as a technicality, is that Q does not refer to states only, but rather to a pair of a state and an action.
Any one has any idea on how i might get the slides of this course?? the url provided is not valid now and I've also checked the course website of 2015 but only the slides of the first two chapters can be found.
The inequation in the formula at page 12 is wrong. You introduced {pie(a|s) - eps/m}/(1-eps) as the weighting variable (which sums to 1), but each term is not gauranteed to be larger than 0, in which case the inequation doesn't hold.
Simple counterexample can be thought in the case when eps is near 1, where the terms become "mean of q_pi(s, a)" and "weighted sum of q_pi(s, a)" where weight is pi(a|s). Latter can be certainly be bigger, if pi(a|s) is bigger for larger q_pi(s, a).
Actually, this is precisely the reason why the proof breaks if pi is not assumed to be epsilon-greedy. I noticed when reviewing the proof, so I am pretty glad someone else noticed.
An episode is a sequence of states, actions, and rewards. S1, A1, R2, S2, A2, R2... For example, it can be one game of chess. From the probabilistic point of view, it's a sequence of random variables. You can sample those to estimate Q(s, a) using Monte Carlo method or any other method. In other words, the episode is an experience of the agent.
There are 10^120 possible chess positions according to Shannon. Can sampling from each state really work? I'm reading Sutton and following these lectures, but something isn't adding up for me. Can anybody point me in the right direction...?
19:42 "Max of all the Q-values has to be at least as good as any weighted sum of all actions" - That is not true, because we can have negative weight coefficients (for larger epsilons for example) Simple counter-example: Q-values = [-1, 0] Weights = [-1, 0] Max(Q-values) = 0 < dot-product(Weights, Q-Values ) = (-1)*(-1) + 0*0 = 1 It would hold if we took convex combination of our q-values, but if there are some coefficients, that are negative, it is affine combination, for which the statement about the max doesn't hold anymore.
Mate, those weights are nonnegative because pi is assumed to be epsilon-greedy, so by definition pi(a|s) is greater or equal than epsilon/m. That is the crux of the proof.
At 17:43, why does the proof only hold if pi is eps-greedy? Why can't it be an arbitrary policy? ( I know it cant be right but I don't see what would be wrong in the proof).
First, choose a policy pi* that is optimal already and improve it epsilon-greedily. Now the policy becomes worse, so it is clear that the theorem does not work in this case. Now, where does the proof break? Abhishek claims the second line will not work, which is not true haha. The second line definitely works, since we are assuming that pi' is epsilon-greedy. One of the things that break for sure is the inequality in the third line, since the weights in the sum could now be negative (note that since pi is not necessarily epsilon-greedy, there is no guarantee that pi(a|s)-epsilon/m is nonnegative).
yes, there was a really nice graph in the previous lecture showing the behavior of E(a,s) over time. Let's assume you arrive at the pair (s,a). Then E(s,a) = 1. If "a" enables you to land in the same state as before (i.e S_t+1 = S_t), you can arrive at the pair (s, a) again. In that case, we will now have E(s,a) = 2 - gamma, which is > 1
I wrote a SARSA(0) code for solving the windy grid world at 51:00. Might be helpful for others. Github link - github.com/ShivendraAgrawal/RL_David_Silver/blob/master/SARSA_windy_gridworld.py
Is there any theory on the convergence of Sarsa(Lambda)? Does it follow from the convergence of One-step Sarsa and Sarsa(Lambda) being kind of the control equivalent of TD(Lambda) which converges?
Man David Silver's articulation is quite phenomenal.
he's a G
God bless internet. Thank you DeepMind and David Silver for making this world class course accessible to everyone.
It is like listening to Richard Feynman. His comprehension is phenomenal! Thanks for putting these online
0:00 Warm Introduction
0:44 Outline
1:20 Model-Free Reinforcement Learning
2:10 Uses and Applications of Model-Free Control
4:10 On and Off-Policy Learning
5:40 Generalized Policy Iteration
7:15 Generalized Policy Iteration with Monte-Carlo Evaluation
13:30 Example of Greedy Action Selection
15:50 Epsilon Greedy Exploration
17:44 Epsilon Greedy Policy Improvement
21:05 Monte-Carlo Policy Iteration
23:40 Monte-Carlo Control
25:37 GLIE (Greedy in the Limit with Infinite Exploration)
28:27 GLIE Monte-Carlo Control
35:17 Monte-Carlo Control in Blackjack
38:46 MC vs TD Control
41:00 Updating Action-Value Functions with Sarsa
44:50 Sarsa Algorithm for On-Po licy Control
48:15 Convergence of Sarsa
49:45 Windy Gridworld Example
53:39 n-Step Sarsa
57:20 Forward View Sarsa(lambda)
1:00:24 Backward View Sarsa(lambda)
1:05:10 Sarsa(lambda) Algorithm
1:06:37 Sarsa(lambda) Gridworld Example
1:19:50 Off-Policy Learning
1:25:20 Importance Sampling
1:28:50 Q-learning
1:32:50 Off-Policy Control with Q-learning
1:33:27 Q-learning Control Algorithm (SARSAMAX)
1:34:10 Relationship Between DP and TD
1:19 Introduction
7:12 On-Policy Monte-Carlo Control
38:46 On-Policy Temporal-Difference Learning
1:19:51 Off-Policy Learning
1:34:09 Summary
Thank you for the time stamps in every video
thank you very much
Currently studying this module for masters, man I wish I was born a few years earlier and went to UCL. David Silver is honestly so smart.
Dude is amazing teacher and fluent speaker even for non native english speakers
This feels like the kind of lecture where, if you don't understand absolutely everything that was discussed you shouldn't move on until you do.
I actually recommend otherwise. I watched the entire course, then attempted the homework. The first two questions were very straightforward, so I did them then came back to watch the lectures to verify my solution. I was understanding the concepts better while rewatching, and there is no way I would have understood this much without doing the homework rather than expecting to understand everything first before moving on
Pfft... explore/exploit repeat
This lecture really puts it all together so well. And the questions helped a lot too!
Another great lecture and an exemplification of what someone who knows their field like the back of their hand, sounds like.
Love how the views drop as we get further into the course
This could be called "exponential motivation decay theory" 😂
this series is phenomenal. No wonder every other lecture series references or copies it
What a hard way to introduce a lecture "In a Way, everything in the course has been leading up to this lecture" - as he smiles ear to ear. I cant believe this got my hyped
oo, he had a haircut :)
I was gonna say the same!
This series is seriously addictive. Be careful with your time management!
very true. i am looping again and again for almost a week, in lecture 4,5
i guess the addiction comes from the uncomfortable feeling of not being able to grasp the lecture information
This lecture was an absolute pleasure. Thank you David!
This whole series is just so brilliant!
"Or learn how to fold proteins"
I see what you did here)
Thank you for finally clearly explaining the equivalence between forward and backward view. The book is a little confusing about this, but this example at the end on the gridworld really clarified in my mind why the two things are basically the same!
In the slide of Q-learning (1:29:11), shouldn't the "Next action chosen using the behavior policy" be denoted by A_t instead of A_{t+1}? Because it is chosen at state S_t.
Also, shouldn't the successor action A' be drawn from state S_{t+1} instead of S_t?
I think in both cases it should be S_{t+1}. The actual action made at state S_{t+1} is A_{t+1}, but the Q-table is updated using the A' action.
How can you update towards A' using R_{t+1} if R_{t+1} comes from the action taken by A_{t+1}? That was the bit I didn't understand from the slide. Don't you end up polluting your Q values with rewards that didn't actually come from actions that you took?
The Slide is correct. If you go back to his first or second lectures, he mentioned about the time notation and the state notation. Basically, R_{t+1} is the reward if you at S_{t} and taking an action A_{t}, so R_{t+1} in the target is correct. I hope it help.
Actually i was wondering if there was an error there.
On the 3rd and 4th bullet point, i think the action A_t+1 and the alternative A` should be drawn from the state S_t+1. This because S_t+1 is the state you ended up acting in the step before, from state S_t with action A_t, gaining the reward R_t+1.
@Evan Su Yes on both counts, if I interpret the slide correctly.
Just to be clear:
1. we start from S = S_t
2. we sample A from mu(.|S), the behavior policy
3. we end up in S'
4. we sample A' from pi(.|S'), the target policy
5. we update q(S, A)
6. we set S = S' and go back to 2.
Note that:
- the update to q(S,A) does not depend on the behavior policy, but only on the target policy. While it's true that the behavior policy determines *which* q(S,A) we'll update, once we decided we want to update a particular q(S,A), the update itself won't depend on the behavior policy but only on the target policy.
- By setting S = S' and forgetting A', the trajectory will be independent of the target policy.
"I only get this one stream of data that we call life" 1:24:50 :D
yep and only 1 single episodes, so make sure u pick the right function approximation to deal with it. :)
@@alvinphantomhive3794 Luckily we use online learning XD
@@sohampatil6539Ow yea that's true for most people. Anyway it's a year ago comment, so i need to update my statement that we as human actually unable to choose the func approximator. It Is the BrAiN as default! Sophisticated Performance but Not Robust enough :D
This guy totally rocks!
Wow, perfect lecture until the end. Excited to see the rest! Thank you so much for uploading this!
No online paid or gree course can beat his content....
GLIE MC looks so easy on theory but harsh in code
Thank you soo much sir for this great series... really learnt a lot and motivated me to work even harder
God Bless this man
Best teacher in the world!
With Andrew Ng :p
"...We only got this one stream of data we call life."
man didn't expect you to get so deeeeep
The theory is great and all, but it would be really helpful to see some code + explanations
David "Gold"!
Horrible comment
@@trouserpython3857 Awesome nickname
@@andychoi3589 Thank you sir
24:15: In the context of Monty P... Carlo.
He's too good at explaining. Lovely :)
49:26 in practice we don't worry about this and this either. :)
Great lecture, as always! I’d have a question about GLIE Monte Carlo (around 30:00). The theorem says that GLIE MC control will converge to q_*. However, Sutton and Barto (page 99) mention that convergence of MC control with exploration starts (somewhat simpler setting than GLIE) is theoretically an open problem. So, has the convergence of GLIE MC control actually been proven?
Monty Python? 24:10
indeed :)
Haha that was funny :D :D
LOL
I think there needs to be a small correction at 1:30:57 the third bullet point states the we sample At+1 from behavior given state St, at first glance this looks incorrect because At+1 must be sampled given state St+1 and also the update equations make more sense if At was sampled from behavior policy given St
it's funny how it always looks like he is going to cough but he is not
prof. david silver's hair is so cute
at 20:48 how is the epsilon greedy policy making it better? if we expand the weighting expression using the formula from the previous slide all we're left with is the value from the left expression. Can someone explain in which cases would it be better ?
In the slide of e-greedy policy improvement's theorem, shouldn't $q_\pi(s,\'{\pi})$ in the first equation be $v_{\'{\pi}}(s)$, instead?
They are equivalent by construction here.
according to the textbook, policy pi is not an epsilon-greedy policy, but an epsilon-soft policy. The left part of the equation is correct.
Do you mean Sutton&Barto's textbook? I didn't read the textbook, but I think putting value function in the first equation on the slide is easier to follow.
Come on guys, pi' is a stochastic policy by definition, so there is clearly an mistake in the slides, since $q_\pi(s,\pi'(s))$ is a random variable. On the left-hand side, it should be the expectation of $q_\pi(s,\pi'(s))$. Plus obviously the argument is incomplete since you have to estimate the expectation of $q_{\pi'}(s,\pi'(s))$ instead of that of $q_\pi(s,\pi'(s))$. He even says this in the lecture.
@@zl7460 Not true. pi' is not deterministic.
@53:44 On the n-step Sarsa slide, I think there's a notation mistake. The Q function should have both S and A as inputs. Other than that, great lecture :D
which Q function you refers to? is it the n-steps Sarsa updates?? seems like everythings just good
@@alvinphantomhive3794 He refers to q_t^1, q_t^2, etc, and he is right that they should depend on both S_t, A_t, but the dependence is sort of obvious, so David didn't write it.
I put everything on 2x speed. I put this one 0.25x :)))
I watched this at 2x too lol. Will defo have to re-visit.
@@honestexpression6393 you have fast mind!
I love the lectures. Thank you.
I have a question about the backward Sarsa(lambda) algorithm implementation. Do we really need to store eligibility trace ? It seems like nothing really happens until a reward is received (an event). When(and only when) the event occurs, we could run back along the trajectory updating Q(s,a) values visited in the trajectory by alpha*delta*(decaying weight) ? Only state/action values in the trajectory would be updated, but only state/action pairs in the trajectory could have non-zero eligibility traces anyway. Maybe what I am saying just swaps storing the eligibilities for storing the trajectory, but it seems more efficient.
The lack of such implementation details in pseudo-code is intentional. The method you propose would be quite slow in practice when you keep receiving rewards every few steps. You could solve the problem by maintaining a set of q(s,a) whose eligibility traces are big enough.
Thanks. I think I see what you mean. The trajectory could become quite long and my proposal traverses it every time a reward is received. So you are suggesting only keeping state/action pairs, from the trajectory, where the eligibility has not decayed below a certain threshold. In practice that would be a few of the most recently visited states in the trajectory. Older pairs would have very small eligibilities and removed from the list.
Yes, that's exactly what I meant! Moreover, in practice we can represent states through features and then only store eligibility traces for the features. I think this is discussed in lecture 7 of this course. More generally, instead of just remembering which state to change, we remember what changes we should make to the weight vector (of a neural net, for instance) to modify the value of the state we're interested in. In other words, we remember the gradient for influencing a particular state. The main advantages are that (1) the weights are usually way fewer than the states and (2) we can generalize over similar states.
What is the idea of taking the ratio between the target policy and behavioral policy in Importance Sampling?
it's some kind of similarity estimation between the policies. How your target policy looks like behavioral one at some states.
You want to evaluate the return of policy pi. Note that this is computed as an expectation, which can be approximated by sampling returns from trajectories FOLLOWING PI. In general, when sampling and approximating this expectation, you want to attach more weight to trajectories that are more common under pi than to trajectories than are rare under pi. This occurs naturally when you compute the expectation on-policy. However, if you are sampling off-policy, you might be often seeing trajectories that are common under mu but uncommon under pi and you do not want to weight those too much. The ratio between the target policy pi and the behavior policy mu achieves precisely this. If a trajectory is common under mu but uncommon under pi, it will attach less weight to the sample of the return, since we are sampling under mu. Conversely, if some trajectories are uncommon under mu but common under pi, the ratio will increase the weighting, since now that sample return is considered to be "more representative". In summary, that ratio ASSIGNS A NEW WEIGHT TO THE EXPERIMENTAL RETURN, IT REWEIGHTS THE EXPERIMENTAL RETURN BY TAKING INTO ACCOUNT HOW IMPORTANT/REPRESENTATIVE IT WOULD BE IF WE WERE FOLLOWING PI.
1:30:50 Shouldn't the condition state for mu and pi be S_t+1, not S_t?
Semin Park A_t+1 should be A_t, I think this correction is better.
@@ernestoyou8610 No, I think it's At+1, it's the action that you're going to follow next, so basically you sample A' from pi to update Q(St, At) and then sample At+1 from mu and follow it.
I've been having this question about SARSA and TD lambda both. Why do we need summation of all lambdas to add up to 1?
So that the sum becomes an unbiased estimator of the discounted return in the limit :).
Say X_1, X_2, ..., X_n,... are samples of a random variable with finite variance and expectation mu, and consider the infinite sum on n:
X = (1-lambda) sum lambda^n X_n.
Then E[X] = mu. If the geometric sequence of lambdas summed, say, to 1/2, you would have E[X] = mu/2, which obviously wouldn't be desirable.
Note: similar sums are very common in machine learning. Look at adaptive optimizers based on moving averages like Adam, or at exponential moving averages applied to weights during training:)
1:00:10 why does it make sense to take into account small lookaheads, when we do take into account large lookaheads? If we advance more into the future, it would make sense to take the real knolwedge into account rather than weigh in assumed knowledge. What am I missing?
Variance man, variance
(1) You incur in variance. (2) you lose online updates. Talking about variance, when lambda goes towards 1, TD(lambda) gets close to MC and you trade variance for a more unbiased estimator (more variance, less bias). When lambda goes towards 0, you trade bias for reduction in variance (more bias, less variance).
20:52 but why wouldn't I get stuck in a local max? the theorem only assures I wouldn't make the policy any worse, not that I would be making it better
And who said you couldn't get stuck in a suboptimal policy? You can get stuck and you will typically get stuck, that is precisely why GLIE is proposed. GLIE ensures that the exploring policy is greedy in the limit, so epsilon->0 if you are using a GLIE algorithm based on epsilon greedy policies. In other words, epsilon is ideally ADAPTIVE.
very good lec
I didn't understand how following a greedy policy will cause an exploration problem. Don't we explore all possible states and find the value of every state-action pair during the policy evaluation phase? Once we have explored the entire state space, how will selecting the actions with the highest q value cause an exploration problem?
Can someone help me out?
As this is model-free, I think you are not able to explore all possible states and perform the policy evaluation like in model based RL, due to the lack of P_ss'. As a result, we are working with action-value q, and the way to update this is to sample a bunch of episodes, and for each episode q(s) = R(s) + discount*q(s+1) + ... Then you take the average and that's your new q. Now it is totally due to the samples you take to update q, thus not guarantee to explore all the states. My two cents.
Why does the proof at 18:03 makes sense at the end result, when compares q_pi'(s, a') >= V_pi(s).
The q-function has the values of each action possible at each state.
Taking the mean of these at each state gives us the State Value function.
So V_pi(s) is a mean between possible actions.
q_pi'(s, a') has the value of ONE of the actions at that state, a': chosen by the new policy pi'.
Started with:
q_pi'(s, a') >= V_pi(s)
I'm interpreting this as a relation between a 'max' and a mean (as the policy tries to maximize this reward).
But this is (almost) always true: a max greater than the mean.
If the policies are different, this inequality cannot ensure this new policy is better.
Is that the case? or am I missing something?
Found this: stats.stackexchange.com/a/304406/249143
@@jackcarreira6931 woah, thanks so much.. i paused for like few hours to grasp these ideas
There is an expectation missing on the left-hand side of q at the beginning of the proof. Note that pi' is not deterministic, so pi'(s) is a random variable.
One question in the example of Sarsa(lambda) example, since we know the trajectory, do we update the error flow in reverse order, therefore, every state would have a valid update with the error flow
its not update in reverse order. in Sarsa(lamda) when you reach to the terminate state, or obtain any state which contain a reward value, you immediately update every single pair of action and states towards the Elgibility Traces and TD error ( Sarsa(lambda) algorithm 1:05:19 ). And also since the value of each pair of states and action decaying by the Elgibility Traces as you move further from the particular state, it looks like you update the Q in reverse order, fortunately its not..
i guess you can take a look at the explanation of a question that was asked in 1:12:28
Great lecture! One question: Why do we need to store the whole matrix E(s,a)? A list of (s, a) for current episode should be enough. Right?
so you can calculate the elgibility traces in proportion to each pair of (s,a).
Yeah, it is enough of course :)
for the first time in 5 lessons, he needed to correct himself
I have a question about the Sarsa(\lambda) algorithm. By using eligibility trace, the policy gets updated in each step, however it is still not completely online in that the agent needs to store all the states it have travelled through the walk, right?
Also, Sarsa(0) isn't actually that bad, right? The case where only one update is made, occurs only when the immediate reward of each state-action are identically negative except for the goal position.
And in off-policy learning, you are assuming that you know the exact foumulation of for example the other agent's policy and can act according to it. However, in most cases, for example by obsserving human reaction to the environment, all you get is a sample drawn from human's policy. In that case, do we just initialize our Q value for the observed state-action pair in the samples as if the agent experienecd these samples by itself?
In addition, how is Q learning better than maintaining only one \varepsilon-greedy policy with decaying \varepsilon? Should the greedy policy generates a suboptimal result and the true optimal policy travels through a vastly different set of states, neither Q learning nor \varepsilon=greedy is capable of exploiting these unvisited states. So, it is like Q learning is not combining an exploratory and an exploiting policy but actually just looking ahead and sampling one more step.
(1) Sarsa(lambda) (I am assuming backward view from now on) does not update the policy as you write. It updates the state-value function. Moreover, we are assuming the state space S to be fixed and known and it is always stored (in any algorithm. We are in the look-up table case). I think you mean that the eligibility traces of all the visited states in that episode are updated at every step, which is true. However, updating eligibility traces does not make the algorithm offline. Online means that you don't wait until the end of the episode to update your value function, which you don't, since you make updates in every state transition. That's the whole point about online algorithms. In every iteration of Sarsa(lambda), multiple values are updated (the ones associated with visited states) and the eligibility traces are also updated. The algorithm is fully ONLINE, since updates occur every time you transition from one state to another.
(2) In the off-policy learning algorithms presented in the slides, it is enough to be able to sample from your behaviour policy, so what you are saying is not true. You don't need the MDP formulation, it is model-free. It is made very clear in the slides. Note that you select the states and actions that will be updated by following your behaviour policy, but the values at those points are updated towards those of your target policy.
am i the only one who feel a bit uncomfortable ignoring the math precisely (e-greedy) and pretends its alright only by grasping the main idea from david silver.. lol
i'm feeling the same right now lol
I am thinking that since using q (action-value) is quite convenient here to update model-free control, why we were not using q for model based policy iteration as well? I feel like in principle, we can use either v or q for model based policy iteration, so I wonder for model based control, is there any advantage of using v over q?
1) The dimensionality of q(s, a) is much higher than v(s).
2) The optimal policy will maximize v(s) over all actions. Hence our agent will perform better in stochastic environment
@@japneetsingh5015 One could do it of course, but why go for something that is higher dimensional? It means higher computational cost. Note that in the model free case, one can calculate q* directly from v*, so it seems much better to do it at the end instead of updating q all the time.
How can I run Monte Carlo sampling with Policy Evaluation? If I have a policy, that policy tells me what action should I take. If I use monte carlo, I randomly walk through the states? How does it fit together and where I am missing a piece?
Actually, Monte Carlo starts for each episode of exploration from a random pair of state-action (s, a),
then updating value-Action(Q) for all pair (s, a) appearing in the episode following by policy improvement for all visited states.
After a large number of episodes and if we are using epsilon-greedy we grantee the infinite exploration and thus a maximum policy improvement and a convergence towards optimal policy.
Magnificent!
Does the First-Term MC and MC(all) return estimate biased? And how can we prove that formally?
The experimental returns that are averaged in both MC and every visit MC in order to estimate the total return are UNBIASED SAMPLES of the real return. In other words, MC and every visit MC are unbiased estimators. The convergence to the real returns is due to the central limit theorem. If you want to take a look at how the density looks formally, then see the definition of
ho^\pi (it appears slightly because equation (1)) in proceedings.mlr.press/v32/silver14.pdf. That's the density you are sampling from, but in an episodic environment.
Thanks for sharing!
Could the voice at 1:31:03 be David Runciman's?
At 28:45, if my robot is exploring the path for a maze, how does it know what the reward is when performing a certain action or being at certain state?
The reward is given by the setup (e.g. the goal in a maze, or the score on the Atari game). In more practical scenarios, presumably setting the utility (reward) function is a whole other thing that needs doing (e.g. program the robot to recognise a sensor telling it it's reached the end point of a maze). The process described in the lecture is how it works out which path to take - of course it can only start weighting moves after it's actually reached the goal once.
1:34:08 - Looking at the pseudocode, I don't see why Q-Learning is off-policy when the actual Q function is updated on every step... Bit confused here because this looks like an algorithm which could run online
It's because when updating the Q function you dont move toward the direction of the target defined by (the reward you just received + gamma*Q(next state, action you will perform at next state) but you move instead toward (the reward you just received + gamma*Q(next state, argmax(available actions at next state))
In SARSA algorithm, does it matter it the Q value exceeds max reward? Eg, the reward that can be awarded after each step ranges from -1 to 1. And Q for one state is something like 1.2 or 4.5 or -2.3, etc. Does this mean my code has errors or is it normal?
I do not know if this is still relevant to you, but the situation you are describing is absolutely normal, because Q(s,a) does not only account for the immediate reward, but is an estimate of the long term reward of being in state s and taking action a. The only thing that I would like to point out as a technicality, is that Q does not refer to states only, but rather to a pair of a state and an action.
Any one has any idea on how i might get the slides of this course?? the url provided is not valid now and I've also checked the course website of 2015 but only the slides of the first two chapters can be found.
www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
So clear!!!
The inequation in the formula at page 12 is wrong. You introduced {pie(a|s) - eps/m}/(1-eps) as the weighting variable (which sums to 1), but each term is not gauranteed to be larger than 0, in which case the inequation doesn't hold.
Simple counterexample can be thought in the case when eps is near 1, where the terms become "mean of q_pi(s, a)" and "weighted sum of q_pi(s, a)" where weight is pi(a|s). Latter can be certainly be bigger, if pi(a|s) is bigger for larger q_pi(s, a).
But pi(a|s) is just epsilon greedy isn't it?
Yolo Swaggins Hey genius! Thanks for enlightening me!!
I had the same confusion too, but yeah pi is eps-greedy so its good.
Actually, this is precisely the reason why the proof breaks if pi is not assumed to be epsilon-greedy. I noticed when reviewing the proof, so I am pretty glad someone else noticed.
Hi, I'm confused about concepts "state" and "episode", can anyone help to enlighten me . Thanks.
An episode is a sequence of states, actions, and rewards. S1, A1, R2, S2, A2, R2... For example, it can be one game of chess. From the probabilistic point of view, it's a sequence of random variables. You can sample those to estimate Q(s, a) using Monte Carlo method or any other method. In other words, the episode is an experience of the agent.
46:45 sarsa for online
1:11:52 sarsa (lambda) grid world example
I have to say, David Sliver is slightly smarter than me:)
There are 10^120 possible chess positions according to Shannon. Can sampling from each state really work? I'm reading Sutton and following these lectures, but something isn't adding up for me. Can anybody point me in the right direction...?
When chess players think several moves ahead, they're basically doing smart sampling.
You just estimate it with a function approximator, getting your function approximator to generalize is a completely different tale though.
In each episode we need to update a few states only.
19:42
"Max of all the Q-values has to be at least as good as any weighted sum of all actions" - That is not true, because we can have negative weight coefficients (for larger epsilons for example)
Simple counter-example:
Q-values = [-1, 0]
Weights = [-1, 0]
Max(Q-values) = 0 < dot-product(Weights, Q-Values ) = (-1)*(-1) + 0*0 = 1
It would hold if we took convex combination of our q-values, but if there are some coefficients, that are negative, it is affine combination, for which the statement about the max doesn't hold anymore.
Mate, those weights are nonnegative because pi is assumed to be epsilon-greedy, so by definition pi(a|s) is greater or equal than epsilon/m. That is the crux of the proof.
At 17:43, why does the proof only hold if pi is eps-greedy? Why can't it be an arbitrary policy? ( I know it cant be right but I don't see what would be wrong in the proof).
The second line of the proof won't follow.
No it's because this makes the weights in 3rd line >=0. The 2nd line is from pi' not pi so it's irrelevant.
First, choose a policy pi* that is optimal already and improve it epsilon-greedily. Now the policy becomes worse, so it is clear that the theorem does not work in this case.
Now, where does the proof break? Abhishek claims the second line will not work, which is not true haha. The second line definitely works, since we are assuming that pi' is epsilon-greedy. One of the things that break for sure is the inequality in the third line, since the weights in the sum could now be negative (note that since pi is not necessarily epsilon-greedy, there is no guarantee that pi(a|s)-epsilon/m is nonnegative).
what does GLIE stand for?
nvm, Greedy in the Limit with Infinite Exploration (GLIE)
He keeps on exploring the Monty Python direction!
awesome
I luv it !! >
can E(A,S) be > 1?
yes, there was a really nice graph in the previous lecture showing the behavior of E(a,s) over time.
Let's assume you arrive at the pair (s,a). Then E(s,a) = 1. If "a" enables you to land in the same state as before (i.e S_t+1 = S_t), you can arrive at the pair (s, a) again. In that case, we will now have E(s,a) = 2 - gamma, which is > 1
I wrote a SARSA(0) code for solving the windy grid world at 51:00. Might be helpful for others.
Github link - github.com/ShivendraAgrawal/RL_David_Silver/blob/master/SARSA_windy_gridworld.py
Your description of off-policy, till Q-Learning, is overly complicated.
great
i like ya cut g
24:44
44:02
What it means to act Greedly?
Choose the action for best value in S_t+1
Someone's offer you a bike or a car, if you act greedly based on the prices then you pick a car..
38:44
24:12 monty py.. carlo xD
If you listen to the lecture at 2x for 90 minutes, then slow it down to normal speed, David Silver sounds like he is completely stoned.
Is there any theory on the convergence of Sarsa(Lambda)?
Does it follow from the convergence of One-step Sarsa and Sarsa(Lambda) being kind of the control equivalent of TD(Lambda) which converges?