0:53 Outline 2:10 Introduction 5:06 Monte Carlo Learning 9:20 First Visit MC Policy Evaluation 14:55 Every Visit MC Policy Evaluation 16:23 Blackjack example 26:30 Incremental Mean 29:00 Incremental MC Updates 34:00 Temporal Difference (TD) Learning 35:45 MC vs TD 39:50 Driving Home Example 44:56 Advantages and Disadvantages of MC vs. TD 53:35 Random Walk Example 58:04 Batch MC and TD 58:45 AB Example 1:01:33 Certainty Equivalence 1:03:32 Markov Property - Advantages and Disadvantages of MC vs. TD 1:04:50 Monte Carlo Backup 1:07:45 Temporal Difference Backup 1:08:14 Dynamic Programming Backup 1:09:10 Bootstrapping and Sampling 1:10:50 Unified View of Reinforcement Learning 1:15:50 TD(lambda) and n-Step Prediction 1:17:29 n-Step Return 1:20:22 Large Random Walk Example 1:22:53 Averaging n-Step Return 1:23:55 lambda-return 1:28:52 Forward-view TD(lambda) 1:30:30 Backward view TD(lambda) and Eligibility Trace 1:33:40 TD(lambda) and TD(0) 1:34:40 TD(lambda) and MC
This actually helped me a ton, I've been knee deep in Deep RL implementations that have just been handed to me and I've skirted doing it by hand or understanding the formulas. As a result, reading papers have been confusing. Him mentioning that the incremental mean will be a common pattern made about 10 things click all at once for me.
At 1:27:47, David explains why we use λ geometric series by saying it is memoryless so "You can do TD(λ) for the same cost as TD(0)".. but I don't see how! TD(0) merely looks at one 1 step whereas TD(λ) has to look at all future steps (or in the backward view, TD(0) merely updates the current state, while TD(λ) updates all previous states)
TD(λ), in the backward view, does updates all previous steps, but that's as effective as TD(0), as each update is roughly equivalent to updates done with TD(0). The forward view is theoretical, and mentioned because it gives us some intuitions on TD(λ), but it would be very inefficient to implement TD(λ) using the forward view, way too much computation, plus it implies that it needs finite episodes. That's why, in practice, we only use the backward view for implementations.
If you look into lecture 5, on Sarsa(λ), there's a comparison of the way Sarsa(λ) and Sarsa(1) update the states at around 1:06:00. You can see that while Sarsa(λ) does more computation, it then also build much more precise evaluation of the state action values. I think looking at this example might give you some intuitions why TD(λ) is as efficient as TD(0) computation wise.
@@totolamenace What you say makes sense, but computational cost means cost per iteration and the backward view of TD(lambda) updates eligibility traces and many state values per iteration, so it is obviously more expensive that one-step TD. A different statement would be to say that "to obtain state value estimates of similar quality, the number of required operations in both methods is roughly the same", which is not the computational order anyway.
"Introduction to Reinforcement Learning" is a Holy Bible of RL, everything you need to know is in this book. And David Silver helped us to interpret this book.
43:35 A student asked about the goal and actions of the driving-home example. I have read the book where this example comes from. And here is my take on the question: The actions come from a policy that is determined by the person. In this case, the policy is getting home by driving a car through particular roads. The person can use other policies to get home such as walking home and driving through other roads. The goal of Monte Carlo or Temporal Difference is to estimate how good his policy is. Remember his policy involves driving through particular roads. The example shows just one sample of updating the algorithms. To actually see how good his policy his. He needs to take the same route everyday, obtain more data, and update the algorithms.
Yeah, I would recommend reading it along with the lectures. It's a good book. I find it helpful to learn the same concepts from different sources, especially difficult ones.
You are welcome! Thanks for the very fast reply! I already have the book from Sutton and Barton, but decided to start with the review article "Deep Reinforcement Learning - An Overview" from Yuxi Li. Also very recommendable!
Thanks for the recommendation! I also definitely love to learn from different sources and in addition to lectures and readings listen to podcasts. I love this one, but it is about machine learning fundamentals, not RL: ocdevel.com/podcasts/machine-learning Maybe you or another reader can recommend another podcast, being more specialized on RL or other more advanced topics? "TWiML & AI" seems to be quite promising... also want to try out "Machine Learning 101". Any experiences or recommendations on that?
I think the reason why looking one step into the future is better than using past predictions is that you can treat the next step as if it were the last step, then that would be the terminating goal and the game is over, we've already known the current state didn't make the episode end up, so only the future state could and that's why we always look into the future for the terminating state.
I have watched the lecture 4 four times, and this is the clearest one. For non-English speakers, language is really an obstacle to understanding this lecture. Oh my poor English, I only got 6.5 at IELTS Listening.
Thanks for good lecture. This lecture really help me a lot. I have a suggestion for improving this lecture. It is English subtitle. It will make this lecture more accessible for the handicapped and non-English speakers.
I'm sure that a lot of people wouldn't mind volunteering to provide English subtitles for these lectures (at least I wouldn't). It would be a nice alternative for that :)
@@minhuang8848 lmao wtf , algorithm already provides free (albeit errorbugged) subtitles , its more than enough . Just let video settings to apply that
@@ThePaypay88 They do not. Auto-generated subtitles here frequently miss a bunch of words altogether, and the questions asked by students are barely audible which auto-generated subtitles don't pick up on.
On Lambda-return at 1:23:56, if you think of N as being geometrically distributed with parameter lambda (for Bernoulli probability of success) and temporarily think of G^(n)_t as a (deterministic) function of n, then G^lambda_t is simply the expected value of G^(N)_t. i.e. Let N~Geo(lamda) and G^(n)_t be fixed as a function of n then G^lambda_t = E[ G^(N)_t ] ### Correction: N~Geo(lamda) above should be N~Geo(1-lamda) instead
It is 1:23:56, your timestamp is incorrect! But the observation is correct up to the fact that N~Geo(1-lambda) instead of Geo(lambda). Nice observation!!
@@edmonddantes4705 wow 4years later i am looking at what I wrote and literally have no idea what the heck i was talking about.. haha for a second I thought I had been hacked and chatgpt has been posting youtube comments on my behalf! 😂 time flies ¯\_(ツ)_/¯
24:52 I think the professor's explanation is a bit misleading about this question. The Sutton & Barto book, where the figure came from, clearly told that the dealer has a fixed strategy to stick on any sum of 17 or greater, and hit otherwise.
I think what David showed in this lecture is the V under the policy of sticking only above 20. This does not say anything about the quality of that policy, just the value function for any state under the said policy.
@@larryjuang5643 Liu was talking about the strategy of the dealer not the policy of the agent. I think the point Dr. Silver is trying to make in this case, is that it doesn't really matter what the strategy of the dealer is. In the book by Sutton and Barto the strategy is fixed, so that we can simulate it when writing the code for the environment.
An audience member at 46:14 asked: "Does it learn the same behavior? So if you have a car, and the car almost crashes, but it doesn't, ... When you're learning step by step you gonna learning the behavior that you're scared of crashing, but if you're learning from the very end you would never learn sth to encourage you to avoid your own manic behaviour." Although prof. Silver's explanation is correct, I think that's not what the answer student was looking for. It is not true that you would never learn sth to encourage you to avoid manic behavior (as the student is implying) - given enough episodes you would, as you would crash in many/most. MC does converge to the true value function (due to the law of large numbers).
Thanks for the clarification. So you are saying MC would also experience episodes terminating in crashes and hence learn to avoid crashes, just like in the case of TD.
@@BGasperov Does it mean TD would be able to learn to not crash without crashing, while Monte Carlo has to crash at least once to understand that it's bad?
@@nikebless No, TD would still need to crash (experience an extremely negative reward) in order to learn not to do it. Without crashing (experiencing a negative reward) how could it know that the action that leads to crashing is bad? Perhaps it could in model-based RL but that's another story.
The answer to the question at 46:50 confuses me. I think the questioner was wondering whether MC would learn to drive safely, not whether TD would learn it. David gave the answer that TD would learn it. My answer would be: "MC would learn it after enough iterations, because sometimes you would crash". But hey, maybe I'm the one that misinterprets the question :)
I don't think the example at 43:00 was very clear... Or at least, it wasn't explained what's the policy and what's the return here I mean I get the example and and I get the theory but not how the two of them relates
@37:30, question please. I dont understand the difference between real return (G_t) and ESTimate return (bootstrap, TD target). In the class 2, bellman equation shows exactly they are the SAME, isn't it , G_t = R_(t+1) + r V(S_(t+1)) ? If left G_t is the REAL return, why the right side of the equation is ESTimation now ??? Can someone teach me ? thank you!
Estimate return: you already got the real R_t+1 but estimate the rest (yV(S_t+1) together -> estimated return. In Monte Carlo you have are at the end of the episode thats why you can use the real return. You already received every real reward.
@@leopoldmuller4486 Leo, thanks for your help. Please allow me to ask more about this. 1. I do agree with you that: real return in MC G_t = R_(t+1) + rR_t+2 + .... Here you do have all the R since you do walk though to the end. Am I right ? 2. I do know what you saying, in TD, you estimate, you don't go to the end of eps, so you do need to estimate. But I don't understand the equation in class 2 about this bellman equation: G_t = R_(t+1) + r V(S_(t+1)) . My understanding is this V(S_(t+1)) is the real, not estimation. The equation derivation shows very clear. Did I miss anything ? the notation is exactly the same between est and real ? Could you check the class 2, it was in very early of that video. thank you!
I cannot clearly understand why TD(\lambda) is better than n-step TD. There seems to be no big difference between the RMS error figures of n-step TD (p. 37) and of TD(\lambda) (p. 42).
I have this intuition, but it may be wrong. TD(n) and TD(lambda) both have one hyperparameter to tune. I think the advantage of TD(\lambda) is that when you average over TD(0), TD(1), TD(n), you apply bagging technique, so it's easier to control bias/variance with lambda and avoid high variance.
Dp uses Bellman Equation for synchronous backups, which is like two steps look a head given the current state or action and immediately update the value for Vk+1 using Value function equation. Behind the equation, it uses the returns of the successor state to keep updating the Vk+1 value onwards.
As far as I understand, the TD(0) will consider multiple episodes. In other words, the TD guess v[s(t+1)] based on multiple episodes, so we can get value 0.75. But I am not quite sure about that.
I think you are right, it would update V(A) to 0 based on what it had seen if the sequences were taken in that order in a single pass through using TD algorithm. I think you can still understand the 0.75 value as a TD-like analysis after the fact. Or you would get 0.75 if you repeated the same 8 sequences again and again.
This also puzzled me. I think the key is to realise that the equations shown at 1:03:00 are for the transition matrix and the reward structure rather than the value function. To find the value function, we should use a Bellman equation. Given the equation at 1:03:00 for transition matrix we see that P_{AB}=1 and the equation for reward structure shows R_{AB}=0. The Bellman equation will then give V(A) = R_{AB} + gamma * V(B). For no discounting we should set gamma = 1 and we also know that V(B) = 6/8 (I believe that's why he says V(B) is the easier one earlier in the video). Substituting then gives V(A) = 6/8. Incidentally, a similar argument can be used to get V(B). For example, we see that P_{BT}=1 where T is terminal state. We also establish that R_{BT}=6/8 using the equation at 1:03:00, and clearly V(T)=0. Again, we get V(B) = R_{BT} + gamma * V(T) and with gamma = 1, we get the desired result. Hopefully that makes sense!
1:22:12 Is the offline method referred to Monte Carlo because it needs to wait till episode termination to update the value function but Temporal Difference method is offline because it updates the value function as it sees the state every time?
I didn't understand the part where he explains how MC learning doesn't depend on how large the state space is. I need to understand it through an example
Can someone clarify me this? Monte carlo learning is basically just taking fewer paths out of the whole possible paths from a state and averaging it to estimate value function of a state.
What kind of policy do we have in a Model-free Prediction? For example, if we have a policy of walking in a straight line from a point towards a goal, then the same set of states will be repeated in every iteration and hence the rewards should be same in every episode. Then how averaging would help?
i don't think you know where the goal is (if you do, you already have the optimal policy). other than that your environment will usually act stochastic and your sampling of starting states should be stochastic..
I have to correct: as it will be explained in lecture 5, acting greedily according to some policy will not lead to optimal exploration of the state space. This is because your trajectory is sampled from a propability distribution with much lower variance (because of deterministic actions limiting to one dimension of the propability matrix, the state transition dimension as opposed to the both the state transition dimension and the action dimension).
I have one question about "model free". If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change? Just like a human looking at a counter repeadtedly counting up from 0 to 9 and then counting down back to 0. A reward of 1 is given if the person predicts the next digit correctly. The person starts out thinking the state is just the number displayed and when he/she watches the counter behave longer, he/she come to realize that the sate is more complicated than that. Consider an extreme case, when the counter repeatedly shows all permutations of some sequence longer than two in some fixed order and none of Monte Carlo, TD(\lambda) and the version that incopoartes eligibility trace would ever generate a policy better than ramdom guess even if the entire period is shown as many times as needed as with the digit as the state representation, the best markov reward process it can fit is with ramdom transition between each state.
"If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change?" Knowledge about the environment == knowledge about state-to-state probability transitions and rewards. The agent perceives states nevertheless (unless we have a POMDP in which case it's a bit more complicated).
The answer is pretty simple. You are choosing a terrible state space that doesn't represent your problem/what is happening well in your example. Choosing a suitable state space is a big part of solving reinforcement learning tasks. Imagine that I am trying to code a bot that sells and buys different cryptos and the state space is whether it rained yesterday. Then the state space is terrible as a predictor. As a human, if you are not making progress in your task of optimising rewards, it is clear that you are not understanding your state space well, so you should include more and more stuff into it until you see how your returns get better. You have to seek to understand what a suitable state space for your problem is.
Thanks you for good lecture. One question on this video. On the slide Large-Random-Walk-Example, I don't get the difference between On-Line n-Step TD and Off-Line n-Step TD. In short, what is the difference between On-Line and Off-Line?
In this context, online means "in place". In each episode (assuming there *are* episodes), you update all V(s) by using the V(s) themselves (that's "bootstrapping"). If during the episode you update, say, V(s1) and then you reuse V(s1) for updating V(s2), then you updated V(s2) not by using the old V(s1) but the update you just did to V(s1). If you first save all the V(s) in, say, OldV(s) and then update V(s) by only using OldV(s), then you're doing it Offline. If you, instead, use the same V vector and don't worry about the update interferences/overlappings, then you're doing it Online. Online is easier and faster and works well, AFAIK.
When lambda=1, the n-step returns = 0 for all n, except the ∞-step return when the agent reaches the terminal state. That final return is the actual return after accumulating the rewards until the end of the episode, which is a Monte Carlo estimate. Therefore, TD(1) uses the same target as the Monte Carlo method.
Be wiser than just plugging in lambda=1 in the coefficients of an infinite series. Compute the limit when lambda-->1 instead by using that the episode terminates and hence from some point onwards, the discounted return will be fixed. Use summability of geometric series.
Empirical means refers to sample mean when you have the samples available, expected value is not for a sample but a distribution. In some cases, you can infer the distribution from the samples, and empirical mean and expected value would be the same.
Does TD work in the game of Go? I understand it's very difficult to evaluation a board state without playing it out further. Perhaps TD (breadth search) but MC to evaluate the leaves.
other algorithms from last lectures(mc, td, dp, ..) = forward view eligibility trace = backward view It's a method to update present value from past values and not future (estimated or not)values like other algorithms that we've learned so far.
Can someone explain how TD(1) corresponding to monte carlo? According to formula (1-lambda) will be zero for lambda and hence estimated return should be zero for every state.
since we have a terminal state T(ie it will terminate after some k),Then the probability density (weight)for that G_(k) will be the sum of geometrical from that point to n= infnity,Then you will get lamda^(k-1).Hence we get MC for lamda=1.This is what prof says in the side 1:28:00 ("we will assign the rest of the prob to the tail part(shaded portion) of the decaying distribution").
In the forward view, it is a limiting case, i.e. lim TD(lambda) when lambda-->1, motivated by the argument Vaishnav described. Calculate the geometric sum of the coefficients from k onwards and take limits when lambda-->1. You will see that, in the limit, all the mass concentrates on the tail of the series. In the backward view, it can be shown by direct manipulation and the proof can be found in the extra material in the slides.
He explains just after. It was to compare how MC and TD use their samples. So you have 8 samples that include B, 6 of them get a final return of 1, and 2 of them get a return of 0. With MC sampling, then estimate for V(B) is 6/8 = 0.75. For TD it is the same because state B is always last one in sequence, so we are bootstrapping from final state, with V(final)=0 by definition. The interesting one is V(A). With MC sampling, you look at average return for any sequence that includes state A. There is only one such sequence and the total return is 0 + 0 = 0, making the MC estimate for V(A) = 0. But the TD estimate can include the existing estimate for B when the sequence transitions from A to B . . . I think it is a little wrong in that this only works with TD algorithm as shown if the A:0,B:0 sequence occurs as the last sample. But I think the idea in the lecture is to ignore the algorithms and compare the two intuitive answers V(A) = 0 (for MC) or V(A) = 0.75 (for TD) to get a sense on how MC and TD differ.
This is exactly what I was looking for! I also think the same, but seeing that no-one else is pointing this out makes me think am I missing something obvious?
Let's say we are at time t and we have already visited state s three times. If the eligibility trace decays geometrically than the decay is constant at each time step, so we don't need to maintain a history of the single visits. Let's say E = E1 + E2 + E3, where Ei is the "partial eligibility" relative to the i-th visit of state s. If the decay is constant, we have new(E) = new(E1) + new(E2) + new(E3) = lambda E1 + lambda E2 + lambda E3 = lambda E and so we don't really need to remember E1, E2 and E3 separately. If the decay is, say, hyperbolic (i.e. 1/n after n steps), then the update is more difficult: new(E) = new(E1) + new(E2) + new(E3) = ? E1 + ? E2 + ? E3 = ??? We can't proceed without knowing when the visits took place. Let's say the right decays are l1, l2, l3, respectively. Then we have new(E) = l1 E1 + l2 E2 + l3 E3 Unfortunately, we need to remember the entire history of the visits. We can't update E directly as before. Just out of curiosity, sei italiano?
David explained 'first-visit monte carlo', in which the number of times you visited a state is updated only the first time you visit it, and the expected return is calculated from the first visit. Even, if you visit the state again a second time, it's not updated until you start a next episode. Thus, you get N_s = 10, for let's say 10 episodes, because you are considering only first visit, and would find the expected return by averaging over return from first visit to state 's'. However, this has been proved that first-visit average return is equal to n-visit average return where you visit the state multiple times during one episode, and the count is updated everytime(unlike first visit). So, if you study only first-visit, you are not missing anything.
It does. But TD(0) denotes TD(lambda) for lambda = 0. This corresponds with 1-step TD or to lambda = 0 in the eligibility trace paradigm. In summary, the 0 corresponds with lambda, not with n.
For TD learning, how can an agent know what the estimate (look ahead) value for state St+1 should be????. I know TD is bias because it has to guess the next value is and use it to update the current value, but how would it guess????
It depends on the specific environment. In a case like a game of chess, where the environment is largely predictable, then information like number of pieces taken can be used to predict your future value function. Otherwise, you would apply a random value function and then use different update functions to adjust that value for the probability distributions p(s'|a,s) and V(s). Essentially, if you don't want a random initialization, then a prior needs to be applied
What about states which are very infrequent and it takes a long time to find the first time they are visited for a given episode? Now, over many episodes ..the time will add up. Right?
In practice, there are many states that could not be visited when the state space is very large. For example, they argue that TD-Gammon probably has not seen most of the configurations in Backgammon and can still produce amazing results. Lectures 6, 7, 8, 10 will certainly answer your questions partially. When there are too many states, other approaches that generalise to unseen states are preferred (parametrising the value function of the policy by a neural network, for example).
Both are unbiased estimators. Assume a policy pi is fixed. The experimental discounted returns you are computing every episode for every state s in MC are a samples of the random variable R = R_2 + gamma R_3 + gamma^2 R_4 +... conditioned to s, i.e. samples of the random variable R | S_1=s. Note that the value of a state s is defined by v(s) = E[R], where E denotes the expectation. By the strong law of large numbers, for every state s, the average experimental return converges to the real expectation almost surely, since the samples are independent and the rewards are uniformly bounded. That's the proof.
for MDP you need to have Action-space, State-space, Reward function (or distribution), Probability of moving to state s' from s given action 'a', over all s,s' in State-space and all 'a' over Action-space. Right now we only have reward function and State-space, (maybe in the next lecture action space) but not the probabilities. I think that is the reason why we don't have a model and this is not a model-based algorithm.
Each step Rt you take increases noise and decreases the bias. If we take a probabilistic standpoint, the more data we have (i.e. the more iterations/steps of the model we take), the less the bias will have effect and the more the likelihood will take effect (on the proverbial posterior distribution). This effectively eliminates the bias, and since at TD(0), we've taken 0 steps, there's 0 noise and close to 0 bias, so that should converge to the true value
because in TD, you make instant response to what just happened after you took the action. But only in Markov environment can you assume that what happened just now was only related to your action and previous state
I do not understand: how MC and TD can find the optimal policy without compare all the possible policies like DP? Do they sampled the majority of the policies?
As far as what I understood, MC and TD converge to one and the same value estimate when in an MDP. They are just different way to get there. No one in their right sense should ideally be using MC for estimating value in an MDP though, I guess MC is only useful for non MDP cases. Now, coming to the policy (which I believe would be explained in a later lecture), if you remember the explanation in one of the previous lectures, where we alternate between value and policy updates, the greedy policy estimated based on the current best value estimate is always better then the policy we started with before the current best value estimate converges. Hence, we are bound to have a better policy than before when we have a better value function estimate. And over multiple such alternate steps of updates of value and policy, we end up with the ideal most policy and value.
What I understand is the more policy you sample the more accurate the policy is. You can see the example of blakjack 19:50 . 10,000 esposides vs 500,000 episodes are different. DP only work for small case which is mentioned on lecture 2 or 3 - I think. So for big case MC and TD will save you time by sampling.
I think pretty much any example you try is going to be a counterexample, because basically what the student proposes does not make any sense. Try with a state space with two or three states and some uniform policy with two different actions and I think it will work :). Let me explain the intuition, which is the fundamental thing here. We are assuming that our environment is an MDP, and MDPs are FORWARD processes. Now, any MDP has an associated Bellman expectation equation, which expresses the value of a state s in terms of the values of FUTURE STATES. In other words, you look at the future states where you might end up depending on which actions you choose and you express the value of a state as a weighted sum of values at those states. The Bellman expectation equation expresses PAST state values in terms of FUTURE state values. This is key. In particular, if we are at state S_t and sample an action with policy pi, R_{t+1} + gamma V(S_{t+1}) is an UNBIASED SAMPLE of V(S_t). This is the heart of TD learning, whose update rule uses the estimate R_{t+1} + \gamma V(S_{t+1}) for V(S_t) , where V(S_{t+1}) is now an approximation of the true value at state S_{t+1}, so the sample becomes biased. However, as you know, TD converges to the true value in the limit because the CLT. The estimates become more and more unbiased and accurate. Now, imagine that instead of updating like V(S_t)
Hey everybody, Is this course an advanced course? I'm looking at various courses on the internet and choosing the best one. I've come up with Udacity's course(the hefty price tag is holding me back) but I heard a lot of good things about this course. How would you recommend studying RL? Take udacity course with a lot of coding implementations or study UCL's course thoroughly and implement code from pseudocodes from the slides?
Sorry for 1 month delay, You've probably either picked something else or watched this one, but if not, I would recommend to follow Practical_RL assignments on github from Higher School of Economics, if I recall correctly, and there you have suggested videos for each 'practical week' and tasks for it, it really helps to grasp the ideas in this course and later start the Berkeley great CS294-112
I guess he asked that on the "Dealer Showing" axis, why the value function on '10' value decrease compare to the '9' value (Dealer showing axis). But again, i think this examples not really neccesary for you to understand, since he also knew, not everyone understand the gameplay which may be hard to bring the intuition. And there are many other more examples with simple concept on the next slide.
In my opinion I think backup is really for policy evaluation by taking the returns and finding out value functions and optimizing them. Bootstrapping is something to do with sampling the entire distribution and not using it as a whole
The formula for the return G_t(lambda) should contain a term at the end for the final update (end of episode) including the terminal state. Need to add (+) lambda^(T-t-1) G_t to the end of the return function. In the case that lambda = 0 then this part become G_t which is the same as MC.
Don't plug it in in the forward view, plug it in in the backward view and you can check it is MC (proof in the notes). In the forward case, it is true as a limit, i.e. MC is lim TD(lambda) when lambda-->1. Indeed, calculate the geometric sum of the coefficients from the end of the episode onwards and take limits when lambda-->1. You will see that all the mass concentrates on the tail of the series in the limit.
He skipped the last 10 slides and i saw them at their website and i think it was really something important and i am having trouble understanding it, can anybody tell me if those slides were important or not?
The last 10 slides show that eligibility traces are truly equivalent to the slow counterpart of calculating updates as we do in MC and naive TD(lambda). If you understand eligibility traces intuitively, you should also "sort of" understand the proofs (I think the proofs are not very precise, though).
Isn't the formula wrong? The last lambda exponent should be T-t-1 instead of T-1; it is Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-1) * R_T but should have been Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-t-1) * R_T .
Questions from students are of very high quality and they are one of the many reasons that make this lecture series particularly great.
yeah specially the guy sitting close to the camera
0:53 Outline
2:10 Introduction
5:06 Monte Carlo Learning
9:20 First Visit MC Policy Evaluation
14:55 Every Visit MC Policy Evaluation
16:23 Blackjack example
26:30 Incremental Mean
29:00 Incremental MC Updates
34:00 Temporal Difference (TD) Learning
35:45 MC vs TD
39:50 Driving Home Example
44:56 Advantages and Disadvantages of MC vs. TD
53:35 Random Walk Example
58:04 Batch MC and TD
58:45 AB Example
1:01:33 Certainty Equivalence
1:03:32 Markov Property - Advantages and Disadvantages of MC vs. TD
1:04:50 Monte Carlo Backup
1:07:45 Temporal Difference Backup
1:08:14 Dynamic Programming Backup
1:09:10 Bootstrapping and Sampling
1:10:50 Unified View of Reinforcement Learning
1:15:50 TD(lambda) and n-Step Prediction
1:17:29 n-Step Return
1:20:22 Large Random Walk Example
1:22:53 Averaging n-Step Return
1:23:55 lambda-return
1:28:52 Forward-view TD(lambda)
1:30:30 Backward view TD(lambda) and Eligibility Trace
1:33:40 TD(lambda) and TD(0)
1:34:40 TD(lambda) and MC
thanks
2:03 Introduction
5:04 Monte-Carlo Learning
33:56 Temporal-Difference Learning
1:23:53 TD(lambda)
He is the Andrew Ng of RL.
Could not agree more. Andrw Ng got me into ML and David Silver into RL. And that book by Sutton is gold!
No, Andrew Ng is David Silver of ML.
@@ruslanponomariov3598 Wouldnt it be Richard Sutton, given he is considered father of RL?
Andrew Ng has done lots of RL work himself!
@@nirajabcd very true; but isn't RL a subset of ML? I thought ML = {RL, Supervised, Unsupervised}
I love how he relates the form of the incremental mean to the meaning of RL updates.
This actually helped me a ton, I've been knee deep in Deep RL implementations that have just been handed to me and I've skirted doing it by hand or understanding the formulas. As a result, reading papers have been confusing. Him mentioning that the incremental mean will be a common pattern made about 10 things click all at once for me.
@@beepmcjeep5527 wake up
@@jaidevshah2286 😴
The lecture 💯.
The questions the students were asking 💯.
My enjoyment of the whole thing 💯.
38:29 what a great example to explain how TD is different from MC
At 1:27:47, David explains why we use λ geometric series by saying it is memoryless so "You can do TD(λ) for the same cost as TD(0)".. but I don't see how! TD(0) merely looks at one 1 step whereas TD(λ) has to look at all future steps (or in the backward view, TD(0) merely updates the current state, while TD(λ) updates all previous states)
TD(λ), in the backward view, does updates all previous steps, but that's as effective as TD(0), as each update is roughly equivalent to updates done with TD(0).
The forward view is theoretical, and mentioned because it gives us some intuitions on TD(λ), but it would be very inefficient to implement TD(λ) using the forward view, way too much computation, plus it implies that it needs finite episodes. That's why, in practice, we only use the backward view for implementations.
If you look into lecture 5, on Sarsa(λ), there's a comparison of the way Sarsa(λ) and Sarsa(1) update the states at around 1:06:00. You can see that while Sarsa(λ) does more computation, it then also build much more precise evaluation of the state action values. I think looking at this example might give you some intuitions why TD(λ) is as efficient as TD(0) computation wise.
@@totolamenace What you say makes sense, but computational cost means cost per iteration and the backward view of TD(lambda) updates eligibility traces and many state values per iteration, so it is obviously more expensive that one-step TD. A different statement would be to say that "to obtain state value estimates of similar quality, the number of required operations in both methods is roughly the same", which is not the computational order anyway.
When you realized every lecture corresponds to a chapter in Sutton's "Introduction to Reinforcement Learning"
Sutton & Barto are the OGs, Silver is the GOAT.
"Introduction to Reinforcement Learning" is a Holy Bible of RL, everything you need to know is in this book. And David Silver helped us to interpret this book.
My Prof actually uses both as teaching material xD
@@_snwflk_my prof too 😂
43:35 A student asked about the goal and actions of the driving-home example. I have read the book where this example comes from. And here is my take on the question:
The actions come from a policy that is determined by the person. In this case, the policy is getting home by driving a car through particular roads. The person can use other policies to get home such as walking home and driving through other roads.
The goal of Monte Carlo or Temporal Difference is to estimate how good his policy is. Remember his policy involves driving through particular roads. The example shows just one sample of updating the algorithms. To actually see how good his policy his. He needs to take the same route everyday, obtain more data, and update the algorithms.
Great explanation, Daniel! Thanks! Which book is the example from and would you recommend to read it parallel to working through this lectures?
Thank you. The book is reinforcement learning: an introduction by Sutton. I think you can download the draft with some googling
Yeah, I would recommend reading it along with the lectures. It's a good book. I find it helpful to learn the same concepts from different sources, especially difficult ones.
You are welcome! Thanks for the very fast reply! I already have the book from Sutton and Barton, but decided to start with the review article "Deep Reinforcement Learning - An Overview" from Yuxi Li. Also very recommendable!
Thanks for the recommendation! I also definitely love to learn from different sources and in addition to lectures and readings listen to podcasts. I love this one, but it is about machine learning fundamentals, not RL: ocdevel.com/podcasts/machine-learning
Maybe you or another reader can recommend another podcast, being more specialized on RL or other more advanced topics? "TWiML & AI" seems to be quite promising... also want to try out "Machine Learning 101". Any experiences or recommendations on that?
I think the reason why looking one step into the future is better than using past predictions is that you can treat the next step as if it were the last step, then that would be the terminating goal and the game is over, we've already known the current state didn't make the episode end up, so only the future state could and that's why we always look into the future for the terminating state.
I have watched the lecture 4 four times, and this is the clearest one. For non-English speakers, language is really an obstacle to understanding this lecture. Oh my poor English, I only got 6.5 at IELTS Listening.
love the example to demonstrate the difference between TD and MC!!
The backup diagrams have made everything much more clearer.
Thanks for good lecture. This lecture really help me a lot.
I have a suggestion for improving this lecture. It is English subtitle. It will make this lecture more accessible for the handicapped and non-English speakers.
Problem being that subtitles cost. Professional captioning can easily run you 400 bucks for a lecture like this, so...
I'm sure that a lot of people wouldn't mind volunteering to provide English subtitles for these lectures (at least I wouldn't). It would be a nice alternative for that :)
@@minhuang8848 lmao wtf , algorithm already provides free (albeit errorbugged) subtitles , its more than enough . Just let video settings to apply that
@@ThePaypay88 They do not. Auto-generated subtitles here frequently miss a bunch of words altogether, and the questions asked by students are barely audible which auto-generated subtitles don't pick up on.
These lectures are sooo helpful! Thank you very much for posting. They are really good :).
These lectures are gold!
On Lambda-return at 1:23:56, if you think of N as being geometrically distributed with parameter lambda (for Bernoulli probability of success) and temporarily think of G^(n)_t as a (deterministic) function of n, then G^lambda_t is simply the expected value of G^(N)_t.
i.e. Let N~Geo(lamda) and G^(n)_t be fixed as a function of n then G^lambda_t = E[ G^(N)_t ]
### Correction: N~Geo(lamda) above should be N~Geo(1-lamda) instead
It is 1:23:56, your timestamp is incorrect! But the observation is correct up to the fact that N~Geo(1-lambda) instead of Geo(lambda). Nice observation!!
@@edmonddantes4705 wow 4years later i am looking at what I wrote and literally have no idea what the heck i was talking about.. haha for a second I thought I had been hacked and chatgpt has been posting youtube comments on my behalf! 😂 time flies ¯\_(ツ)_/¯
I guess it is good to know that somebody read it carefully even after four years 😂
24:52 I think the professor's explanation is a bit misleading about this question. The Sutton & Barto book, where the figure came from, clearly told that the dealer has a fixed strategy to stick on any sum of 17 or greater, and hit otherwise.
I think what David showed in this lecture is the V under the policy of sticking only above 20. This does not say anything about the quality of that policy, just the value function for any state under the said policy.
@@larryjuang5643 Liu was talking about the strategy of the dealer not the policy of the agent. I think the point Dr. Silver is trying to make in this case, is that it doesn't really matter what the strategy of the dealer is. In the book by Sutton and Barto the strategy is fixed, so that we can simulate it when writing the code for the environment.
Another meaty lecture !
This is pure treasure
haha.. "You don't need to wait til you die to update your value function.. "
Thanks, TD learning!
What an excellent teacher.
1:34:04 TCHA TCHEW
Anticipation for this moment made this excellent lecture that much better.
i can't fully understand the slide in 1:21:45 .what is offline TD learning? i just think that offline TD should have the same result with online TD
Thank you for these lectures. They are fantastic.
1:36:12 td lambda and montecarlo 1:36:22
"You don't need to wait until you die to update your value function" killed me. xD
An audience member at 46:14 asked: "Does it learn the same behavior? So if you have a car, and the car almost crashes, but it doesn't, ... When you're learning step by step you gonna learning the behavior that you're scared of crashing, but if you're learning from the very end you would never learn sth to encourage you to avoid your own manic behaviour."
Although prof. Silver's explanation is correct, I think that's not what the answer student was looking for. It is not true that you would never learn sth to encourage you to avoid manic behavior (as the student is implying) - given enough episodes you would, as you would crash in many/most. MC does converge to the true value function (due to the law of large numbers).
Thanks for the clarification. So you are saying MC would also experience episodes terminating in crashes and hence learn to avoid crashes, just like in the case of TD.
@@paulmarkus4917 Yep.
@@BGasperov Does it mean TD would be able to learn to not crash without crashing, while Monte Carlo has to crash at least once to understand that it's bad?
@@nikebless No, TD would still need to crash (experience an extremely negative reward) in order to learn not to do it. Without crashing (experiencing a negative reward) how could it know that the action that leads to crashing is bad? Perhaps it could in model-based RL but that's another story.
The answer to the question at 46:50 confuses me. I think the questioner was wondering whether MC would learn to drive safely, not whether TD would learn it. David gave the answer that TD would learn it. My answer would be: "MC would learn it after enough iterations, because sometimes you would crash". But hey, maybe I'm the one that misinterprets the question :)
I don't think the example at 43:00 was very clear...
Or at least, it wasn't explained what's the policy and what's the return here
I mean I get the example and and I get the theory but not how the two of them relates
@1:07:00 He should've started from the graph rather than from the math.
Exactly
Great series of videos!
You get your real award; having a day at work :D
I have to say, David Silver is slightly smarter than me.
Just slightly, yea, just like how they write "only" in the cheques
😂
@37:30, question please. I dont understand the difference between real return (G_t) and ESTimate return (bootstrap, TD target). In the class 2, bellman equation shows exactly they are the SAME, isn't it , G_t = R_(t+1) + r V(S_(t+1)) ? If left G_t is the REAL return, why the right side of the equation is ESTimation now ??? Can someone teach me ? thank you!
Estimate return: you already got the real R_t+1 but estimate the rest (yV(S_t+1) together -> estimated return.
In Monte Carlo you have are at the end of the episode thats why you can use the real return. You already received every real reward.
@@leopoldmuller4486 Leo, thanks for your help. Please allow me to ask more about this.
1. I do agree with you that: real return in MC G_t = R_(t+1) + rR_t+2 + .... Here you do have all the R since you do walk though to the end. Am I right ?
2. I do know what you saying, in TD, you estimate, you don't go to the end of eps, so you do need to estimate. But I don't understand the equation in class 2 about this bellman equation: G_t = R_(t+1) + r V(S_(t+1)) . My understanding is this V(S_(t+1)) is the real, not estimation. The equation derivation shows very clear. Did I miss anything ? the notation is exactly the same between est and real ? Could you check the class 2, it was in very early of that video. thank you!
1:07:00 mc vs td vs known dynamics
I cannot clearly understand why TD(\lambda) is better than n-step TD. There seems to be no big difference between the RMS error figures of n-step TD (p. 37) and of TD(\lambda) (p. 42).
I have this intuition, but it may be wrong. TD(n) and TD(lambda) both have one hyperparameter to tune. I think the advantage of TD(\lambda) is that when you average over TD(0), TD(1), TD(n), you apply bagging technique, so it's easier to control bias/variance with lambda and avoid high variance.
1:10:13 How does DP bootstraps? Doesn't the DP uses the already calculated real returns?
Dp uses Bellman Equation for synchronous backups, which is like two steps look a head given the current state or action and immediately update the value for Vk+1 using Value function equation. Behind the equation, it uses the returns of the successor state to keep updating the Vk+1 value onwards.
At 1:03:00, I don't understand how TD(0) gives a value of 0.75. If we used the equations, then won't the value be different?
As far as I understand, the TD(0) will consider multiple episodes. In other words, the TD guess v[s(t+1)] based on multiple episodes, so we can get value 0.75. But I am not quite sure about that.
I think you are right, it would update V(A) to 0 based on what it had seen if the sequences were taken in that order in a single pass through using TD algorithm. I think you can still understand the 0.75 value as a TD-like analysis after the fact. Or you would get 0.75 if you repeated the same 8 sequences again and again.
This also puzzled me. I think the key is to realise that the equations shown at 1:03:00 are for the transition matrix and the reward structure rather than the value function. To find the value function, we should use a Bellman equation. Given the equation at 1:03:00 for transition matrix we see that P_{AB}=1 and the equation for reward structure shows R_{AB}=0. The Bellman equation will then give V(A) = R_{AB} + gamma * V(B). For no discounting we should set gamma = 1 and we also know that V(B) = 6/8 (I believe that's why he says V(B) is the easier one earlier in the video). Substituting then gives V(A) = 6/8.
Incidentally, a similar argument can be used to get V(B). For example, we see that P_{BT}=1 where T is terminal state. We also establish that R_{BT}=6/8 using the equation at 1:03:00, and clearly V(T)=0. Again, we get V(B) = R_{BT} + gamma * V(T) and with gamma = 1, we get the desired result.
Hopefully that makes sense!
Does anyone see a connection between a Non-Markov Decision Process (aka partially observable MDP) and a Hidden Markov model?
I'm falling in love with David Silver 😍😍
1:22:12 Is the offline method referred to Monte Carlo because it needs to wait till episode termination to update the value function but Temporal Difference method is offline because it updates the value function as it sees the state every time?
exactly.. that's the definiton, you might want to look up the slide on page 46. the lecture slide was in the description
It's been ~8 years since these lectures were uploaded and I'm yet to find an online resource that can rival Dr. Silver's delivery of this material.
I didn't understand the part where he explains how MC learning doesn't depend on how large the state space is. I need to understand it through an example
really appreciate!! help me understand model free
Can someone clarify me this? Monte carlo learning is basically just taking fewer paths out of the whole possible paths from a state and averaging it to estimate value function of a state.
What kind of policy do we have in a Model-free Prediction?
For example, if we have a policy of walking in a straight line from a point towards a goal, then the same set of states will be repeated in every iteration and hence the rewards should be same in every episode. Then how averaging would help?
i don't think you know where the goal is (if you do, you already have the optimal policy). other than that your environment will usually act stochastic and your sampling of starting states should be stochastic..
I have to correct: as it will be explained in lecture 5, acting greedily according to some policy will not lead to optimal exploration of the state space. This is because your trajectory is sampled from a propability distribution with much lower variance (because of deterministic actions limiting to one dimension of the propability matrix, the state transition dimension as opposed to the both the state transition dimension and the action dimension).
I have one question about "model free". If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change? Just like a human looking at a counter repeadtedly counting up from 0 to 9 and then counting down back to 0. A reward of 1 is given if the person predicts the next digit correctly. The person starts out thinking the state is just the number displayed and when he/she watches the counter behave longer, he/she come to realize that the sate is more complicated than that. Consider an extreme case, when the counter repeatedly shows all permutations of some sequence longer than two in some fixed order and none of Monte Carlo, TD(\lambda) and the version that incopoartes eligibility trace would ever generate a policy better than ramdom guess even if the entire period is shown as many times as needed as with the digit as the state representation, the best markov reward process it can fit is with ramdom transition between each state.
"If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change?"
Knowledge about the environment == knowledge about state-to-state probability transitions and rewards. The agent perceives states nevertheless (unless we have a POMDP in which case it's a bit more complicated).
The answer is pretty simple. You are choosing a terrible state space that doesn't represent your problem/what is happening well in your example. Choosing a suitable state space is a big part of solving reinforcement learning tasks. Imagine that I am trying to code a bot that sells and buys different cryptos and the state space is whether it rained yesterday. Then the state space is terrible as a predictor. As a human, if you are not making progress in your task of optimising rewards, it is clear that you are not understanding your state space well, so you should include more and more stuff into it until you see how your returns get better. You have to seek to understand what a suitable state space for your problem is.
Thanks you for good lecture.
One question on this video.
On the slide Large-Random-Walk-Example, I don't get the difference between On-Line n-Step TD and Off-Line n-Step TD.
In short, what is the difference between On-Line and Off-Line?
I expected TD should be always Online. How can TD be offline in the picture?
In this context, online means "in place". In each episode (assuming there *are* episodes), you update all V(s) by using the V(s) themselves (that's "bootstrapping"). If during the episode you update, say, V(s1) and then you reuse V(s1) for updating V(s2), then you updated V(s2) not by using the old V(s1) but the update you just did to V(s1).
If you first save all the V(s) in, say, OldV(s) and then update V(s) by only using OldV(s), then you're doing it Offline. If you, instead, use the same V vector and don't worry about the update interferences/overlappings, then you're doing it Online. Online is easier and faster and works well, AFAIK.
Online updates means the value function is updated at every step in the episode; Offline updates occur all at once at the end of the episode
14:33 was very good :))
What is the difference between online and offline update in 1:35:50
Can you learn the dynamics of the system during MC sampling and transform model-free in to model-based problem ?
Lecture 8
In the equation for TD(lambda), the weights have a (1-lambda) term
But that would make TD(1) have weights of 0?
When lambda=1, the n-step returns = 0 for all n, except the ∞-step return when the agent reaches the terminal state. That final return is the actual return after accumulating the rewards until the end of the episode, which is a Monte Carlo estimate. Therefore, TD(1) uses the same target as the Monte Carlo method.
Be wiser than just plugging in lambda=1 in the coefficients of an infinite series. Compute the limit when lambda-->1 instead by using that the episode terminates and hence from some point onwards, the discounted return will be fixed. Use summability of geometric series.
Difference b/w expected return and empirical mean return? 8:28
I'm still unable to grasp the concept of expectation.
Empirical means refers to sample mean when you have the samples available, expected value is not for a sample but a distribution. In some cases, you can infer the distribution from the samples, and empirical mean and expected value would be the same.
11:00 good question, David being pissy as usual about it
Great lecture though
Does TD work in the game of Go? I understand it's very difficult to evaluation a board state without playing it out further. Perhaps TD (breadth search) but MC to evaluate the leaves.
16:35 lol: *actually* worries about breaking the casino! :D
seen movie "21"?
For the blackjack example, how do we know which state we get to if we choose to twist without knowing the transition probability?
I was thinking you'd have to write your own simulator. Did you figure out the answer a year later?
Amazing lecture, but the eligibility trace part seemed to sudden and random to me..
hahaha yeah looks like he just rushing out of time.
other algorithms from last lectures(mc, td, dp, ..) = forward view
eligibility trace = backward view
It's a method to update present value from past values and not future (estimated or not)values like other algorithms that we've learned so far.
It's a super cool idea to make something online that seems hopelessly offline. I find it pretty smart, plus the motivation is very psychological.
Can someone explain how TD(1) corresponding to monte carlo? According to formula (1-lambda) will be zero for lambda and hence estimated return should be zero for every state.
since we have a terminal state T(ie it will terminate after some k),Then the probability density (weight)for that G_(k) will be the sum of geometrical from that point to n= infnity,Then you will get lamda^(k-1).Hence we get MC for lamda=1.This is what prof says in the side 1:28:00 ("we will assign the rest of the prob to the tail part(shaded portion) of the decaying distribution").
In the forward view, it is a limiting case, i.e. lim TD(lambda) when lambda-->1, motivated by the argument Vaishnav described. Calculate the geometric sum of the coefficients from k onwards and take limits when lambda-->1. You will see that, in the limit, all the mass concentrates on the tail of the series. In the backward view, it can be shown by direct manipulation and the proof can be found in the extra material in the slides.
Can someone explain about when doing TD(0), how can we get partition of TD target that estimated value of next state V(St+1)??
I am not really sure how he got the value of V(A) and V(B) at 59:54. Can someone please explain?
He explains just after. It was to compare how MC and TD use their samples. So you have 8 samples that include B, 6 of them get a final return of 1, and 2 of them get a return of 0. With MC sampling, then estimate for V(B) is 6/8 = 0.75. For TD it is the same because state B is always last one in sequence, so we are bootstrapping from final state, with V(final)=0 by definition. The interesting one is V(A). With MC sampling, you look at average return for any sequence that includes state A. There is only one such sequence and the total return is 0 + 0 = 0, making the MC estimate for V(A) = 0. But the TD estimate can include the existing estimate for B when the sequence transitions from A to B . . . I think it is a little wrong in that this only works with TD algorithm as shown if the A:0,B:0 sequence occurs as the last sample. But I think the idea in the lecture is to ignore the algorithms and compare the two intuitive answers V(A) = 0 (for MC) or V(A) = 0.75 (for TD) to get a sense on how MC and TD differ.
This is what I was looking for! Thanks! I also thought the sequences should be opposite otherwise in TD case it will also be V(A)=0
This is exactly what I was looking for! I also think the same, but seeing that no-one else is pointing this out makes me think am I missing something obvious?
x.1.0 speed is good for taking the lecture for the first time
x1.5 speed is perfect for quick reviewing this great lecture
Yunlong Song x2 mate
Someone has more documentation on the question about geometrical decay for lambda at 1:28:00? Thanks
Let's say we are at time t and we have already visited state s three times. If the eligibility trace decays geometrically than the decay is constant at each time step, so we don't need to maintain a history of the single visits.
Let's say E = E1 + E2 + E3, where Ei is the "partial eligibility" relative to the i-th visit of state s.
If the decay is constant, we have
new(E) = new(E1) + new(E2) + new(E3) =
lambda E1 + lambda E2 + lambda E3 = lambda E
and so we don't really need to remember E1, E2 and E3 separately.
If the decay is, say, hyperbolic (i.e. 1/n after n steps), then the update is more difficult:
new(E) = new(E1) + new(E2) + new(E3) =
? E1 + ? E2 + ? E3 = ???
We can't proceed without knowing when the visits took place. Let's say the right decays are l1, l2, l3, respectively. Then we have
new(E) = l1 E1 + l2 E2 + l3 E3
Unfortunately, we need to remember the entire history of the visits. We can't update E directly as before.
Just out of curiosity, sei italiano?
I don't understand the question asked at 11:34 and its answer
David explained 'first-visit monte carlo', in which the number of times you visited a state is updated only the first time you visit it, and the expected return is calculated from the first visit. Even, if you visit the state again a second time, it's not updated until you start a next episode. Thus, you get N_s = 10, for let's say 10 episodes, because you are considering only first visit, and would find the expected return by averaging over return from first visit to state 's'.
However, this has been proved that first-visit average return is equal to n-visit average return where you visit the state multiple times during one episode, and the count is updated everytime(unlike first visit). So, if you study only first-visit, you are not missing anything.
In other words, value function is same for first visit or n-visit MC.
Shouldn't TD(0) correspond to n=1 instead of n=0?
It does. But TD(0) denotes TD(lambda) for lambda = 0. This corresponds with 1-step TD or to lambda = 0 in the eligibility trace paradigm. In summary, the 0 corresponds with lambda, not with n.
for Analysis purpose can we treat the Every visit MC as a first visit MC by considering each occurrence of the state as a single episode?
The only thing that will differ is the averaging factor for the chain with multiple visits to the same state.
For TD learning, how can an agent know what the estimate (look ahead) value for state St+1 should be????. I know TD is bias because it has to guess the next value is and use it to update the current value, but how would it guess????
I think we randomly initialize the V(St) at first, then we use the iterative update, that's why he said TD is sensitive to initial conditions.
Here the value function is stationary, without the time index attached
It depends on the specific environment. In a case like a game of chess, where the environment is largely predictable, then information like number of pieces taken can be used to predict your future value function. Otherwise, you would apply a random value function and then use different update functions to adjust that value for the probability distributions p(s'|a,s) and V(s). Essentially, if you don't want a random initialization, then a prior needs to be applied
What about states which are
very infrequent and it takes a long time to find the first time they are
visited for a given episode? Now, over many episodes ..the time will add up. Right?
In practice, there are many states that could not be visited when the state space is very large. For example, they argue that TD-Gammon probably has not seen most of the configurations in Backgammon and can still produce amazing results. Lectures 6, 7, 8, 10 will certainly answer your questions partially. When there are too many states, other approaches that generalise to unseen states are preferred (parametrising the value function of the policy by a neural network, for example).
Is alpha calculated by 1/N(s) of certain state when traversing all the episodes, or just chosen evenly and manually btw 0 and 1?
I don't know what example you're looking at but it generally varies. Both options you gave are viable in different cases.
In the first case, your estimator is unbiased, in the second one, your estimator is biased. In both cases, there is convergence in the limit.
Does the First-Term MC and MC(all) return estimate biased? And how can we prove that formally?
Both are unbiased estimators. Assume a policy pi is fixed. The experimental discounted returns you are computing every episode for every state s in MC are a samples of the random variable R = R_2 + gamma R_3 + gamma^2 R_4 +... conditioned to s, i.e. samples of the random variable R | S_1=s. Note that the value of a state s is defined by v(s) = E[R], where E denotes the expectation. By the strong law of large numbers, for every state s, the average experimental return converges to the real expectation almost surely, since the samples are independent and the rewards are uniformly bounded. That's the proof.
"You don't need to wait until you die to update your value function" :)
Wouldn't TD be considered a model-based algorithm since we need an immediate reward and value of state s'?
for MDP you need to have Action-space, State-space, Reward function (or distribution), Probability of moving to state s' from s given action 'a', over all s,s' in State-space and all 'a' over Action-space. Right now we only have reward function and State-space, (maybe in the next lecture action space) but not the probabilities. I think that is the reason why we don't have a model and this is not a model-based algorithm.
Why does considering (the immediate reward from next step + Value function of next step : TD(0)) means considering the MDP structure?..
Each step Rt you take increases noise and decreases the bias. If we take a probabilistic standpoint, the more data we have (i.e. the more iterations/steps of the model we take), the less the bias will have effect and the more the likelihood will take effect (on the proverbial posterior distribution). This effectively eliminates the bias, and since at TD(0), we've taken 0 steps, there's 0 noise and close to 0 bias, so that should converge to the true value
because in TD, you make instant response to what just happened after you took the action. But only in Markov environment can you assume that what happened just now was only related to your action and previous state
I do not understand: how MC and TD can find the optimal policy without compare all the possible policies like DP?
Do they sampled the majority of the policies?
As far as what I understood, MC and TD converge to one and the same value estimate when in an MDP. They are just different way to get there. No one in their right sense should ideally be using MC for estimating value in an MDP though, I guess MC is only useful for non MDP cases. Now, coming to the policy (which I believe would be explained in a later lecture), if you remember the explanation in one of the previous lectures, where we alternate between value and policy updates, the greedy policy estimated based on the current best value estimate is always better then the policy we started with before the current best value estimate converges. Hence, we are bound to have a better policy than before when we have a better value function estimate. And over multiple such alternate steps of updates of value and policy, we end up with the ideal most policy and value.
What I understand is the more policy you sample the more accurate the policy is. You can see the example of blakjack 19:50 . 10,000 esposides vs 500,000 episodes are different. DP only work for small case which is mentioned on lecture 2 or 3 - I think. So for big case MC and TD will save you time by sampling.
1:05:00 = Monte Carlo backups
Can someone give an example of the student's question at 1:15:39?
I think pretty much any example you try is going to be a counterexample, because basically what the student proposes does not make any sense. Try with a state space with two or three states and some uniform policy with two different actions and I think it will work :).
Let me explain the intuition, which is the fundamental thing here. We are assuming that our environment is an MDP, and MDPs are FORWARD processes. Now, any MDP has an associated Bellman expectation equation, which expresses the value of a state s in terms of the values of FUTURE STATES. In other words, you look at the future states where you might end up depending on which actions you choose and you express the value of a state as a weighted sum of values at those states. The Bellman expectation equation expresses PAST state values in terms of FUTURE state values. This is key. In particular, if we are at state S_t and sample an action with policy pi, R_{t+1} + gamma V(S_{t+1}) is an UNBIASED SAMPLE of V(S_t). This is the heart of TD learning, whose update rule uses the estimate R_{t+1} + \gamma V(S_{t+1}) for V(S_t) , where V(S_{t+1}) is now an approximation of the true value at state S_{t+1}, so the sample becomes biased. However, as you know, TD converges to the true value in the limit because the CLT. The estimates become more and more unbiased and accurate.
Now, imagine that instead of updating like V(S_t)
Hey everybody,
Is this course an advanced course? I'm looking at various courses on the internet and choosing the best one. I've come up with Udacity's course(the hefty price tag is holding me back) but I heard a lot of good things about this course. How would you recommend studying RL? Take udacity course with a lot of coding implementations or study UCL's course thoroughly and implement code from pseudocodes from the slides?
Sorry for 1 month delay, You've probably either picked something else or watched this one, but if not, I would recommend to follow Practical_RL assignments on github from Higher School of Economics, if I recall correctly, and there you have suggested videos for each 'practical week' and tasks for it, it really helps to grasp the ideas in this course and later start the Berkeley great CS294-112
@@Shottedsheriff Thank you very much. I've just bought a course from Udacity, I'll do CS294 and Udacity and the one you recommended me :)
if you listen carefully you can hear a sutil fart 0:36:58
heard dat
maybe he got a reward from monte carlo
why Ace is 10 on dealer side? it say is Ace can use either 1 or 11, right?
23:56 What question is being asked?
I guess he asked that on the "Dealer Showing" axis, why the value function on
'10' value decrease compare to the '9' value (Dealer showing axis). But again, i think this examples not really neccesary for you to understand, since he also knew, not everyone understand the gameplay which may be hard to bring the intuition. And there are many other more examples with simple concept on the next slide.
Maths is pure beauty
45:46 somebody gagged himself lol
LOL
Thanks for the nice lecture.
50:24
1:35:57
41:45
What's the difference between Backup and Bootstrapping?
In my opinion I think backup is really for policy evaluation by taking the returns and finding out value functions and optimizing them. Bootstrapping is something to do with sampling the entire distribution and not using it as a whole
nice way to spend coronavacation
Could someone please clarify why lambda = 1 corresponds to MC? If you plug in lambda = 1, the return just goes to 0, so how can it be equal to MC?
The formula for the return G_t(lambda) should contain a term at the end for the final update (end of episode) including the terminal state. Need to add (+) lambda^(T-t-1) G_t to the end of the return function. In the case that lambda = 0 then this part become G_t which is the same as MC.
Don't plug it in in the forward view, plug it in in the backward view and you can check it is MC (proof in the notes). In the forward case, it is true as a limit, i.e. MC is lim TD(lambda) when lambda-->1. Indeed, calculate the geometric sum of the coefficients from the end of the episode onwards and take limits when lambda-->1. You will see that all the mass concentrates on the tail of the series in the limit.
Just realised from the example from 1:00:00 that I'm an MC kind of guy
He skipped the last 10 slides and i saw them at their website and i think it was really something important and i am having trouble understanding it, can anybody tell me if those slides were important or not?
The last 10 slides show that eligibility traces are truly equivalent to the slow counterpart of calculating updates as we do in MC and naive TD(lambda).
If you understand eligibility traces intuitively, you should also "sort of" understand the proofs (I think the proofs are not very precise, though).
I have the same questions. I totally lost it when it came to backward view of TD(lambda).
1:34:05 cha-chow
IDK who this person in the background is buy they cannot stop coughing and burping constantly. It's so irritating.
Last 10 minutes (backward view) need more elaboration.
Where can I download the notes of this course?
www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
The class is very diverse, guessing from their voices
Sir, I want to use RL for frequency controller in electric vehicle. Please guide me which algorithm of RL will be best
Isn't the formula wrong? The last lambda exponent should be T-t-1 instead of T-1; it is Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-1) * R_T but should have been Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-t-1) * R_T .
Yeah, you are totally correct.
What is that sound at @52:14 ? hahaha
hahahaha
who is that guy on background suffering from covid-19, 5 years long back .. his coughing frequency is so irritating ...
you know 2020 has passed when everyone is thinking the same when they hear someone coughing.