@@crazyfrog2817 Yeah, it's interesting how random professors that know their stuff may not be good at teaching, but the best of the best always know their stuff and how to explain it
0:00 Motivations 0:35 Outline 0:35 Large-Scale Reinforcement Learning 3:55 Value Function Approximation 8:40 Types of Value Function Approximation 11:55 Which Function Approximator? 15:38 Incremental Methods 15:43 Gradient Descent 17:35 Value Function Approx. by Gradient Descent 21:35 Feature Vectors 23:23 Linear Value Function Approximation 28:43 Table Lookup Features 30:50 Incremental Prediction Algorithms 31:10 Monte-Carlo with Value Function Approximation 37:33 TD Learning with Value Function Approximation 41:56 TD(lambda) with Value Function Approximation 49:00 Control with Value Function Approximation 52:30 Action-Value Function Approximation 53:50 Linear Action-Value Function Approximation 55:20 Incremental Control Algorithms 56:20 Linear Sarsa with Coarse Coding in Mountain Car 1:04:30 Study of lambda: Should we Bootstrap? 1:06:10 Baird's Counterexample 1:06:30 Parameter Divergence in Baird's Counterexample 1:06:50 Convergence of Prediction Algorithms 1:08:00 Gradient Temporal-Difference Learning 1:09:00 Convergence of Control Algorithms 1:10:19 Batch Methods 1:12:30 Batch Reinforcement Learning 1:13:30 Least Square Prediction 1:15:25 Stochastic Gradient Descent with Experience Replay 1:17:25 Experience Replay in Deep Q-Networks (DQN) 1:24:46 DQN in Atari 1:26:00 How much does DQN help? 1:27:35 Linear Least Square Prediction (2) 1:32:29 Convergence of Linear Least Squares Prediction Algorithms 1:32:50 Least Squares Policy Iteration 1:34:15 Chain Walk Example 1:35:00 LSPI in Chain Walk: Action-Value Function
This video is yet another example that the complexity of a subject really depends on the one who is teaching it. Thank you David, for making RL so much more accessible and understandable! It is a real pleasure listening to those lectures of yours ❤
It's gonna be pretty confusing the first time you hear about it, but give it a time, try to understand some of it and then come back to watch it again, you will see how much you can comprehend it.
This is by far the best course on DRL I have ever watched. I kind of lost on the different notations in different papers although I can code many of them up. But this video summarized the root of the problems and solutions concisely which makes me be able to make sense what I have learned in the past. Thanks for the great work!
holy cow, in the last 30 minutes this lecture goes completely off the rails. The amount of concepts introduced and the number of slides shown increases exponentially towards the end :-)
It's amazing how intuitive these concepts become after watching these lectures. David Silver makes the math and theory behind RL that seems so hard to grasp impossible to forget.
This is a loooot a of knowledge that probably took over 20 years to get this complicated and genius. The question is why would someone share it for free like this! It's a GIFT!
I don't really care about the math in the lecture, I just enjoy the moment listening David talking about state of the art RL and suddenly catch his idea.
+WahranRai No, Gradient is the direction of ascent and in most of gradient descent optimization we follow the direction of the negative gradient (for finding minima) math.stackexchange.com/questions/223252/why-is-gradient-the-direction-of-steepest-ascent
Backward view is doing online updates. You do that for every state in each time step. If you do that for every state and only apply those updates at the end of the episode it will be equal to the updates you would do in your forward view. In forward view we were updating the values of the states when they are seen. We were collecting all step returns weighted by appropriate lambda and applying to the state that is seen. The return is the return from that state onwards in this case. In backward view, we immediatly update the value of the state, actually all the states in each steps. It is easy to see that they are equal when lambda is 0. Eligibility trace loses its recursive part and it is only 1 when the actual state is the state we are going to update. And all sum of those updates that would happen throughout the whole episode is equal to the forward 0 lambda return which is actually 1 step return because we only care about the 1 step return in that case ( all the other step returns are 0 because they have lambda in front of their weights).
I am puzzled by some comments. I felt lectures in general towards the end are rushed and hand wavy. First 4-5 lectures are really good but after that it feels more fragmented. I could realize this more when I saw DQN slides, that is familiar topic to me and the way it was explained was very upside down and cryptic way.
I like the convention of referring to the information about the value of an action being given to us by an "oracle". It makes me think of the old woman from the Matrix saying "Hmmmm, I would've made pacman turn left there honey".
Great course! the only negative is that each lecture is too long, so I tried to watch it at a higher speed. You can still watch it at speed x1.5 and it becomes around 1h each course.
How would you calculate the Q feature vector based on the action? Would the action considered be part of the feature vector? Otherwise how could my features (e.g. distance to the wall) depend on the considered action without explicit coding of how the environment works?
I am struggling with this same question. Maybe we can help each other think through this. It is confusing because in the gridworld or pacman examples you know the next state by the current state and action you take, but for my problem, robot control, you don't know how an action will affect your state. One solution like you said was to add a feature for the action. I may be wrong on this, haven't yet tried, but I think that there should be one feature for each action we are considering, with corresponding weights. So when we are in a state and looking for an action to take we loop over each possible action where the Q(s,a) for a particular action will have the feature value for that action as 1, then each other action feature is 0 (ignoring them). Another approach I am looking at is to have a different set of weights for each action. This may be less efficient but could be more expressive. Will need to do some tests once both approaches are working. Finally I think the best approach, and the one I have the least understanding of, would be to have a neural network that produces an output for each action rather than a single value. This would seem to be the most efficient in terms of memory and CPU time. And could be a simpler implementation with a black box NN. What are your thoughts on these?
I think perhaps that's why Powell defines the Q-value as a "Post-Decision State". For some implementations you do not need to add action as a feature. For example if you are playing chess you know immediately what the results are of taking your action (even though you can't predict the enemy movement so you don't have access to the transition function). So if your feature is "the enemy queen is in the game" YES/NO you can judge your possible actions without them appearing explicitly as features but that's because you can tell what your action is doing.
+Alec Karfonta Both of your last two suggestions are what I've seen in practice. Silver even mentioned the latter in a previous lecture. Adding actions to the feature vector would violate the standard distinction made between s and a, so while it might work, you'd be cutting against the grain of all of the literature.
I was finally able to get the value-function approximation working in practice on a crawling robot task. It works better and faster than the Q-Table implementation. I thought it would work better but take longer to train, it surpassed all my expectations. I have never before seen the robots move the way they are now. First I tried the method I suggested above to have a feature in the state for each action, it would be a 1 if the action was taken in the previous time step. This almost seemed to work, they would chase the immediate reward but would not take the extra step to take actions with no immediate reward to reach a state with an immediate reward. Then I tried the second method, where each action has it's own set of weights for the state features and this worked like a dream. Within an hour of training they had mastered the task, where the Q-Table would take overnight. So at a time step the state features are combined with each action's set of weights to find the max expected reward action. The next time step the weights for only the previous action are updated using the delta from expected and actual reward (immediate + future from the new state). Now I am experimenting with more complex robots. With the old Q-Table implementation adding a limb would cause the size of the table to explode. With the value-function I have the computational capacity for pretty much any number of actions and state features. So I am trying to train the same crawling robot with just one extra identical limb. Though I seem to have a new problem. Ideally the robot should be able to control both limbs (four motors) at the same time. How do you guys think I could adapt the Q update to take four actions per time step? Maybe instead of one update that considers all actions at once there are four separate updates for each motor. The weights would be a three-dimensional array like [motor][action][state_feature]. The motor dimension would be of size four, the action dimension thee (forward, stop, reverse). At each time-step four actions would be taken, the max action for each motor. The next time-step there would be four updates based on the action taken for each motor and the same previous state values. Though I'm worried this might exacerbate the credit assignment problem.
The feature vector is built off of a set of observations. Take a self driving car for example, you would have sensors that detect your distance to the next object, your accelerometer, your pressure gauge, etc. and that will essentially be a generalization of your state. It doesn't matter if you're in your garage at home or in the garage at your office, if your observations are identical, your state would be the same (although your endgoal, and hence where you will find your reward, will be different).
If you learn each Q(S,A) directly you need to store the trace for each state. When learning the parameter vector w you only need to store the trace for each feature.
1:24:00 Why doesn't using old parameters have the same effect as propagating backwards in TD-learning would? Why doesn't it reverse the temporal flow of information and therefore make the Q-values diverge?
Any ideas on how to adapt this to take multiple actions per time-step? I am working on a robot control problem, the simple crawling crate program described in the Berkeley RL lectures. The robot has two joints each with three actions; forward, hold, reverse. I can get it to work fairly well but when I add an extra limb it gets stuck in a local minimum of using just one arm and holding the other still. Ideally I would like the robot to control the torque on each motor but I do not know how I would set up the training examples for the NN. Right now the input is the angle and velocity of each joint and the output is the expected reward of each action. To get the actual target for a training example I use the Q-Value for the current step, the immediate reward plus the max expected reward from any action in the new state. So if there are four actions this Q-Value is used as the expected reward of the previous action. The other three action values in the sample are set to the current expected reward, so that only the value of the previous action is changed. I cannot think of how to adapt this to take multiple actions or output motor torques, any suggestions?
In DQN, in order to avoid chasing a non-stationary target, we fixed (R+y*maxQ(S', A', w)) for a while so that we could chase a stable target. My question is that could we do the same thing to fix R+y*maxQ(S', A') in Q-learning?
In DQN you have weights, so it is the fact that you are changing your weights all the time due to gradient descent what motivates fixing those weights in the objective for a bit. You stabilise the objective. However, in normal Q-learning, the objective is not changing much every step, so intuitively to me it would not make sense.
Bottom of slide at 30:20ish. To take the dot product of those vectors shouldn't the state vector be transposed? Currently we have nx1 state vector and a nx1 weight vector. Giving is nx1,nx1; which won't multiply? We need 1xn,nx1 to give us a 1x1 output.
Not true. First of all, you are confusing scalar product with matrix multiplication. You are right in the sense that you obviously cannot multiply an nx1 matrix by an nx1 matrix, but that dot does not mean matrix multiplication. In mathematics, those vectors you are seeing are interpreted as vectors in the Euclidean space R^n (\mathbb{R}^n), and he is simply expressing two vectors of the Euclidean space as column vectors and writing a dot in-between to express scalar product. This is sometimes done in mathematics, physics, etc. Note that if I have two vectors x and y in R^n and I express vector x like a 1xn vector and y like an nx1 vector, the scalar product coincides with matrix multiplication. This is because in the Euclidean case, the dot product can be rewritten in terms of matrix multiplication of suitable matrices. I will add that in general, you just need an inner product space in order to define a scalar product, and a scalar product is defined BETWEEN ELEMENTS OF THE SAME SPACE. Your space could be R^n, a space of functions like L^2, A space of sequences, etc.
at 43:28, he talked about the size of the eligibility traces. I dont' understand why its size is the same at the parameters' size but not the size of the state space. Does anyone know the answer for this?
Let's assume we're using a linear approximation, that is: v(s,w) = f1(s) w1 + f2(s) w2 + ... fn(s) wn, where fk(s) is the k-th feature of the state s. In this case, the gradient of v in s is just the feature vector f(s). This will make what follows more intuitive. It's clear that if, say, f1(s) is much bigger than the other features, then we'll want to focus on w1 to increase/decrease v(s,w) as much as possible. Instead of using eligibility traces to assign credit to the states, we use eligibility traces to assign credit to the features which "summarize" the states. Let's say we have 4 states: 1) it's raining and it's cold 2) it's raining and it's hot 3) it's sunny and it's cold 4) it's sunny and it's hot The features are: r = it's raining c = it's cold which can take values 0 (false) or 1 (true). If we go through states 2 and 3, will just remember the features (r,c) instead of the states themselves: e
Does anyone have a reference to the "Gradient Q-Learning" approach that was mentioned around 1:11:27? I couldn't seem to find it online. In particular, I am curious how this approach actually computes the gradient of the TD target (especially how it handles the max term). Perhaps it is related to Gradient TD somehow?
I don't understand how they train with experience replay in 1.17 . Is it like at each time step you take some state and target pairs that the model seen before and apply stochastic Gradient descent ? . Also how the improve policy at each time step . When using greedy policy which function approximator they use is it frozen one or the updating one ?
"Every step, what we do, we sample some mini-batch from our dataset ... optimise MSE between Q-network and Q-Learning targets" 1:18:34 . Also i suppose the 'i' in the loss function equation means "iteration". So calculate the loss each iterations (timestep)
So is it that we don't store states explicitly unlike before where we had them in a look-up table? Or is it that we still have to store them, but now we can quickly estimate the values of similar states and add them to our table even without having to visit them? I'm trying to understand how not storing them will work when we have to take an action using action-values. If we don't have them explicitly stored, then we evaluate all actions possible in the current state in real-time using function approximation right?
So, the point of Value Function Approximation is that you don't have to store the Q-value lookup table for all states and actions. (it may be extremely large for many problems). The approximation of the q-value can be used to "extrapolate" values of unvisited states and their values (that is what is called generalization). Think of it like this: you have a bunch of x -values and y-values and you want to predict approximately the value of x given a new and unseen value for y. So, we try to fit a function for known x and y values. This is exactly what supervised learning in ML does, but here we get the "training set" by interacting with the environment. Hope this helped.
@@ajaybala8967 Thanks for the reply. Truth be told I figured it out and am halfway through my skl project using that exact same thing lol. Only problem now is that it is so computationally heavy that it is the first time i actually have to consider code optimisation haha. But thanks again for the help!
We update the value of weights in Monte Carlo with Value function approximation(Incremental methods) at the end of multiple episodes right? And How does it solve the problem of space. Can someone please clarify?
The problem of space is called curse of dimensionality. A function approximator (linear features, neural net, wavelets, etc) uses way fewer trainable parameters than the number of total states (which could even be continuous). Note that by means of a neural network, you could compute a good estimate of the value of infinitely many states by only having a small number of trainable parameters. Why? Because a neural network is capable of interpolating and generalising if the architecture is chosen correctly. That's the beauty! Trainable parameters
Feature Vector is the output of the network (eg. linear regression, neural network) You don't do it manually. But you may set a size of the network output and let the network optimizes the Feature Vector by itself.
as he said, the features consist the current position and velocity of the car. eg: when the car on 32 position and accelarate 80% from the maximum speed, then the car ends up in x position with the state-value function = 8. and that value function is use to improve the new policy and optimize future action.
The features are coarse coding in the first pictures (i.e. there are many circles and for each circle, a feature is defined as 1 for states on the circle and 0 otherwise), and radial basis functions (probably Gaussian densities with different means) for the other image where there is handwritten text. In both cases, the function approximators are linear.
Depends on how well you choose those features right? Feature selection is an art. Also, the look-up table case corresponds with linear function approximation.
If you fixed the state, yes, you can view it like that. That's the computational way of interpreting it. The mathematical way would be a vector, each component containing the action-value of the fixed state given one of the possible actions.
I am trying to find more information about the "Coarse Coding" function approximation method online (at approx. 1:01:00). Anyone have some useful links? It appears to be quite elusive.
I am guessing gradient TD applies compatible function approximation in order to make updates in the real descent direction, but I would love to see a reference of what he means by gradient TD or gradient Q-learning. Again, I strongly suspect that they come from some improvements due to the policy gradient theorem by Sutton.
No definitely not. The number of states is arbitrary, and a feature vector is canonically a 1 column vector with many rows. For example, the feature vector for chess could have 1 row for every point on a chess board, and the entry to that row could be a number representing what kind chess piece is occupying that point. Opponent’s rook, my queen, empty space, my opponent’s knight, et cetera.
Of course not. Typically, it is ill-advised to do that, since that would imply that your number of trainable parameters is equal to the number of states, and the purpose of function approximation is precisely avoiding the curse of dimensionality of huge state spaces by using a much smaller number of trainable parameters. It's like you are back in the look-up table case. What a torture 😂
I'm struggling with the objective function for DQNs. We're trying to learn w and we update them towards action value functions estimated with an older parameter vector w-. That feels wrong to me, what am I missing? I can see that we bootstrap towards the future (so we observe a reward and estimate the value at s', a'), but using w- from the past which is a worst estimate of the optimal parameter vector w* (hopefully) than the estimate I have now.
did you watch leactures of regular tabular q learning? We did same thing there. While it may feel wrong 1. you can come up with a better method and publish a paper 2. we still inject information of the real reward, thus making our predictions more accurate, that's why it all works.
@@robosergTV Thank you for your answer. Still, to this day, I feel like I'm missing something. I studied and understood table-lookup Q-learning, and I want to highlight that is not the TD target that bothers me (on the contrary, I am amazed by how TD learns about the return without ever observing it). What I felt odd about is the bootstrap towards *old* Q-targets, parametrized by w-. After all, I'm trying to improve my estimate of w, so I want to improve over w-. Anyway, I can grasp that, as you suggest, the observation of the unbiased reward R ensures an improvement. Cheers
@@davideabati1903 I am facing the same dilemma. Over here prof says that we fix the old target for 1000 (say) iterations. How do we ensure that the model trained during these iterations is improved when the error gradients themselves point to the worst estimate of Q*. In tabular Q-learning he told we take an alternate action as the target. Where my interpretation of it was from exploration perspective. Please tell me if I'm going in the wrong direction.
This is one of the most interesting questions in this lecture's comments. My first observation is that in the look-up table case, the Q-learning updates are the same as for the DQN objective (with corresponding indicator function features, obviously) as long as WE DON'T FREEZE THE OLD PARAMETERS. Therefore the question of why freeze the old parameters at all is very natural. My second observation is that in the look-up table case, we are not moving the objective much at every step, in the sense that we only modify one pair of state-actions, so gradient descent won't go crazy by following a constantly moving objective. Note that although only one pair is moved, the change is clearly more violent in the table look-up case. Another reason why gradient descent will not go crazy in this case is the obvious one: it converges because it coincides with Q-learning and Q-learning converges under certain conditions. Now, given the two previous observations, I will summarise the intuition that I have. The look-up table Q-learning updates follow the Bellman optimality equation and converge under certain conditions, so in a sense, it is a proper way of making updates, of going towards the truth. However, when substituting those greedy updates by a gradient descent step, who guarantees that convergence will be retained? What is the theoretical basis? After all, a gradient descent step is smooth, it is all but greedy and violent in general (contrarily to the real Q-learning objective). My claim is that in order for convergence to be retained, gradient descent must update somehow similarly to look-up table Q-learning. However, whereas look-up table Q-learning updates one pair violently in a greedy way, a gradient descent step updates all the weights very slightly, and hence, if you want to converge towards your real objective, you probably need several of those updates in the full weight space. One update is not enough for you to know where the hell those weights are smoothly trying to go towards. Basically you need to approach that violent greedy action a bit more, since a single step might not be representative of that greedy improvement. Also, since after every gradient descent step, all the weights are slightly changing, this is probably not to good for convergence of gradient descent. Gradient descent is normally used when objectives are stable, not when objectives are literally changing every step. So it is good to wait for some steps and see where that cumulative gradient update is actually really wanting to go, and once you are convinced, you can change the weights :P.
I can't get the intuition why eligibility trace becomes Et = .... + gradient of ^v(s,w) or gradient of ^q(s,a,w) w.r.t. to w. Can someone explain? Thank you very much in advance.
Either (1) think of it as a generalisation of the look-up table case, in which case it is clear that they coincide, since the gradients are delta functions (i.e. x(S_t) is all zeros except at the component corresponding with state S_t, where it is 1), or (2) compute it by following the same ideas as when showing that the backward view of TD(lambda) is equivalent to the forward view (this is in the extra material in the slides for lecture 4). I think the equivalence is only true in the linear case :). If you understand the proof in lecture 4, you should understand this.
every state transition occur, you immediately decay the eligibility traces for the entire state you had visited , but at the same time you also increase the eligibility traces but only for the current state. In function approximation, each state represented as the features vector.
the true value comes as the trajectories are unrolled, make the *function mapping* (approximation) of last sample/episode as the target for next, its just like supervised learning, except now the data is correlated (is kind of a sequence)
At 47:15, why cant the second part \gamma^v(S',w) be left in terms of w, and and a new gradient be found with respect to the full function for optimization, instead of just taking the gradient part from V(S,w)
I guess the gradient will not behave properly because we haven't seen the *actual effect* of being in state S', In other words, we can't know (before time) that how the predicted value of next state v(S',w) will change as a function of state S' and w... Since we don't yet know if being in state S' will be useful or not OR what will be the direction of gradient of v(S',w).. or will it increase or decrease with respect to S'
This is solely due to function approximation. The value function may not be a linear combination of the features, and so, MC with linear function approximation cannot converge to the global optimum in this case. "An analysis of temporal-difference learning with function approximation" proves that MC TD methods converge to the least squares solution when this is the case.
i think it could be any value, since the "weight update(delta w)" equation taken from the difference between the estimate value and the real value, and being multiplied by some epselon(step-size) towards the gradient over and over until the value converged to the global optimum
what is the E_{pi}[ (v_pi(S;w) - v^hat(S,;w))^2], is it the true expecation of trajectories of the stoachsitc process given by the MDP? i.e. s1a1r1,s2a2r3,... ~MDP
v_pi does not depend on w, but it is the true expectation, yeah. Trajectories are sampled from the MDP. If you want to be more rigorous, the density is the discounted state distribution ho^\pi defined in Preliminaries, page 2 of proceedings.mlr.press/v32/silver14.pdf (Deterministic policy gradient algorithms)
What if we wrap the input of neural network in a function and treat each such single function as a state? That way it would be an MDP again and the lookup table method still works in choosing the right function to call, a program can program itself by solving this MDP.
For Experience Replay, how long do we hold on to old data? Do we flush it away after every update of the network? If so, then we lose valuable information from the early explorations and the network might unlearn that data. On the other hand, it's not easy to reuse the old episodes, because those 'target' values were computed based on the output of old networks that may now be outdated.
We don't flush it. It depends on what the size of the Experience Replay Memory is. It can be implemented as a cyclic buffer of a bounded size that holds the transitions observed recently: pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
I'm grateful that David was successfully sampled in this iteration of the universe.
Well said!
I laughed so hard 😀😀😀
I refuse to like this comment , to make sure its stays at 69 likes, however tempted I am of doing so.
Ahahaha :D
"Life is one big training set"
D. Silver, 2015
"Follow the gradient, Neo"
Predict values of your future
"you are the running mean of your friend circle".Keep updated that towards goal.
This motivates off-policy learning pretty well. We should learn from someone else's life.
Well my recall is so small.
this guy is definitely the best to explain RL
he is not a random guy...
@@crazyfrog2817 Yeah, it's interesting how random professors that know their stuff may not be good at teaching, but the best of the best always know their stuff and how to explain it
@@crazyfrog2817 he is a legend, people don't realize this
0:00 Motivations
0:35 Outline
0:35 Large-Scale Reinforcement Learning
3:55 Value Function Approximation
8:40 Types of Value Function Approximation
11:55 Which Function Approximator?
15:38 Incremental Methods
15:43 Gradient Descent
17:35 Value Function Approx. by Gradient Descent
21:35 Feature Vectors
23:23 Linear Value Function Approximation
28:43 Table Lookup Features
30:50 Incremental Prediction Algorithms
31:10 Monte-Carlo with Value Function Approximation
37:33 TD Learning with Value Function Approximation
41:56 TD(lambda) with Value Function Approximation
49:00 Control with Value Function Approximation
52:30 Action-Value Function Approximation
53:50 Linear Action-Value Function Approximation
55:20 Incremental Control Algorithms
56:20 Linear Sarsa with Coarse Coding in Mountain Car
1:04:30 Study of lambda: Should we Bootstrap?
1:06:10 Baird's Counterexample
1:06:30 Parameter Divergence in Baird's Counterexample
1:06:50 Convergence of Prediction Algorithms
1:08:00 Gradient Temporal-Difference Learning
1:09:00 Convergence of Control Algorithms
1:10:19 Batch Methods
1:12:30 Batch Reinforcement Learning
1:13:30 Least Square Prediction
1:15:25 Stochastic Gradient Descent with Experience Replay
1:17:25 Experience Replay in Deep Q-Networks (DQN)
1:24:46 DQN in Atari
1:26:00 How much does DQN help?
1:27:35 Linear Least Square Prediction (2)
1:32:29 Convergence of Linear Least Squares Prediction Algorithms
1:32:50 Least Squares Policy Iteration
1:34:15 Chain Walk Example
1:35:00 LSPI in Chain Walk: Action-Value Function
This video is yet another example that the complexity of a subject really depends on the one who is teaching it. Thank you David, for making RL so much more accessible and understandable! It is a real pleasure listening to those lectures of yours ❤
It's gonna be pretty confusing the first time you hear about it, but give it a time, try to understand some of it and then come back to watch it again, you will see how much you can comprehend it.
ditto
I feel this
@@mathavraj9378 j
done that
yeah iteration would be good if not perfect oracle policy at first sight
This is by far the best course on DRL I have ever watched. I kind of lost on the different notations in different papers although I can code many of them up. But this video summarized the root of the problems and solutions concisely which makes me be able to make sense what I have learned in the past. Thanks for the great work!
David Silver is an intelligent and humble person. He explains things very well , I'm glad that I came across his lectures
2020 and still watching this to refresh the knowledge 🙋♀️
Actually true, I was watching some other courses but gave up half way to come back to this goldmine of information and articulation.
2024
holy cow, in the last 30 minutes this lecture goes completely off the rails. The amount of concepts introduced and the number of slides shown increases exponentially towards the end :-)
Yeah, I assume that's also how TD(lambda) works. You start at slow and then learn very fast. LOL
@@ChumpyZhang yes it more effecient, but i guess it also carrying some bias alongs the way
If you can learn slow, you can learn fast
@@rotviepe😂 a ling ling bro learning RL
1:38 Introduction
15:28 Incremental Methods
1:12:17 Batch Methods
It's amazing how intuitive these concepts become after watching these lectures. David Silver makes the math and theory behind RL that seems so hard to grasp impossible to forget.
It is yet another example that the complexity of a subject really depends on the one who is teaching it.
This is a loooot a of knowledge that probably took over 20 years to get this complicated and genius. The question is why would someone share it for free like this! It's a GIFT!
golden information for free. Thanks DeepMind! We need more up to date course though :/
This reminds me of the quote by Lex Fridman that goes, "All machine learning is supervised learning. What differs is how that supervision is done."
Very easy to understand ! outstanding lecture! outstanding teacher!
it would be great, if this video had subtitels
I don't really care about the math in the lecture, I just enjoy the moment listening David talking about state of the art RL and suddenly catch his idea.
It cannot get any better than this 😘
that's unfortunate
Amazing lecture. Thanks for the upload.
16:39
- gradient is the direction of descent !
+WahranRai No, Gradient is the direction of ascent and in most of gradient descent optimization we follow the direction of the negative gradient (for finding minima) math.stackexchange.com/questions/223252/why-is-gradient-the-direction-of-steepest-ascent
Note that @WahranRai said "- gradient is the direction of descent !". Note the "-".
I can never comprehend the backward views and eligibility traces, although the forward view still makes sense.
Backward view is doing online updates. You do that for every state in each time step. If you do that for every state and only apply those updates at the end of the episode it will be equal to the updates you would do in your forward view. In forward view we were updating the values of the states when they are seen. We were collecting all step returns weighted by appropriate lambda and applying to the state that is seen. The return is the return from that state onwards in this case. In backward view, we immediatly update the value of the state, actually all the states in each steps. It is easy to see that they are equal when lambda is 0. Eligibility trace loses its recursive part and it is only 1 when the actual state is the state we are going to update. And all sum of those updates that would happen throughout the whole episode is equal to the forward 0 lambda return which is actually 1 step return because we only care about the 1 step return in that case ( all the other step returns are 0 because they have lambda in front of their weights).
very cool idea experience replay with fixed Q learning, this seems to be related to double Q learning.
This is a great lecture!
I guess you can't really understand the details, unless you make some exercises about it. Still didn't find good videos about this.
Check this repository github.com/dennybritz/reinforcement-learning if you want to learn how to implement these ideas in python.
Abhishek Sharma can not find it :(
To understand TemporalDifference, have a look at this video ruclips.net/video/w33Lplx49_A/видео.html
in particular 56:40
Link is broken
@@piyushjaininventor Remove the the ending parenthesis and the link will work.
Sometimes I think that I learn the truth from the gesture
This helped a lot
I am puzzled by some comments. I felt lectures in general towards the end are rushed and hand wavy. First 4-5 lectures are really good but after that it feels more fragmented. I could realize this more when I saw DQN slides, that is familiar topic to me and the way it was explained was very upside down and cryptic way.
I like the convention of referring to the information about the value of an action being given to us by an "oracle". It makes me think of the old woman from the Matrix saying "Hmmmm, I would've made pacman turn left there honey".
Great course! the only negative is that each lecture is too long, so I tried to watch it at a higher speed. You can still watch it at speed x1.5 and it becomes around 1h each course.
Graduate level classes are usually 2 hours long.
How would you calculate the Q feature vector based on the action? Would the action considered be part of the feature vector? Otherwise how could my features (e.g. distance to the wall) depend on the considered action without explicit coding of how the environment works?
I am struggling with this same question. Maybe we can help each other think through this. It is confusing because in the gridworld or pacman examples you know the next state by the current state and action you take, but for my problem, robot control, you don't know how an action will affect your state.
One solution like you said was to add a feature for the action. I may be wrong on this, haven't yet tried, but I think that there should be one feature for each action we are considering, with corresponding weights. So when we are in a state and looking for an action to take we loop over each possible action where the Q(s,a) for a particular action will have the feature value for that action as 1, then each other action feature is 0 (ignoring them).
Another approach I am looking at is to have a different set of weights for each action. This may be less efficient but could be more expressive. Will need to do some tests once both approaches are working.
Finally I think the best approach, and the one I have the least understanding of, would be to have a neural network that produces an output for each action rather than a single value. This would seem to be the most efficient in terms of memory and CPU time. And could be a simpler implementation with a black box NN.
What are your thoughts on these?
I think perhaps that's why Powell defines the Q-value as a "Post-Decision State". For some implementations you do not need to add action as a feature. For example if you are playing chess you know immediately what the results are of taking your action (even though you can't predict the enemy movement so you don't have access to the transition function). So if your feature is "the enemy queen is in the game" YES/NO you can judge your possible actions without them appearing explicitly as features but that's because you can tell what your action is doing.
+Alec Karfonta Both of your last two suggestions are what I've seen in practice. Silver even mentioned the latter in a previous lecture. Adding actions to the feature vector would violate the standard distinction made between s and a, so while it might work, you'd be cutting against the grain of all of the literature.
I was finally able to get the value-function approximation working in practice on a crawling robot task. It works better and faster than the Q-Table implementation. I thought it would work better but take longer to train, it surpassed all my expectations. I have never before seen the robots move the way they are now.
First I tried the method I suggested above to have a feature in the state for each action, it would be a 1 if the action was taken in the previous time step. This almost seemed to work, they would chase the immediate reward but would not take the extra step to take actions with no immediate reward to reach a state with an immediate reward.
Then I tried the second method, where each action has it's own set of weights for the state features and this worked like a dream. Within an hour of training they had mastered the task, where the Q-Table would take overnight. So at a time step the state features are combined with each action's set of weights to find the max expected reward action. The next time step the weights for only the previous action are updated using the delta from expected and actual reward (immediate + future from the new state).
Now I am experimenting with more complex robots. With the old Q-Table implementation adding a limb would cause the size of the table to explode. With the value-function I have the computational capacity for pretty much any number of actions and state features. So I am trying to train the same crawling robot with just one extra identical limb. Though I seem to have a new problem. Ideally the robot should be able to control both limbs (four motors) at the same time. How do you guys think I could adapt the Q update to take four actions per time step?
Maybe instead of one update that considers all actions at once there are four separate updates for each motor. The weights would be a three-dimensional array like [motor][action][state_feature]. The motor dimension would be of size four, the action dimension thee (forward, stop, reverse). At each time-step four actions would be taken, the max action for each motor. The next time-step there would be four updates based on the action taken for each motor and the same previous state values. Though I'm worried this might exacerbate the credit assignment problem.
The feature vector is built off of a set of observations. Take a self driving car for example, you would have sensors that detect your distance to the next object, your accelerometer, your pressure gauge, etc. and that will essentially be a generalization of your state. It doesn't matter if you're in your garage at home or in the garage at your office, if your observations are identical, your state would be the same (although your endgoal, and hence where you will find your reward, will be different).
43:10, do we need to explicitly store the eligibility trace of features for all states in a look-up table?
If you learn each Q(S,A) directly you need to store the trace for each state. When learning the parameter vector w you only need to store the trace for each feature.
1:24:00 Why doesn't using old parameters have the same effect as propagating backwards in TD-learning would? Why doesn't it reverse the temporal flow of information and therefore make the Q-values diverge?
Any ideas on how to adapt this to take multiple actions per time-step? I am working on a robot control problem, the simple crawling crate program described in the Berkeley RL lectures. The robot has two joints each with three actions; forward, hold, reverse. I can get it to work fairly well but when I add an extra limb it gets stuck in a local minimum of using just one arm and holding the other still. Ideally I would like the robot to control the torque on each motor but I do not know how I would set up the training examples for the NN. Right now the input is the angle and velocity of each joint and the output is the expected reward of each action. To get the actual target for a training example I use the Q-Value for the current step, the immediate reward plus the max expected reward from any action in the new state. So if there are four actions this Q-Value is used as the expected reward of the previous action. The other three action values in the sample are set to the current expected reward, so that only the value of the previous action is changed. I cannot think of how to adapt this to take multiple actions or output motor torques, any suggestions?
You have to discretize your action space. Use something like mgrid. Then use the index as a "scalar" control input
I would recommend DDPG for your application though
In DQN, in order to avoid chasing a non-stationary target, we fixed (R+y*maxQ(S', A', w)) for a while so that we could chase a stable target. My question is that could we do the same thing to fix R+y*maxQ(S', A') in Q-learning?
In DQN you have weights, so it is the fact that you are changing your weights all the time due to gradient descent what motivates fixing those weights in the objective for a bit. You stabilise the objective. However, in normal Q-learning, the objective is not changing much every step, so intuitively to me it would not make sense.
There is probably an extension available for ur browser to increase the audio of the video....
use linux
Bottom of slide at 30:20ish. To take the dot product of those vectors shouldn't the state vector be transposed?
Currently we have nx1 state vector and a nx1 weight vector. Giving is nx1,nx1; which won't multiply? We need 1xn,nx1 to give us a 1x1 output.
Not true. First of all, you are confusing scalar product with matrix multiplication. You are right in the sense that you obviously cannot multiply an nx1 matrix by an nx1 matrix, but that dot does not mean matrix multiplication. In mathematics, those vectors you are seeing are interpreted as vectors in the Euclidean space R^n (\mathbb{R}^n), and he is simply expressing two vectors of the Euclidean space as column vectors and writing a dot in-between to express scalar product. This is sometimes done in mathematics, physics, etc. Note that if I have two vectors x and y in R^n and I express vector x like a 1xn vector and y like an nx1 vector, the scalar product coincides with matrix multiplication. This is because in the Euclidean case, the dot product can be rewritten in terms of matrix multiplication of suitable matrices.
I will add that in general, you just need an inner product space in order to define a scalar product, and a scalar product is defined BETWEEN ELEMENTS OF THE SAME SPACE. Your space could be R^n, a space of functions like L^2, A space of sequences, etc.
please add captions
at 43:28, he talked about the size of the eligibility traces. I dont' understand why its size is the same at the parameters' size but not the size of the state space. Does anyone know the answer for this?
Let's assume we're using a linear approximation, that is: v(s,w) = f1(s) w1 + f2(s) w2 + ... fn(s) wn, where fk(s) is the k-th feature of the state s.
In this case, the gradient of v in s is just the feature vector f(s). This will make what follows more intuitive. It's clear that if, say, f1(s) is much bigger than the other features, then we'll want to focus on w1 to increase/decrease v(s,w) as much as possible.
Instead of using eligibility traces to assign credit to the states, we use eligibility traces to assign credit to the features which "summarize" the states.
Let's say we have 4 states:
1) it's raining and it's cold
2) it's raining and it's hot
3) it's sunny and it's cold
4) it's sunny and it's hot
The features are:
r = it's raining
c = it's cold
which can take values 0 (false) or 1 (true).
If we go through states 2 and 3, will just remember the features (r,c) instead of the states themselves:
e
Great Lectures.
Slides link isn't working.
Does anyone have a reference to the "Gradient Q-Learning" approach that was mentioned around 1:11:27? I couldn't seem to find it online. In particular, I am curious how this approach actually computes the gradient of the TD target (especially how it handles the max term). Perhaps it is related to Gradient TD somehow?
Tnx a lot 👍
I don't understand how they train with experience replay in 1.17 . Is it like at each time step you take some state and target pairs that the model seen before and apply stochastic Gradient descent ? . Also how the improve policy at each time step . When using greedy policy which function approximator they use is it frozen one or the updating one ?
I think it's each timestep. 40:30
"Every step, what we do, we sample some mini-batch from our dataset ... optimise MSE between Q-network and Q-Learning targets" 1:18:34 . Also i suppose the 'i' in the loss function equation means "iteration". So calculate the loss each iterations (timestep)
1:22:15 "any iterations 'i' " , "optimize a loss function"
46:49 this explanation though.
So is it that we don't store states explicitly unlike before where we had them in a look-up table? Or is it that we still have to store them, but now we can quickly estimate the values of similar states and add them to our table even without having to visit them?
I'm trying to understand how not storing them will work when we have to take an action using action-values. If we don't have them explicitly stored, then we evaluate all actions possible in the current state in real-time using function approximation right?
So, the point of Value Function Approximation is that you don't have to store the Q-value lookup table for all states and actions. (it may be extremely large for many problems). The approximation of the q-value can be used to "extrapolate" values of unvisited states and their values (that is what is called generalization). Think of it like this: you have a bunch of x -values and y-values and you want to predict approximately the value of x given a new and unseen value for y. So, we try to fit a function for known x and y values. This is exactly what supervised learning in ML does, but here we get the "training set" by interacting with the environment. Hope this helped.
@@ajaybala8967 Thanks for the reply. Truth be told I figured it out and am halfway through my skl project using that exact same thing lol. Only problem now is that it is so computationally heavy that it is the first time i actually have to consider code optimisation haha. But thanks again for the help!
No problem!
1:11:01 didn't know Boris Johnson was interested in RL
Oh wow, it really does sound like him.
We update the value of weights in Monte Carlo with Value function approximation(Incremental methods) at the end of multiple episodes right? And How does it solve the problem of space. Can someone please clarify?
The problem of space is called curse of dimensionality. A function approximator (linear features, neural net, wavelets, etc) uses way fewer trainable parameters than the number of total states (which could even be continuous). Note that by means of a neural network, you could compute a good estimate of the value of infinitely many states by only having a small number of trainable parameters. Why? Because a neural network is capable of interpolating and generalising if the architecture is chosen correctly. That's the beauty! Trainable parameters
Any idea what kind of features are used for the mountain car example? In my understanding, the user has to choose the features?
Feature Vector is the output of the network (eg. linear regression, neural network) You don't do it manually. But you may set a size of the network output and let the network optimizes the Feature Vector by itself.
as he said, the features consist the current position and velocity of the car. eg: when the car on 32 position and accelarate 80% from the maximum speed, then the car ends up in x position with the state-value function = 8.
and that value function is use to improve the new policy and optimize future action.
The features are coarse coding in the first pictures (i.e. there are many circles and for each circle, a feature is defined as 1 for states on the circle and 0 otherwise), and radial basis functions (probably Gaussian densities with different means) for the other image where there is handwritten text. In both cases, the function approximators are linear.
@@alvinphantomhive3794 You are confusing features with states.
For TD learning with value function approximation, using the linear value estimator as the TD target, won't that damage the algorithm..?
Depends on how well you choose those features right? Feature selection is an art. Also, the look-up table case corresponds with linear function approximation.
Why there is E_pi in the objective function?
Does the "action out" approximation around 11 minutes give us a dictionary as output?
I.e. {Action 1: 7, Action 2: 6.5,...}
If you fixed the state, yes, you can view it like that. That's the computational way of interpreting it. The mathematical way would be a vector, each component containing the action-value of the fixed state given one of the possible actions.
I am trying to find more information about the "Coarse Coding" function approximation method online (at approx. 1:01:00). Anyone have some useful links? It appears to be quite elusive.
its been two years since you asked, did you find any?
Can someone explain the difference between gradient TD and the algorithms that was shown before in the lecture?
I am guessing gradient TD applies compatible function approximation in order to make updates in the real descent direction, but I would love to see a reference of what he means by gradient TD or gradient Q-learning. Again, I strongly suspect that they come from some improvements due to the policy gradient theorem by Sutton.
Hi, guys. Im wondering if Feature Vectors need exactly same number of columns as number of states.
No definitely not. The number of states is arbitrary, and a feature vector is canonically a 1 column vector with many rows. For example, the feature vector for chess could have 1 row for every point on a chess board, and the entry to that row could be a number representing what kind chess piece is occupying that point. Opponent’s rook, my queen, empty space, my opponent’s knight, et cetera.
Of course not. Typically, it is ill-advised to do that, since that would imply that your number of trainable parameters is equal to the number of states, and the purpose of function approximation is precisely avoiding the curse of dimensionality of huge state spaces by using a much smaller number of trainable parameters. It's like you are back in the look-up table case. What a torture 😂
I'm struggling with the objective function for DQNs. We're trying to learn w and we update them towards action value functions estimated with an older parameter vector w-. That feels wrong to me, what am I missing? I can see that we bootstrap towards the future (so we observe a reward and estimate the value at s', a'), but using w- from the past which is a worst estimate of the optimal parameter vector w* (hopefully) than the estimate I have now.
I felt the same when i heard this idea
did you watch leactures of regular tabular q learning? We did same thing there. While it may feel wrong 1. you can come up with a better method and publish a paper 2. we still inject information of the real reward, thus making our predictions more accurate, that's why it all works.
@@robosergTV Thank you for your answer. Still, to this day, I feel like I'm missing something.
I studied and understood table-lookup Q-learning, and I want to highlight that is not the TD target that bothers me (on the contrary, I am amazed by how TD learns about the return without ever observing it).
What I felt odd about is the bootstrap towards *old* Q-targets, parametrized by w-.
After all, I'm trying to improve my estimate of w, so I want to improve over w-.
Anyway, I can grasp that, as you suggest, the observation of the unbiased reward R ensures an improvement.
Cheers
@@davideabati1903 I am facing the same dilemma. Over here prof says that we fix the old target for 1000 (say) iterations. How do we ensure that the model trained during these iterations is improved when the error gradients themselves point to the worst estimate of Q*. In tabular Q-learning he told we take an alternate action as the target. Where my interpretation of it was from exploration perspective. Please tell me if I'm going in the wrong direction.
This is one of the most interesting questions in this lecture's comments. My first observation is that in the look-up table case, the Q-learning updates are the same as for the DQN objective (with corresponding indicator function features, obviously) as long as WE DON'T FREEZE THE OLD PARAMETERS. Therefore the question of why freeze the old parameters at all is very natural. My second observation is that in the look-up table case, we are not moving the objective much at every step, in the sense that we only modify one pair of state-actions, so gradient descent won't go crazy by following a constantly moving objective. Note that although only one pair is moved, the change is clearly more violent in the table look-up case. Another reason why gradient descent will not go crazy in this case is the obvious one: it converges because it coincides with Q-learning and Q-learning converges under certain conditions.
Now, given the two previous observations, I will summarise the intuition that I have. The look-up table Q-learning updates follow the Bellman optimality equation and converge under certain conditions, so in a sense, it is a proper way of making updates, of going towards the truth. However, when substituting those greedy updates by a gradient descent step, who guarantees that convergence will be retained? What is the theoretical basis? After all, a gradient descent step is smooth, it is all but greedy and violent in general (contrarily to the real Q-learning objective). My claim is that in order for convergence to be retained, gradient descent must update somehow similarly to look-up table Q-learning. However, whereas look-up table Q-learning updates one pair violently in a greedy way, a gradient descent step updates all the weights very slightly, and hence, if you want to converge towards your real objective, you probably need several of those updates in the full weight space. One update is not enough for you to know where the hell those weights are smoothly trying to go towards. Basically you need to approach that violent greedy action a bit more, since a single step might not be representative of that greedy improvement. Also, since after every gradient descent step, all the weights are slightly changing, this is probably not to good for convergence of gradient descent. Gradient descent is normally used when objectives are stable, not when objectives are literally changing every step. So it is good to wait for some steps and see where that cumulative gradient update is actually really wanting to go, and once you are convinced, you can change the weights :P.
I can't get the intuition why eligibility trace becomes Et = .... + gradient of ^v(s,w) or gradient of ^q(s,a,w) w.r.t. to w. Can someone explain? Thank you very much in advance.
Either (1) think of it as a generalisation of the look-up table case, in which case it is clear that they coincide, since the gradients are delta functions (i.e. x(S_t) is all zeros except at the component corresponding with state S_t, where it is 1), or (2) compute it by following the same ideas as when showing that the backward view of TD(lambda) is equivalent to the forward view (this is in the extra material in the slides for lecture 4). I think the equivalence is only true in the linear case :). If you understand the proof in lecture 4, you should understand this.
Does the experience set D consist of just one episode? Or multiple episodes/runs?
Multiple episodes/runs.
(Also, He said in the lecture that we randomize them while sampling, so it shouldn't depend on which run it is)
43:40 how can we apply eligibility trace on a model which some of its weights are from something black box ??
every state transition occur, you immediately decay the eligibility traces for the entire state you had visited , but at the same time you also increase the eligibility traces but only for the current state. In function approximation, each state represented as the features vector.
from where can I find the true value function to compare to find the error??
the true value comes as the trajectories are unrolled, make the *function mapping* (approximation) of last sample/episode as the target for next, its just like supervised learning, except now the data is correlated (is kind of a sequence)
At 47:15, why cant the second part \gamma^v(S',w) be left in terms of w, and and a new gradient be found with respect to the full function for optimization, instead of just taking the gradient part from V(S,w)
I guess the gradient will not behave properly because we haven't seen the *actual effect* of being in state S', In other words, we can't know (before time) that how the predicted value of next state v(S',w) will change as a function of state S' and w... Since we don't yet know if being in state S' will be useful or not OR what will be the direction of gradient of v(S',w).. or will it increase or decrease with respect to S'
why does Monte Carlo evaluation converge to a local optimum
Because the target in MC learning is the real rewards (Gt).
This is solely due to function approximation. The value function may not be a linear combination of the features, and so, MC with linear function approximation cannot converge to the global optimum in this case. "An analysis of temporal-difference learning with function approximation" proves that MC TD methods converge to the least squares solution when this is the case.
Anyone have some good resources for learning about sourcing feature vectors (23:30)
Does anybody here know if the Q estimates or weights have to stay in a range like 0.0-1.0?? or they can just be any real value?
i think it could be any value, since
the "weight update(delta w)" equation taken from the difference between the estimate value and the real value, and being multiplied by some epselon(step-size) towards the gradient over and over until the value converged to the global optimum
what is the E_{pi}[ (v_pi(S;w) - v^hat(S,;w))^2], is it the true expecation of trajectories of the stoachsitc process given by the MDP? i.e. s1a1r1,s2a2r3,... ~MDP
v_pi does not depend on w, but it is the true expectation, yeah. Trajectories are sampled from the MDP. If you want to be more rigorous, the density is the discounted state distribution
ho^\pi defined in Preliminaries, page 2 of
proceedings.mlr.press/v32/silver14.pdf
(Deterministic policy gradient algorithms)
The question at 1:22:00 was like gold.
someone needs to run the Independent component analysis algorithm and filter out the irritating background noise.
What if we wrap the input of neural network in a function and treat each such single function as a state? That way it would be an MDP again and the lookup table method still works in choosing the right function to call, a program can program itself by solving this MDP.
No subtitles?
I would have liked this video, but then it wouldn't be 420
For Experience Replay, how long do we hold on to old data? Do we flush it away after every update of the network? If so, then we lose valuable information from the early explorations and the network might unlearn that data. On the other hand, it's not easy to reuse the old episodes, because those 'target' values were computed based on the output of old networks that may now be outdated.
We don't flush it. It depends on what the size of the Experience Replay Memory is. It can be implemented as a cyclic buffer of a bounded size that holds the transitions observed recently: pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
19:17 I died! Best ever physical interpretation of a mathematical equation
Another tutorial that explains Experience Replay: ruclips.net/video/AohtWrJwvgg/видео.html
37:00
wow
37:09