It's crazy to me how much more of RL I've understood in these 2 beginning lectures vs multiple days of trying to understand what my professors have cooked up in their lecture. And David actually goes into even more depth.
If you are reading this in 2024 and thinking what value can a 10 years old course may offer, you are completely wrong. I have looked at many courses and lectures to find out the best startup RL course and found this as the best. Force yourself to start with this.
Can we build the machine learning program/code with this course I am interested but I don't know if I would learn how to implement it in the 'coding' part of it?
can we appreciate the audience too? so interactive and they asked all questions that made our foundation even better. Kudos to David for patiently answering everyone with equal dedication
Indeed. One of the best lectures I have seen on a difficult topic. And his attitude is just...rare and priceless actually. I feel outmost respect for his pedagogical methods.
I've heard of this course as well. Do you have a resource that provides the correct way to download the packages required for starting the course ? Thanks
"The future is independent of the past given the future" is crazy deep when you think about it. How many people lament over past transgressions and stress over "how it should have gone down" when in reality obtaining your goals is much more along the lines of "what can I do right now in this moment"
0:45 Markov Processes {S, P}. used to describe reinforcement learning environment. 3:26 Markov Property: the current state contains all the information of history. 4:35 State transition Matix. transition probability from s to s`. 7:27 Example: student decision chain. 13:00 Markov Reward Processes {S, P, R, γ} 29:10 Bellman Equation for MRPs: Vs = Rt + γ * Σ(Ps+1 * Vs+1) 39:01 Bellman Equation in Matrix Form. 43:14 Markov Decision Processes {S, A, P, R, γ} 46:15 Policies: for taking actions. π(a|s) 50:51 Value Function. state-value funciton action-value function 53:37 Bellman Expectation Equation, following a policy. Bellman expectation equation for Vπ(s) Bellman expectation equation for Qπ(s,a) 1:02:52 Bellman Expectation Equation in matrix form 1:04:01 Q&A 1:08:06 Optimal Value Function. q* 1:15:41 Optimal policy. π ≥ π` if v(s) ≥ v`(s), ∀s (for all states) Theorem: for any MDP, there's an optimal policy that's ≥ all other policies. Theorem: all optimal policies leads to the same optimal value funciton. 1:19:54 find the optimal policy 1:19:55 Bellman optimality equation. v*(s) = max(q*(a,s)) q*(a,s) = R + γ * ΣP * v*(s`) 1:29:39 how to solve bellman optimality equation? it's non-linear. use iterative solution. value iteration (next lecture) policy iteration (next lecture) q-learning (future lecture) sarsa 1:31:49 Q&A 1:40:18 Extensions to MDPs continuous & infinite MDP partially observable MDP undiscounted, average reward MDP
I must say this is by far the best course whether you want to learn Deep or Shallow RL. Its so intuitive and easy to understand the basic of RL concepts. The current CS285 Course has a great material but for me its hard to understand. I repeatedly come back here to clear concepts. David Silver is a great Teacher.
If you're confused like I was at slide 32 (53:27) when he goes over the state-value function for the student MDP, the calculation for each node, using the look-ahead step from Bellman Equation is: v_pi(s) = E_pi[R_t+1 + gamma*v(S_t+1)|S_t = s]. For instance, for the case of the -1.3 node, we calculate, given that the policy assigns 0.5 probabilities for all actions: 0.5*(-2 + 2.7) + 0.5*(-1 + -2.3) ~= -1.3 Where the first term of the sum is the contribution from the Study node, and the second term is the contribution from the Facebook node
I dont understand difference between state value function and action value function. In state value, the value of being in some state depends on the immediate reward in the next state. But we need an "action" to got to the next state so isnt it essentially action value function?
The part from 54:30 to 1:01:00 was a lot to take into. But it's actually not that much once you get a grip on it. First there is the *Bellman Equation* and the *Bellman Expectation Equation* The difference is that the *Bellman Expectation Equation* also incorporates actions. *Bellman Expectation Equation* = *Bellman Equation* + actions. After that we basically just have 4 variations of this: 1. *Vπ* in terms of *Qπ* 2. *Qπ* in terms of *Vπ* 3. *Vπ* in terms of *itself* 4. *Qπ* in terms of *itself*
i still don't understand how the calculations that we got the values for student MDP in 53:30, it was said prior that in MDP there's not just one expectation value anymore but here every nodes has one value? e.g how do we get the +2.7 value in the second class?
@shresthapatir6495 Idk if this a little too late but here is how you get it: The value of the state = expected value of all actions that can be taken from it The value of an action = reward + expected value of all states it leads to There are two actions from that state: Sleep or Study. Lets talk about Sleep. This action always leads to terminal state so the expected value of states it leads to is 0. The reward is also 0 as stated in slides. Value of Sleep = reward + expected value of all states it leads to = 0 + 0 = 0 Same can be done for Study. This action always leads to next state which has a value of 7.4 so the expected value of states it leads to is 7.4. The reward stated is -2. Value of Study = reward + expected value of all states it leads to = -2 + 7.4 = 5.4 Now we have values of all actions for this state. The value of the state is the expected value of all its actions. In this case, the policy is to have a 50/50 choice if actions so each action has a 0.5 probability. Value of state = expected value of all actions = (0.5 * value of Sleep) + (0.5 * value of Study) = (0.5 * 0) + (0.5 * 5.4) = 2.7 So the value of this state is 2.7
I can not resist from commenting. These presentations are more learnable I am scratching my head since a week. Fortunately I got into your video presentations. Awesome Dave Gold*.
My question is related to when the reward for a state-action pair is computed in a Markov Decision Process. When showing the Student MDP on page 25 (min 45:00), the reward for (study, pub) is given immediately after the action is taken and exiting the state. This is compatible with the way we indicate the reward as R^a_s, meaning that it is defined by the action you've taken and the state you're exiting. However, later on in the video (min 56:30) and on the slides (page 32) it looks like the reward is given when the environment pushes you to the next state, once the action has been chosen. In formulas, it seems like r depends on the action and the next state, so R^a_{s'} To me, it makes more sense that the agent receives the reward as a function of the action and the state it'll end up in, rather than the state it is leaving. Consider a game where one can decide to attack or run: if I decide to attack and the environment decides that I should be defeated, I would like to get a negative reward immediately and not later. So, which one is the right way of thinking? Do the equations and the plot need fixing or am I misunderstanding something?
Hi, The initial example is an example of Markov Reward Process. So, the reward function only depends on the state (you can check the slide for the definition of MRP). Therefore, you get a reward at time t+1 when you exit the state at time t no matter what the action is. The later example is Markov Decision Process. In MDP, your reward function definition also considers which action you take. So you take some action and exit the state at time t ("study" or "pub") and get a reward at time t+1 based on your actions and where you'll end up with.
@@MsCalcioo You seem an expert on MDP. I have a question, "Pub" was a state in MRP example but in MDP it is removed being a state. Why? This confuses me. Moreover, rewards in MDPs are shown on the arc from (s,a) to s', but in the MDP example this convention is not followed.
@@ercanatam6344 The states in the student MDP are literally the same as the states in the MRP, David just did not write the names in the circles/squares to avoid saturating the notation. What is confusing you is that he is using the words "pub" or "study" in order to describe actions as well. He means "go to pub" or "go to study". If you carry out the action "go to study" or "go to pub", you end up in the states "study"/"pub". Sometimes "go" could also mean "keep in", as in "keep in the pub", "keep studying", but I hope you get what I mean.
@@edmonddantes4705 i still don't understand how the calculations that we got the values for student MDP in 53:30, it was said prior that in MDP there's not just one expectation value anymore but here every nodes has one value? e.g how do we get the +2.7 value in the second class?
@@fahmikhoirurrijal9427 For any MDP, given a fixed policy pi, every state has a value given by the state-value function. The state-value function is literally defined in the previous slide and the chosen policy pi is uniform, so I am not sure what you are missing. Anyway, he applies the formula in the previous slide (you can do Monte Carlo sampling for an approximation or apply the Bellman expectation equation). PS: By the way, any MDP with a fixed policy is an MRP.
just to make it clear about the question asked at about 23:00: I actually think that the sequence is finite not by an arbitrary definition but due to the fact that every loop in that scenario will eventually terminate with proabability of 1. For instance the "stay in facebook" loop has a probability of 0.9^n of happening during n stages, wich tends to zero as n tends to infinity, thus the probability of (eventually) leaving the loop tends to 1.
I searched for a commentary like that, because I got confused from the answer of this question in the video. Thanks. :) I think this can be seen by a short analysis of the transition matrix, e.g., the eigenvalues which are all between zero and one for each state except for the state "Sleep", which is one. This means that P^n for n to infinity converges to a matrix only with zeroes except for the column "to Sleep" which are all 1 then.
Yes, episodic Markov processes terminate almost surely :). It is subtle. Also MDPs with a.s. absolutely bounded immediate rewards have finite discounted return (gamma < 1). But he is not very rigorous, he just wants to convey the main ideas.
00:04 Introduction to Markov Decision Process 01:45 Markov Decision Processes are versatile and can be applied to various reinforcement learning problems 05:28 Markov Decision Process defines the transitions between states. 07:23 Illustration of transition probabilities in a Markov Process 11:11 Markov Decision Process dynamics and handling non-stationary probabilities 12:58 Introducing rewards and the concept of a Markov reward process 16:42 Discount factor determines present value of future rewards 18:21 Using a discount factor to account for uncertainty in the future. 21:40 Understanding Markov decision processes and its termination properties 23:34 Value of a state in a Markov Decision Process is the expected return starting from that state 27:15 Understanding different discount factors in Markov Decision Process 28:53 Bellman equation explains value function decomposition 32:19 Bellman equation defines value function 34:09 Value function must obey a specific identity to be correct 38:17 Understanding the Bellman equation using U matrices and vectors. 40:04 Markov Decision Process equation simplifies value calculation 43:35 Introducing actions to Markov Decision Process 45:17 Understanding Markov Decision Process and policies. 48:42 Maximize future rewards, not current ones 50:26 Dynamics and rewards in Markov Decision Processes 53:52 Decomposition of value function in MDP 55:38 Understanding Markov Decision Process 58:59 Explaining how value function relates to itself recursively 1:00:46 Value function represents long-term rewards 1:04:17 Transition from Markov reward process to Markov decision process 1:05:55 MDP definition includes state-dependent rewards 1:09:33 QAR helps determine optimal behavior in MDPs. 1:11:14 Understanding the optimal value in Markov Decision Process 1:14:34 Understanding optimal policy in MDP 1:16:29 Partial ordering of policies based on value function 1:19:57 Deterministic optimal policy ensures maximum reward 1:21:50 Understanding the Bellman optimality equation for Markov Decision Processes 1:25:04 Importance of understanding states and values in environments 1:26:44 Explaining recursive relationship between qstar values 1:30:34 Dynamic programming methods and Q learning for iterative solution 1:32:38 Handling Imperfect Models in MDP 1:36:11 The Bellman equation's intuition in real-world examples 1:37:58 Optimal dynamics can be described by breaking down trajectory into optimal decisions at each step. 1:41:20 Markov Decision Process simplifies decision-making with limited information.
At approx 22mins somebody asks about an infinite sequence. David says this is impossible, but even it was possible it's interesting to observe that the sequence of rewards would sum a finite value because it's actually just geometric series. This was discussed in the Udacity course a bit. Loving these lectures btw. David is magnificent.
To be more precise, the discounted return converges almost surely as long as the rewards are uniformly bounded (almost surely) and gamma is strictly less than one.
@@monktastic1 That statement requires some mathematical formalisation, since one could have two different states pointing at each other with probability one, and hence arriving to any of those states would give rise to an infinite alternating sequence. However, those states could be formalised as one state and then you could consider the class of those two states to be a fixed point. There is some truth in your statement, but it requires more formalisation, it is hand-wavy.
Great lecture. I really liked the approach, building from MP to MRP and finally MDP. It was well connected, like listening to a story. Thank you very much!
1:41:24 did anyone else feel like an agent in a partially observable environment when the camera shut off while he was talking about partially observable environments?
@1:37:00 Here's an easier example to understand the idea behind the optimality conditions. Let's say you want to find the shortest path from A to C and someone gives you a supposedly optimal solution which passes through B, i.e. the (supposedly) optimal path is A->B->C. If the path A->B is not the shortest path from A to B, then the total solution can't be optimal because we can find a better path A~>B and then form A~>B->C which is shorter than A->B->C. Now let's say someone gives you a supposedly optimal policy pi. This is more tricky than the shortest path example because of recursion. You notice that pi doesn't always choose the optimal action at every given time. In other words, you compute q_pi for the policy pi and you notice that q_pi(s,a) = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a') < R(a|s) + gamma sum_{s'} P(s'|s,a) max_{a'} q_pi(s',a') for some pair (s,a). Let pi' be the improved policy which always chooses the best action according to q_pi. We get q_pi(s,a) = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a') < R(a|s) + gamma sum_{s'} P(s'|s,a) max_{a'} q_pi(s',a') = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi'(a'|s') q_pi(s',a') Note that we have pi' and q_pi (not q_pi'!) in the same expression. We would like to have q_pi' instead of q_pi. We can update q_pi with the new estimations and then repeat the maximization process. Note that at each maximization step we increase our estimation of q_pi until we converge to q_pi'. Since we're always increasing q_pi, we must have q_pi < q_pi' which means that pi' is strictly better than pi. I hope this makes sense. I've never read the official proof and this is just the sketch of one direction of the proof (i.e. optimality => optimal condition). In conclusion, whenever q_pi doesn't satisfy the bellman optimality equation (BOE), then we can increase q_pi until, estimation after estimation, we converge to a q_pi' which satisfy the BOE and such that pi' > pi.
Around minute 51:00, the bellman equation for the action-value function shouldn’t have the policy subscript for the expectation, as the current action is already given, so it is an averaging goes only over the “environment dice”.
Your comment indicates lack of understanding. Of course it should have the subscript pi. G_t is the discounted return, so you need to take many more actions in the future states in order to calculate that expectation, even if your immediate action a is fixed.
I just want to point out that the recursive formulas are very easy to understand and derive if one disentangle them. For instance, let's say *s* is the current state, *a* an action and *s'* the new state after the action, i.e. s->a->s'. Then v_pi(s) = sum_{a,s'} P(a,s'|s) [R(a|s) + gamma v_pi(s')] where P(a,s'|s) is the probability of taking action *a* and ending up in state *s'* given that we are in state *s*. Now we note that P(a,s'|s) = P(s'|s,a)P(a|s) = P(s'|s,a)pi(a|s). Also, sum_{a,s'} (...) is sum_a sum_{s'} (...), so, we get v_pi(s) = sum_{a,s'} P(a,s'|s) [R(a|s) + gamma v_pi(s')] = sum_a sum_{s'} P(s'|s,a)pi(a|s) [R(a|s) + gamma v_pi(s')] = sum_a pi(a|s) sum_{s'} P(s'|s,a) [R(a|s) + gamma v_pi(s')] = sum_a pi(a|s) [ sum_{s'} P(s'|s,a) R(a|s) + sum_{s'} P(s'|s,a) gamma v_pi(s') ] = sum_a pi(a|s) [ R(a|s) + gamma sum_{s'} P(s'|s,a) v_pi(s') ] which is the formula @57:38. Same thing with q_pi(s,a). We start from state *s* and do action *a*. What happens is that we end up in state *s'* and then do another action *a'*. In other words, the sequence is s->a->s'->a'. Since we know *s* and *a*, but *s'* and *a'* are random, we compute the mean of the total reward over *s'* and *a'*: q_pi(s,a) = sum_{s',a'} P(s',a'|s,a) [R(a|s) + gamma q_pi(s',a')] That's it! Now note that P(s',a'|s,a) = P(a'|s,a,s')P(s'|s,a) = P(a'|s')P(s'|s,a) = pi(a'|s')P(s'|s,a) where we used the fact that *a'*, given *s'*, is independent of *s* and *a*. Therefore, similarly to before, we get: q_pi(s,a) = sum_{s',a'} P(s',a'|s,a) [R(a|s) + gamma q_pi(s',a')] = [sum_{s',a'} P(s',a'|s,a) R(a|s) ] + [sum_{s',a'} P(s',a'|s,a) gamma q_pi(s',a')] = R(a|s) + gamma sum_{s',a'} P(s',a'|s,a) q_pi(s',a') = R(a|s) + gamma sum_{s'} sum_{a'} pi(a'|s')P(s'|s,a) q_pi(s',a') = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a') which is the formula @59:00. This is little more than basic probability.
One thing I'd like to discuss is whether the discount factor should be the same for both positive and negative rewards. Losses loom larger than gains, and it is fair to imagine that a risk aversive agent would not discount a negative reward.
Note that, in general, for the fixed point convergence theorems (see lectures 4 and 5) to converge mathematically (and hence to be usable), negative rewards need to be discounted. How much discount you add is up to you, but I am guessing you can run simulations with different discount factors and see what works better, use something like TD-lambda where there are combined returns with different discounts, etc. On a different note, I have the impression that from a mathematical point of view, using negative or positive rewards is kind of equivalent. Indeed, let the immediate returns be uniformly bounded by below by a negative constant that we will call -C. Then by summing C to all the rewards, you construct a MDP that only has positive rewards and the optimal policies of the original process (the one with negative rewards) and the modified one and clearly the same, they are shared. Now you could ask: what is better for convergence? And that seems like a nontrivial question that depends on the problem, algorithm, spaces, etc.
There is a typo @1:22:08 ?? q*=8.4 ? I think should be 9.4 = 1+1(0.4*10+0.4*8+0.2*6) base on the equation q*(s,a) @1:26:55 . Someone verify this with me?
I think there is a problem with the State transition matrix. First entry in the last row has a wrong subscript index. It should be {n,1} instead of {1,1}.
It does make sense. If we are in state n, we can go to state 1. Unless state n is "absorbing", in this case we cannot go anywhere after reaching state n but it would still appear in the transition matrix with values 0.
At 11:37, David mentions that non-stationary Markov Processes can be manipulated to use the formalism presented for stationary case, by augmenting Markov states with a potentially infinite number of states. However, the definition of Markov Process at 7:19 notes that S is a finite set of states. I'd appreciate any clarifying insights!
That's an amazing question! The answer is that the new state space is indeed infinite. Assume S is a finite set of states. Let X_k be a Markov chain on S that NEEDS NOT BE TIME-HOMOGENEOUS. Define a new state space Omega = SxSx...xSxSx... (i.e. S to the power of the natural numbers, mathematically S^{\mathbb{N}}, which is INFINITE). In other words, the elements that belong to Omega are infinite tuples of the form (s_1,s_2,s_3,...,s_n,s_{n+1},...) (tuples with infinitely many ordered states). Assume n and m are two natural numbers such that n is greater than m (without loss of generality). Define a transition probability P* on Omega via the formula P*({s_1}x{s_2}x{s_3}x...x{s_n}xSxSx...xSx...|{s_1}x{s_2}x{s_3}x...x{s_m}xSxSx...xSx...) =P(X_{m+1}=s_{m+1},...,X_{n}=s_n|X_m=s_m), where P represents the probability function of the original Markov chain on S. Note that {s_1}x{s_2}x{s_3}x...x{s_n}xSxSx...xSx... is simply the set of all tuples starting by the states s_1,s_2,...,s_n! The new transition probability P* defines a new Markov chain which is time-homogeneous. Omega has the cardinality of the real numbers, but that is not a problem, since MDPs are often defined for infinite sets S. Actually, in most of the papers on RL, the case where S is continuous is considered, but David treats the discrete case in this lecture for the sake of simplicity. I am not being very technical, since I do not know your background in math and I also don't have much time, but this can be properly formalised. If S is not finite, this can equally be done. You just need to be a bit careful with the construction of the new sigma-algebra and that's it. PS: In case you are not familiar with general notation, x represents Cartesian product, {s} represents a set containing the state S, the symbol | represents conditioned. I am not being precise in certain things, but I hope this helps.
The transition probability of this course exponentially lows down from 1 mln ( the first lesson) to 50 thousand in 10 lessons. Only 5% have come out from Class 3)))
37:18 How do we calculate the value of states that have circular transitions? i.e. the value of class 3 (v(s)=4.3) depends on the value of pub (v(s)=0.8). How do we arrive at 0.8 for the pub then if the value of the pub depends on the value of class 3?
I guess this is done in the Dynamic Programming for the next lecture. It is solved recursively by dynamic programming, since there is no closed form for bellman equation.
@57:17 "[State-Value Function] V is telling how good it is to be in a particular state, [Action-Value Function] Q is telling how good it is to take a particular actions from a given state"
not that much at the university level actually. very standard and presented in an easy to follow manner. also I assume they had to have some prereqs in math and probability graph theory to get to this class. MDP and bellman equation are needed for solving optimization equations of stochastic systems
@@17teacmrocks I love how at the end he starts questions and states that he is expecting that not everyone understood. I'm following because I read the topics before watching, that isn't a light read at all.
Just watched the video over 5-6 hours lol, took some notes meanwhile and talked about the difficult parts of the subject with my mate. Definitely a lot of stuff covered, and happy to be able to watch it in my own tempo, would be a little overwhelming to watch and understand in 2 hours.
at 1:07:40, when he says stochastic transitions, the transitions are by two separate agencies. So the agent takes an action from a state. The environment depending on the action puts you in a new state. Correct?
Yup somewhat like that. So depending on your action, there is an independent randomness operating which helps in deciding the state you end up in. Deterministic transition would be where your action solely decides which state you end up in. Stochastic transitions adds randomness to the transition from one state to the next state
I have a question around 57:00. After an agent takes an action, should the next state not be specified? If the next state is specified, how can I take the expectation over different states?
@1:02:37 Lecturer: "Is that Clear?" Me: "No it's not, there's a problem and YOU know what it is and you know that I know that you wished nobody notices it 😂" No matter what, these are the best lectures in RL , I really enjoy these underlying details. Thanks David Silver
Am I wrong or should q* of a=Pub at 1:13:10 be 9.4 ?? In the example of (s=class2, a=study), the optimal state-value function is calculated as q*=R+sum(P_state(i)*v(state(i)))=-2+1,0*10=-8
I think he was a bit confusing here. Isn't he using the fact the expectation is a linear operator (E[x+y] = E[x]+E[y]) and then using the fact that E[E[x]] = E[x], since E[x] is a constant and expectation of a constant is the constant itself?
@@estherderman8472 He is just using E[X+Y] = E[X + E[Y]]. In this case, X := R_{t+1} | S_t=s, Y := G_{t+1} | S_t=s, which are random variables. You are confusing it with E[X|Y], which is a random variable and not a number since it is defined as E[X|sigma(Y)], where sigma(Y) is the sigma-algebra generated by Y. Anyway, what he is doing is trivial.
It's simply a look-ahead diagram to evaluate the value of the current state, v(s). To calculate v(s), we add 2 values: (1) the reward for being in that current state, i.e., R_s, and (2) the discounted probability-weighted sum of the values of all possible future states one step later from s.
4:05 there is one problem with markov property, if S is (S1,S2) which are two correlated random variables, then we can't simply estimate correlation between S1 and S2 using last realization of those variables - we need full history (the longer the more accurate is our estimate), so we simply lose some information in this formulation until we extend S with additional informations (with whole history or values allowing for...
Awesome Lecture! But at 1:14:00, shouldn't q* for going to the pub be 9.4? q* here seems to exluce the reward of taking the action of going to the pub.
Although the lecture series overall has good explanation, it does assume the audience to be somewhat knowledgable in this specific domain. For example, what does it mean by "1 sweep"?, what does it mean by "fix the policy" and the criteria for the fix? For beginners with decent comp sci knowledge, its best to look for supplement materials to pair up with the lecture series.
at 13:33 it seems to me that the reward function is a function of both state s and time t. However, the way it is written, it only depends on state s. Then later it is used at 27:09 as a function of time t which is inconsistent with how it was defined earlier.
Markov processes do not depend on time. They depend only on your current state. If you arrive at a certain state at different times, the probability of moving to any other state will not depend on the time you arrived.
@@cparks1000000 Mate, the transition probabilities in a Markov Chain depend on time. He is making a tacit time-homogeneous assumption that simplifies many things, but the transition probabilities of a Markov Chain in general form DEPEND ON TIME. The MDP formalism is developed for time-homogeneous processes.
He is considering time-homogenous Markov chains, so in this formalism, there is no time dependence in the rewards or transition probabilities. A good thing to ask yourself is why this formalism is so useful (?) I.e. why is the time-homogeneous assumption made and how can you make non-homogenous things homogenous :)
The difference between the state value function and the action value function is that the state value function means the expected total return starting from a state s towards the end of the sequence by following a policy pi that means sampling some actions according to that policy pi and choosing a particular action in every state whereas according to action value function we sample some actions according to a policy pi and then by moving from the present state to the successor state we compute the expected total return for all the sampled actions. Is the understanding correct??
At minute 27:25, aren’t the values wrong, given that the discount factor is 0? For example, the value of the state of Class 2 should be -2 times the probability of ending up in Class 3 plus 0 times the probability of ending up in the sleep state, and this sum does not give -2. Is this correct? This comment is based on the fact that when γ=0, then G_t = R_(t+1), while the values on that slide would suggest G_t = R_t.
Yes, I think those values are wrong. I went to the comment section just to find another comment validating that. (Edited : Nevermind, they are correct. It's just that the notation is a bit confusing.)
at 54:47 there is a Rt+1 for both the state-value and the action-value functions. It is said that Rt+1 is specific to the action taken at a the state. How can that be true, since Rt+1 is given as soon as we leave the state? Wouldn't it simply depend on what state we are leaving?
I think there is mistake on the slides. Round about minute 40:00 (Bellman-Equation in Matrix Form). The v on the right side of the equation has to be v'. Because it is the value of the state we end up. In the notation on the slides, it implicates that it references to the state we came from. Awesome lecture by the way!
This means that you didn't follow the lecture. Both v's are the same, that's the power of the time-homogeneous assumption. The values or action-value are time-independent.
In reality, must we program an agent to store the optimal action calculated for each state, in the data structure of the state thus adhering to markov property or will the agent calculate optimal value at every step every time?
Yes, but I'm not sure it's a mistake. Maybe he's using the convention that the max "captures" everything on its right. Who knows... I would use the parentheses.
@@kiuhnmmnhuik2627 It is a mistake! You can see he uses the correct version in other lectures, plus the Bellman Optimality Equation is pretty well-known and the max affects all the terms on the RHS. Compare with any source in the literature. You can even compute it yourself with a bit of careful work.
Yeah well David really now needs to write a book on this subject material. Enjoying the videos but really need a captured analysis. There are few books on RL, Sutton's book is nearly twenty years old (draft updated for next year) But this lecture series is easier to comprehend than Standards.Any book needs to have the algorithms described, to show how the theory can be applied.Well done.
I think there is a mistake at 1:25:47. The q*(s,a) should be the q* of certain s and certain a(that the certain a might lead to an uncertain s'), rather than the average of "left" and "right"
Once you've taken an action, you let the environment decide what state you are going to end up in (it's not deterministic, so given the action you could still end up in many places). Therefore you average the state value of the states you might end up in, weighted by the probabiliity of going there.
Regarding the question at 1:34:30 on eliminating risk, Is it possible to train two agents and use their combined experience to train a third one? i.e., one that is risk-adverse and one that is profit-seeking yields a risk-adverse profit-seeking agent.
I have a scenario which I don't understand: Assume two possible actions a1 and a2 with respective rewards Ra1=-2 and Ra2=-10. Now assume that each of these actions leads from state s to states s1 and s2 deterministically, meaning no environment dynamics. Now v*(s)=max_a(R_s^a)+gamma*sum_{s'inS}(P_{ss'}^a * v*(s')) is the Bellman equation for v*. Now, the reward for action a1 is slightly higher than the one for action a2. At the same time, the v*-value of state s2 is significantly higher that that of s1. Now, how do you calculate v*(s)? Do you pick the action that gives the highest reward (a1) take v*(s1), or do you maximize v*(s) by picking a2 and s2?
v*(s) is the expected reward plus the expected v*(s') of the next state. So in your case it's the sum of the immediate reward and v*(s') for the action which leads to the highest (immediate reward + v(s')) (I assumed gamma=1)
@1:38:00 when he's talking about the principle of optimality, the repeated optimal behavior will lead you to the optimal solution,......it seems each time we're only looking ahead one step, but each look ahead is recursive I guess. So are we looking forward just one time step or are we looking infinitely forward? The idea of getting stuck in a local optimum is on my mind, I think there's something I'm missing. But yeah my question is HOW does looking forward one time step capture the set of other possible future time steps that I might take avter moving from state 1 to state 2? I am perplexed with how we can confidently name the value of nodes that are 1 degree away when their value is dependent on their own surrounding nodes
It's crazy to me how much more of RL I've understood in these 2 beginning lectures vs multiple days of trying to understand what my professors have cooked up in their lecture. And David actually goes into even more depth.
If you are reading this in 2024 and thinking what value can a 10 years old course may offer, you are completely wrong. I have looked at many courses and lectures to find out the best startup RL course and found this as the best. Force yourself to start with this.
Can we build the machine learning program/code with this course I am interested but I don't know if I would learn how to implement it in the 'coding' part of it?
@@happydays334 No, this course doesn't teach you to write code. It covers the fundamentals and the math.
They do have the implementation yes using python
0:45 Markov Processes
13:00 Markov Reward Processes
43:14 Markov Decision Processes
1:40:18 Extensions to MDPs
Wow
Thanks
Thanks a lot!
Arriving at Q* : 1:21:50
Cảm ơn bạn
can we appreciate the audience too? so interactive and they asked all questions that made our foundation even better. Kudos to David for patiently answering everyone with equal dedication
Best RL course 5 years ago and now still is to me. Thank you.
Fantastic. The lectures are so consistent, formal, and organized - mathematically as well as pedagogically. Optimal policy for teaching such a course!
One of the best lectures here at UCL CS :)
Indeed. One of the best lectures I have seen on a difficult topic. And his attitude is just...rare and priceless actually. I feel outmost respect for his pedagogical methods.
I see what you did there. nice
I totally agree. Mr David Silver is a great lecturer
This lecture series is a gem in reinforcement learning. Best material for RL as yet!!
I've heard of this course as well. Do you have a resource that provides the correct way to download the packages required for starting the course ? Thanks
I would expect nothing less from a guy at the forefront of AlphaZero
I have watched lectures from MIT and Stanford and David blows them away.
This guy is an impressively good lecturer. Easily comparable to, if not better than, the best profs I’ve had so far in 7 semesters of university.
"The future is independent of the past given the future" is crazy deep when you think about it. How many people lament over past transgressions and stress over "how it should have gone down" when in reality obtaining your goals is much more along the lines of "what can I do right now in this moment"
He has amazing ability to explain complexity in such clear way.
WARNING: take headphones off 42:25 -> 42:35 and 55:55 -> 56:05
I wish I read this comment before that
Me too.
I guess many find this comment after hearing the noise
Hahahah I read this comment and searched what is it. True take off headphone or be prepared for it.
Thanks
0:45 Markov Processes {S, P}. used to describe reinforcement learning environment.
3:26 Markov Property: the current state contains all the information of history.
4:35 State transition Matix. transition probability from s to s`.
7:27 Example: student decision chain.
13:00 Markov Reward Processes {S, P, R, γ}
29:10 Bellman Equation for MRPs: Vs = Rt + γ * Σ(Ps+1 * Vs+1)
39:01 Bellman Equation in Matrix Form.
43:14 Markov Decision Processes {S, A, P, R, γ}
46:15 Policies: for taking actions. π(a|s)
50:51 Value Function.
state-value funciton
action-value function
53:37 Bellman Expectation Equation, following a policy.
Bellman expectation equation for Vπ(s)
Bellman expectation equation for Qπ(s,a)
1:02:52 Bellman Expectation Equation in matrix form
1:04:01 Q&A
1:08:06 Optimal Value Function. q*
1:15:41 Optimal policy. π ≥ π` if v(s) ≥ v`(s), ∀s (for all states)
Theorem: for any MDP, there's an optimal policy that's ≥ all other policies.
Theorem: all optimal policies leads to the same optimal value funciton.
1:19:54 find the optimal policy
1:19:55 Bellman optimality equation.
v*(s) = max(q*(a,s))
q*(a,s) = R + γ * ΣP * v*(s`)
1:29:39 how to solve bellman optimality equation? it's non-linear. use iterative solution.
value iteration (next lecture)
policy iteration (next lecture)
q-learning (future lecture)
sarsa
1:31:49 Q&A
1:40:18 Extensions to MDPs
continuous & infinite MDP
partially observable MDP
undiscounted, average reward MDP
Thanks. very helpful
This intro to MDP is seriously high-quality. Informative yet understandable.
I must say this is by far the best course whether you want to learn Deep or Shallow RL. Its so intuitive and easy to understand the basic of RL concepts. The current CS285 Course has a great material but for me its hard to understand. I repeatedly come back here to clear concepts. David Silver is a great Teacher.
Thanks, Petyr Baelish. Great lecture series.
He lacks the moustache.
I'm impressed by the size of his head compared to his body.
RL is a ladder.
Nah looks more like Alex Honnold
Despite its low resolution and not clear sound, this is the Best reinforcement learning lecture.
Thanks for super easy, detailed explanation!
Prof. Silver is a brilliant teacher. Really enjoying this
I looked everywhere for a comprehensible explanation of RL. This is easy to understand and awesome. Thank you!
If you're confused like I was at slide 32 (53:27) when he goes over the state-value function for the student MDP, the calculation for each node, using the look-ahead step from Bellman Equation is: v_pi(s) = E_pi[R_t+1 + gamma*v(S_t+1)|S_t = s].
For instance, for the case of the -1.3 node, we calculate, given that the policy assigns 0.5 probabilities for all actions:
0.5*(-2 + 2.7) + 0.5*(-1 + -2.3) ~= -1.3
Where the first term of the sum is the contribution from the Study node, and the second term is the contribution from the Facebook node
Could you be more specific??? I don't know how to get -1.3
It is 0.5*[-2+2.7*(1)] + 0.5*[-1-2.3*(1)] = -1.3
Thanks very much.
I dont understand difference between state value function and action value function. In state value, the value of being in some state depends on the immediate reward in the next state. But we need an "action" to got to the next state so isnt it essentially action value function?
The part from 54:30 to 1:01:00 was a lot to take into. But it's actually not that much once you get a grip on it.
First there is the *Bellman Equation* and the *Bellman Expectation Equation*
The difference is that the *Bellman Expectation Equation* also incorporates actions.
*Bellman Expectation Equation* = *Bellman Equation* + actions.
After that we basically just have 4 variations of this:
1. *Vπ* in terms of *Qπ*
2. *Qπ* in terms of *Vπ*
3. *Vπ* in terms of *itself*
4. *Qπ* in terms of *itself*
i still don't understand how the calculations that we got the values for student MDP in 53:30, it was said prior that in MDP there's not just one expectation value anymore but here every nodes has one value?
e.g how do we get the +2.7 value in the second class?
@@fahmikhoirurrijal9427if you understood it now can you please explain?
@shresthapatir6495 Idk if this a little too late but here is how you get it:
The value of the state = expected value of all actions that can be taken from it
The value of an action = reward + expected value of all states it leads to
There are two actions from that state: Sleep or Study.
Lets talk about Sleep. This action always leads to terminal state so the expected value of states it leads to is 0. The reward is also 0 as stated in slides.
Value of Sleep = reward + expected value of all states it leads to = 0 + 0 = 0
Same can be done for Study. This action always leads to next state which has a value of 7.4 so the expected value of states it leads to is 7.4. The reward stated is -2.
Value of Study = reward + expected value of all states it leads to = -2 + 7.4 = 5.4
Now we have values of all actions for this state.
The value of the state is the expected value of all its actions. In this case, the policy is to have a 50/50 choice if actions so each action has a 0.5 probability.
Value of state = expected value of all actions = (0.5 * value of Sleep) + (0.5 * value of Study) = (0.5 * 0) + (0.5 * 5.4) = 2.7
So the value of this state is 2.7
I can not resist from commenting. These presentations are more learnable I am scratching my head since a week. Fortunately I got into your video presentations. Awesome Dave Gold*.
My question is related to when the reward for a state-action pair is computed in a Markov Decision Process.
When showing the Student MDP on page 25 (min 45:00), the reward for (study, pub) is given immediately after the action is taken and exiting the state. This is compatible with the way we indicate the reward as R^a_s, meaning that it is defined by the action you've taken and the state you're exiting.
However, later on in the video (min 56:30) and on the slides (page 32) it looks like the reward is given when the environment pushes you to the next state, once the action has been chosen. In formulas, it seems like r depends on the action and the next state, so R^a_{s'}
To me, it makes more sense that the agent receives the reward as a function of the action and the state it'll end up in, rather than the state it is leaving. Consider a game where one can decide to attack or run: if I decide to attack and the environment decides that I should be defeated, I would like to get a negative reward immediately and not later.
So, which one is the right way of thinking? Do the equations and the plot need fixing or am I misunderstanding something?
Hi,
The initial example is an example of Markov Reward Process. So, the reward function only depends on the state (you can check the slide for the definition of MRP). Therefore, you get a reward at time t+1 when you exit the state at time t no matter what the action is. The later example is Markov Decision Process. In MDP, your reward function definition also considers which action you take. So you take some action and exit the state at time t ("study" or "pub") and get a reward at time t+1 based on your actions and where you'll end up with.
@@MsCalcioo You seem an expert on MDP. I have a question, "Pub" was a state in MRP example but in MDP it is removed being a state. Why? This confuses me. Moreover, rewards in MDPs are shown on the arc from (s,a) to s', but in the MDP example this convention is not followed.
@@ercanatam6344 The states in the student MDP are literally the same as the states in the MRP, David just did not write the names in the circles/squares to avoid saturating the notation. What is confusing you is that he is using the words "pub" or "study" in order to describe actions as well. He means "go to pub" or "go to study". If you carry out the action "go to study" or "go to pub", you end up in the states "study"/"pub". Sometimes "go" could also mean "keep in", as in "keep in the pub", "keep studying", but I hope you get what I mean.
@@edmonddantes4705 i still don't understand how the calculations that we got the values for student MDP in 53:30, it was said prior that in MDP there's not just one expectation value anymore but here every nodes has one value?
e.g how do we get the +2.7 value in the second class?
@@fahmikhoirurrijal9427 For any MDP, given a fixed policy pi, every state has a value given by the state-value function. The state-value function is literally defined in the previous slide and the chosen policy pi is uniform, so I am not sure what you are missing. Anyway, he applies the formula in the previous slide (you can do Monte Carlo sampling for an approximation or apply the Bellman expectation equation). PS: By the way, any MDP with a fixed policy is an MRP.
David is an amazing teacher. Because he loves the subject. Now I love it, too. Thanks David. Thanks Deepmind.
Best introductory RL course on RUclips!
Absolutely fantastic !! Thank you Dr. Silver !
Way better than the 2021 lecture with a similar title. Never found such a detailed explanation of MDPs Thanks
just to make it clear about the question asked at about 23:00:
I actually think that the sequence is finite not by an arbitrary definition but due to the fact that every loop in that scenario will eventually terminate with proabability of 1.
For instance the "stay in facebook" loop has a probability of 0.9^n of happening during n stages, wich tends to zero as n tends to infinity, thus the probability of (eventually) leaving the loop tends to 1.
Yes, he kinda dismisses the question out of hand, but the answer is more mathematically subtle than he makes it out to be.
I searched for a commentary like that, because I got confused from the answer of this question in the video. Thanks. :)
I think this can be seen by a short analysis of the transition matrix, e.g., the eigenvalues which are all between zero and one for each state except for the state "Sleep", which is one. This means that P^n for n to infinity converges to a matrix only with zeroes except for the column "to Sleep" which are all 1 then.
Yes, episodic Markov processes terminate almost surely :). It is subtle. Also MDPs with a.s. absolutely bounded immediate rewards have finite discounted return (gamma < 1). But he is not very rigorous, he just wants to convey the main ideas.
Thank you for the question at 1:04:30 about states turning into actions / decisions, I appreciate the clarification !
Thank you for such a great set of lectures on Sutton and Barto's book. The lectures are so apt, thank you David for all your lectures
00:04 Introduction to Markov Decision Process
01:45 Markov Decision Processes are versatile and can be applied to various reinforcement learning problems
05:28 Markov Decision Process defines the transitions between states.
07:23 Illustration of transition probabilities in a Markov Process
11:11 Markov Decision Process dynamics and handling non-stationary probabilities
12:58 Introducing rewards and the concept of a Markov reward process
16:42 Discount factor determines present value of future rewards
18:21 Using a discount factor to account for uncertainty in the future.
21:40 Understanding Markov decision processes and its termination properties
23:34 Value of a state in a Markov Decision Process is the expected return starting from that state
27:15 Understanding different discount factors in Markov Decision Process
28:53 Bellman equation explains value function decomposition
32:19 Bellman equation defines value function
34:09 Value function must obey a specific identity to be correct
38:17 Understanding the Bellman equation using U matrices and vectors.
40:04 Markov Decision Process equation simplifies value calculation
43:35 Introducing actions to Markov Decision Process
45:17 Understanding Markov Decision Process and policies.
48:42 Maximize future rewards, not current ones
50:26 Dynamics and rewards in Markov Decision Processes
53:52 Decomposition of value function in MDP
55:38 Understanding Markov Decision Process
58:59 Explaining how value function relates to itself recursively
1:00:46 Value function represents long-term rewards
1:04:17 Transition from Markov reward process to Markov decision process
1:05:55 MDP definition includes state-dependent rewards
1:09:33 QAR helps determine optimal behavior in MDPs.
1:11:14 Understanding the optimal value in Markov Decision Process
1:14:34 Understanding optimal policy in MDP
1:16:29 Partial ordering of policies based on value function
1:19:57 Deterministic optimal policy ensures maximum reward
1:21:50 Understanding the Bellman optimality equation for Markov Decision Processes
1:25:04 Importance of understanding states and values in environments
1:26:44 Explaining recursive relationship between qstar values
1:30:34 Dynamic programming methods and Q learning for iterative solution
1:32:38 Handling Imperfect Models in MDP
1:36:11 The Bellman equation's intuition in real-world examples
1:37:58 Optimal dynamics can be described by breaking down trajectory into optimal decisions at each step.
1:41:20 Markov Decision Process simplifies decision-making with limited information.
At approx 22mins somebody asks about an infinite sequence. David says this is impossible, but even it was possible it's interesting to observe that the sequence of rewards would sum a finite value because it's actually just geometric series. This was discussed in the Udacity course a bit. Loving these lectures btw. David is magnificent.
Also, since there is a sink, the probability goes to zero of generating a sequence of length n as n goes to infinity.
I think he was talking about the case where there's no reduction factor across time.
To be more precise, the discounted return converges almost surely as long as the rewards are uniformly bounded (almost surely) and gamma is strictly less than one.
@@monktastic1 That statement requires some mathematical formalisation, since one could have two different states pointing at each other with probability one, and hence arriving to any of those states would give rise to an infinite alternating sequence. However, those states could be formalised as one state and then you could consider the class of those two states to be a fixed point. There is some truth in your statement, but it requires more formalisation, it is hand-wavy.
Very cool, having this available free of charge. Thank you so much 😊🙌
Great lecture. I really liked the approach, building from MP to MRP and finally MDP. It was well connected, like listening to a story. Thank you very much!
1:41:24 did anyone else feel like an agent in a partially observable environment when the camera shut off while he was talking about partially observable environments?
@1:37:00 Here's an easier example to understand the idea behind the optimality conditions.
Let's say you want to find the shortest path from A to C and someone gives you a supposedly optimal solution which passes through B, i.e. the (supposedly) optimal path is A->B->C. If the path A->B is not the shortest path from A to B, then the total solution can't be optimal because we can find a better path A~>B and then form A~>B->C which is shorter than A->B->C.
Now let's say someone gives you a supposedly optimal policy pi. This is more tricky than the shortest path example because of recursion. You notice that pi doesn't always choose the optimal action at every given time. In other words, you compute q_pi for the policy pi and you notice that
q_pi(s,a) = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a') <
R(a|s) + gamma sum_{s'} P(s'|s,a) max_{a'} q_pi(s',a')
for some pair (s,a).
Let pi' be the improved policy which always chooses the best action according to q_pi. We get
q_pi(s,a) = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a') <
R(a|s) + gamma sum_{s'} P(s'|s,a) max_{a'} q_pi(s',a') =
R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi'(a'|s') q_pi(s',a')
Note that we have pi' and q_pi (not q_pi'!) in the same expression. We would like to have q_pi' instead of q_pi.
We can update q_pi with the new estimations and then repeat the maximization process. Note that at each maximization step we increase our estimation of q_pi until we converge to q_pi'. Since we're always increasing q_pi, we must have
q_pi < q_pi'
which means that pi' is strictly better than pi. I hope this makes sense. I've never read the official proof and this is just the sketch of one direction of the proof (i.e. optimality => optimal condition).
In conclusion, whenever q_pi doesn't satisfy the bellman optimality equation (BOE), then we can increase q_pi until, estimation after estimation, we converge to a q_pi' which satisfy the BOE and such that pi' > pi.
best teacher on RL, there is no contest
Thank you very much, professor David, for these great lectures!
Exceptional lecture series so far. Very clear useful examples and explainations. Thank you very much Dr. Silver.
Fantastic Lectures. I use them along with the Sutton/Barto text book. Excellent stuff
What an amazing lecture! I had so much fun watching it and it helped clarify some things in my head about MDP.
Best course material to start Reinforcement learning. However the math seems to be difficult at least for me.
At 1:11:47, isn't optimal action-value function q*("going to pub") =1 + (0.2*6 + 0.4*8 + 0.4*10) = 9.4 instead of 8.4?
I got the same result
Yep, I got 9.4 too.
same result here
exactly
me too
Around minute 51:00, the bellman equation for the action-value function shouldn’t have the policy subscript for the expectation, as the current action is already given, so it is an averaging goes only over the “environment dice”.
Your comment indicates lack of understanding. Of course it should have the subscript pi. G_t is the discounted return, so you need to take many more actions in the future states in order to calculate that expectation, even if your immediate action a is fixed.
Well done! At 6:05, I guess the lower matrix value should be Pn1
Same error at 39:30
Yes, it should. And he says this during the talk, so no doubt about it.
@@astaconteazamaiputin It's corrected in the slides at: www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf
@@luisjalabert8366 Yes, that is an error.
19:07 The guy taking a call walking across the lecture room missing out the great content, -10 immediate reward there.
Thanks ..the way he explains math is simply amazing...
I just want to point out that the recursive formulas are very easy to understand and derive if one disentangle them. For instance, let's say *s* is the current state, *a* an action and *s'* the new state after the action, i.e. s->a->s'. Then
v_pi(s) = sum_{a,s'} P(a,s'|s) [R(a|s) + gamma v_pi(s')]
where P(a,s'|s) is the probability of taking action *a* and ending up in state *s'* given that we are in state *s*.
Now we note that P(a,s'|s) = P(s'|s,a)P(a|s) = P(s'|s,a)pi(a|s). Also, sum_{a,s'} (...) is sum_a sum_{s'} (...), so, we get
v_pi(s) = sum_{a,s'} P(a,s'|s) [R(a|s) + gamma v_pi(s')]
= sum_a sum_{s'} P(s'|s,a)pi(a|s) [R(a|s) + gamma v_pi(s')]
= sum_a pi(a|s) sum_{s'} P(s'|s,a) [R(a|s) + gamma v_pi(s')]
= sum_a pi(a|s) [ sum_{s'} P(s'|s,a) R(a|s) + sum_{s'} P(s'|s,a) gamma v_pi(s') ]
= sum_a pi(a|s) [ R(a|s) + gamma sum_{s'} P(s'|s,a) v_pi(s') ]
which is the formula @57:38.
Same thing with q_pi(s,a). We start from state *s* and do action *a*. What happens is that we end up in state *s'* and then do another action *a'*. In other words, the sequence is s->a->s'->a'. Since we know *s* and *a*, but *s'* and *a'* are random, we compute the mean of the total reward over *s'* and *a'*:
q_pi(s,a) = sum_{s',a'} P(s',a'|s,a) [R(a|s) + gamma q_pi(s',a')]
That's it!
Now note that
P(s',a'|s,a) = P(a'|s,a,s')P(s'|s,a) = P(a'|s')P(s'|s,a) = pi(a'|s')P(s'|s,a)
where we used the fact that *a'*, given *s'*, is independent of *s* and *a*.
Therefore, similarly to before, we get:
q_pi(s,a) = sum_{s',a'} P(s',a'|s,a) [R(a|s) + gamma q_pi(s',a')]
= [sum_{s',a'} P(s',a'|s,a) R(a|s) ] + [sum_{s',a'} P(s',a'|s,a) gamma q_pi(s',a')]
= R(a|s) + gamma sum_{s',a'} P(s',a'|s,a) q_pi(s',a')
= R(a|s) + gamma sum_{s'} sum_{a'} pi(a'|s')P(s'|s,a) q_pi(s',a')
= R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a')
which is the formula @59:00.
This is little more than basic probability.
One thing I'd like to discuss is whether the discount factor should be the same for both positive and negative rewards. Losses loom larger than gains, and it is fair to imagine that a risk aversive agent would not discount a negative reward.
Note that, in general, for the fixed point convergence theorems (see lectures 4 and 5) to converge mathematically (and hence to be usable), negative rewards need to be discounted. How much discount you add is up to you, but I am guessing you can run simulations with different discount factors and see what works better, use something like TD-lambda where there are combined returns with different discounts, etc.
On a different note, I have the impression that from a mathematical point of view, using negative or positive rewards is kind of equivalent. Indeed, let the immediate returns be uniformly bounded by below by a negative constant that we will call -C. Then by summing C to all the rewards, you construct a MDP that only has positive rewards and the optimal policies of the original process (the one with negative rewards) and the modified one and clearly the same, they are shared. Now you could ask: what is better for convergence? And that seems like a nontrivial question that depends on the problem, algorithm, spaces, etc.
There is a typo @1:22:08 ?? q*=8.4 ? I think should be 9.4 = 1+1(0.4*10+0.4*8+0.2*6) base on the equation q*(s,a) @1:26:55 . Someone verify this with me?
That's exactly what I found weird. I think it has to be 9.4.
@@DeathStocker it is. I am right, i checked other comments, they said the same thing.
@@43SunSon thanks for confirming!
Yup, I just got stuck there too. Should be 9,4
I think there is a problem with the State transition matrix. First entry in the last row has a wrong subscript index. It should be {n,1} instead of {1,1}.
A student in the class mentions this
I agree
Being {n,1}, it would not make sense as it would be , calculating probability of going to state 1 knowing state n. right?
It does make sense. If we are in state n, we can go to state 1. Unless state n is "absorbing", in this case we cannot go anywhere after reaching state n but it would still appear in the transition matrix with values 0.
I think you might be confusing states and steps. Assume you can go from state n to state 1, but not from step n back (in time) to step 1.
Why there is a period of time muted? Is this copyright concern?
that's what i was wondering (around 35:00).
cannot reveal the top secret information
5:23: State Transition Probability Matrix: Last entry of Row n, Column 1: to be corrected as Pn1
At 11:37, David mentions that non-stationary Markov Processes can be manipulated to use the formalism presented for stationary case, by augmenting Markov states with a potentially infinite number of states. However, the definition of Markov Process at 7:19 notes that S is a finite set of states. I'd appreciate any clarifying insights!
That's an amazing question! The answer is that the new state space is indeed infinite. Assume S is a finite set of states. Let X_k be a Markov chain on S that NEEDS NOT BE TIME-HOMOGENEOUS. Define a new state space Omega = SxSx...xSxSx... (i.e. S to the power of the natural numbers, mathematically S^{\mathbb{N}}, which is INFINITE). In other words, the elements that belong to Omega are infinite tuples of the form (s_1,s_2,s_3,...,s_n,s_{n+1},...) (tuples with infinitely many ordered states). Assume n and m are two natural numbers such that n is greater than m (without loss of generality). Define a transition probability P* on Omega via the formula P*({s_1}x{s_2}x{s_3}x...x{s_n}xSxSx...xSx...|{s_1}x{s_2}x{s_3}x...x{s_m}xSxSx...xSx...) =P(X_{m+1}=s_{m+1},...,X_{n}=s_n|X_m=s_m), where P represents the probability function of the original Markov chain on S. Note that {s_1}x{s_2}x{s_3}x...x{s_n}xSxSx...xSx... is simply the set of all tuples starting by the states s_1,s_2,...,s_n! The new transition probability P* defines a new Markov chain which is time-homogeneous. Omega has the cardinality of the real numbers, but that is not a problem, since MDPs are often defined for infinite sets S. Actually, in most of the papers on RL, the case where S is continuous is considered, but David treats the discrete case in this lecture for the sake of simplicity. I am not being very technical, since I do not know your background in math and I also don't have much time, but this can be properly formalised. If S is not finite, this can equally be done. You just need to be a bit careful with the construction of the new sigma-algebra and that's it.
PS: In case you are not familiar with general notation, x represents Cartesian product, {s} represents a set containing the state S, the symbol | represents conditioned. I am not being precise in certain things, but I hope this helps.
The transition probability of this course exponentially lows down from 1 mln ( the first lesson) to 50 thousand in 10 lessons. Only 5% have come out from Class 3)))
37:18 How do we calculate the value of states that have circular transitions? i.e. the value of class 3 (v(s)=4.3) depends on the value of pub (v(s)=0.8). How do we arrive at 0.8 for the pub then if the value of the pub depends on the value of class 3?
I guess this is done in the Dynamic Programming for the next lecture. It is solved recursively by dynamic programming, since there is no closed form for bellman equation.
@57:17 "[State-Value Function] V is telling how good it is to be in a particular state, [Action-Value Function] Q is telling how good it is to take a particular actions from a given state"
What's the convention he's mentioning at 31:48, of the reward indexing?
"Sutton & Barto convention" and he's referring to this textbook incompleteideas.net/book/the-book-2nd.html
Those poor students who had to take all of this in in one sitting.
not that much at the university level actually. very standard and presented in an easy to follow manner. also I assume they had to have some prereqs in math and probability graph theory to get to this class. MDP and bellman equation are needed for solving optimization equations of stochastic systems
@@17teacmrocks I love how at the end he starts questions and states that he is expecting that not everyone understood. I'm following because I read the topics before watching, that isn't a light read at all.
Just watched the video over 5-6 hours lol, took some notes meanwhile and talked about the difficult parts of the subject with my mate. Definitely a lot of stuff covered, and happy to be able to watch it in my own tempo, would be a little overwhelming to watch and understand in 2 hours.
The explanation of the Bellman Optimality Equation for V* was really enjoyable to watch. Thank you.
WARNING: TAKE HEADPHONES OFF 55:50 - 56:05 AND ALSO FROM 42:25 -> 42:35 VERY LOUD NOISE
shockingly clear explanations!
at 1:07:40, when he says stochastic transitions, the transitions are by two separate agencies.
So the agent takes an action from a state.
The environment depending on the action puts you in a new state.
Correct?
Yup somewhat like that. So depending on your action, there is an independent randomness operating which helps in deciding the state you end up in. Deterministic transition would be where your action solely decides which state you end up in. Stochastic transitions adds randomness to the transition from one state to the next state
I have a question around 57:00. After an agent takes an action, should the next state not be specified? If the next state is specified, how can I take the expectation over different states?
The environment decides which state you end up in, sometimes there is only one option, sometimes there are multiable options like the pub action
at 5:09 last element of matrix is Pn1 not P11. refer to latest slides for correction. www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
Why would 59 people dislike this free gold?!
1:12:20 Shouldn't the q* value for the Pub be 9.4? It's 1 + (.2*6 + .4*8 + .4*10) because of the R+1 of going to the pub, right?
The general consensus is that you're right, it should be 9.4 which is what I got.
@@Zantorc same here! I think should be 9.4, not 8.4
@1:02:37
Lecturer: "Is that Clear?"
Me: "No it's not, there's a problem and YOU know what it is and you know that I know that you wished nobody notices it 😂"
No matter what, these are the best lectures in RL , I really enjoy these underlying details.
Thanks David Silver
Hahaha, it's clear right on the next slide ,
my bad !
Am I wrong or should q* of a=Pub at 1:13:10 be 9.4 ?? In the example of (s=class2, a=study), the optimal state-value function is calculated as q*=R+sum(P_state(i)*v(state(i)))=-2+1,0*10=-8
should be 9.4
There is is typo in transition matrix suffix at 40:12 , it should have been Pn1
Excellent lecture series. Thanks, David.
Which book is David referring to at 1:04:31??
Neuro-Dynamic Programming by Dimitri Bertsekas:
www.amazon.com/Neuro-Dynamic-Programming-Optimization-Neural-Computation/dp/1886529108
31:05 - That's the linearity of expectations. Iterated expectations is nesting by conditioning.
According to what sigma-algebra is G_t+1 conditioned? I am missing something in the last equality
I think he was a bit confusing here. Isn't he using the fact the expectation is a linear operator (E[x+y] = E[x]+E[y]) and then using the fact that E[E[x]] = E[x], since E[x] is a constant and expectation of a constant is the constant itself?
@@anselmoufc my understandin also
@@estherderman8472 He is just using E[X+Y] = E[X + E[Y]]. In this case, X := R_{t+1} | S_t=s, Y := G_{t+1} | S_t=s, which are random variables. You are confusing it with E[X|Y], which is a random variable and not a number since it is defined as E[X|sigma(Y)], where sigma(Y) is the sigma-algebra generated by Y. Anyway, what he is doing is trivial.
You save my day! Thank you so much for sharing this video on youtube!
why the audio was muted at 35:00 - 35:40. It seems to be interesting part. Can any one please explain what Mr.David has explained during these 40secs?
It's simply a look-ahead diagram to evaluate the value of the current state, v(s). To calculate v(s), we add 2 values: (1) the reward for being in that current state, i.e., R_s, and (2) the discounted probability-weighted sum of the values of all possible future states one step later from s.
4:05 there is one problem with markov property, if S is (S1,S2) which are two correlated random variables, then we can't simply estimate correlation between S1 and S2 using last realization of those variables - we need full history (the longer the more accurate is our estimate), so we simply lose some information in this formulation until we extend S with additional informations (with whole history or values allowing for...
On slide : State Transition Matrix 6:19
Should the matrix be ?
P11.....P1n
..................
Pn1......Pnn
It should. A student in that class asked about that right after he was done explaining it.
Awesome Lecture! But at 1:14:00, shouldn't q* for going to the pub be 9.4? q* here seems to exluce the reward of taking the action of going to the pub.
I agree, I think it should be 9.4
Although the lecture series overall has good explanation, it does assume the audience to be somewhat knowledgable in this specific domain. For example, what does it mean by "1 sweep"?, what does it mean by "fix the policy" and the criteria for the fix?
For beginners with decent comp sci knowledge, its best to look for supplement materials to pair up with the lecture series.
Great lecture! but volume is really low
increasing volume with headphone
Yep, I'd to download it and use VLC to boost volume.
Yeah, you can use a chrome extension to boost audio if you're on a PC. Suggestion: Volume Master Chrome Extension.
I can't thank you enough for sharing such valuable lectures.
Nah... you can always read the papers. The information is already out there.
Yea
at 13:33 it seems to me that the reward function is a function of both state s and time t. However, the way it is written, it only depends on state s. Then later it is used at 27:09 as a function of time t which is inconsistent with how it was defined earlier.
Markov processes do not depend on time. They depend only on your current state. If you arrive at a certain state at different times, the probability of moving to any other state will not depend on the time you arrived.
@@cparks1000000 Mate, the transition probabilities in a Markov Chain depend on time. He is making a tacit time-homogeneous assumption that simplifies many things, but the transition probabilities of a Markov Chain in general form DEPEND ON TIME. The MDP formalism is developed for time-homogeneous processes.
He is considering time-homogenous Markov chains, so in this formalism, there is no time dependence in the rewards or transition probabilities. A good thing to ask yourself is why this formalism is so useful (?) I.e. why is the time-homogeneous assumption made and how can you make non-homogenous things homogenous :)
Is discout factor can be equal to 0 or 1??? I think brackets are incorrect
Another WARNING: VERY LOUD phone dropping sound! Take headphones off from 55:51 to 56:06
The difference between the state value function and the action value function is that the state value function means the expected total return starting from a state s towards the end of the sequence by following a policy pi that means sampling some actions according to that policy pi and choosing a particular action in every state whereas according to action value function we sample some actions according to a policy pi and then by moving from the present state to the successor state we compute the expected total return for all the sampled actions. Is the understanding correct??
At minute 27:25, aren’t the values wrong, given that the discount factor is 0? For example, the value of the state of Class 2 should be -2 times the probability of ending up in Class 3 plus 0 times the probability of ending up in the sleep state, and this sum does not give -2. Is this correct?
This comment is based on the fact that when γ=0, then G_t = R_(t+1), while the values on that slide would suggest G_t = R_t.
Yeah, that value is definitely wrong. Should be -1.6. Nice catch:)
Yes, I think those values are wrong. I went to the comment section just to find another comment validating that. (Edited : Nevermind, they are correct. It's just that the notation is a bit confusing.)
at 54:47 there is a Rt+1 for both the state-value and the action-value functions. It is said that Rt+1 is specific to the action taken at a the state. How can that be true, since Rt+1 is given as soon as we leave the state? Wouldn't it simply depend on what state we are leaving?
in 37:18 - Why is there no reward for going in the pub, so why is it 4.3 and not 5.3
At 6:05, the lower matrix value should be Pn1
I have lost many months of my thesis, wish I had these lectures earlier
I think there is mistake on the slides. Round about minute 40:00 (Bellman-Equation in Matrix Form). The v on the right side of the equation has to be v'. Because it is the value of the state we end up. In the notation on the slides, it implicates that it references to the state we came from.
Awesome lecture by the way!
not really, both v's are the same. value func will stay constant in every state right
This means that you didn't follow the lecture. Both v's are the same, that's the power of the time-homogeneous assumption. The values or action-value are time-independent.
In reality, must we program an agent to store the optimal action calculated for each state, in the data structure of the state thus adhering to markov property or will the agent calculate optimal value at every step every time?
Can someone enable the automatic CC?
by someone you mean, the guy who uploaded the video?
@@abhi220 lol
at 1:26:36 I think it should be max[R(a,s) + lambda*sum(P(a,s,s')*V(s'))]
Yes, but I'm not sure it's a mistake. Maybe he's using the convention that the max "captures" everything on its right. Who knows... I would use the parentheses.
Yeah, it should be the same like you specified
@@kiuhnmmnhuik2627 It is a mistake! You can see he uses the correct version in other lectures, plus the Bellman Optimality Equation is pretty well-known and the max affects all the terms on the RHS. Compare with any source in the literature. You can even compute it yourself with a bit of careful work.
Yeah well David really now needs to write a book on this subject material. Enjoying the videos but really need a captured analysis. There are few books on RL, Sutton's book is nearly twenty years old (draft updated for next year) But this lecture series is easier to comprehend than Standards.Any book needs to have the algorithms described, to show how the theory can be applied.Well done.
Most of the material is taken from Sutton & Barto, even most of the pictures, examples, etc, so that book you are talking about is already written.
I think there is a mistake at 1:25:47. The q*(s,a) should be the q* of certain s and certain a(that the certain a might lead to an uncertain s'), rather than the average of "left" and "right"
Once you've taken an action, you let the environment decide what state you are going to end up in (it's not deterministic, so given the action you could still end up in many places). Therefore you average the state value of the states you might end up in, weighted by the probabiliity of going there.
Regarding the question at 1:34:30 on eliminating risk,
Is it possible to train two agents and use their combined experience to train a third one? i.e., one that is risk-adverse and one that is profit-seeking yields a risk-adverse profit-seeking agent.
One thing to consider it that these attributes are usually in trade-off, you get more of one at the expense of less of the other.
I have a scenario which I don't understand:
Assume two possible actions a1 and a2 with respective rewards Ra1=-2 and Ra2=-10. Now assume that each of these actions leads from state s to states s1 and s2 deterministically, meaning no environment dynamics. Now v*(s)=max_a(R_s^a)+gamma*sum_{s'inS}(P_{ss'}^a * v*(s')) is the Bellman equation for v*.
Now, the reward for action a1 is slightly higher than the one for action a2. At the same time, the v*-value of state s2 is significantly higher that that of s1.
Now, how do you calculate v*(s)? Do you pick the action that gives the highest reward (a1) take v*(s1), or do you maximize v*(s) by picking a2 and s2?
v*(s) is the expected reward plus the expected v*(s') of the next state. So in your case it's the sum of the immediate reward and v*(s') for the action which leads to the highest (immediate reward + v(s')) (I assumed gamma=1)
@1:38:00 when he's talking about the principle of optimality, the repeated optimal behavior will lead you to the optimal solution,......it seems each time we're only looking ahead one step, but each look ahead is recursive I guess. So are we looking forward just one time step or are we looking infinitely forward? The idea of getting stuck in a local optimum is on my mind, I think there's something I'm missing.
But yeah my question is HOW does looking forward one time step capture the set of other possible future time steps that I might take avter moving from state 1 to state 2? I am perplexed with how we can confidently name the value of nodes that are 1 degree away when their value is dependent on their own surrounding nodes
1:13:18 Can someone please tell me again how to calculate 8.4 for action Pub? Why I calculate 9.4 = 1+0.2*6+0.4*8+0.4*10
Easy answer, you get 8.4 when you forget to add also part where r=1