I must admit at first to see so many stars make me confuse. And watched the video make me more confuse(maybe too many concepts) and then I decide to read the post. What I learned: 1、Reinforcement learning learn the optimal policies. 2、What is optimal policy 3、What is optimal state-value function 4、What is optimal action-value function 5、What is Q-value : calculate by bellman optimal equition. My question: 1、I think this will be more clear afterwards but it come front.I should ask or note it. When we brought the policies and value functions together. Polices is probilites. Value functions return some values.How can we put it together and a sequnce make it more harder(you choose this action make the state change the value will change too,I feel my brain is just melting down...) 2、The optimal Vpi(s) demand all the s.Is that possible.Or just most s is good enough is good?Both q*(s,a) have the same question. 3、Read the post find the describion like "In other words, q∗ gives the largest expected return achievable by any policy π for each possible state-action pair." This make me confuse . All we did is find the optimal policy why there is any policy.... 4、Bellman optimality equition.So many questions: 4.1、Bellman equition calculate the Q-value? 4.2、s',a' means the the next state and action? 4.3、means the expect return is the return of t action on t state add the max value of next step? Kind like recurrent my brain is overloading....
Can optimal policy be defined not only in terms of expected return but also variance of returns? That is, lower variance is preferred over higher variance. Not sure if that has been studied.
Ahh! So you seem to be suggesting that rewards are not necessarily additive! Very interesting! Your second million dollars is not as valuable as your first!
I think there's a typo at 3:30 where "E[R_(t+1)...]" should be "E[R_t...]." This is the reward you get for being in the current state (e.g. A robot wakes up in current state S and is given reward for simply waking up in that state (if any).
Looking at the equation directly, yes. However, indirectly, notice that to calculate q*(s,a), we need q*(s',a') where s' and a' are the _next_ state and action. Therefore, to calculate q*(s',a'), we'll need q*(s'',a''), where s'' and a'' are the _next next_ state and action, and so on.
@@deeplizard Thanks! Ok so it's a recursive function I guess? It would be hard to stumble upon just one reward state if the statespace was huge I think?
@@vishalpoddar yes its recursive. Exactly this allows us to update the q function over and over: Calculate the right hand side for our current approximation of the value function and then update q according to that. If you iterate that process it will converge to the optimal function
@@deeplizard Can you elaborate more on the rationale behind always looking one step forward instead of other number of steps? I understand that the computational demand goes up by a lot, even if you you look at 2 steps forward. What happens to the learning performance with other parameters being equal?
I think there is a mistake when the narrator says that s' is the best possible next state. If I understand correctly, the expectation is over the distribution of the possible next steps s' (since they depend on the randomness of the environment when it is provided with s and a), and the "max" expression is over the best possible a'.
Why does the blog speak of "the" optimal policy? According to the definition there might be many or none. According to the definition it is not even clear whether there exists a policy that is strictly better than any other policies.
I think there is a theorem that for Markov decision programs the optimal value function exists and is comparable to all other value functions (and of course bigger, thats why we are doing it). This directly implies that two such optimal value function (lets call them v and v*) have to be equal: For all states s we have v(s)≤v*(s) since v* is optimal, but also v*(s)≤v(s) since v is optimal, so v(s)=v*(s) for all states s and thus v=v*
Could someone help me to understand the difference between optimal policy and optimal q-function. The former should be the optimal mapping which, given a state, tells me which is the best action in order to get the maximum return, the latter should be that function that given any state and action, return me the best return???? I am very confused.
Ok, not trying to be difficult, just trying to understand. So correct me if I'm wrong. But you paraphrase the Bellman equation as saying that the expected return is the reward.... plus the "maximum expected discounted return that can be achieved from any possible next state-action pair (s',a')". I don't believe this. Shouldn't it be the "expected value over the next state s' of the maximum value over the next action a'"? The point being that "expected" and "maximum" are in the wrong order and the probability distribution is over the next state.
I don’t think you’re being difficult. Your comments show that you’re actually putting your own thoughts into this stuff :) I could’ve been more precise in my paraphrasing. Let me clarify. First, from an earlier video/blog on MDPs where we touch on transition probabilities (more in the blog than the video), we talk about how the action a that is selected from a given state s is from a set of actions A(s) that can be taken from s. A(s) is a subset of all possible actions that can be taken in the environment and has a probability distribution over s. deeplizard.com/learn/video/my207WNoeyA Next, from the max term used in the Bellman equation, we can see from the subscript that the maximization is occurring over all the next possible actions a’ that can be taken from s’. In other words, given that we end up transitioning to state s’, which action a’ from the set of actions that can be taken from s’ is going to yield the max return? Your phrasing of "plus the expected value over the next state s' of the maximum value over the next action a'" works to describe this. I'm trying to think of a way to say this a bit more intuitively and update the blog with it. Maybe "plus the maximum discounted return that can be achieved from the next state s’ over all possible actions a’ in A(s’)."
Make sure that you've studied the episodes in the series that come before this one, as they build up to the math that we're using in this episode. Also, to get a firmer grasp on the math, you can go at a slower pace by spending time on the written blog format of each episode here: deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv
Suggestion: Seems to me that there is some burden to show an optimal policy exists and has the properties that you claim. A "proof" would be simple enough and would demonstrate the purpose for assuming that the set of states and the set of actions are both finite. It would also use some recursive formula like the Bellman equation, so it would be a preview of that equation as well.
I am not convinced ;) The argument mentioned is that "since the agent follows the optimal policy, the next state s' satisfies the condition that the best next action (wrt. the expected reward) can be taken". However, we are conditioning on the action a here, which effectively means we are not following the optimal policy. The optimal policy samples a given s, meaning that not all possible a will follow the optimality path. In general, I don't understand how we get away from greedy behavior here. Thinking of local instead of global optimization maximums.
{ "question": "How can we determine an action-value function is optimal?", "choices": [ "For any state action pair, our function produces the expected reward for taking that action plus the maximum discounted return thereafter.", "For any state action pair, our function yields the maximum future rewards.", "For any state action pair, our function produces the reward for taking that action.", "For any state action pair, our function yields the discounted rewards of following the optimal policy." ], "answer": "For any state action pair, our function produces the expected reward for taking that action plus the maximum discounted return thereafter.", "creator": "Tyler", "creationDate": "2021-03-03T23:16:50.005Z" }
Check out the corresponding blog and other resources for this video at:
deeplizard.com/learn/video/rP4oEpQbDm4
I am glad that this series expose some theory instead of cheesy metaphors. Well done
just a few people can understand the value of this series.
thank you
This is just the series i needed to watch. It is so good! Well done
I must admit at first to see so many stars make me confuse. And watched the video make me more confuse(maybe too many concepts) and then I decide to read the post.
What I learned:
1、Reinforcement learning learn the optimal policies.
2、What is optimal policy
3、What is optimal state-value function
4、What is optimal action-value function
5、What is Q-value : calculate by bellman optimal equition.
My question:
1、I think this will be more clear afterwards but it come front.I should ask or note it. When we brought the policies and value functions together. Polices is probilites. Value functions return some values.How can we put it together and a sequnce make it more harder(you choose this action make the state change the value will change too,I feel my brain is just melting down...)
2、The optimal Vpi(s) demand all the s.Is that possible.Or just most s is good enough is good?Both q*(s,a) have the same question.
3、Read the post find the describion like "In other words, q∗ gives the largest expected return achievable by any policy π for each possible state-action pair." This make me confuse . All we did is find the optimal policy why there is any policy....
4、Bellman optimality equition.So many questions:
4.1、Bellman equition calculate the Q-value?
4.2、s',a' means the the next state and action?
4.3、means the expect return is the return of t action on t state add the max value of next step? Kind like recurrent my brain is overloading....
I need those quetions to be answer too
Its an excellent video, making something so complex so easy to understand
What does E mean in formula? [E]xpectation?
Correct! Expected value.
perfecto seniorita
Can optimal policy be defined not only in terms of expected return but also variance of returns? That is, lower variance is preferred over higher variance. Not sure if that has been studied.
Hm... In the traditional sense of MDPs, I'm not sure. That's an interesting thought though.
Ahh! So you seem to be suggesting that rewards are not necessarily additive! Very interesting! Your second million dollars is not as valuable as your first!
great job .. very easy to understand
Vpi(s) >= Vpi'(s) for all s in S
i expected this instead : sum(Vpi(s)) >= sum(Vpi'(s) ..
Thanks for your videos :)
I think there's a typo at 3:30 where "E[R_(t+1)...]" should be "E[R_t...]." This is the reward you get for being in the current state (e.g. A robot wakes up in current state S and is given reward for simply waking up in that state (if any).
If I understand this correctly, the Bellman equation only considers the current timestep and one more timestep ahead?
Looking at the equation directly, yes. However, indirectly, notice that to calculate q*(s,a), we need q*(s',a') where s' and a' are the _next_ state and action. Therefore, to calculate q*(s',a'), we'll need q*(s'',a''), where s'' and a'' are the _next next_ state and action, and so on.
@@deeplizard Thanks! Ok so it's a recursive function I guess? It would be hard to stumble upon just one reward state if the statespace was huge I think?
@@deeplizard so is it at recursive function? or it learns to predict the q value without recursion?
@@vishalpoddar yes its recursive. Exactly this allows us to update the q function over and over: Calculate the right hand side for our current approximation of the value function and then update q according to that. If you iterate that process it will converge to the optimal function
@@deeplizard Can you elaborate more on the rationale behind always looking one step forward instead of other number of steps? I understand that the computational demand goes up by a lot, even if you you look at 2 steps forward. What happens to the learning performance with other parameters being equal?
Excellent content once again
Your videos are short and
very informative
ABSOLUTE GEM :)
merci merci
Love this videos keep it up
I think there is a mistake when the narrator says that s' is the best possible next state. If I understand correctly, the expectation is over the distribution of the possible next steps s' (since they depend on the randomness of the environment when it is provided with s and a), and the "max" expression is over the best possible a'.
Why does the blog speak of "the" optimal policy? According to the definition there might be many or none.
According to the definition it is not even clear whether there exists a policy that is strictly better than any other policies.
I think there is a theorem that for Markov decision programs the optimal value function exists and is comparable to all other value functions (and of course bigger, thats why we are doing it). This directly implies that two such optimal value function (lets call them v and v*) have to be equal: For all states s we have v(s)≤v*(s) since v* is optimal, but also v*(s)≤v(s) since v is optimal, so v(s)=v*(s) for all states s and thus v=v*
Could someone help me to understand the difference between optimal policy and optimal q-function. The former should be the optimal mapping which, given a state, tells me which is the best action in order to get the maximum return, the latter should be that function that given any state and action, return me the best return???? I am very confused.
amazing
Ok, not trying to be difficult, just trying to understand. So correct me if I'm wrong. But you paraphrase the Bellman equation as saying that the expected return is the reward.... plus the "maximum expected discounted return that can be achieved from any possible next state-action pair (s',a')". I don't believe this. Shouldn't it be the "expected value over the next state s' of the maximum value over the next action a'"? The point being that "expected" and "maximum" are in the wrong order and the probability distribution is over the next state.
I don’t think you’re being difficult. Your comments show that you’re actually putting your own thoughts into this stuff :)
I could’ve been more precise in my paraphrasing. Let me clarify. First, from an earlier video/blog on MDPs where we touch on transition probabilities (more in the blog than the video), we talk about how the action a that is selected from a given state s is from a set of actions A(s) that can be taken from s. A(s) is a subset of all possible actions that can be taken in the environment and has a probability distribution over s.
deeplizard.com/learn/video/my207WNoeyA
Next, from the max term used in the Bellman equation, we can see from the subscript that the maximization is occurring over all the next possible actions a’ that can be taken from s’. In other words, given that we end up transitioning to state s’, which action a’ from the set of actions that can be taken from s’ is going to yield the max return?
Your phrasing of "plus the expected value over the next state s' of the maximum value over the next action a'" works to describe this. I'm trying to think of a way to say this a bit more intuitively and update the blog with it. Maybe "plus the maximum discounted return that can be achieved from the next state s’ over all possible actions a’ in A(s’)."
my head starts to burn "i mean it" when ever you start talking about functions and those terms 3:28
Make sure that you've studied the episodes in the series that come before this one, as they build up to the math that we're using in this episode. Also, to get a firmer grasp on the math, you can go at a slower pace by spending time on the written blog format of each episode here: deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv
Suggestion: Seems to me that there is some burden to show an optimal policy exists and has the properties that you claim. A "proof" would be simple enough and would demonstrate the purpose for assuming that the set of states and the set of actions are both finite. It would also use some recursive formula like the Bellman equation, so it would be a preview of that equation as well.
Thanks for the suggestion. I’ll think on this further and consider writing up a proof and adding it to the blog.
I am not convinced ;) The argument mentioned is that "since the agent follows the optimal policy, the next state s' satisfies the condition that the best next action (wrt. the expected reward) can be taken".
However, we are conditioning on the action a here, which effectively means we are not following the optimal policy. The optimal policy samples a given s, meaning that not all possible a will follow the optimality path.
In general, I don't understand how we get away from greedy behavior here. Thinking of local instead of global optimization maximums.
{
"question": "How can we determine an action-value function is optimal?",
"choices": [
"For any state action pair, our function produces the expected reward for taking that action plus the maximum discounted return thereafter.",
"For any state action pair, our function yields the maximum future rewards.",
"For any state action pair, our function produces the reward for taking that action.",
"For any state action pair, our function yields the discounted rewards of following the optimal policy."
],
"answer": "For any state action pair, our function produces the expected reward for taking that action plus the maximum discounted return thereafter.",
"creator": "Tyler",
"creationDate": "2021-03-03T23:16:50.005Z"
}
Thanks, Tyler! Just added your question to deeplizard.com/learn/video/rP4oEpQbDm4 :)
We progress the following: WHAT THE FUCK DOES E MEAN?!
Intros are too long. 24 seconds is too long...................
this is very confusing...not clear at all