I have said it before but I feel obliged to say it again with each new video. Your videos are awesome. I really like the explanations starting from scratch and then continuously building up. Really helpful and highly appreciated
Your channel is a gem.. I have watched so many tutorials on ML but yours is the one which I fully understand. M gonna comment on every video I watch on your channel. Thanks a lot deeplizard, you're shaping future ML engineers.
I love the series, real clear explanation. Although it would be nice to have example in between. It now stays very abstract to me, but if it is visualized during the explanation it would land better in my opinion. But super thanks for making these series on such complex matter!
Regarding: V(s) and Q(s,a) I want to point out what wasn't clear to me and what needs to be emphasized in my opinion to understand the whole stuff: Vπ(s) expresses the expected value of following policy π forever when the agent starts following it from state s. Qπ(s,a) expresses the expected value of first taking action a from state s and then following policy π forever. The main difference then, is the Q-value lets you play a hypothetical of potentially taking a different action in the first time step than what the policy might prescribe and then following the policy from the state the agent winds up in.
Hi! RL newbie here. I love your explanation. The "following policy pi forever" part is especially enlightening. But I am not sure what it means. Rather, I don't think I ever quite understand the difference between action and policy in an intuitive way. I would guess policy pi, or the probabilities, changes after every time step, but does following policy pi forever mean the probability never changes? Or does following policy pi mean always choosing the same action? I would love to hear more of your insight :)
@@yuhanyao2857 From what I understand, the policy gives a complete probability distribution from each state. You can think of it as first plugging in the state of the system, then receiving a probability distribution of actions. So what happens is this: after your model has taken a step, you get a new state, but if you plug this state into the policy function, then you get a probability distribution of the different possible actions. Just sample an action from this distribution and let your model take that action. This gives a new state, which you then plug into the policy again. In this sense, the policy completely describes how your model should act going forward. It doesn't matter that the state changes.
I watched the video and read the post. What I learned: 1、Policy and value functions are two things. 2、Value functions have two form. One is tell agent how good is the state.The other is tell the agent how good is the action on some certain state. 3、Policy is the probility for agent choose the certain action on some certain state. 4、I think the two value functions just make me confuse at the first time. After I read the post get it clear. For short . One for the state. One for the action under the state. 5、What is the Q-function: action-state pair value function.I finally get it.....Thank for the state value function.
Maybe I missed it. But I think you haven’t described what it means „to follow a policy“. In other words, how do you make use of the probability distribution over actions in any given state? You could sample from it to determine your next action. But doing so, how does it relate to the optimal value criterium, because you won’t be able to reach the global optimum then.
I wish someone would explain all of this in a non-formal way, not using any notations, rudimentary math and visualizations. I had the same experience as when people tried to explain backrpropergation and the chain rule. I now undestand both concepts and how to implement them, but damn it was hard. I had to reverse engineer a simple neural network and really look at the code and step through it, in order to understand what it actually does and how it works. It's frustarting that it's faster to learn it that way, vs listening to people trying to explain it in an abstract way. Why do we have to introduce so many terms and abstractions you have to remember. In the end, what the algorithm does is all multiplication, addition, and data manipulation. My programmer brain is just wired differently. I'm sure that once I undestand all of this, I can throw away 90% of all of the terms and math introduced here, and undestand how it works intuitivly without remembering what MDP stands for... Same thing with NN's I'm ny inventing my own neurons, playing with different ways of making them recurrent, giving them memory and crazy features, and I know very little about math and its notations.
Thanks for great explanation, that was great , but it would be great to discuss more about probability distribution of policy , I just did not understand that concept. The rest was great 🙏
What is "E" in the formula and what does "|" symbol i.e. bar means in the formula? I would suggest to provide appropriate meaning of symbols below each formula.
"If an agent follows policy 'pi' at time t, then pi(a|s) is the probability that At = a if St = s. This means that, at time t, under policy 'pi', the probability of taking action a in state s is 'pi'(a|s)"
{ "question": "What is the difference between a state value function and an action value function?", "choices": [ "State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.", "State value function tell us the policy whereas action value function tells us the action and state.", "Both are the same.", "State value function tells us the state as well as action whereas action value function tell us only action for any state." ], "answer": "State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.", "creator": "whopriyam", "creationDate": "2020-04-09T11:32:01.613Z" }
{ "question": "With respect to what the value function is defined?", "choices": [ "Value function is defined with all of the choices.", "Value function is defined with respect to the expected return.", "Value function is defined with respect to specific ways of acting.", "Value function is defined with respect to policy." ], "answer": "Value function is defined with all of the choices.", "creator": "marianna tzortzi", "creationDate": "2020-11-18T14:57:58.033Z" }
@@deeplizard I am in a situation where I need to learn deep q learning in just 3 to 4 days. I have deep learning background but don't know anything in RL. And your videos gave me so much information and intuition.
Thanks for your video. I watched twice but still don't understand why there should be a big "E" for expected value @4:41. I am confused because I found other refereces do not have that.
Thanks for the explanatory content. It's unnecessary but you may want to change it. On the blog page, the word 'following' was written two times under the action-value function topic in the first paragraph in the second line.
i am somewhat confused though about rewards. say if we're doing something complicated like playing an atari game. how do you program in rewards? winning and losing a game is kinda all there is, right?
If you are doing something like Atari breakout then every time the agent hits a block it can get a +1 reward. It doesn't necessarily need to beat the level until it gets a reward.
@@paulgarcia2887 the answer i was looking for was that when the agent gets his +100 reward for winning he stores the state before the winning state & what to do & the potential reward for that state. which creates a new state with new data. and it propagates backwards that way. Cheers though. this was a difficult thing for me to learn.
I am a very begginer! could someone please explain me why do we need a state-value function and an action-value function? It seems to me that the last one if enough since it can map the state, the action and the reward for this particular pair. Could I select one out of them?
The state-value functions doesn't account for a given action, while the action value does. Going forward, we stick mostly with the action-value function.
3:55 I found it confusing/weird that *the expected return starting from s* is equal to *the discounted return starting from s*, since the two quantities obviously aren't always equal. Is this an error? Or perhaps just a way of saying that $v_\pi (s)$ is equal to either *expected return starting from s* or *discounted return starting from s*?
Hey Sebastian - Maybe it would've been more clear if I said "expected discounted return" instead of just "expected return." From episode 3 onward, as long as we don't explicitly state otherwise, "return" means "discounted return." So, the value of state s under policy pi is equivalent to the expected discounted return from starting at state s and following pi. Hope the helps!
Thanks for the videos, what do you mean "following policy pi thereafter"? If I am in a state, and take an action with a given policy in that state, and then transition to a new state, am I still following the same policy in that state? Shouldn't the policy change in each state? This really confuses me, sorry if my question is not clear. Are you saying that when calculating the value functions, we always use the same policies in each state? Thanks again,
Wait I think I get it, so the policy never changes in a value calculation. The policy is essentially the Q-table, and sums up how an agent will act in an environment. This stays the same, the values are calculated...
i thought policy was "collection of all the actions taken together in all the states in a complete lifetime to achieve some overall reward" ! and policies were "too many lifetimes with too many different overall reward". Optimal policy i thought to be "a policy in policies with highest reward". i didn't know that policy was only for "a single state with different action choices".
Amazing tutorial, I appreciate it very much. Would it be possible to lower the lizard appearing sound in the intro? it seems louder than your voice therefore hurts my ear........... :(
The videos which I watched ate awesome... but most of the videos are not loading and buffering.... I have no idea why becase all other videos in RUclips are working well
Thanks, Debayan. Aside from an issue with internet connection/speed, I'm not sure what else could cause the issue. The videos are all uploaded in standard 1080p HD quality. You could try lowering the quality to 720p to see if that helps load the videos. Also, you can use the corresponding blogs to the videos as well: deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv
I know this channel has got very good quality content but my issue is only with videos of your channels all other channels and videos are working and loading well... are there any copy of this video anywhere else other than RUclips
This issue also occurred on some of your videos in DeepLearnig Neural Networks.... not all only some videos are having problems.... it may be due to some policies of RUclips no issue with your 1080 quality video
@@deeplizard i like the series and all(still waiting for a new vid xD) but knowing what ai can do and the fact that i just saw a vid about how 4 ai killed 29 humans in japan is just scary tbh xD
You gotta use examples, man. You gotta use examples. A bunch of notation without practical examples to tie them to makes folks tune out. That's the only way to intuitively understand the ideas. This is the fourth video and you've barely touched on any 'game'. Aren't games at the 'example' applications for RL? These concepts aren't difficult. They just need to be explained by teachers who are passionate about the student experience and want them to learn. All the dedication on the student's part is absolutely worthless if the teacher or material isn't coming across. I have downloaded and deleted a bunch of textbooks because they were all filled with abstract nonsense that left me frustrated with migraines. Please find a way to break this stuff down with game or real-world examples in a language a dummy can understand. I'd be very pissed if I paid for your course and got a bunch of abstract notations thrown at my face. PLEASE PROVIDE EXAMPLES. PLEASE!!! Save the crap notations for later and provide examples. You started with a raccoon agent in the first video. Where did he go to? Why isnt he utilised for subsequent videos? Sigh
Check out the corresponding blog and other resources for this video at:
deeplizard.com/learn/video/eMxOGwbdqKY
I have said it before but I feel obliged to say it again with each new video. Your videos are awesome. I really like the explanations starting from scratch and then continuously building up. Really helpful and highly appreciated
We really appreciate that, Daniel! Always happy to see your comments :)
i love the little snippets of real-life reinforcement learning at the end of the video, it keeps me inspired to continue!
"If you can't tell it simply, you don't understand it enough"
It is the simplest explanation I have found on RUclips.
Thanks a lot
hey deeplizard we need a series about RNN also. Plz choose it as your next series
RNN is dead
Your channel is a gem.. I have watched so many tutorials on ML but yours is the one which I fully understand.
M gonna comment on every video I watch on your channel.
Thanks a lot deeplizard, you're shaping future ML engineers.
This channel deserves more viewers for sure.
i love how this video is so calm & soothing at the end then BAM WACKY RUNNING CRAZY STICK MAN
LOL!
I learnt about RL from this series, what I couldn't from MIT and Stanford open course ware
I love the series, real clear explanation. Although it would be nice to have example in between. It now stays very abstract to me, but if it is visualized during the explanation it would land better in my opinion. But super thanks for making these series on such complex matter!
Regarding: V(s) and Q(s,a)
I want to point out what wasn't clear to me and what needs to be emphasized in my opinion to understand the whole stuff:
Vπ(s) expresses the expected value of following policy π forever when the agent starts following it from state s.
Qπ(s,a) expresses the expected value of first taking action a from state s and then following policy π forever.
The main difference then, is the Q-value lets you play a hypothetical of potentially taking a different action in the first time step than what the policy might prescribe and then following the policy from the state the agent winds up in.
Hi! RL newbie here. I love your explanation. The "following policy pi forever" part is especially enlightening. But I am not sure what it means. Rather, I don't think I ever quite understand the difference between action and policy in an intuitive way. I would guess policy pi, or the probabilities, changes after every time step, but does following policy pi forever mean the probability never changes? Or does following policy pi mean always choosing the same action? I would love to hear more of your insight :)
@@yuhanyao2857 From what I understand, the policy gives a complete probability distribution from each state. You can think of it as first plugging in the state of the system, then receiving a probability distribution of actions. So what happens is this: after your model has taken a step, you get a new state, but if you plug this state into the policy function, then you get a probability distribution of the different possible actions. Just sample an action from this distribution and let your model take that action. This gives a new state, which you then plug into the policy again. In this sense, the policy completely describes how your model should act going forward. It doesn't matter that the state changes.
@@sender1496 Wow can't believe it's been 2 years since I posted this question. Thank you for your explanation!
by a huge distance you make the best tutorials in this field .thanks a lot
This channel saved my life! Many thanks!
This series is golden! Thank you so much for creating it!
I watched the video and read the post.
What I learned:
1、Policy and value functions are two things.
2、Value functions have two form. One is tell agent how good is the state.The other is tell the agent how good is the action on some certain state.
3、Policy is the probility for agent choose the certain action on some certain state.
4、I think the two value functions just make me confuse at the first time. After I read the post get it clear. For short . One for the state. One for the action under the state.
5、What is the Q-function: action-state pair value function.I finally get it.....Thank for the state value function.
I'm happy to hear that the blog post is acting as a supplemental learning tool to the video!
Very helpful videos. Very well explained in simple words. Thank you for creating such videos.
Really like your discussions: you get right to the heart of the matter. By the way, great video production and graphics! The cyclist is nifty!
Thank you for these series. I am so grateful to you . Its simply awesome! :)
I was done trying to learn the topic till i saw this series the simplificationt the structure wow
Very nice summary! I’m taking the Udacity DRL course and this video helped me understand the distinction between value function components
Glad to hear that, James! Thanks for letting me know!
Maybe I missed it. But I think you haven’t described what it means „to follow a policy“. In other words, how do you make use of the probability distribution over actions in any given state? You could sample from it to determine your next action. But doing so, how does it relate to the optimal value criterium, because you won’t be able to reach the global optimum then.
I wish someone would explain all of this in a non-formal way, not using any notations, rudimentary math and visualizations. I had the same experience as when people tried to explain backrpropergation and the chain rule. I now undestand both concepts and how to implement them, but damn it was hard. I had to reverse engineer a simple neural network and really look at the code and step through it, in order to understand what it actually does and how it works. It's frustarting that it's faster to learn it that way, vs listening to people trying to explain it in an abstract way. Why do we have to introduce so many terms and abstractions you have to remember. In the end, what the algorithm does is all multiplication, addition, and data manipulation. My programmer brain is just wired differently. I'm sure that once I undestand all of this, I can throw away 90% of all of the terms and math introduced here, and undestand how it works intuitivly without remembering what MDP stands for... Same thing with NN's I'm ny inventing my own neurons, playing with different ways of making them recurrent, giving them memory and crazy features, and I know very little about math and its notations.
Nice editing!! Lovely!!
Are you using those videos at the end to give us rewards for completing each video?
If so, that's pretty meta and impressive.
🤔😅
Excellent work!!! Well defined concepts...
Thanks for great explanation, that was great , but it would be great to discuss more about probability distribution of policy , I just did not understand that concept. The rest was great 🙏
This videos are perfect
Really Great, Thank You
Amazing content!
you are a lifesaver
Super!!Going awesome!!!
Awesome video! Really clear!
Thanks, James!
What is "E" in the formula and what does "|" symbol i.e. bar means in the formula?
I would suggest to provide appropriate meaning of symbols below each formula.
Great job, as an awesome explanation. Can you please give the link of deepmind video snippet at the end
ruclips.net/video/t1A3NTttvBA/видео.html
"If an agent follows policy 'pi' at time t, then pi(a|s) is the probability that At = a if St = s. This means that, at time t, under policy 'pi', the probability of taking action a in state s is 'pi'(a|s)"
At 4:00 What exactly is G_t? is it the value of the state? and what does the E_pi[ ] return?
{
"question": "What is the difference between a state value function and an action value function?",
"choices": [
"State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.",
"State value function tell us the policy whereas action value function tells us the action and state.",
"Both are the same.",
"State value function tells us the state as well as action whereas action value function tell us only action for any state."
],
"answer": "State value function tells us the correct state whereas action value function tells us the correct action to be taken in the state, both for a particular policy pie.",
"creator": "whopriyam",
"creationDate": "2020-04-09T11:32:01.613Z"
}
Thanks, priyam! I changed the wording just a bit, but I've now just added your question to deeplizard.com/learn/video/eMxOGwbdqKY :)
thank you for this!
{
"question": "With respect to what the value function is defined?",
"choices": [
"Value function is defined with all of the choices.",
"Value function is defined with respect to the expected return.",
"Value function is defined with respect to specific ways of acting.",
"Value function is defined with respect to policy."
],
"answer": "Value function is defined with all of the choices.",
"creator": "marianna tzortzi",
"creationDate": "2020-11-18T14:57:58.033Z"
}
Thanks, Marianna! Just added your question to deeplizard.com/learn/video/eMxOGwbdqKY :)
@@deeplizard I am in a situation where I need to learn deep q learning in just 3 to 4 days. I have deep learning background but don't know anything in RL. And your videos gave me so much information and intuition.
What is E in value function formulas ?
Expected value
@@deeplizard So it's the expected value of the expected return right?
Thanks for your video. I watched twice but still don't understand why there should be a big "E" for expected value @4:41. I am confused because I found other refereces do not have that.
Thanks for the explanatory content.
It's unnecessary but you may want to change it. On the blog page, the word 'following' was written two times under the action-value function topic in the first paragraph in the second line.
Thanks, samet! Appreciate you spotting this. I will get this fix out in the next website update!
what does big E mean. 4:04
since expected return is Gt
Is there any reinforcement learning problems not solved with Markov decision processes ?
Thank you!
What is the term E at 5:30 in the video? I can't find an explanation.
E used in this way means "expected value." In our specific case, the value we're referring to is the return, so we're looking at the expected return.
Thanks!
i am somewhat confused though about rewards. say if we're doing something complicated like playing an atari game. how do you program in rewards? winning and losing a game is kinda all there is, right?
If you are doing something like Atari breakout then every time the agent hits a block it can get a +1 reward. It doesn't necessarily need to beat the level until it gets a reward.
@@paulgarcia2887 the answer i was looking for was that when the agent gets his +100 reward for winning he stores the state before the winning state & what to do & the potential reward for that state. which creates a new state with new data. and it propagates backwards that way. Cheers though. this was a difficult thing for me to learn.
U should checkout siraj raval he has a explanation as well as practical video where he is using StarCraft game
@@43_damodarbanaulikar71 cool man i will. cheers.
I am a very begginer! could someone please explain me why do we need a state-value function and an action-value function? It seems to me that the last one if enough since it can map the state, the action and the reward for this particular pair. Could I select one out of them?
The state-value functions doesn't account for a given action, while the action value does. Going forward, we stick mostly with the action-value function.
@@deeplizard thanks! Great content!
3:55 I found it confusing/weird that *the expected return starting from s* is equal to *the discounted return starting from s*, since the two quantities obviously aren't always equal.
Is this an error? Or perhaps just a way of saying that $v_\pi (s)$ is equal to either *expected return starting from s* or *discounted return starting from s*?
Hey Sebastian - Maybe it would've been more clear if I said "expected discounted return" instead of just "expected return." From episode 3 onward, as long as we don't explicitly state otherwise, "return" means "discounted return." So, the value of state s under policy pi is equivalent to the expected discounted return from starting at state s and following pi. Hope the helps!
Thanks for the videos, what do you mean "following policy pi thereafter"? If I am in a state, and take an action with a given policy in that state, and then transition to a new state, am I still following the same policy in that state? Shouldn't the policy change in each state? This really confuses me, sorry if my question is not clear. Are you saying that when calculating the value functions, we always use the same policies in each state?
Thanks again,
Wait I think I get it, so the policy never changes in a value calculation. The policy is essentially the Q-table, and sums up how an agent will act in an environment. This stays the same, the values are calculated...
awesome
i thought policy was "collection of all the actions taken together in all the states in a complete lifetime to achieve some overall reward" ! and policies were "too many lifetimes with too many different overall reward". Optimal policy i thought to be "a policy in policies with highest reward".
i didn't know that policy was only for "a single state with different action choices".
Hey ravi - Yes, a policy is a function that maps each state in the state space to the probabilities of taking each possible action.
thank you miss deep-lizard
Amazing tutorial, I appreciate it very much. Would it be possible to lower the lizard appearing sound in the intro? it seems louder than your voice therefore hurts my ear........... :(
Thanks, Haneul! The intro has been modified in later episodes.
merci
The videos which I watched ate awesome... but most of the videos are not loading and buffering.... I have no idea why becase all other videos in RUclips are working well
Thanks, Debayan. Aside from an issue with internet connection/speed, I'm not sure what else could cause the issue. The videos are all uploaded in standard 1080p HD quality. You could try lowering the quality to 720p to see if that helps load the videos. Also, you can use the corresponding blogs to the videos as well:
deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv
I know this channel has got very good quality content but my issue is only with videos of your channels all other channels and videos are working and loading well... are there any copy of this video anywhere else other than RUclips
This issue also occurred on some of your videos in DeepLearnig Neural Networks.... not all only some videos are having problems.... it may be due to some policies of RUclips no issue with your 1080 quality video
great only missing examples ,, especially about
return , policies and value
just traffic.
this is scary dude... xD
Haha in what way?
@@deeplizard i like the series and all(still waiting for a new vid xD) but knowing what ai can do and the fact that i just saw a vid about how 4 ai killed 29 humans in japan is just scary tbh xD
You gotta use examples, man. You gotta use examples. A bunch of notation without practical examples to tie them to makes folks tune out. That's the only way to intuitively understand the ideas. This is the fourth video and you've barely touched on any 'game'. Aren't games at the 'example' applications for RL? These concepts aren't difficult. They just need to be explained by teachers who are passionate about the student experience and want them to learn. All the dedication on the student's part is absolutely worthless if the teacher or material isn't coming across. I have downloaded and deleted a bunch of textbooks because they were all filled with abstract nonsense that left me frustrated with migraines. Please find a way to break this stuff down with game or real-world examples in a language a dummy can understand. I'd be very pissed if I paid for your course and got a bunch of abstract notations thrown at my face. PLEASE PROVIDE EXAMPLES. PLEASE!!!
Save the crap notations for later and provide examples. You started with a raccoon agent in the first video. Where did he go to? Why isnt he utilised for subsequent videos? Sigh