This video definitely helped me understand policy iteration and value iteration! Been reading Sutton and Barto's Intro to RL, and was having trouble fully wrapping my head around the ideas talked about in chapter 4 for dynamic programming. After watching this I see it isn't as bad as I thought - thanks a lot!
This video was so good. I can't believe for 3 years I kept avoiding reading MDP papers and that Puterman book was so difficult for me to read. Thank you so much and the world needs people like you more.
Sir, l love your explanation.. in the whole video you cover most of the important points that I had faced difficulty. Thank you sir for your generous support.
Bellman Equation: 6:58 of this video, Sutton's book Equation (3.14). Bellman Optimality Equation: 12:19 of this video, equations are in the textbook p.85 (pdf page). In the equation in 20:00, the term \sum_a \pi(a|s) disappeared because it is a deterministic policy. Note that p(~| s,a) turned into p(~| s, \pi(s)). 22:07 of this video contains the summary of the algorithm. It is important to note that policy evaluation has a loop, but policy improvement is just one step. Bellman equation takes a role in policy evaluation step, and Bellman optimality equation takes part in the improvement step.
Hi Riturj! Question : When applying the Bellman expectation equation V(s)=max(a) (R(s,a)+γT(s,a,s')V(s')) , is that v(s') at the end of the equation do we only look at the next state or do we recursively go back until we reach the beginning of the maze or MDP???
sir one question please answer it: Ques. Which of the following policy iteration does: Options are: (A) produce random data (B) allocate benifits (add positive value) (C) allocate only rewards (positive rewards) (D) allocate both positive and negative rewards Note: Please explain the answer
24:09 I don't know if I am getting it right, but I think near the end of the convergence step in the first loop, we have already found the best (or nearly best) action given state s. The reason is that to calculate the maximum of V(s), all actions have to be iterated and the values have to be compared. Is it still necessary to backtrack the best action again at the end using V*(s)?
awesome video thank you so much!! It was very helpful for understand it perfectly. One question, what software do you use to create the slides? thank you in advance!
Why is there not a single video on this subject that actually works out an example with all the numbers /: There's many with examples for value iteration but not a single one for policy iteration/evaluation
could u upload some examples of using the two algorithms hard to visualise from just the equations. Maybe just some iteration process shown. Having a hard time to understand just from the equations and my school lecture notes have not clearly defined it as well. hope u can show some examples real soon thanks
Yes. I am thinking about creating some example videos soon. Currently I am mostly focusing on finishing the background topics so that I can move towards more useful topics such as Q learning, Policy gradient methods etc.
thumps up for such am awesome video, i have a question, can i solve simple game where my action is depending on the state beside external variable coming from the opponent?
Hey..thanks. If your external variable can take a discrete set of values then it must be included in the state. Everything should be in the state in this framework.
I agree ! I'll try to make it more clear from the next time. The channel was not well planned initially and I did not have good setup to write on screens.
This video definitely helped me understand policy iteration and value iteration! Been reading Sutton and Barto's Intro to RL, and was having trouble fully wrapping my head around the ideas talked about in chapter 4 for dynamic programming. After watching this I see it isn't as bad as I thought - thanks a lot!
Almost everything that I was confused about was explained really well. Explaining with examples was also very effective. Thank you!
This video was so good. I can't believe for 3 years I kept avoiding reading MDP papers and that Puterman book was so difficult for me to read. Thank you so much and the world needs people like you more.
"I hope you are doing absolutely fantastic in your life!" instantly clicked like. I'm not but I appreciate the sentiment!
😇
Sir, l love your explanation.. in the whole video you cover most of the important points that I had faced difficulty. Thank you sir for your generous support.
Great explanation. I have been confused with value iteration and policy iteration for a long time. Thank you for making such a wonderful video
Thanks...🤘🤘
Man I cannot believe that you have so less subscribers. With the kind of quality you provide it should have been in the factor of ten thousand
How clear your explanations are! Thank you, sir.
Bellman Equation: 6:58 of this video, Sutton's book Equation (3.14).
Bellman Optimality Equation: 12:19 of this video, equations are in the textbook p.85 (pdf page).
In the equation in 20:00, the term \sum_a \pi(a|s) disappeared because it is a deterministic policy. Note that p(~| s,a) turned into p(~| s, \pi(s)).
22:07 of this video contains the summary of the algorithm. It is important to note that policy evaluation has a loop, but policy improvement is just one step.
Bellman equation takes a role in policy evaluation step, and Bellman optimality equation takes part in the improvement step.
Great visual explanations. Thank you very much for such content. Looking forward to watch more videos on RL and DeepRL.
Hi Riturj! Question : When applying the Bellman expectation equation V(s)=max(a) (R(s,a)+γT(s,a,s')V(s')) , is that v(s') at the end of the equation do we only look at the next state or do we recursively go back until we reach the beginning of the maze or MDP???
Hey, we look at the next state only. In every iteration of the algorithm we update the value of all the states using the old v(s') values.
@@aiinsights-riturajkaushik1618 Great that helps alot!
sir one question please answer it:
Ques. Which of the following policy iteration does:
Options are:
(A) produce random data
(B) allocate benifits (add positive value)
(C) allocate only rewards (positive rewards)
(D) allocate both positive and negative rewards
Note: Please explain the answer
Bravo! Short and clear information, very useful for the preparation to the exam
Thanks...
Thanks for the wonderful videos on RL,please post any git snippets if possible
Hey...I forgot to do that. Thanks for reminding. I'll do that as soon as possible.
I think you do a good job of making things clear. This is pretty abstract stuff to teach.
Thank you very much for clear explanation of policy and value iteration
Thanks for the feedback. Stay tuned for new interesting videos related to RL and AI in general.
24:09 I don't know if I am getting it right, but I think near the end of the convergence step in the first loop, we have already found the best (or nearly best) action given state s. The reason is that to calculate the maximum of V(s), all actions have to be iterated and the values have to be compared. Is it still necessary to backtrack the best action again at the end using V*(s)?
awesome video thank you so much!! It was very helpful for understand it perfectly. One question, what software do you use to create the slides? thank you in advance!
Hey...thanks. I used Xournal in Ubuntu to write and converted it to pdf.
very clear and intuitive explaination. tysm
A very informative video. Thanks for creating such helpful content.
Thanks 🙂
Hi sir thank you for this video. I have questions The optimal policy and the corresponding state values.
R(s)= -0.01. How do I do it.
Thank you for the concept
Why is there not a single video on this subject that actually works out an example with all the numbers /:
There's many with examples for value iteration but not a single one for policy iteration/evaluation
sir, you have a very good future in youtube...keep posting
Thanks. I am happy that there are people who watch these videos...😀
why you used Vold at 22:26 as per markov decision procession ,we are not relying at all on past values
Vold is the current estimate of the value function...
could u upload some examples of using the two algorithms hard to visualise from just the equations. Maybe just some iteration process shown. Having a hard time to understand just from the equations and my school lecture notes have not clearly defined it as well. hope u can show some examples real soon thanks
Yes. I am thinking about creating some example videos soon. Currently I am mostly focusing on finishing the background topics so that I can move towards more useful topics such as Q learning, Policy gradient methods etc.
@@aiinsights-riturajkaushik1618 Sure thanks
@@aiinsights-riturajkaushik1618 Yes, it would be great if you can explain DQN and PPO as well in upcoming videos.
thanks you , well explained .
শুরুটা বেশ ভাল। প্রেজেনটেশনটা আরেকটু আকর্ষণীয় হলে দেখতে আরও প্রাণবন্ত হবে, সাবস্ক্রাইবারও বাড়বে। শুভকামনা রইল।
Really liked the videos, hope to get new videos soon.
Great explanation. Makes the equations easy to understand
Thanks
Your videos are very nice! I think there's a typo in the q function at 9:47 (it says "A_t =s").
Yes..you are right. That's a typo. Thanks.
thank you sir!
very much appreciated
very good sir,thankyou
Thanks for the.....stay tuned. I'll bring interesting AI stuff soon with real world applications...and top AI lab visits...
Great..!
thumps up for such am awesome video, i have a question, can i solve simple game where my action is depending on the state beside external variable coming from the opponent?
Hey..thanks. If your external variable can take a discrete set of values then it must be included in the state. Everything should be in the state in this framework.
@@aiinsights-riturajkaushik1618 in that case my state will be so big, is there an ipper limit of states number, if i will code by python?
@@aiinsights-riturajkaushik1618 i have problem with 2000 states each state consists of 3 variables..can i solve it policy iteration??
2000 is not that big. You can do it.
@@aiinsights-riturajkaushik1618 thank you
Why did you stop uploading videos? You provide such great explanations!
great vid
Awesome video, Thanks
I understand whole concept but I can't understand all equations ...I think I should remember all equations for my theory exam..😀😀
I think I should have provided the slides...I'll try to do better from the next videos...
Can't read your handwriting
I agree ! I'll try to make it more clear from the next time. The channel was not well planned initially and I did not have good setup to write on screens.