Ah, something I intentionally skipped over out of laziness, so I'll pin this comment for others. We want to show E[G_t+1 | s^0, -> ] = E[v(S_t+1) | s^0, -> ]. So.. * E[v(S_t+1) | s^0, -> ] = E[E[G_t+1|S_t+1] | s^0, -> ] (by def of v ) * = sum over s' [E[G_t+1|S_t+1 = s']p(s' | s^0, -> ) ] (by def of an expectation - the outer one) * = E[G_t+1| s^0, -> ] (by law of total probability) where p(s' | s^0, -> ) = sum over r [p(s' , r| s^0, -> )]
@@xiaoweilin8184 Hm, it's just precisely what the law of total probability tells you. sum over b [p(a|b)p(b)] = p(a) The only difference is my expression has some extra conditioning on s^0, ->.. but that doesn't change anything. Hope that helps
@@Mutual_Information But in your expression, the quantity to be summed is E[G_t+1|S_t+1 = s']. So do we need to write out this expectation to: sum over g_t+1 [g_t+1*p(G_t+1 = g_t+1|S_t+1 = s')] first? and the whole expression becomes a double sum: sum over s' , sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] exchange the sum: sum over g_t+1 {g_t+1 * sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]} (1) Until this step can we use the total probability formula to the second sum: sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] = p(G_t+1 = g_t+1 | s^0, ->) Put it back into (1): sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1 | s^0, ->)] = E[G_t+1 | s^0, ->] Is it the correct way to use the law of total probability to derive the last step from the previous one in your derivation? It seems to me these are a few more steps that are derived under the hood in your expressions. Sorry there is no Latex in RUclips comment, it would be nicer if they are in Latex...
THESE ARE THE BEST VIDEOS ON THIS TOPIC EVER, AND YOUR WAY OF EXPLAINING AND MAKING THINGS SOUND SO SIMPLE IS INCREDIBLE, THANK YOU A MILLION TIMES !!!!!!
In 15:46 you said "if that policy is greedy in respect to thatvalue function" but i don't quite understand what you ment by that. Other than that the video is crystal clear. thank you for these videos.
A value function gives you the numeric value of every action in every state. A policy that's greedy 'with respect to that value function' is one which, in whatever state, picks the highest value action, according to the value function. Make sense?
Imagine if such great educational videos existed for all foundational topics in artificial intelligence, engineering, math, and physics. We are slowly getting there :). 3b1b py module manim has made it quite accessible to create high-quality, time efficient (for learning) educational content. It's amazing what people create. Thank you for the great videos!
I hope that there's a section of RUclips that's one day more like Wikipedia. It's a bit of a pipedream, but I'm at least nudging this continent in that direction. FYI, I don't use manim
This is so well done! Explaining stuff well can be very difficult. Thanks a lot! I'm studying RL at a university course, but this was way more helpful!
*opens book* *stares into soul* LOL! I watched David Silver's excellent series before finding the MI channel. Your animations add a new level to understanding the topic adding more AHA moments for grasping RL. Thank you for making them! (Subbed & liked)
Excellent video. Even though I have been studying RL for a while, the video clarified some previously learned concepts and gave me a better understanding of the topic.
Love the content so far. I would just prefere that you leave some times to breath like when you ask question "can you find S0 ?" don't answer straight away, let us think for few seconds. Will keep diging the playlist thank you for all this work !
These series of videos are really nice. I would love to see you go more into the theory/proofs of why policy iteration works... as another series. Once again, really good work.
Your videos are like espresso, condensed, tasty, full bodied but you should not try to rush when watching them. There are no spare words so when you miss one, you're lost 😀Great video, I love that logical structure, rock solid!
Hello. Thank you very much for providiing such clear lectures I hope you will continue. I have one suggestion, when you're finishing explaining the set of equations, you make them disapear so fast that one doesn't have time to pause. Just take one full note pause and that should do it. Don't overdo it, the beauty of your lectures is that they are dense, clear and consise.
Duane, at 16:20, I'm not sure why we would need to update the policy: I would think that we could just rely on updating the values of the states, again and again, until they stop changing. Following my logic, we wouldn't have to iteratively change the policy - at the very end we'd just make it "follow the highest action". ....But, I realize that these state-values were updated with the random-action policy (the 4 neighbor-states value are weighed by 0.25). Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others? So it no longer becomes 0.25 but some other probability. My confusion is because I am used to Q-learning where the policy is epsilon greedy. Thank you
Hey Igor. If you are following that random-action policy example, then it just so happens you only need to apply policy improvement *once* and you're at the optimal policy. But that's not true in general. Here's it more spelled out: * start with a random policy. * determine it's value function. * make a slightly better policy using the pick-best-action rule. This only slightly better than the random policy. In the general case, it is not likely to be near the best optimal policy. * determine the value function of this new, slightly improved policy. * repeat. If you were to do your approach, you would only be doing 1 iteration. You wouldn't end with an optimal policy. And regarding "Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others?" Yes I think that's fair. We are changing the probabilities in states where the highest-action vcalue is NOT selected. Though I'm not sure what you mean by "slowly" Hope that helps!
If anyone is having a hard time deriving the bellman equation, especially the part where E [ G_t+1 | S_t ] = v(S_t+1), then I have covered that in my playlist from the absolute ground zero (because I am dumb). ruclips.net/video/4nMnc8M7U5k/видео.htmlsi=9QR8zNuLEd1QAijE
Great material thank you so much for posting. May I ask why the action generated on 20:09 for state = 60 is 1 under the uniform V = 0.4? I understand for 80 for example we get a = 20 because 0.4*1 + 0.6*0.4 = 0.64 (which is the updated V at the next step for 80).
I'm wondering what causes the discontinuities at s= 25, 50, 75 in the gambler's problem solution. The top panel value function triangular appearance makes sense but I can't get an intuitive grasp of why the lower panel optimal policy would end up that shape.
its been 11 months, but either way: first assume s=s^(1): r = 0: 0.12 * [0 + 0.95 * 18.1] r = 1: 0.22 * [1 + 0.95 * 18.1] r = 2: 0.20 * [2 + 0.95 * 18.1] sum all of these up then we assume s=s^(2) r = 0: 0.09 * [0 + 0.95 * 16.2] r = 1: 0.32 * [1 + 0.95 * 16.2] r = 2: 0.05 * [2 + 0.95 * 16.2] sum all of these up as well then add the grand total of both of those. its the same way for taking the left action :) Edit: for the state value of s=s^(0) you would simply have to create the weighted sum of the actions such that: 0.4 * 17.8 + 0.6 * 17.4 = 17.56 or 17.57 if you dont ceil/floor your intermediate results do the calculations in excel you will get the same results :)
@@denizdursun3353 Thank you kind man. After i saw the digits and compared them with formulas in the video it became pretty obvious. Again, video is brilliant, but the digits explanation dramatically boost clarity.
Hi, so far I think this is a great video, however, I wanna point out that at 3:45 your illustration makes it seem like the states are a sort of one-dimensional grid, and one can go from s0 to s1, s1 to s2 etc. When you show the probabilities it becomes "obvious" that this is not the case, but this part had me confused a bit with your explanation/equation at 6:28, which I'm pretty sure should also have an s0 instead of an s. Like I said, otherwise a very good video!
Could you explain the policy improvement? I understand that choosing the action that maximizes the action value function will lead to a better policy but I don't understand why we could not do that in the first iteration after performing policy evaluation? Wouldn't that then be the most optimal policy? What other improvements can we iteratively make to the policy? PS: Thank you for this video series, it has helped me understand a lot!
Let's walk through it. To start, the action-value function is completely flat; all actions in all states have the same value. To do policy improvement from this moment, you must pick actions randomly (or some other arbitrary way to break ties). Now you have a random policy. Then in policy evaluation, we determine the action-values of this random policy. Next, in policy improvement, we can slightly beat it by always picking the max-action value. Ok, why isn't this the optimal strategy immediately? Well b/c it's a policy improvement step (applying the rule of picking the max-value-action) on action values of a crappy policy, the one where we just picked actions randomly! Make sense? It takes time for the action-values to model a good policy, because we start with a bad policy.
@@Mutual_Information This makes a lot more sense now. Thank you so much for taking the time to respond! To everyone else watching, while I have only watched the first 3 parts (will watch the rest soon), I can already tell you that this video series has peaked my interest in RL and I am excited to dive deeper into these topics and look into how I can incorporate this into my research.
Nice explaination. May I ask how did you gain your explanation skills, did you took a course or something? Because you just hit the right buttons with your method, so to speak, to make us understand what you are talking about. I'd love to explain things like you do man, appreciated!
Funny you say that, I feel like I have so much more to learn. I explain things poorly all the time! What I'd say is.. starting *writing* ASAP. Write about whatever interests you and write somewhere where people will give you useful feedback. It takes awhile to learn what works and what doesn't. One thing that helped me is to realize I'll be writing/educating forever. So there's no rush. This makes it more enjoyable, which means it's easier to maintain a long term habit. The long term is where quality edu emerges.
@@Mutual_Information It was just a mean comment for the sake of being mean. I'd probably not put in the work to upload content you already got got covered, but donno
@@watcher8582 haha yea I looked at the video after the fact and decided it wasn't so bad I didn't need to do this one. But some of my older ones, hand motions are awful and I'll be changing some of that lol
@@Mutual_Information No I mean if you already got a video covered (or if it's well covered elsewhere), then I'd probably invest the energy into making a video with a topic not covered online yet. Don't misunderstand me, the hand motion is terrible and I put a post-it over the screen to watch the video.
I was super focused on the board I didn't notice any weird cutovers! I like how you go through each variable it's very useful to have these quick reminders of what these notations represent as we're going through new concepts so that we don't have to make more conscious effort to decipher them and can focus on the new concept
gah unfortunately not.. I'd like to come back and create a written version of this series. I have that for a small set of other videos, but that'll take some time for this one - don't hold your breathe, sorry!
If we were valuing the optimal policy, you'd be right. But we're valuing the do-something-randomly policy, which can't be value by looking at the optimal path. You have to think about a random walk, and then the corners, in that sense, are further away.
I have questions about the 6:06 substitution, could you explain this a bit ? because v_pi does not need to be conditional on state and action , it only depends on the state
I use the Python plotting library called Altair. It creates beautiful static images. Then I have a personal library I use to stitch them into videos. That's also used to make the latex animation.
I have a question, does policy (pi) describes a probability distribution or the action itself? Because, I have a hard time understanding the optimal policy and the equation for `a = pi(s)` when the policy is deterministic. Also, how do we perform the policy improvement when the policy is stochastic? do we just reassign the state action value using the bellman equation after a sweep of policy evaluation? Because in the beginning of the grid world problem, you said that we use pi(a|s) = 0.25 and then you change the setting become deterministic by taking the argmax
If the policy is deterministic, it is a function. You give it a state and it gives you an action. That's all. If it's stochastic, it's a distribution over actions that depends on the state. If you give it a state, it gives back an entire distribution, specifying the probability for each action. Policy improvements for stochastic policies depends on the policy. If it's epsilon greedy, that's like a deterministic policy 'mixed' with a stochastic policy that is always a uniform distribution over the action. In this case, policy improvement doesn't touch the stochastic part - only the deterministic part. To perform policy improvement, you set the deterministic policy to select the highest value action in every state. You can do this, because presumably you just performed policy evaluation. This is the argmax operation: "select the highest value action". Make sense?
hmm, let me guess, wouldn't applying a discount be interesting if the state space is so big? furthermore, wouldn't it be more interesting if we instead of discounting by a constant rate use a Gaussian distribution as our discount ?
Interesting perspective, but I don't think a strong discount removes the difficulty of a large state space. I see your perspective though - it makes it seem as though you only need to care about the most immediate states. But that's not necessarily true. It's because our policy is optimized for all t for G_t.. If you only cared about G_0 and gamma = 0, then yes, the immediate state/action pair is all the matters and you don't care about a lot of the state space. BUT, since we also care about G_1.. we have to have a policy that does well at time t=1.. which means we care about states beyond those in the neighbor of states at t=0. Eventually, we could end up caring about the whole state space. If, on the other hand, some states aren’t reachable from the starting state - then that would be one way in which a lot of the state space doesn't matter.
🤯 My brain keeps saying "I understand" and then "But do I really?" every few seconds Have to rewatch for those algebra for my tiny brain but the overall idea is very well presented!
Indeed it is not random.. but if you give it a random variable as input, it becomes a random variable. As a silly example, if f(x) = x^2.. and U is a uniform random variable over [0, 1], then f(U) is a random variable. It is produced by randomly sampling a value uniformly from 0, 1 and then squaring it.
Can you comment on why does Gambler's problem solution differ from Kelly's criterion from one of your previous videos? Having a goal to reach 100 vs maximum growth.
Sure - they aren't optimizing the same thing. In the gamblers problem, the only thing that matter is probability of getting to 100. In the Betting Game for KC, it's the expected growth rate. Also, KC can sometimes tell you not to bet. In the gamblers problem, you are forced to bet every time. I guess that's enough to make for the different strategies
Great video but I have a question. At 3:56 probability distribution table appears regarding right selection. However s0 is not in table. After all, agent can go from s1 to s0. Am I wrong?
This example is focused only on a small piece of the MDP. The MDP, in entirety, describe the probability of all state, reward pairs for each state, action pair. In this example, I'm only showing the state-action pairs from s0 and, in this example, we can only transition to s1 or s2 (when choosing right). In other words, this example is more restricted than the general case. Implicit in this example, is that the probability of transition from s0 to s0 is zero... or the probability to transition from s0 to s-1 is zero when choosing right. Make sense?
Ah ok MI. now it's crystal clear. BTW we're lucky to ask you questions and get replies. We may not find this possibility in the future when the channel explodes. :)
@@kafaayari Aw thanks :) we’ll see what happens. I enjoy answering the Qs and I’m gonna try to keep it up for as long as I can. So far the volume is quite manageable lol
Is it the case that the optimal policy found by optimizing the state rewards will always be the same as the one found by optimizing the action rewards?
Ok, the notation of E_pi [.] does not necessarily imply you are averaging over a distribution pi(A | s), right? Like in 6:43, where you took the average over the r.v's joint p(s',r | s^0,->), so what's the point of using pi ?
I mean, when I write E_pi[.] it just means that the function inside must be calculated considering that the agent followed the policy pi, but in practice I have to consider a special function (dependent on pi) for the r.v's probs. Am I correct?
At 6:43, you can see E_pi[] means when you go from the second to the third line. It's an expectation operation, so we are expanding random variables within the expression into a weighted sum of values.. where the values are all values the random variables can take, and the weights are their respective probabilities. This is what happens between the second and third line.
@@Mutual_Information I agree with you, my only source of doubt was from the pi subscript, as it does not make explicit the distribution of the random variables, I'm used to think of the subscript of the expectation meaning the distribution of the random variable itself, for example E_{p_X(x)}[X] = sum_x p(x)*x, but in this context it's not the case.
@@piero8284 Oh I see what you're saying. yea pi is just suggestive. It's like saying "the expectation is with respect to the policy pi and you have to know what that means".
I attempted to code the policy iteration algorithm for the gambler's problem, but don't get the policy you show in this video. Instead I get a triangle with a max at 50. This does seem like a reasonable policy though, so I'm not sure if this is one of the "family of optimal policies" that barto and sutton reference in the text.
Oh yes I think it might be. I haven't made the code public but I think I remember the problem. Changing how you are dealing with ties! The action you pick given a tie in value makes a difference
I should. I'm just not on Discord myself, so I don't have familiarity with it as a platform. But I have gotten the request a few times and it seems like a wise move..
@@Mutual_Information Yannic Kilcher created a Discord to support his RUclips, and it is buzzing. Also the level of expertise is high. Yannic's used his following to accrete engineering power into LAION. I got into ML in 2015 and there were almost no online communities back then. I came back over christmas (thanks to the ChatGPT buzz) and was delighted to find that it has taken off bigtime over Discord. Also Karpathy has an active Discord.
I don't know very much about video stuff, but it looks like there's something off with your recording of yourself, it's pretty pixelated. Maybe it's just that your camera isn't that good, or something else, like the lighting, your rendering settings, or bit rate in OBS. Just wanted to let you know in case you didn't already. Thanks for the good videos.
Thanks for looking out. This was my first time uploading in 4K (despite it being recorded in 1920 x 1080) - apparently that's recommended practice. From my end, the video doesn't look bad enough to warrant a re-upload, but I'll give the settings another look on the next videos. I believe I see the pixelation you're referring to.
It's very hard to follow when you call everything "this" and just highlight "this". Would be easier to follow if you replaced "this" with the name of what you're describing
Appreciate the feedback. It's a work in progress.. going forward there's even less on screen so focusing attention might be alleviated a bit. Would you mind providing a timestamp for a case where this stood out? That'll help me identify similar cases in the future
Hi! Nice explanation ! but I tried to implement the calculation from 12:46, like doing the iteration (sweep) and I couldnt reach the same result, for the first part of the equation I kept the 0.25 ( pi[a|s] ), p(s_prime, r | s, a) as 1, for a deterministic world and sum ( -1 + world[row, col]), where world[row, col] is the V_next_state. My result was: [[ 0. -5.05417921 -5.47163695 -3.10743228] [-5.05417921 -6.78698773 -6.51979625 -3.95809215] [-5.47163695 -6.51979625 -5.86246816 -3.20514008] [-3.10743228 -3.95809215 -3.20514008 0. ]] The iteration was something like: for s in the world grid: (except the goals) for act in actions: (if falls off grid does nothing (try catch)) V(s) += 0.25 * (-1 + V[act])
Great video! Can you explain more, that "sneaky" equation in aroun 6:00? Why is G_t+1 = v(S_t+1) in the expectation?
Ah, something I intentionally skipped over out of laziness, so I'll pin this comment for others.
We want to show E[G_t+1 | s^0, -> ] = E[v(S_t+1) | s^0, -> ]. So..
* E[v(S_t+1) | s^0, -> ] = E[E[G_t+1|S_t+1] | s^0, -> ] (by def of v )
* = sum over s' [E[G_t+1|S_t+1 = s']p(s' | s^0, -> ) ] (by def of an expectation - the outer one)
* = E[G_t+1| s^0, -> ] (by law of total probability)
where p(s' | s^0, -> ) = sum over r [p(s' , r| s^0, -> )]
@@Mutual_Information thanks!
@@Mutual_Information May I ask how law of total probability is used to get the last line from the previous one? Thanks!
@@xiaoweilin8184 Hm, it's just precisely what the law of total probability tells you.
sum over b [p(a|b)p(b)] = p(a)
The only difference is my expression has some extra conditioning on s^0, ->.. but that doesn't change anything.
Hope that helps
@@Mutual_Information But in your expression, the quantity to be summed is E[G_t+1|S_t+1 = s']. So do we need to write out this expectation to:
sum over g_t+1 [g_t+1*p(G_t+1 = g_t+1|S_t+1 = s')]
first?
and the whole expression becomes a double sum:
sum over s' , sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]
exchange the sum:
sum over g_t+1 {g_t+1 * sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]} (1)
Until this step can we use the total probability formula to the second sum:
sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] = p(G_t+1 = g_t+1 | s^0, ->)
Put it back into (1):
sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1 | s^0, ->)] = E[G_t+1 | s^0, ->]
Is it the correct way to use the law of total probability to derive the last step from the previous one in your derivation? It seems to me these are a few more steps that are derived under the hood in your expressions.
Sorry there is no Latex in RUclips comment, it would be nicer if they are in Latex...
THESE ARE THE BEST VIDEOS ON THIS TOPIC EVER, AND YOUR WAY OF EXPLAINING AND MAKING THINGS SOUND SO SIMPLE IS INCREDIBLE, THANK YOU A MILLION TIMES !!!!!!
So far the best and optimized playlist for reinforcement learning.
I agree :)
Let's read from the textbook. *He opens the book, then stares at the camera and confidently recites from memory*.
Lol I wish it was from memory! Fortunately teleprompters aren't that expensive :)
@@Mutual_Information tsss
don't ruin the good impression of you
That part of the video made me laugh out loud!!
i was looking for this comment lmao
This part of the video made me lose my focus entirely 😂
I can't express how good these videos are, thank you so much for all the time you put into making them! this is a truly special channel
Thank you, it's tailored for a particular audience. Doesn't hit for most, but some it nails it!
@@Mutual_Information I wish this was much more popular , this series feels tailor made for me.
Kudos, good sir. Your pedagogical skill is both impressive and efficient.
Please continue to grace the world with it for the good of all of mankind.
That's very kind of you Timothy - I have no plans of stopping :)
In 15:46 you said "if that policy is greedy in respect to thatvalue function" but i don't quite understand what you ment by that. Other than that the video is crystal clear. thank you for these videos.
A value function gives you the numeric value of every action in every state. A policy that's greedy 'with respect to that value function' is one which, in whatever state, picks the highest value action, according to the value function. Make sense?
You saved lot of my time by simple, concise and easy to follow video compared to other I have seen so far.
One of the best series if not he best in describing DRL.
Good Job !!!!
Thank you!
This is the best reinforcements learning resource available in internet, Period
Imagine if such great educational videos existed for all foundational topics in artificial intelligence, engineering, math, and physics. We are slowly getting there :). 3b1b py module manim has made it quite accessible to create high-quality, time efficient (for learning) educational content. It's amazing what people create. Thank you for the great videos!
I hope that there's a section of RUclips that's one day more like Wikipedia. It's a bit of a pipedream, but I'm at least nudging this continent in that direction. FYI, I don't use manim
best video lectures of rl on the internet
This is so well done! Explaining stuff well can be very difficult. Thanks a lot! I'm studying RL at a university course, but this was way more helpful!
can you share me your email ? studying RL at a university course.
*opens book* *stares into soul* LOL! I watched David Silver's excellent series before finding the MI channel. Your animations add a new level to understanding the topic adding more AHA moments for grasping RL. Thank you for making them! (Subbed & liked)
Excellent video. Even though I have been studying RL for a while, the video clarified some previously learned concepts and gave me a better understanding of the topic.
Thanks, exactly what I was going for
17:05 * open textbook *, * proceeds to not even look at it *
Man's a CHAD
Love the content so far. I would just prefere that you leave some times to breath like when you ask question "can you find S0 ?" don't answer straight away, let us think for few seconds. Will keep diging the playlist thank you for all this work !
Good feedback. I’ll keep it in mind. Idk why I’m in such a rush lol
It turns out that in fact, algebra *is* fun, cool, and exciting
Damn, it really only took you 20 minutes to explain something that my professor needed two full lectures for. Thank you so much! This was so helpful
These series of videos are really nice. I would love to see you go more into the theory/proofs of why policy iteration works... as another series. Once again, really good work.
Your videos are like espresso, condensed, tasty, full bodied but you should not try to rush when watching them. There are no spare words so when you miss one, you're lost 😀Great video, I love that logical structure, rock solid!
lol you get what I'm going for! It's awesome - love the appreciation
Good to see your content back!
Oh I'm BACK!
Hello.
Thank you very much for providiing such clear lectures I hope you will continue.
I have one suggestion, when you're finishing explaining the set of equations, you make them disapear so fast that one doesn't have time to pause. Just take one full note pause and that should do it. Don't overdo it, the beauty of your lectures is that they are dense, clear and consise.
Noted.. let the equations baked. Ok good point - thank you!
This is the easiest way I have seen regarding this subject. You did a pretty great job there 😂😃😃👍👍
Thank Sir No Name - I'm trying quite hard lol
Amazing explanation of the concepts! Really nice!
Thank you, I appreciate it when the harder topics land :)
Duane, at 16:20, I'm not sure why we would need to update the policy: I would think that we could just rely on updating the values of the states, again and again, until they stop changing. Following my logic, we wouldn't have to iteratively change the policy - at the very end we'd just make it "follow the highest action".
....But, I realize that these state-values were updated with the random-action policy (the 4 neighbor-states value are weighed by 0.25).
Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others? So it no longer becomes 0.25 but some other probability.
My confusion is because I am used to Q-learning where the policy is epsilon greedy.
Thank you
Hey Igor. If you are following that random-action policy example, then it just so happens you only need to apply policy improvement *once* and you're at the optimal policy. But that's not true in general. Here's it more spelled out:
* start with a random policy.
* determine it's value function.
* make a slightly better policy using the pick-best-action rule. This only slightly better than the random policy. In the general case, it is not likely to be near the best optimal policy.
* determine the value function of this new, slightly improved policy.
* repeat.
If you were to do your approach, you would only be doing 1 iteration. You wouldn't end with an optimal policy.
And regarding "Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others?" Yes I think that's fair. We are changing the probabilities in states where the highest-action vcalue is NOT selected. Though I'm not sure what you mean by "slowly"
Hope that helps!
@@Mutual_Information thank you
Wow im not sure if I understood RL when I took the course in college or i just forgot it, but these videos made the aha moment for me for sure!
That's what I'm going for!!
I love Bellman! And I love equations!!
I didn't expect the next video so quickly, amazing stuff.
Have we been spoiled, or will this tight upload schedule continue?
It will continue.. for a short amount of time ;}
If anyone is having a hard time deriving the bellman equation, especially the part where E [ G_t+1 | S_t ] = v(S_t+1), then I have covered that in my playlist from the absolute ground zero (because I am dumb).
ruclips.net/video/4nMnc8M7U5k/видео.htmlsi=9QR8zNuLEd1QAijE
I don't see anything dumb in that video. You're getting right into the meat of probability theory and that's no easy thing!
Yes! We missed you
Missed you too!
goated videos literally all you need to understand the topic
My brain is cooked. This is my fifth time rewatching this😢
Keep it up.. amazing take at this subject
Glad you like it!
And my current plans are to certainly keep it up :)
Can you explain the calculation in bellman equations with the example which values to substitute
Great material thank you so much for posting. May I ask why the action generated on 20:09 for state = 60 is 1 under the uniform V = 0.4? I understand for 80 for example we get a = 20 because 0.4*1 + 0.6*0.4 = 0.64 (which is the updated V at the next step for 80).
At state 60:
For a = 1: 0.4*0.4+0.6*0.4 = 0.4
The counterfactual for a = 40: 0.4*1 + 0.6*0= 0.4.
I guess we grab the minimum action in case of tie?
I'm wondering what causes the discontinuities at s= 25, 50, 75 in the gambler's problem solution. The top panel value function triangular appearance makes sense but I can't get an intuitive grasp of why the lower panel optimal policy would end up that shape.
@6:48, how did you calculate the expectation via the sum i.e. get 17.4?
its been 11 months, but either way:
first assume s=s^(1):
r = 0: 0.12 * [0 + 0.95 * 18.1]
r = 1: 0.22 * [1 + 0.95 * 18.1]
r = 2: 0.20 * [2 + 0.95 * 18.1]
sum all of these up
then we assume s=s^(2)
r = 0: 0.09 * [0 + 0.95 * 16.2]
r = 1: 0.32 * [1 + 0.95 * 16.2]
r = 2: 0.05 * [2 + 0.95 * 16.2]
sum all of these up as well
then add the grand total of both of those.
its the same way for taking the left action :)
Edit:
for the state value of s=s^(0) you would simply have to create the weighted sum
of the actions such that:
0.4 * 17.8 + 0.6 * 17.4 = 17.56 or 17.57 if you dont ceil/floor your intermediate results
do the calculations in excel you will get the same results :)
Nailed it!
@@denizdursun3353 Thank you kind man. After i saw the digits and compared them with formulas in the video it became pretty obvious. Again, video is brilliant, but the digits explanation dramatically boost clarity.
The whole term in case anyone needs it:
0.6 *( (0.12 * 0.95*18.1) + 0.22 ( 1 + 0.95*18.1) + 0.2 ( 2 + 0.95*18.1) + 0.09 (0.95*16.2) + 0.32(1+0.95*16.2) + 0.05 (2+0.95*16.2))
+ 0.4 * (0.17(0.95*16.5) + 0.23(1+0.95*16.5) + 0.04(2+0.95*16.5) + 0.34(0.95*19.2) + 0.05(1+0.95*19.2) + 0.17(2+0.95*19.2))
≈ 17.57
You mentioned something about optimal policy around 7:54. What is that and how did you relate it to state-action optimal value?
Hi, so far I think this is a great video, however, I wanna point out that at 3:45 your illustration makes it seem like the states are a sort of one-dimensional grid, and one can go from s0 to s1, s1 to s2 etc. When you show the probabilities it becomes "obvious" that this is not the case, but this part had me confused a bit with your explanation/equation at 6:28, which I'm pretty sure should also have an s0 instead of an s. Like I said, otherwise a very good video!
Could you explain the policy improvement? I understand that choosing the action that maximizes the action value function will lead to a better policy but I don't understand why we could not do that in the first iteration after performing policy evaluation? Wouldn't that then be the most optimal policy? What other improvements can we iteratively make to the policy?
PS: Thank you for this video series, it has helped me understand a lot!
Let's walk through it. To start, the action-value function is completely flat; all actions in all states have the same value. To do policy improvement from this moment, you must pick actions randomly (or some other arbitrary way to break ties). Now you have a random policy. Then in policy evaluation, we determine the action-values of this random policy. Next, in policy improvement, we can slightly beat it by always picking the max-action value. Ok, why isn't this the optimal strategy immediately? Well b/c it's a policy improvement step (applying the rule of picking the max-value-action) on action values of a crappy policy, the one where we just picked actions randomly!
Make sense? It takes time for the action-values to model a good policy, because we start with a bad policy.
@@Mutual_Information This makes a lot more sense now. Thank you so much for taking the time to respond!
To everyone else watching, while I have only watched the first 3 parts (will watch the rest soon), I can already tell you that this video series has peaked my interest in RL and I am excited to dive deeper into these topics and look into how I can incorporate this into my research.
Excellent! I hope it helps you
Excellent work my friend.
Thank you! Cheers!
Nice explaination. May I ask how did you gain your explanation skills, did you took a course or something? Because you just hit the right buttons with your method, so to speak, to make us understand what you are talking about. I'd love to explain things like you do man, appreciated!
Funny you say that, I feel like I have so much more to learn. I explain things poorly all the time!
What I'd say is.. starting *writing* ASAP. Write about whatever interests you and write somewhere where people will give you useful feedback. It takes awhile to learn what works and what doesn't.
One thing that helped me is to realize I'll be writing/educating forever. So there's no rush. This makes it more enjoyable, which means it's easier to maintain a long term habit. The long term is where quality edu emerges.
This series is just amazing. is there any deep learning series like this?
From me? No (but maybe one day). In the meantime, 3Blue1Brown has an excellent explainer. And there are others..
Incredible content, thanks a lot for your work !
Thank you Jean, and thanks for watching!
@ 13:50 for computing the value function. Is this somehow related to Gibbs sampling? Somehow it reminds me of it :)
Gotta upload a version that blends out the pause-less hand motions
Yea I don't like my old style at all. I'm re-shooting one but it's a lotta work. :/ Feedback is welcome.
@@Mutual_Information It was just a mean comment for the sake of being mean. I'd probably not put in the work to upload content you already got got covered, but donno
@@watcher8582 haha yea I looked at the video after the fact and decided it wasn't so bad I didn't need to do this one. But some of my older ones, hand motions are awful and I'll be changing some of that lol
@@Mutual_Information No I mean if you already got a video covered (or if it's well covered elsewhere), then I'd probably invest the energy into making a video with a topic not covered online yet. Don't misunderstand me, the hand motion is terrible and I put a post-it over the screen to watch the video.
I was super focused on the board I didn't notice any weird cutovers! I like how you go through each variable it's very useful to have these quick reminders of what these notations represent as we're going through new concepts so that we don't have to make more conscious effort to decipher them and can focus on the new concept
Im tough and ambitious!
lol, hell yea!
Do you provide slides as a pdf anywhere ? That would be really helpful! Great video!!!
gah unfortunately not.. I'd like to come back and create a written version of this series. I have that for a small set of other videos, but that'll take some time for this one - don't hold your breathe, sorry!
@@Mutual_Information no problem! Thanks for your videos. They are a big help!
sorry, I may miscalculate and being stupid. I got different result at 6.48 (17.4).someone write it down?
13:53 - how come the -20 cells and the -22 cells don't have the same value? they are identically far from the ending point. no?
Thanks!
If we were valuing the optimal policy, you'd be right. But we're valuing the do-something-randomly policy, which can't be value by looking at the optimal path. You have to think about a random walk, and then the corners, in that sense, are further away.
@@Mutual_Information
Thank you very much for the quick reply!
I have questions about the 6:06 substitution, could you explain this a bit ? because v_pi does not need to be conditional on state and action , it only depends on the state
If you look at the pinned comment on this video, I do a breakdown of the expression. If that doesn't answer your Q, let me know
Very cool. How did you make the animations?
I use the Python plotting library called Altair. It creates beautiful static images. Then I have a personal library I use to stitch them into videos. That's also used to make the latex animation.
Amazing video!
I have a question, does policy (pi) describes a probability distribution or the action itself?
Because, I have a hard time understanding the optimal policy and the equation for `a = pi(s)` when the policy is deterministic.
Also, how do we perform the policy improvement when the policy is stochastic? do we just reassign the state action value using the bellman equation after a sweep of policy evaluation? Because in the beginning of the grid world problem, you said that we use pi(a|s) = 0.25 and then you change the setting become deterministic by taking the argmax
If the policy is deterministic, it is a function. You give it a state and it gives you an action. That's all.
If it's stochastic, it's a distribution over actions that depends on the state. If you give it a state, it gives back an entire distribution, specifying the probability for each action.
Policy improvements for stochastic policies depends on the policy. If it's epsilon greedy, that's like a deterministic policy 'mixed' with a stochastic policy that is always a uniform distribution over the action. In this case, policy improvement doesn't touch the stochastic part - only the deterministic part. To perform policy improvement, you set the deterministic policy to select the highest value action in every state. You can do this, because presumably you just performed policy evaluation. This is the argmax operation: "select the highest value action".
Make sense?
hmm, let me guess, wouldn't applying a discount be interesting if the state space is so big? furthermore, wouldn't it be more interesting if we instead of discounting by a constant rate use a Gaussian distribution as our discount ?
Interesting perspective, but I don't think a strong discount removes the difficulty of a large state space. I see your perspective though - it makes it seem as though you only need to care about the most immediate states. But that's not necessarily true. It's because our policy is optimized for all t for G_t.. If you only cared about G_0 and gamma = 0, then yes, the immediate state/action pair is all the matters and you don't care about a lot of the state space. BUT, since we also care about G_1.. we have to have a policy that does well at time t=1.. which means we care about states beyond those in the neighbor of states at t=0. Eventually, we could end up caring about the whole state space. If, on the other hand, some states aren’t reachable from the starting state - then that would be one way in which a lot of the state space doesn't matter.
Yeahh that's it 👌👌, thanks for the fast reply!
Why there is no power (exponent) to the decay factor for S2 state in state action function RHS side??
watched it on 0.75, great video!
🤯 My brain keeps saying "I understand" and then "But do I really?" every few seconds
Have to rewatch for those algebra for my tiny brain but the overall idea is very well presented!
Thank you. And fortunately the comment section is small enough that I can answer questions - feel free to ask and I'll do my best!
v_pi is not a random variable why do we take the expected value of that at 6:06?
Indeed it is not random.. but if you give it a random variable as input, it becomes a random variable. As a silly example, if f(x) = x^2.. and U is a uniform random variable over [0, 1], then f(U) is a random variable. It is produced by randomly sampling a value uniformly from 0, 1 and then squaring it.
Can you comment on why does Gambler's problem solution differ from Kelly's criterion from one of your previous videos? Having a goal to reach 100 vs maximum growth.
Sure - they aren't optimizing the same thing. In the gamblers problem, the only thing that matter is probability of getting to 100. In the Betting Game for KC, it's the expected growth rate. Also, KC can sometimes tell you not to bet. In the gamblers problem, you are forced to bet every time. I guess that's enough to make for the different strategies
Great video but I have a question. At 3:56 probability distribution table appears regarding right selection. However s0 is not in table. After all, agent can go from s1 to s0. Am I wrong?
This example is focused only on a small piece of the MDP. The MDP, in entirety, describe the probability of all state, reward pairs for each state, action pair. In this example, I'm only showing the state-action pairs from s0 and, in this example, we can only transition to s1 or s2 (when choosing right). In other words, this example is more restricted than the general case. Implicit in this example, is that the probability of transition from s0 to s0 is zero... or the probability to transition from s0 to s-1 is zero when choosing right. Make sense?
Ah ok MI. now it's crystal clear. BTW we're lucky to ask you questions and get replies. We may not find this possibility in the future when the channel explodes. :)
@@kafaayari Aw thanks :) we’ll see what happens. I enjoy answering the Qs and I’m gonna try to keep it up for as long as I can. So far the volume is quite manageable lol
11:24 Bellman eq for 4cases
super good
Dziękujemy.
Thank you!! A rare form of appreciation - thanks a ton :)
Is it the case that the optimal policy found by optimizing the state rewards will always be the same as the one found by optimizing the action rewards?
Great video! Thank you!
Anyone have a currently most recommended library or software to learn the application part of reinforcement learning?
These might help?
* github.com/TianhongDai/reinforcement-learning-algorithms
* github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch
Amazing video! Thank you!
Ok, the notation of E_pi [.] does not necessarily imply you are averaging over a distribution pi(A | s), right? Like in 6:43, where you took the average over the r.v's joint p(s',r | s^0,->), so what's the point of using pi ?
I mean, when I write E_pi[.] it just means that the function inside must be calculated considering that the agent followed the policy pi, but in practice I have to consider a special function (dependent on pi) for the r.v's probs. Am I correct?
At 6:43, you can see E_pi[] means when you go from the second to the third line. It's an expectation operation, so we are expanding random variables within the expression into a weighted sum of values.. where the values are all values the random variables can take, and the weights are their respective probabilities. This is what happens between the second and third line.
@@Mutual_Information I agree with you, my only source of doubt was from the pi subscript, as it does not make explicit the distribution of the random variables, I'm used to think of the subscript of the expectation meaning the distribution of the random variable itself, for example E_{p_X(x)}[X] = sum_x p(x)*x, but in this context it's not the case.
@@piero8284 Oh I see what you're saying. yea pi is just suggestive. It's like saying "the expectation is with respect to the policy pi and you have to know what that means".
brilliant
Do you have code for the gamblers problem online anywhere?
I attempted to code the policy iteration algorithm for the gambler's problem, but don't get the policy you show in this video. Instead I get a triangle with a max at 50. This does seem like a reasonable policy though, so I'm not sure if this is one of the "family of optimal policies" that barto and sutton reference in the text.
Oh yes I think it might be. I haven't made the code public but I think I remember the problem. Changing how you are dealing with ties! The action you pick given a tie in value makes a difference
was checking whather my speed was set to 2x cause thats a lot of things to pack
How about creating a Discord? If you think of the mind-type you're filtering with your videos, it could make for a strong community.
I should. I'm just not on Discord myself, so I don't have familiarity with it as a platform. But I have gotten the request a few times and it seems like a wise move..
@@Mutual_Information Yannic Kilcher created a Discord to support his RUclips, and it is buzzing. Also the level of expertise is high. Yannic's used his following to accrete engineering power into LAION. I got into ML in 2015 and there were almost no online communities back then. I came back over christmas (thanks to the ChatGPT buzz) and was delighted to find that it has taken off bigtime over Discord. Also Karpathy has an active Discord.
You are awesome!
That awkward moment when a RUclips series teaches you more practical knowledge than a $50,000 4-year degree in math
Bravo🎉
THANKS!!! GOOD CONTENT
(11:10) v_*(s) = max_a {q_*(s,a)}, no proof for this equation in the book made me very disappointed.
I don't know very much about video stuff, but it looks like there's something off with your recording of yourself, it's pretty pixelated.
Maybe it's just that your camera isn't that good, or something else, like the lighting, your rendering settings, or bit rate in OBS. Just wanted to let you know in case you didn't already.
Thanks for the good videos.
Thanks for looking out. This was my first time uploading in 4K (despite it being recorded in 1920 x 1080) - apparently that's recommended practice. From my end, the video doesn't look bad enough to warrant a re-upload, but I'll give the settings another look on the next videos. I believe I see the pixelation you're referring to.
17:05 had me dying lmao
i heard its simular to calculus of variations
Love that❤
EPIC
It's very hard to follow when you call everything "this" and just highlight "this". Would be easier to follow if you replaced "this" with the name of what you're describing
E.g. "R at time step t" or "the bottom-right XYZ"
Appreciate the feedback. It's a work in progress.. going forward there's even less on screen so focusing attention might be alleviated a bit.
Would you mind providing a timestamp for a case where this stood out? That'll help me identify similar cases in the future
𝔭𝔯𝔬𝔪𝔬𝔰𝔪
The thumbnail was Manim-like. Dissapointed. Did not watch the whole video.
Hi! Nice explanation ! but I tried to implement the calculation from 12:46, like doing the iteration (sweep) and I couldnt reach the same result, for the first part of the equation I kept the 0.25 ( pi[a|s] ), p(s_prime, r | s, a) as 1, for a deterministic world and sum ( -1 + world[row, col]), where world[row, col] is the V_next_state. My result was:
[[ 0. -5.05417921 -5.47163695 -3.10743228]
[-5.05417921 -6.78698773 -6.51979625 -3.95809215]
[-5.47163695 -6.51979625 -5.86246816 -3.20514008]
[-3.10743228 -3.95809215 -3.20514008 0. ]]
The iteration was something like:
for s in the world grid: (except the goals)
for act in actions: (if falls off grid does nothing (try catch))
V(s) += 0.25 * (-1 + V[act])