This is really beautiful and I’m a complete newbie. But the code at is looping over Rewards[1:] then Rewards[2:], then Rewards[3:] - I.e repeating most calculations. Dynamic programming should speed this up. Let DP[i] := sum of discounted rewards starting at time step i. Start with i=T as base case. From i=T-1,….0, DP[i] = DP[i+1] + gamma^i *Reward[i]
what dimension is log_prob? and why do we only set one argument to the loss function? does it require both nn output and target value? thanks great video
log_prob is just a scalar here. You can read as to why the loss is like that here (see "Score" function in both cases): mpatacchiola.github.io/blog/2021/02/08/intro-variational-inference-2.html or here pytorch.org/docs/stable/distributions.html Basically it's a way to go around the problem of our inability to backpropagate through the random samples (the picking of the action here). Instead we create a surrogate function that *can* be backpropagated. I am considering making a video on policy gradient to explain the loss function in more detail.
can you explain 1 thing please - we compute G with constant policy and after that with this an array of G's we use it to find a direction for optimization and with every G from our array we update our policy(for example NN) BUT we use rewards from initial policy and update current policy that has some difference with initial one. Is it right to update current policy with information(G) that we get from initial policy? Because in gradient descent for example we update current NN by using information that we get from current state(from forward propagation)
Yes, very good, you are correct. I was waiting for someone to point that out! Technically we should do update all at once. I just thought it would be a bit confusing for the viewer (since it would differ from the book's pseudocode). But you can see my old code version here where I create EligibilityVector (as it is called in RL book - top of page 328): github.com/drozzy/reinforce/blob/99a56061102a82bb4a835e852307ffa9d693ac98/reinforce.py#L40 additionally here is pytorch implementation taking a single step: github.com/pytorch/examples/blob/41b035f2f8faede544174cfd82960b7b407723eb/reinforcement_learning/reinforce.py#L62 On a side note, I do think the book's pseudocode shows it rather strangely - since it makes it seem as if multiple updates are occurring, and that is what I show here. But you are right that after we update the policy the first time, it then becomes kind-of an off-policy learning, where we are now updating a target policy that is no longer the same as the behavior policy.
You don't, it's just a shorthand for \frac{ abla \pi}{pi}. You can look up the derivation and the answer in Reinforcement Learning book 2nd Ed. section "13.3. REINFORCE: Monte Carlo Policy Gradient" at the very bottom of page 327 incompleteideas.net/book/RLbook2020.pdf
Great video, thank you. Do you have any advice or links to resources on how to apply a policy gradient to a continuous action space or an environment where multiple actions must be taken for each state?
Yes, for theory take a look at section "13.7 Policy Parameterization for Continuous Actions" in RL Sutton & Barto Book 2nd ed. incompleteideas.net/book/RLbook2020.pdf , and for implementation it seems Spinning Up by open AI has "vanilla" policy gradients, here is an example: github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/vpg/core.py#L80
I am not sure why this doesn't have more views. Thank you for the really clear and concise explanation of REINFORCE!
"Maybe you've seen it.. maybe you've read it.. but have you implemented it?" That is when you gained my like and subscribe!
Things are always so much simpler when written in code (at least for me!). Thanks!
This is the OG explanation on REINFORCE on RUclips. I hope you can come up with an entire RL series! Subscribed!
Big respect for this video! thank you
Many thanks for the nice explanation.
This is pure gold! You should produce more videos!
Your explanations are so clear and your voice is pleasant to the ear making learning more enjoyable. Thank you
this is unbelievably good. well done, sir
This is amazing. Thanks for the clear explanation and the illustrative code samples!
Great showman Andriy is not, for sure... but the explanation of the algo is spot on and crystal clear. Thank you!
Great video! Your clear explanations and pleasant voice make learning enjoyable. Looking forward to more content like this!
Subscribed 👍
intriguing start
Such a good explaination! Loved the examples + clear visuals
Thanks I liked the minimal PyTorch implementation !
That was a great explanation / walkthrough. Thank you!
super clear, many thanks for this brilliant explanation
Thanks, this was a great explanation!
Thank you so much.
This video answered the single biggest doubt in my mind. How do you backprop through env.step(). Brilliant explanation ! Thanks a lot !
Loved this.. Please make more videos
Great video
This is super good, did you have more courses/videos?
This is really beautiful and I’m a complete newbie. But the code at is looping over Rewards[1:] then Rewards[2:], then Rewards[3:] - I.e repeating most calculations. Dynamic programming should speed this up. Let DP[i] := sum of discounted rewards starting at time step i. Start with i=T as base case. From i=T-1,….0, DP[i] = DP[i+1] + gamma^i *Reward[i]
thanks!
what dimension is log_prob? and why do we only set one argument to the loss function? does it require both nn output and target value? thanks great video
log_prob is just a scalar here. You can read as to why the loss is like that here (see "Score" function in both cases): mpatacchiola.github.io/blog/2021/02/08/intro-variational-inference-2.html or here pytorch.org/docs/stable/distributions.html
Basically it's a way to go around the problem of our inability to backpropagate through the random samples (the picking of the action here). Instead we create a surrogate function that *can* be backpropagated.
I am considering making a video on policy gradient to explain the loss function in more detail.
can you explain 1 thing please - we compute G with constant policy and after that with this an array of G's we use it to find a direction for optimization and with every G from our array we update our policy(for example NN) BUT we use rewards from initial policy and update current policy that has some difference with initial one. Is it right to update current policy with information(G) that we get from initial policy? Because in gradient descent for example we update current NN by using information that we get from current state(from forward propagation)
Yes, very good, you are correct. I was waiting for someone to point that out! Technically we should do update all at once. I just thought it would be a bit confusing for the viewer (since it would differ from the book's pseudocode). But you can see my old code version here where I create EligibilityVector (as it is called in RL book - top of page 328): github.com/drozzy/reinforce/blob/99a56061102a82bb4a835e852307ffa9d693ac98/reinforce.py#L40 additionally here is pytorch implementation taking a single step: github.com/pytorch/examples/blob/41b035f2f8faede544174cfd82960b7b407723eb/reinforcement_learning/reinforce.py#L62 On a side note, I do think the book's pseudocode shows it rather strangely - since it makes it seem as if multiple updates are occurring, and that is what I show here. But you are right that after we update the policy the first time, it then becomes kind-of an off-policy learning, where we are now updating a target policy that is no longer the same as the behavior policy.
why do you need to take the log of the probability?
You don't, it's just a shorthand for \frac{
abla \pi}{pi}. You can look up the derivation and the answer in Reinforcement Learning book 2nd Ed. section "13.3. REINFORCE: Monte Carlo Policy Gradient" at the very bottom of page 327 incompleteideas.net/book/RLbook2020.pdf
great
Great video, thank you. Do you have any advice or links to resources on how to apply a policy gradient to a continuous action space or an environment where multiple actions must be taken for each state?
Yes, for theory take a look at section "13.7 Policy Parameterization for Continuous Actions" in RL Sutton & Barto Book 2nd ed. incompleteideas.net/book/RLbook2020.pdf , and for implementation it seems Spinning Up by open AI has "vanilla" policy gradients, here is an example: github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/vpg/core.py#L80
@@AndriyDrozdyuk Great! Thanks!
Oh man....I passed my assignment
Thank you!