I think that loss function at 7:30 should have the opposite sign because the gradient is derived for gradient ascent. So for gradient descent we should pretend that the gradient has opposite sign. And if we derive loss function for that - the "minus" sign will carry trough. Am I right?
No, it should not be negative: it is true that we want to use gradient ascent, i.e. to move in the direction that increases our "loss" the most. The advantage term is positive for those actions, whose actual value (i.e. the "q-value") was better than the expected value (or simply "value"). So what we want to do, is to find the direction that increases that advantage the most, i.e. find the gradient. You have to be careful when using frameworks, though (what John refers to as "auto-div-libraries"), and double-check whether you get the positive or negative value (e.g. for the categorical cross entropy, where we want to use the "positive loss" but some implementations might return the negative value).
He just has it written in terms of gradient ascent, which is common in RL where we are typically trying to maximize our objective of expected total reward. But you are correct in that if we want to do gradient descent, which is the default in PyTorch or TensorFlow, we'll want to use the negative of that loss, as can be seen in this implementation: github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py#L234
It is due to the importance sampling used to calculate the surrogate function. The expectation in the surrogate uses samples from the distro parametrized by theta_old. However, I think we are actually interested in the value of surrogate at other parameters theta. The ratio corrects the probability of the samples from theta_old in the expectation.
This is really dense, but also clears a lot up. I'll have to watch a second time.
This is much less apprehensible than last lectures =
anyway
Yeah, John's mind works on another level. Maybe that's how you invent not only TRPO (along with Pieter from lecture 4a) but also PPO.
AWW YEAH TRUST REGION THIS IS WHAT I NEEDED
THANKS!
I think that loss function at 7:30 should have the opposite sign because the gradient is derived for gradient ascent. So for gradient descent we should pretend that the gradient has opposite sign. And if we derive loss function for that - the "minus" sign will carry trough. Am I right?
No, it should not be negative: it is true that we want to use gradient ascent, i.e. to move in the direction that increases our "loss" the most. The advantage term is positive for those actions, whose actual value (i.e. the "q-value") was better than the expected value (or simply "value"). So what we want to do, is to find the direction that increases that advantage the most, i.e. find the gradient.
You have to be careful when using frameworks, though (what John refers to as "auto-div-libraries"), and double-check whether you get the positive or negative value (e.g. for the categorical cross entropy, where we want to use the "positive loss" but some implementations might return the negative value).
He just has it written in terms of gradient ascent, which is common in RL where we are typically trying to maximize our objective of expected total reward. But you are correct in that if we want to do gradient descent, which is the default in PyTorch or TensorFlow, we'll want to use the negative of that loss, as can be seen in this implementation: github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py#L234
Why calculate the ratio of new and old policy even though log prob is good enough anway? Is it because we want to use the ratio for clipping?
So we can be more data-efficient and re-sample trajectories generated under an old policy according to the ratio we compute
It is due to the importance sampling used to calculate the surrogate function. The expectation in the surrogate uses samples from the distro parametrized by theta_old. However, I think we are actually interested in the value of surrogate at other parameters theta. The ratio corrects the probability of the samples from theta_old in the expectation.
Its because we'd like to keep the policy in trusted region. So that the policy would not diverge too much and become useless
explanations are all over the place - put some structure in the way you explain things
Who has the slides?
You make these things soo easy
In 14.55 When maximizing the objective function with a penalty obtained with KL divergence, what if the expected values become minus?
Discourage action.
Thanks for share
awwwe yeah