Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO

Поделиться
HTML-код
  • Опубликовано: 17 янв 2025

Комментарии • 22

  • @littlebigphil
    @littlebigphil 3 года назад +4

    This is really dense, but also clears a lot up. I'll have to watch a second time.

  • @ariel415el
    @ariel415el 4 года назад +16

    This is much less apprehensible than last lectures =

    • @nissim2007
      @nissim2007 4 года назад +3

      anyway

    • @gregh6586
      @gregh6586 4 года назад

      Yeah, John's mind works on another level. Maybe that's how you invent not only TRPO (along with Pieter from lecture 4a) but also PPO.

  • @dewinmoonl
    @dewinmoonl 7 лет назад +6

    AWW YEAH TRUST REGION THIS IS WHAT I NEEDED
    THANKS!

  • @mansurZ01
    @mansurZ01 5 лет назад +1

    I think that loss function at 7:30 should have the opposite sign because the gradient is derived for gradient ascent. So for gradient descent we should pretend that the gradient has opposite sign. And if we derive loss function for that - the "minus" sign will carry trough. Am I right?

    • @gregh6586
      @gregh6586 4 года назад

      No, it should not be negative: it is true that we want to use gradient ascent, i.e. to move in the direction that increases our "loss" the most. The advantage term is positive for those actions, whose actual value (i.e. the "q-value") was better than the expected value (or simply "value"). So what we want to do, is to find the direction that increases that advantage the most, i.e. find the gradient.
      You have to be careful when using frameworks, though (what John refers to as "auto-div-libraries"), and double-check whether you get the positive or negative value (e.g. for the categorical cross entropy, where we want to use the "positive loss" but some implementations might return the negative value).

    • @elliotwaite
      @elliotwaite 4 года назад

      He just has it written in terms of gradient ascent, which is common in RL where we are typically trying to maximize our objective of expected total reward. But you are correct in that if we want to do gradient descent, which is the default in PyTorch or TensorFlow, we'll want to use the negative of that loss, as can be seen in this implementation: github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py#L234

  • @OliverZeigermann
    @OliverZeigermann 4 года назад +2

    Why calculate the ratio of new and old policy even though log prob is good enough anway? Is it because we want to use the ratio for clipping?

    • @ppstub
      @ppstub 4 года назад +1

      So we can be more data-efficient and re-sample trajectories generated under an old policy according to the ratio we compute

    • @abcborgess
      @abcborgess 3 года назад +1

      It is due to the importance sampling used to calculate the surrogate function. The expectation in the surrogate uses samples from the distro parametrized by theta_old. However, I think we are actually interested in the value of surrogate at other parameters theta. The ratio corrects the probability of the samples from theta_old in the expectation.

    • @underlecht
      @underlecht Год назад

      Its because we'd like to keep the policy in trusted region. So that the policy would not diverge too much and become useless

  • @mfavaits
    @mfavaits 2 года назад +2

    explanations are all over the place - put some structure in the way you explain things

  • @yanwen3498
    @yanwen3498 3 года назад

    Who has the slides?

  • @kevinwu2040
    @kevinwu2040 4 года назад

    You make these things soo easy

  • @shaz7163
    @shaz7163 7 лет назад +1

    In 14.55 When maximizing the objective function with a penalty obtained with KL divergence, what if the expected values become minus?

  • @pablodiaz1811
    @pablodiaz1811 5 лет назад

    Thanks for share

  • @ProfessionalTycoons
    @ProfessionalTycoons 6 лет назад

    awwwe yeah