REINFORCE: Reinforcement Learning Most Fundamental Algorithm

Поделиться
HTML-код
  • Опубликовано: 27 ноя 2024

Комментарии • 37

  • @ollipasanen3570
    @ollipasanen3570 3 года назад +26

    I am not sure why this doesn't have more views. Thank you for the really clear and concise explanation of REINFORCE!

  • @yodarocco
    @yodarocco Год назад +3

    "Maybe you've seen it.. maybe you've read it.. but have you implemented it?" That is when you gained my like and subscribe!

  • @supersonic956
    @supersonic956 2 года назад +3

    Things are always so much simpler when written in code (at least for me!). Thanks!

  • @debarchanbasu768
    @debarchanbasu768 2 года назад +1

    This is the OG explanation on REINFORCE on RUclips. I hope you can come up with an entire RL series! Subscribed!

  • @jonathanwogerbauer2703
    @jonathanwogerbauer2703 6 месяцев назад +2

    Big respect for this video! thank you

  • @yangyu441
    @yangyu441 2 месяца назад +1

    Many thanks for the nice explanation.

  • @pitiwatlueang6899
    @pitiwatlueang6899 7 месяцев назад +1

    This is pure gold! You should produce more videos!

  • @beltusnkwawir2908
    @beltusnkwawir2908 2 года назад

    Your explanations are so clear and your voice is pleasant to the ear making learning more enjoyable. Thank you

  • @TwoSocksFoxtrot
    @TwoSocksFoxtrot 2 года назад +1

    this is unbelievably good. well done, sir

  • @jeonghwankim8973
    @jeonghwankim8973 10 месяцев назад

    This is amazing. Thanks for the clear explanation and the illustrative code samples!

  • @marcin.sobocinski
    @marcin.sobocinski 2 года назад

    Great showman Andriy is not, for sure... but the explanation of the algo is spot on and crystal clear. Thank you!

  • @kalixml1523
    @kalixml1523 10 месяцев назад

    Great video! Your clear explanations and pleasant voice make learning enjoyable. Looking forward to more content like this!
    Subscribed 👍

  • @AKUKamil
    @AKUKamil 9 месяцев назад

    intriguing start

  • @samdonald741
    @samdonald741 2 года назад +1

    Such a good explaination! Loved the examples + clear visuals

  • @LatelierdArmand
    @LatelierdArmand 9 месяцев назад

    Thanks I liked the minimal PyTorch implementation !

  • @akwstr
    @akwstr 2 года назад

    That was a great explanation / walkthrough. Thank you!

  • @antoineajsv6976
    @antoineajsv6976 3 года назад

    super clear, many thanks for this brilliant explanation

  • @anirudhthatipelli8765
    @anirudhthatipelli8765 Год назад

    Thanks, this was a great explanation!

  • @abdelrahmanwaelhelaly1871
    @abdelrahmanwaelhelaly1871 2 года назад +1

    Thank you so much.

  • @swamikannan943
    @swamikannan943 2 года назад +1

    This video answered the single biggest doubt in my mind. How do you backprop through env.step(). Brilliant explanation ! Thanks a lot !

  • @openaidalle
    @openaidalle 2 года назад

    Loved this.. Please make more videos

  • @jubaaissaoui5678
    @jubaaissaoui5678 2 года назад

    Great video

  • @abdelrahmanwaelhelaly1871
    @abdelrahmanwaelhelaly1871 2 года назад +1

    This is super good, did you have more courses/videos?

  • @itdepends5906
    @itdepends5906 2 месяца назад

    This is really beautiful and I’m a complete newbie. But the code at is looping over Rewards[1:] then Rewards[2:], then Rewards[3:] - I.e repeating most calculations. Dynamic programming should speed this up. Let DP[i] := sum of discounted rewards starting at time step i. Start with i=T as base case. From i=T-1,….0, DP[i] = DP[i+1] + gamma^i *Reward[i]

  • @ahmedaj2000
    @ahmedaj2000 Год назад

    thanks!

  • @hoddy2001
    @hoddy2001 Год назад

    what dimension is log_prob? and why do we only set one argument to the loss function? does it require both nn output and target value? thanks great video

    • @AndriyDrozdyuk
      @AndriyDrozdyuk  Год назад

      log_prob is just a scalar here. You can read as to why the loss is like that here (see "Score" function in both cases): mpatacchiola.github.io/blog/2021/02/08/intro-variational-inference-2.html or here pytorch.org/docs/stable/distributions.html
      Basically it's a way to go around the problem of our inability to backpropagate through the random samples (the picking of the action here). Instead we create a surrogate function that *can* be backpropagated.
      I am considering making a video on policy gradient to explain the loss function in more detail.

  • @timanb2491
    @timanb2491 2 года назад +1

    can you explain 1 thing please - we compute G with constant policy and after that with this an array of G's we use it to find a direction for optimization and with every G from our array we update our policy(for example NN) BUT we use rewards from initial policy and update current policy that has some difference with initial one. Is it right to update current policy with information(G) that we get from initial policy? Because in gradient descent for example we update current NN by using information that we get from current state(from forward propagation)

    • @AndriyDrozdyuk
      @AndriyDrozdyuk  2 года назад

      Yes, very good, you are correct. I was waiting for someone to point that out! Technically we should do update all at once. I just thought it would be a bit confusing for the viewer (since it would differ from the book's pseudocode). But you can see my old code version here where I create EligibilityVector (as it is called in RL book - top of page 328): github.com/drozzy/reinforce/blob/99a56061102a82bb4a835e852307ffa9d693ac98/reinforce.py#L40 additionally here is pytorch implementation taking a single step: github.com/pytorch/examples/blob/41b035f2f8faede544174cfd82960b7b407723eb/reinforcement_learning/reinforce.py#L62 On a side note, I do think the book's pseudocode shows it rather strangely - since it makes it seem as if multiple updates are occurring, and that is what I show here. But you are right that after we update the policy the first time, it then becomes kind-of an off-policy learning, where we are now updating a target policy that is no longer the same as the behavior policy.

  • @MaartinLPDA
    @MaartinLPDA 2 года назад +2

    why do you need to take the log of the probability?

    • @AndriyDrozdyuk
      @AndriyDrozdyuk  2 года назад +2

      You don't, it's just a shorthand for \frac{
      abla \pi}{pi}. You can look up the derivation and the answer in Reinforcement Learning book 2nd Ed. section "13.3. REINFORCE: Monte Carlo Policy Gradient" at the very bottom of page 327 incompleteideas.net/book/RLbook2020.pdf

  • @Ishaheennabi
    @Ishaheennabi 8 месяцев назад

    great

  • @AndrewGarrisonTheHuman
    @AndrewGarrisonTheHuman 2 года назад

    Great video, thank you. Do you have any advice or links to resources on how to apply a policy gradient to a continuous action space or an environment where multiple actions must be taken for each state?

    • @AndriyDrozdyuk
      @AndriyDrozdyuk  2 года назад +1

      Yes, for theory take a look at section "13.7 Policy Parameterization for Continuous Actions" in RL Sutton & Barto Book 2nd ed. incompleteideas.net/book/RLbook2020.pdf , and for implementation it seems Spinning Up by open AI has "vanilla" policy gradients, here is an example: github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/vpg/core.py#L80

    • @AndrewGarrisonTheHuman
      @AndrewGarrisonTheHuman 2 года назад

      @@AndriyDrozdyuk Great! Thanks!

  • @blatogh1277
    @blatogh1277 2 года назад

    Oh man....I passed my assignment

  • @swjidswi3733
    @swjidswi3733 4 месяца назад

    Thank you!