DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]

Поделиться
HTML-код
  • Опубликовано: 3 фев 2025

Комментарии •

  • @jedrzej8442
    @jedrzej8442 10 месяцев назад +1

    Another great lecure! Every year this lecture gets more and more polished, thank you Hado and Deepmind! :)

  • @vslaykovsky
    @vslaykovsky 2 года назад +9

    I find it amusing that after building up strong math foundation behind policy gradient methods we end up with Cacla which is essentially a heuristic driven by some reasonable intuition.

  • @blueeagle6890
    @blueeagle6890 3 года назад +3

    Great presentation. Thanks! Here’s a question: It is stated in the presentation (time: 1:03:49) that the correlation between an action at time T+N and rewards which were obtained in earlier time T is zero (and the corresponding terms in the summation are deleted). Is this true? It seems to be based on the assumption that a reward does not depend on future actions. However, the policy gradient equation is about correlation, not causality, and correlation does not require causality. A reward at time T may be correlated with an action taken at time T, which may be correlated with state at time T+N which is correlated with action at time T+N. So, there might be a correlation between action at time T+N and reward at time T. Here’s another more intuitive example: An episode may contain multiple repeating identical or substantially similar states and actions before it is terminated (for instance, a game where the player and ball are in the same situation multiple times during an episode). Therefore, an action selected at time T+N is likely to be correlated and indicative of a reward obtained at earlier time T, if there is a repeating identical scenario at time T and time T+N. Will be grateful for deblurring this issue. Thanks!

    • @timr8222
      @timr8222 2 года назад

      you have to take the expectation, only after the expectation it is 0

  • @Adam-yh6oe
    @Adam-yh6oe 3 года назад +4

    Amazing lectures Hado!

  • @DFM-b1n
    @DFM-b1n 4 месяца назад

    Around 1:02:00 why don’t we take the rewards into account when computing p(tau) ? In my opinion we should take them into account, but hopefully since the probabilities P(R|S,A) do not depend on theta, they have not effect on the gradient so the equalities on the slide are true

  • @johntanchongmin
    @johntanchongmin 2 года назад +2

    Great lecture! Very clear and very understandable! Thanks so much:)

  • @k-bala-vignesh
    @k-bala-vignesh 3 года назад +2

    very useful, thank you

  • @prostohodim
    @prostohodim 2 года назад

    hi! Could you please explain the ruclips.net/video/y3oqOjHilio/видео.html part? i think we should drop Expectation in the right side because we use sum everything up by probs. tell me if i'm wrong

    • @prostohodim
      @prostohodim 2 года назад

      @a17benayad got it, thanks

  • @anirudhsilverking5761
    @anirudhsilverking5761 3 года назад +1

    Subtitles please!