DeepMind x UCL RL Lecture Series - Multi-step & Off Policy [11/13]

Поделиться
HTML-код
  • Опубликовано: 9 фев 2025
  • Research Scientist Hado van Hasselt discusses multi-step and off policy algorithms, including various techniques for variance reduction.
    Slides: dpmd.ai/offpolicy
    Full video lecture series: dpmd.ai/DeepMi...

Комментарии • 11

  • @bertobertoberto242
    @bertobertoberto242 2 года назад +2

    Sorry but that v-trace, isn't just a "clip the ratio"? isn't it a common thing in DL, like gradient clipping to avoid exploding gradient, or in WGAN? or am i missing something

  • @Saurabhsingh-cl7px
    @Saurabhsingh-cl7px 3 года назад +1

    How do you calculated the variance at 58:00? E[x^2] - E[x]^2 ?

    • @Karthik-lq4gn
      @Karthik-lq4gn 2 года назад +5

      that's the formula for variance

  • @EngIlya
    @EngIlya 3 года назад +1

    what software/hardware is used for drawing at 26:04 ?

    • @hadovanhasselt7357
      @hadovanhasselt7357 3 года назад +5

      I'm using the Notes app on an iPad pro (with iPencil). Presumable one can do something similar with other apps / tablets.

  • @haliteabudureyimu638
    @haliteabudureyimu638 3 года назад +2

    How do we use per-decision importance weighting and control-variants technique in practice?for example like in Actor-Critic off-policy learning settings using replay buffer or demonstration learning? We don't know the target policy in practice, how we can get the value for $ro$ ?

    • @sender1496
      @sender1496 Год назад

      I'm also confused by this. Of course, the Q-values depend on the policy we use. It feels like they just pretend that the current Q-values are the greedy-policy-ones (if the greedy policy is the target) and estimate the target policy as the greedy policy with respect to the current values, but I'm very surprised as to why this isn't mentioned? Or maybe I'm misunderstanding something.

    • @sender1496
      @sender1496 Год назад

      Actually, if you think about it, it is usually the case that the current value estimate is used in formulas where there really should be an exact value. For example, all the algorithms that are derived from dynamic programming (bootstrapping etc.). So I guess it makes sense that it would happen here as well - we have an exact theoretical formula showing us how to rescale our samples, but we use that with the current value estimates instead of the theoretically exact values to get an algorithm. For example, if our target policy is the greedy policy (being a fixed point to its own value function), then pi(s) should just be the action that gives the largest successive Q-value, given our current Q-value estimate (which is different from the fixed point Q-values)

  • @wcoenen
    @wcoenen 3 года назад +3

    Video editing issue after 3:35

    • @hadovanhasselt7357
      @hadovanhasselt7357 3 года назад +1

      Whoops :) Thanks for noting, will try to get fixed