DeepMind x UCL RL Lecture Series - Deep Reinforcement Learning #1 [12/13]

DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]

Think Fast, Talk Smart: Communication Techniques

Duramax Diesel "Extreme" Tune and Allison Transmission Service (My Going Ta' Town Rig!)

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

Vermont vs. Marshall: 2024 NCAA men’s soccer championship highlights

DeepMind x UCL RL Lecture Series - Multi-step & Off Policy [11/13]

Google DeepMind

Просмотров 16 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 9 фев 2025
Research Scientist Hado van Hasselt discusses multi-step and off policy algorithms, including various techniques for variance reduction.
Slides: dpmd.ai/offpolicy
Full video lecture series: dpmd.ai/DeepMi...

Комментарии • 11

@bertobertoberto242 2 года назад ⁺²
Sorry but that v-trace, isn't just a "clip the ratio"? isn't it a common thing in DL, like gradient clipping to avoid exploding gradient, or in WGAN? or am i missing something
@Saurabhsingh-cl7px 3 года назад ⁺¹
How do you calculated the variance at 58:00? E[x^2] - E[x]^2 ?
@Karthik-lq4gn 2 года назад ⁺⁵
that's the formula for variance
@EngIlya 3 года назад ⁺¹
what software/hardware is used for drawing at 26:04 ?
@hadovanhasselt7357 3 года назад ⁺⁵
I'm using the Notes app on an iPad pro (with iPencil). Presumable one can do something similar with other apps / tablets.
@haliteabudureyimu638 3 года назад ⁺²
How do we use per-decision importance weighting and control-variants technique in practice?for example like in Actor-Critic off-policy learning settings using replay buffer or demonstration learning? We don't know the target policy in practice, how we can get the value for $ro$ ?
@sender1496 Год назад
I'm also confused by this. Of course, the Q-values depend on the policy we use. It feels like they just pretend that the current Q-values are the greedy-policy-ones (if the greedy policy is the target) and estimate the target policy as the greedy policy with respect to the current values, but I'm very surprised as to why this isn't mentioned? Or maybe I'm misunderstanding something.
@sender1496 Год назад
Actually, if you think about it, it is usually the case that the current value estimate is used in formulas where there really should be an exact value. For example, all the algorithms that are derived from dynamic programming (bootstrapping etc.). So I guess it makes sense that it would happen here as well - we have an exact theoretical formula showing us how to rescale our samples, but we use that with the current value estimates instead of the theoretically exact values to get an algorithm. For example, if our target policy is the greedy policy (being a fixed point to its own value function), then pi(s) should just be the action that gives the largest successive Q-value, given our current Q-value estimate (which is different from the fixed point Q-values)
@wcoenen 3 года назад ⁺³
Video editing issue after 3:35
@hadovanhasselt7357 3 года назад ⁺¹
Whoops :) Thanks for noting, will try to get fixed

Следующие

Автовоспроизведение

DeepMind x UCL RL Lecture Series - Deep Reinforcement Learning #1 [12/13]

DeepMind x UCL RL Lecture Series - Deep Reinforcement Learning #1 [12/13]

DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]

DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]

Think Fast, Talk Smart: Communication Techniques

Think Fast, Talk Smart: Communication Techniques

Duramax Diesel "Extreme" Tune and Allison Transmission Service (My Going Ta' Town Rig!)

Duramax Diesel "Extreme" Tune and Allison Transmission Service (My Going Ta' Town Rig!)

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

Vermont vs. Marshall: 2024 NCAA men’s soccer championship highlights

Vermont vs. Marshall: 2024 NCAA men’s soccer championship highlights

Rory McIlroy, Scottie Scheffler vs Bryson DeChambeau, Brooks Koepka | Crypto.com Showdown Highlights

Rory McIlroy, Scottie Scheffler vs Bryson DeChambeau, Brooks Koepka | Crypto.com Showdown Highlights

DeepMind x UCL RL Lecture Series - Model-free Prediction [5/13]

DeepMind x UCL RL Lecture Series - Model-free Prediction [5/13]

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

RL Course by David Silver - Lecture 7: Policy Gradient Methods

RL Course by David Silver - Lecture 7: Policy Gradient Methods

A Conscious Universe? - Dr Rupert Sheldrake

A Conscious Universe? – Dr Rupert Sheldrake

DeepMind x UCL RL Lecture Series - Model-free Control [6/13]

DeepMind x UCL RL Lecture Series - Model-free Control [6/13]

DeepMind x UCL RL Lecture Series - Exploration & Control [2/13]

DeepMind x UCL RL Lecture Series - Exploration & Control [2/13]

Lecture Series in AI: “How Could Machines Reach Human-Level Intelligence?” by Yann LeCun

Lecture Series in AI: “How Could Machines Reach Human-Level Intelligence?” by Yann LeCun

DeepMind x UCL RL Lecture Series - Planning & models [8/13]

DeepMind x UCL RL Lecture Series - Planning & models [8/13]

ПОДРИФТИЛ С БАБУЛЕЙ #shorts

ПОДРИФТИЛ С БАБУЛЕЙ #shorts

Найди девушку и получи приз!

Найди девушку и получи приз!

Купил 5 ТАЧЕК НЕ ГЛЯДЯ за 500000₽! Ушёл в минус?

Купил 5 ТАЧЕК НЕ ГЛЯДЯ за 500000₽! Ушёл в минус?

so cute🥺 #edm #snowboarding

so cute🥺 #edm #snowboarding

Сотрудники заведения были в шоке от этого. Антон Теляков #пранк

Сотрудники заведения были в шоке от этого. Антон Теляков #пранк

НАВИ - СПИРИТ, МАТЧ НА МИЛЛИОН ЗРИТЕЛЕЙ! КАТОВИЦЕ 2025

НАВИ - СПИРИТ, МАТЧ НА МИЛЛИОН ЗРИТЕЛЕЙ! КАТОВИЦЕ 2025

Handshake rating med Spånga P12A🤝 #fotboll24

Handshake rating med Spånga P12A🤝 #fotboll24