Deep RL Bootcamp Lecture 4A: Policy Gradients

Deep RL Bootcamp Lecture 6: Nuts and Bolts of Deep RL Experimentation

Who's Adam and What's He Optimizing? | Deep Dive into Optimizers for Machine Learning!

Off Grid Cabin Disaster !

Warfare | Official Trailer HD | A24

Death Of A Unicorn | Official Trailer HD | A24

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO

AI Prism

Просмотров 53 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 17 янв 2025

Комментарии • 22

@littlebigphil 3 года назад ⁺⁴
This is really dense, but also clears a lot up. I'll have to watch a second time.
@ariel415el 4 года назад ⁺¹⁶
This is much less apprehensible than last lectures =
@nissim2007 4 года назад ⁺³
anyway
@gregh6586 4 года назад
Yeah, John's mind works on another level. Maybe that's how you invent not only TRPO (along with Pieter from lecture 4a) but also PPO.
@dewinmoonl 7 лет назад ⁺⁶
AWW YEAH TRUST REGION THIS IS WHAT I NEEDED
THANKS!
@mansurZ01 5 лет назад ⁺¹
I think that loss function at 7:30 should have the opposite sign because the gradient is derived for gradient ascent. So for gradient descent we should pretend that the gradient has opposite sign. And if we derive loss function for that - the "minus" sign will carry trough. Am I right?
@gregh6586 4 года назад
No, it should not be negative: it is true that we want to use gradient ascent, i.e. to move in the direction that increases our "loss" the most. The advantage term is positive for those actions, whose actual value (i.e. the "q-value") was better than the expected value (or simply "value"). So what we want to do, is to find the direction that increases that advantage the most, i.e. find the gradient.
You have to be careful when using frameworks, though (what John refers to as "auto-div-libraries"), and double-check whether you get the positive or negative value (e.g. for the categorical cross entropy, where we want to use the "positive loss" but some implementations might return the negative value).
@elliotwaite 4 года назад
He just has it written in terms of gradient ascent, which is common in RL where we are typically trying to maximize our objective of expected total reward. But you are correct in that if we want to do gradient descent, which is the default in PyTorch or TensorFlow, we'll want to use the negative of that loss, as can be seen in this implementation: github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py#L234
@OliverZeigermann 4 года назад ⁺²
Why calculate the ratio of new and old policy even though log prob is good enough anway? Is it because we want to use the ratio for clipping?
@ppstub 4 года назад ⁺¹
So we can be more data-efficient and re-sample trajectories generated under an old policy according to the ratio we compute
@abcborgess 3 года назад ⁺¹
It is due to the importance sampling used to calculate the surrogate function. The expectation in the surrogate uses samples from the distro parametrized by theta_old. However, I think we are actually interested in the value of surrogate at other parameters theta. The ratio corrects the probability of the samples from theta_old in the expectation.
@underlecht Год назад
Its because we'd like to keep the policy in trusted region. So that the policy would not diverge too much and become useless
@mfavaits 2 года назад ⁺²
explanations are all over the place - put some structure in the way you explain things
@yanwen3498 3 года назад
Who has the slides?
@kevinwu2040 4 года назад
You make these things soo easy
@shaz7163 7 лет назад ⁺¹
In 14.55 When maximizing the objective function with a penalty obtained with KL divergence, what if the expected values become minus?
@yoloswaggins2161 7 лет назад
Discourage action.
@pablodiaz1811 5 лет назад
Thanks for share
@ProfessionalTycoons 6 лет назад
awwwe yeah

Следующие

Автовоспроизведение

Deep RL Bootcamp Lecture 4A: Policy Gradients

Deep RL Bootcamp Lecture 4A: Policy Gradients

Deep RL Bootcamp Lecture 6: Nuts and Bolts of Deep RL Experimentation

Deep RL Bootcamp Lecture 6: Nuts and Bolts of Deep RL Experimentation

Who's Adam and What's He Optimizing? | Deep Dive into Optimizers for Machine Learning!

Who's Adam and What's He Optimizing? | Deep Dive into Optimizers for Machine Learning!

Off Grid Cabin Disaster !

Off Grid Cabin Disaster !

Warfare | Official Trailer HD | A24

Warfare | Official Trailer HD | A24

Death Of A Unicorn | Official Trailer HD | A24

Death Of A Unicorn | Official Trailer HD | A24

I Filled my ENTIRE House with Snow *don’t try this*

I Filled my ENTIRE House with Snow *don’t try this*

Policy Gradient Methods | Reinforcement Learning Part 6

Policy Gradient Methods | Reinforcement Learning Part 6

Proximal Policy Optimization Explained

Proximal Policy Optimization Explained

CS885 Lecture 14c: Trust Region Methods

CS885 Lecture 14c: Trust Region Methods

MIT 6.S191: Reinforcement Learning

MIT 6.S191: Reinforcement Learning

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Introduction to GANs, NIPS 2016 | Ian Goodfellow, OpenAI

Introduction to GANs, NIPS 2016 | Ian Goodfellow, OpenAI

L3 Policy Gradients and Advantage Estimation (Foundations of Deep RL Series)

L3 Policy Gradients and Advantage Estimation (Foundations of Deep RL Series)

An introduction to Policy Gradient methods - Deep Reinforcement Learning

An introduction to Policy Gradient methods - Deep Reinforcement Learning

Deep RL Bootcamp Lecture 4B Policy Gradients Revisited

Deep RL Bootcamp Lecture 4B Policy Gradients Revisited

ИРП в деревянном ящике от ОПЕРАТОРА! Российские ДЕЛИКАТЕСЫ!

ИРП в деревянном ящике от ОПЕРАТОРА! Российские ДЕЛИКАТЕСЫ!

China hat ein Transportmittel der neuen Generation gestartet, das die Welt schockiert!

China hat ein Transportmittel der neuen Generation gestartet, das die Welt schockiert!

100km/h Reflex Challenge 😱🚀

100km/h Reflex Challenge 😱🚀

СПАЛИЛИСЬ🫣 Продолжение на канале Димас Блог #аняищук #димасблог #прятки

СПАЛИЛИСЬ🫣 Продолжение на канале Димас Блог #аняищук #димасблог #прятки

Ручки зябнут Ножки зябнут

Ручки зябнут Ножки зябнут

🦑🎮 Round two! Will he make it? #PuffSurvives #SquidGameSequel #thatlittlepuff #Mesucalife

🦑🎮 Round two! Will he make it? #PuffSurvives #SquidGameSequel #thatlittlepuff #Mesucalife

Brave Coco, facing danger with a smile #clown #angel

Brave Coco, facing danger with a smile #clown #angel

Solo Chef Adventures: Mom Hacks for Cooking with a Tiny Taster! 🍳👶 #Kitchen #Cooking

Solo Chef Adventures: Mom Hacks for Cooking with a Tiny Taster! 🍳👶 #Kitchen #Cooking