Policy Gradient Methods | Reinforcement Learning Part 6

RL Framework You Never Heard of: Tianshou

27 November 2024

i tried food from TEMU...

진 엔딩은 HAPPY 엔딩💜

Bishop T.D. Jakes suffers health incident after delivering sermon

REINFORCE: Reinforcement Learning Most Fundamental Algorithm

Andriy Drozdyuk

Просмотров 10 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 27 ноя 2024

Комментарии • 37

@ollipasanen3570 3 года назад ⁺²⁶
I am not sure why this doesn't have more views. Thank you for the really clear and concise explanation of REINFORCE!
@yodarocco Год назад ⁺³
"Maybe you've seen it.. maybe you've read it.. but have you implemented it?" That is when you gained my like and subscribe!
@supersonic956 2 года назад ⁺³
Things are always so much simpler when written in code (at least for me!). Thanks!
@debarchanbasu768 2 года назад ⁺¹
This is the OG explanation on REINFORCE on RUclips. I hope you can come up with an entire RL series! Subscribed!
@jonathanwogerbauer2703 6 месяцев назад ⁺²
Big respect for this video! thank you
@yangyu441 2 месяца назад ⁺¹
Many thanks for the nice explanation.
@pitiwatlueang6899 7 месяцев назад ⁺¹
This is pure gold! You should produce more videos!
@beltusnkwawir2908 2 года назад
Your explanations are so clear and your voice is pleasant to the ear making learning more enjoyable. Thank you
@TwoSocksFoxtrot 2 года назад ⁺¹
this is unbelievably good. well done, sir
@jeonghwankim8973 10 месяцев назад
This is amazing. Thanks for the clear explanation and the illustrative code samples!
@marcin.sobocinski 2 года назад
Great showman Andriy is not, for sure... but the explanation of the algo is spot on and crystal clear. Thank you!
@kalixml1523 10 месяцев назад
Great video! Your clear explanations and pleasant voice make learning enjoyable. Looking forward to more content like this!
Subscribed 👍
@AKUKamil 9 месяцев назад
intriguing start
@samdonald741 2 года назад ⁺¹
Such a good explaination! Loved the examples + clear visuals
@LatelierdArmand 9 месяцев назад
Thanks I liked the minimal PyTorch implementation !
@akwstr 2 года назад
That was a great explanation / walkthrough. Thank you!
@antoineajsv6976 3 года назад
super clear, many thanks for this brilliant explanation
@anirudhthatipelli8765 Год назад
Thanks, this was a great explanation!
@abdelrahmanwaelhelaly1871 2 года назад ⁺¹
Thank you so much.
@swamikannan943 2 года назад ⁺¹
This video answered the single biggest doubt in my mind. How do you backprop through env.step(). Brilliant explanation ! Thanks a lot !
@openaidalle 2 года назад
Loved this.. Please make more videos
@jubaaissaoui5678 2 года назад
Great video
@abdelrahmanwaelhelaly1871 2 года назад ⁺¹
This is super good, did you have more courses/videos?
@itdepends5906 2 месяца назад
This is really beautiful and I’m a complete newbie. But the code at is looping over Rewards[1:] then Rewards[2:], then Rewards[3:] - I.e repeating most calculations. Dynamic programming should speed this up. Let DP[i] := sum of discounted rewards starting at time step i. Start with i=T as base case. From i=T-1,….0, DP[i] = DP[i+1] + gamma^i *Reward[i]
@ahmedaj2000 Год назад
thanks!
@hoddy2001 Год назад
what dimension is log_prob? and why do we only set one argument to the loss function? does it require both nn output and target value? thanks great video
@AndriyDrozdyuk Год назад
log_prob is just a scalar here. You can read as to why the loss is like that here (see "Score" function in both cases): mpatacchiola.github.io/blog/2021/02/08/intro-variational-inference-2.html or here pytorch.org/docs/stable/distributions.html
Basically it's a way to go around the problem of our inability to backpropagate through the random samples (the picking of the action here). Instead we create a surrogate function that *can* be backpropagated.
I am considering making a video on policy gradient to explain the loss function in more detail.
@timanb2491 2 года назад ⁺¹
can you explain 1 thing please - we compute G with constant policy and after that with this an array of G's we use it to find a direction for optimization and with every G from our array we update our policy(for example NN) BUT we use rewards from initial policy and update current policy that has some difference with initial one. Is it right to update current policy with information(G) that we get from initial policy? Because in gradient descent for example we update current NN by using information that we get from current state(from forward propagation)
@AndriyDrozdyuk 2 года назад
Yes, very good, you are correct. I was waiting for someone to point that out! Technically we should do update all at once. I just thought it would be a bit confusing for the viewer (since it would differ from the book's pseudocode). But you can see my old code version here where I create EligibilityVector (as it is called in RL book - top of page 328): github.com/drozzy/reinforce/blob/99a56061102a82bb4a835e852307ffa9d693ac98/reinforce.py#L40 additionally here is pytorch implementation taking a single step: github.com/pytorch/examples/blob/41b035f2f8faede544174cfd82960b7b407723eb/reinforcement_learning/reinforce.py#L62 On a side note, I do think the book's pseudocode shows it rather strangely - since it makes it seem as if multiple updates are occurring, and that is what I show here. But you are right that after we update the policy the first time, it then becomes kind-of an off-policy learning, where we are now updating a target policy that is no longer the same as the behavior policy.
@MaartinLPDA 2 года назад ⁺²
why do you need to take the log of the probability?
@AndriyDrozdyuk 2 года назад ⁺²
You don't, it's just a shorthand for \frac{
abla \pi}{pi}. You can look up the derivation and the answer in Reinforcement Learning book 2nd Ed. section "13.3. REINFORCE: Monte Carlo Policy Gradient" at the very bottom of page 327 incompleteideas.net/book/RLbook2020.pdf
@Ishaheennabi 8 месяцев назад
great
@AndrewGarrisonTheHuman 2 года назад
Great video, thank you. Do you have any advice or links to resources on how to apply a policy gradient to a continuous action space or an environment where multiple actions must be taken for each state?
@AndriyDrozdyuk 2 года назад ⁺¹
Yes, for theory take a look at section "13.7 Policy Parameterization for Continuous Actions" in RL Sutton & Barto Book 2nd ed. incompleteideas.net/book/RLbook2020.pdf , and for implementation it seems Spinning Up by open AI has "vanilla" policy gradients, here is an example: github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/vpg/core.py#L80
@AndrewGarrisonTheHuman 2 года назад
@@AndriyDrozdyuk Great! Thanks!
@blatogh1277 2 года назад
Oh man....I passed my assignment
@swjidswi3733 4 месяца назад
Thank you!

Следующие

Автовоспроизведение

Policy Gradient Methods | Reinforcement Learning Part 6

Policy Gradient Methods | Reinforcement Learning Part 6

RL Framework You Never Heard of: Tianshou

RL Framework You Never Heard of: Tianshou

27 November 2024

27 November 2024

i tried food from TEMU...

i tried food from TEMU...

진 엔딩은 HAPPY 엔딩💜

진 엔딩은 HAPPY 엔딩💜

Bishop T.D. Jakes suffers health incident after delivering sermon

Bishop T.D. Jakes suffers health incident after delivering sermon

Why It Sucks to Be Born as a Horse

Why It Sucks to Be Born as a Horse

MIT 6.S191: Reinforcement Learning

MIT 6.S191: Reinforcement Learning

An introduction to Policy Gradient methods - Deep Reinforcement Learning

An introduction to Policy Gradient methods - Deep Reinforcement Learning

Rich Sutton, Toward a better Deep Learning

Rich Sutton, Toward a better Deep Learning

Proximal Policy Optimization | ChatGPT uses this

Proximal Policy Optimization | ChatGPT uses this

The Most Important Algorithm in Machine Learning

The Most Important Algorithm in Machine Learning

Reinforcement Learning Series: Overview of Methods

Reinforcement Learning Series: Overview of Methods

Reinforcement Learning, by the Book

Reinforcement Learning, by the Book

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Proximal Policy Optimization Explained

Proximal Policy Optimization Explained

Симба спасёт всех! 💪 #симба #симбочка #симбочкапимпочка

Симба спасёт всех! 💪 #симба #симбочка #симбочкапимпочка

Color Matching Challenge, So Exciting, Waiting For The Party To Play#Funnyfamily #Partygames #Funny

Color Matching Challenge, So Exciting, Waiting For The Party To Play#Funnyfamily #Partygames #Funny

Безопасно ли летать на «Суперджетах»

Безопасно ли летать на «Суперджетах»

ОНА ЗАПОМНИТ ЭТО НАВСЕГДА! Судьба самой убитой тачки из контейнеров!?

ОНА ЗАПОМНИТ ЭТО НАВСЕГДА! Судьба самой убитой тачки из контейнеров!?

Гонка с навигатором

Гонка с навигатором

Ov və ovçu hər an dəyişə bilər... #animals #shorts

Ov və ovçu hər an dəyişə bilər... #animals #shorts

Новый альбом «Синяя Богиня» 📅 29.11.24 #Ленинград

Новый альбом «Синяя Богиня» 📅 29.11.24 #Ленинград