Temporal Difference and Q Learning

Model Based Reinforcement Learning: Policy Iteration, Value Iteration, and Dynamic Programming

Markov Decision Processes - Computerphile

I Built 5 EXTREME LEGO Builds With My Mom!

I Turned Myself Into a Parasite to Fool my Friend

Monster Hunter Wilds: Insect Glaive | Weapon Overview

Policy and Value Iteration

CIS 522 - Deep Learning

Просмотров 137 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 27 мар 2021
Наука

Комментарии • 68

@kiranmurphy3887 2 года назад ⁺¹²
Great video! Walking through the first few its of the VI on a gridworld problem helped me to understand the algorithm much better!
@TheClockmister Год назад ⁺⁸
My bald teacher will talk about this for 2 hours and I won’t understand anything. This helps a lot
@Moch117 5 месяцев назад
lmfaooo
@furkanbaldir 3 года назад ⁺³
I seached many many times to find this solution, and finally I found. Thank you.
@ahmetkarakartal9563 7 месяцев назад
kes
@studyaccount9662 Год назад ⁺²¹
this is better explantation than my teacher from MIT, thanks
@allantourin 2 месяца назад ⁺²
you're not from MIT lol
@harbaapkabaap2040 4 месяца назад
Best video on the topic I have seen so far, to the point and well explained! Kudos to you brother!
@UsefulArtificialIntelligence 3 года назад ⁺³
nice explaination
@aymanadam7825 Год назад ⁺¹
great video!! thanks!!
@newtonsnexus 10 месяцев назад
fantastic video, man I was so confused for some reason when my lecturer was talking about it, not supposed to be hard iguess, just how exactly it worked this video helped fill in the details
@ziki5993 Год назад
great explanation ! thanks.
@stevecarson7031 2 года назад
Nice job, thanks
@anishreddyanam8617 6 месяцев назад
Thank you so much! My professor explained this part a bit too fast so I got confused, but this makes a lot of sense!
@daved1113 Год назад
Helped me learn it. Thank you.
@sunnygla4323 2 года назад
This is helpful, thank you
@ellyjessy5044 Год назад
I see the values at V3 are for gamma only, shouldn't they be for gamma squared?
@kyrohrs 8 месяцев назад
Great video but how can we use policy iteration for a MDP when the state space grows considerably with each action? I know there’s various methods of approximation for policy iteration but I just haven’t been able to find anything, do you have any resources on this?
@kkyars Год назад ⁺⁶
For v1, would the two terminal states not be 0.8, since you have to multiply by the probability to get the expected value?
@aidan6957 Месяц назад
Remember, we're taking the sum and as all probabilites add to 1, we get 0.8x1 + 0.1x1 + 0.1x1 = 1, same for -1
@kkyars Месяц назад
@@aidan6957 thanks, can you send me the timestamp since I no longe have it?
@jlopezll 3 месяца назад ⁺¹
9:06 Why when iterating v2, the values of the all other squares are 0's? Shouldn't the squares near the terminal states have non-zero value?
@parul821 Год назад ⁺⁶
Can you provide example of policy iteration too
@yjw8958 Год назад ⁺²
If you also suffer from the vague explanation in GT's ML course, here comes Upenn to rescue you!
@711tornado 4 месяца назад
Literally why I'm here. CS7641 has been pretty good so far but the RL section was honestly crap in the lectures IMO.
@abhrantapanigrahi3475 2 года назад
Yes ! Finally found suck a video! Yay!
@abdullah.montasheri 4 месяца назад
the state value function Bellman equation includes the policy action probability at the beginning of the equation which you did not consider in your equation. any reason why?
@gate_da_ai Год назад
Thank God, get RL videos from an Indian....
@Leo-di9fq Год назад
In second iteration V = 0.09 ?
@prengbiba3474 3 года назад
nice
@eklavyaattar1810 Год назад ⁺¹
why would you substitute value of +1 in equation in green? the formula says it should the V(S') and not reward value!!!
@user-bk3tl7ke1r 10 месяцев назад
According to bellman equation, I got the value 0.8 * (0.72 + 0.9 * 1) + 0.1 * (0.72 + 0.9 * 0) + 0.1 * (0.72 + 0.9 * 0) = 1.62. Please correct where I got wrong.
@mghaynes24 9 месяцев назад ⁺¹
The living reward is 0, not 0.72. 0.72 is the V at time 2 for grid square (3,3). Use the 0.72 value to update grid squares (2,3) and (3,2) at time step 3.
@nwudochikaeze6309 Год назад ⁺²
Please can you explain how you got the 0.78 in V3.?
@ThinhTran-ys4mr Год назад
do you understand :( if yes please explain for me
@yottalynn776 2 года назад ⁺²
Thanks for the video. In v3, how do you get 0.52 and 0.43?
@HonduranHunk Год назад ⁺⁷
Instead of starting in square (3, 3), you start in squares (3, 2) and (2, 3). After that, you do the same calculations to get 0.78. The optimal reward in square (3, 2) would be to go up, so the equation will look like: 0.8[0 + 0.9(0.72)] + 0.1[0 + 0.9(0)] + 0.1[0 + 0.9(-1)] = 0.43. The optimal reward in square (2, 3) would be to go right, so the equation will look like: 0.8[0 + 0.9(0.72)] + 0.1[0 + 0.9(0)] + 0.1[0 + 0.9(0)] = 0.52.
@ThinhTran-ys4mr Год назад ⁺⁴
@@HonduranHunk How can we calculate to have 0.78. please help me sir
@jemtyjose7088 Год назад
In V2 why is it that there is no Value for (2,3)? Doesnt the presence of -1 give it a value of 0.09. I am confused there.
@Leo-di9fq Год назад
lol same. any confirmations?
@stuartgill6060 Год назад
I think there is a value V for (2,3) at V2-- it is 0. You get that value taking the "left" action and bumping into the wall, thereby avoiding the -1 terminal state. What action could you take that would result in a value of .09?
@citricitygo Год назад
Remember you are taking the max of action values. So for (2,3), the max action is to move left, which may result in (2, 3) or (3,3) or (1, 3). The value is all 0.
@user-canon031 4 месяца назад
Good!
@huachengli1786 3 года назад
quick question: 6:10, is the R(s,a,s_prime) always 0 in the example.
@cssanchit 2 года назад ⁺²
yes it is fixed to zero
@Leo-di9fq Год назад
@@cssanchit except in terminal states
@don-ju8ck 6 месяцев назад
🙏🙏🏿
@user-ls3bi6jk8u 3 месяца назад
Cant understand how it it 0.52
@keyoorabhyankar5863 Месяц назад
Is it just me or are the subscripts in the wrong directions?
@tower1990 Год назад
There shouldn’t be any value for the terminal state… my god…
@anoushkagade8091 2 года назад ⁺⁷
Hi, thank you for the explanation. Can you please explain how you got 0.78 for (3,3) in 3rd iteration (V3) ? According to bellman equation, I got the value 0.8 * (0.72 + 0.9 * 1) + 0.1 * (0.72 + 0.9 * 0) + 0.1 * (0.72 + 0.9 * 0) = 1.62. Please correct where I got wrong. Assignment due tomorrow :(
@ankurparmar5414 2 года назад
+1
@anoushkagade8091 2 года назад ⁺¹
@@maiadeutsch4424 Thank you so much for the detailed explanation. This was really helpful. I was not considering the agent's own discounted value when going towards the wall and coming back.
@donzhu4996 2 года назад ⁺¹¹
@@maiadeutsch4424 we don't need to multiply 0.1??
@Ishu7287 2 года назад
@@maiadeutsch4424 hey ! nice explaination but can you tell if we will get a table regarding probabilities like 0.8 0.1 0.3 etc for going right left up visa versa
@vangelismathioudis3891 2 года назад
@@maiadeutsch4424 hey there nice explanation, but for the cases with 10% chance it should be 0.1*(0 + 0.9*V_previter).
@pietjan2409 Год назад ⁺¹
Seriously people cant explain this in a easy way. Same for this video
@alialho7309 2 года назад
For the first moment, you do not need to calculate terminal states and get +1, -1 for them. its wrong !
we have things like terminal state in grid world. use it.
@Leo-di9fq Год назад
what do you mean?

Следующие

Автовоспроизведение

Temporal Difference and Q Learning

Temporal Difference and Q Learning

Model Based Reinforcement Learning: Policy Iteration, Value Iteration, and Dynamic Programming

Model Based Reinforcement Learning: Policy Iteration, Value Iteration, and Dynamic Programming

Markov Decision Processes - Computerphile

Markov Decision Processes - Computerphile

I Built 5 EXTREME LEGO Builds With My Mom!

I Built 5 EXTREME LEGO Builds With My Mom!

I Turned Myself Into a Parasite to Fool my Friend

I Turned Myself Into a Parasite to Fool my Friend

Monster Hunter Wilds: Insect Glaive | Weapon Overview

Monster Hunter Wilds: Insect Glaive | Weapon Overview

Shelly-Ann Fraser-Pryce & Sha'Carri Richardson THROW DOWN In Women's 100 Meters || 2024 Olympics

Shelly-Ann Fraser-Pryce & Sha'Carri Richardson THROW DOWN In Women's 100 Meters || 2024 Olympics

The Bellman Equation

The Bellman Equation

introduction to Markov Decision Processes (MFD)

introduction to Markov Decision Processes (MFD)

Mastering Dynamic Programming - How to solve any interview problem (Part 1)

Mastering Dynamic Programming - How to solve any interview problem (Part 1)

Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2

Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2

I gave 127 interviews. Top 5 Algorithms they asked me.

I gave 127 interviews. Top 5 Algorithms they asked me.

How to use Bellman Equation Reinforcement Learning | Bellman Equation Machine Learning Mahesh Huddar

How to use Bellman Equation Reinforcement Learning | Bellman Equation Machine Learning Mahesh Huddar

Clear Explanation of Value Function and Bellman Equation (PART I) Reinforcement Learning Tutorial

Clear Explanation of Value Function and Bellman Equation (PART I) Reinforcement Learning Tutorial

Reinforcement Learning from scratch

Reinforcement Learning from scratch

RL 6: Policy iteration and value iteration - Reinforcement learning

RL 6: Policy iteration and value iteration - Reinforcement learning

Вот почему HyperX это кринж а Fifine это база

Вот почему HyperX это кринж а Fifine это база

Xiaomi SU-7 Max 2024 - Самый быстрый мобильник

Xiaomi SU-7 Max 2024 - Самый быстрый мобильник

Just Connect Your TV and Watch All the World's Channels in Full HD Format

Just Connect Your TV and Watch All the World's Channels in Full HD Format

Подсветка для машины в usb порт с wb #обзор #wildberries #дляавто

Подсветка для машины в usb порт с wb #обзор #wildberries #дляавто

TECNO POVA NEO 6 - БЮДЖЕТНИК С ГИГАНТСКИМ АККУМУЛЯТОРОМ!

TECNO POVA NEO 6 - БЮДЖЕТНИК С ГИГАНТСКИМ АККУМУЛЯТОРОМ!

Серьезные проблемы с CPU у Intel и AMD. Маркетинговая дичь от NVIDIA. Youtube замедлили. Что делать?

Серьезные проблемы с CPU у Intel и AMD. Маркетинговая дичь от NVIDIA. Youtube замедлили. Что делать?

Как противодействовать FPV дронам

Как противодействовать FPV дронам

Сколько реально стоит ПК Величайшего?

Сколько реально стоит ПК Величайшего?