RL Course by David Silver - Lecture 4: Model-Free Prediction

Google DeepMind

Просмотров 344 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 11 янв 2025

Комментарии • 227

@appendix77 7 лет назад ⁺²⁰⁷
Questions from students are of very high quality and they are one of the many reasons that make this lecture series particularly great.
@muratcan__22 5 лет назад ⁺⁸
yeah specially the guy sitting close to the camera
@yasseraziz1287 4 года назад ⁺⁸⁸
0:53 Outline
2:10 Introduction
5:06 Monte Carlo Learning
9:20 First Visit MC Policy Evaluation
14:55 Every Visit MC Policy Evaluation
16:23 Blackjack example
26:30 Incremental Mean
29:00 Incremental MC Updates
34:00 Temporal Difference (TD) Learning
35:45 MC vs TD
39:50 Driving Home Example
44:56 Advantages and Disadvantages of MC vs. TD
53:35 Random Walk Example
58:04 Batch MC and TD
58:45 AB Example
1:01:33 Certainty Equivalence
1:03:32 Markov Property - Advantages and Disadvantages of MC vs. TD
1:04:50 Monte Carlo Backup
1:07:45 Temporal Difference Backup
1:08:14 Dynamic Programming Backup
1:09:10 Bootstrapping and Sampling
1:10:50 Unified View of Reinforcement Learning
1:15:50 TD(lambda) and n-Step Prediction
1:17:29 n-Step Return
1:20:22 Large Random Walk Example
1:22:53 Averaging n-Step Return
1:23:55 lambda-return
1:28:52 Forward-view TD(lambda)
1:30:30 Backward view TD(lambda) and Eligibility Trace
1:33:40 TD(lambda) and TD(0)
1:34:40 TD(lambda) and MC
@김준호-n1w9d Год назад
thanks
@NganVu 4 года назад ⁺⁶⁰
2:03 Introduction
5:04 Monte-Carlo Learning
33:56 Temporal-Difference Learning
1:23:53 TD(lambda)
@scottoctave1353 5 лет назад ⁺²³⁵
He is the Andrew Ng of RL.
@nirajabcd 4 года назад ⁺¹⁴
Could not agree more. Andrw Ng got me into ML and David Silver into RL. And that book by Sutton is gold!
@ruslanponomariov3598 4 года назад ⁺²⁴
No, Andrew Ng is David Silver of ML.
@samjoshua192 3 года назад ⁺²
@@ruslanponomariov3598 Wouldnt it be Richard Sutton, given he is considered father of RL?
@diln5 3 года назад
Andrew Ng has done lots of RL work himself!
@CONGKAMAEL 3 года назад ⁺³
@@nirajabcd very true; but isn't RL a subset of ML? I thought ML = {RL, Supervised, Unsupervised}
@azerotrlz 6 лет назад ⁺¹⁸
I love how he relates the form of the incremental mean to the meaning of RL updates.
@beepmcjeep5527 4 года назад
This actually helped me a ton, I've been knee deep in Deep RL implementations that have just been handed to me and I've skirted doing it by hand or understanding the formulas. As a result, reading papers have been confusing. Him mentioning that the incremental mean will be a common pattern made about 10 things click all at once for me.
@jaidevshah2286 4 года назад
@@beepmcjeep5527 wake up
@beepmcjeep5527 4 года назад
@@jaidevshah2286 😴
@ikechukwuokerenwogba8646 Год назад ⁺²
The lecture 💯.
The questions the students were asking 💯.
My enjoyment of the whole thing 💯.
@saminchowdhury7995 5 лет назад ⁺⁹
38:29 what a great example to explain how TD is different from MC
@ErwinDSouza 6 лет назад ⁺⁶
At 1:27:47, David explains why we use λ geometric series by saying it is memoryless so "You can do TD(λ) for the same cost as TD(0)".. but I don't see how! TD(0) merely looks at one 1 step whereas TD(λ) has to look at all future steps (or in the backward view, TD(0) merely updates the current state, while TD(λ) updates all previous states)
@totolamenace 2 года назад ⁺¹
TD(λ), in the backward view, does updates all previous steps, but that's as effective as TD(0), as each update is roughly equivalent to updates done with TD(0).
The forward view is theoretical, and mentioned because it gives us some intuitions on TD(λ), but it would be very inefficient to implement TD(λ) using the forward view, way too much computation, plus it implies that it needs finite episodes. That's why, in practice, we only use the backward view for implementations.
@totolamenace 2 года назад ⁺²
If you look into lecture 5, on Sarsa(λ), there's a comparison of the way Sarsa(λ) and Sarsa(1) update the states at around 1:06:00. You can see that while Sarsa(λ) does more computation, it then also build much more precise evaluation of the state action values. I think looking at this example might give you some intuitions why TD(λ) is as efficient as TD(0) computation wise.
@edmonddantes4705 Год назад
@@totolamenace What you say makes sense, but computational cost means cost per iteration and the backward view of TD(lambda) updates eligibility traces and many state values per iteration, so it is obviously more expensive that one-step TD. A different statement would be to say that "to obtain state value estimates of similar quality, the number of required operations in both methods is roughly the same", which is not the computational order anyway.
@scienceofart9121 4 года назад ⁺⁵³
When you realized every lecture corresponds to a chapter in Sutton's "Introduction to Reinforcement Learning"
@phuongdh 4 года назад ⁺⁶
Sutton & Barto are the OGs, Silver is the GOAT.
@aidosmaulsharif9570 4 года назад ⁺¹⁷
"Introduction to Reinforcement Learning" is a Holy Bible of RL, everything you need to know is in this book. And David Silver helped us to interpret this book.
@_snwflk_ 4 года назад ⁺⁴
My Prof actually uses both as teaching material xD
@BooBar2521 Месяц назад
@@_snwflk_my prof too 😂
@danielc4267 6 лет назад ⁺⁸
43:35 A student asked about the goal and actions of the driving-home example. I have read the book where this example comes from. And here is my take on the question:
The actions come from a policy that is determined by the person. In this case, the policy is getting home by driving a car through particular roads. The person can use other policies to get home such as walking home and driving through other roads.
The goal of Monte Carlo or Temporal Difference is to estimate how good his policy is. Remember his policy involves driving through particular roads. The example shows just one sample of updating the algorithms. To actually see how good his policy his. He needs to take the same route everyday, obtain more data, and update the algorithms.
@ze-speeches 6 лет назад
Great explanation, Daniel! Thanks! Which book is the example from and would you recommend to read it parallel to working through this lectures?
@danielc4267 6 лет назад
Thank you. The book is reinforcement learning: an introduction by Sutton. I think you can download the draft with some googling
@danielc4267 6 лет назад
Yeah, I would recommend reading it along with the lectures. It's a good book. I find it helpful to learn the same concepts from different sources, especially difficult ones.
@ze-speeches 6 лет назад
You are welcome! Thanks for the very fast reply! I already have the book from Sutton and Barton, but decided to start with the review article "Deep Reinforcement Learning - An Overview" from Yuxi Li. Also very recommendable!
@ze-speeches 6 лет назад
Thanks for the recommendation! I also definitely love to learn from different sources and in addition to lectures and readings listen to podcasts. I love this one, but it is about machine learning fundamentals, not RL: ocdevel.com/podcasts/machine-learning
Maybe you or another reader can recommend another podcast, being more specialized on RL or other more advanced topics? "TWiML & AI" seems to be quite promising... also want to try out "Machine Learning 101". Any experiences or recommendations on that?
@yuxinzhang9403 3 года назад ⁺³
I think the reason why looking one step into the future is better than using past predictions is that you can treat the next step as if it were the last step, then that would be the terminating goal and the game is over, we've already known the current state didn't make the episode end up, so only the future state could and that's why we always look into the future for the terminating state.
@alexanderyau6347 7 лет назад ⁺³
I have watched the lecture 4 four times, and this is the clearest one. For non-English speakers, language is really an obstacle to understanding this lecture. Oh my poor English, I only got 6.5 at IELTS Listening.
@testxy5555 4 года назад ⁺³
love the example to demonstrate the difference between TD and MC!!
@achyuthvishwamithra 3 года назад ⁺¹
The backup diagrams have made everything much more clearer.
@nightfall4857 8 лет назад ⁺²¹
Thanks for good lecture. This lecture really help me a lot.
I have a suggestion for improving this lecture. It is English subtitle. It will make this lecture more accessible for the handicapped and non-English speakers.
@minhuang8848 8 лет назад ⁺³
Problem being that subtitles cost. Professional captioning can easily run you 400 bucks for a lecture like this, so...
@isainstars 8 лет назад ⁺¹³
I'm sure that a lot of people wouldn't mind volunteering to provide English subtitles for these lectures (at least I wouldn't). It would be a nice alternative for that :)
@ThePaypay88 5 лет назад
@@minhuang8848 lmao wtf , algorithm already provides free (albeit errorbugged) subtitles , its more than enough . Just let video settings to apply that
@comradepeter87 Месяц назад ⁺¹
@@ThePaypay88 They do not. Auto-generated subtitles here frequently miss a bunch of words altogether, and the questions asked by students are barely audible which auto-generated subtitles don't pick up on.
@tacobellas 8 лет назад ⁺⁵
These lectures are sooo helpful! Thank you very much for posting. They are really good :).
@joker_nomad 3 года назад ⁺²
These lectures are gold!
@somcho 6 лет назад ⁺²
On Lambda-return at 1:23:56, if you think of N as being geometrically distributed with parameter lambda (for Bernoulli probability of success) and temporarily think of G^(n)_t as a (deterministic) function of n, then G^lambda_t is simply the expected value of G^(N)_t.
i.e. Let N~Geo(lamda) and G^(n)_t be fixed as a function of n then G^lambda_t = E[ G^(N)_t ]
### Correction: N~Geo(lamda) above should be N~Geo(1-lamda) instead
@edmonddantes4705 Год назад ⁺²
It is 1:23:56, your timestamp is incorrect! But the observation is correct up to the fact that N~Geo(1-lambda) instead of Geo(lambda). Nice observation!!
@somcho Год назад
@@edmonddantes4705 wow 4years later i am looking at what I wrote and literally have no idea what the heck i was talking about.. haha for a second I thought I had been hacked and chatgpt has been posting youtube comments on my behalf! 😂 time flies ¯\_(ツ)_/¯
@edmonddantes4705 Год назад ⁺¹
I guess it is good to know that somebody read it carefully even after four years 😂
@qingfangliu4687 4 года назад ⁺⁶
24:52 I think the professor's explanation is a bit misleading about this question. The Sutton & Barto book, where the figure came from, clearly told that the dealer has a fixed strategy to stick on any sum of 17 or greater, and hit otherwise.
@larryjuang5643 4 года назад ⁺¹
I think what David showed in this lecture is the V under the policy of sticking only above 20. This does not say anything about the quality of that policy, just the value function for any state under the said policy.
@Ashrak55 4 года назад ⁺¹
@@larryjuang5643 Liu was talking about the strategy of the dealer not the policy of the agent. I think the point Dr. Silver is trying to make in this case, is that it doesn't really matter what the strategy of the dealer is. In the book by Sutton and Barto the strategy is fixed, so that we can simulate it when writing the code for the environment.
@billykotsos4642 3 года назад
Another meaty lecture !
This is pure treasure
@joshash5944 8 лет назад ⁺⁹¹
haha.. "You don't need to wait til you die to update your value function.. "
@Hestehviskeren 4 года назад ⁺⁴
Thanks, TD learning!
@johnsmith-mp4pr 6 лет назад ⁺¹
What an excellent teacher.
@joaumgomes 8 лет назад ⁺⁷⁸
1:34:04 TCHA TCHEW
@LucGendrot 6 лет назад ⁺⁵
Anticipation for this moment made this excellent lecture that much better.
@dawenwu9896 6 лет назад ⁺⁴
i can't fully understand the slide in 1:21:45 .what is offline TD learning? i just think that offline TD should have the same result with online TD
@mind-set 3 года назад
Thank you for these lectures. They are fantastic.
@SphereofTime 4 месяца назад ⁺¹
1:36:12 td lambda and montecarlo 1:36:22
@SaifUlIslam-di5xv 4 года назад ⁺¹
"You don't need to wait until you die to update your value function" killed me. xD
@BGasperov 4 года назад ⁺¹
An audience member at 46:14 asked: "Does it learn the same behavior? So if you have a car, and the car almost crashes, but it doesn't, ... When you're learning step by step you gonna learning the behavior that you're scared of crashing, but if you're learning from the very end you would never learn sth to encourage you to avoid your own manic behaviour."
Although prof. Silver's explanation is correct, I think that's not what the answer student was looking for. It is not true that you would never learn sth to encourage you to avoid manic behavior (as the student is implying) - given enough episodes you would, as you would crash in many/most. MC does converge to the true value function (due to the law of large numbers).
@paulmarkus4917 4 года назад
Thanks for the clarification. So you are saying MC would also experience episodes terminating in crashes and hence learn to avoid crashes, just like in the case of TD.
@BGasperov 4 года назад
@@paulmarkus4917 Yep.
@nikebless 4 года назад
@@BGasperov Does it mean TD would be able to learn to not crash without crashing, while Monte Carlo has to crash at least once to understand that it's bad?
@BGasperov 4 года назад
@@nikebless No, TD would still need to crash (experience an extremely negative reward) in order to learn not to do it. Without crashing (experiencing a negative reward) how could it know that the action that leads to crashing is bad? Perhaps it could in model-based RL but that's another story.
@herrizaax 3 года назад
The answer to the question at 46:50 confuses me. I think the questioner was wondering whether MC would learn to drive safely, not whether TD would learn it. David gave the answer that TD would learn it. My answer would be: "MC would learn it after enough iterations, because sometimes you would crash". But hey, maybe I'm the one that misinterprets the question :)
@kuity 6 лет назад ⁺²
I don't think the example at 43:00 was very clear...
Or at least, it wasn't explained what's the policy and what's the return here
I mean I get the example and and I get the theory but not how the two of them relates
@kiuhnmmnhuik2627 7 лет назад ⁺⁹
@1:07:00 He should've started from the graph rather than from the math.
@DhavalSalwala 5 лет назад ⁺¹
Exactly
@abeaumont10 5 лет назад
Great series of videos!
@johnhefele5432 6 лет назад ⁺⁴
You get your real award; having a day at work :D
@43SunSon 4 года назад ⁺²⁷
I have to say, David Silver is slightly smarter than me.
@suryansudash0502 4 года назад ⁺¹⁰
Just slightly, yea, just like how they write "only" in the cheques
😂
@43SunSon 4 года назад ⁺¹
@37:30, question please. I dont understand the difference between real return (G_t) and ESTimate return (bootstrap, TD target). In the class 2, bellman equation shows exactly they are the SAME, isn't it , G_t = R_(t+1) + r V(S_(t+1)) ? If left G_t is the REAL return, why the right side of the equation is ESTimation now ??? Can someone teach me ? thank you!
@leopoldmuller4486 4 года назад
Estimate return: you already got the real R_t+1 but estimate the rest (yV(S_t+1) together -> estimated return.
In Monte Carlo you have are at the end of the episode thats why you can use the real return. You already received every real reward.
@43SunSon 4 года назад
@@leopoldmuller4486 Leo, thanks for your help. Please allow me to ask more about this.
1. I do agree with you that: real return in MC G_t = R_(t+1) + rR_t+2 + .... Here you do have all the R since you do walk though to the end. Am I right ?
2. I do know what you saying, in TD, you estimate, you don't go to the end of eps, so you do need to estimate. But I don't understand the equation in class 2 about this bellman equation: G_t = R_(t+1) + r V(S_(t+1)) . My understanding is this V(S_(t+1)) is the real, not estimation. The equation derivation shows very clear. Did I miss anything ? the notation is exactly the same between est and real ? Could you check the class 2, it was in very early of that video. thank you!
@commelephenix4702 4 года назад
1:07:00 mc vs td vs known dynamics
@홍준표-p4q 6 лет назад ⁺⁵
I cannot clearly understand why TD(\lambda) is better than n-step TD. There seems to be no big difference between the RMS error figures of n-step TD (p. 37) and of TD(\lambda) (p. 42).
@pi3ni0 6 лет назад ⁺³
I have this intuition, but it may be wrong. TD(n) and TD(lambda) both have one hyperparameter to tune. I think the advantage of TD(\lambda) is that when you average over TD(0), TD(1), TD(n), you apply bagging technique, so it's easier to control bias/variance with lambda and avoid high variance.
@divyanshushekhar5118 5 лет назад ⁺¹
1:10:13 How does DP bootstraps? Doesn't the DP uses the already calculated real returns?
@alvinphantomhive3794 5 лет назад ⁺¹
Dp uses Bellman Equation for synchronous backups, which is like two steps look a head given the current state or action and immediately update the value for Vk+1 using Value function equation. Behind the equation, it uses the returns of the successor state to keep updating the Vk+1 value onwards.
@ThePrefect693 8 лет назад ⁺¹
At 1:03:00, I don't understand how TD(0) gives a value of 0.75. If we used the equations, then won't the value be different?
@baicenxiao7107 8 лет назад ⁺¹
As far as I understand, the TD(0) will consider multiple episodes. In other words, the TD guess v[s(t+1)] based on multiple episodes, so we can get value 0.75. But I am not quite sure about that.
@neilslater8223 7 лет назад
I think you are right, it would update V(A) to 0 based on what it had seen if the sequences were taken in that order in a single pass through using TD algorithm. I think you can still understand the 0.75 value as a TD-like analysis after the fact. Or you would get 0.75 if you repeated the same 8 sequences again and again.
@daviderrington2241 6 лет назад ⁺⁴
This also puzzled me. I think the key is to realise that the equations shown at 1:03:00 are for the transition matrix and the reward structure rather than the value function. To find the value function, we should use a Bellman equation. Given the equation at 1:03:00 for transition matrix we see that P_{AB}=1 and the equation for reward structure shows R_{AB}=0. The Bellman equation will then give V(A) = R_{AB} + gamma * V(B). For no discounting we should set gamma = 1 and we also know that V(B) = 6/8 (I believe that's why he says V(B) is the easier one earlier in the video). Substituting then gives V(A) = 6/8.
Incidentally, a similar argument can be used to get V(B). For example, we see that P_{BT}=1 where T is terminal state. We also establish that R_{BT}=6/8 using the equation at 1:03:00, and clearly V(T)=0. Again, we get V(B) = R_{BT} + gamma * V(T) and with gamma = 1, we get the desired result.
Hopefully that makes sense!
@chandrasekharChebiyyam 6 лет назад ⁺⁵
Does anyone see a connection between a Non-Markov Decision Process (aka partially observable MDP) and a Hidden Markov model?
@eyeofhorus1301 5 лет назад ⁺⁴
I'm falling in love with David Silver 😍😍
@divyanshushekhar5118 5 лет назад
1:22:12 Is the offline method referred to Monte Carlo because it needs to wait till episode termination to update the value function but Temporal Difference method is offline because it updates the value function as it sees the state every time?
@alvinphantomhive3794 5 лет назад
exactly.. that's the definiton, you might want to look up the slide on page 46. the lecture slide was in the description
@akshayshrivastava97 Год назад ⁺²
It's been ~8 years since these lectures were uploaded and I'm yet to find an online resource that can rival Dr. Silver's delivery of this material.
@vaddadisairahul2956 3 года назад ⁺¹
I didn't understand the part where he explains how MC learning doesn't depend on how large the state space is. I need to understand it through an example
@muxintian8110 Год назад
really appreciate!! help me understand model free
@mathavraj9378 4 года назад ⁺¹
Can someone clarify me this? Monte carlo learning is basically just taking fewer paths out of the whole possible paths from a state and averaging it to estimate value function of a state.
@avinashanand2163 7 лет назад ⁺¹
What kind of policy do we have in a Model-free Prediction?
For example, if we have a policy of walking in a straight line from a point towards a goal, then the same set of states will be repeated in every iteration and hence the rewards should be same in every episode. Then how averaging would help?
@DerRumbringer 7 лет назад
i don't think you know where the goal is (if you do, you already have the optimal policy). other than that your environment will usually act stochastic and your sampling of starting states should be stochastic..
@DerRumbringer 7 лет назад
I have to correct: as it will be explained in lecture 5, acting greedily according to some policy will not lead to optimal exploration of the state space. This is because your trajectory is sampled from a propability distribution with much lower variance (because of deterministic actions limiting to one dimension of the propability matrix, the state transition dimension as opposed to the both the state transition dimension and the action dimension).
@cheungtyrone3615 5 лет назад
I have one question about "model free". If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change? Just like a human looking at a counter repeadtedly counting up from 0 to 9 and then counting down back to 0. A reward of 1 is given if the person predicts the next digit correctly. The person starts out thinking the state is just the number displayed and when he/she watches the counter behave longer, he/she come to realize that the sate is more complicated than that. Consider an extreme case, when the counter repeatedly shows all permutations of some sequence longer than two in some fixed order and none of Monte Carlo, TD(\lambda) and the version that incopoartes eligibility trace would ever generate a policy better than ramdom guess even if the entire period is shown as many times as needed as with the digit as the state representation, the best markov reward process it can fit is with ramdom transition between each state.
@BGasperov 4 года назад
"If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change?"
Knowledge about the environment == knowledge about state-to-state probability transitions and rewards. The agent perceives states nevertheless (unless we have a POMDP in which case it's a bit more complicated).
@edmonddantes4705 Год назад
The answer is pretty simple. You are choosing a terrible state space that doesn't represent your problem/what is happening well in your example. Choosing a suitable state space is a big part of solving reinforcement learning tasks. Imagine that I am trying to code a bot that sells and buys different cryptos and the state space is whether it rained yesterday. Then the state space is terrible as a predictor. As a human, if you are not making progress in your task of optimising rewards, it is clear that you are not understanding your state space well, so you should include more and more stuff into it until you see how your returns get better. You have to seek to understand what a suitable state space for your problem is.
@brekk6118 7 лет назад ⁺²
Thanks you for good lecture.
One question on this video.
On the slide Large-Random-Walk-Example, I don't get the difference between On-Line n-Step TD and Off-Line n-Step TD.
In short, what is the difference between On-Line and Off-Line?
@brekk6118 7 лет назад ⁺¹
I expected TD should be always Online. How can TD be offline in the picture?
@kiuhnmmnhuik2627 7 лет назад ⁺⁷
In this context, online means "in place". In each episode (assuming there *are* episodes), you update all V(s) by using the V(s) themselves (that's "bootstrapping"). If during the episode you update, say, V(s1) and then you reuse V(s1) for updating V(s2), then you updated V(s2) not by using the old V(s1) but the update you just did to V(s1).
If you first save all the V(s) in, say, OldV(s) and then update V(s) by only using OldV(s), then you're doing it Offline. If you, instead, use the same V vector and don't worry about the update interferences/overlappings, then you're doing it Online. Online is easier and faster and works well, AFAIK.
@jeffery1236 4 года назад
Online updates means the value function is updated at every step in the episode; Offline updates occur all at once at the end of the episode
@ahmadalghooneh2105 5 лет назад ⁺¹
14:33 was very good :))
@scienceofart9121 4 года назад
What is the difference between online and offline update in 1:35:50
@p.z.8355 3 года назад ⁺¹
Can you learn the dynamics of the system during MC sampling and transform model-free in to model-based problem ?
@edmonddantes4705 Год назад ⁺¹
Lecture 8
@vedgupta1686 4 года назад ⁺¹
In the equation for TD(lambda), the weights have a (1-lambda) term
But that would make TD(1) have weights of 0?
@hieuphan8335 4 года назад ⁺¹
When lambda=1, the n-step returns = 0 for all n, except the ∞-step return when the agent reaches the terminal state. That final return is the actual return after accumulating the rewards until the end of the episode, which is a Monte Carlo estimate. Therefore, TD(1) uses the same target as the Monte Carlo method.
@edmonddantes4705 Год назад ⁺¹
Be wiser than just plugging in lambda=1 in the coefficients of an infinite series. Compute the limit when lambda-->1 instead by using that the episode terminates and hence from some point onwards, the discounted return will be fixed. Use summability of geometric series.
@aneeshbhat7328 5 лет назад
Difference b/w expected return and empirical mean return? 8:28
I'm still unable to grasp the concept of expectation.
@manishandu 5 лет назад
Empirical means refers to sample mean when you have the samples available, expected value is not for a sample but a distribution. In some cases, you can infer the distribution from the samples, and empirical mean and expected value would be the same.
@420_gunna 6 месяцев назад
11:00 good question, David being pissy as usual about it
Great lecture though
@vegahimsa3057 4 года назад
Does TD work in the game of Go? I understand it's very difficult to evaluation a board state without playing it out further. Perhaps TD (breadth search) but MC to evaluate the leaves.
@MattSiegel 9 лет назад ⁺⁹
16:35 lol: *actually* worries about breaking the casino! :D
@divyanshushekhar5118 5 лет назад
seen movie "21"?
@jiapengzeng6651 2 года назад
For the blackjack example, how do we know which state we get to if we choose to twist without knowing the transition probability?
@fabtjar Год назад
I was thinking you'd have to write your own simulator. Did you figure out the answer a year later?
@12kenbutsuri 6 лет назад ⁺⁴
Amazing lecture, but the eligibility trace part seemed to sudden and random to me..
@alvinphantomhive3794 5 лет назад ⁺¹
hahaha yeah looks like he just rushing out of time.
@lex6709 2 года назад ⁺¹
other algorithms from last lectures(mc, td, dp, ..) = forward view
eligibility trace = backward view
It's a method to update present value from past values and not future (estimated or not)values like other algorithms that we've learned so far.
@edmonddantes4705 Год назад
It's a super cool idea to make something online that seems hopelessly offline. I find it pretty smart, plus the motivation is very psychological.
@tejasduseja 2 года назад ⁺¹
Can someone explain how TD(1) corresponding to monte carlo? According to formula (1-lambda) will be zero for lambda and hence estimated return should be zero for every state.
@vaishnav4035 2 года назад
since we have a terminal state T(ie it will terminate after some k),Then the probability density (weight)for that G_(k) will be the sum of geometrical from that point to n= infnity,Then you will get lamda^(k-1).Hence we get MC for lamda=1.This is what prof says in the side 1:28:00 ("we will assign the rest of the prob to the tail part(shaded portion) of the decaying distribution").
@edmonddantes4705 Год назад
In the forward view, it is a limiting case, i.e. lim TD(lambda) when lambda-->1, motivated by the argument Vaishnav described. Calculate the geometric sum of the coefficients from k onwards and take limits when lambda-->1. You will see that, in the limit, all the mass concentrates on the tail of the series. In the backward view, it can be shown by direct manipulation and the proof can be found in the extra material in the slides.
@윤섭-p9l 3 года назад
Can someone explain about when doing TD(0), how can we get partition of TD target that estimated value of next state V(St+1)??
@lekhaemerald 7 лет назад
I am not really sure how he got the value of V(A) and V(B) at 59:54. Can someone please explain?
@neilslater8223 7 лет назад ⁺⁵
He explains just after. It was to compare how MC and TD use their samples. So you have 8 samples that include B, 6 of them get a final return of 1, and 2 of them get a return of 0. With MC sampling, then estimate for V(B) is 6/8 = 0.75. For TD it is the same because state B is always last one in sequence, so we are bootstrapping from final state, with V(final)=0 by definition. The interesting one is V(A). With MC sampling, you look at average return for any sequence that includes state A. There is only one such sequence and the total return is 0 + 0 = 0, making the MC estimate for V(A) = 0. But the TD estimate can include the existing estimate for B when the sequence transitions from A to B . . . I think it is a little wrong in that this only works with TD algorithm as shown if the A:0,B:0 sequence occurs as the last sample. But I think the idea in the lecture is to ignore the algorithms and compare the two intuitive answers V(A) = 0 (for MC) or V(A) = 0.75 (for TD) to get a sense on how MC and TD differ.
@hschoi3681 7 лет назад
This is what I was looking for! Thanks! I also thought the sequences should be opposite otherwise in TD case it will also be V(A)=0
@vikramwaradpande5059 5 лет назад
This is exactly what I was looking for! I also think the same, but seeing that no-one else is pointing this out makes me think am I missing something obvious?
@yunlongsong7618 6 лет назад
x.1.0 speed is good for taking the lecture for the first time
x1.5 speed is perfect for quick reviewing this great lecture
@arturbegyan5207 5 лет назад
Yunlong Song x2 mate
@claudiocimarelli 7 лет назад
Someone has more documentation on the question about geometrical decay for lambda at 1:28:00? Thanks
@kiuhnmmnhuik2627 7 лет назад
Let's say we are at time t and we have already visited state s three times. If the eligibility trace decays geometrically than the decay is constant at each time step, so we don't need to maintain a history of the single visits.
Let's say E = E1 + E2 + E3, where Ei is the "partial eligibility" relative to the i-th visit of state s.
If the decay is constant, we have
new(E) = new(E1) + new(E2) + new(E3) =
lambda E1 + lambda E2 + lambda E3 = lambda E
and so we don't really need to remember E1, E2 and E3 separately.
If the decay is, say, hyperbolic (i.e. 1/n after n steps), then the update is more difficult:
new(E) = new(E1) + new(E2) + new(E3) =
? E1 + ? E2 + ? E3 = ???
We can't proceed without knowing when the visits took place. Let's say the right decays are l1, l2, l3, respectively. Then we have
new(E) = l1 E1 + l2 E2 + l3 E3
Unfortunately, we need to remember the entire history of the visits. We can't update E directly as before.
Just out of curiosity, sei italiano?
@divyanshushekhar5118 5 лет назад
I don't understand the question asked at 11:34 and its answer
@manishandu 5 лет назад
David explained 'first-visit monte carlo', in which the number of times you visited a state is updated only the first time you visit it, and the expected return is calculated from the first visit. Even, if you visit the state again a second time, it's not updated until you start a next episode. Thus, you get N_s = 10, for let's say 10 episodes, because you are considering only first visit, and would find the expected return by averaging over return from first visit to state 's'.
However, this has been proved that first-visit average return is equal to n-visit average return where you visit the state multiple times during one episode, and the count is updated everytime(unlike first visit). So, if you study only first-visit, you are not missing anything.
@manishandu 5 лет назад
In other words, value function is same for first visit or n-visit MC.
@Elocess 5 лет назад ⁺²
Shouldn't TD(0) correspond to n=1 instead of n=0?
@edmonddantes4705 Год назад
It does. But TD(0) denotes TD(lambda) for lambda = 0. This corresponds with 1-step TD or to lambda = 0 in the eligibility trace paradigm. In summary, the 0 corresponds with lambda, not with n.
@girigupta 4 года назад
for Analysis purpose can we treat the Every visit MC as a first visit MC by considering each occurrence of the state as a single episode?
@girigupta 4 года назад
The only thing that will differ is the averaging factor for the chain with multiple visits to the same state.
@MinhVu-fo6hd 7 лет назад ⁺¹
For TD learning, how can an agent know what the estimate (look ahead) value for state St+1 should be????. I know TD is bias because it has to guess the next value is and use it to update the current value, but how would it guess????
@sandeepinuganti8791 7 лет назад ⁺⁷
I think we randomly initialize the V(St) at first, then we use the iterative update, that's why he said TD is sensitive to initial conditions.
@Erain-aus-78 7 лет назад
Here the value function is stationary, without the time index attached
@Isaac668 7 лет назад
It depends on the specific environment. In a case like a game of chess, where the environment is largely predictable, then information like number of pieces taken can be used to predict your future value function. Otherwise, you would apply a random value function and then use different update functions to adjust that value for the probability distributions p(s'|a,s) and V(s). Essentially, if you don't want a random initialization, then a prior needs to be applied
@pariveshplayson 4 года назад
What about states which are
very infrequent and it takes a long time to find the first time they are
visited for a given episode? Now, over many episodes ..the time will add up. Right?
@edmonddantes4705 Год назад
In practice, there are many states that could not be visited when the state space is very large. For example, they argue that TD-Gammon probably has not seen most of the configurations in Backgammon and can still produce amazing results. Lectures 6, 7, 8, 10 will certainly answer your questions partially. When there are too many states, other approaches that generalise to unseen states are preferred (parametrising the value function of the policy by a neural network, for example).
@princehan920453024 3 года назад
Is alpha calculated by 1/N(s) of certain state when traversing all the episodes, or just chosen evenly and manually btw 0 and 1?
@duckdudette 3 года назад
I don't know what example you're looking at but it generally varies. Both options you gave are viable in different cases.
@edmonddantes4705 Год назад
In the first case, your estimator is unbiased, in the second one, your estimator is biased. In both cases, there is convergence in the limit.
@ildaalushaj8627 3 года назад
Does the First-Term MC and MC(all) return estimate biased? And how can we prove that formally?
@edmonddantes4705 Год назад
Both are unbiased estimators. Assume a policy pi is fixed. The experimental discounted returns you are computing every episode for every state s in MC are a samples of the random variable R = R_2 + gamma R_3 + gamma^2 R_4 +... conditioned to s, i.e. samples of the random variable R | S_1=s. Note that the value of a state s is defined by v(s) = E[R], where E denotes the expectation. By the strong law of large numbers, for every state s, the average experimental return converges to the real expectation almost surely, since the samples are independent and the rewards are uniformly bounded. That's the proof.
@a11a12 3 года назад
"You don't need to wait until you die to update your value function" :)
@3DAYSGG 3 года назад
Wouldn't TD be considered a model-based algorithm since we need an immediate reward and value of state s'?
@shauryamohan718 3 года назад
for MDP you need to have Action-space, State-space, Reward function (or distribution), Probability of moving to state s' from s given action 'a', over all s,s' in State-space and all 'a' over Action-space. Right now we only have reward function and State-space, (maybe in the next lecture action space) but not the probabilities. I think that is the reason why we don't have a model and this is not a model-based algorithm.
@hschoi3681 7 лет назад
Why does considering (the immediate reward from next step + Value function of next step : TD(0)) means considering the MDP structure?..
@Isaac668 7 лет назад ⁺¹
Each step Rt you take increases noise and decreases the bias. If we take a probabilistic standpoint, the more data we have (i.e. the more iterations/steps of the model we take), the less the bias will have effect and the more the likelihood will take effect (on the proverbial posterior distribution). This effectively eliminates the bias, and since at TD(0), we've taken 0 steps, there's 0 noise and close to 0 bias, so that should converge to the true value
@justalaugher 6 лет назад
because in TD, you make instant response to what just happened after you took the action. But only in Markov environment can you assume that what happened just now was only related to your action and previous state
@lahaale5840 6 лет назад
I do not understand: how MC and TD can find the optimal policy without compare all the possible policies like DP?
Do they sampled the majority of the policies?
@chandrasekharChebiyyam 6 лет назад
As far as what I understood, MC and TD converge to one and the same value estimate when in an MDP. They are just different way to get there. No one in their right sense should ideally be using MC for estimating value in an MDP though, I guess MC is only useful for non MDP cases. Now, coming to the policy (which I believe would be explained in a later lecture), if you remember the explanation in one of the previous lectures, where we alternate between value and policy updates, the greedy policy estimated based on the current best value estimate is always better then the policy we started with before the current best value estimate converges. Hence, we are bound to have a better policy than before when we have a better value function estimate. And over multiple such alternate steps of updates of value and policy, we end up with the ideal most policy and value.
@SuperOnlyP 5 лет назад
What I understand is the more policy you sample the more accurate the policy is. You can see the example of blakjack 19:50 . 10,000 esposides vs 500,000 episodes are different. DP only work for small case which is mentioned on lecture 2 or 3 - I think. So for big case MC and TD will save you time by sampling.
@anmolagarwal999 Год назад
1:05:00 = Monte Carlo backups
@zeweichu550 7 лет назад
Can someone give an example of the student's question at 1:15:39?
@edmonddantes4705 Год назад
I think pretty much any example you try is going to be a counterexample, because basically what the student proposes does not make any sense. Try with a state space with two or three states and some uniform policy with two different actions and I think it will work :).
Let me explain the intuition, which is the fundamental thing here. We are assuming that our environment is an MDP, and MDPs are FORWARD processes. Now, any MDP has an associated Bellman expectation equation, which expresses the value of a state s in terms of the values of FUTURE STATES. In other words, you look at the future states where you might end up depending on which actions you choose and you express the value of a state as a weighted sum of values at those states. The Bellman expectation equation expresses PAST state values in terms of FUTURE state values. This is key. In particular, if we are at state S_t and sample an action with policy pi, R_{t+1} + gamma V(S_{t+1}) is an UNBIASED SAMPLE of V(S_t). This is the heart of TD learning, whose update rule uses the estimate R_{t+1} + \gamma V(S_{t+1}) for V(S_t) , where V(S_{t+1}) is now an approximation of the true value at state S_{t+1}, so the sample becomes biased. However, as you know, TD converges to the true value in the limit because the CLT. The estimates become more and more unbiased and accurate.
Now, imagine that instead of updating like V(S_t)
@amornchotsingh3977 6 лет назад
Hey everybody,
Is this course an advanced course? I'm looking at various courses on the internet and choosing the best one. I've come up with Udacity's course(the hefty price tag is holding me back) but I heard a lot of good things about this course. How would you recommend studying RL? Take udacity course with a lot of coding implementations or study UCL's course thoroughly and implement code from pseudocodes from the slides?
@Shottedsheriff 6 лет назад ⁺¹
Sorry for 1 month delay, You've probably either picked something else or watched this one, but if not, I would recommend to follow Practical_RL assignments on github from Higher School of Economics, if I recall correctly, and there you have suggested videos for each 'practical week' and tasks for it, it really helps to grasp the ideas in this course and later start the Berkeley great CS294-112
@amornchotsingh3977 6 лет назад
@@Shottedsheriff Thank you very much. I've just bought a course from Udacity, I'll do CS294 and Udacity and the one you recommended me :)
@santiagorestrepo5458 6 лет назад ⁺⁹
if you listen carefully you can hear a sutil fart 0:36:58
@johnhefele5432 6 лет назад
heard dat
@shashankbhat9646 5 лет назад
maybe he got a reward from monte carlo
@lahaale5840 6 лет назад
why Ace is 10 on dealer side? it say is Ace can use either 1 or 11, right?
@divyanshushekhar5118 5 лет назад
23:56 What question is being asked?
@alvinphantomhive3794 5 лет назад
I guess he asked that on the "Dealer Showing" axis, why the value function on
'10' value decrease compare to the '9' value (Dealer showing axis). But again, i think this examples not really neccesary for you to understand, since he also knew, not everyone understand the gameplay which may be hard to bring the intuition. And there are many other more examples with simple concept on the next slide.
@Chitrang62 4 года назад
Maths is pure beauty
@akarshrastogi3682 5 лет назад ⁺³
45:46 somebody gagged himself lol
@b00i00d 4 года назад
LOL
@sng5192 8 лет назад
Thanks for the nice lecture.
@woncoh1 5 лет назад
50:24
1:35:57
@saminder987 4 года назад ⁺¹
41:45
@davypaul123 5 лет назад
What's the difference between Backup and Bootstrapping?
@suryapalaniswamy3726 5 лет назад
In my opinion I think backup is really for policy evaluation by taking the returns and finding out value functions and optimizing them. Bootstrapping is something to do with sampling the entire distribution and not using it as a whole
@driyagon 4 года назад ⁺⁵
nice way to spend coronavacation
@vman049 5 лет назад
Could someone please clarify why lambda = 1 corresponds to MC? If you plug in lambda = 1, the return just goes to 0, so how can it be equal to MC?
@paulmarkus4917 4 года назад
The formula for the return G_t(lambda) should contain a term at the end for the final update (end of episode) including the terminal state. Need to add (+) lambda^(T-t-1) G_t to the end of the return function. In the case that lambda = 0 then this part become G_t which is the same as MC.
@edmonddantes4705 Год назад
Don't plug it in in the forward view, plug it in in the backward view and you can check it is MC (proof in the notes). In the forward case, it is true as a limit, i.e. MC is lim TD(lambda) when lambda-->1. Indeed, calculate the geometric sum of the coefficients from the end of the episode onwards and take limits when lambda-->1. You will see that all the mass concentrates on the tail of the series in the limit.
@rahulramachandran3492 2 года назад
Just realised from the example from 1:00:00 that I'm an MC kind of guy
@sandeepinuganti8791 7 лет назад
He skipped the last 10 slides and i saw them at their website and i think it was really something important and i am having trouble understanding it, can anybody tell me if those slides were important or not?
@kiuhnmmnhuik2627 7 лет назад ⁺¹
The last 10 slides show that eligibility traces are truly equivalent to the slow counterpart of calculating updates as we do in MC and naive TD(lambda).
If you understand eligibility traces intuitively, you should also "sort of" understand the proofs (I think the proofs are not very precise, though).
@ChumpyZhang 5 лет назад
I have the same questions. I totally lost it when it came to backward view of TD(lambda).
@nikhilweee 2 года назад
1:34:05 cha-chow
@sharinfoster6582 4 года назад ⁺⁷
IDK who this person in the background is buy they cannot stop coughing and burping constantly. It's so irritating.
@DrN007 2 года назад
Last 10 minutes (backward view) need more elaboration.
@jinwang7932 8 лет назад ⁺¹
Where can I download the notes of this course?
@rajiveranki 8 лет назад ⁺⁶
www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
@astaragmohapatra9 4 года назад
The class is very diverse, guessing from their voices
@kundankumar-dt5uu Год назад
Sir, I want to use RL for frequency controller in electric vehicle. Please guide me which algorithm of RL will be best
@burada8993 4 года назад
Isn't the formula wrong? The last lambda exponent should be T-t-1 instead of T-1; it is Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-1) * R_T but should have been Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-t-1) * R_T .
@edmonddantes4705 Год назад
Yeah, you are totally correct.
@rishabhchopra6418 7 лет назад ⁺³
What is that sound at @52:14 ? hahaha
@alterma01 7 лет назад
hahahaha
@sajanpoudel5707 4 года назад ⁺¹
who is that guy on background suffering from covid-19, 5 years long back .. his coughing frequency is so irritating ...
@benvirusofthemind 3 года назад
you know 2020 has passed when everyone is thinking the same when they hear someone coughing.

Следующие

Автовоспроизведение

RL Course by David Silver - Lecture 5: Model Free Control