RL Course by David Silver - Lecture 7: Policy Gradient Methods

Google DeepMind

Просмотров 276 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 17 ноя 2024

Комментарии • 191

@akshatgarg6635 3 года назад ⁺²¹⁷
People who feel like quiting at this stage, relax, take a break, watch this video over and over again and read sutton and barto. Do everything but dont quit. You are amongst the 10% who came this far.
@matejnovosad9152 3 года назад ⁺⁹
Me in highschool trying to make a rocket league bot: :0
@TY-il7tf 3 года назад ⁺¹⁰
I finally came to this stage of learning after watching his videos over and over. He did very well explaining everything, but RL knowledge differs from other ML and it takes time to learn and get used to.
@juliansuse1 3 года назад ⁺¹¹
Bro thanks for the encouragement
@SuperBiggestking 2 года назад ⁺⁷
Brother Ashsat! This was the most timely comment ever on youtube. I was watching these lectures and I felt brain dead since they are pretty long. This is a great encouragemnt. May God bless you for this!
@jamesnesfield7288 2 года назад ⁺⁶
my best advice is to apply each algo in the sutton and barto text book to problems in ai gym to help understand all this.....
you can do it, you got this....
@alexanderyau6347 7 лет назад ⁺³¹⁸
Oh, I can't concentrate without seeing David.
@terrykarekarem9180 5 лет назад ⁺⁷
Exactly the same here.
@musicsirme3604 4 года назад ⁺²
Me too. This is much better than 2018 one
@mathavraj9378 4 года назад ⁺⁴⁵
How will I understand expected return without him walking to a spot and summing up all along the path after it
@DavidKristoffersson 4 года назад ⁺²
Turn on subtitles. Helps a lot.
@JonnyHuman 6 лет назад ⁺²²²
For those confused:
- whenever he speaks of the u vector he's talking about the theta vector (slides don't match).
- At 1:02:00 he's talking about slide 4.
- At 1:16:35 he says Vhat but slides show Vv
- He refers to Q in Natural Policy Gradient, which is actually Gtheta in the slides
- At 1:30:30 the slide should be 41 (the last slide), not the Natural Actor-Critic slide
@illuminatic7 6 лет назад ⁺⁶
Also, when he starts talking about DPG at 1:26:10 it helps a lot to have a look at his original paper (proceedings.mlr.press/v32/silver14.pdf) pages 4 and 5 in particular.
I think the DPG slides he is actually referring to are not available online.
@MiottoGuilherme 5 лет назад ⁺⁸
@50:00 he mentions G_t, but the slides show v_t, right?
@gregh6586 5 лет назад ⁺³
@@MiottoGuilherme yes. G_t as in Sutton/Barto's book, i.e. future, discounted reward.
@mhtb32 4 года назад ⁺⁶
@@illuminatic7 Unfortunately that link doesn't work anymore, here is an alternative: hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf
@xicocaio 4 года назад ⁺¹⁷⁵
This is what I call commitment: David Silver explored not showing his face policy, received less reward, and then switched back to the past lectures' optimal policy.
Nothing like learning from this "one stream of data called life."
@Sanderex 4 года назад ⁺⁷
What a treasure of a comment
@camozot 3 года назад ⁺⁴
And we experienced a different state, realized we got much less reward from it, and updated our value function! Then David adjusts his policy to our value function like actor critic? Or is that a lil stretch, meh I think true there's some link between his and our value function here, he wants us to do well, because he's a legend!
@stevecarson7031 3 года назад ⁺¹
Genius!
@js116600 Год назад ⁺³
Can't discount face value!
@saltcheese 8 лет назад ⁺²⁸⁵
ahhh... where did u go david.. i loved your moderated gesturing
@lucasli9225 7 лет назад ⁺¹¹
I have the same question too. His gestures are really helpful in learning the course!
@Chr0nalis 7 лет назад ⁺³
He probably forgot to turn on the cam capture :)
@DoingEasy 7 лет назад
Lecture 7 is optional
@danielc4267 6 лет назад ⁺²⁰
I think Lecture 10 is optional. Lecture 7 seems rather important.
@michaellin9407 4 года назад ⁺¹¹⁹
This course should be called: "But wait, there's an even better algorithm!"
@mathavraj9378 4 года назад ⁺⁴
lol entire machine learning is like that
@akshatgarg6635 3 года назад ⁺¹⁰
That my friend is the core principal of any field of engineering. Thats how computers got from a room sized contraption to a hand held device. Becuse somebody said wait, there is an even better way of doing this
@krishnanjanareddy2067 3 года назад ⁺³⁴
And it turns out that this is best course to learn RL even after 6 years.
@snared_ 11 месяцев назад
really? What were you able to do with this information?
@finarwa3602 7 месяцев назад ⁺²
I have to listen repetitively because I could not concentrate with out seeing him. I have to imagine what he was trying to show through his gestures . This is a gold standard lecture for RL. Thank you professor David Silver.
@NganVu 4 года назад ⁺³⁸
3:24 Introduction
26:39 Finite Difference Policy Gradient
33:38 Monte-Carlo Policy Gradient
52:55 Actor-Critic Policy Gradient
@yasseraziz1287 3 года назад ⁺²⁵
1:30 Outline
3:25 Policy-Based Reinforcement Learning
7:40 Value-Based and Policy-Based RL
10:15 Advantages of Policy Based RL
14:10 Example: Rock-Paper-Scissors
16:00 Example: Aliased Gridworld
20:45 Policy Objective Function
23:55 Policy Optimization
26:40 Policy Gradient
28:30 Computing Gradients by Finite Differences
30:30 Training AIBO to Walk by Finite Difference Policy Gradient
33:40 Score Function
36:45 Softmax Policy
39:28 Gaussian Policy
41:30 One-Step MDPs
46:35 Policy Gradient Theorem
48:30 Monte-Carlo Policy Gradient (REINFORCE)
51:05 Puck World Example
53:00 Reducing Variance Using a Critic
56:00 Estimating the Action-Value Function
57:10 Action-Value Actor-Critic
1:05:04 Bias in Actor-Critic Algorithms
1:05:30 Compatible Function Approximation
1:06:00 Proof of Compatible Function Approximation Theorem
1:06:33 Reducing Variance using a Baseline
1:12:05 Estimating the Advantage Function
1:17:00 Critics at Different Time-Scales
1:18:30 Actors at Different Time-Scales
1:21:38 Policy Gradient with Eligibility Traces
1:23:50 Alternative Policy Gradient Directions
1:26:08 Natural Policy Gradient
1:30:05 Natural Actor-Critic
@Wuu4D 7 лет назад ⁺¹⁴⁰
Damn.Its was alot easier understandin it with gestures
@fktudiablo9579 5 лет назад ⁺⁴
he could describe his gestures in the subtiles
@chrisanderson1513 7 лет назад ⁺⁴⁰
Starts at 1:25.
Actor critic at 52:55.
@ranhao8882 7 лет назад
thx
@georgegvishiani736 5 лет назад ⁺³³
It would have been great if it was possible to recreate David in this lecture based on his voice using some combination of RL frameworks.
@jurgenstrydom 7 лет назад ⁺⁶⁸
I wanted to see the AIBO training :(
@felixt1250 6 лет назад ⁺²¹
Me too. If you look at the Paper by Nate Kohl and Peter Stone where they describe it, they reference a web page for the videos. And surprisingly it is still online. You find it at www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk
@tchz 3 года назад
@@felixt1250 not anymore :'(
@gunsodo 3 года назад ⁺²
@@tchz I think it is still there but you have to copy and paste the link.
@akarshrastogi5145 5 лет назад ⁺¹⁵
This lecture was immensely difficult to get owing to david's absence and mismatch of slides
@florentinrieger5306 Год назад ⁺⁴
It is unfortunate that exactly this episode is without david in the screen. It is again a quite compley topic and Devaid jumping and running around and pointing out the relevant parts make it much easier to digest.
@liamroche1473 7 лет назад ⁺¹⁵
I am not sure exactly how this video was created, but the right slide is often not displayed (especially near the end, but elsewhere as well). It is probably better to download the slides for the lecture and find your own way through them while listening to the audio.
@MrCmon113 4 года назад ⁺⁹
Unfortunately the slides do not fit what is said. It's a pity they don't seem to put much effort into these videos. David is surely one of the best people to learn RL from.
@lorenzovannini82 Месяц назад
This lectures are a gift. Thanks
@T4l0nITA 3 года назад
By far the best video about policy gradient methods on youtube
@Vasha88 4 года назад ⁺¹
First time in my life I had to DECREASE the speed of the video and not increase....man he talks REALLY fast, while at the same time showing new slides filled up by equations
@OmPrakash-vt5vr 2 месяца назад
“No matter how ridiculous the odds may seem, within us resides the power to overcome these challenges and achieve something beautiful. That one day we look back at where we started, and be amazed by how far we’ve come.” -Technoblade
I started this series a month ago in summer break, I even did the Easy21 assignment and now I finally learned what I wanted, when I started this series i.e. Actor Critic Method. Time to do some gymnasium env.
@helinw 6 лет назад ⁺¹
Just to make sure, in 36:22, the purpose of the likelihood ratio trick is to make the gradient of the objective function gets converted to a expectation again? Just a David said at 44:33, "... that's the whole point of using the likelihood ratio trick".
@AM-kx4ue 5 лет назад
I'm not sure about it neither
@edmonddantes4705 Год назад ⁺¹
That's exactly right. Once you convert it into an expectation, you can approximate it by sampling, so that trick is very practical.
@jorgelarangeira7013 6 лет назад ⁺⁴
It took me a while to realize that policy function pi(s, a) is alternately used as the probability of taking a certain action in state s, and the action proper (a notation overload that comes from the Sutton book). I think specific notation for each instance would avoid a lot of confusion.
@ErwinDSouza 5 лет назад ⁺¹
At 45:36 I think the notation he is describing is different from that shown in these slides. I think his "capital R" is the small "r" for us. And the "curly R" is the "Rs,a" for us.
@MrCmon113 4 года назад ⁺²
Also u is theta.
@AdamCajf 4 года назад
Yes, fully agree. I believe this is important so to reiterate the small correction: the lowercase r is a random reward, the actual reward that agent/we experience, while the curly uppercase R is the reward from the MDP (Markov Decision Process).
@SSS.8320 5 лет назад ⁺²
We miss you David
@JakobFoerster 8 лет назад ⁺²
Thank you for creating the video John, this is really great!
@emrahe468 6 лет назад ⁺¹⁴
This has good sound quality, but missing nice body language..
@nirmalnarasimha9181 Год назад
Made me cry after very long :( given the professors absence and slide mismatch
@ProfessionalTycoons 6 лет назад ⁺⁸
man without the gestures its not the same, the lecutre is not the same…...
@omeryilmaz6653 6 лет назад
You are fantastic David. Thanks for the tutorial.
@LucGendrot 6 лет назад ⁺⁵
Is there any particular reason that, in the basic TD(0) QAC pseudocode (1:00:00), we don't update the Q weights first before doing the theta gradient update?
@alvinphantomhive3794 4 года назад
i think you can start with arbitrary value for the weight.
Since the weight value also will be adjusted proportion to the td error and get better as the iteration increase to n steps.
@edmonddantes4705 Год назад
Super good question. I am guessing the reason is computational right? You want to reuse the computation you did for Q_w(s,a) when computing delta instead of computing it again with new weights when doing the gradient ascent update of the policy parameters (theta). However, what you propose seems more solid, just more costly.
@mohammadfarzanullah5549 3 года назад ⁺¹
He teaches much better than hado van hasselt. makes it much easier
@Darshanhegde 9 лет назад ⁺²
Thanks for updating lectures :)
I sort of got stuck on this lecture because the video wasn't available :P Now I have no excuse for not finishing the course !
@Darshanhegde 9 лет назад ⁺⁵
Thought this is a real video ! I was wrong ! David keeps referring to equations on slides but audio and slides are not synced ! It's confusing sometime ! But still better than just audio !
@jingchuliu8635 8 лет назад ⁺²
The slides are perfectly synced with the audio for most times, but the slides on "compatible function approximation" is not in the right order and the slides on "deterministic policy gradient" is missing.
@d4138 6 лет назад ⁺³
36:20, could anyone please explain, what kind of expectation we are computing (i only see the gradients). And why is expectation of the right-hand side easier to compute then that of the left-hand side
@edmonddantes4705 Год назад
You want to minimise J in 43:25, which is the expected immediate reward. Note that thanks to the computation in 36:20, the gradient of J at 43:25 becomes an expectation. The expectation is computed in the full state-action space of the MDP with policy pi_\theta. Note that without the term pi_\theta(s,a) in the sum, that thing would not be an expectation anymore, so you COULD NOT APPROXIMATE IT BY SAMPLING.
@MrHailstorm00 5 лет назад ⁺¹
The slides are outdated, judging by David's speech, he apparently changed notations and added a few slides in the last 30 mins or so.
@shaz7163 7 лет назад ⁺⁴
Can someone explain how he got the score function from the maximum likelihood expression in 22.39 .
@sravanchittupalli2333 4 года назад ⁺⁴
I am 2 years late but this might help someone else😅😅
It is simple differentiation
grad of log(a) wrt a = grad(a)/a
He just did this backwards
@emilfilipov169 6 лет назад ⁺⁵
I love it how there is always someone moaning or chewing food near the camera/microphone.
@gregh6586 5 лет назад ⁺⁵
Hado van Hasselt holds basically the same lecture here: ruclips.net/video/bRfUxQs6xIM/видео.html. I still like David's lecture much more but perhaps this other lecture can fill some of the gaps that appeared with David's disappearence.
@serendipity0306 3 года назад
Wish to see David in person.
@ffff4303 2 года назад
While I don't know how generalizable the solution to this specific problem in an adversarial game would be, I can't help but wonder how these Policy Gradient Methods could solve it. The problem I am considering is one where the agent is out-matched, out-classed, or an "underdog" of limited range, damage, or resources than it's opponent in an adversarial game where it is known that the opponent's vulnerability increases with time or proximity.
Think of Rocky Balboa vs Apollo Creed in Rocky 2 (where Rocky draws punches for many rounds to tire Apollo and then throws a train of left punches to secure the knockout) , being pursued by a larger vessel in water or space (where the opponent has longer range artillery or railguns but less maneuverability due to it's greater size), eliminating a gunmen in a foxhole with a manually thrown grenade, or sieging a castle.
If we assume that the agent can only win these games by concentrating all the actions that actually give measurable or estimable reward in the last few sequences of actions in the small fraction of possible episodes that reach the goal, how would any of these Policy Gradient Methods be able to find a winning solution?
Given that all actions for many steps from the initial state would require receiving consistent negative rewards (either through glancing blows with punches for many rounds, evasive actions like maneuvering the agent's ship to dodge or incur nonvital damage from the longer-range attacks, or simply lose large portions of an army to get from the field to castle walls and ascend the walls) I imagine the solution would have to be some bidirectional search with some nonlinear step between minimizing negative rewards from the initial state and maximizing positive reward from the goal.
But can any of these Gradient Policy Methods ever capture such strategies if they are model-free (what if they have to be online or in partially observable environments as well)? It seems that TD lambda with both forward and backward views might be able to, but would the critical states of transitioning between min-negative and max-positive reward be lost in a "smoothing out" over all action sequence steps or never found given the nonlinearity between the negative and positive rewards? What if the requisite transitions were also the most dangerous for the underdog agent (ie t_100 rewards: -100, +0; t_101 rewards: -1000, +5)?
If the environment is partially observable, and there really is no real benefit in strictly following the min-negative reward, given that the only true reward that matters is surviving and eliminating the opponent, some stochasticity would be required in action selection on the forward-view to explore states that are nonoptimal locally for the min-negative reward and required for ever experiencing the global terminal reward state, but this stochasticity may not be affordable on the backward view where the concentration of limited resource use cannot be wasted.
I guess the only assailable method is if the network captured a function in the feature vector of the opponent's vulnerability as a function of time, resources exhausted, and/or proximity, but what still remains is this concern of increased danger for the agent as it gets closer to the goal. I realize that one could bound the negative reward minimization from zero damage to "anything short of death", but normalizing that with the positive rewards at the final steps of the game or episode would be interesting to understand. In this strategy it seems odd for an algorithm at certain states to effectively be "saying" things like:
"Yes! You just got punched in the face 27 times in a row! (+2700 reward)";
"Congratulations! 2/3s of your ship has lost cabin pressure! (+6600 reward)";
"You have one functional leg, one functional arm, and suffering acute exsanguination! (+10,000 reward)"
"Infantry death rate increases 200x! (+200,000 reward)".
Any thoughts?
@snared_ 11 месяцев назад ⁺¹
did you figure it out yet? It's been a year, hopefullly you've had time to sit down and make actual progress towards creating this?
@AliRafieiAliRafiei 9 лет назад
thank you many times Dear Karolina. cheers
@fktudiablo9579 5 лет назад ⁺²
1:00:54, this man got a -1 reward and restarted a new episode.
@alvinphantomhive3794 4 года назад
1:00:55 " Ugnkhhh.... "
@nirajabcd 4 года назад ⁺²
The lectures were going great until someone decided not to show David's gestures. God I was learning so much just from his gestures.
@charles101993 6 лет назад ⁺¹
What is the log policy exactly? Is it just the log of the output of the gradient with respect to some state action pair?
@BramGrooten 4 года назад ⁺²
Is there perhaps a link to the videos of AIBOs running? (supposed to be shown at 31:55)
@BramGrooten 4 года назад ⁺¹
@@oDaRkDeMoNo Thank you!
@xingyuanzhang5989 5 лет назад ⁺⁵
I need David! It's hard to understand some pronouns without seeing him.
@sengonzi2010 2 года назад
Fantastic lectures
@TillMiltzow 2 года назад
When adding the baseline, there is an error. The gradient is zero when multiplying with the baseline because the function B(s) does not depend on theta. Then he uses B(s) = V^{\pi_\Theta} (s), which depends on theta. :( So this is at most a motivation rather than a mathematical proof.
@edmonddantes4705 Год назад
No error. That gradient is not hitting the baseline B, so it does not matter that B depends on theta. The gradient inside the sum is zero because the policy coefficients sum to one for fixed s.
This is a well-known classical thing anyway. It was originally proven in Sutton's paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation".
@SuperBiggestking 2 года назад
Following this lecture is like learning math by listening to a podcast.
@GeneralKenobi69420 Год назад
Not sure if that's supposed to be good or not lol
@MotherFriendSon 4 года назад
Sometimes David words and slides don't corresopond to each other. And I don't know what to do: listen to David or read slides. For example at 1:29:55 when he speaks about deterministic gradient theorem
@binjianxin7830 4 года назад ⁺¹
David has a paper about DPG which he mentioned was published “last year” in 2014, later a DDPG one. Just check them out.
@20a3c5f9 2 года назад
51:58 - "you get this very nice smooth learning curve... but learning is slow because rewards are high variance"
Any idea why the learning curve is smooth despite high variance of returns? We use returns directly in gradient formula, so intuitively I'd guess they'd affect behavior of the learning curve as well.
@edmonddantes4705 Год назад
I mean, look at the scale, it is massive. I bet if you zoom in, it is not going to be very smooth. Lets say we have an absorbing MDP with pretty long trajectories and we calculate the mean returns by applying MC. By the central limit theorem, the mean experimental returns converge to the real returns, but it will take many iterations due to the high variance of those returns. The smoothness you would see when zooming out (when looking at how the mean returns converge) would be due to the central limit theorem. Note that I am simply making a parallel. In the case of MC policy gradient, that smoothness is due to its convergence properties, which rely on the fact that the MC returns are unbiased samples of the real returns, but that thing is very bumpy when you zoom in precisely due to the variance.
@kyanas1750 6 лет назад ⁺¹
Why there is not a single implementation in MATALB?
@d4138 6 лет назад ⁺¹
could anyone please explain the slide at 45:51. In particular, i don't understand what how the big $R_{s,a}$ becomes just $r$ when we compress the gradient to expectation E. What is the difference between the big R and the small one?
@edmonddantes4705 Год назад
r is the immediate reward understood as a RANDOM VARIABLE. This is useful because we want to compute the expectation of r along the state space generated by the MDP given a fixed policy pi. This is a measure of how good our policy is. R_{s,a} is the expectation of r given that you are at state s and carry out action a, i.e. R_{s,a} = E[r | s,a].
@brandomiranda6703 5 лет назад ⁺¹
Does he talk about REINFORCE in this talk/lecture? If yes when?
@AlbaraRamli 5 лет назад ⁺¹
Here: 48:30
@andreariba7792 2 года назад
it's a pity to see only the slides compared to the previous lectures, the change of format makes it very hard to follow
@thomasmezzanine5470 7 лет назад ⁺¹
Thanks for updating lectures. I have some problem in understanding the state-action feature vector \phi(s, a). I know the feature of environment mentioned in the last lecture, it could be some kind of observation of the environment, but how to understand this state-action feature vector?
@edmonddantes4705 Год назад
The state-action features in the last lecture and this lecture are different, since in the last lecture they were used to approximate the VALUE Q of a particular state-action pair, and in this lecture they are used to approximate a POLICY PI. State-action features filter important information about the state and action used to approximate the state-value function or maybe the policy, depending on the context.
@edmonddantes4705 Год назад
Say we are in a 2D grid world. The possible actions are up, down, left and right. Every time I move up, I get +1 reward, every time I move down, left or right, I get 0 reward. Define two features as (1,0) if I choose to go up, and (0,1) otherwise. Note that now I can compute the value function EXACTLY as a linear combination of my features, since they contain all the relevant information. My optimal policy is also a linear combination of those features only.
PS: you are asking about the linear case, but for me the most interesting case is the nonlinear one.
@guptamber Год назад
@@edmonddantes4705 Wow that is response to 6 year old question. Thanks for taking time.
@edmonddantes4705 Год назад
@@guptamberhaha I do it to practise!
@VladislavProkhorov-sr2mf 7 лет назад ⁺⁵
How does he get the score function at 37:41?
@blairfraser8005 7 лет назад ⁺¹³
I've seen this question a few places around the net so I answered it here: math.stackexchange.com/questions/2013050/log-of-softmax-function-derivative/2340848#2340848
@alexanderyau6347 7 лет назад
Thank you, very elaborate answer!
@MinhVu-fo6hd 6 лет назад
So, how do you get a score function for a deep NN?
@AM-kx4ue 5 лет назад
@@blairfraser8005 could you do it for dummies? I don't understand why you put the terms inside logs.
@blairfraser8005 5 лет назад ⁺²
Our goal is to get a score function by taking the gradient of softmax. It looks like a difficult problem so I need to break it down into a simpler form. The first way to break it down is to separate the numerator and denominator using the log identity: log(x/y) = log(x) - log(y). Now I can apply the gradient to the left and right side independently. I also know that anytime I see something in the form e^x there is a good chance I can simplify and get at the guts of the exponent by taking the log of it. That helps simplify the left side. Next, the right side also takes advantage of a log property - namely that the gradient of the log of f(x) can be written in the form of gradient of f(x) / f(x). This is just the chain rule from calculus. Now the gradients of both the left and right sides are easier.
@japneetsingh5015 5 лет назад
will these policy gradient methods work better in the previous methods based on generalized policy iteration MC and TD and SARSA?
@samlaf92 5 лет назад
I don't understand why he says that Value-based methods can't work with stochastic policies? By definition epsilon-greedy is stochastic. If we find two actions with the same value, we could simply have a stochastic policy with 1/2 probability to both. And thus, value-function based methods could also solve the aliased-gridworld example around 20:00.
@edmonddantes4705 Год назад
The convergence theorems in CONTROL require epsilon --> 0. If you read papers, you will often see assumptions of GLIE type (greedy in the limit with infinite exploration), which go towards a deterministic policy. David also mentions this (lecture 5 I think).
@alenmanoocherian631 8 лет назад ⁺¹
Hello Karolina,
Is there any real video for this class?
@Sickkkkiddddd 7 месяцев назад
Isn't the state value function 'useless' to an agent considering he 'chooses' actions but can't 'choose' his state?
@sarahjamal86 5 лет назад ⁺¹
Where did you go David :-(
@guptamber Год назад
I found Prof. Silver brilliant but concepts in this lecture by large are not explained concretely but just illustration of book. Moreover the lectures earlier showed where on the slide prof is pointing and that is missing too.
@ck5300045 9 лет назад ⁺¹
This really helps. Thanks
@zhaoxuanzhu4364 5 лет назад
I am guessing the slides shown in the video is slightly different from the one they used in the lecture.
@fndTenorio 6 лет назад
28:52 So J(teta) is the average reward your agent gets following policy teta, and pi(teta, a) is the probability of taking action a given policy teta?
@AM-kx4ue 5 лет назад
J(teta) is the cost function, check the 22:30 slide.
@MinhVu-fo6hd 7 лет назад
In line "Sample a ~ pi_theta" of the actor-critic algorithm around 58:00.
From what I understand that pi_theta(s, a) = P[ a | s, theta], I don't clearly understand how can we pick an action a given s and theta. Do we have to calculate phi(s , a) * theta for all possible action a at state s, and then choose an action accordingly to their probabilities?
If yes, how can we take an action in continuous action domains?
If no, how can we pick an action then?
@chukybaby 6 лет назад
Something like this
a = np.random.choice(action_space, 1, p=action_probability_distribution)
See docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html
@edmonddantes4705 Год назад
In continuous action domains, pi_theta(s,a) could be a Gaussian for fixed s (just an example). In discrete action spaces, for every state s, there is a probability of every action given by pi_theta(s,a). They sum to one of course.
@jk9165 7 лет назад
Thank you for the lecture. I was wondering if you are constrained to use the same state-action feature vectors for actor and critics? The weights are, of course, different, but does \phi(s,a) need to be the same? (57:18)
@narendiranchembu5893 7 лет назад
As far as my understanding goes, the feature vectors of actor and critic are completely different. The feature vector of critic is more like the state space and action space representation, as you have seen in Value Function Approximation lecture. But for the actor, the feature vectors are probabilities of taking an action in a given state (mostly).
@edmonddantes4705 Год назад
Of course they don't have to be the same. The state-action value features are stuff that approximate the state-action value function well, and the policy features are stuff that approximate a general useful policy well. For example, look at what compatible function approximation imposes in order to flow along the real gradient 1:05:52. How are you going to achieve that condition with the same features?
@TillMiltzow 2 года назад
I feel like the last 15 minutes the slides and what he says is not in sync anymore. :(
@robert780612 6 лет назад
David disappeared, but CC subtitle is coming!!
@randalllionelkharkrang4047 2 года назад
around 1:00:00, in the action-value and actor critic algorthims, to update w, he used \beta * \delta * feature. Why is he taking the feature here ? in model free evaluation , he used the eligibility trace , but why feature here ?
@edmonddantes4705 Год назад
He is using linear function approximation for Q. It is a choice. Not sure why you are bothered that much by that.
@hyunghochrischoi6561 3 года назад ⁺¹
First time making it this far. But is it just me or did alot of the notations change?
@hyunghochrischoi6561 3 года назад
Also, he seems to be speaking in one notation while the screen is showing something else.
@alxyok 5 лет назад ⁺²
rock paper scissors problem:
would it not be a better strategy to try to fool the opponent into thinking we are following a policy other than the random play so that we can exploit the consequences of his decisions?
@BGasperov 5 лет назад ⁺¹
Whatever strategy you come up with can not beat the uniform random strategy - that's why it is considered optimal.
@edmonddantes4705 Год назад
In real life it could be good, but theoretically of course not, since it is not a Nash Equilibrium. It can be exploited. Watch lecture 10.
@rylanschaeffer3248 7 лет назад
At 1:03:49, shouldn't the action be sampled from \pi_\theta(s', a)?
@karthik-ex4dm 6 лет назад ⁺¹
Came with high hopes from last video....
WO video unable to predict what is he pointing to
@claudiocimarelli 7 лет назад
the slide about deterministic policy gradient at the end is the one with compatible function approximation(like in the middle of the presentation)?:S
Luckily there is the paper from Silver online :)
Very good videos. With the video would have been better this one, but thanks anyways.
@illuminatic7 6 лет назад ⁺¹
Not really, the slide you are referring to does not have the gradient of the Q-Function in the equations, which is the main point of what he is talking about.
It helps a lot to have a look at the original paper (pages 4 and 5 in particular) to understand his explanation of DPG which can be found here: proceedings.mlr.press/v32/silver14.pdf
@Kalernor 3 года назад ⁺¹
Why do all lecture videos on Policy Gradient Methods use the exact same set of slides lol
@divyanshushekhar5118 5 лет назад
1:07:28 What does Silver mean when he says : "We can reduce the variance without changing the expectation"
@alvinphantomhive3794 4 года назад
There's several way to reduce the variance, but reducing variance like using "Critic" at 53:02 , may keeps changing and updating the expectation value onward.
So this slide shows the way to reduce the variance without changing the expectation.
The idea here is, by subtracting the "Baseline function B(s)" from the "Policy gradient" could do the job.
The expectation equation above shows that after a few algebra steps,
which ends up with the "B(s) or Baseline" multiply by the "gradient" of "the policy that sums up to 1".
And the gradient of a constant (1) equals to "Zero". So the whole terms of that equation shows that,
the calculation between the expectation and the "baseline B(s)" actually equal to zero.
That's mean you could use this "baseline function B(s)" as a Trick to control the variance without changing the expectation.
The baseline not gonna affect the expectation, Since the calculation between the expectation and the baseline actually equal to zero.
@alvinphantomhive3794 4 года назад
sorry if the explanation not straight forward and bit complicated lol
@edmonddantes4705 Год назад

abla log pi(s,a) A(s,a) and

abla log pi(s,a) Q(s,a) have the same expectation in the MDP space. However, which one has the larger variance? V[X] = E[X^2]-E[X]^2. Obviously E[X]^2 is the same for both. However, which expectation is larger, that of |
abla log pi(s,a)|^2 |A(s,a)|^2 or that of |
abla log pi(s,a)|^2 |Q(s,a)|^2? Note that A just centers Q, so tipically its square is smaller.
@alexanderyau6347 7 лет назад ⁺¹
Hi, guys, how can I get the v_t in 50:53?
@narendiranchembu5893 7 лет назад ⁺⁵
Since we have rewards of the entire episode, we can calculate the returns, Monte Carlo way. Here, v_t is more like G_t. v_t = R_(t+1) + gamma*R_(t+2)..... + gamma^(T-1)*R_T
@alexanderyau6347 7 лет назад
Thank you!
@kunkumamithunbalajivenkate8893 2 года назад ⁺¹
32:38 - AIBO Training Video Links: www.cs.utexas.edu/~AustinVilla/?p=research/learned_walk
@shivajidutta8472 7 лет назад
Does anyone know about the reading material David mentions in the previous class?
@hantinglu8050 7 лет назад
I guess is this one: "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto
@hyunjaecho1415 4 года назад
What does phi(s) mean at 1:18:05 ?
@p.z.8355 3 года назад
so when can we determine that there is state aliasing ?
@edmonddantes4705 Год назад
Basically when you feel like your features are not representing the MDP very well. The solution is changing the features or improving them.
@hardikmadhu584 6 лет назад ⁺²
Someone forgot to hit Record!!
@alexanderyau6347 7 лет назад
What is state aliasing in reinforcement learning?
@sam41619 6 лет назад ⁺¹
its like when two different states are represented with same features OR if two different states are encoded/represented using same encoding. though they are different (and have different rewards) but due to aliasing property, they appear same so it gets difficult for the algorithm or approximator to differentiate between these
@chongsun7872 6 месяцев назад
A little mismatched between the voice and slides...
@MinhVu-fo6hd 7 лет назад
How about the non-MDP? Does anyone have experience with that?
@Erain-aus-78 7 лет назад ⁺²
Minh Vu non-MDP can be artificially converted to be able to use MDP to solve, like quasi-Markov Chain
@ks3562 2 года назад
I lost it after he started talking about bias and reducing variance in actor-critic algorithms, after 1:05:03
@tomwon5451 6 лет назад ⁺¹
v: critic param, u: actor param
@phuongdh 4 года назад
Someone should superimpose their gestures on top of this video
@ProfessionalTycoons 6 лет назад
slide: www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf
@helloworld9478 4 года назад ⁺¹
31:40 "...until your graduate students collapse it..." LoL
@mohamedakrout9742 6 лет назад ⁺²
24 dislikes ? I cannot believe you can watch David and hit dislike in the end. Some people are really strange
@muratcan__22 5 лет назад ⁺¹
you can't watch actually
@hippiedonut1 3 года назад
this lecture was difficult
@jiansenxmu 7 лет назад
I'm looking for the robot.. 32:49
@lex6709 Год назад
whoa that was fast
@riccardoandreetta9520 7 лет назад ⁺¹
there should be more real examples ... udacity course is even more heavy.
@JH-bb8in 4 года назад
Lecture quality went downhill quick
@MinhVu-fo6hd 5 лет назад
Ohh math! Student confused lol.

Следующие

Автовоспроизведение

RL Course by David Silver - Lecture 8: Integrating Learning and Planning