People who feel like quiting at this stage, relax, take a break, watch this video over and over again and read sutton and barto. Do everything but dont quit. You are amongst the 10% who came this far.
I finally came to this stage of learning after watching his videos over and over. He did very well explaining everything, but RL knowledge differs from other ML and it takes time to learn and get used to.
Brother Ashsat! This was the most timely comment ever on youtube. I was watching these lectures and I felt brain dead since they are pretty long. This is a great encouragemnt. May God bless you for this!
my best advice is to apply each algo in the sutton and barto text book to problems in ai gym to help understand all this..... you can do it, you got this....
For those confused: - whenever he speaks of the u vector he's talking about the theta vector (slides don't match). - At 1:02:00 he's talking about slide 4. - At 1:16:35 he says Vhat but slides show Vv - He refers to Q in Natural Policy Gradient, which is actually Gtheta in the slides - At 1:30:30 the slide should be 41 (the last slide), not the Natural Actor-Critic slide
Also, when he starts talking about DPG at 1:26:10 it helps a lot to have a look at his original paper (proceedings.mlr.press/v32/silver14.pdf) pages 4 and 5 in particular. I think the DPG slides he is actually referring to are not available online.
This is what I call commitment: David Silver explored not showing his face policy, received less reward, and then switched back to the past lectures' optimal policy. Nothing like learning from this "one stream of data called life."
And we experienced a different state, realized we got much less reward from it, and updated our value function! Then David adjusts his policy to our value function like actor critic? Or is that a lil stretch, meh I think true there's some link between his and our value function here, he wants us to do well, because he's a legend!
That my friend is the core principal of any field of engineering. Thats how computers got from a room sized contraption to a hand held device. Becuse somebody said wait, there is an even better way of doing this
I have to listen repetitively because I could not concentrate with out seeing him. I have to imagine what he was trying to show through his gestures . This is a gold standard lecture for RL. Thank you professor David Silver.
1:30 Outline 3:25 Policy-Based Reinforcement Learning 7:40 Value-Based and Policy-Based RL 10:15 Advantages of Policy Based RL 14:10 Example: Rock-Paper-Scissors 16:00 Example: Aliased Gridworld 20:45 Policy Objective Function 23:55 Policy Optimization 26:40 Policy Gradient 28:30 Computing Gradients by Finite Differences 30:30 Training AIBO to Walk by Finite Difference Policy Gradient 33:40 Score Function 36:45 Softmax Policy 39:28 Gaussian Policy 41:30 One-Step MDPs 46:35 Policy Gradient Theorem 48:30 Monte-Carlo Policy Gradient (REINFORCE) 51:05 Puck World Example 53:00 Reducing Variance Using a Critic 56:00 Estimating the Action-Value Function 57:10 Action-Value Actor-Critic 1:05:04 Bias in Actor-Critic Algorithms 1:05:30 Compatible Function Approximation 1:06:00 Proof of Compatible Function Approximation Theorem 1:06:33 Reducing Variance using a Baseline 1:12:05 Estimating the Advantage Function 1:17:00 Critics at Different Time-Scales 1:18:30 Actors at Different Time-Scales 1:21:38 Policy Gradient with Eligibility Traces 1:23:50 Alternative Policy Gradient Directions 1:26:08 Natural Policy Gradient 1:30:05 Natural Actor-Critic
Me too. If you look at the Paper by Nate Kohl and Peter Stone where they describe it, they reference a web page for the videos. And surprisingly it is still online. You find it at www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk
It is unfortunate that exactly this episode is without david in the screen. It is again a quite compley topic and Devaid jumping and running around and pointing out the relevant parts make it much easier to digest.
I am not sure exactly how this video was created, but the right slide is often not displayed (especially near the end, but elsewhere as well). It is probably better to download the slides for the lecture and find your own way through them while listening to the audio.
Unfortunately the slides do not fit what is said. It's a pity they don't seem to put much effort into these videos. David is surely one of the best people to learn RL from.
First time in my life I had to DECREASE the speed of the video and not increase....man he talks REALLY fast, while at the same time showing new slides filled up by equations
“No matter how ridiculous the odds may seem, within us resides the power to overcome these challenges and achieve something beautiful. That one day we look back at where we started, and be amazed by how far we’ve come.” -Technoblade I started this series a month ago in summer break, I even did the Easy21 assignment and now I finally learned what I wanted, when I started this series i.e. Actor Critic Method. Time to do some gymnasium env.
Just to make sure, in 36:22, the purpose of the likelihood ratio trick is to make the gradient of the objective function gets converted to a expectation again? Just a David said at 44:33, "... that's the whole point of using the likelihood ratio trick".
It took me a while to realize that policy function pi(s, a) is alternately used as the probability of taking a certain action in state s, and the action proper (a notation overload that comes from the Sutton book). I think specific notation for each instance would avoid a lot of confusion.
At 45:36 I think the notation he is describing is different from that shown in these slides. I think his "capital R" is the small "r" for us. And the "curly R" is the "Rs,a" for us.
Yes, fully agree. I believe this is important so to reiterate the small correction: the lowercase r is a random reward, the actual reward that agent/we experience, while the curly uppercase R is the reward from the MDP (Markov Decision Process).
Is there any particular reason that, in the basic TD(0) QAC pseudocode (1:00:00), we don't update the Q weights first before doing the theta gradient update?
i think you can start with arbitrary value for the weight. Since the weight value also will be adjusted proportion to the td error and get better as the iteration increase to n steps.
Super good question. I am guessing the reason is computational right? You want to reuse the computation you did for Q_w(s,a) when computing delta instead of computing it again with new weights when doing the gradient ascent update of the policy parameters (theta). However, what you propose seems more solid, just more costly.
Thanks for updating lectures :) I sort of got stuck on this lecture because the video wasn't available :P Now I have no excuse for not finishing the course !
Thought this is a real video ! I was wrong ! David keeps referring to equations on slides but audio and slides are not synced ! It's confusing sometime ! But still better than just audio !
The slides are perfectly synced with the audio for most times, but the slides on "compatible function approximation" is not in the right order and the slides on "deterministic policy gradient" is missing.
36:20, could anyone please explain, what kind of expectation we are computing (i only see the gradients). And why is expectation of the right-hand side easier to compute then that of the left-hand side
You want to minimise J in 43:25, which is the expected immediate reward. Note that thanks to the computation in 36:20, the gradient of J at 43:25 becomes an expectation. The expectation is computed in the full state-action space of the MDP with policy pi_\theta. Note that without the term pi_\theta(s,a) in the sum, that thing would not be an expectation anymore, so you COULD NOT APPROXIMATE IT BY SAMPLING.
Hado van Hasselt holds basically the same lecture here: ruclips.net/video/bRfUxQs6xIM/видео.html. I still like David's lecture much more but perhaps this other lecture can fill some of the gaps that appeared with David's disappearence.
While I don't know how generalizable the solution to this specific problem in an adversarial game would be, I can't help but wonder how these Policy Gradient Methods could solve it. The problem I am considering is one where the agent is out-matched, out-classed, or an "underdog" of limited range, damage, or resources than it's opponent in an adversarial game where it is known that the opponent's vulnerability increases with time or proximity. Think of Rocky Balboa vs Apollo Creed in Rocky 2 (where Rocky draws punches for many rounds to tire Apollo and then throws a train of left punches to secure the knockout) , being pursued by a larger vessel in water or space (where the opponent has longer range artillery or railguns but less maneuverability due to it's greater size), eliminating a gunmen in a foxhole with a manually thrown grenade, or sieging a castle. If we assume that the agent can only win these games by concentrating all the actions that actually give measurable or estimable reward in the last few sequences of actions in the small fraction of possible episodes that reach the goal, how would any of these Policy Gradient Methods be able to find a winning solution? Given that all actions for many steps from the initial state would require receiving consistent negative rewards (either through glancing blows with punches for many rounds, evasive actions like maneuvering the agent's ship to dodge or incur nonvital damage from the longer-range attacks, or simply lose large portions of an army to get from the field to castle walls and ascend the walls) I imagine the solution would have to be some bidirectional search with some nonlinear step between minimizing negative rewards from the initial state and maximizing positive reward from the goal. But can any of these Gradient Policy Methods ever capture such strategies if they are model-free (what if they have to be online or in partially observable environments as well)? It seems that TD lambda with both forward and backward views might be able to, but would the critical states of transitioning between min-negative and max-positive reward be lost in a "smoothing out" over all action sequence steps or never found given the nonlinearity between the negative and positive rewards? What if the requisite transitions were also the most dangerous for the underdog agent (ie t_100 rewards: -100, +0; t_101 rewards: -1000, +5)? If the environment is partially observable, and there really is no real benefit in strictly following the min-negative reward, given that the only true reward that matters is surviving and eliminating the opponent, some stochasticity would be required in action selection on the forward-view to explore states that are nonoptimal locally for the min-negative reward and required for ever experiencing the global terminal reward state, but this stochasticity may not be affordable on the backward view where the concentration of limited resource use cannot be wasted. I guess the only assailable method is if the network captured a function in the feature vector of the opponent's vulnerability as a function of time, resources exhausted, and/or proximity, but what still remains is this concern of increased danger for the agent as it gets closer to the goal. I realize that one could bound the negative reward minimization from zero damage to "anything short of death", but normalizing that with the positive rewards at the final steps of the game or episode would be interesting to understand. In this strategy it seems odd for an algorithm at certain states to effectively be "saying" things like: "Yes! You just got punched in the face 27 times in a row! (+2700 reward)"; "Congratulations! 2/3s of your ship has lost cabin pressure! (+6600 reward)"; "You have one functional leg, one functional arm, and suffering acute exsanguination! (+10,000 reward)" "Infantry death rate increases 200x! (+200,000 reward)". Any thoughts?
When adding the baseline, there is an error. The gradient is zero when multiplying with the baseline because the function B(s) does not depend on theta. Then he uses B(s) = V^{\pi_\Theta} (s), which depends on theta. :( So this is at most a motivation rather than a mathematical proof.
No error. That gradient is not hitting the baseline B, so it does not matter that B depends on theta. The gradient inside the sum is zero because the policy coefficients sum to one for fixed s. This is a well-known classical thing anyway. It was originally proven in Sutton's paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation".
Sometimes David words and slides don't corresopond to each other. And I don't know what to do: listen to David or read slides. For example at 1:29:55 when he speaks about deterministic gradient theorem
51:58 - "you get this very nice smooth learning curve... but learning is slow because rewards are high variance" Any idea why the learning curve is smooth despite high variance of returns? We use returns directly in gradient formula, so intuitively I'd guess they'd affect behavior of the learning curve as well.
I mean, look at the scale, it is massive. I bet if you zoom in, it is not going to be very smooth. Lets say we have an absorbing MDP with pretty long trajectories and we calculate the mean returns by applying MC. By the central limit theorem, the mean experimental returns converge to the real returns, but it will take many iterations due to the high variance of those returns. The smoothness you would see when zooming out (when looking at how the mean returns converge) would be due to the central limit theorem. Note that I am simply making a parallel. In the case of MC policy gradient, that smoothness is due to its convergence properties, which rely on the fact that the MC returns are unbiased samples of the real returns, but that thing is very bumpy when you zoom in precisely due to the variance.
could anyone please explain the slide at 45:51. In particular, i don't understand what how the big $R_{s,a}$ becomes just $r$ when we compress the gradient to expectation E. What is the difference between the big R and the small one?
r is the immediate reward understood as a RANDOM VARIABLE. This is useful because we want to compute the expectation of r along the state space generated by the MDP given a fixed policy pi. This is a measure of how good our policy is. R_{s,a} is the expectation of r given that you are at state s and carry out action a, i.e. R_{s,a} = E[r | s,a].
Thanks for updating lectures. I have some problem in understanding the state-action feature vector \phi(s, a). I know the feature of environment mentioned in the last lecture, it could be some kind of observation of the environment, but how to understand this state-action feature vector?
The state-action features in the last lecture and this lecture are different, since in the last lecture they were used to approximate the VALUE Q of a particular state-action pair, and in this lecture they are used to approximate a POLICY PI. State-action features filter important information about the state and action used to approximate the state-value function or maybe the policy, depending on the context.
Say we are in a 2D grid world. The possible actions are up, down, left and right. Every time I move up, I get +1 reward, every time I move down, left or right, I get 0 reward. Define two features as (1,0) if I choose to go up, and (0,1) otherwise. Note that now I can compute the value function EXACTLY as a linear combination of my features, since they contain all the relevant information. My optimal policy is also a linear combination of those features only. PS: you are asking about the linear case, but for me the most interesting case is the nonlinear one.
I've seen this question a few places around the net so I answered it here: math.stackexchange.com/questions/2013050/log-of-softmax-function-derivative/2340848#2340848
Our goal is to get a score function by taking the gradient of softmax. It looks like a difficult problem so I need to break it down into a simpler form. The first way to break it down is to separate the numerator and denominator using the log identity: log(x/y) = log(x) - log(y). Now I can apply the gradient to the left and right side independently. I also know that anytime I see something in the form e^x there is a good chance I can simplify and get at the guts of the exponent by taking the log of it. That helps simplify the left side. Next, the right side also takes advantage of a log property - namely that the gradient of the log of f(x) can be written in the form of gradient of f(x) / f(x). This is just the chain rule from calculus. Now the gradients of both the left and right sides are easier.
I don't understand why he says that Value-based methods can't work with stochastic policies? By definition epsilon-greedy is stochastic. If we find two actions with the same value, we could simply have a stochastic policy with 1/2 probability to both. And thus, value-function based methods could also solve the aliased-gridworld example around 20:00.
The convergence theorems in CONTROL require epsilon --> 0. If you read papers, you will often see assumptions of GLIE type (greedy in the limit with infinite exploration), which go towards a deterministic policy. David also mentions this (lecture 5 I think).
I found Prof. Silver brilliant but concepts in this lecture by large are not explained concretely but just illustration of book. Moreover the lectures earlier showed where on the slide prof is pointing and that is missing too.
In line "Sample a ~ pi_theta" of the actor-critic algorithm around 58:00. From what I understand that pi_theta(s, a) = P[ a | s, theta], I don't clearly understand how can we pick an action a given s and theta. Do we have to calculate phi(s , a) * theta for all possible action a at state s, and then choose an action accordingly to their probabilities? If yes, how can we take an action in continuous action domains? If no, how can we pick an action then?
Something like this a = np.random.choice(action_space, 1, p=action_probability_distribution) See docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html
In continuous action domains, pi_theta(s,a) could be a Gaussian for fixed s (just an example). In discrete action spaces, for every state s, there is a probability of every action given by pi_theta(s,a). They sum to one of course.
Thank you for the lecture. I was wondering if you are constrained to use the same state-action feature vectors for actor and critics? The weights are, of course, different, but does \phi(s,a) need to be the same? (57:18)
As far as my understanding goes, the feature vectors of actor and critic are completely different. The feature vector of critic is more like the state space and action space representation, as you have seen in Value Function Approximation lecture. But for the actor, the feature vectors are probabilities of taking an action in a given state (mostly).
Of course they don't have to be the same. The state-action value features are stuff that approximate the state-action value function well, and the policy features are stuff that approximate a general useful policy well. For example, look at what compatible function approximation imposes in order to flow along the real gradient 1:05:52. How are you going to achieve that condition with the same features?
around 1:00:00, in the action-value and actor critic algorthims, to update w, he used \beta * \delta * feature. Why is he taking the feature here ? in model free evaluation , he used the eligibility trace , but why feature here ?
rock paper scissors problem: would it not be a better strategy to try to fool the opponent into thinking we are following a policy other than the random play so that we can exploit the consequences of his decisions?
the slide about deterministic policy gradient at the end is the one with compatible function approximation(like in the middle of the presentation)?:S Luckily there is the paper from Silver online :) Very good videos. With the video would have been better this one, but thanks anyways.
Not really, the slide you are referring to does not have the gradient of the Q-Function in the equations, which is the main point of what he is talking about. It helps a lot to have a look at the original paper (pages 4 and 5 in particular) to understand his explanation of DPG which can be found here: proceedings.mlr.press/v32/silver14.pdf
There's several way to reduce the variance, but reducing variance like using "Critic" at 53:02 , may keeps changing and updating the expectation value onward. So this slide shows the way to reduce the variance without changing the expectation. The idea here is, by subtracting the "Baseline function B(s)" from the "Policy gradient" could do the job. The expectation equation above shows that after a few algebra steps, which ends up with the "B(s) or Baseline" multiply by the "gradient" of "the policy that sums up to 1". And the gradient of a constant (1) equals to "Zero". So the whole terms of that equation shows that, the calculation between the expectation and the "baseline B(s)" actually equal to zero. That's mean you could use this "baseline function B(s)" as a Trick to control the variance without changing the expectation. The baseline not gonna affect the expectation, Since the calculation between the expectation and the baseline actually equal to zero.
abla log pi(s,a) Q(s,a) have the same expectation in the MDP space. However, which one has the larger variance? V[X] = E[X^2]-E[X]^2. Obviously E[X]^2 is the same for both. However, which expectation is larger, that of | abla log pi(s,a)|^2 |A(s,a)|^2 or that of | abla log pi(s,a)|^2 |Q(s,a)|^2? Note that A just centers Q, so tipically its square is smaller.
Since we have rewards of the entire episode, we can calculate the returns, Monte Carlo way. Here, v_t is more like G_t. v_t = R_(t+1) + gamma*R_(t+2)..... + gamma^(T-1)*R_T
its like when two different states are represented with same features OR if two different states are encoded/represented using same encoding. though they are different (and have different rewards) but due to aliasing property, they appear same so it gets difficult for the algorithm or approximator to differentiate between these
People who feel like quiting at this stage, relax, take a break, watch this video over and over again and read sutton and barto. Do everything but dont quit. You are amongst the 10% who came this far.
Me in highschool trying to make a rocket league bot: :0
I finally came to this stage of learning after watching his videos over and over. He did very well explaining everything, but RL knowledge differs from other ML and it takes time to learn and get used to.
Bro thanks for the encouragement
Brother Ashsat! This was the most timely comment ever on youtube. I was watching these lectures and I felt brain dead since they are pretty long. This is a great encouragemnt. May God bless you for this!
my best advice is to apply each algo in the sutton and barto text book to problems in ai gym to help understand all this.....
you can do it, you got this....
Oh, I can't concentrate without seeing David.
Exactly the same here.
Me too. This is much better than 2018 one
How will I understand expected return without him walking to a spot and summing up all along the path after it
Turn on subtitles. Helps a lot.
For those confused:
- whenever he speaks of the u vector he's talking about the theta vector (slides don't match).
- At 1:02:00 he's talking about slide 4.
- At 1:16:35 he says Vhat but slides show Vv
- He refers to Q in Natural Policy Gradient, which is actually Gtheta in the slides
- At 1:30:30 the slide should be 41 (the last slide), not the Natural Actor-Critic slide
Also, when he starts talking about DPG at 1:26:10 it helps a lot to have a look at his original paper (proceedings.mlr.press/v32/silver14.pdf) pages 4 and 5 in particular.
I think the DPG slides he is actually referring to are not available online.
@50:00 he mentions G_t, but the slides show v_t, right?
@@MiottoGuilherme yes. G_t as in Sutton/Barto's book, i.e. future, discounted reward.
@@illuminatic7 Unfortunately that link doesn't work anymore, here is an alternative: hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf
This is what I call commitment: David Silver explored not showing his face policy, received less reward, and then switched back to the past lectures' optimal policy.
Nothing like learning from this "one stream of data called life."
What a treasure of a comment
And we experienced a different state, realized we got much less reward from it, and updated our value function! Then David adjusts his policy to our value function like actor critic? Or is that a lil stretch, meh I think true there's some link between his and our value function here, he wants us to do well, because he's a legend!
Genius!
Can't discount face value!
ahhh... where did u go david.. i loved your moderated gesturing
I have the same question too. His gestures are really helpful in learning the course!
He probably forgot to turn on the cam capture :)
Lecture 7 is optional
I think Lecture 10 is optional. Lecture 7 seems rather important.
This course should be called: "But wait, there's an even better algorithm!"
lol entire machine learning is like that
That my friend is the core principal of any field of engineering. Thats how computers got from a room sized contraption to a hand held device. Becuse somebody said wait, there is an even better way of doing this
And it turns out that this is best course to learn RL even after 6 years.
really? What were you able to do with this information?
I have to listen repetitively because I could not concentrate with out seeing him. I have to imagine what he was trying to show through his gestures . This is a gold standard lecture for RL. Thank you professor David Silver.
3:24 Introduction
26:39 Finite Difference Policy Gradient
33:38 Monte-Carlo Policy Gradient
52:55 Actor-Critic Policy Gradient
1:30 Outline
3:25 Policy-Based Reinforcement Learning
7:40 Value-Based and Policy-Based RL
10:15 Advantages of Policy Based RL
14:10 Example: Rock-Paper-Scissors
16:00 Example: Aliased Gridworld
20:45 Policy Objective Function
23:55 Policy Optimization
26:40 Policy Gradient
28:30 Computing Gradients by Finite Differences
30:30 Training AIBO to Walk by Finite Difference Policy Gradient
33:40 Score Function
36:45 Softmax Policy
39:28 Gaussian Policy
41:30 One-Step MDPs
46:35 Policy Gradient Theorem
48:30 Monte-Carlo Policy Gradient (REINFORCE)
51:05 Puck World Example
53:00 Reducing Variance Using a Critic
56:00 Estimating the Action-Value Function
57:10 Action-Value Actor-Critic
1:05:04 Bias in Actor-Critic Algorithms
1:05:30 Compatible Function Approximation
1:06:00 Proof of Compatible Function Approximation Theorem
1:06:33 Reducing Variance using a Baseline
1:12:05 Estimating the Advantage Function
1:17:00 Critics at Different Time-Scales
1:18:30 Actors at Different Time-Scales
1:21:38 Policy Gradient with Eligibility Traces
1:23:50 Alternative Policy Gradient Directions
1:26:08 Natural Policy Gradient
1:30:05 Natural Actor-Critic
Damn.Its was alot easier understandin it with gestures
he could describe his gestures in the subtiles
Starts at 1:25.
Actor critic at 52:55.
thx
It would have been great if it was possible to recreate David in this lecture based on his voice using some combination of RL frameworks.
I wanted to see the AIBO training :(
Me too. If you look at the Paper by Nate Kohl and Peter Stone where they describe it, they reference a web page for the videos. And surprisingly it is still online. You find it at www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk
@@felixt1250 not anymore :'(
@@tchz I think it is still there but you have to copy and paste the link.
This lecture was immensely difficult to get owing to david's absence and mismatch of slides
It is unfortunate that exactly this episode is without david in the screen. It is again a quite compley topic and Devaid jumping and running around and pointing out the relevant parts make it much easier to digest.
I am not sure exactly how this video was created, but the right slide is often not displayed (especially near the end, but elsewhere as well). It is probably better to download the slides for the lecture and find your own way through them while listening to the audio.
Unfortunately the slides do not fit what is said. It's a pity they don't seem to put much effort into these videos. David is surely one of the best people to learn RL from.
This lectures are a gift. Thanks
By far the best video about policy gradient methods on youtube
First time in my life I had to DECREASE the speed of the video and not increase....man he talks REALLY fast, while at the same time showing new slides filled up by equations
“No matter how ridiculous the odds may seem, within us resides the power to overcome these challenges and achieve something beautiful. That one day we look back at where we started, and be amazed by how far we’ve come.” -Technoblade
I started this series a month ago in summer break, I even did the Easy21 assignment and now I finally learned what I wanted, when I started this series i.e. Actor Critic Method. Time to do some gymnasium env.
Just to make sure, in 36:22, the purpose of the likelihood ratio trick is to make the gradient of the objective function gets converted to a expectation again? Just a David said at 44:33, "... that's the whole point of using the likelihood ratio trick".
I'm not sure about it neither
That's exactly right. Once you convert it into an expectation, you can approximate it by sampling, so that trick is very practical.
It took me a while to realize that policy function pi(s, a) is alternately used as the probability of taking a certain action in state s, and the action proper (a notation overload that comes from the Sutton book). I think specific notation for each instance would avoid a lot of confusion.
At 45:36 I think the notation he is describing is different from that shown in these slides. I think his "capital R" is the small "r" for us. And the "curly R" is the "Rs,a" for us.
Also u is theta.
Yes, fully agree. I believe this is important so to reiterate the small correction: the lowercase r is a random reward, the actual reward that agent/we experience, while the curly uppercase R is the reward from the MDP (Markov Decision Process).
We miss you David
Thank you for creating the video John, this is really great!
This has good sound quality, but missing nice body language..
Made me cry after very long :( given the professors absence and slide mismatch
man without the gestures its not the same, the lecutre is not the same…...
You are fantastic David. Thanks for the tutorial.
Is there any particular reason that, in the basic TD(0) QAC pseudocode (1:00:00), we don't update the Q weights first before doing the theta gradient update?
i think you can start with arbitrary value for the weight.
Since the weight value also will be adjusted proportion to the td error and get better as the iteration increase to n steps.
Super good question. I am guessing the reason is computational right? You want to reuse the computation you did for Q_w(s,a) when computing delta instead of computing it again with new weights when doing the gradient ascent update of the policy parameters (theta). However, what you propose seems more solid, just more costly.
He teaches much better than hado van hasselt. makes it much easier
Thanks for updating lectures :)
I sort of got stuck on this lecture because the video wasn't available :P Now I have no excuse for not finishing the course !
Thought this is a real video ! I was wrong ! David keeps referring to equations on slides but audio and slides are not synced ! It's confusing sometime ! But still better than just audio !
The slides are perfectly synced with the audio for most times, but the slides on "compatible function approximation" is not in the right order and the slides on "deterministic policy gradient" is missing.
36:20, could anyone please explain, what kind of expectation we are computing (i only see the gradients). And why is expectation of the right-hand side easier to compute then that of the left-hand side
You want to minimise J in 43:25, which is the expected immediate reward. Note that thanks to the computation in 36:20, the gradient of J at 43:25 becomes an expectation. The expectation is computed in the full state-action space of the MDP with policy pi_\theta. Note that without the term pi_\theta(s,a) in the sum, that thing would not be an expectation anymore, so you COULD NOT APPROXIMATE IT BY SAMPLING.
The slides are outdated, judging by David's speech, he apparently changed notations and added a few slides in the last 30 mins or so.
Can someone explain how he got the score function from the maximum likelihood expression in 22.39 .
I am 2 years late but this might help someone else😅😅
It is simple differentiation
grad of log(a) wrt a = grad(a)/a
He just did this backwards
I love it how there is always someone moaning or chewing food near the camera/microphone.
Hado van Hasselt holds basically the same lecture here: ruclips.net/video/bRfUxQs6xIM/видео.html. I still like David's lecture much more but perhaps this other lecture can fill some of the gaps that appeared with David's disappearence.
Wish to see David in person.
While I don't know how generalizable the solution to this specific problem in an adversarial game would be, I can't help but wonder how these Policy Gradient Methods could solve it. The problem I am considering is one where the agent is out-matched, out-classed, or an "underdog" of limited range, damage, or resources than it's opponent in an adversarial game where it is known that the opponent's vulnerability increases with time or proximity.
Think of Rocky Balboa vs Apollo Creed in Rocky 2 (where Rocky draws punches for many rounds to tire Apollo and then throws a train of left punches to secure the knockout) , being pursued by a larger vessel in water or space (where the opponent has longer range artillery or railguns but less maneuverability due to it's greater size), eliminating a gunmen in a foxhole with a manually thrown grenade, or sieging a castle.
If we assume that the agent can only win these games by concentrating all the actions that actually give measurable or estimable reward in the last few sequences of actions in the small fraction of possible episodes that reach the goal, how would any of these Policy Gradient Methods be able to find a winning solution?
Given that all actions for many steps from the initial state would require receiving consistent negative rewards (either through glancing blows with punches for many rounds, evasive actions like maneuvering the agent's ship to dodge or incur nonvital damage from the longer-range attacks, or simply lose large portions of an army to get from the field to castle walls and ascend the walls) I imagine the solution would have to be some bidirectional search with some nonlinear step between minimizing negative rewards from the initial state and maximizing positive reward from the goal.
But can any of these Gradient Policy Methods ever capture such strategies if they are model-free (what if they have to be online or in partially observable environments as well)? It seems that TD lambda with both forward and backward views might be able to, but would the critical states of transitioning between min-negative and max-positive reward be lost in a "smoothing out" over all action sequence steps or never found given the nonlinearity between the negative and positive rewards? What if the requisite transitions were also the most dangerous for the underdog agent (ie t_100 rewards: -100, +0; t_101 rewards: -1000, +5)?
If the environment is partially observable, and there really is no real benefit in strictly following the min-negative reward, given that the only true reward that matters is surviving and eliminating the opponent, some stochasticity would be required in action selection on the forward-view to explore states that are nonoptimal locally for the min-negative reward and required for ever experiencing the global terminal reward state, but this stochasticity may not be affordable on the backward view where the concentration of limited resource use cannot be wasted.
I guess the only assailable method is if the network captured a function in the feature vector of the opponent's vulnerability as a function of time, resources exhausted, and/or proximity, but what still remains is this concern of increased danger for the agent as it gets closer to the goal. I realize that one could bound the negative reward minimization from zero damage to "anything short of death", but normalizing that with the positive rewards at the final steps of the game or episode would be interesting to understand. In this strategy it seems odd for an algorithm at certain states to effectively be "saying" things like:
"Yes! You just got punched in the face 27 times in a row! (+2700 reward)";
"Congratulations! 2/3s of your ship has lost cabin pressure! (+6600 reward)";
"You have one functional leg, one functional arm, and suffering acute exsanguination! (+10,000 reward)"
"Infantry death rate increases 200x! (+200,000 reward)".
Any thoughts?
did you figure it out yet? It's been a year, hopefullly you've had time to sit down and make actual progress towards creating this?
thank you many times Dear Karolina. cheers
1:00:54, this man got a -1 reward and restarted a new episode.
1:00:55 " Ugnkhhh.... "
The lectures were going great until someone decided not to show David's gestures. God I was learning so much just from his gestures.
What is the log policy exactly? Is it just the log of the output of the gradient with respect to some state action pair?
Is there perhaps a link to the videos of AIBOs running? (supposed to be shown at 31:55)
@@oDaRkDeMoNo Thank you!
I need David! It's hard to understand some pronouns without seeing him.
Fantastic lectures
When adding the baseline, there is an error. The gradient is zero when multiplying with the baseline because the function B(s) does not depend on theta. Then he uses B(s) = V^{\pi_\Theta} (s), which depends on theta. :( So this is at most a motivation rather than a mathematical proof.
No error. That gradient is not hitting the baseline B, so it does not matter that B depends on theta. The gradient inside the sum is zero because the policy coefficients sum to one for fixed s.
This is a well-known classical thing anyway. It was originally proven in Sutton's paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation".
Following this lecture is like learning math by listening to a podcast.
Not sure if that's supposed to be good or not lol
Sometimes David words and slides don't corresopond to each other. And I don't know what to do: listen to David or read slides. For example at 1:29:55 when he speaks about deterministic gradient theorem
David has a paper about DPG which he mentioned was published “last year” in 2014, later a DDPG one. Just check them out.
51:58 - "you get this very nice smooth learning curve... but learning is slow because rewards are high variance"
Any idea why the learning curve is smooth despite high variance of returns? We use returns directly in gradient formula, so intuitively I'd guess they'd affect behavior of the learning curve as well.
I mean, look at the scale, it is massive. I bet if you zoom in, it is not going to be very smooth. Lets say we have an absorbing MDP with pretty long trajectories and we calculate the mean returns by applying MC. By the central limit theorem, the mean experimental returns converge to the real returns, but it will take many iterations due to the high variance of those returns. The smoothness you would see when zooming out (when looking at how the mean returns converge) would be due to the central limit theorem. Note that I am simply making a parallel. In the case of MC policy gradient, that smoothness is due to its convergence properties, which rely on the fact that the MC returns are unbiased samples of the real returns, but that thing is very bumpy when you zoom in precisely due to the variance.
Why there is not a single implementation in MATALB?
could anyone please explain the slide at 45:51. In particular, i don't understand what how the big $R_{s,a}$ becomes just $r$ when we compress the gradient to expectation E. What is the difference between the big R and the small one?
r is the immediate reward understood as a RANDOM VARIABLE. This is useful because we want to compute the expectation of r along the state space generated by the MDP given a fixed policy pi. This is a measure of how good our policy is. R_{s,a} is the expectation of r given that you are at state s and carry out action a, i.e. R_{s,a} = E[r | s,a].
Does he talk about REINFORCE in this talk/lecture? If yes when?
Here: 48:30
it's a pity to see only the slides compared to the previous lectures, the change of format makes it very hard to follow
Thanks for updating lectures. I have some problem in understanding the state-action feature vector \phi(s, a). I know the feature of environment mentioned in the last lecture, it could be some kind of observation of the environment, but how to understand this state-action feature vector?
The state-action features in the last lecture and this lecture are different, since in the last lecture they were used to approximate the VALUE Q of a particular state-action pair, and in this lecture they are used to approximate a POLICY PI. State-action features filter important information about the state and action used to approximate the state-value function or maybe the policy, depending on the context.
Say we are in a 2D grid world. The possible actions are up, down, left and right. Every time I move up, I get +1 reward, every time I move down, left or right, I get 0 reward. Define two features as (1,0) if I choose to go up, and (0,1) otherwise. Note that now I can compute the value function EXACTLY as a linear combination of my features, since they contain all the relevant information. My optimal policy is also a linear combination of those features only.
PS: you are asking about the linear case, but for me the most interesting case is the nonlinear one.
@@edmonddantes4705 Wow that is response to 6 year old question. Thanks for taking time.
@@guptamberhaha I do it to practise!
How does he get the score function at 37:41?
I've seen this question a few places around the net so I answered it here: math.stackexchange.com/questions/2013050/log-of-softmax-function-derivative/2340848#2340848
Thank you, very elaborate answer!
So, how do you get a score function for a deep NN?
@@blairfraser8005 could you do it for dummies? I don't understand why you put the terms inside logs.
Our goal is to get a score function by taking the gradient of softmax. It looks like a difficult problem so I need to break it down into a simpler form. The first way to break it down is to separate the numerator and denominator using the log identity: log(x/y) = log(x) - log(y). Now I can apply the gradient to the left and right side independently. I also know that anytime I see something in the form e^x there is a good chance I can simplify and get at the guts of the exponent by taking the log of it. That helps simplify the left side. Next, the right side also takes advantage of a log property - namely that the gradient of the log of f(x) can be written in the form of gradient of f(x) / f(x). This is just the chain rule from calculus. Now the gradients of both the left and right sides are easier.
will these policy gradient methods work better in the previous methods based on generalized policy iteration MC and TD and SARSA?
I don't understand why he says that Value-based methods can't work with stochastic policies? By definition epsilon-greedy is stochastic. If we find two actions with the same value, we could simply have a stochastic policy with 1/2 probability to both. And thus, value-function based methods could also solve the aliased-gridworld example around 20:00.
The convergence theorems in CONTROL require epsilon --> 0. If you read papers, you will often see assumptions of GLIE type (greedy in the limit with infinite exploration), which go towards a deterministic policy. David also mentions this (lecture 5 I think).
Hello Karolina,
Is there any real video for this class?
Isn't the state value function 'useless' to an agent considering he 'chooses' actions but can't 'choose' his state?
Where did you go David :-(
I found Prof. Silver brilliant but concepts in this lecture by large are not explained concretely but just illustration of book. Moreover the lectures earlier showed where on the slide prof is pointing and that is missing too.
This really helps. Thanks
I am guessing the slides shown in the video is slightly different from the one they used in the lecture.
28:52 So J(teta) is the average reward your agent gets following policy teta, and pi(teta, a) is the probability of taking action a given policy teta?
J(teta) is the cost function, check the 22:30 slide.
In line "Sample a ~ pi_theta" of the actor-critic algorithm around 58:00.
From what I understand that pi_theta(s, a) = P[ a | s, theta], I don't clearly understand how can we pick an action a given s and theta. Do we have to calculate phi(s , a) * theta for all possible action a at state s, and then choose an action accordingly to their probabilities?
If yes, how can we take an action in continuous action domains?
If no, how can we pick an action then?
Something like this
a = np.random.choice(action_space, 1, p=action_probability_distribution)
See docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html
In continuous action domains, pi_theta(s,a) could be a Gaussian for fixed s (just an example). In discrete action spaces, for every state s, there is a probability of every action given by pi_theta(s,a). They sum to one of course.
Thank you for the lecture. I was wondering if you are constrained to use the same state-action feature vectors for actor and critics? The weights are, of course, different, but does \phi(s,a) need to be the same? (57:18)
As far as my understanding goes, the feature vectors of actor and critic are completely different. The feature vector of critic is more like the state space and action space representation, as you have seen in Value Function Approximation lecture. But for the actor, the feature vectors are probabilities of taking an action in a given state (mostly).
Of course they don't have to be the same. The state-action value features are stuff that approximate the state-action value function well, and the policy features are stuff that approximate a general useful policy well. For example, look at what compatible function approximation imposes in order to flow along the real gradient 1:05:52. How are you going to achieve that condition with the same features?
I feel like the last 15 minutes the slides and what he says is not in sync anymore. :(
David disappeared, but CC subtitle is coming!!
around 1:00:00, in the action-value and actor critic algorthims, to update w, he used \beta * \delta * feature. Why is he taking the feature here ? in model free evaluation , he used the eligibility trace , but why feature here ?
He is using linear function approximation for Q. It is a choice. Not sure why you are bothered that much by that.
First time making it this far. But is it just me or did alot of the notations change?
Also, he seems to be speaking in one notation while the screen is showing something else.
rock paper scissors problem:
would it not be a better strategy to try to fool the opponent into thinking we are following a policy other than the random play so that we can exploit the consequences of his decisions?
Whatever strategy you come up with can not beat the uniform random strategy - that's why it is considered optimal.
In real life it could be good, but theoretically of course not, since it is not a Nash Equilibrium. It can be exploited. Watch lecture 10.
At 1:03:49, shouldn't the action be sampled from \pi_\theta(s', a)?
Came with high hopes from last video....
WO video unable to predict what is he pointing to
the slide about deterministic policy gradient at the end is the one with compatible function approximation(like in the middle of the presentation)?:S
Luckily there is the paper from Silver online :)
Very good videos. With the video would have been better this one, but thanks anyways.
Not really, the slide you are referring to does not have the gradient of the Q-Function in the equations, which is the main point of what he is talking about.
It helps a lot to have a look at the original paper (pages 4 and 5 in particular) to understand his explanation of DPG which can be found here: proceedings.mlr.press/v32/silver14.pdf
Why do all lecture videos on Policy Gradient Methods use the exact same set of slides lol
1:07:28 What does Silver mean when he says : "We can reduce the variance without changing the expectation"
There's several way to reduce the variance, but reducing variance like using "Critic" at 53:02 , may keeps changing and updating the expectation value onward.
So this slide shows the way to reduce the variance without changing the expectation.
The idea here is, by subtracting the "Baseline function B(s)" from the "Policy gradient" could do the job.
The expectation equation above shows that after a few algebra steps,
which ends up with the "B(s) or Baseline" multiply by the "gradient" of "the policy that sums up to 1".
And the gradient of a constant (1) equals to "Zero". So the whole terms of that equation shows that,
the calculation between the expectation and the "baseline B(s)" actually equal to zero.
That's mean you could use this "baseline function B(s)" as a Trick to control the variance without changing the expectation.
The baseline not gonna affect the expectation, Since the calculation between the expectation and the baseline actually equal to zero.
sorry if the explanation not straight forward and bit complicated lol
abla log pi(s,a) A(s,a) and
abla log pi(s,a) Q(s,a) have the same expectation in the MDP space. However, which one has the larger variance? V[X] = E[X^2]-E[X]^2. Obviously E[X]^2 is the same for both. However, which expectation is larger, that of |
abla log pi(s,a)|^2 |A(s,a)|^2 or that of |
abla log pi(s,a)|^2 |Q(s,a)|^2? Note that A just centers Q, so tipically its square is smaller.
Hi, guys, how can I get the v_t in 50:53?
Since we have rewards of the entire episode, we can calculate the returns, Monte Carlo way. Here, v_t is more like G_t. v_t = R_(t+1) + gamma*R_(t+2)..... + gamma^(T-1)*R_T
Thank you!
32:38 - AIBO Training Video Links: www.cs.utexas.edu/~AustinVilla/?p=research/learned_walk
Does anyone know about the reading material David mentions in the previous class?
I guess is this one: "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto
What does phi(s) mean at 1:18:05 ?
so when can we determine that there is state aliasing ?
Basically when you feel like your features are not representing the MDP very well. The solution is changing the features or improving them.
Someone forgot to hit Record!!
What is state aliasing in reinforcement learning?
its like when two different states are represented with same features OR if two different states are encoded/represented using same encoding. though they are different (and have different rewards) but due to aliasing property, they appear same so it gets difficult for the algorithm or approximator to differentiate between these
A little mismatched between the voice and slides...
How about the non-MDP? Does anyone have experience with that?
Minh Vu non-MDP can be artificially converted to be able to use MDP to solve, like quasi-Markov Chain
I lost it after he started talking about bias and reducing variance in actor-critic algorithms, after 1:05:03
v: critic param, u: actor param
Someone should superimpose their gestures on top of this video
slide: www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf
31:40 "...until your graduate students collapse it..." LoL
24 dislikes ? I cannot believe you can watch David and hit dislike in the end. Some people are really strange
you can't watch actually
this lecture was difficult
I'm looking for the robot.. 32:49
whoa that was fast
there should be more real examples ... udacity course is even more heavy.
Lecture quality went downhill quick
Ohh math! Student confused lol.