RL Course by David Silver - Lecture 6: Value Function Approximation

Поделиться
HTML-код
  • Опубликовано: 11 янв 2025

Комментарии •

  • @TheAIEpiphany
    @TheAIEpiphany 3 года назад +157

    I'm grateful that David was successfully sampled in this iteration of the universe.

  • @Chr0nalis
    @Chr0nalis 8 лет назад +316

    "Life is one big training set"
    D. Silver, 2015

    • @feebdaed
      @feebdaed 8 лет назад +53

      "Follow the gradient, Neo"

    • @luoryan3355
      @luoryan3355 7 лет назад +4

      Predict values of your future

    • @TrainWithSweatAndBlood
      @TrainWithSweatAndBlood 6 лет назад +2

      "you are the running mean of your friend circle".Keep updated that towards goal.

    • @kotamori1907
      @kotamori1907 6 лет назад +15

      This motivates off-policy learning pretty well. We should learn from someone else's life.

    • @ProfessionalTycoons
      @ProfessionalTycoons 6 лет назад +2

      Well my recall is so small.

  • @proximo08
    @proximo08 7 лет назад +95

    this guy is definitely the best to explain RL

    • @crazyfrog2817
      @crazyfrog2817 6 лет назад +9

      he is not a random guy...

    • @andrewgrebenisan6141
      @andrewgrebenisan6141 5 лет назад +7

      @@crazyfrog2817 Yeah, it's interesting how random professors that know their stuff may not be good at teaching, but the best of the best always know their stuff and how to explain it

    • @ishansrt
      @ishansrt 5 лет назад +7

      @@crazyfrog2817 he is a legend, people don't realize this

  • @florentinrieger5306
    @florentinrieger5306 Год назад +8

    This video is yet another example that the complexity of a subject really depends on the one who is teaching it. Thank you David, for making RL so much more accessible and understandable! It is a real pleasure listening to those lectures of yours ❤

  • @yasseraziz1287
    @yasseraziz1287 4 года назад +55

    0:00 Motivations
    0:35 Outline
    0:35 Large-Scale Reinforcement Learning
    3:55 Value Function Approximation
    8:40 Types of Value Function Approximation
    11:55 Which Function Approximator?
    15:38 Incremental Methods
    15:43 Gradient Descent
    17:35 Value Function Approx. by Gradient Descent
    21:35 Feature Vectors
    23:23 Linear Value Function Approximation
    28:43 Table Lookup Features
    30:50 Incremental Prediction Algorithms
    31:10 Monte-Carlo with Value Function Approximation
    37:33 TD Learning with Value Function Approximation
    41:56 TD(lambda) with Value Function Approximation
    49:00 Control with Value Function Approximation
    52:30 Action-Value Function Approximation
    53:50 Linear Action-Value Function Approximation
    55:20 Incremental Control Algorithms
    56:20 Linear Sarsa with Coarse Coding in Mountain Car
    1:04:30 Study of lambda: Should we Bootstrap?
    1:06:10 Baird's Counterexample
    1:06:30 Parameter Divergence in Baird's Counterexample
    1:06:50 Convergence of Prediction Algorithms
    1:08:00 Gradient Temporal-Difference Learning
    1:09:00 Convergence of Control Algorithms
    1:10:19 Batch Methods
    1:12:30 Batch Reinforcement Learning
    1:13:30 Least Square Prediction
    1:15:25 Stochastic Gradient Descent with Experience Replay
    1:17:25 Experience Replay in Deep Q-Networks (DQN)
    1:24:46 DQN in Atari
    1:26:00 How much does DQN help?
    1:27:35 Linear Least Square Prediction (2)
    1:32:29 Convergence of Linear Least Squares Prediction Algorithms
    1:32:50 Least Squares Policy Iteration
    1:34:15 Chain Walk Example
    1:35:00 LSPI in Chain Walk: Action-Value Function

  • @MinhVu-fo6hd
    @MinhVu-fo6hd 7 лет назад +132

    It's gonna be pretty confusing the first time you hear about it, but give it a time, try to understand some of it and then come back to watch it again, you will see how much you can comprehend it.

    • @nirajabcd
      @nirajabcd 4 года назад +2

      ditto

    • @mathavraj9378
      @mathavraj9378 4 года назад

      I feel this

    • @Chdy120
      @Chdy120 3 года назад

      @@mathavraj9378 j

    • @chamangupta4624
      @chamangupta4624 3 года назад

      done that

    • @lex6709
      @lex6709 2 года назад +1

      yeah iteration would be good if not perfect oracle policy at first sight

  • @junzhu107
    @junzhu107 4 года назад +17

    This is by far the best course on DRL I have ever watched. I kind of lost on the different notations in different papers although I can code many of them up. But this video summarized the root of the problems and solutions concisely which makes me be able to make sense what I have learned in the past. Thanks for the great work!

  • @shyom1d676
    @shyom1d676 2 года назад +3

    David Silver is an intelligent and humble person. He explains things very well , I'm glad that I came across his lectures

  • @NganVu
    @NganVu 4 года назад +32

    1:38 Introduction
    15:28 Incremental Methods
    1:12:17 Batch Methods

  • @LunahLiu
    @LunahLiu 4 года назад +37

    2020 and still watching this to refresh the knowledge 🙋‍♀️

    • @KayzeeFPS
      @KayzeeFPS 4 года назад +6

      Actually true, I was watching some other courses but gave up half way to come back to this goldmine of information and articulation.

    • @OmPrakash-vt5vr
      @OmPrakash-vt5vr 6 месяцев назад +1

      2024

  • @aidankennedy6973
    @aidankennedy6973 3 года назад +3

    It's amazing how intuitive these concepts become after watching these lectures. David Silver makes the math and theory behind RL that seems so hard to grasp impossible to forget.

    • @florentinrieger5306
      @florentinrieger5306 Год назад

      It is yet another example that the complexity of a subject really depends on the one who is teaching it.

  • @martagutierrez3858
    @martagutierrez3858 7 лет назад +21

    holy cow, in the last 30 minutes this lecture goes completely off the rails. The amount of concepts introduced and the number of slides shown increases exponentially towards the end :-)

    • @ChumpyZhang
      @ChumpyZhang 5 лет назад +8

      Yeah, I assume that's also how TD(lambda) works. You start at slow and then learn very fast. LOL

    • @alvinphantomhive3794
      @alvinphantomhive3794 5 лет назад +3

      ​@@ChumpyZhang yes it more effecient, but i guess it also carrying some bias alongs the way

    • @rotviepe
      @rotviepe Год назад +2

      If you can learn slow, you can learn fast

    • @le0nz
      @le0nz 2 месяца назад +1

      ​@@rotviepe😂 a ling ling bro learning RL

  • @willlogs
    @willlogs 4 года назад +2

    This is a loooot a of knowledge that probably took over 20 years to get this complicated and genius. The question is why would someone share it for free like this! It's a GIFT!

  • @robosergTV
    @robosergTV 6 лет назад +20

    golden information for free. Thanks DeepMind! We need more up to date course though :/

  • @trantandat2699
    @trantandat2699 4 года назад +1

    Very easy to understand ! outstanding lecture! outstanding teacher!

  • @ThuongPham-wg4bc
    @ThuongPham-wg4bc Год назад

    I don't really care about the math in the lecture, I just enjoy the moment listening David talking about state of the art RL and suddenly catch his idea.

  • @existenence
    @existenence 4 года назад +11

    This reminds me of the quote by Lex Fridman that goes, "All machine learning is supervised learning. What differs is how that supervision is done."

  • @SadmanSakibEnan
    @SadmanSakibEnan 6 лет назад +2

    Amazing lecture. Thanks for the upload.

  • @karthik-ex4dm
    @karthik-ex4dm 6 лет назад +3

    It cannot get any better than this 😘

    • @jg9193
      @jg9193 6 лет назад

      that's unfortunate

  • @kaspar8292
    @kaspar8292 4 года назад

    1:24:00 Why doesn't using old parameters have the same effect as propagating backwards in TD-learning would? Why doesn't it reverse the temporal flow of information and therefore make the Q-values diverge?

  • @TheKovosh
    @TheKovosh 5 лет назад +18

    it would be great, if this video had subtitels

  • @WahranRai
    @WahranRai 9 лет назад +5

    16:39
    - gradient is the direction of descent !

    • @ArulSelvam4
      @ArulSelvam4 8 лет назад +16

      +WahranRai No, Gradient is the direction of ascent and in most of gradient descent optimization we follow the direction of the negative gradient (for finding minima) math.stackexchange.com/questions/223252/why-is-gradient-the-direction-of-steepest-ascent

    • @kiuhnmmnhuik2627
      @kiuhnmmnhuik2627 7 лет назад +4

      Note that @WahranRai said "- gradient is the direction of descent !". Note the "-".

  • @ProfessionalTycoons
    @ProfessionalTycoons 6 лет назад +1

    very cool idea experience replay with fixed Q learning, this seems to be related to double Q learning.

  • @akshayrana9351
    @akshayrana9351 5 лет назад +2

    I can never comprehend the backward views and eligibility traces, although the forward view still makes sense.

    • @muratcan__22
      @muratcan__22 5 лет назад +2

      Backward view is doing online updates. You do that for every state in each time step. If you do that for every state and only apply those updates at the end of the episode it will be equal to the updates you would do in your forward view. In forward view we were updating the values of the states when they are seen. We were collecting all step returns weighted by appropriate lambda and applying to the state that is seen. The return is the return from that state onwards in this case. In backward view, we immediatly update the value of the state, actually all the states in each steps. It is easy to see that they are equal when lambda is 0. Eligibility trace loses its recursive part and it is only 1 when the actual state is the state we are going to update. And all sum of those updates that would happen throughout the whole episode is equal to the forward 0 lambda return which is actually 1 step return because we only care about the 1 step return in that case ( all the other step returns are 0 because they have lambda in front of their weights).

  • @abeaumont10
    @abeaumont10 5 лет назад

    This is a great lecture!

  • @timzhang6962
    @timzhang6962 8 лет назад +2

    43:10, do we need to explicitly store the eligibility trace of features for all states in a look-up table?

    • @notthispunk
      @notthispunk 8 лет назад +7

      If you learn each Q(S,A) directly you need to store the trace for each state. When learning the parameter vector w you only need to store the trace for each feature.

  • @rohanbanerjee1924
    @rohanbanerjee1924 2 года назад

    Does anyone have a reference to the "Gradient Q-Learning" approach that was mentioned around 1:11:27? I couldn't seem to find it online. In particular, I am curious how this approach actually computes the gradient of the TD target (especially how it handles the max term). Perhaps it is related to Gradient TD somehow?

  • @danieltjones01
    @danieltjones01 4 года назад

    Bottom of slide at 30:20ish. To take the dot product of those vectors shouldn't the state vector be transposed?
    Currently we have nx1 state vector and a nx1 weight vector. Giving is nx1,nx1; which won't multiply? We need 1xn,nx1 to give us a 1x1 output.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Not true. First of all, you are confusing scalar product with matrix multiplication. You are right in the sense that you obviously cannot multiply an nx1 matrix by an nx1 matrix, but that dot does not mean matrix multiplication. In mathematics, those vectors you are seeing are interpreted as vectors in the Euclidean space R^n (\mathbb{R}^n), and he is simply expressing two vectors of the Euclidean space as column vectors and writing a dot in-between to express scalar product. This is sometimes done in mathematics, physics, etc. Note that if I have two vectors x and y in R^n and I express vector x like a 1xn vector and y like an nx1 vector, the scalar product coincides with matrix multiplication. This is because in the Euclidean case, the dot product can be rewritten in terms of matrix multiplication of suitable matrices.
      I will add that in general, you just need an inner product space in order to define a scalar product, and a scalar product is defined BETWEEN ELEMENTS OF THE SAME SPACE. Your space could be R^n, a space of functions like L^2, A space of sequences, etc.

  • @MinhVu-fo6hd
    @MinhVu-fo6hd 7 лет назад +1

    at 43:28, he talked about the size of the eligibility traces. I dont' understand why its size is the same at the parameters' size but not the size of the state space. Does anyone know the answer for this?

    • @kiuhnmmnhuik2627
      @kiuhnmmnhuik2627 7 лет назад +3

      Let's assume we're using a linear approximation, that is: v(s,w) = f1(s) w1 + f2(s) w2 + ... fn(s) wn, where fk(s) is the k-th feature of the state s.
      In this case, the gradient of v in s is just the feature vector f(s). This will make what follows more intuitive. It's clear that if, say, f1(s) is much bigger than the other features, then we'll want to focus on w1 to increase/decrease v(s,w) as much as possible.
      Instead of using eligibility traces to assign credit to the states, we use eligibility traces to assign credit to the features which "summarize" the states.
      Let's say we have 4 states:
      1) it's raining and it's cold
      2) it's raining and it's hot
      3) it's sunny and it's cold
      4) it's sunny and it's hot
      The features are:
      r = it's raining
      c = it's cold
      which can take values 0 (false) or 1 (true).
      If we go through states 2 and 3, will just remember the features (r,c) instead of the states themselves:
      e

  • @guptamber
    @guptamber Год назад

    I am puzzled by some comments. I felt lectures in general towards the end are rushed and hand wavy. First 4-5 lectures are really good but after that it feels more fragmented. I could realize this more when I saw DQN slides, that is familiar topic to me and the way it was explained was very upside down and cryptic way.

  • @riccardoandreetta9520
    @riccardoandreetta9520 8 лет назад +13

    I guess you can't really understand the details, unless you make some exercises about it. Still didn't find good videos about this.

    • @AbhishekSharma0318
      @AbhishekSharma0318 7 лет назад +11

      Check this repository github.com/dennybritz/reinforcement-learning if you want to learn how to implement these ideas in python.

    • @luecu
      @luecu 6 лет назад

      Abhishek Sharma can not find it :(

    • @IgorAherne
      @IgorAherne 6 лет назад +1

      To understand TemporalDifference, have a look at this video ruclips.net/video/w33Lplx49_A/видео.html
      in particular 56:40

    • @piyushjaininventor
      @piyushjaininventor 6 лет назад

      Link is broken

    • @matsapo1
      @matsapo1 6 лет назад

      @@piyushjaininventor Remove the the ending parenthesis and the link will work.

  • @rilyoire
    @rilyoire Год назад

    This helped a lot

  • @sampathkumar1010
    @sampathkumar1010 7 лет назад

    At 47:15, why cant the second part \gamma^v(S',w) be left in terms of w, and and a new gradient be found with respect to the full function for optimization, instead of just taking the gradient part from V(S,w)

    • @sam41619
      @sam41619 7 лет назад

      I guess the gradient will not behave properly because we haven't seen the *actual effect* of being in state S', In other words, we can't know (before time) that how the predicted value of next state v(S',w) will change as a function of state S' and w... Since we don't yet know if being in state S' will be useful or not OR what will be the direction of gradient of v(S',w).. or will it increase or decrease with respect to S'

  • @willbrenton8482
    @willbrenton8482 3 года назад

    Anyone have some good resources for learning about sourcing feature vectors (23:30)

  • @chengjing9029
    @chengjing9029 3 года назад +1

    Sometimes I think that I learn the truth from the gesture

  • @notthispunk
    @notthispunk 8 лет назад +3

    How would you calculate the Q feature vector based on the action? Would the action considered be part of the feature vector? Otherwise how could my features (e.g. distance to the wall) depend on the considered action without explicit coding of how the environment works?

    • @KarfontaAlec
      @KarfontaAlec 8 лет назад +2

      I am struggling with this same question. Maybe we can help each other think through this. It is confusing because in the gridworld or pacman examples you know the next state by the current state and action you take, but for my problem, robot control, you don't know how an action will affect your state.
      One solution like you said was to add a feature for the action. I may be wrong on this, haven't yet tried, but I think that there should be one feature for each action we are considering, with corresponding weights. So when we are in a state and looking for an action to take we loop over each possible action where the Q(s,a) for a particular action will have the feature value for that action as 1, then each other action feature is 0 (ignoring them).
      Another approach I am looking at is to have a different set of weights for each action. This may be less efficient but could be more expressive. Will need to do some tests once both approaches are working.
      Finally I think the best approach, and the one I have the least understanding of, would be to have a neural network that produces an output for each action rather than a single value. This would seem to be the most efficient in terms of memory and CPU time. And could be a simpler implementation with a black box NN.
      What are your thoughts on these?

    • @CarrKnight
      @CarrKnight 8 лет назад

      I think perhaps that's why Powell defines the Q-value as a "Post-Decision State". For some implementations you do not need to add action as a feature. For example if you are playing chess you know immediately what the results are of taking your action (even though you can't predict the enemy movement so you don't have access to the transition function). So if your feature is "the enemy queen is in the game" YES/NO you can judge your possible actions without them appearing explicitly as features but that's because you can tell what your action is doing.

    • @neverfox
      @neverfox 8 лет назад

      +Alec Karfonta Both of your last two suggestions are what I've seen in practice. Silver even mentioned the latter in a previous lecture. Adding actions to the feature vector would violate the standard distinction made between s and a, so while it might work, you'd be cutting against the grain of all of the literature.

    • @KarfontaAlec
      @KarfontaAlec 8 лет назад

      I was finally able to get the value-function approximation working in practice on a crawling robot task. It works better and faster than the Q-Table implementation. I thought it would work better but take longer to train, it surpassed all my expectations. I have never before seen the robots move the way they are now.
      First I tried the method I suggested above to have a feature in the state for each action, it would be a 1 if the action was taken in the previous time step. This almost seemed to work, they would chase the immediate reward but would not take the extra step to take actions with no immediate reward to reach a state with an immediate reward.
      Then I tried the second method, where each action has it's own set of weights for the state features and this worked like a dream. Within an hour of training they had mastered the task, where the Q-Table would take overnight. So at a time step the state features are combined with each action's set of weights to find the max expected reward action. The next time step the weights for only the previous action are updated using the delta from expected and actual reward (immediate + future from the new state).
      Now I am experimenting with more complex robots. With the old Q-Table implementation adding a limb would cause the size of the table to explode. With the value-function I have the computational capacity for pretty much any number of actions and state features. So I am trying to train the same crawling robot with just one extra identical limb. Though I seem to have a new problem. Ideally the robot should be able to control both limbs (four motors) at the same time. How do you guys think I could adapt the Q update to take four actions per time step?
      Maybe instead of one update that considers all actions at once there are four separate updates for each motor. The weights would be a three-dimensional array like [motor][action][state_feature]. The motor dimension would be of size four, the action dimension thee (forward, stop, reverse). At each time-step four actions would be taken, the max action for each motor. The next time-step there would be four updates based on the action taken for each motor and the same previous state values. Though I'm worried this might exacerbate the credit assignment problem.

    • @Isaac668
      @Isaac668 7 лет назад

      The feature vector is built off of a set of observations. Take a self driving car for example, you would have sensors that detect your distance to the next object, your accelerometer, your pressure gauge, etc. and that will essentially be a generalization of your state. It doesn't matter if you're in your garage at home or in the garage at your office, if your observations are identical, your state would be the same (although your endgoal, and hence where you will find your reward, will be different).

  • @QDownGames
    @QDownGames 7 лет назад

    I am trying to find more information about the "Coarse Coding" function approximation method online (at approx. 1:01:00). Anyone have some useful links? It appears to be quite elusive.

  • @yatint9665
    @yatint9665 4 года назад

    There is probably an extension available for ur browser to increase the audio of the video....

  • @Zeeblog
    @Zeeblog 3 года назад

    I like the convention of referring to the information about the value of an action being given to us by an "oracle". It makes me think of the old woman from the Matrix saying "Hmmmm, I would've made pacman turn left there honey".

  • @yota07
    @yota07 6 лет назад +2

    Great course! the only negative is that each lecture is too long, so I tried to watch it at a higher speed. You can still watch it at speed x1.5 and it becomes around 1h each course.

    • @manishandu
      @manishandu 5 лет назад +1

      Graduate level classes are usually 2 hours long.

  • @KarfontaAlec
    @KarfontaAlec 7 лет назад +2

    Any ideas on how to adapt this to take multiple actions per time-step? I am working on a robot control problem, the simple crawling crate program described in the Berkeley RL lectures. The robot has two joints each with three actions; forward, hold, reverse. I can get it to work fairly well but when I add an extra limb it gets stuck in a local minimum of using just one arm and holding the other still. Ideally I would like the robot to control the torque on each motor but I do not know how I would set up the training examples for the NN. Right now the input is the angle and velocity of each joint and the output is the expected reward of each action. To get the actual target for a training example I use the Q-Value for the current step, the immediate reward plus the max expected reward from any action in the new state. So if there are four actions this Q-Value is used as the expected reward of the previous action. The other three action values in the sample are set to the current expected reward, so that only the value of the previous action is changed. I cannot think of how to adapt this to take multiple actions or output motor torques, any suggestions?

    • @sanjanavijayshankar5508
      @sanjanavijayshankar5508 4 года назад

      You have to discretize your action space. Use something like mgrid. Then use the index as a "scalar" control input

    • @sanjanavijayshankar5508
      @sanjanavijayshankar5508 4 года назад

      I would recommend DDPG for your application though

  • @zijianxu8595
    @zijianxu8595 5 лет назад

    In DQN, in order to avoid chasing a non-stationary target, we fixed (R+y*maxQ(S', A', w)) for a while so that we could chase a stable target. My question is that could we do the same thing to fix R+y*maxQ(S', A') in Q-learning?

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      In DQN you have weights, so it is the fact that you are changing your weights all the time due to gradient descent what motivates fixing those weights in the objective for a bit. You stabilise the objective. However, in normal Q-learning, the objective is not changing much every step, so intuitively to me it would not make sense.

  • @mohammadkhodadad1067
    @mohammadkhodadad1067 5 лет назад

    43:40 how can we apply eligibility trace on a model which some of its weights are from something black box ??

    • @alvinphantomhive3794
      @alvinphantomhive3794 5 лет назад

      every state transition occur, you immediately decay the eligibility traces for the entire state you had visited , but at the same time you also increase the eligibility traces but only for the current state. In function approximation, each state represented as the features vector.

  • @silwate
    @silwate 3 года назад +1

    please add captions

  • @emrahe468
    @emrahe468 6 лет назад

    The question at 1:22:00 was like gold.

  • @shaz7163
    @shaz7163 7 лет назад +1

    I don't understand how they train with experience replay in 1.17 . Is it like at each time step you take some state and target pairs that the model seen before and apply stochastic Gradient descent ? . Also how the improve policy at each time step . When using greedy policy which function approximator they use is it frozen one or the updating one ?

    • @jeromeblt1432
      @jeromeblt1432 6 лет назад

      I think it's each timestep. 40:30

    • @jeromeblt1432
      @jeromeblt1432 6 лет назад

      "Every step, what we do, we sample some mini-batch from our dataset ... optimise MSE between Q-network and Q-Learning targets" 1:18:34 . Also i suppose the 'i' in the loss function equation means "iteration". So calculate the loss each iterations (timestep)

    • @jeromeblt1432
      @jeromeblt1432 6 лет назад

      1:22:15 "any iterations 'i' " , "optimize a loss function"

  • @dantealexis7835
    @dantealexis7835 6 лет назад

    So is it that we don't store states explicitly unlike before where we had them in a look-up table? Or is it that we still have to store them, but now we can quickly estimate the values of similar states and add them to our table even without having to visit them?
    I'm trying to understand how not storing them will work when we have to take an action using action-values. If we don't have them explicitly stored, then we evaluate all actions possible in the current state in real-time using function approximation right?

    • @ajaybala8967
      @ajaybala8967 6 лет назад

      So, the point of Value Function Approximation is that you don't have to store the Q-value lookup table for all states and actions. (it may be extremely large for many problems). The approximation of the q-value can be used to "extrapolate" values of unvisited states and their values (that is what is called generalization). Think of it like this: you have a bunch of x -values and y-values and you want to predict approximately the value of x given a new and unseen value for y. So, we try to fit a function for known x and y values. This is exactly what supervised learning in ML does, but here we get the "training set" by interacting with the environment. Hope this helped.

    • @dantealexis7835
      @dantealexis7835 6 лет назад

      @@ajaybala8967 Thanks for the reply. Truth be told I figured it out and am halfway through my skl project using that exact same thing lol. Only problem now is that it is so computationally heavy that it is the first time i actually have to consider code optimisation haha. But thanks again for the help!

    • @ajaybala8967
      @ajaybala8967 6 лет назад

      No problem!

  • @adityagupta120
    @adityagupta120 6 лет назад

    Great Lectures.
    Slides link isn't working.

  • @focusonlife3242
    @focusonlife3242 3 года назад

    Tnx a lot 👍

  • @sanjanavijayshankar5508
    @sanjanavijayshankar5508 4 года назад +1

    46:49 this explanation though.

  • @davideabati1903
    @davideabati1903 7 лет назад

    I'm struggling with the objective function for DQNs. We're trying to learn w and we update them towards action value functions estimated with an older parameter vector w-. That feels wrong to me, what am I missing? I can see that we bootstrap towards the future (so we observe a reward and estimate the value at s', a'), but using w- from the past which is a worst estimate of the optimal parameter vector w* (hopefully) than the estimate I have now.

    • @sam41619
      @sam41619 7 лет назад

      I felt the same when i heard this idea

    • @robosergTV
      @robosergTV 6 лет назад +1

      did you watch leactures of regular tabular q learning? We did same thing there. While it may feel wrong 1. you can come up with a better method and publish a paper 2. we still inject information of the real reward, thus making our predictions more accurate, that's why it all works.

    • @davideabati1903
      @davideabati1903 6 лет назад

      @@robosergTV Thank you for your answer. Still, to this day, I feel like I'm missing something.
      I studied and understood table-lookup Q-learning, and I want to highlight that is not the TD target that bothers me (on the contrary, I am amazed by how TD learns about the return without ever observing it).
      What I felt odd about is the bootstrap towards *old* Q-targets, parametrized by w-.
      After all, I'm trying to improve my estimate of w, so I want to improve over w-.
      Anyway, I can grasp that, as you suggest, the observation of the unbiased reward R ensures an improvement.
      Cheers

    • @amnagrawal
      @amnagrawal 4 года назад

      @@davideabati1903 I am facing the same dilemma. Over here prof says that we fix the old target for 1000 (say) iterations. How do we ensure that the model trained during these iterations is improved when the error gradients themselves point to the worst estimate of Q*. In tabular Q-learning he told we take an alternate action as the target. Where my interpretation of it was from exploration perspective. Please tell me if I'm going in the wrong direction.

    • @edmonddantes4705
      @edmonddantes4705 Год назад +2

      This is one of the most interesting questions in this lecture's comments. My first observation is that in the look-up table case, the Q-learning updates are the same as for the DQN objective (with corresponding indicator function features, obviously) as long as WE DON'T FREEZE THE OLD PARAMETERS. Therefore the question of why freeze the old parameters at all is very natural. My second observation is that in the look-up table case, we are not moving the objective much at every step, in the sense that we only modify one pair of state-actions, so gradient descent won't go crazy by following a constantly moving objective. Note that although only one pair is moved, the change is clearly more violent in the table look-up case. Another reason why gradient descent will not go crazy in this case is the obvious one: it converges because it coincides with Q-learning and Q-learning converges under certain conditions.
      Now, given the two previous observations, I will summarise the intuition that I have. The look-up table Q-learning updates follow the Bellman optimality equation and converge under certain conditions, so in a sense, it is a proper way of making updates, of going towards the truth. However, when substituting those greedy updates by a gradient descent step, who guarantees that convergence will be retained? What is the theoretical basis? After all, a gradient descent step is smooth, it is all but greedy and violent in general (contrarily to the real Q-learning objective). My claim is that in order for convergence to be retained, gradient descent must update somehow similarly to look-up table Q-learning. However, whereas look-up table Q-learning updates one pair violently in a greedy way, a gradient descent step updates all the weights very slightly, and hence, if you want to converge towards your real objective, you probably need several of those updates in the full weight space. One update is not enough for you to know where the hell those weights are smoothly trying to go towards. Basically you need to approach that violent greedy action a bit more, since a single step might not be representative of that greedy improvement. Also, since after every gradient descent step, all the weights are slightly changing, this is probably not to good for convergence of gradient descent. Gradient descent is normally used when objectives are stable, not when objectives are literally changing every step. So it is good to wait for some steps and see where that cumulative gradient update is actually really wanting to go, and once you are convinced, you can change the weights :P.

  • @TheMikiwar
    @TheMikiwar 4 года назад +2

    1:11:01 didn't know Boris Johnson was interested in RL

    • @toanoopie34
      @toanoopie34 4 года назад

      Oh wow, it really does sound like him.

  • @timdockhorn5834
    @timdockhorn5834 6 лет назад

    Any idea what kind of features are used for the mountain car example? In my understanding, the user has to choose the features?

    • @exitudio
      @exitudio 6 лет назад

      Feature Vector is the output of the network (eg. linear regression, neural network) You don't do it manually. But you may set a size of the network output and let the network optimizes the Feature Vector by itself.

    • @alvinphantomhive3794
      @alvinphantomhive3794 5 лет назад +1

      as he said, the features consist the current position and velocity of the car. eg: when the car on 32 position and accelarate 80% from the maximum speed, then the car ends up in x position with the state-value function = 8.
      and that value function is use to improve the new policy and optimize future action.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      The features are coarse coding in the first pictures (i.e. there are many circles and for each circle, a feature is defined as 1 for states on the circle and 0 otherwise), and radial basis functions (probably Gaussian densities with different means) for the other image where there is handwritten text. In both cases, the function approximators are linear.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      ​@@alvinphantomhive3794 You are confusing features with states.

  • @shivinsaluja6085
    @shivinsaluja6085 6 лет назад

    We update the value of weights in Monte Carlo with Value function approximation(Incremental methods) at the end of multiple episodes right? And How does it solve the problem of space. Can someone please clarify?

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      The problem of space is called curse of dimensionality. A function approximator (linear features, neural net, wavelets, etc) uses way fewer trainable parameters than the number of total states (which could even be continuous). Note that by means of a neural network, you could compute a good estimate of the value of infinitely many states by only having a small number of trainable parameters. Why? Because a neural network is capable of interpolating and generalising if the architecture is chosen correctly. That's the beauty! Trainable parameters

  • @danieltjones01
    @danieltjones01 4 года назад

    Does the "action out" approximation around 11 minutes give us a dictionary as output?
    I.e. {Action 1: 7, Action 2: 6.5,...}

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      If you fixed the state, yes, you can view it like that. That's the computational way of interpreting it. The mathematical way would be a vector, each component containing the action-value of the fixed state given one of the possible actions.

  • @ishansrt
    @ishansrt 5 лет назад

    19:17 I died! Best ever physical interpretation of a mathematical equation

  • @rohansinghvi3061
    @rohansinghvi3061 4 года назад

    For TD learning with value function approximation, using the linear value estimator as the TD target, won't that damage the algorithm..?

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Depends on how well you choose those features right? Feature selection is an art. Also, the look-up table case corresponds with linear function approximation.

  • @SSS.8320
    @SSS.8320 5 лет назад

    I can't get the intuition why eligibility trace becomes Et = .... + gradient of ^v(s,w) or gradient of ^q(s,a,w) w.r.t. to w. Can someone explain? Thank you very much in advance.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Either (1) think of it as a generalisation of the look-up table case, in which case it is clear that they coincide, since the gradients are delta functions (i.e. x(S_t) is all zeros except at the component corresponding with state S_t, where it is 1), or (2) compute it by following the same ideas as when showing that the backward view of TD(lambda) is equivalent to the forward view (this is in the extra material in the slides for lecture 4). I think the equivalence is only true in the linear case :). If you understand the proof in lecture 4, you should understand this.

  • @manolislykos
    @manolislykos 4 года назад

    Can someone explain the difference between gradient TD and the algorithms that was shown before in the lecture?

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      I am guessing gradient TD applies compatible function approximation in order to make updates in the real descent direction, but I would love to see a reference of what he means by gradient TD or gradient Q-learning. Again, I strongly suspect that they come from some improvements due to the policy gradient theorem by Sutton.

  • @judenchd
    @judenchd 6 лет назад

    Why there is E_pi in the objective function?

  • @a90440654
    @a90440654 5 лет назад

    Hi, guys. Im wondering if Feature Vectors need exactly same number of columns as number of states.

    • @luisbanegassaybe6685
      @luisbanegassaybe6685 5 лет назад +2

      No definitely not. The number of states is arbitrary, and a feature vector is canonically a 1 column vector with many rows. For example, the feature vector for chess could have 1 row for every point on a chess board, and the entry to that row could be a number representing what kind chess piece is occupying that point. Opponent’s rook, my queen, empty space, my opponent’s knight, et cetera.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Of course not. Typically, it is ill-advised to do that, since that would imply that your number of trainable parameters is equal to the number of states, and the purpose of function approximation is precisely avoiding the curse of dimensionality of huge state spaces by using a much smaller number of trainable parameters. It's like you are back in the look-up table case. What a torture 😂

  • @renney77
    @renney77 6 лет назад

    Does anybody here know if the Q estimates or weights have to stay in a range like 0.0-1.0?? or they can just be any real value?

    • @alvinphantomhive3794
      @alvinphantomhive3794 5 лет назад

      i think it could be any value, since
      the "weight update(delta w)" equation taken from the difference between the estimate value and the real value, and being multiplied by some epselon(step-size) towards the gradient over and over until the value converged to the global optimum

  • @brandomiranda6703
    @brandomiranda6703 5 лет назад

    what is the E_{pi}[ (v_pi(S;w) - v^hat(S,;w))^2], is it the true expecation of trajectories of the stoachsitc process given by the MDP? i.e. s1a1r1,s2a2r3,... ~MDP

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      v_pi does not depend on w, but it is the true expectation, yeah. Trajectories are sampled from the MDP. If you want to be more rigorous, the density is the discounted state distribution
      ho^\pi defined in Preliminaries, page 2 of
      proceedings.mlr.press/v32/silver14.pdf
      (Deterministic policy gradient algorithms)

  • @eason2694
    @eason2694 7 лет назад +1

    why does Monte Carlo evaluation converge to a local optimum

    • @williamtan3857
      @williamtan3857 7 лет назад

      Because the target in MC learning is the real rewards (Gt).

    • @siddhartha3457
      @siddhartha3457 7 лет назад +3

      This is solely due to function approximation. The value function may not be a linear combination of the features, and so, MC with linear function approximation cannot converge to the global optimum in this case. "An analysis of temporal-difference learning with function approximation" proves that MC TD methods converge to the least squares solution when this is the case.

  • @simonhaxby830
    @simonhaxby830 8 лет назад

    Does the experience set D consist of just one episode? Or multiple episodes/runs?

    • @NithinVasisth
      @NithinVasisth 8 лет назад

      Multiple episodes/runs.
      (Also, He said in the lecture that we randomize them while sampling, so it shouldn't depend on which run it is)

  • @yuxinzhang9403
    @yuxinzhang9403 3 года назад

    What if we wrap the input of neural network in a function and treat each such single function as a state? That way it would be an MDP again and the lookup table method still works in choosing the right function to call, a program can program itself by solving this MDP.

  • @ImtithalSaeed
    @ImtithalSaeed 7 лет назад

    from where can I find the true value function to compare to find the error??

    • @sam41619
      @sam41619 7 лет назад +1

      the true value comes as the trajectories are unrolled, make the *function mapping* (approximation) of last sample/episode as the target for next, its just like supervised learning, except now the data is correlated (is kind of a sequence)

  • @ErwinDSouza
    @ErwinDSouza 5 лет назад

    For Experience Replay, how long do we hold on to old data? Do we flush it away after every update of the network? If so, then we lose valuable information from the early explorations and the network might unlearn that data. On the other hand, it's not easy to reuse the old episodes, because those 'target' values were computed based on the output of old networks that may now be outdated.

    • @AdamCajf
      @AdamCajf 4 года назад

      We don't flush it. It depends on what the size of the Experience Replay Memory is. It can be implemented as a cyclic buffer of a bounded size that holds the transitions observed recently: pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

  • @levin6975
    @levin6975 4 года назад

    37:00

  • @feishawnshanks775
    @feishawnshanks775 6 лет назад

    No subtitles?

  • @chane1950
    @chane1950 4 года назад

    someone needs to run the Independent component analysis algorithm and filter out the irritating background noise.

  • @akshayrana9351
    @akshayrana9351 4 года назад

    Another tutorial that explains Experience Replay: ruclips.net/video/AohtWrJwvgg/видео.html

  • @johnhefele5432
    @johnhefele5432 6 лет назад +1

    I would have liked this video, but then it wouldn't be 420

  • @woncoh1
    @woncoh1 5 лет назад

    37:09

  • @karolinadrabent2088
    @karolinadrabent2088 2 года назад

    wow