RL Course by David Silver - Lecture 2: Markov Decision Process

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024
  • #Reinforcement Learning Course by David Silver# Lecture 2: Markov Decision Process
    #Slides and more info about the course: goo.gl/vUiyjq

Комментарии • 478

  • @BRUNNSPARK
    @BRUNNSPARK 5 месяцев назад +24

    If you are reading this in 2024 and thinking what value can a 10 years old course may offer, you are completely wrong. I have looked at many courses and lectures to find out the best startup RL course and found this as the best. Force yourself to start with this.

    • @happydays334
      @happydays334 4 месяца назад

      Can we build the machine learning program/code with this course I am interested but I don't know if I would learn how to implement it in the 'coding' part of it?

  • @dudley5424
    @dudley5424 8 лет назад +319

    Thanks, Petyr Baelish. Great lecture series.

    • @Chr0nalis
      @Chr0nalis 8 лет назад +8

      He lacks the moustache.

    • @rickyhan7023
      @rickyhan7023 6 лет назад +5

      I'm impressed by the size of his head compared to his body.

    • @falcon20243
      @falcon20243 6 лет назад +44

      RL is a ladder.

    • @NilavraPathak
      @NilavraPathak 5 лет назад +2

      Nah looks more like Alex Honnold

  • @drq3098
    @drq3098 6 лет назад

    Great lectures! Thank you! Can you please check: Time 1:17:00 the relation on policies is total, b/c a partial order does not always have a maximal element.

    • @iAmUzaar
      @iAmUzaar 4 года назад +3

      I know this is an old comment but here goes. Having a maximal element is not the differentiator between partial/total order. Being able to use the comparison relation between any element, however, is. Here we define the comparison relation as v_π(s) ≥ v_π'(s) for all s. However, it may be the case that there exists s*, s' such that v_π(s*) ≥ v_π'(s*) and v_π'(s') ≥ v_π(s') which means that the relation cannot be applied to the two policies, they are incomparable. Therefore the value function defines only a partial order.

  • @subimalsengupta3726
    @subimalsengupta3726 6 лет назад +3

    The transition probability matrix should have S(n, 1) in the nth row

    • @Bromine97
      @Bromine97 3 года назад

      Yes, I believe it should have P(n,n) in the nth row too.

  • @a.m.howlader609
    @a.m.howlader609 6 лет назад +1

    How can I design the transition probability matrix for different states, is it random? Please help. Thanks

    • @BGasperov
      @BGasperov 5 лет назад +1

      Not sure I understood you well, but the transition dynamics is defined by the environment (it's part of the MVP/MDP tuple).

  • @43SunSon
    @43SunSon 4 года назад +1

    I have a question, @37:04, where is R=+1 from "state 4.3" to "state 0.8" ?? it should be -2+1+0.6*10+0.4+0.8. There are 2 possible outcomes, why only consider the reward of "4.3 to 10" ? Anyone could help me on this?

    • @jont7207
      @jont7207 4 года назад +1

      by definition of the bellman equation, you use v(s') (the number written inside the circles) rather than the Reward R, hence a v(s) of 4.3 is = -2 + (0.4*0.8 + 0.6*10) where 0.8 is the v(s) of transitioning to "0.8" state, or the pub state.

    • @43SunSon
      @43SunSon 4 года назад

      @@jont7207 Thanks. But it was not answering my question. Could you advise me more?
      1. the equation at bottom @34:20, I am talking about Rs, not v(s') or v(s). I was asking why not consider Rs from 4.3 to 0.8 but only 4.3 to 10? in other words should be =+1+ (-2) + (0.4*0.8 + 0.6*10).
      2. Later on in the video, this pub is not a state @1:07:35, maybe that's the reason?

    • @jont7207
      @jont7207 4 года назад +1

      @@43SunSon 1. Rs is the immediate reward just for being in state class 3, which is -2. The +1 and +10 are rewards for successor states Rs'. Also, the +1 and +10 are already accounted for in the second component of the bellman equation, so if we do what you say, then we do double accounting of rewards.
      2. The Pub being an action in this case is just made an example for MDP because actions taken matter in MDP. Interestingly the Pub action can land you in a different state, which is a better example rather than landing in just 1 state. The pub becoming a state is just an arbitrary example I think.

    • @43SunSon
      @43SunSon 4 года назад

      @@jont7207 I almost got it. could you confirm my understanding below: Thanks buddy
      1. Rs = -2 here is immediate reward, but Rs=+1 (the state 0.8) is NOT immediate reward of leaving state “4.3” but for "state 0.8". Yes or No ?
      2. you said "the +1 and +10 are already accounted for in the second component ". you mean the (0.4*0.8 + 0.6*10) ? yes or no?

    • @jont7207
      @jont7207 4 года назад +1

      @@43SunSon 1.) Yes 2.) Yes. keep in mind that 0.8 is the value function for that state, which includes the +1 reward, same with the 0.6*10 component (remember v(s) = Rs + avg(Pss' * v(s')), for 10, s' v(s') is zero, the final state, so v(s) just the the immediate reward 10, so it looks like Rs = v(s) for state "class 3", the "10" state.

  • @NganVu
    @NganVu 4 года назад +151

    0:45 Markov Processes
    13:00 Markov Reward Processes
    43:14 Markov Decision Processes
    1:40:18 Extensions to MDPs

  • @adbocode
    @adbocode 7 лет назад +292

    WARNING: take headphones off 42:25 -> 42:35 and 55:55 -> 56:05

    • @Stu49583
      @Stu49583 6 лет назад +30

      I wish I read this comment before that

    • @shrishabharadwaj4811
      @shrishabharadwaj4811 6 лет назад +3

      Me too.

    • @huahua71
      @huahua71 6 лет назад +4

      I guess many find this comment after hearing the noise

    • @chaitragn3379
      @chaitragn3379 5 лет назад +4

      Hahahah I read this comment and searched what is it. True take off headphone or be prepared for it.

    • @prateekagarawal5311
      @prateekagarawal5311 5 лет назад

      Thanks

  • @VinBhaskara_
    @VinBhaskara_ 6 лет назад +212

    Fantastic. The lectures are so consistent, formal, and organized - mathematically as well as pedagogically. Optimal policy for teaching such a course!

    • @huaweijian2210
      @huaweijian2210 4 года назад +1

      One of the best lectures here at UCL CS :)

    • @astaconteazamaiputin
      @astaconteazamaiputin 4 года назад +4

      Indeed. One of the best lectures I have seen on a difficult topic. And his attitude is just...rare and priceless actually. I feel outmost respect for his pedagogical methods.

    • @mathavraj9378
      @mathavraj9378 4 года назад +2

      I see what you did there. nice

    • @matthiasho3216
      @matthiasho3216 2 года назад +2

      I totally agree. Mr David Silver is a great lecturer

  • @FreestateofOkondor
    @FreestateofOkondor Год назад +27

    It's crazy to me how much more of RL I've understood in these 2 beginning lectures vs multiple days of trying to understand what my professors have cooked up in their lecture. And David actually goes into even more depth.

  • @jacksong8748
    @jacksong8748 4 года назад +20

    1:41:24 did anyone else feel like an agent in a partially observable environment when the camera shut off while he was talking about partially observable environments?

  • @muzammilnxs
    @muzammilnxs 7 лет назад +118

    This lecture series is a gem in reinforcement learning. Best material for RL as yet!!

    • @shivannawal7687
      @shivannawal7687 6 лет назад +1

      I've heard of this course as well. Do you have a resource that provides the correct way to download the packages required for starting the course ? Thanks

    • @eyeofhorus1301
      @eyeofhorus1301 5 лет назад +3

      I would expect nothing less from a guy at the forefront of AlphaZero

  • @mathavraj9378
    @mathavraj9378 4 года назад +102

    can we appreciate the audience too? so interactive and they asked all questions that made our foundation even better. Kudos to David for patiently answering everyone with equal dedication

  • @stevetang5131
    @stevetang5131 3 года назад +22

    Best RL course 5 years ago and now still is to me. Thank you.

  • @xxgimpl0rdxx22
    @xxgimpl0rdxx22 4 года назад +52

    Those poor students who had to take all of this in in one sitting.

    • @17teacmrocks
      @17teacmrocks 4 года назад +7

      not that much at the university level actually. very standard and presented in an easy to follow manner. also I assume they had to have some prereqs in math and probability graph theory to get to this class. MDP and bellman equation are needed for solving optimization equations of stochastic systems

    • @SergioGonzalez-gu1bk
      @SergioGonzalez-gu1bk 3 года назад +2

      ​@@17teacmrocks I love how at the end he starts questions and states that he is expecting that not everyone understood. I'm following because I read the topics before watching, that isn't a light read at all.

    • @Samuel-wl4fw
      @Samuel-wl4fw 3 года назад +1

      Just watched the video over 5-6 hours lol, took some notes meanwhile and talked about the difficult parts of the subject with my mate. Definitely a lot of stuff covered, and happy to be able to watch it in my own tempo, would be a little overwhelming to watch and understand in 2 hours.

  • @akarshrastogi3682
    @akarshrastogi3682 5 лет назад +28

    David pointlessly pointing at himself from 23:08 - lol

    • @abhi220
      @abhi220 4 года назад +5

      So funny you pointed it out!

    • @mihailtal7550
      @mihailtal7550 4 года назад +14

      I keep missing what's really important in those lectures. Thank God we have people like you to to guide us.

  • @43SunSon
    @43SunSon 4 года назад +6

    There is a typo @1:22:08 ?? q*=8.4 ? I think should be 9.4 = 1+1(0.4*10+0.4*8+0.2*6) base on the equation q*(s,a) @1:26:55 . Someone verify this with me?

    • @DeathStocker
      @DeathStocker 4 года назад

      That's exactly what I found weird. I think it has to be 9.4.

    • @43SunSon
      @43SunSon 4 года назад +1

      @@DeathStocker it is. I am right, i checked other comments, they said the same thing.

    • @DeathStocker
      @DeathStocker 4 года назад

      @@43SunSon thanks for confirming!

    • @kaspar8292
      @kaspar8292 4 года назад

      Yup, I just got stuck there too. Should be 9,4

  • @adi331
    @adi331 2 года назад +6

    The part from 54:30 to 1:01:00 was a lot to take into. But it's actually not that much once you get a grip on it.
    First there is the *Bellman Equation* and the *Bellman Expectation Equation*
    The difference is that the *Bellman Expectation Equation* also incorporates actions.
    *Bellman Expectation Equation* = *Bellman Equation* + actions.
    After that we basically just have 4 variations of this:
    1. *Vπ* in terms of *Qπ*
    2. *Qπ* in terms of *Vπ*
    3. *Vπ* in terms of *itself*
    4. *Qπ* in terms of *itself*

    • @fahmikhoirurrijal9427
      @fahmikhoirurrijal9427 10 месяцев назад +1

      i still don't understand how the calculations that we got the values for student MDP in 53:30, it was said prior that in MDP there's not just one expectation value anymore but here every nodes has one value?
      e.g how do we get the +2.7 value in the second class?

    • @shresthapatir6495
      @shresthapatir6495 4 месяца назад

      @@fahmikhoirurrijal9427if you understood it now can you please explain?

    • @theahmedmustafa
      @theahmedmustafa 3 месяца назад

      ​ @shresthapatir6495 Idk if this a little too late but here is how you get it:
      The value of the state = expected value of all actions that can be taken from it
      The value of an action = reward + expected value of all states it leads to
      There are two actions from that state: Sleep or Study.
      Lets talk about Sleep. This action always leads to terminal state so the expected value of states it leads to is 0. The reward is also 0 as stated in slides.
      Value of Sleep = reward + expected value of all states it leads to = 0 + 0 = 0
      Same can be done for Study. This action always leads to next state which has a value of 7.4 so the expected value of states it leads to is 7.4. The reward stated is -2.
      Value of Study = reward + expected value of all states it leads to = -2 + 7.4 = 5.4
      Now we have values of all actions for this state.
      The value of the state is the expected value of all its actions. In this case, the policy is to have a 50/50 choice if actions so each action has a 0.5 probability.
      Value of state = expected value of all actions = (0.5 * value of Sleep) + (0.5 * value of Study) = (0.5 * 0) + (0.5 * 5.4) = 2.7
      So the value of this state is 2.7

  • @ogsconnect1312
    @ogsconnect1312 6 лет назад +38

    Well done! At 6:05, I guess the lower matrix value should be Pn1

    • @luisjalabert8366
      @luisjalabert8366 5 лет назад +2

      Same error at 39:30

    • @astaconteazamaiputin
      @astaconteazamaiputin 4 года назад

      Yes, it should. And he says this during the talk, so no doubt about it.

    • @adi331
      @adi331 3 года назад +1

      @@astaconteazamaiputin It's corrected in the slides at: www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf

    • @mohamedmusa7149
      @mohamedmusa7149 2 года назад

      @@luisjalabert8366 Yes, that is an error.

  • @MrBeatronic
    @MrBeatronic 8 лет назад +43

    At 1:11:47, isn't optimal action-value function q*("going to pub") =1 + (0.2*6 + 0.4*8 + 0.4*10) = 9.4 instead of 8.4?

  • @olenapanasovska9900
    @olenapanasovska9900 3 года назад +3

    The transition probability of this course exponentially lows down from 1 mln ( the first lesson) to 50 thousand in 10 lessons. Only 5% have come out from Class 3)))

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 4 года назад +13

    He has amazing ability to explain complexity in such clear way.

  • @valken666
    @valken666 8 лет назад +24

    Great lecture! but volume is really low

    • @halibite
      @halibite 7 лет назад +1

      increasing volume with headphone

    • @timothdev
      @timothdev 5 лет назад +1

      Yep, I'd to download it and use VLC to boost volume.

    • @abhijeetsharma5715
      @abhijeetsharma5715 2 года назад

      Yeah, you can use a chrome extension to boost audio if you're on a PC. Suggestion: Volume Master Chrome Extension.

  • @existenence
    @existenence 4 года назад +2

    Which book is David referring to at 1:04:31??

    • @JohnHAdams-vo2pk
      @JohnHAdams-vo2pk 4 года назад +1

      Neuro-Dynamic Programming by Dimitri Bertsekas:
      www.amazon.com/Neuro-Dynamic-Programming-Optimization-Neural-Computation/dp/1886529108

  • @zhongchuxiong
    @zhongchuxiong Год назад +12

    0:45 Markov Processes {S, P}. used to describe reinforcement learning environment.
    3:26 Markov Property: the current state contains all the information of history.
    4:35 State transition Matix. transition probability from s to s`.
    7:27 Example: student decision chain.
    13:00 Markov Reward Processes {S, P, R, γ}
    29:10 Bellman Equation for MRPs: Vs = Rt + γ * Σ(Ps+1 * Vs+1)
    39:01 Bellman Equation in Matrix Form.
    43:14 Markov Decision Processes {S, A, P, R, γ}
    46:15 Policies: for taking actions. π(a|s)
    50:51 Value Function.
    state-value funciton
    action-value function
    53:37 Bellman Expectation Equation, following a policy.
    Bellman expectation equation for Vπ(s)
    Bellman expectation equation for Qπ(s,a)
    1:02:52 Bellman Expectation Equation in matrix form
    1:04:01 Q&A
    1:08:06 Optimal Value Function. q*
    1:15:41 Optimal policy. π ≥ π` if v(s) ≥ v`(s), ∀s (for all states)
    Theorem: for any MDP, there's an optimal policy that's ≥ all other policies.
    Theorem: all optimal policies leads to the same optimal value funciton.
    1:19:54 find the optimal policy
    1:19:55 Bellman optimality equation.
    v*(s) = max(q*(a,s))
    q*(a,s) = R + γ * ΣP * v*(s`)
    1:29:39 how to solve bellman optimality equation? it's non-linear. use iterative solution.
    value iteration (next lecture)
    policy iteration (next lecture)
    q-learning (future lecture)
    sarsa
    1:31:49 Q&A
    1:40:18 Extensions to MDPs
    continuous & infinite MDP
    partially observable MDP
    undiscounted, average reward MDP

  • @hobby_coding
    @hobby_coding 4 года назад +5

    i remember a time when someone could sneeze and raise 0 suspicions 19:49

  • @consecuencias.imprevistas
    @consecuencias.imprevistas 8 лет назад +16

    If you're confused like I was at slide 32 (53:27) when he goes over the state-value function for the student MDP, the calculation for each node, using the look-ahead step from Bellman Equation is: v_pi(s) = E_pi[R_t+1 + gamma*v(S_t+1)|S_t = s].
    For instance, for the case of the -1.3 node, we calculate, given that the policy assigns 0.5 probabilities for all actions:
    0.5*(-2 + 2.7) + 0.5*(-1 + -2.3) ~= -1.3
    Where the first term of the sum is the contribution from the Study node, and the second term is the contribution from the Facebook node

    • @wonchulkim3008
      @wonchulkim3008 7 лет назад

      Could you be more specific??? I don't know how to get -1.3

    • @MinhVu-fo6hd
      @MinhVu-fo6hd 7 лет назад

      It is 0.5*[-2+2.7*(1)] + 0.5*[-1-2.3*(1)] = -1.3

    • @seongmukgang9843
      @seongmukgang9843 7 лет назад

      Thanks very much.

    • @wahabfiles6260
      @wahabfiles6260 4 года назад

      I dont understand difference between state value function and action value function. In state value, the value of being in some state depends on the immediate reward in the next state. But we need an "action" to got to the next state so isnt it essentially action value function?

  • @0mane0
    @0mane0 4 года назад +4

    just to make it clear about the question asked at about 23:00:
    I actually think that the sequence is finite not by an arbitrary definition but due to the fact that every loop in that scenario will eventually terminate with proabability of 1.
    For instance the "stay in facebook" loop has a probability of 0.9^n of happening during n stages, wich tends to zero as n tends to infinity, thus the probability of (eventually) leaving the loop tends to 1.

    • @andriesrosseau5048
      @andriesrosseau5048 4 года назад +2

      Yes, he kinda dismisses the question out of hand, but the answer is more mathematically subtle than he makes it out to be.

    • @DonMalchimo
      @DonMalchimo 2 года назад

      I searched for a commentary like that, because I got confused from the answer of this question in the video. Thanks. :)
      I think this can be seen by a short analysis of the transition matrix, e.g., the eigenvalues which are all between zero and one for each state except for the state "Sleep", which is one. This means that P^n for n to infinity converges to a matrix only with zeroes except for the column "to Sleep" which are all 1 then.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Yes, episodic Markov processes terminate almost surely :). It is subtle. Also MDPs with a.s. absolutely bounded immediate rewards have finite discounted return (gamma < 1). But he is not very rigorous, he just wants to convey the main ideas.

  • @SimonLimygoogleplus
    @SimonLimygoogleplus 8 лет назад +10

    Why there is a period of time muted? Is this copyright concern?

    • @alexnichol3138
      @alexnichol3138 7 лет назад +1

      that's what i was wondering (around 35:00).

    • @paulgarcia2887
      @paulgarcia2887 4 года назад +1

      cannot reveal the top secret information

  • @BobTheZealot
    @BobTheZealot 7 лет назад +9

    Those mic knocks hurt cuz they are so much louder than his voice

  • @andyyuan97
    @andyyuan97 8 лет назад +16

    Can someone enable the automatic CC?

    • @abhi220
      @abhi220 7 лет назад +12

      by someone you mean, the guy who uploaded the video?

    • @wahabfiles6260
      @wahabfiles6260 4 года назад

      @@abhi220 lol

  • @sreenivasanac
    @sreenivasanac 6 лет назад +14

    WARNING: TAKE HEADPHONES OFF 55:50 - 56:05 AND ALSO FROM 42:25 -> 42:35 VERY LOUD NOISE

  • @kaspar8292
    @kaspar8292 4 года назад +2

    Am I wrong or should q* of a=Pub at 1:13:10 be 9.4 ?? In the example of (s=class2, a=study), the optimal state-value function is calculated as q*=R+sum(P_state(i)*v(state(i)))=-2+1,0*10=-8

  • @whaloob
    @whaloob 4 года назад +4

    This guy is an impressively good lecturer. Easily comparable to, if not better than, the best profs I’ve had so far in 7 semesters of university.

  • @Moustafa865
    @Moustafa865 7 лет назад +9

    I think there is a problem with the State transition matrix. First entry in the last row has a wrong subscript index. It should be {n,1} instead of {1,1}.

    • @Alley00Cat
      @Alley00Cat 7 лет назад +2

      A student in the class mentions this

    • @Avivalious
      @Avivalious 6 лет назад

      I agree

    • @nittur1008
      @nittur1008 6 лет назад

      Being {n,1}, it would not make sense as it would be , calculating probability of going to state 1 knowing state n. right?

    • @Morpho32
      @Morpho32 6 лет назад

      It does make sense. If we are in state n, we can go to state 1. Unless state n is "absorbing", in this case we cannot go anywhere after reaching state n but it would still appear in the transition matrix with values 0.

    • @HellYeah223
      @HellYeah223 6 лет назад

      I think you might be confusing states and steps. Assume you can go from state n to state 1, but not from step n back (in time) to step 1.

  • @jordan4890
    @jordan4890 4 года назад +6

    "The future is independent of the past given the future" is crazy deep when you think about it. How many people lament over past transgressions and stress over "how it should have gone down" when in reality obtaining your goals is much more along the lines of "what can I do right now in this moment"

  • @zerox2536
    @zerox2536 Месяц назад +1

    Is discout factor can be equal to 0 or 1??? I think brackets are incorrect

  • @poojanpatel2437
    @poojanpatel2437 6 лет назад +2

    at 5:09 last element of matrix is Pn1 not P11. refer to latest slides for correction. www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf

  • @alfcnz
    @alfcnz 6 лет назад +3

    What's the convention he's mentioning at 31:48, of the reward indexing?

    • @alfcnz
      @alfcnz 6 лет назад +1

      "Sutton & Barto convention" and he's referring to this textbook incompleteideas.net/book/the-book-2nd.html

  • @astaconteazamaiputin
    @astaconteazamaiputin 4 года назад +4

    Thank you for the question at 1:04:30 about states turning into actions / decisions, I appreciate the clarification !

  • @willlogs
    @willlogs 4 года назад +2

    Why would 59 people dislike this free gold?!

  • @jont7207
    @jont7207 4 года назад +3

    37:18 How do we calculate the value of states that have circular transitions? i.e. the value of class 3 (v(s)=4.3) depends on the value of pub (v(s)=0.8). How do we arrive at 0.8 for the pub then if the value of the pub depends on the value of class 3?

    • @jont7207
      @jont7207 4 года назад +2

      I guess this is done in the Dynamic Programming for the next lecture. It is solved recursively by dynamic programming, since there is no closed form for bellman equation.

  • @yibingevelynwei278
    @yibingevelynwei278 4 года назад +3

    1:13:18 Can someone please tell me again how to calculate 8.4 for action Pub? Why I calculate 9.4 = 1+0.2*6+0.4*8+0.4*10

    • @harrivayrynen
      @harrivayrynen 4 года назад

      Easy answer, you get 8.4 when you forget to add also part where r=1

  • @kiuhnmmnhuik2627
    @kiuhnmmnhuik2627 7 лет назад +2

    @1:37:00 Here's an easier example to understand the idea behind the optimality conditions.
    Let's say you want to find the shortest path from A to C and someone gives you a supposedly optimal solution which passes through B, i.e. the (supposedly) optimal path is A->B->C. If the path A->B is not the shortest path from A to B, then the total solution can't be optimal because we can find a better path A~>B and then form A~>B->C which is shorter than A->B->C.
    Now let's say someone gives you a supposedly optimal policy pi. This is more tricky than the shortest path example because of recursion. You notice that pi doesn't always choose the optimal action at every given time. In other words, you compute q_pi for the policy pi and you notice that
    q_pi(s,a) = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a') <
    R(a|s) + gamma sum_{s'} P(s'|s,a) max_{a'} q_pi(s',a')
    for some pair (s,a).
    Let pi' be the improved policy which always chooses the best action according to q_pi. We get
    q_pi(s,a) = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a') <
    R(a|s) + gamma sum_{s'} P(s'|s,a) max_{a'} q_pi(s',a') =
    R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi'(a'|s') q_pi(s',a')
    Note that we have pi' and q_pi (not q_pi'!) in the same expression. We would like to have q_pi' instead of q_pi.
    We can update q_pi with the new estimations and then repeat the maximization process. Note that at each maximization step we increase our estimation of q_pi until we converge to q_pi'. Since we're always increasing q_pi, we must have
    q_pi < q_pi'
    which means that pi' is strictly better than pi. I hope this makes sense. I've never read the official proof and this is just the sketch of one direction of the proof (i.e. optimality => optimal condition).
    In conclusion, whenever q_pi doesn't satisfy the bellman optimality equation (BOE), then we can increase q_pi until, estimation after estimation, we converge to a q_pi' which satisfy the BOE and such that pi' > pi.

  • @lawlietlight2592
    @lawlietlight2592 2 года назад +1

    There is is typo in transition matrix suffix at 40:12 , it should have been Pn1

  • @angelachikaebirim1043
    @angelachikaebirim1043 5 лет назад +5

    Prof. Silver is a brilliant teacher. Really enjoying this

  • @Chanakya06
    @Chanakya06 3 года назад +8

    Best course material to start Reinforcement learning. However the math seems to be difficult at least for me.

  • @Enderman-en3dv
    @Enderman-en3dv 4 года назад +2

    1:12:20 Shouldn't the q* value for the Pub be 9.4? It's 1 + (.2*6 + .4*8 + .4*10) because of the R+1 of going to the pub, right?

    • @Zantorc
      @Zantorc 4 года назад

      The general consensus is that you're right, it should be 9.4 which is what I got.

    • @43SunSon
      @43SunSon 4 года назад

      @@Zantorc same here! I think should be 9.4, not 8.4

  • @tylersnard
    @tylersnard 4 года назад +4

    I looked everywhere for a comprehensible explanation of RL. This is easy to understand and awesome. Thank you!

  • @FedericoBaldassarre
    @FedericoBaldassarre 6 лет назад +5

    My question is related to when the reward for a state-action pair is computed in a Markov Decision Process.
    When showing the Student MDP on page 25 (min 45:00), the reward for (study, pub) is given immediately after the action is taken and exiting the state. This is compatible with the way we indicate the reward as R^a_s, meaning that it is defined by the action you've taken and the state you're exiting.
    However, later on in the video (min 56:30) and on the slides (page 32) it looks like the reward is given when the environment pushes you to the next state, once the action has been chosen. In formulas, it seems like r depends on the action and the next state, so R^a_{s'}
    To me, it makes more sense that the agent receives the reward as a function of the action and the state it'll end up in, rather than the state it is leaving. Consider a game where one can decide to attack or run: if I decide to attack and the environment decides that I should be defeated, I would like to get a negative reward immediately and not later.
    So, which one is the right way of thinking? Do the equations and the plot need fixing or am I misunderstanding something?

    • @MsCalcioo
      @MsCalcioo 5 лет назад +5

      Hi,
      The initial example is an example of Markov Reward Process. So, the reward function only depends on the state (you can check the slide for the definition of MRP). Therefore, you get a reward at time t+1 when you exit the state at time t no matter what the action is. The later example is Markov Decision Process. In MDP, your reward function definition also considers which action you take. So you take some action and exit the state at time t ("study" or "pub") and get a reward at time t+1 based on your actions and where you'll end up with.

    • @ercanatam6344
      @ercanatam6344 2 года назад

      @@MsCalcioo You seem an expert on MDP. I have a question, "Pub" was a state in MRP example but in MDP it is removed being a state. Why? This confuses me. Moreover, rewards in MDPs are shown on the arc from (s,a) to s', but in the MDP example this convention is not followed.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      @@ercanatam6344 The states in the student MDP are literally the same as the states in the MRP, David just did not write the names in the circles/squares to avoid saturating the notation. What is confusing you is that he is using the words "pub" or "study" in order to describe actions as well. He means "go to pub" or "go to study". If you carry out the action "go to study" or "go to pub", you end up in the states "study"/"pub". Sometimes "go" could also mean "keep in", as in "keep in the pub", "keep studying", but I hope you get what I mean.

    • @fahmikhoirurrijal9427
      @fahmikhoirurrijal9427 10 месяцев назад

      @@edmonddantes4705 i still don't understand how the calculations that we got the values for student MDP in 53:30, it was said prior that in MDP there's not just one expectation value anymore but here every nodes has one value?
      e.g how do we get the +2.7 value in the second class?

    • @edmonddantes4705
      @edmonddantes4705 10 месяцев назад

      @@fahmikhoirurrijal9427 For any MDP, given a fixed policy pi, every state has a value given by the state-value function. The state-value function is literally defined in the previous slide and the chosen policy pi is uniform, so I am not sure what you are missing. Anyway, he applies the formula in the previous slide (you can do Monte Carlo sampling for an approximation or apply the Bellman expectation equation). PS: By the way, any MDP with a fixed policy is an MRP.

  • @opalpearl3051
    @opalpearl3051 Год назад +2

    I have watched lectures from MIT and Stanford and David blows them away.

  • @rl-rc7kb
    @rl-rc7kb 7 лет назад +16

    Unfortunately, important part has been muted. :(

    • @Qiang12345
      @Qiang12345 7 лет назад +1

      yako taki muted part is not hard to understand

    • @snaawflake
      @snaawflake 7 лет назад +2

      NADS IQ its not called reverse engineering.

    • @jerryc5716
      @jerryc5716 5 лет назад

      @@snaawflake Can we reverse the polarity?

  • @BruinChang
    @BruinChang 2 года назад +1

    I have a question around 57:00. After an agent takes an action, should the next state not be specified? If the next state is specified, how can I take the expectation over different states?

    • @kaierliang
      @kaierliang 2 года назад +1

      The environment decides which state you end up in, sometimes there is only one option, sometimes there are multiable options like the pub action

  • @krishcoolguy1
    @krishcoolguy1 6 лет назад +2

    why the audio was muted at 35:00 - 35:40. It seems to be interesting part. Can any one please explain what Mr.David has explained during these 40secs?

    • @kevingatimu8319
      @kevingatimu8319 5 лет назад +2

      It's simply a look-ahead diagram to evaluate the value of the current state, v(s). To calculate v(s), we add 2 values: (1) the reward for being in that current state, i.e., R_s, and (2) the discounted probability-weighted sum of the values of all possible future states one step later from s.

  • @RomarioGildasFokoTiomela
    @RomarioGildasFokoTiomela 7 месяцев назад +1

    Absolutely fantastic !! Thank you Dr. Silver !

  • @subhambhowmick8044
    @subhambhowmick8044 5 лет назад +2

    On slide : State Transition Matrix 6:19
    Should the matrix be ?
    P11.....P1n
    ..................
    Pn1......Pnn

    • @leandronw1
      @leandronw1 4 года назад

      It should. A student in that class asked about that right after he was done explaining it.

  • @hanwang2561
    @hanwang2561 4 года назад +2

    At 6:05, the lower matrix value should be Pn1

  • @ichbins161
    @ichbins161 3 года назад +2

    in 37:18 - Why is there no reward for going in the pub, so why is it 4.3 and not 5.3

  • @zhenyueqin6910
    @zhenyueqin6910 6 лет назад +16

    Another WARNING: VERY LOUD phone dropping sound! Take headphones off from 55:51 to 56:06

  • @elio3101
    @elio3101 6 лет назад +8

    Great lecture. I really liked the approach, building from MP to MRP and finally MDP. It was well connected, like listening to a story. Thank you very much!

  • @WWEMaster
    @WWEMaster 9 месяцев назад

    when did Christian Bale start teaching??😆😆

  • @Jacob-wp4km
    @Jacob-wp4km 6 лет назад +3

    This intro to MDP is seriously high-quality. Informative yet understandable.

  • @VladislavProkhorov-sr2mf
    @VladislavProkhorov-sr2mf 7 лет назад +3

    at 1:26:36 I think it should be max[R(a,s) + lambda*sum(P(a,s,s')*V(s'))]

    • @kiuhnmmnhuik2627
      @kiuhnmmnhuik2627 7 лет назад

      Yes, but I'm not sure it's a mistake. Maybe he's using the convention that the max "captures" everything on its right. Who knows... I would use the parentheses.

    • @damianwysokinski3285
      @damianwysokinski3285 2 года назад

      Yeah, it should be the same like you specified

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      ​​@@kiuhnmmnhuik2627 It is a mistake! You can see he uses the correct version in other lectures, plus the Bellman Optimality Equation is pretty well-known and the max affects all the terms on the RHS. Compare with any source in the literature. You can even compute it yourself with a bit of careful work.

  • @thepablo22
    @thepablo22 3 года назад +1

    What is the meaning of the E? (minute 23:19)

    • @Alexihem
      @Alexihem 3 года назад +1

      Is the Expected Value of the total discounted reward, given a state "s". The reason for this, is because we are under a stochastic process, and every trajectory sample could be different, so we need to average their returns.

  • @_jiwi2674
    @_jiwi2674 3 года назад +3

    audio is too low

  • @ПавелСеменов-з7у
    @ПавелСеменов-з7у 4 года назад +1

    the overall explanation is ok, while the student diagram is a bit confusing

  • @juleswombat5309
    @juleswombat5309 7 лет назад +2

    Yeah well David really now needs to write a book on this subject material. Enjoying the videos but really need a captured analysis. There are few books on RL, Sutton's book is nearly twenty years old (draft updated for next year) But this lecture series is easier to comprehend than Standards.Any book needs to have the algorithms described, to show how the theory can be applied.Well done.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Most of the material is taken from Sutton & Barto, even most of the pictures, examples, etc, so that book you are talking about is already written.

  • @achyuthvishwamithra
    @achyuthvishwamithra 3 года назад +1

    In reality, must we program an agent to store the optimal action calculated for each state, in the data structure of the state thus adhering to markov property or will the agent calculate optimal value at every step every time?

  • @jamesr141
    @jamesr141 2 года назад +1

    David is an amazing teacher. Because he loves the subject. Now I love it, too. Thanks David. Thanks Deepmind.

  • @kiuhnmmnhuik2627
    @kiuhnmmnhuik2627 7 лет назад +5

    I just want to point out that the recursive formulas are very easy to understand and derive if one disentangle them. For instance, let's say *s* is the current state, *a* an action and *s'* the new state after the action, i.e. s->a->s'. Then
    v_pi(s) = sum_{a,s'} P(a,s'|s) [R(a|s) + gamma v_pi(s')]
    where P(a,s'|s) is the probability of taking action *a* and ending up in state *s'* given that we are in state *s*.
    Now we note that P(a,s'|s) = P(s'|s,a)P(a|s) = P(s'|s,a)pi(a|s). Also, sum_{a,s'} (...) is sum_a sum_{s'} (...), so, we get
    v_pi(s) = sum_{a,s'} P(a,s'|s) [R(a|s) + gamma v_pi(s')]
    = sum_a sum_{s'} P(s'|s,a)pi(a|s) [R(a|s) + gamma v_pi(s')]
    = sum_a pi(a|s) sum_{s'} P(s'|s,a) [R(a|s) + gamma v_pi(s')]
    = sum_a pi(a|s) [ sum_{s'} P(s'|s,a) R(a|s) + sum_{s'} P(s'|s,a) gamma v_pi(s') ]
    = sum_a pi(a|s) [ R(a|s) + gamma sum_{s'} P(s'|s,a) v_pi(s') ]
    which is the formula @57:38.
    Same thing with q_pi(s,a). We start from state *s* and do action *a*. What happens is that we end up in state *s'* and then do another action *a'*. In other words, the sequence is s->a->s'->a'. Since we know *s* and *a*, but *s'* and *a'* are random, we compute the mean of the total reward over *s'* and *a'*:
    q_pi(s,a) = sum_{s',a'} P(s',a'|s,a) [R(a|s) + gamma q_pi(s',a')]
    That's it!
    Now note that
    P(s',a'|s,a) = P(a'|s,a,s')P(s'|s,a) = P(a'|s')P(s'|s,a) = pi(a'|s')P(s'|s,a)
    where we used the fact that *a'*, given *s'*, is independent of *s* and *a*.
    Therefore, similarly to before, we get:
    q_pi(s,a) = sum_{s',a'} P(s',a'|s,a) [R(a|s) + gamma q_pi(s',a')]
    = [sum_{s',a'} P(s',a'|s,a) R(a|s) ] + [sum_{s',a'} P(s',a'|s,a) gamma q_pi(s',a')]
    = R(a|s) + gamma sum_{s',a'} P(s',a'|s,a) q_pi(s',a')
    = R(a|s) + gamma sum_{s'} sum_{a'} pi(a'|s')P(s'|s,a) q_pi(s',a')
    = R(a|s) + gamma sum_{s'} P(s'|s,a) sum_{a'} pi(a'|s') q_pi(s',a')
    which is the formula @59:00.
    This is little more than basic probability.

  • @mangaart3366
    @mangaart3366 2 года назад +1

    What an amazing lecture! I had so much fun watching it and it helped clarify some things in my head about MDP.

  • @rameshmaddali6208
    @rameshmaddali6208 5 лет назад +1

    I can not resist from commenting. These presentations are more learnable I am scratching my head since a week. Fortunately I got into your video presentations. Awesome Dave Gold*.

  • @yoloswaggins2161
    @yoloswaggins2161 7 лет назад +8

    Got my crew of Markov reward values, G unit.

  • @周华-i1l
    @周华-i1l 7 лет назад +2

    I don't know why the "pub" status stays in MRP but disappear in MDP. Why?

    • @scorpiocorps
      @scorpiocorps 7 лет назад +2

      You're right, that's not clear at first sight.
      I believe that's because in MDP, we have 'actions' and after choosing an action, environment can be in different states with some probability. So, when we take 'Pub' action following a policy, we don't know exactly which state we end up in.
      In contrast, MRP has no concept of actions, so you just end up in a state with some probability and get the associated reward.
      EDIT: There's a discussion of this in 1:04:30

    • @monishravulapally9097
      @monishravulapally9097 3 года назад

      Yes mrp is more like looking and the possible values available the probabilities of each scenario in this case but mdp is about taking an action ie optimal

  • @natureoflife5490
    @natureoflife5490 5 лет назад +1

    Great lecture, however, if there is an autogenerated subtitle then it is easy to realize.

  • @alessandrobottero6544
    @alessandrobottero6544 3 года назад +1

    At minute 27:25, aren’t the values wrong, given that the discount factor is 0? For example, the value of the state of Class 2 should be -2 times the probability of ending up in Class 3 plus 0 times the probability of ending up in the sleep state, and this sum does not give -2. Is this correct?
    This comment is based on the fact that when γ=0, then G_t = R_(t+1), while the values on that slide would suggest G_t = R_t.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      Yeah, that value is definitely wrong. Should be -1.6. Nice catch:)

  • @robinkhlng8728
    @robinkhlng8728 3 года назад +1

    Awesome Lecture! But at 1:14:00, shouldn't q* for going to the pub be 9.4? q* here seems to exluce the reward of taking the action of going to the pub.

  • @limsiryuean5040
    @limsiryuean5040 21 день назад

    Although the lecture series overall has good explanation, it does assume the audience to be somewhat knowledgable in this specific domain. For example, what does it mean by "1 sweep"?, what does it mean by "fix the policy" and the criteria for the fix?
    For beginners with decent comp sci knowledge, its best to look for supplement materials to pair up with the lecture series.

  • @nanchen1554
    @nanchen1554 21 день назад

    The book he refer to is: Neuro-Dynamic Programming by
    Dimitri P. Bertsekas?

  • @patrickadjei9676
    @patrickadjei9676 21 день назад

    At ruclips.net/video/lfHX2hHRMVQ/видео.html , he seem to be multiplying the policy value with the immediate reward as well? Is that correct? It is not consistent with his previous calculations

  • @maayowa
    @maayowa 2 месяца назад

    This might be somewhat trivial but in the explanation of the Students Reward problem, can anyone explain what the negative and positive rewards indicate? i.e. R=-2 and R=+10. He had explained that R refers to the reward associated to a state and the agent receives it when you it takes any action. I imagine the case of class 3 being that the agent has reached the "goal" but I can't exactly make out that for the pub. I might be overthinking this.

  • @reinventing_the_circle
    @reinventing_the_circle Месяц назад

    Went to the pub, ended up sleeping, 3 out of 5 stars for inaccurate markov process

  • @TahlilAhmedchowdhury
    @TahlilAhmedchowdhury 5 лет назад +1

    I must say this is by far the best course whether you want to learn Deep or Shallow RL. Its so intuitive and easy to understand the basic of RL concepts. The current CS285 Course has a great material but for me its hard to understand. I repeatedly come back here to clear concepts. David Silver is a great Teacher.

  • @JohnHAdams-vo2pk
    @JohnHAdams-vo2pk 4 года назад +1

    Fantastic Lectures. I use them along with the Sutton/Barto text book. Excellent stuff

  • @Davidicus000
    @Davidicus000 2 месяца назад

    when you climb over the maths wall, be sure you do not leave the drawbridge down, otherwise too many people will share in the wealth of this knowledge.

  • @Hasan-sn2gb
    @Hasan-sn2gb 5 месяцев назад

    Very good course! I feel the audience are not ready for this course by asking low quality questions.

  • @1060320
    @1060320 4 года назад +1

    at 1:07:40, when he says stochastic transitions, the transitions are by two separate agencies.
    So the agent takes an action from a state.
    The environment depending on the action puts you in a new state.
    Correct?

    • @indianbeatboxersunited4311
      @indianbeatboxersunited4311 4 года назад +1

      Yup somewhat like that. So depending on your action, there is an independent randomness operating which helps in deciding the state you end up in. Deterministic transition would be where your action solely decides which state you end up in. Stochastic transitions adds randomness to the transition from one state to the next state

  • @sevgicanpolat2841
    @sevgicanpolat2841 7 месяцев назад

    In 1:23:26 shouldn't the value of Facebook state be like 5 if we always go backwards to calculate q values?

  • @notanape5415
    @notanape5415 4 года назад +1

    This video is sponsored by Apple.

  • @piyushgupta809
    @piyushgupta809 7 лет назад +1

    I Don't understand one point, He is Using previous state Values (Technically future )to calculate 7.4 value. but MDP states that value of one state = instant reward + discounted Value of Future States.
    How he calculated -1.3, 2.7, -2.3 in the first Place.
    Thanks in Advance

    • @pariveshplayson
      @pariveshplayson 6 лет назад

      piyush gupta one can solve for 1 variable and arrive at the value by doing LHS=RHS

    • @santiagosanchezcorrea5043
      @santiagosanchezcorrea5043 6 лет назад

      can u show me how to solve it I can't understand this part, I try to find one value with the Bellman Eq but I don't get any variable. Thanks

  • @SuperMyckie
    @SuperMyckie 2 года назад

    fantastic Series... but I need help here, do you get reward from exiting a State or reward from entering a new State.. does the reward used in calculations come from the state being Exited or the Next State

  • @kevinmcinerney9552
    @kevinmcinerney9552 7 лет назад +2

    At approx 22mins somebody asks about an infinite sequence. David says this is impossible, but even it was possible it's interesting to observe that the sequence of rewards would sum a finite value because it's actually just geometric series. This was discussed in the Udacity course a bit. Loving these lectures btw. David is magnificent.

    • @JaysonSunshine
      @JaysonSunshine 7 лет назад

      Also, since there is a sink, the probability goes to zero of generating a sequence of length n as n goes to infinity.

    • @monktastic1
      @monktastic1 6 лет назад +3

      I think he was talking about the case where there's no reduction factor across time.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      To be more precise, the discounted return converges almost surely as long as the rewards are uniformly bounded (almost surely) and gamma is strictly less than one.

    • @edmonddantes4705
      @edmonddantes4705 Год назад

      @@monktastic1 That statement requires some mathematical formalisation, since one could have two different states pointing at each other with probability one, and hence arriving to any of those states would give rise to an infinite alternating sequence. However, those states could be formalised as one state and then you could consider the class of those two states to be a fixed point. There is some truth in your statement, but it requires more formalisation, it is hand-wavy.

  • @buwburbhbjeqbb6790
    @buwburbhbjeqbb6790 Год назад

    The point where audio goes off is really annoying. Specially when u r trying to understand a concept. Atlest subtitles should have been there for that case

  • @dexmoe
    @dexmoe 6 лет назад +2

    42:30 be careful
    edit: and 56:00

  • @twoplustwo5
    @twoplustwo5 6 лет назад +2

    it's like D&D

  • @ludmiladeute6353
    @ludmiladeute6353 3 года назад +1

    Thank you very much, professor David, for these great lectures!

  • @neriyacohen7805
    @neriyacohen7805 2 года назад

    1:12:07 am I the only one seen Min-Flow here? with nodes been states and weighted connection been actions?

  • @josephsantarcangelo9310
    @josephsantarcangelo9310 4 года назад +1

    does anyone know what the value for v(pub) 1:02:33 ?

  • @peterkimemiah9669
    @peterkimemiah9669 7 месяцев назад

    I have lost many months of my thesis, wish I had these lectures earlier

  • @preethiabi2008
    @preethiabi2008 Год назад +1

    Thank you professor