Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2

Поделиться
HTML-код
  • Опубликовано: 31 май 2024
  • The machine learning consultancy: truetheta.io
    Want to work together? See here: truetheta.io/about/#want-to-w...
    Part two of a six part series on Reinforcement Learning. We discuss the Bellman Equations, Dynamic Programming and Generalized Policy Iteration.
    SOCIAL MEDIA
    LinkedIn : / dj-rich-90b91753
    Twitter : / duanejrich
    Github: github.com/Duane321
    Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
    SOURCES
    [1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
    [2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, • DeepMind x UCL | Deep ...
    SOURCE NOTES
    The video covers the topics of Chapter 3 and 4 from [1]. The whole series teaches from [1]. [2] was a useful secondary resource.
    TIMESTAMP
    0:00 What We'll Learn
    1:09 Review of Previous Topics
    2:46 Definition of Dynamic Programming
    3:05 Discovering the Bellman Equation
    7:13 Bellman Optimality
    8:41 A Grid View of the Bellman Equations
    11:24 Policy Evaluation
    13:58 Policy Improvement
    15:55 Generalized Policy Iteration
    17:55 A Beautiful View of GPI
    18:14 The Gambler's Problem
    20:42 Watch the Next Video!

Комментарии • 152

  • @mbeloch97
    @mbeloch97 Год назад +7

    Great video! Can you explain more, that "sneaky" equation in aroun 6:00? Why is G_t+1 = v(S_t+1) in the expectation?

    • @Mutual_Information
      @Mutual_Information  Год назад +11

      Ah, something I intentionally skipped over out of laziness, so I'll pin this comment for others.
      We want to show E[G_t+1 | s^0, -> ] = E[v(S_t+1) | s^0, -> ]. So..
      * E[v(S_t+1) | s^0, -> ] = E[E[G_t+1|S_t+1] | s^0, -> ] (by def of v )
      * = sum over s' [E[G_t+1|S_t+1 = s']p(s' | s^0, -> ) ] (by def of an expectation - the outer one)
      * = E[G_t+1| s^0, -> ] (by law of total probability)
      where p(s' | s^0, -> ) = sum over r [p(s' , r| s^0, -> )]

    • @mbeloch97
      @mbeloch97 Год назад +1

      @@Mutual_Information thanks!

    • @xiaoweilin8184
      @xiaoweilin8184 Год назад

      @@Mutual_Information May I ask how law of total probability is used to get the last line from the previous one? Thanks!

    • @Mutual_Information
      @Mutual_Information  Год назад +2

      @@xiaoweilin8184 Hm, it's just precisely what the law of total probability tells you.
      sum over b [p(a|b)p(b)] = p(a)
      The only difference is my expression has some extra conditioning on s^0, ->.. but that doesn't change anything.
      Hope that helps

    • @xiaoweilin8184
      @xiaoweilin8184 Год назад +1

      @@Mutual_Information But in your expression, the quantity to be summed is E[G_t+1|S_t+1 = s']. So do we need to write out this expectation to:
      sum over g_t+1 [g_t+1*p(G_t+1 = g_t+1|S_t+1 = s')]
      first?
      and the whole expression becomes a double sum:
      sum over s' , sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]
      exchange the sum:
      sum over g_t+1 {g_t+1 * sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]} (1)
      Until this step can we use the total probability formula to the second sum:
      sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] = p(G_t+1 = g_t+1 | s^0, ->)
      Put it back into (1):
      sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1 | s^0, ->)] = E[G_t+1 | s^0, ->]
      Is it the correct way to use the law of total probability to derive the last step from the previous one in your derivation? It seems to me these are a few more steps that are derived under the hood in your expressions.
      Sorry there is no Latex in RUclips comment, it would be nicer if they are in Latex...

  • @TheRealExecuter22
    @TheRealExecuter22 8 месяцев назад +13

    I can't express how good these videos are, thank you so much for all the time you put into making them! this is a truly special channel

    • @Mutual_Information
      @Mutual_Information  8 месяцев назад +5

      Thank you, it's tailored for a particular audience. Doesn't hit for most, but some it nails it!

  • @mCoding
    @mCoding Год назад +60

    Let's read from the textbook. *He opens the book, then stares at the camera and confidently recites from memory*.

    • @Mutual_Information
      @Mutual_Information  Год назад +19

      Lol I wish it was from memory! Fortunately teleprompters aren't that expensive :)

    • @NoNameAtAll2
      @NoNameAtAll2 Год назад +3

      @@Mutual_Information tsss
      don't ruin the good impression of you

    • @BenjaminLiraLuttges
      @BenjaminLiraLuttges Год назад +5

      That part of the video made me laugh out loud!!

    • @pandie4555
      @pandie4555 9 месяцев назад

      i was looking for this comment lmao

    • @bean217
      @bean217 4 месяца назад

      This part of the video made me lose my focus entirely 😂

  • @hypershadow9226
    @hypershadow9226 4 месяца назад +2

    In 15:46 you said "if that policy is greedy in respect to thatvalue function" but i don't quite understand what you ment by that. Other than that the video is crystal clear. thank you for these videos.

    • @Mutual_Information
      @Mutual_Information  3 месяца назад +2

      A value function gives you the numeric value of every action in every state. A policy that's greedy 'with respect to that value function' is one which, in whatever state, picks the highest value action, according to the value function. Make sense?

  • @katchdawgs914
    @katchdawgs914 Год назад +3

    These series of videos are really nice. I would love to see you go more into the theory/proofs of why policy iteration works... as another series. Once again, really good work.

  • @rajatjaiswal100
    @rajatjaiswal100 Месяц назад +1

    You saved lot of my time by simple, concise and easy to follow video compared to other I have seen so far.

  • @timothytyree5211
    @timothytyree5211 Год назад +13

    Kudos, good sir. Your pedagogical skill is both impressive and efficient.
    Please continue to grace the world with it for the good of all of mankind.

    • @Mutual_Information
      @Mutual_Information  Год назад +2

      That's very kind of you Timothy - I have no plans of stopping :)

  • @vesk4000
    @vesk4000 Месяц назад +3

    This is so well done! Explaining stuff well can be very difficult. Thanks a lot! I'm studying RL at a university course, but this was way more helpful!

    • @avinashsharma8913
      @avinashsharma8913 6 дней назад

      can you share me your email ? studying RL at a university course.

  • @usonian11
    @usonian11 Год назад +3

    Excellent video. Even though I have been studying RL for a while, the video clarified some previously learned concepts and gave me a better understanding of the topic.

  • @manudasmd
    @manudasmd 7 месяцев назад +1

    This is the best reinforcements learning resource available in internet, Period

  • @valterszakrevskis
    @valterszakrevskis Год назад +10

    Imagine if such great educational videos existed for all foundational topics in artificial intelligence, engineering, math, and physics. We are slowly getting there :). 3b1b py module manim has made it quite accessible to create high-quality, time efficient (for learning) educational content. It's amazing what people create. Thank you for the great videos!

    • @Mutual_Information
      @Mutual_Information  Год назад +3

      I hope that there's a section of RUclips that's one day more like Wikipedia. It's a bit of a pipedream, but I'm at least nudging this continent in that direction. FYI, I don't use manim

  • @arturprzybysz6614
    @arturprzybysz6614 Год назад +3

    Good to see your content back!

  • @bonettimauricio
    @bonettimauricio 10 месяцев назад +1

    Amazing explanation of the concepts! Really nice!

    • @Mutual_Information
      @Mutual_Information  10 месяцев назад

      Thank you, I appreciate it when the harder topics land :)

  • @marcin.sobocinski
    @marcin.sobocinski Год назад +4

    Your videos are like espresso, condensed, tasty, full bodied but you should not try to rush when watching them. There are no spare words so when you miss one, you're lost 😀Great video, I love that logical structure, rock solid!

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      lol you get what I'm going for! It's awesome - love the appreciation

  • @jeandescamps4962
    @jeandescamps4962 Год назад +1

    Incredible content, thanks a lot for your work !

  • @dmitriigalkin3445
    @dmitriigalkin3445 Месяц назад

    Amazing video! Thank you!

  • @aakashswami8143
    @aakashswami8143 23 дня назад +1

    Amazing video!

  • @AlisonStuff
    @AlisonStuff Год назад +1

    I love Bellman! And I love equations!!

  • @oj0024
    @oj0024 Год назад +6

    I didn't expect the next video so quickly, amazing stuff.
    Have we been spoiled, or will this tight upload schedule continue?

  • @hassaniftikhar5564
    @hassaniftikhar5564 4 месяца назад +1

    best video lectures of rl on the internet

  • @raminessalat9803
    @raminessalat9803 9 месяцев назад +1

    Wow im not sure if I understood RL when I took the course in college or i just forgot it, but these videos made the aha moment for me for sure!

  • @yuktikaura
    @yuktikaura Год назад +1

    Keep it up.. amazing take at this subject

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      Glad you like it!
      And my current plans are to certainly keep it up :)

  • @pedrocastilho6789
    @pedrocastilho6789 Год назад +1

    Yes! We missed you

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader Год назад +1

    Excellent work my friend.

  • @sayounara94
    @sayounara94 7 месяцев назад

    Is it the case that the optimal policy found by optimizing the state rewards will always be the same as the one found by optimizing the action rewards?

  • @40NoNameFound-100-years-ago
    @40NoNameFound-100-years-ago Год назад +2

    This is the easiest way I have seen regarding this subject. You did a pretty great job there 😂😃😃👍👍

  • @karthage3637
    @karthage3637 3 месяца назад +1

    Love the content so far. I would just prefere that you leave some times to breath like when you ask question "can you find S0 ?" don't answer straight away, let us think for few seconds. Will keep diging the playlist thank you for all this work !

    • @Mutual_Information
      @Mutual_Information  3 месяца назад

      Good feedback. I’ll keep it in mind. Idk why I’m in such a rush lol

  • @himm2003
    @himm2003 Месяц назад

    You mentioned something about optimal policy around 7:54. What is that and how did you relate it to state-action optimal value?

  • @sathvikkalikivaya10
    @sathvikkalikivaya10 Год назад +1

    This series is just amazing. is there any deep learning series like this?

    • @Mutual_Information
      @Mutual_Information  Год назад

      From me? No (but maybe one day). In the meantime, 3Blue1Brown has an excellent explainer. And there are others..

  • @EnesDeumic
    @EnesDeumic Год назад +1

    Hello.
    Thank you very much for providiing such clear lectures I hope you will continue.
    I have one suggestion, when you're finishing explaining the set of equations, you make them disapear so fast that one doesn't have time to pause. Just take one full note pause and that should do it. Don't overdo it, the beauty of your lectures is that they are dense, clear and consise.

  • @rafas4307
    @rafas4307 Год назад +2

    super good

  • @sidnath7336
    @sidnath7336 Год назад +3

    @6:48, how did you calculate the expectation via the sum i.e. get 17.4?

    • @denizdursun3353
      @denizdursun3353 4 месяца назад +2

      its been 11 months, but either way:
      first assume s=s^(1):
      r = 0: 0.12 * [0 + 0.95 * 18.1]
      r = 1: 0.22 * [1 + 0.95 * 18.1]
      r = 2: 0.20 * [2 + 0.95 * 18.1]
      sum all of these up
      then we assume s=s^(2)
      r = 0: 0.09 * [0 + 0.95 * 16.2]
      r = 1: 0.32 * [1 + 0.95 * 16.2]
      r = 2: 0.05 * [2 + 0.95 * 16.2]
      sum all of these up as well
      then add the grand total of both of those.
      its the same way for taking the left action :)
      Edit:
      for the state value of s=s^(0) you would simply have to create the weighted sum
      of the actions such that:
      0.4 * 17.8 + 0.6 * 17.4 = 17.56 or 17.57 if you dont ceil/floor your intermediate results
      do the calculations in excel you will get the same results :)

    • @Mutual_Information
      @Mutual_Information  4 месяца назад +1

      Nailed it!

  • @ManuThaisseril
    @ManuThaisseril 11 месяцев назад +1

    I have questions about the 6:06 substitution, could you explain this a bit ? because v_pi does not need to be conditional on state and action , it only depends on the state

    • @Mutual_Information
      @Mutual_Information  11 месяцев назад

      If you look at the pinned comment on this video, I do a breakdown of the expression. If that doesn't answer your Q, let me know

  • @milostean8615
    @milostean8615 Год назад +1

    brilliant

  • @XandraDave
    @XandraDave 6 месяцев назад +1

    It turns out that in fact, algebra *is* fun, cool, and exciting

  • @envynoir
    @envynoir 2 месяца назад

    THANKS!!! GOOD CONTENT

  • @samudralatejaswini
    @samudralatejaswini 28 дней назад

    Can you explain the calculation in bellman equations with the example which values to substitute

  • @kafaayari
    @kafaayari Год назад +1

    Great video but I have a question. At 3:56 probability distribution table appears regarding right selection. However s0 is not in table. After all, agent can go from s1 to s0. Am I wrong?

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      This example is focused only on a small piece of the MDP. The MDP, in entirety, describe the probability of all state, reward pairs for each state, action pair. In this example, I'm only showing the state-action pairs from s0 and, in this example, we can only transition to s1 or s2 (when choosing right). In other words, this example is more restricted than the general case. Implicit in this example, is that the probability of transition from s0 to s0 is zero... or the probability to transition from s0 to s-1 is zero when choosing right. Make sense?

    • @kafaayari
      @kafaayari Год назад +1

      Ah ok MI. now it's crystal clear. BTW we're lucky to ask you questions and get replies. We may not find this possibility in the future when the channel explodes. :)

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      @@kafaayari Aw thanks :) we’ll see what happens. I enjoy answering the Qs and I’m gonna try to keep it up for as long as I can. So far the volume is quite manageable lol

  • @luken476
    @luken476 Год назад +1

    Anyone have a currently most recommended library or software to learn the application part of reinforcement learning?

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      These might help?
      * github.com/TianhongDai/reinforcement-learning-algorithms
      * github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch

  • @yli6050
    @yli6050 Год назад +1

    Bravo🎉

  • @ReconFX
    @ReconFX 2 месяца назад +1

    Hi, so far I think this is a great video, however, I wanna point out that at 3:45 your illustration makes it seem like the states are a sort of one-dimensional grid, and one can go from s0 to s1, s1 to s2 etc. When you show the probabilities it becomes "obvious" that this is not the case, but this part had me confused a bit with your explanation/equation at 6:28, which I'm pretty sure should also have an s0 instead of an s. Like I said, otherwise a very good video!

  • @kiffeeify
    @kiffeeify 10 месяцев назад

    @ 13:50 for computing the value function. Is this somehow related to Gibbs sampling? Somehow it reminds me of it :)

  • @lfccardona
    @lfccardona 2 месяца назад +1

    You are awesome!

  • @tmorid3
    @tmorid3 9 месяцев назад +1

    13:53 - how come the -20 cells and the -22 cells don't have the same value? they are identically far from the ending point. no?
    Thanks!

    • @Mutual_Information
      @Mutual_Information  9 месяцев назад +1

      If we were valuing the optimal policy, you'd be right. But we're valuing the do-something-randomly policy, which can't be value by looking at the optimal path. You have to think about a random walk, and then the corners, in that sense, are further away.

    • @tmorid3
      @tmorid3 9 месяцев назад

      @@Mutual_Information
      Thank you very much for the quick reply!

  • @MarkoTintor
    @MarkoTintor Год назад +1

    Can you comment on why does Gambler's problem solution differ from Kelly's criterion from one of your previous videos? Having a goal to reach 100 vs maximum growth.

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      Sure - they aren't optimizing the same thing. In the gamblers problem, the only thing that matter is probability of getting to 100. In the Betting Game for KC, it's the expected growth rate. Also, KC can sometimes tell you not to bet. In the gamblers problem, you are forced to bet every time. I guess that's enough to make for the different strategies

  • @attilasarkany6123
    @attilasarkany6123 11 месяцев назад +1

    sorry, I may miscalculate and being stupid. I got different result at 6.48 (17.4).someone write it down?

  • @marcin.sobocinski
    @marcin.sobocinski Год назад +1

    Dziękujemy.

  • @piero8284
    @piero8284 8 месяцев назад

    Ok, the notation of E_pi [.] does not necessarily imply you are averaging over a distribution pi(A | s), right? Like in 6:43, where you took the average over the r.v's joint p(s',r | s^0,->), so what's the point of using pi ?

    • @piero8284
      @piero8284 8 месяцев назад

      I mean, when I write E_pi[.] it just means that the function inside must be calculated considering that the agent followed the policy pi, but in practice I have to consider a special function (dependent on pi) for the r.v's probs. Am I correct?

    • @Mutual_Information
      @Mutual_Information  8 месяцев назад

      At 6:43, you can see E_pi[] means when you go from the second to the third line. It's an expectation operation, so we are expanding random variables within the expression into a weighted sum of values.. where the values are all values the random variables can take, and the weights are their respective probabilities. This is what happens between the second and third line.

    • @piero8284
      @piero8284 8 месяцев назад

      ​@@Mutual_Information I agree with you, my only source of doubt was from the pi subscript, as it does not make explicit the distribution of the random variables, I'm used to think of the subscript of the expectation meaning the distribution of the random variable itself, for example E_{p_X(x)}[X] = sum_x p(x)*x, but in this context it's not the case.

    • @Mutual_Information
      @Mutual_Information  8 месяцев назад

      @@piero8284 Oh I see what you're saying. yea pi is just suggestive. It's like saying "the expectation is with respect to the policy pi and you have to know what that means".

  • @dwi4773
    @dwi4773 5 месяцев назад

    watched it on 0.75, great video!

  • @IgorAherne
    @IgorAherne Год назад

    Duane, at 16:20, I'm not sure why we would need to update the policy: I would think that we could just rely on updating the values of the states, again and again, until they stop changing. Following my logic, we wouldn't have to iteratively change the policy - at the very end we'd just make it "follow the highest action".
    ....But, I realize that these state-values were updated with the random-action policy (the 4 neighbor-states value are weighed by 0.25).
    Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others? So it no longer becomes 0.25 but some other probability.
    My confusion is because I am used to Q-learning where the policy is epsilon greedy.
    Thank you

    • @Mutual_Information
      @Mutual_Information  Год назад +3

      Hey Igor. If you are following that random-action policy example, then it just so happens you only need to apply policy improvement *once* and you're at the optimal policy. But that's not true in general. Here's it more spelled out:
      * start with a random policy.
      * determine it's value function.
      * make a slightly better policy using the pick-best-action rule. This only slightly better than the random policy. In the general case, it is not likely to be near the best optimal policy.
      * determine the value function of this new, slightly improved policy.
      * repeat.
      If you were to do your approach, you would only be doing 1 iteration. You wouldn't end with an optimal policy.
      And regarding "Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others?" Yes I think that's fair. We are changing the probabilities in states where the highest-action vcalue is NOT selected. Though I'm not sure what you mean by "slowly"
      Hope that helps!

    • @IgorAherne
      @IgorAherne Год назад

      @@Mutual_Information thank you

  • @ManuThaisseril
    @ManuThaisseril 11 месяцев назад +1

    v_pi is not a random variable why do we take the expected value of that at 6:06?

    • @Mutual_Information
      @Mutual_Information  11 месяцев назад

      Indeed it is not random.. but if you give it a random variable as input, it becomes a random variable. As a silly example, if f(x) = x^2.. and U is a uniform random variable over [0, 1], then f(U) is a random variable. It is produced by randomly sampling a value uniformly from 0, 1 and then squaring it.

  • @mehmeterenbulut6076
    @mehmeterenbulut6076 5 месяцев назад +1

    Nice explaination. May I ask how did you gain your explanation skills, did you took a course or something? Because you just hit the right buttons with your method, so to speak, to make us understand what you are talking about. I'd love to explain things like you do man, appreciated!

    • @Mutual_Information
      @Mutual_Information  5 месяцев назад

      Funny you say that, I feel like I have so much more to learn. I explain things poorly all the time!
      What I'd say is.. starting *writing* ASAP. Write about whatever interests you and write somewhere where people will give you useful feedback. It takes awhile to learn what works and what doesn't.
      One thing that helped me is to realize I'll be writing/educating forever. So there's no rush. This makes it more enjoyable, which means it's easier to maintain a long term habit. The long term is where quality edu emerges.

  • @curryeater259
    @curryeater259 Год назад +1

    Very cool. How did you make the animations?

    • @Mutual_Information
      @Mutual_Information  Год назад +3

      I use the Python plotting library called Altair. It creates beautiful static images. Then I have a personal library I use to stitch them into videos. That's also used to make the latex animation.

  • @supercobra1746
    @supercobra1746 Год назад +3

    Im tough and ambitious!

  • @nathanzorndorf8214
    @nathanzorndorf8214 8 месяцев назад +1

    Do you provide slides as a pdf anywhere ? That would be really helpful! Great video!!!

    • @Mutual_Information
      @Mutual_Information  8 месяцев назад +1

      gah unfortunately not.. I'd like to come back and create a written version of this series. I have that for a small set of other videos, but that'll take some time for this one - don't hold your breathe, sorry!

    • @nathanzorndorf8214
      @nathanzorndorf8214 8 месяцев назад

      @@Mutual_Information no problem! Thanks for your videos. They are a big help!

  • @nathanzorndorf8214
    @nathanzorndorf8214 5 месяцев назад

    Do you have code for the gamblers problem online anywhere?

    • @nathanzorndorf8214
      @nathanzorndorf8214 5 месяцев назад +1

      I attempted to code the policy iteration algorithm for the gambler's problem, but don't get the policy you show in this video. Instead I get a triangle with a max at 50. This does seem like a reasonable policy though, so I'm not sure if this is one of the "family of optimal policies" that barto and sutton reference in the text.

    • @Mutual_Information
      @Mutual_Information  5 месяцев назад +1

      Oh yes I think it might be. I haven't made the code public but I think I remember the problem. Changing how you are dealing with ties! The action you pick given a tie in value makes a difference

  • @earthykibbles
    @earthykibbles Месяц назад

    My brain is cooked. This is my fifth time rewatching this😢

  • @aryamohan7533
    @aryamohan7533 3 месяца назад +1

    Could you explain the policy improvement? I understand that choosing the action that maximizes the action value function will lead to a better policy but I don't understand why we could not do that in the first iteration after performing policy evaluation? Wouldn't that then be the most optimal policy? What other improvements can we iteratively make to the policy?
    PS: Thank you for this video series, it has helped me understand a lot!

    • @Mutual_Information
      @Mutual_Information  3 месяца назад

      Let's walk through it. To start, the action-value function is completely flat; all actions in all states have the same value. To do policy improvement from this moment, you must pick actions randomly (or some other arbitrary way to break ties). Now you have a random policy. Then in policy evaluation, we determine the action-values of this random policy. Next, in policy improvement, we can slightly beat it by always picking the max-action value. Ok, why isn't this the optimal strategy immediately? Well b/c it's a policy improvement step (applying the rule of picking the max-value-action) on action values of a crappy policy, the one where we just picked actions randomly!
      Make sense? It takes time for the action-values to model a good policy, because we start with a bad policy.

    • @aryamohan7533
      @aryamohan7533 3 месяца назад +1

      @@Mutual_Information This makes a lot more sense now. Thank you so much for taking the time to respond!
      To everyone else watching, while I have only watched the first 3 parts (will watch the rest soon), I can already tell you that this video series has peaked my interest in RL and I am excited to dive deeper into these topics and look into how I can incorporate this into my research.

    • @Mutual_Information
      @Mutual_Information  3 месяца назад

      Excellent! I hope it helps you

  • @NicolasChanCSY
    @NicolasChanCSY Год назад +3

    🤯 My brain keeps saying "I understand" and then "But do I really?" every few seconds
    Have to rewatch for those algebra for my tiny brain but the overall idea is very well presented!

    • @Mutual_Information
      @Mutual_Information  Год назад

      Thank you. And fortunately the comment section is small enough that I can answer questions - feel free to ask and I'll do my best!

  • @yassinesafraoui
    @yassinesafraoui Год назад +2

    hmm, let me guess, wouldn't applying a discount be interesting if the state space is so big? furthermore, wouldn't it be more interesting if we instead of discounting by a constant rate use a Gaussian distribution as our discount ?

    • @Mutual_Information
      @Mutual_Information  Год назад

      Interesting perspective, but I don't think a strong discount removes the difficulty of a large state space. I see your perspective though - it makes it seem as though you only need to care about the most immediate states. But that's not necessarily true. It's because our policy is optimized for all t for G_t.. If you only cared about G_0 and gamma = 0, then yes, the immediate state/action pair is all the matters and you don't care about a lot of the state space. BUT, since we also care about G_1.. we have to have a policy that does well at time t=1.. which means we care about states beyond those in the neighbor of states at t=0. Eventually, we could end up caring about the whole state space. If, on the other hand, some states aren’t reachable from the starting state - then that would be one way in which a lot of the state space doesn't matter.

    • @yassinesafraoui
      @yassinesafraoui Год назад

      Yeahh that's it 👌👌, thanks for the fast reply!

  • @sharmakartikeya
    @sharmakartikeya 5 месяцев назад +2

    If anyone is having a hard time deriving the bellman equation, especially the part where E [ G_t+1 | S_t ] = v(S_t+1), then I have covered that in my playlist from the absolute ground zero (because I am dumb).
    ruclips.net/video/4nMnc8M7U5k/видео.htmlsi=9QR8zNuLEd1QAijE

    • @Mutual_Information
      @Mutual_Information  5 месяцев назад

      I don't see anything dumb in that video. You're getting right into the meat of probability theory and that's no easy thing!

  • @watcher8582
    @watcher8582 7 месяцев назад +1

    Gotta upload a version that blends out the pause-less hand motions

    • @Mutual_Information
      @Mutual_Information  7 месяцев назад +1

      Yea I don't like my old style at all. I'm re-shooting one but it's a lotta work. :/ Feedback is welcome.

    • @watcher8582
      @watcher8582 7 месяцев назад

      @@Mutual_Information It was just a mean comment for the sake of being mean. I'd probably not put in the work to upload content you already got got covered, but donno

    • @Mutual_Information
      @Mutual_Information  7 месяцев назад

      @@watcher8582 haha yea I looked at the video after the fact and decided it wasn't so bad I didn't need to do this one. But some of my older ones, hand motions are awful and I'll be changing some of that lol

    • @watcher8582
      @watcher8582 7 месяцев назад

      @@Mutual_Information No I mean if you already got a video covered (or if it's well covered elsewhere), then I'd probably invest the energy into making a video with a topic not covered online yet. Don't misunderstand me, the hand motion is terrible and I put a post-it over the screen to watch the video.

    • @sayounara94
      @sayounara94 7 месяцев назад

      I was super focused on the board I didn't notice any weird cutovers! I like how you go through each variable it's very useful to have these quick reminders of what these notations represent as we're going through new concepts so that we don't have to make more conscious effort to decipher them and can focus on the new concept

  • @JoeyFknD
    @JoeyFknD 11 месяцев назад +4

    That awkward moment when a RUclips series teaches you more practical knowledge than a $50,000 4-year degree in math

  • @kjs333333
    @kjs333333 8 месяцев назад +1

    EPIC

  • @piero8284
    @piero8284 8 месяцев назад +1

    (11:10) v_*(s) = max_a {q_*(s,a)}, no proof for this equation in the book made me very disappointed.

  • @pi5549
    @pi5549 Год назад +1

    How about creating a Discord? If you think of the mind-type you're filtering with your videos, it could make for a strong community.

    • @Mutual_Information
      @Mutual_Information  Год назад

      I should. I'm just not on Discord myself, so I don't have familiarity with it as a platform. But I have gotten the request a few times and it seems like a wise move..

    • @pi5549
      @pi5549 Год назад

      @@Mutual_Information Yannic Kilcher created a Discord to support his RUclips, and it is buzzing. Also the level of expertise is high. Yannic's used his following to accrete engineering power into LAION. I got into ML in 2015 and there were almost no online communities back then. I came back over christmas (thanks to the ChatGPT buzz) and was delighted to find that it has taken off bigtime over Discord. Also Karpathy has an active Discord.

  • @LeviFinkelstein
    @LeviFinkelstein Год назад +1

    I don't know very much about video stuff, but it looks like there's something off with your recording of yourself, it's pretty pixelated.
    Maybe it's just that your camera isn't that good, or something else, like the lighting, your rendering settings, or bit rate in OBS. Just wanted to let you know in case you didn't already.
    Thanks for the good videos.

    • @Mutual_Information
      @Mutual_Information  Год назад

      Thanks for looking out. This was my first time uploading in 4K (despite it being recorded in 1920 x 1080) - apparently that's recommended practice. From my end, the video doesn't look bad enough to warrant a re-upload, but I'll give the settings another look on the next videos. I believe I see the pixelation you're referring to.

  • @gigantopithecus8254
    @gigantopithecus8254 3 месяца назад

    i heard its simular to calculus of variations

  • @stallard5772
    @stallard5772 Год назад

    𝔭𝔯𝔬𝔪𝔬𝔰𝔪

  • @sorn6813
    @sorn6813 11 месяцев назад +1

    It's very hard to follow when you call everything "this" and just highlight "this". Would be easier to follow if you replaced "this" with the name of what you're describing

    • @sorn6813
      @sorn6813 11 месяцев назад

      E.g. "R at time step t" or "the bottom-right XYZ"

    • @Mutual_Information
      @Mutual_Information  11 месяцев назад

      Appreciate the feedback. It's a work in progress.. going forward there's even less on screen so focusing attention might be alleviated a bit.
      Would you mind providing a timestamp for a case where this stood out? That'll help me identify similar cases in the future

  • @anondsml
    @anondsml 2 месяца назад

    you move too fast. almost as if you're nervous

  • @vishalpaudel
    @vishalpaudel 3 месяца назад

    The thumbnail was Manim-like. Dissapointed. Did not watch the whole video.

  • @juliomastrodomenico5188
    @juliomastrodomenico5188 Год назад +2

    Hi! Nice explanation ! but I tried to implement the calculation from 12:46, like doing the iteration (sweep) and I couldnt reach the same result, for the first part of the equation I kept the 0.25 ( pi[a|s] ), p(s_prime, r | s, a) as 1, for a deterministic world and sum ( -1 + world[row, col]), where world[row, col] is the V_next_state. My result was:
    [[ 0. -5.05417921 -5.47163695 -3.10743228]
    [-5.05417921 -6.78698773 -6.51979625 -3.95809215]
    [-5.47163695 -6.51979625 -5.86246816 -3.20514008]
    [-3.10743228 -3.95809215 -3.20514008 0. ]]
    The iteration was something like:
    for s in the world grid: (except the goals)
    for act in actions: (if falls off grid does nothing (try catch))
    V(s) += 0.25 * (-1 + V[act])

  • @user-co6pu8zv3v
    @user-co6pu8zv3v 7 месяцев назад +1

    Great video! Thank you!