Это видео недоступно.
Сожалеем об этом.

Monte Carlo And Off-Policy Methods | Reinforcement Learning Part 3

Поделиться
HTML-код
  • Опубликовано: 4 авг 2024
  • The machine learning consultancy: truetheta.io
    Join my email list to get educational and useful articles (and nothing else!): mailchi.mp/truetheta/true-the...
    Want to work together? See here: truetheta.io/about/#want-to-w...
    Part three of a six part series on Reinforcement Learning. It covers the Monte Carlo approach a Markov Decision Process with mere samples. At the end, we touch on off-policy methods, which enable RL when the data was generate with a different agent.
    SOCIAL MEDIA
    LinkedIn : / dj-rich-90b91753
    Twitter : / duanejrich
    Github: github.com/Duane321
    Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
    SOURCES
    [1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
    [2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, • DeepMind x UCL | Deep ...
    SOURCE NOTES
    The video covers topics from chapters 5 and 7 from [1]. The whole series teaches from [1]. [2] has been a useful secondary resource.
    TIMESTAMP
    0:00 What We'll Learn
    0:33 Review of Previous Topics
    2:50 Monte Carlo Methods
    3:35 Model-Free vs Model-Based Methods
    4:59 Monte Carlo Evaluation
    9:30 MC Evaluation Example
    11:48 MC Control
    13:01 The Exploration-Exploitation Trade-Off
    15:01 The Rules of Blackjack and its MDP
    16:55 Constant-alpha MC Applied to Blackjack
    21:55 Off-Policy Methods
    24:32 Off-Policy Blackjack
    26:43 Watch the next video!
    NOTES
    Link to Constant-alpha MC applied to Blackjack: github.com/Duane321/mutual_in...
    The Off-Policy method you see at 25:00 is different from the rule you'll see in the textbook at eq 7.9 (which will be MC if n goes to inf). That's because they are showing re-weighted IS and I'm showing plain ( high variance) IS.

Комментарии • 67

  • @hannahnelson4569
    @hannahnelson4569 15 часов назад

    The off policy thing was mind blowing!

  • @shahadalrawi6744
    @shahadalrawi6744 Год назад +18

    This is beyond great.
    I can't thank you enough for the effort and clarity in this series. This is gold.

  • @PromptStreamer
    @PromptStreamer Год назад +7

    These videos genuinely help me learn. A lot of the time studying math that’s above your head doesn’t have any tiny cumulative value, you’re just out of your league. But in these videos I often feel like I get the general idea of what he’s saying, even if I can’t work out all the details on my own yet. It’s something you can actually watch relaxed, like hearing a podcast, but walk away having learned something. I’m watching this in a hospital waiting room and it’s gripping. After watching his softmax video I was able to read through a paper I saw linked on twitter and sure enough, they mentioned the softmax, and my eyes lit up for a second. These are really high quality videos.

  • @aadi.p4159
    @aadi.p4159 Год назад +1

    keep em coming man. This is one of the most well prod. videos Ive seen on this topic!

  • @minefacex
    @minefacex Год назад +4

    I love how your videos are so understandable, but mathematically concise and clear at the same time! You also have amazing animations and figures. Good job and thank you!

  • @andrewkovachik2062
    @andrewkovachik2062 Год назад +13

    Your video on importance sampling was so useful and well made I'm sticking around for this whole series even though I don't expect I'll need any time soon

  • @timothytyree5211
    @timothytyree5211 Год назад +2

    Fantastic video series! I am looking forward to your next video, good sir.

  • @jacksonstenger
    @jacksonstenger Год назад +4

    This is really great information, thanks for taking the time to make these videos

  • @architasrivastava218
    @architasrivastava218 Год назад +4

    I have been doing specialization in AI since last 2 years in my college. I wish my teachers had explained to me such a clear way.

  • @moranreznik
    @moranreznik Год назад +3

    I wish every math book in the world was written by you.

    • @Mutual_Information
      @Mutual_Information  Год назад +2

      lol that's very nice of you, but that sounds like an awfully lot of work :)

  • @imanmossavat9383
    @imanmossavat9383 Год назад +1

    Excellent series.

  • @hamade7997
    @hamade7997 Год назад +1

    This is really quite excellent, thank you.

  • @tomsojer7524
    @tomsojer7524 6 месяцев назад

    I am so grateful for this series man, it helped me pass my exam. Thank you so much man. I'm waiting for more of your videos

    • @Mutual_Information
      @Mutual_Information  6 месяцев назад +1

      Awesome - exactly what I was going for. And I'm working on the next video now..

  • @bonettimauricio
    @bonettimauricio Год назад

    Thanks for sharing this content, really amazing!

  • @marcin.sobocinski
    @marcin.sobocinski Год назад +4

    Dziękujemy.

  • @DKFBJ
    @DKFBJ 5 месяцев назад

    This is excellent - Highly appreciated.
    Thank you very much.
    Have a great week,
    Kind regards

  • @codesmell786
    @codesmell786 9 месяцев назад +1

    Best Video I have seen many but this one is best ...great work

  • @qiguosun129
    @qiguosun129 Год назад +2

    With all due respect, your lecture is more vivid than what deep-mind teacher explained.

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      Thank you ! Their lecture series is great. I just put more of an emphasis on visualizing the mechanics and compressing the subject

    • @qiguosun129
      @qiguosun129 Год назад

      @@Mutual_Information Yes, that helps a lot for understanding the underlying mechanism.

  • @dimitrispapapdimitriou6364
    @dimitrispapapdimitriou6364 7 месяцев назад

    This is very well made. Thank you!

  • @user-co6pu8zv3v
    @user-co6pu8zv3v 9 месяцев назад +1

    This is great! excellent. Thank you!

  • @DARWINANDRESBENAVIDESMIRANDA
    @DARWINANDRESBENAVIDESMIRANDA Год назад +2

    so great explanation, notation and video produce!!

  • @jiaqint961
    @jiaqint961 Год назад +1

    Thank you for your videos they are very comprehensive and well explained.

  • @rickycarollo6410
    @rickycarollo6410 6 месяцев назад +1

    amazing stuff thanks !

  • @user-ed7ze8sx9c
    @user-ed7ze8sx9c 8 месяцев назад

    Super awesome video series and I have thoroughly enjoyed it so far! I do want to ask what tool(s) did you use to perform visualization and add animations for the plots in the video. If you can provide me the answer it would be a great help for a documentation I am currently working on! Again, super awesome video and I am glad people like you put so much effort to communicate and simply these complicated topics in a really fun and very descriptive manner.

  • @melihozcan8676
    @melihozcan8676 8 месяцев назад +1

    Don’t expect to understand these videos by only watching. They are like concentrated juices (without sugar/chemicals added hehe), you can't just drink them, it’ll overload your body… Water must be added, which is time and effort, in this context. Everybody has some vague idea about reinforcement learning already: Give rewards / punishment & repeat. Nevertheless this high level understanding is only adequate for people from different areas: Like Justin Trudeau knowing the basics of Quantum computers (which is impressive actually).
    I would like to thank Mutual Information for this series! The connections between topics and the amount of details (math) is very well established. Such quality content is really sparse. If you also make similar series on ML or similar topics, count me in!

    • @Mutual_Information
      @Mutual_Information  8 месяцев назад

      Wow, that's very kind of you! Thank you for noticing what I was aiming for her.. and I'm going to use that line "concentrated juices" - that's a good analogy!

    • @melihozcan8676
      @melihozcan8676 8 месяцев назад +1

      Thank you as well@@Mutual_Information, apparently good lectures lead to good analogies! I am honored!

  • @catcoder12
    @catcoder12 11 месяцев назад +2

    26:03 We did a better estimate because the behavioural policy chooses hit/stick with equal probability so we "explore" more of the suboptimal states compared to a on-policy method where we greedily always choose the most optimal action? Am I right?

    • @Mutual_Information
      @Mutual_Information  11 месяцев назад

      It could be something like that.. I can't confidently say. It could also be the noise of the simulation I did. I'd have to re-run it a lot to know it's a real effect. I don't suspect it is.. in general, off policy makes for strictly worse learning.

  • @ice-skully2651
    @ice-skully2651 Год назад +1

    Great quality sir! The material is well presented, do you have a social media account I could follow you on ?

  • @fallognet
    @fallognet 6 месяцев назад +1

    Hey ! Thank you so much for your videos, they are great and very useful !
    I still have a question tho, when you are showing the Off-policy version of the Constant alpha MC algorithm (25:10), why is the behaviour policy b never updated to generates the new trajectories (we would like the new trajectories to take into account of our improvments on the policy and the decision making, right ?)
    Thank you again Sir !

    • @Mutual_Information
      @Mutual_Information  6 месяцев назад

      Good question! It's because it's off policy. That's defined as the case where the behavior policy is fixed and given to you (imagine someone just handed you data and said it was generated by XYZ code or agent). Then we're using that data / fixed behavior policy to update the Q-table, which gives us another policy, pi. Think of it as the 'policy recommended according to the data collected by the given behavior policy."

  • @user-kz6jr6gw7t
    @user-kz6jr6gw7t 5 месяцев назад +1

    Thank you so much! But at 25:13, since the target policy is derived after the data are sampled by the behavior policy, is there an iterative process to update the rho, then get a new target policy, and then so on?

    • @Mutual_Information
      @Mutual_Information  5 месяцев назад

      Yea, you're thinking about it right. The target policy is the thing getting updated. The behavior policy is a fixed, given function. So pho changes as the target policy changes. Intuition, rho is adjusting for the fact the target and behavior policies 'hang out' in different regions of the state space. So, as the target policy changes were it hangs out, rho needs change how it adjusts.

    • @user-kz6jr6gw7t
      @user-kz6jr6gw7t 5 месяцев назад

      Thanks a lot for the further clarification. That really helps! @@Mutual_Information

  • @EdupugantiAadityaaeb
    @EdupugantiAadityaaeb 9 месяцев назад +1

    what an explanation

  • @IMAdiministrator
    @IMAdiministrator 8 месяцев назад

    I have a question in BlackJack example. Why doesn't the stick graph with both having and not having ace have similar result? You stick to whatever you have anyway so it seems a bit odd to have different state-value between the graph.

  • @glebmokeev6312
    @glebmokeev6312 9 месяцев назад

    12:14 Why the policy is deterministic if we have probability > 0 of taking one of two actions?

  • @Prathamsapraa
    @Prathamsapraa 10 месяцев назад +1

    I loved it, can you make coding videos regarding this.

    • @Mutual_Information
      @Mutual_Information  10 месяцев назад

      I included some notebooks in the description. That's probably as far as I'll go. Just got other topics I'd like to get to.

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader Год назад +2

    I love your videos. Would love to connect with you further

  • @florianvogt1638
    @florianvogt1638 8 месяцев назад

    I am curios, in most pseudo code algorithms for Off policy MC Control the order we go over the states after generating an episode is in reverse, that is, we start from T-1 and go to t=0. However, you start at t=0 until T-1. I wonder if both approaches are really equal?

    • @IMAdiministrator
      @IMAdiministrator 8 месяцев назад

      Judging on the RL Book, IMHO he altered the off-policy MC control(section 5.7) into this method which initially multiply all the ratio from the start to terminal state and gradually stripping the ratio when t progress to the terminal state, hence he can progress timestep forward. Alpha is suppose to be a ratio between weight of the current state and cumulative sum which value is between 0 to 1 according to the method, but cumulative sum need to be calculated backward. In order to calculate alpha forward you need first get all the cumulative sum in episode. cumulative sum can be gathered from all the importance sampling ratio between all timestep in an episode. And gradually stripping the weight of the current state from cumulative sums to calculate alpha forward.

  • @faysoufox
    @faysoufox Год назад +3

    I understood the two first videos well, but in this one, you spend time talking about fine model points without explaining the model with enough time to actually understand it. Still thank you for your videos that seem to be good introductions.

    • @Mutual_Information
      @Mutual_Information  Год назад

      Ah sorry it's not landing :/ But maybe I can help. Is there something specific you don't understand and maybe I can clarify it here?

    • @faysoufox
      @faysoufox Год назад +2

      @@Mutual_Information thank you for your proposition, I was actually watching your videos more for fun, it's not like I need to be able to do RL things tomorrow. If I want to understand in detail I'll read the book you based your videos on.

  • @049_revanfauzialgifari6
    @049_revanfauzialgifari6 5 месяцев назад

    how do you evaluate reinforcement learning results? i know precision, recall, mAP, etc. but i dont think it can be used in this cenario, CMIIW.

  • @PromptStreamer
    @PromptStreamer Год назад

    Can you please start a discord server? Would be wonderful to discuss the video content somewhere. Thx

  • @alexchen879
    @alexchen879 Год назад

    Could you please public your source code?

    • @Mutual_Information
      @Mutual_Information  Год назад

      There's a link to a notebook in a description. It covers some of the code, but not everything.
      If there's a specific question you have, I can try to answer it here. Maybe that'll fill that gap.

  • @catcoder12
    @catcoder12 11 месяцев назад +4

    You teach great but I feel you speak a little too fast.

    • @Mutual_Information
      @Mutual_Information  11 месяцев назад +1

      Good to know, I'm still getting calibrated. I've spoken *way* too fast before and sometimes too slow. Finding that sweet spot..

  • @user-wn9jq3zn6u
    @user-wn9jq3zn6u 3 месяца назад

    Anyone's brain explode like mine?