Monte Carlo And Off-Policy Methods | Reinforcement Learning Part 3

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 79

  • @shahadalrawi6744
    @shahadalrawi6744 Год назад +19

    This is beyond great.
    I can't thank you enough for the effort and clarity in this series. This is gold.

  • @PromptStreamer
    @PromptStreamer Год назад +12

    These videos genuinely help me learn. A lot of the time studying math that’s above your head doesn’t have any tiny cumulative value, you’re just out of your league. But in these videos I often feel like I get the general idea of what he’s saying, even if I can’t work out all the details on my own yet. It’s something you can actually watch relaxed, like hearing a podcast, but walk away having learned something. I’m watching this in a hospital waiting room and it’s gripping. After watching his softmax video I was able to read through a paper I saw linked on twitter and sure enough, they mentioned the softmax, and my eyes lit up for a second. These are really high quality videos.

    • @Mutual_Information
      @Mutual_Information  2 месяца назад +1

      This is very nice to read, and I'm glad it had a positive effect

  • @JaishreeramCoder
    @JaishreeramCoder 3 месяца назад +1

    Best series for someone who wants to know about reinforcement learning

  • @Tracing0029
    @Tracing0029 Год назад +4

    I love how your videos are so understandable, but mathematically concise and clear at the same time! You also have amazing animations and figures. Good job and thank you!

  • @andrewkovachik2062
    @andrewkovachik2062 2 года назад +14

    Your video on importance sampling was so useful and well made I'm sticking around for this whole series even though I don't expect I'll need any time soon

  • @tomsojer7524
    @tomsojer7524 10 месяцев назад

    I am so grateful for this series man, it helped me pass my exam. Thank you so much man. I'm waiting for more of your videos

    • @Mutual_Information
      @Mutual_Information  10 месяцев назад +1

      Awesome - exactly what I was going for. And I'm working on the next video now..

  • @hannahnelson4569
    @hannahnelson4569 4 месяца назад

    The off policy thing was mind blowing!

  • @melihozcan8676
    @melihozcan8676 Год назад +2

    Don’t expect to understand these videos by only watching. They are like concentrated juices (without sugar/chemicals added hehe), you can't just drink them, it’ll overload your body… Water must be added, which is time and effort, in this context. Everybody has some vague idea about reinforcement learning already: Give rewards / punishment & repeat. Nevertheless this high level understanding is only adequate for people from different areas: Like Justin Trudeau knowing the basics of Quantum computers (which is impressive actually).
    I would like to thank Mutual Information for this series! The connections between topics and the amount of details (math) is very well established. Such quality content is really sparse. If you also make similar series on ML or similar topics, count me in!

    • @Mutual_Information
      @Mutual_Information  Год назад

      Wow, that's very kind of you! Thank you for noticing what I was aiming for her.. and I'm going to use that line "concentrated juices" - that's a good analogy!

    • @melihozcan8676
      @melihozcan8676 Год назад +1

      Thank you as well@@Mutual_Information, apparently good lectures lead to good analogies! I am honored!

  • @moranreznik
    @moranreznik Год назад +3

    I wish every math book in the world was written by you.

    • @Mutual_Information
      @Mutual_Information  Год назад +2

      lol that's very nice of you, but that sounds like an awfully lot of work :)

  • @timothytyree5211
    @timothytyree5211 2 года назад +2

    Fantastic video series! I am looking forward to your next video, good sir.

  • @codesmell786
    @codesmell786 Год назад +1

    Best Video I have seen many but this one is best ...great work

  • @jacksonstenger
    @jacksonstenger 2 года назад +4

    This is really great information, thanks for taking the time to make these videos

  • @qiguosun129
    @qiguosun129 Год назад +3

    With all due respect, your lecture is more vivid than what deep-mind teacher explained.

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      Thank you ! Their lecture series is great. I just put more of an emphasis on visualizing the mechanics and compressing the subject

    • @qiguosun129
      @qiguosun129 Год назад

      @@Mutual_Information Yes, that helps a lot for understanding the underlying mechanism.

  • @architasrivastava218
    @architasrivastava218 Год назад +4

    I have been doing specialization in AI since last 2 years in my college. I wish my teachers had explained to me such a clear way.

  • @DKFBJ
    @DKFBJ 9 месяцев назад

    This is excellent - Highly appreciated.
    Thank you very much.
    Have a great week,
    Kind regards

  • @aadi.p4159
    @aadi.p4159 2 года назад +1

    keep em coming man. This is one of the most well prod. videos Ive seen on this topic!

  • @keep-it-simple-shu
    @keep-it-simple-shu 4 дня назад

    broo... you're a savior

    • @keep-it-simple-shu
      @keep-it-simple-shu 4 дня назад

      not only very helpful, but also inspiring, i'm intrigued

  • @DARWINANDRESBENAVIDESMIRANDA
    @DARWINANDRESBENAVIDESMIRANDA 2 года назад +2

    so great explanation, notation and video produce!!

  • @marcin.sobocinski
    @marcin.sobocinski 2 года назад +4

    Dziękujemy.

  • @Mewgu_studio
    @Mewgu_studio Год назад +1

    Thank you for your videos they are very comprehensive and well explained.

  • @hamade7997
    @hamade7997 Год назад +1

    This is really quite excellent, thank you.

  • @dimitrispapapdimitriou6364
    @dimitrispapapdimitriou6364 11 месяцев назад

    This is very well made. Thank you!

  • @НиколайНовичков-е1э

    This is great! excellent. Thank you!

  • @imanmossavat9383
    @imanmossavat9383 Год назад +1

    Excellent series.

  • @bonettimauricio
    @bonettimauricio Год назад

    Thanks for sharing this content, really amazing!

  • @catcoder12
    @catcoder12 Год назад +2

    26:03 We did a better estimate because the behavioural policy chooses hit/stick with equal probability so we "explore" more of the suboptimal states compared to a on-policy method where we greedily always choose the most optimal action? Am I right?

    • @Mutual_Information
      @Mutual_Information  Год назад

      It could be something like that.. I can't confidently say. It could also be the noise of the simulation I did. I'd have to re-run it a lot to know it's a real effect. I don't suspect it is.. in general, off policy makes for strictly worse learning.

  • @urpaps
    @urpaps Месяц назад

    Gem of a video❤

  • @rickycarollo6410
    @rickycarollo6410 10 месяцев назад +1

    amazing stuff thanks !

  • @HaozheJiang
    @HaozheJiang 9 месяцев назад +1

    Thank you so much! But at 25:13, since the target policy is derived after the data are sampled by the behavior policy, is there an iterative process to update the rho, then get a new target policy, and then so on?

    • @Mutual_Information
      @Mutual_Information  9 месяцев назад

      Yea, you're thinking about it right. The target policy is the thing getting updated. The behavior policy is a fixed, given function. So pho changes as the target policy changes. Intuition, rho is adjusting for the fact the target and behavior policies 'hang out' in different regions of the state space. So, as the target policy changes were it hangs out, rho needs change how it adjusts.

    • @HaozheJiang
      @HaozheJiang 9 месяцев назад

      Thanks a lot for the further clarification. That really helps! @@Mutual_Information

  • @EdupugantiAadityaaeb
    @EdupugantiAadityaaeb Год назад +1

    what an explanation

  • @BhargavJoshi-i4x
    @BhargavJoshi-i4x Год назад

    Super awesome video series and I have thoroughly enjoyed it so far! I do want to ask what tool(s) did you use to perform visualization and add animations for the plots in the video. If you can provide me the answer it would be a great help for a documentation I am currently working on! Again, super awesome video and I am glad people like you put so much effort to communicate and simply these complicated topics in a really fun and very descriptive manner.

  • @itdepends5906
    @itdepends5906 2 месяца назад +1

    Isn't the equation introduced at 7:51 a circular reference? Finding that part hard to follow. But thanks for all the videos, they're great

    • @Mutual_Information
      @Mutual_Information  2 месяца назад

      Ah I see how it's confusing. The arrow is there to suggest it's an operation, like what a computer would do. In the same way, it's like specifying the count sequence with "x

    • @itdepends5906
      @itdepends5906 2 месяца назад

      ​@@Mutual_Information OH lmao my bad. Probably because the equation was written in nice mathematical notation that it didn't occur to me to think like that. Thanks again 🙏

  • @faysoufox
    @faysoufox 2 года назад +6

    I understood the two first videos well, but in this one, you spend time talking about fine model points without explaining the model with enough time to actually understand it. Still thank you for your videos that seem to be good introductions.

    • @Mutual_Information
      @Mutual_Information  2 года назад

      Ah sorry it's not landing :/ But maybe I can help. Is there something specific you don't understand and maybe I can clarify it here?

    • @faysoufox
      @faysoufox 2 года назад +2

      @@Mutual_Information thank you for your proposition, I was actually watching your videos more for fun, it's not like I need to be able to do RL things tomorrow. If I want to understand in detail I'll read the book you based your videos on.

  • @fallognet
    @fallognet 10 месяцев назад +1

    Hey ! Thank you so much for your videos, they are great and very useful !
    I still have a question tho, when you are showing the Off-policy version of the Constant alpha MC algorithm (25:10), why is the behaviour policy b never updated to generates the new trajectories (we would like the new trajectories to take into account of our improvments on the policy and the decision making, right ?)
    Thank you again Sir !

    • @Mutual_Information
      @Mutual_Information  10 месяцев назад

      Good question! It's because it's off policy. That's defined as the case where the behavior policy is fixed and given to you (imagine someone just handed you data and said it was generated by XYZ code or agent). Then we're using that data / fixed behavior policy to update the Q-table, which gives us another policy, pi. Think of it as the 'policy recommended according to the data collected by the given behavior policy."

  • @IMAdiministrator
    @IMAdiministrator Год назад

    I have a question in BlackJack example. Why doesn't the stick graph with both having and not having ace have similar result? You stick to whatever you have anyway so it seems a bit odd to have different state-value between the graph.

  • @ice-skully2651
    @ice-skully2651 Год назад +1

    Great quality sir! The material is well presented, do you have a social media account I could follow you on ?

  • @Prathamsapraa
    @Prathamsapraa Год назад +1

    I loved it, can you make coding videos regarding this.

    • @Mutual_Information
      @Mutual_Information  Год назад

      I included some notebooks in the description. That's probably as far as I'll go. Just got other topics I'd like to get to.

  • @florianvogt1638
    @florianvogt1638 Год назад

    I am curios, in most pseudo code algorithms for Off policy MC Control the order we go over the states after generating an episode is in reverse, that is, we start from T-1 and go to t=0. However, you start at t=0 until T-1. I wonder if both approaches are really equal?

    • @IMAdiministrator
      @IMAdiministrator Год назад

      Judging on the RL Book, IMHO he altered the off-policy MC control(section 5.7) into this method which initially multiply all the ratio from the start to terminal state and gradually stripping the ratio when t progress to the terminal state, hence he can progress timestep forward. Alpha is suppose to be a ratio between weight of the current state and cumulative sum which value is between 0 to 1 according to the method, but cumulative sum need to be calculated backward. In order to calculate alpha forward you need first get all the cumulative sum in episode. cumulative sum can be gathered from all the importance sampling ratio between all timestep in an episode. And gradually stripping the weight of the current state from cumulative sums to calculate alpha forward.

  • @049_revanfauzialgifari6
    @049_revanfauzialgifari6 9 месяцев назад

    how do you evaluate reinforcement learning results? i know precision, recall, mAP, etc. but i dont think it can be used in this cenario, CMIIW.

  • @glebmokeev6312
    @glebmokeev6312 Год назад

    12:14 Why the policy is deterministic if we have probability > 0 of taking one of two actions?

    • @lm-gu1ki
      @lm-gu1ki 2 месяца назад

      The video said the *environment* is determinstic, not the policy. That is, given state s and action a, you know with certainty what the new state will be.

  • @PromptStreamer
    @PromptStreamer Год назад

    Can you please start a discord server? Would be wonderful to discuss the video content somewhere. Thx

  • @alexchen879
    @alexchen879 Год назад

    Could you please public your source code?

    • @Mutual_Information
      @Mutual_Information  Год назад

      There's a link to a notebook in a description. It covers some of the code, but not everything.
      If there's a specific question you have, I can try to answer it here. Maybe that'll fill that gap.

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader 2 года назад +2

    I love your videos. Would love to connect with you further

  • @erdevel
    @erdevel Месяц назад

    watching his videos for 2 hours straight(including all other videos) and not understanding anthing, i dont understand why people are commenting great when it was just a talk and talk.......so fustrating and wtf he speakes too fast like he is rapping or something

  • @刘春峰-w2e
    @刘春峰-w2e 7 месяцев назад +1

    Anyone's brain explode like mine?

  • @catcoder12
    @catcoder12 Год назад +5

    You teach great but I feel you speak a little too fast.

    • @Mutual_Information
      @Mutual_Information  Год назад +1

      Good to know, I'm still getting calibrated. I've spoken *way* too fast before and sometimes too slow. Finding that sweet spot..

  • @lm-gu1ki
    @lm-gu1ki 2 месяца назад +1

    With a deterministic target policy (25:00), wouldn't you throw away almost all your learning? Such a target policy has 0 probabilities assigned to all but one actions in a given state. The behavior policy needs to be quite lucky to hit that single action, so most of the time your importance ratio will be 0. Perhaps this problem is less severe if the behavior and target policies are quite similar -- but that happens only if the behavior policy is near-greedy relative to q values derived from it. Typically, the dataset you started from was not generated by such a good policy, otherwise you wouldn't need to do RL in the first place.

    • @Mutual_Information
      @Mutual_Information  2 месяца назад +1

      That is a *very* astute observation! Yes, there are issues with learning with fully deterministic policies. This is because they are never randomizing over actions, and so that creates permanent blind spots, as your intuition suggestions.
      But here's what you can do (and actually, this is almost the standard practice). You collect data under a fully randomized, often uniform policy - that's the behavior policy. Then, in a separate stage, you train a deterministic policy on that off policy data and deploy it (allowing it to learn online from there, or keeping the policy fixed). This kills some of the attractive adaptivity of an RL agent, but it's nonetheless done in practice.