These videos genuinely help me learn. A lot of the time studying math that’s above your head doesn’t have any tiny cumulative value, you’re just out of your league. But in these videos I often feel like I get the general idea of what he’s saying, even if I can’t work out all the details on my own yet. It’s something you can actually watch relaxed, like hearing a podcast, but walk away having learned something. I’m watching this in a hospital waiting room and it’s gripping. After watching his softmax video I was able to read through a paper I saw linked on twitter and sure enough, they mentioned the softmax, and my eyes lit up for a second. These are really high quality videos.
I love how your videos are so understandable, but mathematically concise and clear at the same time! You also have amazing animations and figures. Good job and thank you!
Your video on importance sampling was so useful and well made I'm sticking around for this whole series even though I don't expect I'll need any time soon
Don’t expect to understand these videos by only watching. They are like concentrated juices (without sugar/chemicals added hehe), you can't just drink them, it’ll overload your body… Water must be added, which is time and effort, in this context. Everybody has some vague idea about reinforcement learning already: Give rewards / punishment & repeat. Nevertheless this high level understanding is only adequate for people from different areas: Like Justin Trudeau knowing the basics of Quantum computers (which is impressive actually). I would like to thank Mutual Information for this series! The connections between topics and the amount of details (math) is very well established. Such quality content is really sparse. If you also make similar series on ML or similar topics, count me in!
Wow, that's very kind of you! Thank you for noticing what I was aiming for her.. and I'm going to use that line "concentrated juices" - that's a good analogy!
26:03 We did a better estimate because the behavioural policy chooses hit/stick with equal probability so we "explore" more of the suboptimal states compared to a on-policy method where we greedily always choose the most optimal action? Am I right?
It could be something like that.. I can't confidently say. It could also be the noise of the simulation I did. I'd have to re-run it a lot to know it's a real effect. I don't suspect it is.. in general, off policy makes for strictly worse learning.
Thank you so much! But at 25:13, since the target policy is derived after the data are sampled by the behavior policy, is there an iterative process to update the rho, then get a new target policy, and then so on?
Yea, you're thinking about it right. The target policy is the thing getting updated. The behavior policy is a fixed, given function. So pho changes as the target policy changes. Intuition, rho is adjusting for the fact the target and behavior policies 'hang out' in different regions of the state space. So, as the target policy changes were it hangs out, rho needs change how it adjusts.
Super awesome video series and I have thoroughly enjoyed it so far! I do want to ask what tool(s) did you use to perform visualization and add animations for the plots in the video. If you can provide me the answer it would be a great help for a documentation I am currently working on! Again, super awesome video and I am glad people like you put so much effort to communicate and simply these complicated topics in a really fun and very descriptive manner.
Ah I see how it's confusing. The arrow is there to suggest it's an operation, like what a computer would do. In the same way, it's like specifying the count sequence with "x
@@Mutual_Information OH lmao my bad. Probably because the equation was written in nice mathematical notation that it didn't occur to me to think like that. Thanks again 🙏
I understood the two first videos well, but in this one, you spend time talking about fine model points without explaining the model with enough time to actually understand it. Still thank you for your videos that seem to be good introductions.
@@Mutual_Information thank you for your proposition, I was actually watching your videos more for fun, it's not like I need to be able to do RL things tomorrow. If I want to understand in detail I'll read the book you based your videos on.
Hey ! Thank you so much for your videos, they are great and very useful ! I still have a question tho, when you are showing the Off-policy version of the Constant alpha MC algorithm (25:10), why is the behaviour policy b never updated to generates the new trajectories (we would like the new trajectories to take into account of our improvments on the policy and the decision making, right ?) Thank you again Sir !
Good question! It's because it's off policy. That's defined as the case where the behavior policy is fixed and given to you (imagine someone just handed you data and said it was generated by XYZ code or agent). Then we're using that data / fixed behavior policy to update the Q-table, which gives us another policy, pi. Think of it as the 'policy recommended according to the data collected by the given behavior policy."
I have a question in BlackJack example. Why doesn't the stick graph with both having and not having ace have similar result? You stick to whatever you have anyway so it seems a bit odd to have different state-value between the graph.
I am curios, in most pseudo code algorithms for Off policy MC Control the order we go over the states after generating an episode is in reverse, that is, we start from T-1 and go to t=0. However, you start at t=0 until T-1. I wonder if both approaches are really equal?
Judging on the RL Book, IMHO he altered the off-policy MC control(section 5.7) into this method which initially multiply all the ratio from the start to terminal state and gradually stripping the ratio when t progress to the terminal state, hence he can progress timestep forward. Alpha is suppose to be a ratio between weight of the current state and cumulative sum which value is between 0 to 1 according to the method, but cumulative sum need to be calculated backward. In order to calculate alpha forward you need first get all the cumulative sum in episode. cumulative sum can be gathered from all the importance sampling ratio between all timestep in an episode. And gradually stripping the weight of the current state from cumulative sums to calculate alpha forward.
The video said the *environment* is determinstic, not the policy. That is, given state s and action a, you know with certainty what the new state will be.
There's a link to a notebook in a description. It covers some of the code, but not everything. If there's a specific question you have, I can try to answer it here. Maybe that'll fill that gap.
watching his videos for 2 hours straight(including all other videos) and not understanding anthing, i dont understand why people are commenting great when it was just a talk and talk.......so fustrating and wtf he speakes too fast like he is rapping or something
With a deterministic target policy (25:00), wouldn't you throw away almost all your learning? Such a target policy has 0 probabilities assigned to all but one actions in a given state. The behavior policy needs to be quite lucky to hit that single action, so most of the time your importance ratio will be 0. Perhaps this problem is less severe if the behavior and target policies are quite similar -- but that happens only if the behavior policy is near-greedy relative to q values derived from it. Typically, the dataset you started from was not generated by such a good policy, otherwise you wouldn't need to do RL in the first place.
That is a *very* astute observation! Yes, there are issues with learning with fully deterministic policies. This is because they are never randomizing over actions, and so that creates permanent blind spots, as your intuition suggestions. But here's what you can do (and actually, this is almost the standard practice). You collect data under a fully randomized, often uniform policy - that's the behavior policy. Then, in a separate stage, you train a deterministic policy on that off policy data and deploy it (allowing it to learn online from there, or keeping the policy fixed). This kills some of the attractive adaptivity of an RL agent, but it's nonetheless done in practice.
This is beyond great.
I can't thank you enough for the effort and clarity in this series. This is gold.
You thanked me plenty! Glad you enjoy it
These videos genuinely help me learn. A lot of the time studying math that’s above your head doesn’t have any tiny cumulative value, you’re just out of your league. But in these videos I often feel like I get the general idea of what he’s saying, even if I can’t work out all the details on my own yet. It’s something you can actually watch relaxed, like hearing a podcast, but walk away having learned something. I’m watching this in a hospital waiting room and it’s gripping. After watching his softmax video I was able to read through a paper I saw linked on twitter and sure enough, they mentioned the softmax, and my eyes lit up for a second. These are really high quality videos.
This is very nice to read, and I'm glad it had a positive effect
Best series for someone who wants to know about reinforcement learning
I love how your videos are so understandable, but mathematically concise and clear at the same time! You also have amazing animations and figures. Good job and thank you!
Thank you Balazs!
Your video on importance sampling was so useful and well made I'm sticking around for this whole series even though I don't expect I'll need any time soon
The whole series!? You're a champ dude - thank you!
I am so grateful for this series man, it helped me pass my exam. Thank you so much man. I'm waiting for more of your videos
Awesome - exactly what I was going for. And I'm working on the next video now..
The off policy thing was mind blowing!
Don’t expect to understand these videos by only watching. They are like concentrated juices (without sugar/chemicals added hehe), you can't just drink them, it’ll overload your body… Water must be added, which is time and effort, in this context. Everybody has some vague idea about reinforcement learning already: Give rewards / punishment & repeat. Nevertheless this high level understanding is only adequate for people from different areas: Like Justin Trudeau knowing the basics of Quantum computers (which is impressive actually).
I would like to thank Mutual Information for this series! The connections between topics and the amount of details (math) is very well established. Such quality content is really sparse. If you also make similar series on ML or similar topics, count me in!
Wow, that's very kind of you! Thank you for noticing what I was aiming for her.. and I'm going to use that line "concentrated juices" - that's a good analogy!
Thank you as well@@Mutual_Information, apparently good lectures lead to good analogies! I am honored!
I wish every math book in the world was written by you.
lol that's very nice of you, but that sounds like an awfully lot of work :)
Fantastic video series! I am looking forward to your next video, good sir.
Best Video I have seen many but this one is best ...great work
Means a lot, I appreciate hearing it
This is really great information, thanks for taking the time to make these videos
With all due respect, your lecture is more vivid than what deep-mind teacher explained.
Thank you ! Their lecture series is great. I just put more of an emphasis on visualizing the mechanics and compressing the subject
@@Mutual_Information Yes, that helps a lot for understanding the underlying mechanism.
I have been doing specialization in AI since last 2 years in my college. I wish my teachers had explained to me such a clear way.
This is excellent - Highly appreciated.
Thank you very much.
Have a great week,
Kind regards
keep em coming man. This is one of the most well prod. videos Ive seen on this topic!
broo... you're a savior
not only very helpful, but also inspiring, i'm intrigued
so great explanation, notation and video produce!!
Thank you Darwin!
Dziękujemy.
Thank you for your videos they are very comprehensive and well explained.
Glad they helped!
This is really quite excellent, thank you.
This is very well made. Thank you!
This is great! excellent. Thank you!
Excellent series.
Thanks for sharing this content, really amazing!
26:03 We did a better estimate because the behavioural policy chooses hit/stick with equal probability so we "explore" more of the suboptimal states compared to a on-policy method where we greedily always choose the most optimal action? Am I right?
It could be something like that.. I can't confidently say. It could also be the noise of the simulation I did. I'd have to re-run it a lot to know it's a real effect. I don't suspect it is.. in general, off policy makes for strictly worse learning.
Gem of a video❤
amazing stuff thanks !
Thank you so much! But at 25:13, since the target policy is derived after the data are sampled by the behavior policy, is there an iterative process to update the rho, then get a new target policy, and then so on?
Yea, you're thinking about it right. The target policy is the thing getting updated. The behavior policy is a fixed, given function. So pho changes as the target policy changes. Intuition, rho is adjusting for the fact the target and behavior policies 'hang out' in different regions of the state space. So, as the target policy changes were it hangs out, rho needs change how it adjusts.
Thanks a lot for the further clarification. That really helps! @@Mutual_Information
what an explanation
I like it too
Super awesome video series and I have thoroughly enjoyed it so far! I do want to ask what tool(s) did you use to perform visualization and add animations for the plots in the video. If you can provide me the answer it would be a great help for a documentation I am currently working on! Again, super awesome video and I am glad people like you put so much effort to communicate and simply these complicated topics in a really fun and very descriptive manner.
Isn't the equation introduced at 7:51 a circular reference? Finding that part hard to follow. But thanks for all the videos, they're great
Ah I see how it's confusing. The arrow is there to suggest it's an operation, like what a computer would do. In the same way, it's like specifying the count sequence with "x
@@Mutual_Information OH lmao my bad. Probably because the equation was written in nice mathematical notation that it didn't occur to me to think like that. Thanks again 🙏
I understood the two first videos well, but in this one, you spend time talking about fine model points without explaining the model with enough time to actually understand it. Still thank you for your videos that seem to be good introductions.
Ah sorry it's not landing :/ But maybe I can help. Is there something specific you don't understand and maybe I can clarify it here?
@@Mutual_Information thank you for your proposition, I was actually watching your videos more for fun, it's not like I need to be able to do RL things tomorrow. If I want to understand in detail I'll read the book you based your videos on.
Hey ! Thank you so much for your videos, they are great and very useful !
I still have a question tho, when you are showing the Off-policy version of the Constant alpha MC algorithm (25:10), why is the behaviour policy b never updated to generates the new trajectories (we would like the new trajectories to take into account of our improvments on the policy and the decision making, right ?)
Thank you again Sir !
Good question! It's because it's off policy. That's defined as the case where the behavior policy is fixed and given to you (imagine someone just handed you data and said it was generated by XYZ code or agent). Then we're using that data / fixed behavior policy to update the Q-table, which gives us another policy, pi. Think of it as the 'policy recommended according to the data collected by the given behavior policy."
I have a question in BlackJack example. Why doesn't the stick graph with both having and not having ace have similar result? You stick to whatever you have anyway so it seems a bit odd to have different state-value between the graph.
Great quality sir! The material is well presented, do you have a social media account I could follow you on ?
Yea, Twitter: @DuaneJRich
I loved it, can you make coding videos regarding this.
I included some notebooks in the description. That's probably as far as I'll go. Just got other topics I'd like to get to.
I am curios, in most pseudo code algorithms for Off policy MC Control the order we go over the states after generating an episode is in reverse, that is, we start from T-1 and go to t=0. However, you start at t=0 until T-1. I wonder if both approaches are really equal?
Judging on the RL Book, IMHO he altered the off-policy MC control(section 5.7) into this method which initially multiply all the ratio from the start to terminal state and gradually stripping the ratio when t progress to the terminal state, hence he can progress timestep forward. Alpha is suppose to be a ratio between weight of the current state and cumulative sum which value is between 0 to 1 according to the method, but cumulative sum need to be calculated backward. In order to calculate alpha forward you need first get all the cumulative sum in episode. cumulative sum can be gathered from all the importance sampling ratio between all timestep in an episode. And gradually stripping the weight of the current state from cumulative sums to calculate alpha forward.
how do you evaluate reinforcement learning results? i know precision, recall, mAP, etc. but i dont think it can be used in this cenario, CMIIW.
12:14 Why the policy is deterministic if we have probability > 0 of taking one of two actions?
The video said the *environment* is determinstic, not the policy. That is, given state s and action a, you know with certainty what the new state will be.
Can you please start a discord server? Would be wonderful to discuss the video content somewhere. Thx
Could you please public your source code?
There's a link to a notebook in a description. It covers some of the code, but not everything.
If there's a specific question you have, I can try to answer it here. Maybe that'll fill that gap.
I love your videos. Would love to connect with you further
watching his videos for 2 hours straight(including all other videos) and not understanding anthing, i dont understand why people are commenting great when it was just a talk and talk.......so fustrating and wtf he speakes too fast like he is rapping or something
Anyone's brain explode like mine?
You teach great but I feel you speak a little too fast.
Good to know, I'm still getting calibrated. I've spoken *way* too fast before and sometimes too slow. Finding that sweet spot..
With a deterministic target policy (25:00), wouldn't you throw away almost all your learning? Such a target policy has 0 probabilities assigned to all but one actions in a given state. The behavior policy needs to be quite lucky to hit that single action, so most of the time your importance ratio will be 0. Perhaps this problem is less severe if the behavior and target policies are quite similar -- but that happens only if the behavior policy is near-greedy relative to q values derived from it. Typically, the dataset you started from was not generated by such a good policy, otherwise you wouldn't need to do RL in the first place.
That is a *very* astute observation! Yes, there are issues with learning with fully deterministic policies. This is because they are never randomizing over actions, and so that creates permanent blind spots, as your intuition suggestions.
But here's what you can do (and actually, this is almost the standard practice). You collect data under a fully randomized, often uniform policy - that's the behavior policy. Then, in a separate stage, you train a deterministic policy on that off policy data and deploy it (allowing it to learn online from there, or keeping the policy fixed). This kills some of the attractive adaptivity of an RL agent, but it's nonetheless done in practice.