Ha thank you, means a lot to hear that. In my view, I still have a lot of wrinkles in what I'm producing. It's OK to not have a massive audience while I try to figure out how to make great videos. But eventually, they'll be really great and I think the attention will come then.
Just a bit of "surfing" on the very broad topic (like just mentioning a "deadly triad" without any hints as to how to deal with it) ;), but the mountain car animation is just wonderful! Thank you for the code! It's always a pleasure to watch such a well prepared videos :D
Thank you Marcin - great to see you back here! Yea, sometimes surfing is the best I can afford :) Glad to hear the code is appreciated. For the those who are *really* curious about the details, the code can fill in the gaps
Deadly triad reminds me of CAP theorem from databases. "You can only keep two of consistency, availability, and partition tolerance." (Consistency = data is consistent between partitions, availability = data is able to be retrieved, partition tolerance = partitions that aren't turned off being able to uphold consistency and availability, even when another partition becomes unavailable.) - Function approximation = availability bc we're able to have a model that we can pass inputs to and it generalizes across all input-output mappings, - Off-policy training = partition tolerance bc the exploration policy and (attempted) optimal policy diverge and they're essentially two partitions that are trying to maintain coherent communication/maintain modularity so they can communicate, - Bootstrapping = consistency bc we're trying to shortcut what our expected value is but we have to trade off exploring every possible path with sampling to get a good enough but not perfect expected value. Admittedly I feel like I'm stretching a bit here but I feel like it fits somehow and I just haven't found the exact way yet. It feels like there has to be a foundation for them to be stable on and when all three of the deadly triad are present it's like that spiderman meme where they're pointing to each other for who will be the stable foundation. If none of those three are the foundation, then what is? 🤔 It feels like the only way is to invert flow and try to predict what oneself will do/predict how to scaffold intermediate reward, rather than try to calculate the final answer (and rather than having an algorithm that has an exploration policy based on its ability to predict the final reward rather than intermediate. I may be misunderstanding this though based on what you said about MCTS vs Q-learning(?). I'm not sure if the predicting how to scaffold part is equivalent to Q-learning. I'm still learning sorry haha.). I think that's pretty much what predictive coding in the brain is. Not sure how to break it down correctly into subproblem though so that we can do "while surprise.exists(): build". Maybe one thing is that humans have more punctuated phases of adjusting value and policy in wake and sleep. Wake = learn about environment, REM = learn about one's brain itself. Curious if anyone has any thoughts on the CAP theorem comparison or any of the other stuff. Thanks so much for the great video(s)! They help me learn a lot and help really get to the essence of the concepts. And are really clear and concise. And are entertaining.
These videos are great! I really did not like the formatting of Barto and Sutton (eg. definitions in the middle of paragraphs), but you've done an awesome job of exacting and presenting the most valuable concepts
Change the title to RL with DJ featuring Lake Moraine. 😂😂😂😂 . The green screen is actually really useful. Once again, grateful for these videos. You are making content that can be binge watched with a notebook 😂😂😂😂
Can you help motivate the need for the proto points? You already had a complete encoding of the state with just 2 dimensions: (position, velocity). Encoding the state in 1200 dimensions seems like overparameterization/redundancy. I assume there's a practical reason such as "dividing up the state space into 1200 discretized regions then learning the optimal behavior per region" but I can't wrap my head around why that would be necessary. This confusion carries over into Part 6 where proto points come up again, but now we only have two.
You got me! A length 1200 feature vector is indeed overkill. What I'm doing is, I don't want the representational capacity of the parameterization I chose to be a limiting factor. So I go over the top and I'm doing effectively exactly what you describe: "diving up the state space 1200 discretized regions" and learning the value in each region, almost independently (but not *actually* independently). In practice, we'd take a lot more care to choose a parsimonious parameterization that would be more sample effect (assuming we chose the parameterization wisely). But doing so requires machinery I'd rather not use; e.g a neural network. By picking something simple, I was avoiding the headache of our more powerful tool, but you saw it's ugly symptom.
Hi, thanks,TD value is not exactly the belman update (TV), in case of off-policy learning, it may capture a not important sample and new update can lead to a wrong direction, which may cause divergence and the update is projected into feature space (linear approximator let's say), and then projected bellman error is minimized , am I right?
Mercer's theorem.. not a bad idea. That would probably get wrapped in a broader conversation with the kernel trick, and SVMs would get mentioned there. Added it to the queue!
Thank you very much for sharing this amazing content, I have a question. I think the obvious choices for features in the mountain car example are distance and velocity. I don't understand why you (or the book that used tile coding) chose to use normalized radial basis to convert these 2 features into 1225 (352) features. My understanding of function approximation was that its main goal was to shrink a huge space state into a smaller one. I get the impression that this solution expands the state space.
The essential *information* is distance and velocity, but how you feature those into a model is a different story. Let's say we didn't use a NRB, what's the alternative? E.g. if you do something linear, you'll quickly see do-able actions over the state space can never produce a sequence that'll get out of the valley.
@@Mutual_Information I will experiment with a polynomial expression combining position and velocity instead and check if it converges to the optimal solution. The NRB solution is great. I do not have a standard procedure to use for feature selection, do not even know if it exists at all, if you know any literature about it please let me know. Again, thanks for this content!
Hi JD, i'm home schooling myself about IA in prespective of a career change. What level of education/ degree is this course equivalent to ? cheers, awesome work !
oh my god my head aches. do I really have to understand everything deeply with no prior knowledge of RL? I am just taking things as they are and don't really understand everything and I am worried
No that's a good point in fact. I don't remember the details either. Eventually you just abstract stuff away e.g. "Ok function approx just groups states together and that's bad but you gotta do it".. stuff like that.
Thank you for the great explanations and animations! Helped me a lot with passing the Advanced Machine Learning course! (Passed with an 8, this is approximately an A in the US grade system) Is there any way I can donate €5 to your PayPal? I wasn't able to do this through patreon/youtube as they both require a creditcard, which I don't have. (Creditcards are not that common in the Netherlands, especially not as a student)
That's very kind of you! I don't actually have PayPal, so I'm not sure how this transfer would work. But that's ok - there's no need! One thing that I would appreciate much more than the money is if you recommend this channel to someone in your class. Word of mouth is a big deal for a channel like this :)
@@Mutual_Information I already recommended your channel in the teams channel of the university course :) at the start of 2023. I'll also sbare your channel with some friends of mine. Looking forward to part 6!
"Who needs theorems when you've got hopes?" - words to live by.
Amen
That animation updating the estimates and showing the path the ball -- err "Car" -- took was spectacular. Great work as always!
Thank you my man! When reading the text, this was the example that convinced me it needed to be animated
You're really criminally underrated, should have hundreds of thousands of views at least
Ha thank you, means a lot to hear that. In my view, I still have a lot of wrinkles in what I'm producing. It's OK to not have a massive audience while I try to figure out how to make great videos. But eventually, they'll be really great and I think the attention will come then.
Best Teacher ever of RL.
Thanks Duane, loving these videos. They're a big help for my group of undergrads who are interested in getting into RL research!
Oh that's awesome! Getting my vids into classrooms is the ideal case - thank you for passing it along
One of the best in youtube! Thanks!
The best video. People should watch these!
Your explanations are brilliant, thanks for making these videos
Thanks Lukas, happy to do it
Your videos are so fricking good! Thank you for such quality content on YT many of us appreciate it. I'm sure the channel will blow up in the future!!
I hope you're right. Thank you!
Thanks!
amazing series. really appreciate your work!
Thanks guy!
Just a bit of "surfing" on the very broad topic (like just mentioning a "deadly triad" without any hints as to how to deal with it) ;), but the mountain car animation is just wonderful! Thank you for the code! It's always a pleasure to watch such a well prepared videos :D
Thank you Marcin - great to see you back here! Yea, sometimes surfing is the best I can afford :)
Glad to hear the code is appreciated. For the those who are *really* curious about the details, the code can fill in the gaps
Deadly triad reminds me of CAP theorem from databases. "You can only keep two of consistency, availability, and partition tolerance." (Consistency = data is consistent between partitions, availability = data is able to be retrieved, partition tolerance = partitions that aren't turned off being able to uphold consistency and availability, even when another partition becomes unavailable.)
- Function approximation = availability bc we're able to have a model that we can pass inputs to and it generalizes across all input-output mappings,
- Off-policy training = partition tolerance bc the exploration policy and (attempted) optimal policy diverge and they're essentially two partitions that are trying to maintain coherent communication/maintain modularity so they can communicate,
- Bootstrapping = consistency bc we're trying to shortcut what our expected value is but we have to trade off exploring every possible path with sampling to get a good enough but not perfect expected value.
Admittedly I feel like I'm stretching a bit here but I feel like it fits somehow and I just haven't found the exact way yet. It feels like there has to be a foundation for them to be stable on and when all three of the deadly triad are present it's like that spiderman meme where they're pointing to each other for who will be the stable foundation. If none of those three are the foundation, then what is? 🤔 It feels like the only way is to invert flow and try to predict what oneself will do/predict how to scaffold intermediate reward, rather than try to calculate the final answer (and rather than having an algorithm that has an exploration policy based on its ability to predict the final reward rather than intermediate. I may be misunderstanding this though based on what you said about MCTS vs Q-learning(?). I'm not sure if the predicting how to scaffold part is equivalent to Q-learning. I'm still learning sorry haha.). I think that's pretty much what predictive coding in the brain is. Not sure how to break it down correctly into subproblem though so that we can do "while surprise.exists(): build". Maybe one thing is that humans have more punctuated phases of adjusting value and policy in wake and sleep. Wake = learn about environment, REM = learn about one's brain itself. Curious if anyone has any thoughts on the CAP theorem comparison or any of the other stuff.
Thanks so much for the great video(s)! They help me learn a lot and help really get to the essence of the concepts. And are really clear and concise. And are entertaining.
Just 2 days before the exam😍😍
These videos are great! I really did not like the formatting of Barto and Sutton (eg. definitions in the middle of paragraphs), but you've done an awesome job of exacting and presenting the most valuable concepts
Thank you for appreciating it! Barto and Sutton is a big bite, so I was intending to ease the digestion with these videos.
Great playlist ! It would have been cool to include the time of each trainning
Change the title to RL with DJ featuring Lake Moraine. 😂😂😂😂 . The green screen is actually really useful. Once again, grateful for these videos. You are making content that can be binge watched with a notebook 😂😂😂😂
Thanks Siddharth - the show is a work in progress, and I've actually managed to pull off some progress :)
Can you help motivate the need for the proto points? You already had a complete encoding of the state with just 2 dimensions: (position, velocity). Encoding the state in 1200 dimensions seems like overparameterization/redundancy. I assume there's a practical reason such as "dividing up the state space into 1200 discretized regions then learning the optimal behavior per region" but I can't wrap my head around why that would be necessary.
This confusion carries over into Part 6 where proto points come up again, but now we only have two.
You got me! A length 1200 feature vector is indeed overkill. What I'm doing is, I don't want the representational capacity of the parameterization I chose to be a limiting factor. So I go over the top and I'm doing effectively exactly what you describe: "diving up the state space 1200 discretized regions" and learning the value in each region, almost independently (but not *actually* independently).
In practice, we'd take a lot more care to choose a parsimonious parameterization that would be more sample effect (assuming we chose the parameterization wisely). But doing so requires machinery I'd rather not use; e.g a neural network. By picking something simple, I was avoiding the headache of our more powerful tool, but you saw it's ugly symptom.
Thanks for confirming! And with a speedy response time too! Thanks for making this series!!
@@danielawesome12 Happy to - love it when people check out the RL series
Excellent course!
Great video! very ilustrative
Thank you for great series!
BTW - changing background to completely dark allows to concentrate on the content better
Yea, now that I've changed to the green screen, I think it's much better. We're a new channel now!
Looking forward to the part6 video. Any idea when will it be out?
Working on it as we speak! I have a lot of non-YT stuff going on as well, so I've been delayed.
Let's say.. 3 weeks?
Thanks a lot for the great content. May I know when the final video will be released?
It will be about a month form now. It may help to turn notifications on :)
Hi, thanks,TD value is not exactly the belman update (TV), in case of off-policy learning, it may capture a not important sample and new update can lead to a wrong direction, which may cause divergence and the update is projected into feature space (linear approximator let's say), and then projected bellman error is minimized , am I right?
Sir can you provide the coding of these classes theory is really great but I am having trouble in implementation.
One more playlist, please
Code links in the description :)
And another playlist lol... I'm tired
Thank you. it really help me a lots.
Awesome, happy to hear it
If I may suggest a future video topic, how about a deep dive into mercer's theorem and how it is applicable to support vector machines?
Mercer's theorem.. not a bad idea. That would probably get wrapped in a broader conversation with the kernel trick, and SVMs would get mentioned there. Added it to the queue!
>tfw irl all data is spread in multiple excel files throughout the company with no structure whatsoever.
Thank you very much for sharing this amazing content, I have a question. I think the obvious choices for features in the mountain car example are distance and velocity. I don't understand why you (or the book that used tile coding) chose to use normalized radial basis to convert these 2 features into 1225 (352) features. My understanding of function approximation was that its main goal was to shrink a huge space state into a smaller one. I get the impression that this solution expands the state space.
The essential *information* is distance and velocity, but how you feature those into a model is a different story. Let's say we didn't use a NRB, what's the alternative? E.g. if you do something linear, you'll quickly see do-able actions over the state space can never produce a sequence that'll get out of the valley.
@@Mutual_Information I will experiment with a polynomial expression combining position and velocity instead and check if it converges to the optimal solution. The NRB solution is great. I do not have a standard procedure to use for feature selection, do not even know if it exists at all, if you know any literature about it please let me know. Again, thanks for this content!
@@bonettimauricio There's a section of Sutton's textbook that's devoted to how to featurize the state space, in case you're interested
Hi JD, i'm home schooling myself about IA in prespective of a career change. What level of education/ degree is this course equivalent to ? cheers, awesome work !
Hard to say. The textbook this is based on is used in Master's programs and some undergrad classes.
Wonderful!
I am waiting for your policy gradient video to use in my class! Are you going to release it any time soon?👀🙏
Yes, I've finished shooting it. Just in the editing phase now. It'll be post in about a week
oh my god my head aches. do I really have to understand everything deeply with no prior knowledge of RL? I am just taking things as they are and don't really understand everything and I am worried
No that's a good point in fact. I don't remember the details either. Eventually you just abstract stuff away e.g. "Ok function approx just groups states together and that's bad but you gotta do it".. stuff like that.
@@Mutual_Information THANK YOU!!
Thank you so much
ur my GOAT i hope u know :)
Thank you for the great explanations and animations! Helped me a lot with passing the Advanced Machine Learning course! (Passed with an 8, this is approximately an A in the US grade system) Is there any way I can donate €5 to your PayPal? I wasn't able to do this through patreon/youtube as they both require a creditcard, which I don't have. (Creditcards are not that common in the Netherlands, especially not as a student)
That's very kind of you! I don't actually have PayPal, so I'm not sure how this transfer would work. But that's ok - there's no need! One thing that I would appreciate much more than the money is if you recommend this channel to someone in your class. Word of mouth is a big deal for a channel like this :)
@@Mutual_Information I already recommended your channel in the teams channel of the university course :) at the start of 2023. I'll also sbare your channel with some friends of mine. Looking forward to part 6!
@@datsplit2571 you're a hero! Thank you!!
The end looks like tabu search x)
finally :-)