I know this video is a year old but I have to say this was extremely well crafted together. At no point did I ever feel lost while watching the video. I have actually come from watching two other recent videos both related to #SOME. I really enjoy how you don't use a ton of buzzwords and read textbook definitions and pass them off as "explanations".
Hey, this is a great video! I noticed you made a mistake in the spreadsheet regarding UCB calculations. The parenthesis of the square root calculation are placed differently.
Ok so i know this video is old, but how do you train a network like this to know what valid states are or what winning states or loosing states or draw states are?
At 20:40 or so, the video shows the process of turning gameplay into training data. In the very first round of training, the goal is to get the agent to know what the valid moves are, so the goal policy is all legal moves (the game engine told us these at each step) with more or less equal probability (the tree search has no idea what is good or bad). We also want to start teaching the representation and dynamics networks "how good" a state or action is. For board games (with a win or loss at the end only), we take the reward from the end of the game (again, provided by the game engine) and feed it to earlier states; this effectively pretends that each play along the way was "winning" or "losing". Let's suppose we played randomly for 1000 games and generated that test data. We train the agent, it starts to learn what moves are "good" and we then use *that* agent to play some more. This time, the agent's tree search will be better (it has learned a little bit how to win), so the policies that are produced for training are a bit more focused towards the good moves. Play another 1000 games, and then re-train the agent on that data (I think the MuZero authors would use a random sample of training data skewed towards later generations [when the agent was better], but I could be misremembering).
I realize that was a long response, but in short: The game engine tells us about wins/losses and legal moves and we use some clever tricks to turn that into training data to teach the model how to look ahead better.
Aww, man, now I can no longer win all the games in “set the video compression settings more optimally than your opponent,” anymore. 😔
I know this video is a year old but I have to say this was extremely well crafted together. At no point did I ever feel lost while watching the video. I have actually come from watching two other recent videos both related to #SOME. I really enjoy how you don't use a ton of buzzwords and read textbook definitions and pass them off as "explanations".
Hey, this is a great video!
I noticed you made a mistake in the spreadsheet regarding UCB calculations. The parenthesis of the square root calculation are placed differently.
This video is gold.
Are you sure that the input to the dynamic network is the actual game state and not the representation network output?
Thank you 😊
Ok so i know this video is old, but how do you train a network like this to know what valid states are or what winning states or loosing states or draw states are?
At 20:40 or so, the video shows the process of turning gameplay into training data. In the very first round of training, the goal is to get the agent to know what the valid moves are, so the goal policy is all legal moves (the game engine told us these at each step) with more or less equal probability (the tree search has no idea what is good or bad). We also want to start teaching the representation and dynamics networks "how good" a state or action is. For board games (with a win or loss at the end only), we take the reward from the end of the game (again, provided by the game engine) and feed it to earlier states; this effectively pretends that each play along the way was "winning" or "losing".
Let's suppose we played randomly for 1000 games and generated that test data. We train the agent, it starts to learn what moves are "good" and we then use *that* agent to play some more. This time, the agent's tree search will be better (it has learned a little bit how to win), so the policies that are produced for training are a bit more focused towards the good moves. Play another 1000 games, and then re-train the agent on that data (I think the MuZero authors would use a random sample of training data skewed towards later generations [when the agent was better], but I could be misremembering).
I realize that was a long response, but in short: The game engine tells us about wins/losses and legal moves and we use some clever tricks to turn that into training data to teach the model how to look ahead better.
what amazing content, thank you so much for share it!!! :clapping_hands:
👏🏻