Couple of mistakes I noticed: 3:02 The numbers in the bottom left of the tree are incorrect and should have been toggled. 12:42 I say "root node" but should have said "leaf node" 21:36 I say "root node" but should have said "leaf node"
Best explanation so far, with the code available too. Still hard to understand everything anyway when I try to put it in place for real (ucb & prior are tricky). You reproduced step by step debugger into your blog page, I cannot believe you did this by hand. Awesome. Thanks.
@@CedricPuig Pretty sure prior comes from "apriori", i.e. what we believe the probabilities to be before we make observations with the mcts. After doing the tree search and getting the probability distribution representing the visit counts, it would be posterior probabilities. From "aposteriori".
This was the video I needed to get an intuition on MCTS. I had read so much and watched a bunch of other videos but they all repeated the same textbook buzzwords without actually explaining the concepts. What you taught me that nobody else did (at least not well enough for me to understand what they were talking about), is that you need to iteratively run the algorithm a bunch of times ("number of simulations"). Before that MCTS was absolute nonsense, but now it all seems obvious. Great vid, shame you don't have that many others.
Thanks Josh. I finished the lectures from David Silver and I wanted to understand the MCTS a little better. Especially if there is another player involved who will make opposing moves where you have to assume he will also make intelligent moves. To predict his moves you need to query a model (as in model based RL) that knows the dynamics of the game and maybe even covers the behaviour of another player; or if the model doesn't cover the behaviour of the opposing player, let him move according to the same policy you are using for player 1. But then you need to flip the input bits that represent the state of your board (like a Go board) for every step down in the search tree. Now I'm still wondering how a model-based agent with MCTS for planning is implemented, does the trained model include the opposing player's behaviour, or are his moves also " rolled out" at the same time when you are rolling your player 1's moves during the construction of the MC search tree.
The more I look the more questions I have and you seem to know what you are talking about in the detail I need! This simple example is great for pinning down things before I start programming. It is obvious really but many states appear several times in the tree and therefore additional time and memory is being expended to do the same thing. eg if stones are put in positions 1,2,3 then the same state will arise if they were put down in the order 3,2,1. This is also a problem for pure MCTS. Would it be worth keeping all nodes in a dictionary and if a node already exists then use that instead of creating another. The only change I can think of is that the parent property would have to become a list of parents. Note you do not seem to have stored the nodes parent so I do not know how you do the backup.
Question: given enough time, can MCTS (Monte Carlo Tree Search) find the best solution? The problem about MCTS is that it chooses the child node with the highest probability of having a solution. As long as those probabilities don't change, MCTS will choose the same node, no matter how many iterations you perfom. That means some leaves (terminal nodes) are unreachable. If the best solution happens to be in an unreachable leaf, MCTS will never find it. The problem is how you calculate the probabilities of the child node. There are formulas like UCT, but I don't know if such formulas are really correct. The probabilites need to change in such a way to allow MCTS to traverse every single node and every single leaf of the tree. I don't think UCT does that. That means some node and leaves are unreachable.
"The problem about MCTS is that it chooses the child node with the highest probability of having a solution." Its not shown in this video, but irl, MCTS would use "epsilon greedy". Something like epsilon for 5% (ie pick child randomly) vs 95% greedy. in this way it ensures good exploration of the state space. "The probabilites need to change in such a way to allow MCTS to traverse every single node and every single leaf of the tree." Brute force is not possible for real AI problems (like Go game). Thats why we have these kind of AI algos. If you can visit all nodes, then it would be some standard CS algo, not AI algo. Yes, theoretically there could exist better solutions than deduced by MCTS and its not fool proof. But its an acceptable compromise.
Great Video. Really get a clear understanding of Monte Carlo Tree Search. Can I know how do you do the slide?? It is not just power point right? I see you run the simulation of the code in the slide. That is awesome.
I thought the idea of MCTS was to select nodes randomly and average out values so that you don't have to search the full tree, but you said that when all values are equal it just selects the first state available, how does this work as wont it just end up exploring the whole game tree or getting stuck. In the beginning it doesn't know which moves are valuable so they all receive the same weight, it selects the first available move, plays through until it receives a reward, and then distributes the reward back up the tree, if it resulted in a losing game, then the initial move would receive a lower value and then the rest would all still be equal, meaning it would then search the next available move from the beginning, moving its way through the entire tree. Or if the search of the first move resulted in a win it would receive a higher value and then that move would continually be explored.
I have a limited knowledge of this but explaining it might help me learn as well. Your second point about getting stuck is the easier question, the UCB prevents this as it starts prioritizing nodes that haven't been visited enough after a while. As the number of total nodes visited increase and the current node still has only been visited once or 0 times then the UCB score will continue to increase which makes it more probable to be picked the next time. This is all assuming the prior is not always 0 which I don't get myself why that couldn't be the case but I suppose that the randomness of the neural network or adding some constant solves this. About moving through the entire game tree; this might be the case eventually. But it's not something you need to do. After one of these trees has been created for N number of simulations you've already trained your policy function to try to mimic the tree for that root state and a huge chunk of simulated descendant states. It's the policy function which you will save and use to play the game with. The goal is to make the neural network learn from the MCTS tree so that you can then discard the tree and not have to run these expensive simulations
Still don't understand why 20:33 value of [-1 0 0 0] zeroed? Also visits count is 2, but children's visit counts are 1, 0, 0. Ok, at least I managed to figure out how -0.3 turned out - (0 + 0.5 + 0.5 + 0.5) /(2 + 1 + 1 + 1) = 1.5 / 5 = 0.3 (don't ask why)
WHY DOES EVERYONE TALK ABOUT ALPHA ZERO? Which is something that we don't know too much about COMPARED to LC0 which is an open source engine - exactly (or even better by now) than Alpha Zero? Give my girl LC0 some love! :D
I know that the video was posted along time ago but maybe somebody will see this and answer. I thought that MCTS did depth first search ie went all the way to a terminal node. The way it is programmed here it uses breadth first search which means it takes a long time for it to get any real information, particularly in games like connect4 and chess etc were there is only a non zero reward at the end of the game. Am I wrong?
In traditional MCTS there is an additional step where you perform “rollout” to determine the value of a given position. Often this would involve multiple rounds of playing random moves until the game terminates. Then you would aggregate the outcomes of these random games (all having started at the original position) as some measure of approximate “value” of the original position. AlphaZero does not use rollout. Instead we use the value network to estimate the value of a given position.
@@JoshVarty Thank you for your quick reply. I thought that AlphaZero still used rollout but rather than using random moves it used the value function (or the action probabilities) to guide its course. I cannot find it now but I read somewhere that they actually used a smaller network that could produce predictions much faster than the main network to guide the rollout. However while writing this I did some Googling and your understanding seems correct and that idea must relate to something else. I have another question but rather than bury it here I will make another comment
My question is that the policy network has to have a result for every possible move? For example in chess or go, there must be billions of possible moves, how would the network look like then?
No it does not have to get a a result for every possible move. At 21:22 you can see the tree we have explored. We have not explored all possible moves (See 2:20 for the full tree). We use the value network to "estimate" the value of a given state without fully (and more accurately) evaluating every possible game that could arise from that state. At 21:22 we have performed only 5 rounds of MCTS simulation. The algorithm tells us that the best move to play would be the very first position. This is wrong! The best position would have been in the second or third positions. This is because we have only used 5 rounds of MCTS simulation. In general, increasing the number of MCTS rounds allows us to search deeper into the tree (and therefore get better predictions) though we will never fully explore it.
@@JoshVarty Yeah but how would you actually implement the policy network for an uncertain amount of moves? Because in the example it's obvious that there are 4 moves that can be played so the policy network needs to have 4 output nodes for each action. But for example in chess there are unimaginably big number of moves that can be played. If you would want to make the policy network for chess how many outputs would there be?
@@nagycsapat931 Before we talk about the policy network, we should first talk about neural networks in general to make sure you understand how they work. One of the most commonly used image datasets is called MNIST. It contains ~60,000 images of handwritten digits (0 - 9). Each image is 28x28. When we train a neural network on this dataset, we show it a 28x28 image and get its prediction (ie. Is this a 1? Is this a 2? etc.). We look at the output and then adjust the weights slightly (via SGD) to make it slightly more likely to make the correct prediction in the future. We do this many, many times. After we have trained this neural network, we show it images that it has never seen before and measure how accurately it predicts the correct image. This is very important: We have not trained it on every single 28x28 image. We have only shown it ~60,000 examples but we are happy to see that is generalizes to images it has never seen before. For the policy network (7:10) we have an input of 4 numbers (the current state of the game) and it outputs 4 numbers (the confidence the network has that we should place a piece at each position). We train this network by showing it many different games of self-play and trying to make it match the probabilities generated by MCTS. We do not show the policy network every single possible move. We hope that by showing it many, many possibles moves during training that it will generalize and be able to create good predictions for boards it has never seen before.
@@JoshVarty So my understanding is that the value networks input nodes represent the state (for example: 1st input node: 0, 2nd input node: -1, 3rd input node: 1, 4th input node: 0) like (0, -1, 1 , 0), and for these numbers it outputs the likelihood of winning in this position (or otherwise called "value"). This value between -1 and 1 is the single output node from the value network. Now on the other hand the policy networks input nodes are the same as the value networks (0, -1, 1, 0), but the theres not only 1 output node, but 4. Those 4 output nodes value represent the likeliness that we should play each move (and also help improve our MCTS).
@@nagycsapat931 I think your understanding is correct. For games like Go (19x19 board) there are (at most) 361 possible moves on each turn, so you might imagine a policy network with 361 outputs. So there aren't millions or billions of moves to consider, but once you start looking ahead with MCTS, this search space does explode. You've raised an interesting point. The approaches we're discussing here work reasonably well for games like Tic-Tac-Toe, Chess and even Go. That said, they don't work that well for games like StarCraft or Dota which have a much, much larger set of possible moves at every frame.
Thanks! I use VS Code mostly. (In this video the code isn’t actually from an IDE. It’s just HTML/JavaScript that I made to look like VS Code). There’s a link in the description that will take you to the blog post with the code.
One thing I don't understand. Why you don't assume that the opponent will play the best move, and instead propagate the (weighted?) mean up the tree? Let's say you play chess and you have a forced mate, wouldn't it be the most logical to go for that mate even if another branch on average gives you better value predictions? (a good example might be to sacrifice most of your pieces to deliver a forced mate, Morphy-style) Also, when you train the net, do you take only the root nodes (which take in the most information), or do you train all state/policy pairs you have from the MCTS session?
> One thing I don't understand. Why you don't assume that the opponent will play the best move, I think the fundamental problem is: "What is the best move?" It's easy for us to write a program that can identify mate-in-one moves. But what about situations where there is no forced mate? How do we find the best move? > wouldn't it be the most logical to go for that mate even if another branch on average gives you better value predictions? This situation should not happen. A mate-in-one move will be valued as 1.0 (the highest possible value). No other move can have a higher value. We want exactly the properties you're describing: If sacrificing pieces leads to victory, then that path should have a high value! We don't encode any information about the value of a queen, bishop etc. > Also, when you train the net, do you take only the root nodes (which take in the most information), or do you train all state/policy pairs you have from the MCTS session? We only record and train with moves that were actually played in the game.
@@JoshVarty > I think the fundamental problem is: "What is the best move?" It's easy for us to write a program that can identify mate-in-one moves. But what about situations where there is no forced mate? How do we find the best move? Let me be clearer. After you have the UCB score for each candidate moves from the board position you will choose the best one. I'm talking about the phase where you expand from each candidate move a sub-tree. When you collapse it, the formulation uses the mean of all leaf node values corresponding to the candidate moves. Why not use argmax instead? > This situation should not happen. A mate-in-one move will be valued as 1.0 (the highest possible value). No other move can have a higher value. We want exactly the properties you're describing: If sacrificing pieces leads to victory, then that path should have a high value! We don't encode any information about the value of a queen, bishop etc. For mate in 1 situations you can just brute-force it via sampling because you'll have a reasonable chance of landing on the winning move. I'm talking about situation where you have a forced mate in 10, where there is only a single move per opponent move that leads to that mate, otherwise you'll end up in a worse position. Let's say for example the alternative is a pawn move that gives you more control of the center but doesn't leave you with a possible mating sequence. The latter will have the higher average value score, but the best move is taking the forced mate route assuming you can infer all the moves in the sequence. Isn't taking the argmax for computing the candidate move value is better? > We only record and train with moves that were actually played in the game. Doesn't it make the training prohibilitally inefficient? Creating an entire tree just to have a single perfect sample? Or is it the only way to create an engine capable of beating Stockfish, by running 150 TPUs to self play games to create the sufficient number of positions for training? You'll be losing hundreds to thousands of data points per tree. If I want to train such an algorithm on my puny rtx2060 I'll need to use a significant chunk of each tree, otherwise it will take me months to train an algorithm that can beat me, let alone compete with GMs and top-tier engines.
At 4:30, you create a training set by simulating many games. However, that particular training set has the second sample ([-1 0 0 0], -1) with an incorrect label: the position is drawn for the second player, but the label is -1. How do you address an issue of generating the training data with wrong labels?
I think the label is correct. The board is seen from the perspective of Player 2. Player 2 lost the game so we assign that board a value of -1. From what I've seen RL is pretty finicky and probably doesn't handle noisy labels very well. It's important to make sure your labels are correct.
@@0114mercury You're correct that the position is not 100% lost. However, the game in which we saw that position lead to a loss. The idea here is that if we play thousands of games and mark positions that lead to a loss with -1 and positions that lead to a win with +1, we will slowly get a value network that can value positions accurately. We just have to play lots and lots and lots of games for this to happen. If the position [-1,0,0,0] is a good position, other instances of self-play will value it as +1 and in aggregate our network will learn to value this position with a positive value. If the position [-1,0,0,0] is a bad position, other instances of self-play will value it as -1 and in aggregate our network will learn to value this position with a negative value. Does that make sense?
Couple of mistakes I noticed:
3:02 The numbers in the bottom left of the tree are incorrect and should have been toggled.
12:42 I say "root node" but should have said "leaf node"
21:36 I say "root node" but should have said "leaf node"
3:02 you say that after every play the signs of the numbers in the representation should flip but in the lower leftmost nodes this doesn't happen
@@dsadsa4336 Good catch! Those should have changed as well.
watched many mcts videos and this is the only one i've been looking for. brief + actual code. subbed!
What a great explanation. I especially like the visualization of the MCTS algorithm!
This visualization of MCTS finally crystallized it for me. Thanks!
I would really love a in depth video about the machine learning code in this project.
Great video!
Best explanation so far, with the code available too. Still hard to understand everything anyway when I try to put it in place for real (ucb & prior are tricky). You reproduced step by step debugger into your blog page, I cannot believe you did this by hand.
Awesome.
Thanks.
It took me too much time to realize that "prior" means "priority"
@@CedricPuig Pretty sure prior comes from "apriori", i.e. what we believe the probabilities to be before we make observations with the mcts. After doing the tree search and getting the probability distribution representing the visit counts, it would be posterior probabilities. From "aposteriori".
been struggling wrapping my head around MCTS, this really helped, thanks man
This was the video I needed to get an intuition on MCTS. I had read so much and watched a bunch of other videos but they all repeated the same textbook buzzwords without actually explaining the concepts. What you taught me that nobody else did (at least not well enough for me to understand what they were talking about), is that you need to iteratively run the algorithm a bunch of times ("number of simulations"). Before that MCTS was absolute nonsense, but now it all seems obvious. Great vid, shame you don't have that many others.
Fantastic video. Very clear. Thank you for making it.
The best video that I've senn about alpha zero and monte Carlo search algorithms
Thank you so much ❤❤❤
Awesome video! First one I found that explains how things work together and not just the components separately.
Excellent work on this.
Incredibly good explanation!
Thanks Josh. I finished the lectures from David Silver and I wanted to understand the MCTS a little better. Especially if there is another player involved who will make opposing moves where you have to assume he will also make intelligent moves. To predict his moves you need to query a model (as in model based RL) that knows the dynamics of the game and maybe even covers the behaviour of another player; or if the model doesn't cover the behaviour of the opposing player, let him move according to the same policy you are using for player 1. But then you need to flip the input bits that represent the state of your board (like a Go board) for every step down in the search tree. Now I'm still wondering how a model-based agent with MCTS for planning is implemented, does the trained model include the opposing player's behaviour, or are his moves also " rolled out" at the same time when you are rolling your player 1's moves during the construction of the MC search tree.
best explanation. great job. thank you
The more I look the more questions I have and you seem to know what you are talking about in the detail I need! This simple example is great for pinning down things before I start programming.
It is obvious really but many states appear several times in the tree and therefore additional time and memory is being expended to do the same thing. eg if stones are put in positions 1,2,3 then the same state will arise if they were put down in the order 3,2,1. This is also a problem for pure MCTS. Would it be worth keeping all nodes in a dictionary and if a node already exists then use that instead of creating another. The only change I can think of is that the parent property would have to become a list of parents. Note you do not seem to have stored the nodes parent so I do not know how you do the backup.
Finally i understand alpha zero. thx! 🙏
wow, I've been looking for that video for a long time, thank you so much this is so clear
good explanation and great slides... well done!
Clearly explained, thank you.
Question: given enough time, can MCTS (Monte Carlo Tree Search) find the best solution?
The problem about MCTS is that it chooses the child node with the highest probability of having a solution.
As long as those probabilities don't change, MCTS will choose the same node, no matter how many iterations you perfom. That means some leaves (terminal nodes) are unreachable.
If the best solution happens to be in an unreachable leaf, MCTS will never find it.
The problem is how you calculate the probabilities of the child node.
There are formulas like UCT, but I don't know if such formulas are really correct.
The probabilites need to change in such a way to allow MCTS to traverse every single node and every single leaf of the tree.
I don't think UCT does that.
That means some node and leaves are unreachable.
"The problem about MCTS is that it chooses the child node with the highest probability of having a solution."
Its not shown in this video, but irl, MCTS would use "epsilon greedy". Something like epsilon for 5% (ie pick child randomly) vs 95% greedy. in this way it ensures good exploration of the state space.
"The probabilites need to change in such a way to allow MCTS to traverse every single node and every single leaf of the tree."
Brute force is not possible for real AI problems (like Go game). Thats why we have these kind of AI algos. If you can visit all nodes, then it would be some standard CS algo, not AI algo. Yes, theoretically there could exist better solutions than deduced by MCTS and its not fool proof. But its an acceptable compromise.
awesome simulation and explanation!
Awesome breakdown, thanks man
What visualization library are you using for the graph state with forward and backward state transitions?
This is a great explanation. Thanks, you helped me a lot
Great explanation, thanks!
could you please explain why is it easier to feed in neural network when we toggle the value , and keep them with respect to the player about to move?
Great Video. Really get a clear understanding of Monte Carlo Tree Search.
Can I know how do you do the slide?? It is not just power point right? I see you run the simulation of the code in the slide. That is awesome.
Thanks! The animations are just HTML/Javascript/CSS. You can see them in action at: joshvarty.github.io/AlphaZero/
Just wondering how you can decide which is the leaf node in a NLP task?
Brilliant, Thanks dude!
This is really well made! Thank you for the good video!
Nice video, great explanation. Thank you.
Thank you for the effort.
I thought the idea of MCTS was to select nodes randomly and average out values so that you don't have to search the full tree, but you said that when all values are equal it just selects the first state available, how does this work as wont it just end up exploring the whole game tree or getting stuck. In the beginning it doesn't know which moves are valuable so they all receive the same weight, it selects the first available move, plays through until it receives a reward, and then distributes the reward back up the tree, if it resulted in a losing game, then the initial move would receive a lower value and then the rest would all still be equal, meaning it would then search the next available move from the beginning, moving its way through the entire tree. Or if the search of the first move resulted in a win it would receive a higher value and then that move would continually be explored.
I have a limited knowledge of this but explaining it might help me learn as well.
Your second point about getting stuck is the easier question, the UCB prevents this as it starts prioritizing nodes that haven't been visited enough after a while. As the number of total nodes visited increase and the current node still has only been visited once or 0 times then the UCB score will continue to increase which makes it more probable to be picked the next time. This is all assuming the prior is not always 0 which I don't get myself why that couldn't be the case but I suppose that the randomness of the neural network or adding some constant solves this.
About moving through the entire game tree; this might be the case eventually. But it's not something you need to do. After one of these trees has been created for N number of simulations you've already trained your policy function to try to mimic the tree for that root state and a huge chunk of simulated descendant states. It's the policy function which you will save and use to play the game with. The goal is to make the neural network learn from the MCTS tree so that you can then discard the tree and not have to run these expensive simulations
At 22:25, shouldn't the child.visit_count+1 be under the square root?
May I ask which visualization/simulation tool you did you use for the MCTS algorithm(14:42) ? Thanks in advance!
No tool, I just used HTML/JS/SVG. The blog post in the description has the visualizations in it.
this is really awesome
Still don't understand why 20:33 value of [-1 0 0 0] zeroed? Also visits count is 2, but children's visit counts are 1, 0, 0. Ok, at least I managed to figure out how -0.3 turned out - (0 + 0.5 + 0.5 + 0.5) /(2 + 1 + 1 + 1) = 1.5 / 5 = 0.3 (don't ask why)
WHY DOES EVERYONE TALK ABOUT ALPHA ZERO? Which is something that we don't know too much about COMPARED to LC0 which is an open source engine - exactly (or even better by now) than Alpha Zero?
Give my girl LC0 some love! :D
LC0 is just a chess specific architecture that's specifically based on the A0 architecture.
@@Dalroc In theory. Because we don't know (I think) the source code of A0, I think LC0 was created from scratch.
Great video
this is gold! Thanks!
can you make a video on how to play this model on kaggle or locally?
The best explanation!!! thanks :)
Great video!
I know that the video was posted along time ago but maybe somebody will see this and answer.
I thought that MCTS did depth first search ie went all the way to a terminal node. The way it is programmed here it uses breadth first search which means it takes a long time for it to get any real information, particularly in games like connect4 and chess etc were there is only a non zero reward at the end of the game. Am I wrong?
In traditional MCTS there is an additional step where you perform “rollout” to determine the value of a given position. Often this would involve multiple rounds of playing random moves until the game terminates. Then you would aggregate the outcomes of these random games (all having started at the original position) as some measure of approximate “value” of the original position.
AlphaZero does not use rollout. Instead we use the value network to estimate the value of a given position.
@@JoshVarty Thank you for your quick reply. I thought that AlphaZero still used rollout but rather than using random moves it used the value function (or the action probabilities) to guide its course. I cannot find it now but I read somewhere that they actually used a smaller network that could produce predictions much faster than the main network to guide the rollout. However while writing this I did some Googling and your understanding seems correct and that idea must relate to something else. I have another question but rather than bury it here I will make another comment
My question is that the policy network has to have a result for every possible move? For example in chess or go, there must be billions of possible moves, how would the network look like then?
No it does not have to get a a result for every possible move.
At 21:22 you can see the tree we have explored. We have not explored all possible moves (See 2:20 for the full tree). We use the value network to "estimate" the value of a given state without fully (and more accurately) evaluating every possible game that could arise from that state.
At 21:22 we have performed only 5 rounds of MCTS simulation. The algorithm tells us that the best move to play would be the very first position. This is wrong! The best position would have been in the second or third positions. This is because we have only used 5 rounds of MCTS simulation. In general, increasing the number of MCTS rounds allows us to search deeper into the tree (and therefore get better predictions) though we will never fully explore it.
@@JoshVarty Yeah but how would you actually implement the policy network for an uncertain amount of moves? Because in the example it's obvious that there are 4 moves that can be played so the policy network needs to have 4 output nodes for each action. But for example in chess there are unimaginably big number of moves that can be played. If you would want to make the policy network for chess how many outputs would there be?
@@nagycsapat931 Before we talk about the policy network, we should first talk about neural networks in general to make sure you understand how they work. One of the most commonly used image datasets is called MNIST. It contains ~60,000 images of handwritten digits (0 - 9). Each image is 28x28.
When we train a neural network on this dataset, we show it a 28x28 image and get its prediction (ie. Is this a 1? Is this a 2? etc.). We look at the output and then adjust the weights slightly (via SGD) to make it slightly more likely to make the correct prediction in the future. We do this many, many times. After we have trained this neural network, we show it images that it has never seen before and measure how accurately it predicts the correct image. This is very important: We have not trained it on every single 28x28 image. We have only shown it ~60,000 examples but we are happy to see that is generalizes to images it has never seen before.
For the policy network (7:10) we have an input of 4 numbers (the current state of the game) and it outputs 4 numbers (the confidence the network has that we should place a piece at each position). We train this network by showing it many different games of self-play and trying to make it match the probabilities generated by MCTS.
We do not show the policy network every single possible move. We hope that by showing it many, many possibles moves during training that it will generalize and be able to create good predictions for boards it has never seen before.
@@JoshVarty So my understanding is that the value networks input nodes represent the state (for example: 1st input node: 0, 2nd input node: -1, 3rd input node: 1, 4th input node: 0) like (0, -1, 1 , 0), and for these numbers it outputs the likelihood of winning in this position (or otherwise called "value"). This value between -1 and 1 is the single output node from the value network. Now on the other hand the policy networks input nodes are the same as the value networks (0, -1, 1, 0), but the theres not only 1 output node, but 4. Those 4 output nodes value represent the likeliness that we should play each move (and also help improve our MCTS).
@@nagycsapat931 I think your understanding is correct. For games like Go (19x19 board) there are (at most) 361 possible moves on each turn, so you might imagine a policy network with 361 outputs. So there aren't millions or billions of moves to consider, but once you start looking ahead with MCTS, this search space does explode.
You've raised an interesting point. The approaches we're discussing here work reasonably well for games like Tic-Tac-Toe, Chess and even Go. That said, they don't work that well for games like StarCraft or Dota which have a much, much larger set of possible moves at every frame.
Great job!!! It helped me a lot!!
great video. Thank you very much
New Roslyn series please!!!
Great video! What is your running IDE?
Thanks! I use VS Code mostly. (In this video the code isn’t actually from an IDE. It’s just HTML/JavaScript that I made to look like VS Code).
There’s a link in the description that will take you to the blog post with the code.
Hey great video. Could you tell us what library did you use for visualization?
Unfortunately I didn't use a library. I had to animate it with vanilla Javascript and SVG objects.
One thing I don't understand. Why you don't assume that the opponent will play the best move, and instead propagate the (weighted?) mean up the tree? Let's say you play chess and you have a forced mate, wouldn't it be the most logical to go for that mate even if another branch on average gives you better value predictions? (a good example might be to sacrifice most of your pieces to deliver a forced mate, Morphy-style)
Also, when you train the net, do you take only the root nodes (which take in the most information), or do you train all state/policy pairs you have from the MCTS session?
> One thing I don't understand. Why you don't assume that the opponent will play the best move,
I think the fundamental problem is: "What is the best move?" It's easy for us to write a program that can identify mate-in-one moves. But what about situations where there is no forced mate? How do we find the best move?
> wouldn't it be the most logical to go for that mate even if another branch on average gives you better value predictions?
This situation should not happen. A mate-in-one move will be valued as 1.0 (the highest possible value). No other move can have a higher value. We want exactly the properties you're describing: If sacrificing pieces leads to victory, then that path should have a high value! We don't encode any information about the value of a queen, bishop etc.
> Also, when you train the net, do you take only the root nodes (which take in the most information), or do you train all state/policy pairs you have from the MCTS session?
We only record and train with moves that were actually played in the game.
@@JoshVarty > I think the fundamental problem is: "What is the best move?" It's easy for us to write a program that can identify mate-in-one moves. But what about situations where there is no forced mate? How do we find the best move?
Let me be clearer. After you have the UCB score for each candidate moves from the board position you will choose the best one. I'm talking about the phase where you expand from each candidate move a sub-tree. When you collapse it, the formulation uses the mean of all leaf node values corresponding to the candidate moves. Why not use argmax instead?
> This situation should not happen. A mate-in-one move will be valued as 1.0 (the highest possible value). No other move can have a higher value. We want exactly the properties you're describing: If sacrificing pieces leads to victory, then that path should have a high value! We don't encode any information about the value of a queen, bishop etc.
For mate in 1 situations you can just brute-force it via sampling because you'll have a reasonable chance of landing on the winning move. I'm talking about situation where you have a forced mate in 10, where there is only a single move per opponent move that leads to that mate, otherwise you'll end up in a worse position. Let's say for example the alternative is a pawn move that gives you more control of the center but doesn't leave you with a possible mating sequence. The latter will have the higher average value score, but the best move is taking the forced mate route assuming you can infer all the moves in the sequence. Isn't taking the argmax for computing the candidate move value is better?
> We only record and train with moves that were actually played in the game.
Doesn't it make the training prohibilitally inefficient? Creating an entire tree just to have a single perfect sample? Or is it the only way to create an engine capable of beating Stockfish, by running 150 TPUs to self play games to create the sufficient number of positions for training? You'll be losing hundreds to thousands of data points per tree. If I want to train such an algorithm on my puny rtx2060 I'll need to use a significant chunk of each tree, otherwise it will take me months to train an algorithm that can beat me, let alone compete with GMs and top-tier engines.
great video though
What's the tool you are using to visualize building of the graph while the code is running?
No tool, I just manually made them as SVG drawing and animated it with Javascript.
Sick sick sick
At 4:30, you create a training set by simulating many games. However, that particular training set has the second sample ([-1 0 0 0], -1) with an incorrect label: the position is drawn for the second player, but the label is -1.
How do you address an issue of generating the training data with wrong labels?
I think the label is correct. The board is seen from the perspective of Player 2. Player 2 lost the game so we assign that board a value of -1.
From what I've seen RL is pretty finicky and probably doesn't handle noisy labels very well. It's important to make sure your labels are correct.
@@JoshVarty player lost the game, but the position is NOT lost. He can draw the game
@@0114mercury You're correct that the position is not 100% lost. However, the game in which we saw that position lead to a loss.
The idea here is that if we play thousands of games and mark positions that lead to a loss with -1 and positions that lead to a win with +1, we will slowly get a value network that can value positions accurately. We just have to play lots and lots and lots of games for this to happen.
If the position [-1,0,0,0] is a good position, other instances of self-play will value it as +1 and in aggregate our network will learn to value this position with a positive value.
If the position [-1,0,0,0] is a bad position, other instances of self-play will value it as -1 and in aggregate our network will learn to value this position with a negative value.
Does that make sense?
@@JoshVarty yeah Ok, I was thinking on the same line that a winning position will on average more times get assigned a winning score...
Cant see the code could you share it please
6:37 #player 2's reward is 0, isn't it?
why you only let num_sims be 5 in the demo is a bit baffling, shouldve let it do a full run