Oh my god, this video is so great, the explanation is so clear, all with flowchart , pseudocode, visualization. Thank you for providing this wonderful learning material
At the beginning, your root node's visit number is 0, but you expand it instead of rollout. I know it is the right way but then you should alter your algorithm. At the left board you said "rollout if the ni value is 0" You can do this by saying the root node has the visit value 1 from the start.
5:25 - It's not the smartest idea to take them in numerical order it will somewhat bias the algo, I think a better policy would be a random uniform one or to put some priors over the actions using some heuristic or better yet an ML model (maybe a policy dnn like in the case of AlphaGo) in case you know some specifics of the problem you're trying to solve with MCTS. It's also maybe worth mentioning that the expansion threshold doesn't necessarily have to be 1, it's the hyperparam of MCTS. So maybe you'll have to visit the node 40 times before expanding it. Finally in order to pick the best action at the end sometimes you don't want to choose it looking at the average reward of the child states but simply take the visit count - but I guess this is problem specific. Loved the walk-through, thank you John! Great pedagogical methods you've got.
Sir, thank you so much for the explanations. Even dumb like me now feels like I will now apply MCTS to my tictactoe game! If teachers like you taught me when I was a kid my life would definitely be different. Just can't thank enough for this lecture!
Thank you for the video, you explained very well the single steps. That helps me a lot, most of all the expansion step was the one I didn't understand until now.
You're very welcome! It took me a while to work out how the expansion step works, since the standard explanation offered is a bit unclear on this point.
Thank you so much, because this is the first source I find that gives a true example. So once again, thank you so much. I think I'll now be able to implement it in C for my connect 4 :)
Didn't understand why after backpropagation on S5, S0 became 44 if it was 30 and S2 = 24. I thought we should add S0 = 30 plus S2 = 24 so S0 would be 54! By the way, best explanation EVER! Thanks a lot for sharing.
Pedro Coelho I believe we just backpropagate the value we obtained after the rollout. So if we got 14, we just add that 4 to each of the parents all the way to the root.
Your explanations are much better than the Wikipedia page. Congrats! Thought that MCTS was only applicable to games in which you have an opponent, glad to see it is not the case and will apply it to my pb. Thanks!
First of all, thank you for the excellent lecture and explanation! The algorithm is very clear and understandable from the way it's presented. My question is: In the end you mention that what this gives us, is that we should take action a2 over a1. Why can't we take it a step further and also infer that, for example a5 is better than a6 (assuming we have visited them both more times than what is presented in the example). Why do we only use a whole search tree for just one decision, and then scrap it and rebuild it anew, instead of making more sequential decisions using the same tree?
Very good presentation. Thank you! I really needed clarification on whether to increment visit count BEFORE or AFTER backpropagation. Thanks for this :)
This is great, thank you! But why is the value found at the terminal node *added* to the nodes above? Is that simply so we can get the average expected value of a certain state by dividing by the number of times it's been visited?
Thank you so much for this video, really helped me a lot. I just have one question : when expanding, why do you only roll out from one of the children? Wouldn't it give a more accurate score for the parent if we simulate all its children?
Good question. In general, MCTS is trying to make the best decision it can from the limited number of rollouts that it can make in the time available. Therefore, every time it does a rollout, it updates the whole tree and then uses this updated tree too decide where to sample next - it could be that sampling only a single child will tell us that the node above is a poor choice, and so we will start sampling elsewhere in the tree.
Very nice explanation! Thank you! I have a question: what should I do if I am in the leaf node and I should extend it, but the state of this node is terminal?
In more specific applications such as chess, during the rollout phase can/should you weight the possibile actions by guesses for how good they are (like moves that take a piece, put the opponent in check, etc) instead of doing it purely randomly? Would that improve the algorithm on average?
Yes. Indeed, most advanced implementations of MCTS have some form of heuristics to favour "good" moves during the rollour. AlphaGo for instance used neural networks both to favor good moves during the rollouts and to estimate the terminal value at the end of the rollout (since playing the rollout to the end would take too long).
Sorry I have a question, when we do a rollout from a leaf node, we have to start backpropagating the value from the leaf to the radix or from where we started the backpropagation to the radix?
Question: given enough time, can MCTS (Monte Carlo Tree Search) find the best solution? The problem about MCTS is that it chooses the child node with the highest probability of having a solution. As long as those probabilities don't change, MCTS will choose the same node, no matter how many iterations you perfom. That means some leaves (terminal nodes) are unreachable. If the best solution happens to be in an unreachable leaf, MCTS will never find it.
great explanation! thx! a quick question about how the algorithm is constructed. the leaf nodes: the algorithm is constructed to rollout on the first visit, on the second visit to expand the tree & rollout, and on any subsequent visit to just traverse using UCB1 formula. is my understanding correct?
I've been searching for ages for a proper explanation on MCTS, and this is the best of all. Thanks for taking the time to make this! Just one question, however; how did you get the value of the leaf modes on rollout? I see that you randomized until terminal state. Do you use an evaluation function there?
The idea is that the rollout should end in a state that really is terminal - the end of the game. For example, if we're playing tic-tac-toe, the terminal state is a win for one player or a draw because there are no moves left. In my example run, I'm imagining a game with a more sophisticated scoring system than tic-tac-toe, so the score that I'm using is the reward of the terminal state to the player that's making the decision. Hope this helps!
John Levine Thanks for the reply. However in the expansion phase, we're adding possible moves we can make, but before that, how can we simulate where the enemy will place his move?
John Levine If I elaborate: in the expansion phase, our current state is the state where the AI placed the stone. In a turn based game, the nodes added to S2 in the example in the video should be the opponent's moves. How do you simulate that?
In a 2-player game the opponent will choose moves that keep that minimise the first player's score, while also choosing moves that keep the exploration term high. So the easiest way to do this is to put a negative sign in front of the first term, and then return the move that gives the maximum value for this new formula.
It's a little weird that we don't account for the first rollout of S1 in its children (we can't because we don't keep the path visited). It looks like once we get to a child of S1 we "forget" that first rollout.
A really cool explanation of MCTS. Just some pointers, I never really understood what V_i stood for(understood it towards the end of the video), would have been better if you replaced it with simply t/n in the formula
Hello, thank you for the explanation, it made things a lot clearer. I do have one question, do we save the path taken during the rollouts in order to ensure that during subsequent rollouts, we don't take the same path.
Amazing explanation. Thank you. But i'm not sure how to get terminal value of leaf state. Is it average score you get after you do random actions from this point?
6:16 how do we compute the value on the terminal state ? (for instance, you said : v = 20). I'm trying to apply MCTS on easy case like tic-tac--toe ^^ .
Question the professor says that after the algorithm has run out of iterations and budgets Pick the chidl with the highest scores: most textbooks I consulted says pick the child with the highest number of VISITS..because picking the one with the highest score leads to bias.
Great explanation. You are slow and clear. It might be nice to mount the camera. Question: What is the point of back propagating the t value of the initial state (S-zero)?
Thanks Ben! This was our first video - we mounted the camera for the next one and then used an external microphone, which helped too. You're right that there's no real point in calculating the score of S0, except to find out what the estimated value of the game is to the agent - if you have a choice to play or not to play, you might decline if the estimated value is negative. But if you have to play anyway, you can omit this calculation.
@@johnlevine2909 If I may, I will offer a practical take on the question of propagating all the way up. From pure intuition I would imagine the backpropagating code to be cleaner if it is allowed to propagate all the way up to the root node. Sure it's useless in most cases, but I imagine we can skip a subjectively ugly if-statement checking if current progagation is root (and practically: computational step). Im thinking simple as propagate(int/float/whatev t) => { this.n++; this.t += t; If parent not null then parent.propagate(t); }
very helpful video! but.. i'm still confused when i consider stochastic games such as klondike. in those games, one action doesn't correspond to only single next state. am i missing or misunderstanding? anyone can help me?
Thanks for the great explanation! I have one question though: If we now decide after four iterations to choose action a_2 as you showed in the video, then the next four iterations would start in s_2 to determine the next action. For this, do we use the results of the calculations from previous iterations or do we "throw everything away" and start with all values initialized to 0?
Let's say we do select a_2 for execution. So we perform a_2 and go into state s_2, as you say. s_2 now becomes the new root node, and we need to decide what action we should take next. As you also say, we already have some results stored in the tree which are useful to us, so it makes sense to start the next iteration using the sub-tree with s_2 as the root node.
Very nice explanation, but I still don't get what the rollout phase does. Is it a continuation of choosing random states until the algorithm hits some terminal criterion (with the subsequent evaluation)?
They are seemingly arbitrary values for a terminal state in this setting. It might make more sense to relate this to chess, where the value of such a rollout is either +1 (win), 0 (draw), or -1 (loss), which is then propagated upwards.
I have a question about the UCB1 formula what excacly is N? ist it the number of visits in the parent node or is it the number of visits in the root node? the root node was alsways the parent node for all calculated examples so i'm not 100% sure about that
This is a ridiculously good explanation for MCTS. Keep teaching, you're really good at it
Best explanation for MCTS
The visualization make this algorithm so easy to understand! Thanks for your great work, professor!
I watched this vid like 10 times to make my monte carlo agent. thnx man
You're very welcome!
You have a phenomenal talent for teaching. Clarity, pace, and conciseness was brilliant. Thanks.
Honestly, the best explanation for MCTS so far. This helped greatly, thank you ☺
After reading about this method, the rollout was confusing. The verbosity of your example completely eliminates that confusion. Thank you.
Why algorithm tries zero visiting first -> ∞ explanation very helped me. Nobody accented on this so simple. Thanks
Oh my god, this video is so great, the explanation is so clear, all with flowchart , pseudocode, visualization. Thank you for providing this wonderful learning material
Thank you for your explanation. Your video was the only source where I could understand this algorithm. It really helps writing my seminar work
You're very welcome! Glad it was of use.
This is the best step by step explanation of MTCS! You save my day! Thanks!
Amazing class, thank you so much for it! You made a very understandable class in under 16 minutes, that's impressive.
I came here after watching many vids on MCTS and none of the videos were able to make me understand the concept but this video is the best
At the beginning, your root node's visit number is 0, but you expand it instead of rollout. I know it is the right way but then you should alter your algorithm. At the left board you said "rollout if the ni value is 0"
You can do this by saying the root node has the visit value 1 from the start.
Don't you expand and then rollout just like he did, meaning that the visits would be 0.
Beautifully explained
5:25 - It's not the smartest idea to take them in numerical order it will somewhat bias the algo, I think a better policy would be a random uniform one or to put some priors over the actions using some heuristic or better yet an ML model (maybe a policy dnn like in the case of AlphaGo) in case you know some specifics of the problem you're trying to solve with MCTS.
It's also maybe worth mentioning that the expansion threshold doesn't necessarily have to be 1, it's the hyperparam of MCTS. So maybe you'll have to visit the node 40 times before expanding it.
Finally in order to pick the best action at the end sometimes you don't want to choose it looking at the average reward of the child states but simply take the visit count - but I guess this is problem specific.
Loved the walk-through, thank you John! Great pedagogical methods you've got.
Thanks for the insights!
Sir, thank you so much for the explanations. Even dumb like me now feels like I will now apply MCTS to my tictactoe game! If teachers like you taught me when I was a kid my life would definitely be different. Just can't thank enough for this lecture!
One of the best explanations of MCTS
By far, the simplest explanation of MCTS. Thank you so much.
Thank you for the video, you explained very well the single steps. That helps me a lot, most of all the expansion step was the one I didn't understand until now.
You're very welcome! It took me a while to work out how the expansion step works, since the standard explanation offered is a bit unclear on this point.
Thank you so much, because this is the first source I find that gives a true example.
So once again, thank you so much. I think I'll now be able to implement it in C for my connect 4 :)
Thank you so much ! The explanations are very clear, you make it sound like it's easy! What's with the human calculator by the way ?
You did a great job. explain the process clearly.
What you did is admirable sir, great thanks for the explanation. Saved my day.
Great explanation of the algorithm, showing all the elements which are needed to get a basic understanding of MCTS. Thank you.
By far the best explanation I've seen. Thank you.
Didn't understand why after backpropagation on S5, S0 became 44 if it was 30 and S2 = 24. I thought we should add S0 = 30 plus S2 = 24 so S0 would be 54!
By the way, best explanation EVER! Thanks a lot for sharing.
Pedro Coelho I believe we just backpropagate the value we obtained after the rollout. So if we got 14, we just add that 4 to each of the parents all the way to the root.
Incredibly clear, after just one watch i now understand MCTS
Great video. Explaining things clearly is hard. You make it look easy John.
Thank you for such a clear explanation of MCTS.
Great video. It was well explained and easy to follow. I'd love to see other videos related to deep learning if you have them.
Thank you so much for the motivation this gives me to not give up, its a lot easier to believe in myself after understanding these concepts!
The best explanation I can find.
Your explanations are much better than the Wikipedia page. Congrats! Thought that MCTS was only applicable to games in which you have an opponent, glad to see it is not the case and will apply it to my pb. Thanks!
The best explanation of MCTS ever. And look like the Wikipedia article is not correct.
Wow, this is liquid gold. Thanks John.
Nice to hear from a fellow brit and not MIT OCW! Big up America too tho..
Thanks a lot for the lecture, its was really helpful :)
You're very welcome!
Great video! At 15:03, is t_0 supposed to be 54 not 44?
Thank you so much! I'm working on my FYP and this is the best explanation of MCTS out there!
Thank you Prof! The worked example makes the absolute difference!
This is pure gold! Thank you infinitely
The best explanation. Thank you!
Thanks the best explanation of MCTS. I wish to understand how it would work with a game with multiple players.
First of all, thank you for the excellent lecture and explanation! The algorithm is very clear and understandable from the way it's presented.
My question is: In the end you mention that what this gives us, is that we should take action a2 over a1. Why can't we take it a step further and also infer that, for example a5 is better than a6 (assuming we have visited them both more times than what is presented in the example). Why do we only use a whole search tree for just one decision, and then scrap it and rebuild it anew, instead of making more sequential decisions using the same tree?
Your videos are so helpful. Thank you for posting these online!
Very good presentation. Thank you! I really needed clarification on whether to increment visit count BEFORE or AFTER backpropagation. Thanks for this :)
Truly incredible explanation.
This is great, thank you! But why is the value found at the terminal node *added* to the nodes above? Is that simply so we can get the average expected value of a certain state by dividing by the number of times it's been visited?
Thank you! Explained it better than my professor ever did
What a great explanation! I've heard other explanations, but this is the condensed one I'll refer to in the future!
Very clear explanation of the algorithm and the video is also well made! Thank you!
The content is so exciting even the camera shares the feeling.
Thank you so much for this video, really helped me a lot.
I just have one question : when expanding, why do you only roll out from one of the children? Wouldn't it give a more accurate score for the parent if we simulate all its children?
Good question. In general, MCTS is trying to make the best decision it can from the limited number of rollouts that it can make in the time available. Therefore, every time it does a rollout, it updates the whole tree and then uses this updated tree too decide where to sample next - it could be that sampling only a single child will tell us that the node above is a poor choice, and so we will start sampling elsewhere in the tree.
a monument to clarity. thank you very much.
Very nice explanation! Thank you!
I have a question: what should I do if I am in the leaf node and I should extend it, but the state of this node is terminal?
Best possible explanation...
Very clear lecture. Great explanation! Thank you!
Thanks a lot for sharing this video Sir, It was really very help full.
Could you explain what is behind the rollout in your example (simulation). Take an example like tic tac toe
According to your flowchart, shouldn't you do a rollout immediately from S0 since it is a leaf node and it has never been visited?
In more specific applications such as chess, during the rollout phase can/should you weight the possibile actions by guesses for how good they are (like moves that take a piece, put the opponent in check, etc) instead of doing it purely randomly? Would that improve the algorithm on average?
Yes. Indeed, most advanced implementations of MCTS have some form of heuristics to favour "good" moves during the rollour. AlphaGo for instance used neural networks both to favor good moves during the rollouts and to estimate the terminal value at the end of the rollout (since playing the rollout to the end would take too long).
Sorry I have a question, when we do a rollout from a leaf node, we have to start backpropagating the value from the leaf to the radix or from where we started the backpropagation to the radix?
I really appreciate the algorithm trace.
Question: given enough time, can MCTS (Monte Carlo Tree Search) find the best solution?
The problem about MCTS is that it chooses the child node with the highest probability of having a solution.
As long as those probabilities don't change, MCTS will choose the same node, no matter how many iterations you perfom. That means some leaves (terminal nodes) are unreachable.
If the best solution happens to be in an unreachable leaf, MCTS will never find it.
great explanation! thx! a quick question about how the algorithm is constructed. the leaf nodes: the algorithm is constructed to rollout on the first visit, on the second visit to expand the tree & rollout, and on any subsequent visit to just traverse using UCB1 formula. is my understanding correct?
I've been searching for ages for a proper explanation on MCTS, and this is the best of all. Thanks for taking the time to make this! Just one question, however; how did you get the value of the leaf modes on rollout? I see that you randomized until terminal state. Do you use an evaluation function there?
The idea is that the rollout should end in a state that really is terminal - the end of the game. For example, if we're playing tic-tac-toe, the terminal state is a win for one player or a draw because there are no moves left. In my example run, I'm imagining a game with a more sophisticated scoring system than tic-tac-toe, so the score that I'm using is the reward of the terminal state to the player that's making the decision. Hope this helps!
John Levine Thanks for the reply. However in the expansion phase, we're adding possible moves we can make, but before that, how can we simulate where the enemy will place his move?
John Levine If I elaborate: in the expansion phase, our current state is the state where the AI placed the stone. In a turn based game, the nodes added to S2 in the example in the video should be the opponent's moves. How do you simulate that?
In a 2-player game the opponent will choose moves that keep that minimise the first player's score, while also choosing moves that keep the exploration term high. So the easiest way to do this is to put a negative sign in front of the first term, and then return the move that gives the maximum value for this new formula.
Hey, I want to learn monte carlo tree search, but I also want to feel like I'm going deaf in one ear sporadically... this video ---> "I got you"
It's a little weird that we don't account for the first rollout of S1 in its children (we can't because we don't keep the path visited). It looks like once we get to a child of S1 we "forget" that first rollout.
Thank you sir for the great explaination !
A really cool explanation of MCTS. Just some pointers, I never really understood what V_i stood for(understood it towards the end of the video), would have been better if you replaced it with simply t/n in the formula
Hello, thank you for the explanation, it made things a lot clearer. I do have one question, do we save the path taken during the rollouts in order to ensure that during subsequent rollouts, we don't take the same path.
Rollouts are discarded and only their outcome affects following iterations (source DOI: 10.1109/TCIAIG.2012.2186810, p5-9)
Why is there the logarithm and the square root in the UCB formula? What would have changed if there had been the N/n_i directly?
Amazing explanation. Thank you. But i'm not sure how to get terminal value of leaf state. Is it average score you get after you do random actions from this point?
on what basis you give a value to the unvisited nodes or you just generate them randomly?
Straight forward explanation
amazing walkthrough, thank you
thank u for helping me in my AI studies... ure great.
6:16 how do we compute the value on the terminal state ? (for instance, you said : v = 20). I'm trying to apply MCTS on easy case like tic-tac--toe ^^ .
Thank you very much! My game of othello is fully working now! :)
According to the flow chat, shouldn't the start node itself be considered as leaf node at the start?
Super clear explanation!!! 🌟🌟🌟🌟🌟
Thank you for the explanation. It was clear!
Very good and easy to follow explanation!
Thank you! Glad you liked it.
Great video however unless I am mistaken the t (reward) value for s_0 should be (24 + 20)/2 the average value of the child nodes of the tree.
Question the professor says that after the algorithm has run out of iterations and budgets Pick the chidl with the highest scores: most textbooks I consulted says pick the child with the highest number of VISITS..because picking the one with the highest score leads to bias.
Great explanation. You are slow and clear. It might be nice to mount the camera. Question: What is the point of back propagating the t value of the initial state (S-zero)?
Thanks Ben! This was our first video - we mounted the camera for the next one and then used an external microphone, which helped too. You're right that there's no real point in calculating the score of S0, except to find out what the estimated value of the game is to the agent - if you have a choice to play or not to play, you might decline if the estimated value is negative. But if you have to play anyway, you can omit this calculation.
@@johnlevine2909 If I may, I will offer a practical take on the question of propagating all the way up. From pure intuition I would imagine the backpropagating code to be cleaner if it is allowed to propagate all the way up to the root node. Sure it's useless in most cases, but I imagine we can skip a subjectively ugly if-statement checking if current progagation is root (and practically: computational step). Im thinking simple as
propagate(int/float/whatev t) => { this.n++; this.t += t; If parent not null then parent.propagate(t); }
Great video! Helped me a lot, thanks.
If S2 is better over S1, should we choose S3 or S4? Is there rule for that in MCTS algorithm or it differs between implementations?
life saver , great run through
very helpful video! but.. i'm still confused when i consider stochastic games such as klondike. in those games, one action doesn't correspond to only single next state. am i missing or misunderstanding? anyone can help me?
Great lecture and vivid example! Thanks a lot! ^_^
Thanks for the great explanation! I have one question though: If we now decide after four iterations to choose action a_2 as you showed in the video, then the next four iterations would start in s_2 to determine the next action. For this, do we use the results of the calculations from previous iterations or do we "throw everything away" and start with all values initialized to 0?
Let's say we do select a_2 for execution. So we perform a_2 and go into state s_2, as you say. s_2 now becomes the new root node, and we need to decide what action we should take next. As you also say, we already have some results stored in the tree which are useful to us, so it makes sense to start the next iteration using the sub-tree with s_2 as the root node.
Very nice explanation, but I still don't get what the rollout phase does. Is it a continuation of choosing random states until the algorithm hits some terminal criterion (with the subsequent evaluation)?
Great Explanation!! Thank you alot!! Keep the good work going!
Whats the difference between leaf node and terminal state?
Really great lecture. Continue the good work :)
Thank you for the great lecture.
John levine and josh starmer are carrying me so hard my feet aren't even touching the ground
why the value of V after roullout of the S1, is 20 , while this value in case of S2 is 10? Where are they comming from?
They are seemingly arbitrary values for a terminal state in this setting. It might make more sense to relate this to chess, where the value of such a rollout is either +1 (win), 0 (draw), or -1 (loss), which is then propagated upwards.
I have a question about the UCB1 formula what excacly is N? ist it the number of visits in the parent node or is it the number of visits in the root node?
the root node was alsways the parent node for all calculated examples so i'm not 100% sure about that