Alpha Zero and Monte Carlo Tree Search

Поделиться
HTML-код
  • Опубликовано: 6 ноя 2024

Комментарии • 90

  • @JoshVarty
    @JoshVarty  3 года назад +37

    Couple of mistakes I noticed:
    3:02 The numbers in the bottom left of the tree are incorrect and should have been toggled.
    12:42 I say "root node" but should have said "leaf node"
    21:36 I say "root node" but should have said "leaf node"

    • @dsadsa4336
      @dsadsa4336 3 года назад +1

      3:02 you say that after every play the signs of the numbers in the representation should flip but in the lower leftmost nodes this doesn't happen

    • @JoshVarty
      @JoshVarty  3 года назад +1

      @@dsadsa4336 Good catch! Those should have changed as well.

  • @bensonmiakoun7674
    @bensonmiakoun7674 4 года назад +29

    watched many mcts videos and this is the only one i've been looking for. brief + actual code. subbed!

  • @wedenigt
    @wedenigt 3 года назад +22

    What a great explanation. I especially like the visualization of the MCTS algorithm!

  • @leeschmalz6365
    @leeschmalz6365 2 года назад +4

    This visualization of MCTS finally crystallized it for me. Thanks!

  • @jackrubiralta1925
    @jackrubiralta1925 3 года назад +8

    I would really love a in depth video about the machine learning code in this project.
    Great video!

  • @CedricPuig
    @CedricPuig 3 года назад +4

    Best explanation so far, with the code available too. Still hard to understand everything anyway when I try to put it in place for real (ucb & prior are tricky). You reproduced step by step debugger into your blog page, I cannot believe you did this by hand.
    Awesome.
    Thanks.

    • @CedricPuig
      @CedricPuig 3 года назад

      It took me too much time to realize that "prior" means "priority"

    • @DerIntergalaktische
      @DerIntergalaktische 2 года назад +1

      @@CedricPuig Pretty sure prior comes from "apriori", i.e. what we believe the probabilities to be before we make observations with the mcts. After doing the tree search and getting the probability distribution representing the visit counts, it would be posterior probabilities. From "aposteriori".

  • @AtoZshow115
    @AtoZshow115 Месяц назад

    been struggling wrapping my head around MCTS, this really helped, thanks man

  • @victorvandermilte5857
    @victorvandermilte5857 Год назад +1

    This was the video I needed to get an intuition on MCTS. I had read so much and watched a bunch of other videos but they all repeated the same textbook buzzwords without actually explaining the concepts. What you taught me that nobody else did (at least not well enough for me to understand what they were talking about), is that you need to iteratively run the algorithm a bunch of times ("number of simulations"). Before that MCTS was absolute nonsense, but now it all seems obvious. Great vid, shame you don't have that many others.

  • @PaulAaronCooney
    @PaulAaronCooney 3 года назад +3

    Fantastic video. Very clear. Thank you for making it.

  • @godlyradmehr2004
    @godlyradmehr2004 6 месяцев назад

    The best video that I've senn about alpha zero and monte Carlo search algorithms
    Thank you so much ❤❤❤

  • @DerIntergalaktische
    @DerIntergalaktische 2 года назад

    Awesome video! First one I found that explains how things work together and not just the components separately.

  • @jeffwads
    @jeffwads 3 года назад +3

    Excellent work on this.

  • @mrbasketball1991
    @mrbasketball1991 4 года назад +2

    Incredibly good explanation!

  • @zdzdfzde
    @zdzdfzde Год назад +1

    Thanks Josh. I finished the lectures from David Silver and I wanted to understand the MCTS a little better. Especially if there is another player involved who will make opposing moves where you have to assume he will also make intelligent moves. To predict his moves you need to query a model (as in model based RL) that knows the dynamics of the game and maybe even covers the behaviour of another player; or if the model doesn't cover the behaviour of the opposing player, let him move according to the same policy you are using for player 1. But then you need to flip the input bits that represent the state of your board (like a Go board) for every step down in the search tree. Now I'm still wondering how a model-based agent with MCTS for planning is implemented, does the trained model include the opposing player's behaviour, or are his moves also " rolled out" at the same time when you are rolling your player 1's moves during the construction of the MC search tree.

  • @drayg0n806
    @drayg0n806 Год назад +1

    best explanation. great job. thank you

  • @alanjohnstone8766
    @alanjohnstone8766 Год назад

    The more I look the more questions I have and you seem to know what you are talking about in the detail I need! This simple example is great for pinning down things before I start programming.
    It is obvious really but many states appear several times in the tree and therefore additional time and memory is being expended to do the same thing. eg if stones are put in positions 1,2,3 then the same state will arise if they were put down in the order 3,2,1. This is also a problem for pure MCTS. Would it be worth keeping all nodes in a dictionary and if a node already exists then use that instead of creating another. The only change I can think of is that the parent property would have to become a list of parents. Note you do not seem to have stored the nodes parent so I do not know how you do the backup.

  • @kimchi_taco
    @kimchi_taco Год назад

    Finally i understand alpha zero. thx! 🙏

  • @avo_k
    @avo_k 3 года назад

    wow, I've been looking for that video for a long time, thank you so much this is so clear

  • @scasci3929
    @scasci3929 4 года назад +2

    good explanation and great slides... well done!

  • @melamparithy
    @melamparithy 3 года назад +1

    Clearly explained, thank you.

  • @hcm9999
    @hcm9999 Месяц назад

    Question: given enough time, can MCTS (Monte Carlo Tree Search) find the best solution?
    The problem about MCTS is that it chooses the child node with the highest probability of having a solution.
    As long as those probabilities don't change, MCTS will choose the same node, no matter how many iterations you perfom. That means some leaves (terminal nodes) are unreachable.
    If the best solution happens to be in an unreachable leaf, MCTS will never find it.
    The problem is how you calculate the probabilities of the child node.
    There are formulas like UCT, but I don't know if such formulas are really correct.
    The probabilites need to change in such a way to allow MCTS to traverse every single node and every single leaf of the tree.
    I don't think UCT does that.
    That means some node and leaves are unreachable.

    • @kkkkjjjj4517
      @kkkkjjjj4517 Месяц назад +1

      "The problem about MCTS is that it chooses the child node with the highest probability of having a solution."
      Its not shown in this video, but irl, MCTS would use "epsilon greedy". Something like epsilon for 5% (ie pick child randomly) vs 95% greedy. in this way it ensures good exploration of the state space.
      "The probabilites need to change in such a way to allow MCTS to traverse every single node and every single leaf of the tree."
      Brute force is not possible for real AI problems (like Go game). Thats why we have these kind of AI algos. If you can visit all nodes, then it would be some standard CS algo, not AI algo. Yes, theoretically there could exist better solutions than deduced by MCTS and its not fool proof. But its an acceptable compromise.

  • @pramodzorze
    @pramodzorze 2 года назад

    awesome simulation and explanation!

  • @Drawland2012
    @Drawland2012 10 месяцев назад

    Awesome breakdown, thanks man

  • @sozenoz
    @sozenoz 2 года назад +3

    What visualization library are you using for the graph state with forward and backward state transitions?

  • @yusufb7560
    @yusufb7560 Год назад

    This is a great explanation. Thanks, you helped me a lot

  • @carlosaguayo
    @carlosaguayo 4 года назад +1

    Great explanation, thanks!

  • @abhaychandra2624
    @abhaychandra2624 4 месяца назад

    could you please explain why is it easier to feed in neural network when we toggle the value , and keep them with respect to the player about to move?

  • @aungpaing7305
    @aungpaing7305 3 года назад

    Great Video. Really get a clear understanding of Monte Carlo Tree Search.
    Can I know how do you do the slide?? It is not just power point right? I see you run the simulation of the code in the slide. That is awesome.

    • @JoshVarty
      @JoshVarty  3 года назад +1

      Thanks! The animations are just HTML/Javascript/CSS. You can see them in action at: joshvarty.github.io/AlphaZero/

  • @presence5834
    @presence5834 2 месяца назад

    Just wondering how you can decide which is the leaf node in a NLP task?

  • @elirand1986
    @elirand1986 4 года назад +1

    Brilliant, Thanks dude!

  • @AlessandroOrlandi83
    @AlessandroOrlandi83 3 года назад

    This is really well made! Thank you for the good video!

  • @NashGRS
    @NashGRS 3 года назад

    Nice video, great explanation. Thank you.

  • @ouahidlakhal5934
    @ouahidlakhal5934 2 года назад

    Thank you for the effort.

  • @mitchb9766
    @mitchb9766 2 года назад

    I thought the idea of MCTS was to select nodes randomly and average out values so that you don't have to search the full tree, but you said that when all values are equal it just selects the first state available, how does this work as wont it just end up exploring the whole game tree or getting stuck. In the beginning it doesn't know which moves are valuable so they all receive the same weight, it selects the first available move, plays through until it receives a reward, and then distributes the reward back up the tree, if it resulted in a losing game, then the initial move would receive a lower value and then the rest would all still be equal, meaning it would then search the next available move from the beginning, moving its way through the entire tree. Or if the search of the first move resulted in a win it would receive a higher value and then that move would continually be explored.

    • @henrikf8777
      @henrikf8777 Год назад +1

      I have a limited knowledge of this but explaining it might help me learn as well.
      Your second point about getting stuck is the easier question, the UCB prevents this as it starts prioritizing nodes that haven't been visited enough after a while. As the number of total nodes visited increase and the current node still has only been visited once or 0 times then the UCB score will continue to increase which makes it more probable to be picked the next time. This is all assuming the prior is not always 0 which I don't get myself why that couldn't be the case but I suppose that the randomness of the neural network or adding some constant solves this.
      About moving through the entire game tree; this might be the case eventually. But it's not something you need to do. After one of these trees has been created for N number of simulations you've already trained your policy function to try to mimic the tree for that root state and a huge chunk of simulated descendant states. It's the policy function which you will save and use to play the game with. The goal is to make the neural network learn from the MCTS tree so that you can then discard the tree and not have to run these expensive simulations

  • @MLPvipiao
    @MLPvipiao 2 года назад

    At 22:25, shouldn't the child.visit_count+1 be under the square root?

  • @oneJL
    @oneJL 2 года назад +1

    May I ask which visualization/simulation tool you did you use for the MCTS algorithm(14:42) ? Thanks in advance!

    • @JoshVarty
      @JoshVarty  2 года назад

      No tool, I just used HTML/JS/SVG. The blog post in the description has the visualizations in it.

  • @ehmad11
    @ehmad11 3 года назад +1

    this is really awesome

  • @cblbopotka3915
    @cblbopotka3915 2 года назад

    Still don't understand why 20:33 value of [-1 0 0 0] zeroed? Also visits count is 2, but children's visit counts are 1, 0, 0. Ok, at least I managed to figure out how -0.3 turned out - (0 + 0.5 + 0.5 + 0.5) /(2 + 1 + 1 + 1) = 1.5 / 5 = 0.3 (don't ask why)

  • @-_Nuke_-
    @-_Nuke_- 11 месяцев назад

    WHY DOES EVERYONE TALK ABOUT ALPHA ZERO? Which is something that we don't know too much about COMPARED to LC0 which is an open source engine - exactly (or even better by now) than Alpha Zero?
    Give my girl LC0 some love! :D

    • @Dalroc
      @Dalroc 5 месяцев назад

      LC0 is just a chess specific architecture that's specifically based on the A0 architecture.

    • @-_Nuke_-
      @-_Nuke_- 5 месяцев назад

      @@Dalroc In theory. Because we don't know (I think) the source code of A0, I think LC0 was created from scratch.

  • @Nadzap
    @Nadzap Год назад

    Great video

  • @najmamathema6738
    @najmamathema6738 2 года назад

    this is gold! Thanks!

  • @ChoMinsung-n1h
    @ChoMinsung-n1h 3 месяца назад

    can you make a video on how to play this model on kaggle or locally?

  • @ליאתשיף
    @ליאתשיף 3 года назад

    The best explanation!!! thanks :)

  • @fredfred9847
    @fredfred9847 2 года назад

    Great video!

  • @alanjohnstone8766
    @alanjohnstone8766 Год назад

    I know that the video was posted along time ago but maybe somebody will see this and answer.
    I thought that MCTS did depth first search ie went all the way to a terminal node. The way it is programmed here it uses breadth first search which means it takes a long time for it to get any real information, particularly in games like connect4 and chess etc were there is only a non zero reward at the end of the game. Am I wrong?

    • @JoshVarty
      @JoshVarty  Год назад

      In traditional MCTS there is an additional step where you perform “rollout” to determine the value of a given position. Often this would involve multiple rounds of playing random moves until the game terminates. Then you would aggregate the outcomes of these random games (all having started at the original position) as some measure of approximate “value” of the original position.
      AlphaZero does not use rollout. Instead we use the value network to estimate the value of a given position.

    • @alanjohnstone8766
      @alanjohnstone8766 Год назад

      @@JoshVarty Thank you for your quick reply. I thought that AlphaZero still used rollout but rather than using random moves it used the value function (or the action probabilities) to guide its course. I cannot find it now but I read somewhere that they actually used a smaller network that could produce predictions much faster than the main network to guide the rollout. However while writing this I did some Googling and your understanding seems correct and that idea must relate to something else. I have another question but rather than bury it here I will make another comment

  • @nagycsapat931
    @nagycsapat931 3 года назад

    My question is that the policy network has to have a result for every possible move? For example in chess or go, there must be billions of possible moves, how would the network look like then?

    • @JoshVarty
      @JoshVarty  3 года назад

      No it does not have to get a a result for every possible move.
      At 21:22 you can see the tree we have explored. We have not explored all possible moves (See 2:20 for the full tree). We use the value network to "estimate" the value of a given state without fully (and more accurately) evaluating every possible game that could arise from that state.
      At 21:22 we have performed only 5 rounds of MCTS simulation. The algorithm tells us that the best move to play would be the very first position. This is wrong! The best position would have been in the second or third positions. This is because we have only used 5 rounds of MCTS simulation. In general, increasing the number of MCTS rounds allows us to search deeper into the tree (and therefore get better predictions) though we will never fully explore it.

    • @nagycsapat931
      @nagycsapat931 3 года назад

      @@JoshVarty Yeah but how would you actually implement the policy network for an uncertain amount of moves? Because in the example it's obvious that there are 4 moves that can be played so the policy network needs to have 4 output nodes for each action. But for example in chess there are unimaginably big number of moves that can be played. If you would want to make the policy network for chess how many outputs would there be?

    • @JoshVarty
      @JoshVarty  3 года назад

      ​@@nagycsapat931 Before we talk about the policy network, we should first talk about neural networks in general to make sure you understand how they work. One of the most commonly used image datasets is called MNIST. It contains ~60,000 images of handwritten digits (0 - 9). Each image is 28x28.
      When we train a neural network on this dataset, we show it a 28x28 image and get its prediction (ie. Is this a 1? Is this a 2? etc.). We look at the output and then adjust the weights slightly (via SGD) to make it slightly more likely to make the correct prediction in the future. We do this many, many times. After we have trained this neural network, we show it images that it has never seen before and measure how accurately it predicts the correct image. This is very important: We have not trained it on every single 28x28 image. We have only shown it ~60,000 examples but we are happy to see that is generalizes to images it has never seen before.
      For the policy network (7:10) we have an input of 4 numbers (the current state of the game) and it outputs 4 numbers (the confidence the network has that we should place a piece at each position). We train this network by showing it many different games of self-play and trying to make it match the probabilities generated by MCTS.
      We do not show the policy network every single possible move. We hope that by showing it many, many possibles moves during training that it will generalize and be able to create good predictions for boards it has never seen before.

    • @nagycsapat931
      @nagycsapat931 3 года назад

      @@JoshVarty So my understanding is that the value networks input nodes represent the state (for example: 1st input node: 0, 2nd input node: -1, 3rd input node: 1, 4th input node: 0) like (0, -1, 1 , 0), and for these numbers it outputs the likelihood of winning in this position (or otherwise called "value"). This value between -1 and 1 is the single output node from the value network. Now on the other hand the policy networks input nodes are the same as the value networks (0, -1, 1, 0), but the theres not only 1 output node, but 4. Those 4 output nodes value represent the likeliness that we should play each move (and also help improve our MCTS).

    • @JoshVarty
      @JoshVarty  3 года назад

      ​@@nagycsapat931 I think your understanding is correct. For games like Go (19x19 board) there are (at most) 361 possible moves on each turn, so you might imagine a policy network with 361 outputs. So there aren't millions or billions of moves to consider, but once you start looking ahead with MCTS, this search space does explode.
      You've raised an interesting point. The approaches we're discussing here work reasonably well for games like Tic-Tac-Toe, Chess and even Go. That said, they don't work that well for games like StarCraft or Dota which have a much, much larger set of possible moves at every frame.

  • @karolszymczyk8170
    @karolszymczyk8170 3 года назад

    Great job!!! It helped me a lot!!

  • @andreasbauer3039
    @andreasbauer3039 4 года назад

    great video. Thank you very much

  • @Felipedacruz443
    @Felipedacruz443 10 месяцев назад

    New Roslyn series please!!!

  • @shuaishuai2009
    @shuaishuai2009 2 года назад

    Great video! What is your running IDE?

    • @JoshVarty
      @JoshVarty  2 года назад

      Thanks! I use VS Code mostly. (In this video the code isn’t actually from an IDE. It’s just HTML/JavaScript that I made to look like VS Code).
      There’s a link in the description that will take you to the blog post with the code.

  • @lanvukusic
    @lanvukusic 3 года назад

    Hey great video. Could you tell us what library did you use for visualization?

    • @JoshVarty
      @JoshVarty  3 года назад +3

      Unfortunately I didn't use a library. I had to animate it with vanilla Javascript and SVG objects.

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 2 года назад

    One thing I don't understand. Why you don't assume that the opponent will play the best move, and instead propagate the (weighted?) mean up the tree? Let's say you play chess and you have a forced mate, wouldn't it be the most logical to go for that mate even if another branch on average gives you better value predictions? (a good example might be to sacrifice most of your pieces to deliver a forced mate, Morphy-style)
    Also, when you train the net, do you take only the root nodes (which take in the most information), or do you train all state/policy pairs you have from the MCTS session?

    • @JoshVarty
      @JoshVarty  2 года назад

      > One thing I don't understand. Why you don't assume that the opponent will play the best move,
      I think the fundamental problem is: "What is the best move?" It's easy for us to write a program that can identify mate-in-one moves. But what about situations where there is no forced mate? How do we find the best move?
      > wouldn't it be the most logical to go for that mate even if another branch on average gives you better value predictions?
      This situation should not happen. A mate-in-one move will be valued as 1.0 (the highest possible value). No other move can have a higher value. We want exactly the properties you're describing: If sacrificing pieces leads to victory, then that path should have a high value! We don't encode any information about the value of a queen, bishop etc.
      > Also, when you train the net, do you take only the root nodes (which take in the most information), or do you train all state/policy pairs you have from the MCTS session?
      We only record and train with moves that were actually played in the game.

    • @vladimirtchuiev2218
      @vladimirtchuiev2218 2 года назад

      @@JoshVarty > I think the fundamental problem is: "What is the best move?" It's easy for us to write a program that can identify mate-in-one moves. But what about situations where there is no forced mate? How do we find the best move?
      Let me be clearer. After you have the UCB score for each candidate moves from the board position you will choose the best one. I'm talking about the phase where you expand from each candidate move a sub-tree. When you collapse it, the formulation uses the mean of all leaf node values corresponding to the candidate moves. Why not use argmax instead?
      > This situation should not happen. A mate-in-one move will be valued as 1.0 (the highest possible value). No other move can have a higher value. We want exactly the properties you're describing: If sacrificing pieces leads to victory, then that path should have a high value! We don't encode any information about the value of a queen, bishop etc.
      For mate in 1 situations you can just brute-force it via sampling because you'll have a reasonable chance of landing on the winning move. I'm talking about situation where you have a forced mate in 10, where there is only a single move per opponent move that leads to that mate, otherwise you'll end up in a worse position. Let's say for example the alternative is a pawn move that gives you more control of the center but doesn't leave you with a possible mating sequence. The latter will have the higher average value score, but the best move is taking the forced mate route assuming you can infer all the moves in the sequence. Isn't taking the argmax for computing the candidate move value is better?
      > We only record and train with moves that were actually played in the game.
      Doesn't it make the training prohibilitally inefficient? Creating an entire tree just to have a single perfect sample? Or is it the only way to create an engine capable of beating Stockfish, by running 150 TPUs to self play games to create the sufficient number of positions for training? You'll be losing hundreds to thousands of data points per tree. If I want to train such an algorithm on my puny rtx2060 I'll need to use a significant chunk of each tree, otherwise it will take me months to train an algorithm that can beat me, let alone compete with GMs and top-tier engines.

  • @devilsolution9781
    @devilsolution9781 8 месяцев назад

    great video though

  • @Falstad88
    @Falstad88 3 года назад

    What's the tool you are using to visualize building of the graph while the code is running?

    • @JoshVarty
      @JoshVarty  3 года назад

      No tool, I just manually made them as SVG drawing and animated it with Javascript.

  • @ShaunGan
    @ShaunGan 3 года назад

    Sick sick sick

  • @0114mercury
    @0114mercury 3 года назад

    At 4:30, you create a training set by simulating many games. However, that particular training set has the second sample ([-1 0 0 0], -1) with an incorrect label: the position is drawn for the second player, but the label is -1.
    How do you address an issue of generating the training data with wrong labels?

    • @JoshVarty
      @JoshVarty  3 года назад

      I think the label is correct. The board is seen from the perspective of Player 2. Player 2 lost the game so we assign that board a value of -1.
      From what I've seen RL is pretty finicky and probably doesn't handle noisy labels very well. It's important to make sure your labels are correct.

    • @0114mercury
      @0114mercury 3 года назад

      @@JoshVarty player lost the game, but the position is NOT lost. He can draw the game

    • @JoshVarty
      @JoshVarty  3 года назад +1

      ​@@0114mercury You're correct that the position is not 100% lost. However, the game in which we saw that position lead to a loss.
      The idea here is that if we play thousands of games and mark positions that lead to a loss with -1 and positions that lead to a win with +1, we will slowly get a value network that can value positions accurately. We just have to play lots and lots and lots of games for this to happen.
      If the position [-1,0,0,0] is a good position, other instances of self-play will value it as +1 and in aggregate our network will learn to value this position with a positive value.
      If the position [-1,0,0,0] is a bad position, other instances of self-play will value it as -1 and in aggregate our network will learn to value this position with a negative value.
      Does that make sense?

    • @0114mercury
      @0114mercury 3 года назад

      @@JoshVarty yeah Ok, I was thinking on the same line that a winning position will on average more times get assigned a winning score...

  • @darkmatter9583
    @darkmatter9583 Месяц назад

    Cant see the code could you share it please

  • @oneJL
    @oneJL 2 года назад

    6:37 #player 2's reward is 0, isn't it?

  • @devilsolution9781
    @devilsolution9781 8 месяцев назад

    why you only let num_sims be 5 in the demo is a bit baffling, shouldve let it do a full run