7:24 Your explanation of MCTS is not correct. For one instance of simulation: It picks the top move recommended by the network (greedy) most of the time, with random moves some of the time (epsilon). Then it walks into that move and repeats the same. It does it to completion. Then it backs up and keeps track of win vs visit ratio for each state as shown in the picture. It repeats this whole process 1600 times. As it is performing these walkthroughs it trains the networks and updates the values. So eventually, the more often you see a state, it will statistically converge to optimal value. MCTS runs to completion, its not a depth pruning algorithm. Temporal Difference stops somewhere in the middle, this was not used in AGZ. MCTS algorithm is discussed by David Silver in his lecture #8 towards the end.
I checked the paper and you are indeed correct! The used MCTS doesn't always play out every thread until the very end of the game (they use value tresholds for early stopping) but I did misinterpret the meaning of the '1600 simulations', thanks for pointing this out!
So I am going down the search tree (based on an algorithm that takes exploration and other things into account), until I reach a leaf node. I put the leaf node position into my neural net and get a policy and an evaluation as a result. The policy adds new leaf nodes to my current position and the value function gives me an evaluation of my current position. Is this correct ?
I've been programming board game engines for 25 years and I've followed the development of CNNs to play go quite closely. This video is a really good description of the AlphaGo Zero paper, with very clear explanations. Well, the explanation of MCTS was completely wrong, but other than that this video was great. I'll make sure to check out more from this channel.
You explanation skills are fantastic! I like how he has an outline at the begging of his video, very simple thing yet very effective when it comes to teaching a subject, yet so few educational videos do that. If I were to figure out the paper by myself, it would have taken me personally ~2x longer. Subscribed.
We, humans, run simulations in our heads all the time because sometimes simple intuitions are not enough... So, I guess, it isn't surprising that inclusion of Monte Carlo Tree Search would always drastically improve performance no matter how good the value function estimates are, even with the help of deep learning... The question is how to search more efficiently and also how to build an efficient model...
Thank you for taking the time to explain it so well. Still difficult for me that I'm not familiar with the matter yet, but you did really a good job of showing it clearly!
The part I don't understand, is how they dispense with rollout in MCTS. It seems like this is the only way to get a ground truth value ( by reaching a terminal state) which can then be propagated back up the chain. If you reach a non terminal state, you're back propagating a value from the policy network which won't have useful values until it can be trained on useful values from the tree search. It seems like it's pulling itself up by it's bootstraps. Is it the case that the true values come from the odd time that a simulation reaches a terminal state? Or am I missing something fundamental?
Brilliant - thanks for this! Really enjoyed watching and I think it takes away all the right information from the paper. Just a quick point: is there any chance you could quieten down the background music for your next video? It was slightly distracting and I think it detracted a bit from your great explanation! Merry Christmas!
Thanks a lot, great to hear :) And for the background music: I got the same feedback from a few different people! This was my first video (every other video you'll find on my channel has this fixed) :p
Does anybody know if the shape of the output layer changes for every phase of the game? In the video, he explains that the network produces probability distribution over possible moves and the number of possible moves is dynamic. Does that mean the output layer's dimension is also dynamic? If so, how is it achieved? Can anyone help me understand? Thanks!
This is cool, but after the third random jumpscare sound I couldn't pay attention to what you were saying--all I could think about was when the next one would be. Gave up halfway through since it was stressing me out
7:27 u said "certain depth", did u mean "certain width"? btw. I'd say this is one of the very best channels on the DL topic I've ever seen! thanks so much!
Thanks for the great explanation! Im still wondering how alpha go zero learns that certain moves are obviously bad like playin in the corner for example without playing a game till the end?
A little question on the AlphaGo Zero MCTS. The Monte-carlo aspect of the Alpha Zero MCTS seems to be gone AFAIK. Can't see random number or random choices in that MCTS. It seem to be replaced by the CNN calculating the probability of a board position leading to victory. What's your take on it ?
10 Nov 2022 - I just discovered this video and your channel. Fantastic explanation of granted a difficult subject to even tackle. Did you mention what kind of computer hardware the newest AlphaGo system uses? I assume it’s a mainframe of some type. Also, I wonder if the system can decide in advance to play a peaceful game or a highly combative game? I have seen games where they were very few prisoners taken off the board. Otherwise called the peaceful game. Still there is a winner nonetheless. Anyway bravo for an excellent video.
Thank you for this valuable explanation! I just want to request, if you would like to highlight the parts of your images more, that you are talking about? E.g. in the diagrams, you show. This would make it easier to follow!
Thank you very much for the effort to educate the ignorants about the mechanisms linked to Alpha Go that is taken from a lot of ignorants like a "beast" like "terminator-like machine"... for me to simplify, I would say that the Go champion played versus 50000 professional go players -> no chance to win at all. As kasparov failed winning against 50000 human amateur players in this last decade.For me this is the massive parallel processes and recursive functions that beated the champion, technically beaten, this is definetly NOT intelligence but MASSIVELY PARALLEL processes clustered on thousands of CPU and GPU (floating point operations).
Is it hard to implement this algorithm by myself? Could I create a super-human go Player on say a 7x7 board with just my laptop? How big could I make the board using just a normal laptop?
There's a ton of open source implementations on GitHub: github.com/topics/alphago-zero but I know that many people are having issues reproducing the full strength of DeepMinds version. I don't know if the 'interesting game mechanics' of Go also emerge on a small board like 7x7 but I would guess that you can definitely train a decent model on a laptop for such a small game-state. Additionally, you could also apply the algorithm to chess, which has a much smaller branching factor so it's easier to train, although again I think in order to get decent results you would have to throw some cloud computing power in the mix :)
7:37 "... to play about 1600 simulations for every single board evaluation ... " I have a question. How do they do this? Even if it's not 19*19*17, let's say it's just 19*19*2 around 700, there would be (2**700) "board evaluation"s (maybe less if there are some illegal board states, but not much less). How could they even just play one simulation for so many "board evaluation"s? Guess I'm missing something...
To build out the search tree, potential actions are sampled from the policy network (both for white and black moves) (so this hugely constrains the tree rollout to the most-likely / best moves). And then they also do pruning according to the value network, so whenever a part of the search tree results in a very low chance of winning (below some threshold according to the value network) it is discarded and not explored further. Combining these two approaches they build out the search tree and finally decide what move to play at the top of the tree by looking at the average value estimates for each of the tree's branches.
You could use that to speed up training in the beginning for version 2.0, but eventually performance will saturate and you wont do better.. And if you're building a version 2.0 you're hoping to do better than 1.0, so bootstrapping on gameplay that is worse than what you want to achieve doenst really makes sense. Similarly AlphaGoZero got better than AlphaGo by NOT bootstrapping on human games...
Well, it uses deep neural nets (value estimate + policy net) + self-play training (Reinforcement Learning) to make the Monte Carlo Tree Search tractable on the exponentially scaling Go search space. So yes it's number crunching, but that's what AI is all about...
Ihor Menshykov yeah had the same thought at first. Apparently including the history let's the network learn a form of attention over the important/active parts of the game. But I agree that theoretically, it shouldn't really be necessary... See the Reddit Q&A for more details!
Good info but you should consider losing those annoying and jarring scene transition guitar strums+kick drum sounds. They detract very much from the presentation.
dougdevine27 haha very true, this was my first video :p I removed them in all my other content ;) Unfortunately, once uploaded RUclips doesn't let you change anything anymore..
7:24 Your explanation of MCTS is not correct. For one instance of simulation: It picks the top move recommended by the network (greedy) most of the time, with random moves some of the time (epsilon). Then it walks into that move and repeats the same. It does it to completion. Then it backs up and keeps track of win vs visit ratio for each state as shown in the picture. It repeats this whole process 1600 times. As it is performing these walkthroughs it trains the networks and updates the values. So eventually, the more often you see a state, it will statistically converge to optimal value. MCTS runs to completion, its not a depth pruning algorithm. Temporal Difference stops somewhere in the middle, this was not used in AGZ. MCTS algorithm is discussed by David Silver in his lecture #8 towards the end.
I checked the paper and you are indeed correct! The used MCTS doesn't always play out every thread until the very end of the game (they use value tresholds for early stopping) but I did misinterpret the meaning of the '1600 simulations', thanks for pointing this out!
Do I get it correct: with average depth ~300 moves and no early stops that will be ~1600*300 network queries just for the first move?
so it plays out and if he wins then those moves he picked are trained to be 1 and value 1? and if he losses everything 0?
AGZ doesn't really use MCTS as it doesn't use rollouts, it doesn't play the game out to the end.
So I am going down the search tree (based on an algorithm that takes exploration and other things into account), until I reach a leaf node.
I put the leaf node position into my neural net and get a policy and an evaluation as a result. The policy adds new leaf nodes to my current position
and the value function gives me an evaluation of my current position.
Is this correct ?
This is a valuable explanation, this channel is a great discovery
noranta4 thanks man, just started two weeks ago ;) More vids coming up :p
I've been programming board game engines for 25 years and I've followed the development of CNNs to play go quite closely. This video is a really good description of the AlphaGo Zero paper, with very clear explanations. Well, the explanation of MCTS was completely wrong, but other than that this video was great. I'll make sure to check out more from this channel.
You explanation skills are fantastic! I like how he has an outline at the begging of his video, very simple thing yet very effective when it comes to teaching a subject, yet so few educational videos do that.
If I were to figure out the paper by myself, it would have taken me personally ~2x longer.
Subscribed.
We, humans, run simulations in our heads all the time because sometimes simple intuitions are not enough... So, I guess, it isn't surprising that inclusion of Monte Carlo Tree Search would always drastically improve performance no matter how good the value function estimates are, even with the help of deep learning... The question is how to search more efficiently and also how to build an efficient model...
Thank you! This is the one of the clearest and most concise explanations of any paper I've found thus far.
Clearest and most informative video I've seen on AlphaGo. Thanks!
Awesome explination! (And, you're greenscreen work looks great!)
Best explanation I found about AlphaGo Zero
You explained technical stuff very clearly. Thanks Arxiv Insights
This is the best video regarding Alpha GO paper. Just Amazing !!!
It's very clear, thank you! I can't wait to discover the other videos :)
Excellent explanation, thanks!! I'm going to make my own 9*9 alphago zero version
I couldn't find it on your GitHub friend
Thank you for taking the time to explain it so well. Still difficult for me that I'm not familiar with the matter yet, but you did really a good job of showing it clearly!
Fantastic explanation! Few people balance simplicity with thoroughness as well as you do.
That's the goal indeed, thx for the feedback :)
The part I don't understand, is how they dispense with rollout in MCTS. It seems like this is the only way to get a ground truth value ( by reaching a terminal state) which can then be propagated back up the chain. If you reach a non terminal state, you're back propagating a value from the policy network which won't have useful values until it can be trained on useful values from the tree search. It seems like it's pulling itself up by it's bootstraps. Is it the case that the true values come from the odd time that a simulation reaches a terminal state? Or am I missing something fundamental?
Brilliant - thanks for this! Really enjoyed watching and I think it takes away all the right information from the paper.
Just a quick point: is there any chance you could quieten down the background music for your next video? It was slightly distracting and I think it detracted a bit from your great explanation!
Merry Christmas!
Thanks a lot, great to hear :) And for the background music: I got the same feedback from a few different people! This was my first video (every other video you'll find on my channel has this fixed) :p
Does anybody know if the shape of the output layer changes for every phase of the game? In the video, he explains that the network produces probability distribution over possible moves and the number of possible moves is dynamic. Does that mean the output layer's dimension is also dynamic? If so, how is it achieved? Can anyone help me understand? Thanks!
No, the output layer shape is static. You need to zero out the illegal moves from the output and then renormalize the probabilities to sum to 1.
Xander, you look like "The Hoff" (David Hasslehoff) and that's a great look!
germans love david hasslehoff
Damn great video! Carry on ! Makes it very easy to get into these advanced subjects :)
This is cool, but after the third random jumpscare sound I couldn't pay attention to what you were saying--all I could think about was when the next one would be. Gave up halfway through since it was stressing me out
I am wondering what the output "policy vector" is like in the neural network
7:27 u said "certain depth", did u mean "certain width"? btw. I'd say this is one of the very best channels on the DL topic I've ever seen! thanks so much!
Thanks for the great explanation! Im still wondering how alpha go zero learns that certain moves are obviously bad like playin in the corner for example without playing a game till the end?
Thank you, finally I found a good video on this paper.
A little question on the AlphaGo Zero MCTS. The Monte-carlo aspect of the Alpha Zero MCTS seems to be gone AFAIK. Can't see random number or random choices in that MCTS. It seem to be replaced by the CNN calculating the probability of a board position leading to victory. What's your take on it ?
10 Nov 2022 - I just discovered this video and your channel. Fantastic explanation of granted a difficult subject to even tackle. Did you mention what kind of computer hardware the newest AlphaGo system uses? I assume it’s a mainframe of some type. Also, I wonder if the system can decide in advance to play a peaceful game or a highly combative game? I have seen games where they were very few prisoners taken off the board. Otherwise called the peaceful game. Still there is a winner nonetheless. Anyway bravo for an excellent video.
This helps a lot for those who need insights of machine learning trends :)
How is the value representation trained?
Great summary of the paper! Thank you :)
Excellent video, thanks for making it!
the transition 'dhkk' hits hard
You should open this up to community captioning
Done! Good suggestion :)
Loving your channel!
Thank you for this great explanation!
wow, i see some mistakes and also I didn't watch to much of your videos, but i find this channel is definitely underrated
Thank you for this valuable explanation!
I just want to request, if you would like to highlight the parts of your images more, that you are talking about? E.g. in the diagrams, you show. This would make it easier to follow!
Excellent explanation, thanks
cant wait for this thing to perform in sc2
So what did you think?
Excellent video
did u guys teach alphago how to beat security systems yet? and take over teh stock market and all nuclear launch codes?
Great explanation. Thank you!
The history (extra 7 layers) is also used to identify ko (kind of similar to threefold repetition in Chess)
maaaan you are doing a great work keep up
i'm confused as your explanation contradict the points you mentionned in the introduction
Excellent Presentation
GREAT WORK
Thanks for the impressive explanation, where can I learn the source code ?
Wow, this is really a great explanation
Thank you very much for the effort to educate the ignorants about the mechanisms linked to Alpha Go that is taken from a lot of ignorants like a "beast" like "terminator-like machine"... for me to simplify, I would say that the Go champion played versus 50000 professional go players -> no chance to win at all. As kasparov failed winning against 50000 human amateur players in this last decade.For me this is the massive parallel processes and recursive functions that beated the champion, technically beaten, this is definetly NOT intelligence but MASSIVELY PARALLEL processes clustered on thousands of CPU and GPU (floating point operations).
that's some giga chad jaw u have there
Is it hard to implement this algorithm by myself? Could I create a super-human go Player on say a 7x7 board with just my laptop?
How big could I make the board using just a normal laptop?
There's a ton of open source implementations on GitHub: github.com/topics/alphago-zero but I know that many people are having issues reproducing the full strength of DeepMinds version. I don't know if the 'interesting game mechanics' of Go also emerge on a small board like 7x7 but I would guess that you can definitely train a decent model on a laptop for such a small game-state. Additionally, you could also apply the algorithm to chess, which has a much smaller branching factor so it's easier to train, although again I think in order to get decent results you would have to throw some cloud computing power in the mix :)
7:37 "... to play about 1600 simulations for every single board evaluation ... " I have a question. How do they do this? Even if it's not 19*19*17, let's say it's just 19*19*2 around 700, there would be (2**700) "board evaluation"s (maybe less if there are some illegal board states, but not much less). How could they even just play one simulation for so many "board evaluation"s? Guess I'm missing something...
To build out the search tree, potential actions are sampled from the policy network (both for white and black moves) (so this hugely constrains the tree rollout to the most-likely / best moves). And then they also do pruning according to the value network, so whenever a part of the search tree results in a very low chance of winning (below some threshold according to the value network) it is discarded and not explored further. Combining these two approaches they build out the search tree and finally decide what move to play at the top of the tree by looking at the average value estimates for each of the tree's branches.
Ok. So how do you play and what is the idea of it?
What if instead of self training, the ai is trained with the data of matches of a previously trained alpha zero ai?
You could use that to speed up training in the beginning for version 2.0, but eventually performance will saturate and you wont do better.. And if you're building a version 2.0 you're hoping to do better than 1.0, so bootstrapping on gameplay that is worse than what you want to achieve doenst really makes sense. Similarly AlphaGoZero got better than AlphaGo by NOT bootstrapping on human games...
11:39 two of four of your "very popular moves that stands for thousands of years" was dismissed by alphago after 50 hours of training.
Do Alphafold 2 when paper comes out)
yo this video is dope. it's super fire. Just letting u know i'm a dan player
I want to know more about this i hope your my school teacher
So not really AI, just number crunching using statistical analysis (montecarlo tree).
Well, it uses deep neural nets (value estimate + policy net) + self-play training (Reinforcement Learning) to make the Monte Carlo Tree Search tractable on the exponentially scaling Go search space. So yes it's number crunching, but that's what AI is all about...
my brain feels sexually abused after watching this video...
🤣🤣
Go play with yourself, so you can learn Go! XD
very nice video ! put a microphone closer to you so we don't have that annoying reverb please
Interesting video :-)
Danger Will Robinson!!!
distracting music
I think it's just fine
I could barely notice it
Keep on! It's great.
It’s too difficult to explain how great of alphago to the people who don’t know how to play wei qi.
so really it's just one big ass flow chart, if this then that
Anybody knows wtf they use a move history for? Aside from obfuscating learning and computation by many many times over? Seems like nonesense.
Ihor Menshykov yeah had the same thought at first. Apparently including the history let's the network learn a form of attention over the important/active parts of the game. But I agree that theoretically, it shouldn't really be necessary... See the Reddit Q&A for more details!
Good info but you should consider losing those annoying and jarring scene transition guitar strums+kick drum sounds. They detract very much from the presentation.
dougdevine27 haha very true, this was my first video :p I removed them in all my other content ;) Unfortunately, once uploaded RUclips doesn't let you change anything anymore..
well, basically like humans acquire skills ... from scratch.
cool
Epic👌👌👌👌
*Guitar Noise*
Nice video. But your sound effects and music are VERY loud. Maybe normalize a bit?
🔥🧠🔥
You're...gwern? What?
exactly! Is he?
+1 sub
😊
Get rid of that sound effect. It’s weird and jarring. Do a graphical transition instead.
You explain shi@ too complicated
So big spreadsheet and not AI - got it.
i am not getting a single thing out of this
sry bad made