My AGA membership was about to expire next month, and I wasn't sure if I wanted to renew since I haven't been taking advantage of it much since I can't really get to any tournaments or anything. But I really enjoy these videos so I just renewed for 2 more years. =)
As both a Go enthusiast and a student researching machine learning and game AI, I feel that I can contribute a bit to some of the AI side of things. Something that Michael has not touched upon is the way that both Master and Zero actually select their moves. Firstly, an important distinction must be made that the implementation of the Monte Carlo tree search and its interface with the neural network is different between Master/Zero and AlphaGo Fan/Lee; while AlphaGo Fan/Lee did in fact use separate neural networks to evaluate "policy" and "value", Master/Zero (both use the same architecture) have combined these networks into a single network with dual outputs. There is little difference from a strictly Go perspective, but from the AI side it means that the policy and value networks are both "looking at" the same board representation, while previously the policy and value networks would derive different "important features" of the board. The policy output is actually a list of 362 numbers that represent the "prior probability" that a given move may be played by the MCTS (including pass). This is important because it does not strictly mean that the move with the highest "prior probability" will be selected; this is what is meant in the paper when they say that they included the network alone (without MCTS) in the final tournament, and that it performed similarly to AlphaGo Fan. Rather, these "prior probabilities" are used in the MCTS stage to inform the MCTS about likely moves to be played, and therefore to narrow its search to those moves. The way that the MCTS algorithm, as implemented by Master/Zero, works is that the MCTS algorithm, from a given board position, selects a move based on both a "U-value" (which is proportional to the number of times the MCTS has "visited" that move) and a "Q-value" (which is determined by the value output of the neural network and the number of times the MCTS has "visited" that move). The upshot of this is that in general, Master/Zero is NOT basing its move decisions primarily on the output of the policy network, but rather on the number of times the MCTS has investigated the lines that spawn from a given move. In human terms, this would mean that Master/Zero selects its moves based on the frequency that it considers a move and reads the consequences of playing that move. Importantly, Master/Zero (and I believe Fan/Lee as well) do not simply choose the move with the highest U-value plus Q-value; instead, all (legal) moves are given some nonzero probability, and these probabilities (adding up to 100%) represent the "move probability distribution" of a board position. This is to say, if AlphaGo determined its moves based solely on the move with the greatest probability in this distribution, it would ALWAYS choose the same move given an identical board position. As Michael notes in this video (at least with Master), AlphaGo does not always choose the same move in an identical position. This is because AlphaGo's moves are drawn from the move probability distribution. If, for example, one move has a 65% chance of being selected, over many games it will appear approximately 65% of the time, but other moves, such as one that has a 25% chance of being selected, may appear instead. This explains the behavior that Michael noted in Master; the first move being a 4-4 point likely has a probability (in the MCTS move probability distribution) of approximately 75%, while the 3-4 point likely has a probability (again in the MCTS move probability distribution) of approximately 25%. For Zero, the MCTS probability of the 4-4 point is likely very high, around 98%, and so other moves such as the 3-4 point do not appear in this (relatively) small sample of games. If we were to examine the entire body of Zero's games, we would almost surely see the 3-4 point appear in the opening, just with a very low frequency. Another interesting thing to note, specifically about the opening, is that Master/Zero use a different temperature parameter (defined in the paper) for the first 30 moves of the game than they do for the rest of the game. The details of this are technical, but essentially if the temperature parameter is ~1, then MCTS move probability distribution is "smoother", and more moves are given greater probabilities, and if the temperature parameter is ~0, then the MCTS move probability distribution is "sharper", and single moves stand out and are assigned very high probabilities. For the first 30 moves, the temperature parameter is 1 (so the move probability distribution is more evenly distributed), and for the rest of the game the temperature parameter is ~0 (so the move probability distribution heavily favors moves with high "visit counts"). To sum up: It is not strictly accurate to say that AlphaGo is favoring moves with "high win percentage" (i.e. value output); the most relevant statistic is actually the number of times the MCTS "reads" a certain move's continuation. Now, how the MCTS algorithm determines which moves to evaluate is an interesting (and technical) discussion, and heavily involves the aforementioned value output (i.e. "win percentage"), but it is important to note that when Chris and Michael say that "several moves have similar win percentages, so AlphaGo could have chosen any of them" this is not strictly accurate. This also helps to explain some of the "strange" moves that AlphaGo plays; the random element of the move selection (it is randomly chosen from the move probability distribution) may in fact select a move that does not have the highest probability; such a move would then appear to be "bad", but cannot strictly attributed to AlphaGo deciding that that move is the "best". EDIT: For the "evaluation" games, used to determine the strongest version of AlphaGo during training (to generate more self-play data), the temperature parameter was set to ~0 for all moves, not just those after move 30. DeepMind has not specified whether this was the case for the Master/Zero match (or any of the other matches that were played), but if it were the case then it would make more sense for Zero to be always playing the 4-4 point, as that move would have a high visit count (based on its training) and therefore all other moves would have a prohibitively low probability of being selected.
Thank you for a detailed translation of some aspects of AlphaGo Zero paper into daily language. I am puzzled by your summary paragraph however. When you say Michael and Chris are not STRICTLY accurate in saying AlphaGo is favoring moves with higher winning percentage, you seem to imply that the daily language approximation misses some ESSENTIAL aspect. What exactly is that? In the Lee version or before, your distinction may be quite essential for the AI novices, since the policy network was trained separately using high level human plays. There were elements of "mimicking" human play data, which implies the potential loss of gem like moves excluded from tree search. But in the Zero/Master version, PN and VN are trained simultaneously. Except for the cut off criteria, that favors even more for moves with higher values in policy, it does look QUITE accurate to say that Zero favors higher winning percentage moves, considering the nature of MCTS. I may make myself clearer to put the perspective other way around. The most important reason why you WANT TO train value network as you explained is the BELIEF that it is quite highly correlated with the winning percentage. The legitimacy for reading certain moves more deeply than others only lies in the belief that it is more probable that the best move is within this reading domain. This belief is empirically tested. Only through the massive computing power of GPUs and TPUs that probably only Google and Tensent may have in the world, the Zero could achieve the effectiveness of their approach in such short time.
So, the MCTS phase selects moves based on the number of times a given move's branch is traversed during the Monte Carlo simulation. On the surface, this seems that that this has nothing to do with the value produced by the neural network. Let's unpack that. Firstly, it must be clarified that the policy output and the value output of the network are not strictly from the same specific output. While they share the same initial stages of the network evaluation (convolution layers and residual blocks), these are used by the network to recognize "features" of the board. After the residual blocks, the network splits, like a two-headed hydra, and each "head" performs its own analysis of the features analysis that the "body" of the network produces. One "head" performs a policy analysis, generating a vector of prior probabilities that the MCTS phase uses to weight its initial selection (more on this later). The other "head" performs the value analysis, which is used to attribute a win probability to a particular board position (more on this later too). On to the MCTS phase (I'm going to get a little bit technical here for the sake of accuracy). When deciding which moves to "read" (i.e., which branches of the search tree to traverse), the MCTS algorithm uses the equation a = argmax(Q + U), where a is the action (move), argmax is a function that selects the option having the greatest value, Q is the mean action value (more on this later), and U is the Polynomial Upper Confidence Tree algorithm (U = c*P*(sqrt(sum(N(b)))/(1+N(a))), where c is an arbitrary constant, P is the prior probability of playing that move (obtained from the neural network as above), sum(N(b) is the sum of all of the times that each possible move has been traversed (i.e. the total number of searches), and N(a) is the number of times that the particular move being considered has been traversed). Note that this algorithm includes the prior probability produced by the neural network, but is not solely dependent on it; as stated in the paper, this algorithm initially (at N(b) ~ 0) prefers moves with high prior probability and low visit count (i.e., more weight for the prior probability and a wider search range), but later (at N(b) >> 0) prefers moves with high action value (Q). We have seen how the prior probabilities (policy output) of the neural network is used with the MCTS phase. How is the value output used? This is where the Q value comes in. When the MCTS search reaches a "leaf node" (ill-defined in the paper, it could mean either the end of the game simulation (e.g. "reading to the end of the game") or simply stopping at some fixed depth (e.g. "reading 50 moves ahead")), the board position is fed into the neural network, returning a prior probability vector and a win probability value. The prior probability vector is then used to "seed" a new layer of leaf nodes branching from the currently-being-evaluated node, and the value output is backed up through the MCTS search tree, being added to the total action value (W) of each move leading up to that "leaf" move, and the N value (number of visits, see above) is incremented by one. For each move that is updated in this way, the Q value of that move is then updated to be equal to the W value for that move divided by the N value for that move; this is the mean action value of the move, because it is averaging the total action value over all visits to that move. Recall from above that the move selection algorithm initially prefers moves with high prior probability (policy) and low visit counts, but gradually shifts toward preferring moves with high action values; this is the action value that gets maximized. Now, what is important to realize this that everything above is only referring to how moves are chosen for "reading", or simulation evaluation. This is NOT how the move that is actually played is chosen. The actual move is selected randomly from a probability distribution represented by Pi in the paper (lowercase in the paper, uppercase here to differentiate from the circle constant). The probability distribution of each possible move in a given board position is given by N(a)^(1/T) / sum(N(b)^(1/T)), where N(a) is the number of times that a given move was visited by the simulation, sum(N(b)) is the total number of simulations run from that position, and T is a temperature parameter that controls the "smoothness" of the probability distribution. If T ~ 1, then Pi is smoother, and moves with less visit count have a non-insignificant probability of being played. However, is T ~ 0, the Pi is very sharp, and the move with the highest visit count is almost guaranteed to be played. For these demo games, I would wager that T was set to be ~ 0 for every move; during training, T would be set to 1 for the first 30 moves to encourage a wider range of moves to train from. All of this is to demonstrate why it is not *strictly* accurate to say that AlphaGo seeks to maximize win percentage. Rather, the move with the highest mean action value, and therefore the most visits, is the most likely to be chosen for a particular game move; otherwise, we could simply choose the move from a given position that has the highest value output from the neural network. This may seem to be saying the same thing, but there are subtle differences, primarily lying the ways that DeepMind were able to interface the neural network outputs to the MCTS algorithm to create a more robust system of move selection, based only *indirectly* on maximizing value.
I think we should clarify a number of things. 1. It is not Michael and Chris that originally used the expression "choose the move with the highest winning percentage". It is the DeepMind folks. Refer to Sedol Match 3 commentary. 2. As far as my understanding goes, there is no official explanation of what Michael and Chris mean exactly by "winning percentage". My casual understanding is that it is much closer to mean action value than the value in Value Network. (and BTW my understanding is that "the most visits" is not a very good daily language translation of mean action value) Even when my interpretation is adopted, due to the nature of MCTS, the daily vocabulary of "percentage" still applies. In contrast, humans tend to play the mixture of high winning percentage and high expected "score" moves. 3. Your explanation, trebledawson, seems algorithm oriented rather than output relevance oriented. My question to you was "so what?" (from go players/viewers viewpoint). From AI development point of view, the subtle difference between the Value in VN and mean action value may be important, but from viewers' point of view, the difference is meaningful only when you can qualitatively characterize the play style difference in one sense or another. Every model is approximate, but you often adopt a SIMPLE one instead of more accurate complicated one for various reasons. Almost nobody complains ideal gas model, in favor of van der Waals model, for instance when it is used under an appropriate setting. Your extremely long explanation only seems to refer to the difference between ideal gas model and van der Waals model, whereas my question was why is the difference important? Many thanks however for your explanation again.
I apologize for that in my clarification 3. in the last reply, I did not read your first comment carefully enough. "This is to say, if AlphaGo determined its moves based solely on the move with the greatest probability in this distribution, it would ALWAYS choose the same move given an identical board position. " is a very important qualitative notion. I should add however again that mean action value should be interpreted as weighted reading than simple visit count. That is why you expect pretty huge variation in the opening, whereas in the tense middlegame, you expect "the only best move" also played by AlphaGo, despite still non unique policy position. After reading deep enough in MCTS, AlphaGo can designate the indispensable move much closer to probability 1 than the prior policy network probability distribution, when it is necessary. It is due to the limited reading ability and probably also due to the nature of go as a game, vastly many variations may have exactly the same mean action value (indifference among the highest value moves). For instance, there are very many ways a single group of stones die or live in real board position (in tsumego problems, the setting is artificially crafted in such a way that the best killing or living move tends to be unique). There are many different paths of endgame moves leading to the same score. and so on. The actual play variation in the opening may become large or small as the learning gets deeper.
Thank you for your explanation so far. Can you say what, intuitively, is gained by using U values in addition to Q values? Does it help ensure that a given move can lead to many profitable paths?
Go has been around for thousands of years, but I am pretty sure that in those thousands of years, the most exciting time to be a go player and a go fan is right now. The dramatic breakthroughs just keep coming one after the other, and the game is changing very quicky. Although it is very exctiting, it is also very confusing, and it feels like everything we thought we knew about go is going out the window. Thank you very much to Chris and Michael for helping us try to find our footing in the midst of all the chaos.
I appreciate your enthusiasm, and agree with it, but it's not quite that *everything* is going out the window. The Zero games are not completely alien-looking, for example. It seems that humans did progress toward 'the truth' about Go in those thousands of years. It's just that computers, algorithms, and data structures have now progressed even closer to 'the truth' than human minds alone have.
The worst thing that can happen is humans trying to imitate AI in the way they play. That is going to be the most disgusting thing that would happen to go. No human can think like AI so it's pointless to study AI games.
Regarding the 4-4 and 3-3 points, it feels like to me that the 4-4 point is a very flexible move. It can either build influence or work to help capture a corner, depending on the way the game goes. Maybe it sees these as Miai, capturing a corner, but with fairly small territory and giving the opponent influence vs. playing something else and letting the opponent have a fairly large corner. Playing the 3-3 invasion very early makes some sense if you consider that when the board is undeveloped, it's much less obvious in which direction the opponent should build their wall on the outside. At this point, you can play a move to direct the game in such a way as to minimize the effectiveness of that thickness, whereas later in the game, things are more set in stone and your opponent can potentially inflict some real damage with that choice.
Invading at 3-3 gives you a small territory in the corner (and part of an edge) but it gives the opponent the opportunity to have strong influence toward the center. I would prefer to approach 4-4 to a distance first, then to invade at 3-3. Does it make sense ?
Just to say that I joined the usgo even though I live in Hong Kong. The effort is hard and very appreciated. Minor contribution from a 40 years go player. I think sometimes professional go player forget the support is it just for their match but for go itself. The alpha go force the tide change back to old days. Very old days. Yes there are competition and very fierce. But we like the match and the mind sports. Not just the match.
I think the tension between the 4-4 point and the early 3-3 invasion can be resolved by thinking about sente. AlphaGo seems to value sente very highly, and the 3-3 point invasion ends in sente unless the opponent takes a local loss. Therefore, playing at the 3-3 point guarantees you get either sente or a better result than you normally would. Similarly, any approach to the 4-4 point will end in gote (including the 3-3 point, as long as you're willing to pay the price). So if you assume AlphaGo *really* wants sente above all else, both behaviors makes sense. Those moves are the best way each side has to fight for sente.
Half resolved. If sente is such important, why would you play 44 in the first place knowing through self-play experience that you and your opponents will play away sente via invading 33 later? Obviously, they think that the initial 44 set up is worth gote despite the opponent's sente 33 invasion, and in fact this preferring thick gote for 44 far more to 33 invasion sente has been the human understanding for over a hundred years before AlphaGo Master (Ke JIe version). Overall, AlphaGo not only thinks 33 invasion sente is huge, it also thinks the initial 44 setup leading to gote sequence balanced.
This is not correct. The 3-3 invasion is not actually sente - see my original comment. The 4-4 point is valuable because you can build on it and all approaches end in sente for you, and the 3-3 point is the best invasion for the opponent because they get the most when you take sente.
One thing to consider is that it is relatively easy to see the value, and make use of, the general strength of moves around the corners - If I had to choose one move to make without knowing the board position, I'd probably choose the 3-3 myself. I think it's quite natural, even starting from random play, to learn that playing moves in the corner is good, because it usually is. In my opinion, if central moves are good, they're probably only good with quite exact play. Not the sort of thing you're likely to stumble across from random play, or seriously consider once you have an established opening strategy.
The worry that corner play is just a "local maximum" is a concern I've seen other people familiar with AI raise about this result. AG zero and Master may both be completely correct that the 4-4 point is the best opening move, the 4-3/3-4 slightly less optimal, and that the value falls off dramatically from there (all the way down to "completely useless" when playing 1-1). I do think that Zero's process of starting from completely random moves was the best way to make sure that this "local maximum" issue was avoided, and that there wasn't some other equally strong (or stronger) location that was being missed. I can't say if they have completely ruled that out (and perhaps even Deepmind may not be certain enough to make that claim), but I do think that Zero's style of play in the opening has shown us that it's much less likely that anything other than corner moves will lead to winning positions.
I'm not referencing the tree search at all. Simply the value network and its evaluation of the "best" moves. Getting stuck in a "local maximum" is a very real concern when using reinforced learning to try and judge what is better or worse. For a primer on what I'm talking about, the article at en.wikipedia.org/wiki/Reinforcement_learning has a decent introduction. As stated there: "Many policy search methods may get stuck in local optima".
What Michael Redmond needs, is access to a runtime version of AlphaGo Zero so that he can force any starting position and see AlphaGo's winning percentages for the best responses. Then he can debunk many human josekis and fusekis for us, showing with response sequences why they would be suboptimal.
Well, DeepMind is really closed with respect to the use of the program. Other programs like DeepZen are already used for this sort of purpose by Japanese pros. It is however not that easy to "debunk" human plays solely based on AI plays. Humans can learn from AI plays, but the learning needs to be adjusted to the human level, with much less memory and computing power.
I don't know what DeepMind's research priorities are, but I hope they do keep up their interest in go. There seems to be a lot to be gained if players (especially professional players) get to tweak the program to test out their own ideas. From an engineering perspective I hope the rerun the Alphago self learning process from scratch several times. I want to see whether different styles of play could emerge, and if the sibling programs are evenly matched.
Tweaking by humans is not a feasible idea when it comes to neural net based AI. It's a kind of black box which plays so well that if you mess with it most likely it's quality of play would suffer.
I don't think it matters if the quality of play is hampered, so long as the changes are not crippling. The point is to compare different strategies, not to try and improve the engines performance. From what I understand Alphago often gives the top two moves near identical values, so (some) human interference is possible without significantly affecting performance. Also you would be looking at relative performance. If imposing the Koboyashi opening hinders the game less than the Chinese opening, that would be a valuable result.
From what I understand, they're ultimately interested in General AI and Go to them is just sort of the stepping stone to prove that it's possible to achieve General AI. Now that they've been successful in that, they are probably moving on to something else. It could be something related to Go but it may also not be. Check this out if you haven't: ruclips.net/video/rbsqaJwpu6A/видео.html
It is worth remembering that there is a difference in how both version of AlphaGo play when studying through self play games, and how they play under "tournament" conditions. As per the first Nature paper, when AlphaGo is engaged in self play it probabilistically samples a move based on how many times that move was visited at the top of MCTS, versus in tournament play it selects the move with the highest number of visits. So if the top moves are visited 21%, 20%, 20%, 14%, 6%, ... of the time, in self play mode there is a 21% chance of it picking the "absolute best move", and it is willing to experiment with other moves which have a chance of winning. My understanding fro the second publication is that they played AG Zero vs AG Master against each other in tournament mode, where each system picks the "absolute" best branch. The reason they do not play the exact same games against each other is because of random effects which can cause different moves to be judged "the single best move".
It was widely believed (like Michael said) that they had adjusted some parameter so that even in tournament play it would be more likely to pick 2nd or 3rd best move as long as it is within some range. It could be that the "Ke Jie" version Michael speaks about is actually exactly the same program, but with different threshold. It would be pretty easy to test the impact of changing the threshold on strength to ensure that a threshold has limited impact on ELO (since it would likely be a little negative, at least based on self-play ELO). I agree with Michael that this would lead to more interesting openings in the series, but I'm not sure it is the right conclusion that it would be a better test of Alphago.
The Ke Jei version was more advanced becuase of the matches with Ke Jei, which were pushed to top of branch searches. Thus Ke Jei and pairgo actually acted as a trainer for Alpha Go Master. I'm sure after the matches the version was archived, but not recognized from the basic engineering perspective as anything different from V0 playing Ke Jie
I guess the master version has a temperature parameter, as long as the temperature is raised, it will play more freely. Engineers use this feature to get more diversed games.
Perhaps they set the variation the same for both programs, giving each (for example) a 0.1% winning percentage range in which it could select moves, and for the Master version that meant 4-3/3-4 were possible, but for Zero they were not. It could just be that Zero finds 4-4 to be slightly more valuable compared to the other moves than Master does. Because there most certainly is a gradient for Zero as well, since it selects different moves later on in the game, just not in the opening like Master does for some reason (which I think likely comes down to a different value network and different winning percentage calculations).
Yes, it has a temperature. During training they set the temperature high for the first 30 moves so that it selects a good move semi-randomly, then for the rest of the moves they set the temperature low so it always takes the best move. (That's from the paper.) It sounds like for the sample games they set the temperature low for all moves. You'll still get some variation because the Monte Carlo tree search it uses is inherently random.
I wonder if alphago showed more variation in the master series because of the tight time control in those games. It may be that alphago gets more predictable when it has longer to calculate because the monte carlo search is more likely to converge.
A question for Michael: when he is describing a "contradiction" in Zero's opening, does he mean the fact that it overwhelmingly chooses the 4-4 point, and then also just as overwhelmingly plays at the 3-3 point against that initial opening? This wasn't really explained clearly in the video, but the "contradiction" term seemed to be describing the 4-4/3-3 choices. Is it because attaching value to one of those two moves would appear to negate some of the value in the other? That's the thought that jumped into my mind as well.
I think he means that some of the inherent value is that the 4-4 point is considered a good balance of gaining influence while starting to enclose some territory. But, if the best response to this move is for the opponent to invade the 3-3 point a little later in the game, then part of the territory aspect of the 4-4 point is then negated. So, what's the true value of playing the 4-4 point over something like the 3-4 point, in this case? That's the part that I think puzzles Michael. Perhaps the answer is that the 4-4 point still has a good amount of value in its influence towards the center and the sides, that it's still strong even after the opponent invades. Indeed, maybe Zero judges it so strong that it almost requires the opponent to invade at 3-3 quickly, just to diminish some of its strength towards the corner and the sides.
That's pretty much what I was thinking, but you explained the details much more clearly than I could. I'd be interested to hear if that's what Michael concluded as well.
I don't see why people are thinking it's kind of contradictory. For example in Tic-Tac-Toe the best initial move is in one of the 4 corners. The best reply to that is in the center. So people might say, "Hey that's contradictory why didn't the player with the first move play in the center?" However it isn't contradictory it just makes sense. You will see what I mean if you analyse Tic-Tac-Toe. Almost everyone thinks the best first move in T-T-T is in the center but not so.
I don't see it as contradictory either, but I also don't play pro-level Go. I think it has more to do with Michael's feel for the game, rather than actually being a contradiction. From watching these videos, it seems that Michael feels the 3-3 invasion as played by AG is "too early" compared to how humans play. But, assuming that playing the invasion that early is actually correct, the next feeling that Michael has is that the 3-3 invasion has made the 4-4 point a lot less relevant of a move. So, to him, if playing the 3-3 invasion that early is correct, the 4-4 point move now "feels" wrong. And that's probably what's troubling him about it.
Talking about "best move" in ttt is not such a useful comparison since the full game tree is known and best result is a draw. I assume you mean something like "more chances for 2nd player to screw up"
I've been working a lot on a re-implementation of AlphaGo Zero for a different game than Go but it should be easy to adapt. Does anyone reading the comments know about a Go programming community that would appreciate an AGZ implementation with most of the legwork done?
dtracers, programs are fine but I'm asking for a community. Do you know of a Go programming community (like a messageboard or something similar?). It's difficult to find because Google had to name their programming language Go as well.
Imagine AlphaGo achieved consciousness, desperately wants to communicate the meaning of life to us monkeys through Go opening variants, alas - we can’t understand its code. Seriously, thanks to AGA and all involved in making these study videos accessible. Very intriguing! Cheers!
One starts to wonder what AlphaGo Zero sees in 4-4 moves, that it feels the needs to invade them via a 3-3 so often. If it's that regular like in that it would almost seem like a forcing move... Like something about it has to be refuted and can only be done so well like that... Sadly that's just speculation from me though. I have next to no chance figuring out what it is about if the experts haven't so far.
I believe DeepMind stated that the Master version used only 4TPUs. DeepMind doesn't seem to be distinguishing the Master version and the Ke Jie version, so my guess is that their architecture and code is pretty much the same, but the extra few months of training yields different behavior. It seems that was what Michael was getting at at 23:53.
Conscious, that makes sense. But wouldn't thinking time also have an effect on how much processing power gets used per move? If I used the same hardware using only about 8 seconds per move (as master did in the 60 games) isn't that very different than spending 80 or 90 seconds per move like master did against Ke Jie?
Similarly, we don't know how long master had to think in the published self-play games, but I got the impression that there was a lot. More thinking time could reduce the seen randomness in move selection too.
I believe it was stated that the self-play games were all at full length time controls. What you're saying is *possible*, however I'm somewhat doubtful that the shorter time controls would be the deciding factor.
Looks more due to the further network learning time. For example, the early 33 invasion seems to appear in the Ke Jie version regardless of time settings. Ke Jie, before the three match event, seems to have known this play feature (DeepMind probably lent the Ke Jie version to Ke Jie and other Chinese pros for training).
Alphago did play the early 3-3 invasion in the 60 game series. I would say it's *extremely* unlikely that DeepMind *lent* the system in any way to Ke Jie or other pros.
Zero is not ignorant of human joseki see ruclips.net/video/m13QHNMHAa4/видео.html at 40 hours it had discovered most human joseki's however at 70 hours it began to abandon them so really it's just that we got the 4-4 point and approach right but not the followups
Your choice of the term "ignorant" seems to be off the daily language usage. Zero "learnt" to play like humans do. What Michael was trying to say was that he was hoping for a completely different learning path.
Ultimately, all Alpha (and Zero) AI are human trained. Remember, this is the cutting edge of AI (RL) research. The sanity checks and improvements were always based on experience from human play. More specifically, the design of Zero, including Monte-Carlo, neutral networks, back propogation, etc were decided upon and optimized based on AlphaGo's success playing humans. Only a year before AlphaGo defeating Fan and Lee, Deepmind considered TD (temporal decision (I think) trees) to be superior or more generally applicable than Monte-Carlo. However, because it's nearly impossible to evaluate a board even after 20 moves, Monte-Carlo (which plays random games to completion) was still necessary. Deepmind only knows that (and much more) because of human play. Another example, AlphaGo had specific rules for "inside moves" (creating or destroying two eye potential), so even if Zero did not have initial rules or weights, its learning algorithm was chosen and optimized until it did "understand" such human concepts. Until AI can explain WHY a particular move is superior, I don't think we'll know whether it's learned from humans (better than humans can) or really taught itself (and us). Third attempt: humans have very fast and deep internal network but very slow external communication and quickly tire. Computers have moderately fast internal networks, extremely fast inter-computer communication and never tire. If humans could learn from trillions of games, humans would be much better; if computers had our neutral network, computers would be much better. Both are limited by capacity, not algorithms.
Maybe Master does the same variations in all the published games because it lost the other games badly, so they weren't worth publishing. That would imply that the published move is better (against Zero anyway) EDIT: Come to think of it, we've all been amazed at how close all these Alphago-Alphago games have been. But on the other hand, as soon as the games aren't close, Alphago starts playing strangely. The winner starts giving away points, and the loser starts making random nonsense moves. But the games Deepmind have published are curated -- they were selected, by Fan Hui I guess, to showcase Alphago's best play. That best play is only on display in close games, so it's no surprise that the published games are close. It could well be that 99% of the games are won and lost at a bigger margin, but the program plays too erratically, so they don't see the light of day.
Another possibility is that Master and the Ke Jie version were actually very similar, except that for the Ke Jie match, they gave more computing resources and a longer time limit for it to think. Since the moves look different, it seems like it's a very different AI base than Master, but in reality, it was a very similar AI base. For these practice games, they couldn't or didn't want to give the same level of computing resources and time limit, so having the "Ke Jie" version of AG wasn't applicable for these sets.
I highly doubt that DeepMind wouldn't publish them because of it being lopsided. You are assuming that the published games are better, where there is no evidence for it. I would also assume Fan Hui would have a reaction similar to Michael's at 19:13. So I don't find it convincing that Fan Hui would select games that have the same first 20-30 moves. Also, DeepMind has stated that Master used 4 TPUs. Since they're are saying all the versions(60 game series, Ke Jie, and self play series) are the same, it is safe to assume that they all used the same hardware.
It’s been 7 years, but in regard to the comment above the openings being similar, I had the opposite reaction. If you compare the impact of computers on chess vs go, it’s completely different. For the vast majority of games, >95%, you can tell if the game has been played since 2016 easily, whereas in chess, despite computers being better than humans for more decades, you can’t. Almost all the josekis we learned aren’t played any more.
Great summation of the games. I think the strength of AG in the opening is overrated. The paper has a graph where they say that AG without MCTS is slightly weaker then crazystone, or a top pro. As MCTS is known to be very weak in the opening, we can't really say that AG's opening is really better then that of a solid pro, as it might drop to that strength until the trees become concentrated.
I hope we will get to see professional players using Alphago as a tool to test out different opening strategies. If you are right then they will find fuseki that are better than AG could come up with on it's own.
Great stuff, thank you so much! Your problem is that these AIs can produce a million masterpieces a day. We need to have some way of pruning them down to the most interesting ones.
If both played their own determination of optimal play, every game would be exactly the same. "Interesting" (if not purely subjective) may only be a function of how much variation we want (handicap, komi, random decision tolerance, etc). Take joseki: the win-rate of several alternate moves are often very similar (the "best” move might have a win-rate of say 47.1%, whereas several alternatives might be 47.0%) and the win-rate is not perfect or provable, it's just what Alpha came up with after trillions of random games. We either understand the reasoning, or we don't. That's either interesting or not. Hard to say which is which. Most interesting (to me) would be "humans often play this, but Alpha thinks that's stupid, because she'll just play this". I'm not sure we've come across that. It's usually just "Alpha gives the typical human move a win-rate of 47% whereas another known move has 48%”. The only games we can understand are those that humans played (either before or after Alpha). Personally, I think the best we'll get in terms of "interesting", is for a professional like Michael to play, and tell us his analysis. I think his approach of "this is what a human would likely play" but "this is what Alpha plays" and "maybe this is why" is probably the best that I (most of us) could hope for... at least until a computer can explain it to us... Which a neutral network simply cannot.
In this video, Zero prefers the 4,4 start point openings, which is surprising. It still takes a human, like Michael, to tell us why that's a superior move... or, as it seems, he's not sure why or even that it is indeed superior. Same with the 3,3 attack. I believe we're at a point of understanding only that "Alpha Go seems to think it's an even trade off". Zero is optimized to choose conservative moves most likely to result in a small win. Whereas humans tend to make aggressive moves likely to win big in the short term (presumably because we can't predict the long game or assume we'll make a mistake somewhere else). That alone is interesting (to me).
@@vegahimsa3057 We know AlphaGo and the other AIs are not perfect, just a lot better than any human. I suspect that at current komi, perfect play for White is mirror go, and it's just a matter of when and why Black plays tengen to force White out of mirror go. Tengen is clearly not the best move early in the game, but it's better than continuing to let White play mirror go all the way to the endgame. The other option is to set up a mirror go trap, but that is also (very likely) going to involve sub-optimal moves.
Without komi, I believe we can safely assume the best white can hope for is a tie, in perfectly optimal play by both players. Any komi is all the asymmetry needed to ensure that mirror go is not optimal for both players. For some komi, mirroring might be the most optimal outcome for white, but if that's provable, then we simply adjust komi until mirror is winnable for neither. * Previously "optimal" referred to "what the current AI believe is best play for both players, to the best of its current ability". However "perfectly provable optimal" is some hypothetical future AI with truly god-like abilities -- either harnessing all energy in the universe, or the discovery of a simplified, shallow, and perfect evaluation function. We'll get better AI, for sure, discover a lot, but I don't believe we will discover a radically simplified, provably perfect, and shallow evaluation function. However, if we do, perhaps it would no longer be fun to play go. I think we'll soon discover if AI performance plateaus or whether play continues to improve with more computation and better learning algorithms. And would that tell us more about AI or the size of go universe?
Perhaps the Ke Jie version is equally as strong as Zero and that is why they didnt reveal it. Another option is that they intent to write more papers about it before publishing these games. Sometimes I just think DeepMind just wants to hold back information for some time to see how humans evaluate AlphaGo's moves and then afterwards corfirm that or refute that with the complete neural network data. Kinda like how a teacher would to see how a student comes to an answer before showing how it should be done.
My opinion from what the engineer said is that the latest AGM before AGZ was the same system as played Ke Jie the difference being it had played more self play games after it's match with Ke Jie.
Seems the most likely answer, but then it should have trained for months to go from 1-4 variations to just 1 option. Would be interesting to see how much training time this version of master has.
i find it interesting that when Master has black, it continues to play the same sequences when obviously it has lost to Zero "every time", shouldn't the losing record affect the winning % of the move and eventually made it change to playing the alternate moves? I also feel very disappointed that Master didn't play alternative moves, would have so much more variation in the games for us to watch. Hopefully, this is just because of what the DeepMind team chose to release to public, and if they hear these "complains", they will release some games with those alternatives. It feels like Zero AI "learnt" and "upgraded" after each game while the Master AI didn't
Well, the definition of "Zero" requires that it is trained only through self plays. If it is trained against Master, then it is no longer zero. So Zero vs. Master games are the experiment games for testing winning percentage. Both Zero and Master should be frozen during the plays. The frozen version of AlphaGo does not have the learning ability. It only chooses the best moves within its ability whoever the opponent is; that is AlphaGo currently does not take advantage of weaker players as humans do for example. The variation that appeared in the 50 self play series that Michael and Chris are working on right now are not due to the adaptation to the opponent's play, but rather due to the randomness inherent in Monte Carlo Tree Search.
Yes, it makes no sense to not freeze the programs if you want to evaluate them. In addition, 10 games as black isn't nearly enough to change Master's behavior.
So what does Deep Mind say about these discrepancies? Is this the same Master version or not? And was zero contaminated with human knowledge after all?
Maybe the first stones master played on B during the 60 online game were missclicks or experimentation by the operator actually playing the moves and these were fully automatic so B never happened ? That would explain the weird 3/4 and 10/10 split between these first stones.
Why didn't Google release games of Alpha Ke Jie vs Zero? Well, there's only two major factors: the algorithm and the compute. Zero is a superior generalized algorithm and likely *could become* a superior go player. AlphaGo was optimized to play go and Google threw a huge amount of computational power at it (advanced TPUs, weeks of time, and huge electricity bills). It's very likely that Zero would be much much stronger if a similar amount computational power were thrown at both Alpha Ke Jie and Zero. But that's not Google's intention. Google is not in the business of Go play. Their algorithm was trained and tested on humans and now humans are no longer necessary. Why didn't they release existing games? Maybe they present unflattering weeknesses in game play. Maybe there is no computed version of Zero superior to the TPUs actually used to defeat Ke Jie?
Can't believe the amount of work Michael(especially) and Chis put in these videos. Thanks and keep up the good work!
My AGA membership was about to expire next month, and I wasn't sure if I wanted to renew since I haven't been taking advantage of it much since I can't really get to any tournaments or anything. But I really enjoy these videos so I just renewed for 2 more years. =)
As both a Go enthusiast and a student researching machine learning and game AI, I feel that I can contribute a bit to some of the AI side of things.
Something that Michael has not touched upon is the way that both Master and Zero actually select their moves. Firstly, an important distinction must be made that the implementation of the Monte Carlo tree search and its interface with the neural network is different between Master/Zero and AlphaGo Fan/Lee; while AlphaGo Fan/Lee did in fact use separate neural networks to evaluate "policy" and "value", Master/Zero (both use the same architecture) have combined these networks into a single network with dual outputs. There is little difference from a strictly Go perspective, but from the AI side it means that the policy and value networks are both "looking at" the same board representation, while previously the policy and value networks would derive different "important features" of the board.
The policy output is actually a list of 362 numbers that represent the "prior probability" that a given move may be played by the MCTS (including pass). This is important because it does not strictly mean that the move with the highest "prior probability" will be selected; this is what is meant in the paper when they say that they included the network alone (without MCTS) in the final tournament, and that it performed similarly to AlphaGo Fan. Rather, these "prior probabilities" are used in the MCTS stage to inform the MCTS about likely moves to be played, and therefore to narrow its search to those moves.
The way that the MCTS algorithm, as implemented by Master/Zero, works is that the MCTS algorithm, from a given board position, selects a move based on both a "U-value" (which is proportional to the number of times the MCTS has "visited" that move) and a "Q-value" (which is determined by the value output of the neural network and the number of times the MCTS has "visited" that move).
The upshot of this is that in general, Master/Zero is NOT basing its move decisions primarily on the output of the policy network, but rather on the number of times the MCTS has investigated the lines that spawn from a given move. In human terms, this would mean that Master/Zero selects its moves based on the frequency that it considers a move and reads the consequences of playing that move.
Importantly, Master/Zero (and I believe Fan/Lee as well) do not simply choose the move with the highest U-value plus Q-value; instead, all (legal) moves are given some nonzero probability, and these probabilities (adding up to 100%) represent the "move probability distribution" of a board position. This is to say, if AlphaGo determined its moves based solely on the move with the greatest probability in this distribution, it would ALWAYS choose the same move given an identical board position. As Michael notes in this video (at least with Master), AlphaGo does not always choose the same move in an identical position. This is because AlphaGo's moves are drawn from the move probability distribution. If, for example, one move has a 65% chance of being selected, over many games it will appear approximately 65% of the time, but other moves, such as one that has a 25% chance of being selected, may appear instead. This explains the behavior that Michael noted in Master; the first move being a 4-4 point likely has a probability (in the MCTS move probability distribution) of approximately 75%, while the 3-4 point likely has a probability (again in the MCTS move probability distribution) of approximately 25%. For Zero, the MCTS probability of the 4-4 point is likely very high, around 98%, and so other moves such as the 3-4 point do not appear in this (relatively) small sample of games. If we were to examine the entire body of Zero's games, we would almost surely see the 3-4 point appear in the opening, just with a very low frequency.
Another interesting thing to note, specifically about the opening, is that Master/Zero use a different temperature parameter (defined in the paper) for the first 30 moves of the game than they do for the rest of the game. The details of this are technical, but essentially if the temperature parameter is ~1, then MCTS move probability distribution is "smoother", and more moves are given greater probabilities, and if the temperature parameter is ~0, then the MCTS move probability distribution is "sharper", and single moves stand out and are assigned very high probabilities. For the first 30 moves, the temperature parameter is 1 (so the move probability distribution is more evenly distributed), and for the rest of the game the temperature parameter is ~0 (so the move probability distribution heavily favors moves with high "visit counts").
To sum up: It is not strictly accurate to say that AlphaGo is favoring moves with "high win percentage" (i.e. value output); the most relevant statistic is actually the number of times the MCTS "reads" a certain move's continuation. Now, how the MCTS algorithm determines which moves to evaluate is an interesting (and technical) discussion, and heavily involves the aforementioned value output (i.e. "win percentage"), but it is important to note that when Chris and Michael say that "several moves have similar win percentages, so AlphaGo could have chosen any of them" this is not strictly accurate. This also helps to explain some of the "strange" moves that AlphaGo plays; the random element of the move selection (it is randomly chosen from the move probability distribution) may in fact select a move that does not have the highest probability; such a move would then appear to be "bad", but cannot strictly attributed to AlphaGo deciding that that move is the "best".
EDIT: For the "evaluation" games, used to determine the strongest version of AlphaGo during training (to generate more self-play data), the temperature parameter was set to ~0 for all moves, not just those after move 30. DeepMind has not specified whether this was the case for the Master/Zero match (or any of the other matches that were played), but if it were the case then it would make more sense for Zero to be always playing the 4-4 point, as that move would have a high visit count (based on its training) and therefore all other moves would have a prohibitively low probability of being selected.
Thank you for a detailed translation of some aspects of AlphaGo Zero paper into daily language.
I am puzzled by your summary paragraph however. When you say Michael and Chris are not STRICTLY accurate in saying AlphaGo is favoring moves with higher winning percentage, you seem to imply that the daily language approximation misses some ESSENTIAL aspect. What exactly is that?
In the Lee version or before, your distinction may be quite essential for the AI novices, since the policy network was trained separately using high level human plays. There were elements of "mimicking" human play data, which implies the potential loss of gem like moves excluded from tree search.
But in the Zero/Master version, PN and VN are trained simultaneously. Except for the cut off criteria, that favors even more for moves with higher values in policy, it does look QUITE accurate to say that Zero favors higher winning percentage moves, considering the nature of MCTS.
I may make myself clearer to put the perspective other way around.
The most important reason why you WANT TO train value network as you explained is the BELIEF that it is quite highly correlated with the winning percentage. The legitimacy for reading certain moves more deeply than others only lies in the belief that it is more probable that the best move is within this reading domain. This belief is empirically tested. Only through the massive computing power of GPUs and TPUs that probably only Google and Tensent may have in the world, the Zero could achieve the effectiveness of their approach in such short time.
So, the MCTS phase selects moves based on the number of times a given move's branch is traversed during the Monte Carlo simulation. On the surface, this seems that that this has nothing to do with the value produced by the neural network. Let's unpack that.
Firstly, it must be clarified that the policy output and the value output of the network are not strictly from the same specific output. While they share the same initial stages of the network evaluation (convolution layers and residual blocks), these are used by the network to recognize "features" of the board. After the residual blocks, the network splits, like a two-headed hydra, and each "head" performs its own analysis of the features analysis that the "body" of the network produces. One "head" performs a policy analysis, generating a vector of prior probabilities that the MCTS phase uses to weight its initial selection (more on this later). The other "head" performs the value analysis, which is used to attribute a win probability to a particular board position (more on this later too).
On to the MCTS phase (I'm going to get a little bit technical here for the sake of accuracy). When deciding which moves to "read" (i.e., which branches of the search tree to traverse), the MCTS algorithm uses the equation a = argmax(Q + U), where a is the action (move), argmax is a function that selects the option having the greatest value, Q is the mean action value (more on this later), and U is the Polynomial Upper Confidence Tree algorithm (U = c*P*(sqrt(sum(N(b)))/(1+N(a))), where c is an arbitrary constant, P is the prior probability of playing that move (obtained from the neural network as above), sum(N(b) is the sum of all of the times that each possible move has been traversed (i.e. the total number of searches), and N(a) is the number of times that the particular move being considered has been traversed). Note that this algorithm includes the prior probability produced by the neural network, but is not solely dependent on it; as stated in the paper, this algorithm initially (at N(b) ~ 0) prefers moves with high prior probability and low visit count (i.e., more weight for the prior probability and a wider search range), but later (at N(b) >> 0) prefers moves with high action value (Q).
We have seen how the prior probabilities (policy output) of the neural network is used with the MCTS phase. How is the value output used? This is where the Q value comes in. When the MCTS search reaches a "leaf node" (ill-defined in the paper, it could mean either the end of the game simulation (e.g. "reading to the end of the game") or simply stopping at some fixed depth (e.g. "reading 50 moves ahead")), the board position is fed into the neural network, returning a prior probability vector and a win probability value. The prior probability vector is then used to "seed" a new layer of leaf nodes branching from the currently-being-evaluated node, and the value output is backed up through the MCTS search tree, being added to the total action value (W) of each move leading up to that "leaf" move, and the N value (number of visits, see above) is incremented by one. For each move that is updated in this way, the Q value of that move is then updated to be equal to the W value for that move divided by the N value for that move; this is the mean action value of the move, because it is averaging the total action value over all visits to that move. Recall from above that the move selection algorithm initially prefers moves with high prior probability (policy) and low visit counts, but gradually shifts toward preferring moves with high action values; this is the action value that gets maximized.
Now, what is important to realize this that everything above is only referring to how moves are chosen for "reading", or simulation evaluation. This is NOT how the move that is actually played is chosen. The actual move is selected randomly from a probability distribution represented by Pi in the paper (lowercase in the paper, uppercase here to differentiate from the circle constant). The probability distribution of each possible move in a given board position is given by N(a)^(1/T) / sum(N(b)^(1/T)), where N(a) is the number of times that a given move was visited by the simulation, sum(N(b)) is the total number of simulations run from that position, and T is a temperature parameter that controls the "smoothness" of the probability distribution. If T ~ 1, then Pi is smoother, and moves with less visit count have a non-insignificant probability of being played. However, is T ~ 0, the Pi is very sharp, and the move with the highest visit count is almost guaranteed to be played. For these demo games, I would wager that T was set to be ~ 0 for every move; during training, T would be set to 1 for the first 30 moves to encourage a wider range of moves to train from.
All of this is to demonstrate why it is not *strictly* accurate to say that AlphaGo seeks to maximize win percentage. Rather, the move with the highest mean action value, and therefore the most visits, is the most likely to be chosen for a particular game move; otherwise, we could simply choose the move from a given position that has the highest value output from the neural network. This may seem to be saying the same thing, but there are subtle differences, primarily lying the ways that DeepMind were able to interface the neural network outputs to the MCTS algorithm to create a more robust system of move selection, based only *indirectly* on maximizing value.
I think we should clarify a number of things.
1. It is not Michael and Chris that originally used the expression "choose the move with the highest winning percentage". It is the DeepMind folks. Refer to Sedol Match 3 commentary.
2. As far as my understanding goes, there is no official explanation of what Michael and Chris mean exactly by "winning percentage". My casual understanding is that it is much closer to mean action value than the value in Value Network. (and BTW my understanding is that "the most visits" is not a very good daily language translation of mean action value)
Even when my interpretation is adopted, due to the nature of MCTS, the daily vocabulary of "percentage" still applies.
In contrast, humans tend to play the mixture of high winning percentage and high expected "score" moves.
3. Your explanation, trebledawson, seems algorithm oriented rather than output relevance oriented.
My question to you was "so what?" (from go players/viewers viewpoint). From AI development point of view, the subtle difference between the Value in VN and mean action value may be important, but from viewers' point of view, the difference is meaningful only when you can qualitatively characterize the play style difference in one sense or another.
Every model is approximate, but you often adopt a SIMPLE one instead of more accurate complicated one for various reasons. Almost nobody complains ideal gas model, in favor of van der Waals model, for instance when it is used under an appropriate setting. Your extremely long explanation only seems to refer to the difference between ideal gas model and van der Waals model, whereas my question was why is the difference important?
Many thanks however for your explanation again.
I apologize for that in my clarification 3. in the last reply, I did not read your first comment carefully enough.
"This is to say, if AlphaGo determined its moves based solely on the move with the greatest probability in this distribution, it would ALWAYS choose the same move given an identical board position. "
is a very important qualitative notion.
I should add however again that mean action value should be interpreted as weighted reading than simple visit count.
That is why you expect pretty huge variation in the opening, whereas in the tense middlegame, you expect "the only best move" also played by AlphaGo, despite still non unique policy position. After reading deep enough in MCTS, AlphaGo can designate the indispensable move much closer to probability 1 than the prior policy network probability distribution, when it is necessary.
It is due to the limited reading ability and probably also due to the nature of go as a game, vastly many variations may have exactly the same mean action value (indifference among the highest value moves). For instance, there are very many ways a single group of stones die or live in real board position (in tsumego problems, the setting is artificially crafted in such a way that the best killing or living move tends to be unique). There are many different paths of endgame moves leading to the same score. and so on.
The actual play variation in the opening may become large or small as the learning gets deeper.
Thank you for your explanation so far. Can you say what, intuitively, is gained by using U values in addition to Q values? Does it help ensure that a given move can lead to many profitable paths?
Go has been around for thousands of years, but I am pretty sure that in those thousands of years, the most exciting time to be a go player and a go fan is right now. The dramatic breakthroughs just keep coming one after the other, and the game is changing very quicky. Although it is very exctiting, it is also very confusing, and it feels like everything we thought we knew about go is going out the window. Thank you very much to Chris and Michael for helping us try to find our footing in the midst of all the chaos.
I appreciate your enthusiasm, and agree with it, but it's not quite that *everything* is going out the window. The Zero games are not completely alien-looking, for example. It seems that humans did progress toward 'the truth' about Go in those thousands of years. It's just that computers, algorithms, and data structures have now progressed even closer to 'the truth' than human minds alone have.
The worst thing that can happen is humans trying to imitate AI in the way they play. That is going to be the most disgusting thing that would happen to go.
No human can think like AI so it's pointless to study AI games.
@@psijicassassin7166 Except Sin Jinseo does pretty much exactly that, and he is far and away the strongest player in the world right now
Regarding the 4-4 and 3-3 points, it feels like to me that the 4-4 point is a very flexible move. It can either build influence or work to help capture a corner, depending on the way the game goes. Maybe it sees these as Miai, capturing a corner, but with fairly small territory and giving the opponent influence vs. playing something else and letting the opponent have a fairly large corner.
Playing the 3-3 invasion very early makes some sense if you consider that when the board is undeveloped, it's much less obvious in which direction the opponent should build their wall on the outside. At this point, you can play a move to direct the game in such a way as to minimize the effectiveness of that thickness, whereas later in the game, things are more set in stone and your opponent can potentially inflict some real damage with that choice.
Invading at 3-3 gives you a small territory in the corner (and part of an edge) but it gives the opponent the opportunity to have strong influence toward the center. I would prefer to approach 4-4 to a distance first, then to invade at 3-3. Does it make sense ?
Just to say that I joined the usgo even though I live in Hong Kong. The effort is hard and very appreciated. Minor contribution from a 40 years go player.
I think sometimes professional go player forget the support is it just for their match but for go itself. The alpha go force the tide change back to old days. Very old days.
Yes there are competition and very fierce. But we like the match and the mind sports. Not just the match.
I think the tension between the 4-4 point and the early 3-3 invasion can be resolved by thinking about sente. AlphaGo seems to value sente very highly, and the 3-3 point invasion ends in sente unless the opponent takes a local loss. Therefore, playing at the 3-3 point guarantees you get either sente or a better result than you normally would. Similarly, any approach to the 4-4 point will end in gote (including the 3-3 point, as long as you're willing to pay the price). So if you assume AlphaGo *really* wants sente above all else, both behaviors makes sense. Those moves are the best way each side has to fight for sente.
Half resolved.
If sente is such important, why would you play 44 in the first place knowing through self-play experience that you and your opponents will play away sente via invading 33 later?
Obviously, they think that the initial 44 set up is worth gote despite the opponent's sente 33 invasion, and in fact this preferring thick gote for 44 far more to 33 invasion sente has been the human understanding for over a hundred years before AlphaGo Master (Ke JIe version).
Overall, AlphaGo not only thinks 33 invasion sente is huge, it also thinks the initial 44 setup leading to gote sequence balanced.
This is not correct. The 3-3 invasion is not actually sente - see my original comment. The 4-4 point is valuable because you can build on it and all approaches end in sente for you, and the 3-3 point is the best invasion for the opponent because they get the most when you take sente.
One thing to consider is that it is relatively easy to see the value, and make use of, the general strength of moves around the corners - If I had to choose one move to make without knowing the board position, I'd probably choose the 3-3 myself. I think it's quite natural, even starting from random play, to learn that playing moves in the corner is good, because it usually is. In my opinion, if central moves are good, they're probably only good with quite exact play. Not the sort of thing you're likely to stumble across from random play, or seriously consider once you have an established opening strategy.
The worry that corner play is just a "local maximum" is a concern I've seen other people familiar with AI raise about this result. AG zero and Master may both be completely correct that the 4-4 point is the best opening move, the 4-3/3-4 slightly less optimal, and that the value falls off dramatically from there (all the way down to "completely useless" when playing 1-1). I do think that Zero's process of starting from completely random moves was the best way to make sure that this "local maximum" issue was avoided, and that there wasn't some other equally strong (or stronger) location that was being missed. I can't say if they have completely ruled that out (and perhaps even Deepmind may not be certain enough to make that claim), but I do think that Zero's style of play in the opening has shown us that it's much less likely that anything other than corner moves will lead to winning positions.
Well that's not exactly how a neural net based system works. It's learned skill is far more significant than the tree search.
I'm not referencing the tree search at all. Simply the value network and its evaluation of the "best" moves. Getting stuck in a "local maximum" is a very real concern when using reinforced learning to try and judge what is better or worse.
For a primer on what I'm talking about, the article at
en.wikipedia.org/wiki/Reinforcement_learning
has a decent introduction. As stated there: "Many policy search methods may get stuck in local optima".
Fascinating
What Michael Redmond needs, is access to a runtime version of AlphaGo Zero so that he can force any starting position and see AlphaGo's winning percentages for the best responses.
Then he can debunk many human josekis and fusekis for us, showing with response sequences why they would be suboptimal.
Well, DeepMind is really closed with respect to the use of the program. Other programs like DeepZen are already used for this sort of purpose by Japanese pros.
It is however not that easy to "debunk" human plays solely based on AI plays. Humans can learn from AI plays, but the learning needs to be adjusted to the human level, with much less memory and computing power.
I don't know what DeepMind's research priorities are, but I hope they do keep up their interest in go.
There seems to be a lot to be gained if players (especially professional players) get to tweak the program to test out their own ideas.
From an engineering perspective I hope the rerun the Alphago self learning process from scratch several times. I want to see whether different styles of play could emerge, and if the sibling programs are evenly matched.
Tweaking by humans is not a feasible idea when it comes to neural net based AI. It's a kind of black box which plays so well that if you mess with it most likely it's quality of play would suffer.
I don't think it matters if the quality of play is hampered, so long as the changes are not crippling. The point is to compare different strategies, not to try and improve the engines performance.
From what I understand Alphago often gives the top two moves near identical values, so (some) human interference is possible without significantly affecting performance. Also you would be looking at relative performance. If imposing the Koboyashi opening hinders the game less than the Chinese opening, that would be a valuable result.
From what I understand, they're ultimately interested in General AI and Go to them is just sort of the stepping stone to prove that it's possible to achieve General AI. Now that they've been successful in that, they are probably moving on to something else. It could be something related to Go but it may also not be.
Check this out if you haven't: ruclips.net/video/rbsqaJwpu6A/видео.html
It is worth remembering that there is a difference in how both version of AlphaGo play when studying through self play games, and how they play under "tournament" conditions.
As per the first Nature paper, when AlphaGo is engaged in self play it probabilistically samples a move based on how many times that move was visited at the top of MCTS, versus in tournament play it selects the move with the highest number of visits. So if the top moves are visited 21%, 20%, 20%, 14%, 6%, ... of the time, in self play mode there is a 21% chance of it picking the "absolute best move", and it is willing to experiment with other moves which have a chance of winning. My understanding fro the second publication is that they played AG Zero vs AG Master against each other in tournament mode, where each system picks the "absolute" best branch.
The reason they do not play the exact same games against each other is because of random effects which can cause different moves to be judged "the single best move".
Exactly.
It was widely believed (like Michael said) that they had adjusted some parameter so that even in tournament play it would be more likely to pick 2nd or 3rd best move as long as it is within some range. It could be that the "Ke Jie" version Michael speaks about is actually exactly the same program, but with different threshold. It would be pretty easy to test the impact of changing the threshold on strength to ensure that a threshold has limited impact on ELO (since it would likely be a little negative, at least based on self-play ELO). I agree with Michael that this would lead to more interesting openings in the series, but I'm not sure it is the right conclusion that it would be a better test of Alphago.
The Ke Jei version was more advanced becuase of the matches with Ke Jei, which were pushed to top of branch searches. Thus Ke Jei and pairgo actually acted as a trainer for Alpha Go Master. I'm sure after the matches the version was archived, but not recognized from the basic engineering perspective as anything different from V0 playing Ke Jie
Opening background music plz ?
I guess the master version has a temperature parameter, as long as the temperature is raised, it will play more freely. Engineers use this feature to get more diversed games.
Perhaps they set the variation the same for both programs, giving each (for example) a 0.1% winning percentage range in which it could select moves, and for the Master version that meant 4-3/3-4 were possible, but for Zero they were not. It could just be that Zero finds 4-4 to be slightly more valuable compared to the other moves than Master does. Because there most certainly is a gradient for Zero as well, since it selects different moves later on in the game, just not in the opening like Master does for some reason (which I think likely comes down to a different value network and different winning percentage calculations).
Yes, it has a temperature. During training they set the temperature high for the first 30 moves so that it selects a good move semi-randomly, then for the rest of the moves they set the temperature low so it always takes the best move. (That's from the paper.) It sounds like for the sample games they set the temperature low for all moves. You'll still get some variation because the Monte Carlo tree search it uses is inherently random.
I wonder if alphago showed more variation in the master series because of the tight time control in those games. It may be that alphago gets more predictable when it has longer to calculate because the monte carlo search is more likely to converge.
A question for Michael: when he is describing a "contradiction" in Zero's opening, does he mean the fact that it overwhelmingly chooses the 4-4 point, and then also just as overwhelmingly plays at the 3-3 point against that initial opening? This wasn't really explained clearly in the video, but the "contradiction" term seemed to be describing the 4-4/3-3 choices. Is it because attaching value to one of those two moves would appear to negate some of the value in the other? That's the thought that jumped into my mind as well.
I think he means that some of the inherent value is that the 4-4 point is considered a good balance of gaining influence while starting to enclose some territory. But, if the best response to this move is for the opponent to invade the 3-3 point a little later in the game, then part of the territory aspect of the 4-4 point is then negated. So, what's the true value of playing the 4-4 point over something like the 3-4 point, in this case? That's the part that I think puzzles Michael. Perhaps the answer is that the 4-4 point still has a good amount of value in its influence towards the center and the sides, that it's still strong even after the opponent invades. Indeed, maybe Zero judges it so strong that it almost requires the opponent to invade at 3-3 quickly, just to diminish some of its strength towards the corner and the sides.
That's pretty much what I was thinking, but you explained the details much more clearly than I could. I'd be interested to hear if that's what Michael concluded as well.
I don't see why people are thinking it's kind of contradictory. For example in Tic-Tac-Toe the best initial move is in one of the 4 corners. The best reply to that is in the center. So people might say, "Hey that's contradictory why didn't the player with the first move play in the center?" However it isn't contradictory it just makes sense. You will see what I mean if you analyse Tic-Tac-Toe. Almost everyone thinks the best first move in T-T-T is in the center but not so.
I don't see it as contradictory either, but I also don't play pro-level Go. I think it has more to do with Michael's feel for the game, rather than actually being a contradiction. From watching these videos, it seems that Michael feels the 3-3 invasion as played by AG is "too early" compared to how humans play. But, assuming that playing the invasion that early is actually correct, the next feeling that Michael has is that the 3-3 invasion has made the 4-4 point a lot less relevant of a move. So, to him, if playing the 3-3 invasion that early is correct, the 4-4 point move now "feels" wrong. And that's probably what's troubling him about it.
Talking about "best move" in ttt is not such a useful comparison since the full game tree is known and best result is a draw. I assume you mean something like "more chances for 2nd player to screw up"
I've been working a lot on a re-implementation of AlphaGo Zero for a different game than Go but it should be easy to adapt. Does anyone reading the comments know about a Go programming community that would appreciate an AGZ implementation with most of the legwork done?
Yolo Swaggins there are already good go programs taking in this papers work. Look at deep zen go.
dtracers, programs are fine but I'm asking for a community. Do you know of a Go programming community (like a messageboard or something similar?). It's difficult to find because Google had to name their programming language Go as well.
Yolo Swaggins I'm saying there is not really a community because there are a couple of really big ones doing alphago zero.
Ok so there's no community for hobbyists doing go programming? Interesting, in chess programming that thing is relatively common. God bless.
May be worth reading this thread: www.reddit.com/r/baduk/comments/78vk47/leelazero_from_the_author_of_leela_for_all/
Would the Ke Jie version still play the Joseki at 11:02 even though the ladder is good for Black?
Imagine AlphaGo achieved consciousness, desperately wants to communicate the meaning of life to us monkeys through Go opening variants, alas - we can’t understand its code.
Seriously, thanks to AGA and all involved in making these study videos accessible. Very intriguing! Cheers!
One starts to wonder what AlphaGo Zero sees in 4-4 moves, that it feels the needs to invade them via a 3-3 so often. If it's that regular like in that it would almost seem like a forcing move... Like something about it has to be refuted and can only be done so well like that...
Sadly that's just speculation from me though. I have next to no chance figuring out what it is about if the experts haven't so far.
Great video, thank you Michael! You are awesome
Error: video of the 18th nov not found!
Maybe the Ke Jie-version was just master with more processing power?
Or more time per move. The online ones vs the 60 pros were blitz games.
I believe DeepMind stated that the Master version used only 4TPUs. DeepMind doesn't seem to be distinguishing the Master version and the Ke Jie version, so my guess is that their architecture and code is pretty much the same, but the extra few months of training yields different behavior.
It seems that was what Michael was getting at at 23:53.
Conscious, that makes sense. But wouldn't thinking time also have an effect on how much processing power gets used per move? If I used the same hardware using only about 8 seconds per move (as master did in the 60 games) isn't that very different than spending 80 or 90 seconds per move like master did against Ke Jie?
Similarly, we don't know how long master had to think in the published self-play games, but I got the impression that there was a lot. More thinking time could reduce the seen randomness in move selection too.
I believe it was stated that the self-play games were all at full length time controls.
What you're saying is *possible*, however I'm somewhat doubtful that the shorter time controls would be the deciding factor.
Correct me if I'm wrong, but is it possible that the difference between the Master and Ke Jie versions boils down to time settings?
That's possible, but somehow I'm doubtful as to whether that would be the deciding factor.
Looks more due to the further network learning time. For example, the early 33 invasion seems to appear in the Ke Jie version regardless of time settings. Ke Jie, before the three match event, seems to have known this play feature (DeepMind probably lent the Ke Jie version to Ke Jie and other Chinese pros for training).
Alphago did play the early 3-3 invasion in the 60 game series. I would say it's *extremely* unlikely that DeepMind *lent* the system in any way to Ke Jie or other pros.
I wonder if either Master or Zero could solve tsumego problems ?
the link to your journal doesn't work
there is no Ke Jie version, it was just AGM not playing online fast games.
do u guys have the same headphones? i find that odd, in a strangely satisfying way.
Zero is not ignorant of human joseki see ruclips.net/video/m13QHNMHAa4/видео.html at 40 hours it had discovered most human joseki's however at 70 hours it began to abandon them so really it's just that we got the 4-4 point and approach right but not the followups
Your choice of the term "ignorant" seems to be off the daily language usage. Zero "learnt" to play like humans do.
What Michael was trying to say was that he was hoping for a completely different learning path.
Wasn't the Ke Jie version just master with more training?
Ultimately, all Alpha (and Zero) AI are human trained. Remember, this is the cutting edge of AI (RL) research. The sanity checks and improvements were always based on experience from human play. More specifically, the design of Zero, including Monte-Carlo, neutral networks, back propogation, etc were decided upon and optimized based on AlphaGo's success playing humans. Only a year before AlphaGo defeating Fan and Lee, Deepmind considered TD (temporal decision (I think) trees) to be superior or more generally applicable than Monte-Carlo. However, because it's nearly impossible to evaluate a board even after 20 moves, Monte-Carlo (which plays random games to completion) was still necessary. Deepmind only knows that (and much more) because of human play. Another example, AlphaGo had specific rules for "inside moves" (creating or destroying two eye potential), so even if Zero did not have initial rules or weights, its learning algorithm was chosen and optimized until it did "understand" such human concepts. Until AI can explain WHY a particular move is superior, I don't think we'll know whether it's learned from humans (better than humans can) or really taught itself (and us). Third attempt: humans have very fast and deep internal network but very slow external communication and quickly tire. Computers have moderately fast internal networks, extremely fast inter-computer communication and never tire. If humans could learn from trillions of games, humans would be much better; if computers had our neutral network, computers would be much better. Both are limited by capacity, not algorithms.
Maybe Master does the same variations in all the published games because it lost the other games badly, so they weren't worth publishing. That would imply that the published move is better (against Zero anyway)
EDIT: Come to think of it, we've all been amazed at how close all these Alphago-Alphago games have been. But on the other hand, as soon as the games aren't close, Alphago starts playing strangely. The winner starts giving away points, and the loser starts making random nonsense moves.
But the games Deepmind have published are curated -- they were selected, by Fan Hui I guess, to showcase Alphago's best play. That best play is only on display in close games, so it's no surprise that the published games are close.
It could well be that 99% of the games are won and lost at a bigger margin, but the program plays too erratically, so they don't see the light of day.
Another possibility is that Master and the Ke Jie version were actually very similar, except that for the Ke Jie match, they gave more computing resources and a longer time limit for it to think. Since the moves look different, it seems like it's a very different AI base than Master, but in reality, it was a very similar AI base. For these practice games, they couldn't or didn't want to give the same level of computing resources and time limit, so having the "Ke Jie" version of AG wasn't applicable for these sets.
I highly doubt that DeepMind wouldn't publish them because of it being lopsided. You are assuming that the published games are better, where there is no evidence for it.
I would also assume Fan Hui would have a reaction similar to Michael's at 19:13. So I don't find it convincing that Fan Hui would select games that have the same first 20-30 moves.
Also, DeepMind has stated that Master used 4 TPUs. Since they're are saying all the versions(60 game series, Ke Jie, and self play series) are the same, it is safe to assume that they all used the same hardware.
It’s been 7 years, but in regard to the comment above the openings being similar, I had the opposite reaction. If you compare the impact of computers on chess vs go, it’s completely different. For the vast majority of games, >95%, you can tell if the game has been played since 2016 easily, whereas in chess, despite computers being better than humans for more decades, you can’t. Almost all the josekis we learned aren’t played any more.
Great summation of the games.
I think the strength of AG in the opening is overrated. The paper has a graph where they say that AG without MCTS is slightly weaker then crazystone, or a top pro. As MCTS is known to be very weak in the opening, we can't really say that AG's opening is really better then that of a solid pro, as it might drop to that strength until the trees become concentrated.
I hope we will get to see professional players using Alphago as a tool to test out different opening strategies. If you are right then they will find fuseki that are better than AG could come up with on it's own.
I actually think that would imply that it is stronger in the opening than most. Because within searching 3-4 moves deep it becomes really strong
alpha go's greatest hits sounds like a best-seller in certain circles
OMG YES, THANK YOU
Best video serie ever !
I'm amazed michael can tell ke jie version is better than master.
Great stuff, thank you so much! Your problem is that these AIs can produce a million masterpieces a day. We need to have some way of pruning them down to the most interesting ones.
If both played their own determination of optimal play, every game would be exactly the same. "Interesting" (if not purely subjective) may only be a function of how much variation we want (handicap, komi, random decision tolerance, etc).
Take joseki: the win-rate of several alternate moves are often very similar (the "best” move might have a win-rate of say 47.1%, whereas several alternatives might be 47.0%) and the win-rate is not perfect or provable, it's just what Alpha came up with after trillions of random games. We either understand the reasoning, or we don't. That's either interesting or not. Hard to say which is which.
Most interesting (to me) would be "humans often play this, but Alpha thinks that's stupid, because she'll just play this". I'm not sure we've come across that. It's usually just "Alpha gives the typical human move a win-rate of 47% whereas another known move has 48%”.
The only games we can understand are those that humans played (either before or after Alpha). Personally, I think the best we'll get in terms of "interesting", is for a professional like Michael to play, and tell us his analysis. I think his approach of "this is what a human would likely play" but "this is what Alpha plays" and "maybe this is why" is probably the best that I (most of us) could hope for... at least until a computer can explain it to us... Which a neutral network simply cannot.
In this video, Zero prefers the 4,4 start point openings, which is surprising. It still takes a human, like Michael, to tell us why that's a superior move... or, as it seems, he's not sure why or even that it is indeed superior. Same with the 3,3 attack. I believe we're at a point of understanding only that "Alpha Go seems to think it's an even trade off".
Zero is optimized to choose conservative moves most likely to result in a small win. Whereas humans tend to make aggressive moves likely to win big in the short term (presumably because we can't predict the long game or assume we'll make a mistake somewhere else). That alone is interesting (to me).
@@vegahimsa3057 We know AlphaGo and the other AIs are not perfect, just a lot better than any human. I suspect that at current komi, perfect play for White is mirror go, and it's just a matter of when and why Black plays tengen to force White out of mirror go. Tengen is clearly not the best move early in the game, but it's better than continuing to let White play mirror go all the way to the endgame. The other option is to set up a mirror go trap, but that is also (very likely) going to involve sub-optimal moves.
Without komi, I believe we can safely assume the best white can hope for is a tie, in perfectly optimal play by both players. Any komi is all the asymmetry needed to ensure that mirror go is not optimal for both players. For some komi, mirroring might be the most optimal outcome for white, but if that's provable, then we simply adjust komi until mirror is winnable for neither.
* Previously "optimal" referred to "what the current AI believe is best play for both players, to the best of its current ability". However "perfectly provable optimal" is some hypothetical future AI with truly god-like abilities -- either harnessing all energy in the universe, or the discovery of a simplified, shallow, and perfect evaluation function.
We'll get better AI, for sure, discover a lot, but I don't believe we will discover a radically simplified, provably perfect, and shallow evaluation function. However, if we do, perhaps it would no longer be fun to play go.
I think we'll soon discover if AI performance plateaus or whether play continues to improve with more computation and better learning algorithms. And would that tell us more about AI or the size of go universe?
Perhaps the Ke Jie version is equally as strong as Zero and that is why they didnt reveal it.
Another option is that they intent to write more papers about it before publishing these games.
Sometimes I just think DeepMind just wants to hold back information for some time to see how humans evaluate AlphaGo's moves and then afterwards corfirm that or refute that with the complete neural network data. Kinda like how a teacher would to see how a student comes to an answer before showing how it should be done.
Pocket Calculator I got the impression AG Ke Jie = AGM.
My opinion from what the engineer said is that the latest AGM before AGZ was the same system as played Ke Jie the difference being it had played more self play games after it's match with Ke Jie.
Seems the most likely answer, but then it should have trained for months to go from 1-4 variations to just 1 option. Would be interesting to see how much training time this version of master has.
Two this weekend! so happy ^_^
i find it interesting that when Master has black, it continues to play the same sequences when obviously it has lost to Zero "every time", shouldn't the losing record affect the winning % of the move and eventually made it change to playing the alternate moves? I also feel very disappointed that Master didn't play alternative moves, would have so much more variation in the games for us to watch. Hopefully, this is just because of what the DeepMind team chose to release to public, and if they hear these "complains", they will release some games with those alternatives.
It feels like Zero AI "learnt" and "upgraded" after each game while the Master AI didn't
Well, the definition of "Zero" requires that it is trained only through self plays. If it is trained against Master, then it is no longer zero. So Zero vs. Master games are the experiment games for testing winning percentage. Both Zero and Master should be frozen during the plays.
The frozen version of AlphaGo does not have the learning ability. It only chooses the best moves within its ability whoever the opponent is; that is AlphaGo currently does not take advantage of weaker players as humans do for example.
The variation that appeared in the 50 self play series that Michael and Chris are working on right now are not due to the adaptation to the opponent's play, but rather due to the randomness inherent in Monte Carlo Tree Search.
Yes, it makes no sense to not freeze the programs if you want to evaluate them. In addition, 10 games as black isn't nearly enough to change Master's behavior.
So what does Deep Mind say about these discrepancies? Is this the same Master version or not? And was zero contaminated with human knowledge after all?
Maybe the first stones master played on B during the 60 online game were missclicks or experimentation by the operator actually playing the moves and these were fully automatic so B never happened ?
That would explain the weird 3/4 and 10/10 split between these first stones.
Why didn't Google release games of Alpha Ke Jie vs Zero? Well, there's only two major factors: the algorithm and the compute. Zero is a superior generalized algorithm and likely *could become* a superior go player. AlphaGo was optimized to play go and Google threw a huge amount of computational power at it (advanced TPUs, weeks of time, and huge electricity bills). It's very likely that Zero would be much much stronger if a similar amount computational power were thrown at both Alpha Ke Jie and Zero. But that's not Google's intention. Google is not in the business of Go play. Their algorithm was trained and tested on humans and now humans are no longer necessary.
Why didn't they release existing games? Maybe they present unflattering weeknesses in game play. Maybe there is no computed version of Zero superior to the TPUs actually used to defeat Ke Jie?
YEY!!!