I just read "Superhuman AI for multiplayer poker" article by this Noam Brown guy and became interested in the "Supplementary Materials" he published for this project- but barely understood things. Then I open up youtube to introduce myself to Monte Carlo CFR and find this... Thank you in advance already!
Extremely impressive, thank you for making the video (together with the paper). Congratulations! I wonder why you only submitted to a CS conference, and not to a real scientific journal like Nature or Science? I think the quality of the work certainly warrents a publication that is seen by a large amount of scientists, and that's gone through proper peer-review (not only the CS conference reviews). I believe your methods could be very exciting for the interface between deterministic and indeterministic processes in physics (classical and quantum mechanical boundary). They are in a way perfect information (classical physics) and imperfect information (quantum, see Bell's inequality).
Thanks! I think in order to turn it into a Nature or Science paper we would need experimental results on Go. That would take a lot of engineering work so we decided it wasn't quite worth it.
Hi Noam, I think this field of research is fascinating and it's very impressive to see the strides you took and are taking in the realm of imperfect information games. Previously, Pluribus cost very little computationally, despite being placed in a 6 person table. I found one of the most impressive stats of Pluribus being that it only took ~$150 and a relatively weak computation to train and execute (which is because of the lack of deep neural nets?). In comparison, with the implementation of deep reinforcement and self play without information abstraction, how does ReBel compare in terms of cost and computability? Thanks
Thanks! ReBeL is much more computationally expensive. The exact resources are mentioned in the paper, but it's definitely way more than expensive than can be done on a home computer. However, it appears to scale well, and it's possible that future research will bring the computational cost down.
I think it would work fine in 6-player and we thought about doing it, but there aren't many benchmark agents for 6-player so it would be hard to evaluate the bot. In theory public belief states (PBSs) might not have unique values outside of two-player zero-sum games but I don't think that would be an issue for 6-player poker.
Very cool! Your Pro was Dong Kim who was able to play from home according to your SI. Does it mean you have a online-platform where one could play? Is there any chance to get access? Do you need special hardware to run the final version or is a standard workstation sufficient? Would love to get beaten in Chess and HU texas holdem :)
Thank you for the video! I'm an undergraduate who wants to learn a lot more about RL, and this was really great. I was actually curious whether ReBeL could be directly applied to a game like Pokemon. I'm not sure how familiar you are with battling mechanics in Pokemon, but there is a lot of imperfect information (each opponent Pokemon has hidden moves, hidden stats which can later be inferred based on observable damage numbers and moves used) as well as damage rolls, crits, etc. In particular, unlike poker or RPS where there is a set turn order (for RPS we reformulated it to be this way), in Pokemon move order is determined by speed stat of the Pokemon in play, which is sometimes hidden to the players if not previously observed. Would you still be able to use ReBeL in this case, and would it have to be reformulated in any way? Also, I just had a couple quick clarifying questions about belief infostates. You mentioned in the paper that the input of the value network is the following size: 1(agent index) + 1(acting agent) + 1(pot) + 5(board) + 2 × 1326(infostate beliefs) Why is there a separate value for "agent index" and "acting agent"? Also, are infostate beliefs just the set of probabilities for every possible player hand (which we update based on player actions)? Why don't we include information such as amount of money each player put in the pot and bets made during that current round? Thank you again so much for the paper and video, I appreciate your time to help people like me learn more!
I'm not that familiar with Pokemon so it's hard for me to say, but one issue might be the amount of hidden information. At the end of the talk I mention that the input to the NN is proportional to the number of infostates in a public state. It sounds like that can be quite large in Pokemon. There's a separate value for agent index vs acting agent because we need to whose turn it is and who we want the value for. Those could be different. We don't need to keep track of prior bets and other information like that for the same reason that in chess you don't need to keep track of all the prior moves, you just need the current board configuration (plus some additional info to prevent repeating the same position). The info we track is sufficient for computing the EV for each player. And yes, infostate beliefs are the probability for each hand.
The value vector computed with cfr is added as a training example. When solving the the last round, these values are completely accurate if it was solved to 0% exploitability. So over time it will get better at solving the round before the last, then the round before that. Something I don't understand is that you don't solve until the start of the next round. How does the value network improve at estimating the leaf nodes when you are giving it training data that consists only of public belief states at the start of a betting round?
The values of belief states at the end of a round can be computed by averaging over the values of belief states at the start of the next round. As we collect more data for values at the start of rounds, we also train the network to predict the values at the end of rounds.
I have a question about the Algorithm 1 in the page 1 of paper. In my opinion, I understand that function COMPUTEEV computes the expected value of PBS Br based on the output of value neural network( compute in SETLEAFVALUES function). However, you add the (Br, v(Br)) to the value net data, which means using output of a model to train it (did I understand right?). The CFR-D in section 5.1 also uses output of the value network to compute new policy for next iterations. I think the CFR-D has to use groundtruth value to compute policy for new iteration but it seems uses the output of untrained value network. So where is the groundtruth of value network in leaf nodes? Thank you so much Noam.
I'm not sure I understood your question. For some subgames, there will be leaf nodes at the end of the game (e.g., if the leaf node is for a fold action, or if you are on the river and both players have called). In those cases, you will receive a ground truth reward from the environment. Otherwise, you use the value network at the leaf node. But eventually the ground truth rewards will "bubble up" and the value network will become increasingly accurate.
@@noambrown8362 so could you check my understanding about your algorithm at each iteration t (from twarm -> T): Step 1: UPDATEPOLICY(G, policy at t-1). This step compute policy at time t by applying CFR-D. CFR-D will compute regret matching with value at each leaf nodes (ground truth if node at the end of game or value network output at intermediate node of game). So it will update new policy based on the regret value. Step 2: Update average policy. This step just computer new average policy using current average policy and policy at time t computed in step 1 Step 3: G
@@noambrown8362 I think I understand how to train value network after reading conversation between you and @brianbrewer974. Take a poker game for example, first we train value network with ground truth reward (river round), so this network will predict to output well in river round, which is almost exact as ground truth. So we can use this output to compute expected value for turn round and run CFR-D for T iterations to get proper value v(Br) at PBS Br. Now we add (Br, v(Br)) to train value network more so that it could predict well in turn round (may be that's why you call this "boostrapping" strategy training). Repeat this process to flop round and other previous rounds and the value network will be trained with every PBS Br in all rounds. ReBel finishes and we have a perfect trained value network. Is that my thought correct, mr Noam?
Hi Noam, I was wondering what the computational requirements of training ReBeL was as opposed to for the libratus model. I assume this is significantly more expensive considering the continuous nature of game state space, and the fact that the game state now contains atleast as much information as libratus's game space. Could you improve my understanding of this? Thanks!
Great job , Noam! But what about Prisoners Dilemma? I see that you are study non-cooperative zero sum games, so may be it is not your subject at all. But if all AI algorithms will concentrate only on competing, but not cooperating, will we have any win-win games in human future? Are altruistic gens rudiments for future homo artifices?
Does the exploitability issue when playing humans you mentioned at 26:19 also apply to Libratus or only to this new algorithm? Or does this issue only apply to ReBel because you are using value networks?
It was *mostly* not an issue in Libratus. Libratus used two kinds of search: upon reaching the turn for the first time, it used unsafe search. Thereafter, it used safe search. The exploitability issue was in theory relevant when using unsafe search, but in practice wasn't really an issue in poker. (The reason it wasn't an issue in practice is complicated. Basically because the search occurred immediately after the turn card came out, the unpredictability of the turn card made it difficult for a human to actually exploit the bot's strategy.) You might wonder why we used unsafe search at all if we had safe search. The safe search techniques that existed before ReBeL added extra constraints to the search strategy that in practice could make it do worse than unsafe search. The nice thing about ReBeL's search is that it's unexploitable just like past safe search algorithms, but it doesn't have their extra constraints so ReBeL's search in practice performs much better.
@@noambrown8362 Thank you for the response. Conceptually, is it correct to think of the goal of the technique you employed of randomly selecting iterations in the search as making the strategies robust against every type of potential counterstrategy (even bad ones) in terms of long-run expectation/unexploitability ?
@@grabaka884 yes that's a good intuition for it. The real reason you need to select a random iteration has to do with the details of CFR. But understanding it would require going into the details of the CFR decomposition algorithm, which would take a while.
For the turn endgame holdem experiment where you trained with random values - Was this something like each of the possible hands has a random probability between 0 and 1? I was able to train a river net to a training loss of 0.013 and validation loss of 0.018 with 500k samples (solved to an exploitability of 0.1% of the pot) using Deepstack's pseudorandom range generating function and having the board be an input to the network (represented as 52 0's or 1s depending on if the card is on the board). The inputs were also players ranges rather than bucket probabilities. I also tested the network in different situations where both players had 100% of possible hands (a distribution not represented in the data), and it had a loss of about 0.003 in those cases (albeit a small sample). For some reason, that is an easier distribution than the pseudorandom distribution. I'm guessing that if you assign random probabilities to every hand - each hand strength is going to have a similar number of combos in most samples. For instance, there may be 100 hands that make top pair, most samples on that board are going to have around 50 combinations of top pair. It may not be able to generalize in situations where a hand strength is very unlikely or very likely. I've come to the conclusion that buckets were not necessary in Deepstack or Supremus, but the algorithm for generating players ranges is.
Yes, it was each possible hand with a random probability between 0 and 1 (and then normalized). What you're describing is an interesting result. I'm guessing the validation set was drawn from the same pseudorandom distribution used for training? What if you try solving the turn endgame that we tested in the ReBeL paper (where we assume uniform ranges at the start of the turn)? I'd be interested to see what kind of exploitability numbers you get. That would give you a sense of how good the network is outside of its training distribution.
@@noambrown8362 Yes, the validation set was created identically to the training set. It's not clear to me how to measure exploitability when using a value net. When solving to the end of the game, finding a best response is fairly straightforward and similar to a cfr iteration. When using a value net at the leaf nodes of the subgame, both players ranges will impact the result of the network (adding strong hands to your own range will change the expectation of your weak hands, this isn't true at terminal nodes at the end of the game). Do you measure exploitability in just the first round, or do you solve the second round (by passing the reach probabilities from the solution of the first round?) and measure exploitability in the full game?
@@brianbrewer974 You should first solve the turn using the river value network, and then "freeze" that turn strategy for each player. To calculate the best response value against P1, you run CFR on the *full* turn endgame (both turn and river without any neural value network) where P1's turn strategy is fixed at the frozen strategy (but not P1's river strategy, and not P2's turn/river strategy). You can do the same thing but mirrored to get the best response value against P2. Add these up to get exploitability.
@@noambrown8362 Sorry for the slow response. I created an end of turn network with 1m subgames and its huber loss was 0.00239. I solved the turn endgame holdem subgame described in the rebel paper. Locking the oop player strategy in a commercial solver caused him to lose about 50 chips, or 2.5% the pot. I haven't tested the ip player as copying the strategies was quite time consuming. I've noticed that the combo selection in the strategy (from the neural net) is super specific. It's much more likely to play pure strategies with specific hands than the solution from solving to the end of the game. This may be a weakness from the outputs not being bucketed.
@@brianbrewer974 It's not really possible to say what the exploitability is without measuring how much both players lose. The oop player might be losing 2.5% simply because the oop player might be at a disadvantage given the ranges. Can you let me know what the numbers look like for ip too? For comparison, ReBeL is exploitable by about 0.25% the size of the pot.
While this work tries to incorporate techniques successful in deep RL, notably the use of value function approximation to truncate search, into solving imperfect info games, it's not clear that it can scale to scenarios where the underlying state space is already large. In the toy poker game, one has a state space of cardinality 104. I suppose this issue is addressed with the final comments on recon chess.
The state space of the whole game is extremely large: 10^161 decision points, which is (very roughly) comparable to the size of Go. When converted into the belief space representation, Texas holdem can be viewed as a continuous state space with more than 2,651 dimensions. Go has 0 dimensions in the belief representation. The issue with scaling is with increasing amounts of hidden information, *not* with an increasing size in the state space.
-Actually, since you don't know either your cards or your opponent's in the belief space representation (they are not common knowledge), isn't the public belief state of size 52 * 51 * 50 * 49 / 2 or similar?- Oh, I see, what we care about is the size of the info states, and not the PBS as a whole. I suppose contrary to your comment about 6 player poker games, this approach might already be intractable for 3 player games, since the info state for a single player would have > 10^6 dimensions. Unless you are suggesting one might be able to handle them adequately separately? (in general, I suppose factorising a state space is a common technique to reduce dimensionality, whether learned or manually)
I'm quite interested, with regard to these limitations, in exploring whether learned representations over infostates, possibly "common representations", the sense they are shared by all players, could be used to overcome them. I suppose that agents that have solved imperfect info games with high dimensional/continuous info states like dota 2/starcraft already do this (approximating over hidden knowledge) implicitly. There does seem to be some crossover point where humans also can't handle high-dimensional info states with high precision. Since developing specific techniques was not required* in those cases to beat humans. Imperfect info is however only one part, rather than being the whole point of the game. (*with the caveat that alphastar uses population based training with exploiter agents to reduce main agent exploitability, but this is not an algorithmic approach, and certainly not in same vein as work done in poker AI. Alphastar's approach addresses the problem of data distribution, rather than imperfect info per se).
@@jonabirdd 6-player poker shouldn't be much of a problem for ReBeL (except the fact that PBS values in theory might not be unique). It's the number of infostates in a public state that is the limiting factor, not the size of the infostates. In 3-player poker there are 3 * 1,326 infostates per public state, whereas the size of infostates is about 10^6. You can approximate the equilibrium of any game regardless of the amount of hidden information by not doing search at all. This is what AlphaStar does in StarCraft 2 and OpenAI Five does in Dota 2. But search makes a *huge* difference in games where it has been used. The fact that AlphaStar was not able to achieve superhuman performance in StarCraft 2 and is highly exploitable in the game despite using massive amounts of compute suggests that search might be important in that game as well. The problem is that nobody knows how to do search effectively in RTS games yet (which is why it wasn't used in StarCraft / Dota).
Hello! I am wondering how exactly a subgame is constructed from a PBS. At first I thought that you would just sample an infostate for each player using the infostate probabilities, and then construct a subgame in the discrete representation using those infostates, but then I realized this doesnt work because some infostates may be incompatible (eg both players cannot hold the same card in poker).Thanks in advance!
In the full game of poker, there is an initial chance node whose outcomes lead to players being dealt different poker hands (with each poker hand having equal probability). A subgame rooted at a PBS is constructed in a similar way, but with the initial chance node having outcome probabilities based on the probabilities of the PBS rather than being uniform. In other words, at the start of the subgame, each player is "dealt" a hand with probability equal to the belief for that hand (this gets a bit more complicated in poker compared to, say, Liar's Dice because players can't be dealt the same card, so you have to account for this "card removal", but it's not a very important conceptual detail).
I think this technique would be applicable to tranding card games like magic or Yu-Gi-Oh but would it struggle with different matchups or decks that are not equal, and if so do you think there are ways of addressing it?
@@noambrown8362 I am not sure if you are still looking at comments or not. But I gave it my best shot and got it published in AJCAI this year (yt won't let me link sadly) Thank you for being an awesome inspiration! I have learned so much.
In the paper the following is said "At the start of the game, a depth-limited subgame rooted at the initial PBS βr is generated. This subgame is solved (i.e., a Nash equilibrium is approximated) by running T iterations of an iterative equilibrium-finding algorithm in the discrete representation of the game," but I'm a bit confused as to how you would solve the discrete representation of the game when you start in the public belief state. Do you sample a infostate from the public belief state and run CFR on this or do you do something else?
This is a bit tricky. You can view it as sampling a history at the root of the subgame on each iteration (you don't just sample an infostate, because an infostate only describes one player's information). Now you might ask, how do you sample a history if you only have the probability distribution over infostates for each player? It turns out the probability distribution over infostates for each player uniquely defines a probability distribution over histories. You could alternatively just always view a PBS as a probability distribution over histories, but if each player has 1,000 possible hands then there are 2*1,000 infostates, whereas there are 1,000^2 histories, so representing a PBS using probabilities over infostates is more compact.
@@noambrown8362 I see but then how would you apply CFR to a PBS, since normally you feed CFR a history and reach probabilities. Do you sample/keep track of a history then? It's nice that a probability distribution over infostates uniquely defines a probability distribution over histories but I presume a mapping from one to the other can't be explicitly written down.
@@ハェフィシェフ Yes you sample a history from the PBS and then solve the subgame just as you normally would solve an imperfect-information sequential game in the discrete representation.
What is the diffrence between MuZero? So Im not an engineer, just a programer who learns ML by my own. So in my view you doing kind of the same what MuZero is doing, try to come up with a policy-model which are othe multi-agents using. Sorry for asking. I should have just read the paper :D
MuZero also can't play poker. It can play perfect-information games without having to know the rules. That's impressive, but a separate contribution from ReBeL.
The proof of Theorem 3 is not very convincing. It looks like you are proving that the strategy obtained by Algorithm 1 is an equilibrium in a game where players know the past strategies and can compute beliefs. This is because the value at leaf nodes is obtained as a function of beta^t which in turn is a function of the strategies. Furthermore, according to the proof, the policy averaged over all iterations and the policy that stops at a random iteration are equivalent. However, you seem to imply that one of them works while the other does not. There seems to be a discrepancy here, it would be great if you can clarify this. In zero-sum games, going from the case where the strategies are common knowledge to the case where they are not common knowledge is a tricky long-standing problem that has been addressed only in some special cases. That is why I feel skeptical about Theorem 3 and I think it should be handled more carefully.
Hi Dhruva, Theorem 3 (and all of Section 6) is in my opinion the most interesting, counter-intuitive, and under-appreciated parts of the paper, so I'm glad you highlighted it! Theorem 3 doesn't say that we need to know the opponent's actual strategy. Instead, it says that if we assume that the opponent's strategy is sampled from an iteration of CFR, then the strategy that we play is (in expectation) a Nash equilibrium strategy. It's still a Nash equilibrium strategy regardless of what the opponent actually plays (and indeed the opponent will almost certainly play a strategy different from what we assume). To be clear, it is *not* theoretically sound to run CFR in a subgame and then assume the opponent played that average strategy. However, it *is* theoretically sound to sample a random iteration and assume the opponent played the strategy from that random iteration. The policy you play (in expectation) in future subgames will be different in the latter case than in the former case.
Hi Noam, thanks a lot for clarifying the statement of the theorem. That is more or less how I understood it and I am inclined to agree with it. My confusion however is with the proof of the theorem. If the proof of this theorem is correct, then it is indeed a major development in game theory. It would be great if you can make the proof more mathematically rigorous. Specifically, these are my concerns: 1) As you just said, assuming the opponent played the average strategy is not sound while sampling a random iteration and assuming the strategy from this iteration is sound. If you look at the arguments used in the proof of theorem 3, there is nothing that distinguishes these two methods. In fact, the arguments clearly say that both of them are equivalent ("one can equivalently pick a random iteration..." in the fourth paragraph for instance). Which part of the proof suggests that randomly choosing iteration t is sound whereas the average strategy is not? In the proof arguments, where is the significance of this added layer of randomizing over iterations? 2) In the last line of paragraph 3, you state "the average policy over all iterations is (approximate) Nash equilibrium." I think one needs to be more specific about the game in which this is a Nash equilibrium. Is it the game in which players know the strategies or they do not know the strategies? To prove the theorem, we would like this to be the game in which players do not know the strategies. But since the beliefs beta^t (which depend on strategies) are being used at the leaf nodes, I am not sure if that is the case.
Hi @@DhruvaKartik, the key is that in paragraph 3 we say that the subgame is solved using CFR-D ("Solving Imperfect Information Games Using Decomposition" by Burch et al. AAAI-14). The theorem makes a lot more sense once you understand the CFR-D algorithm. Once you understand CFR-D, the proof can be summarized as follows: 1) CFR-D computes an average policy that is a Nash equilibrium for the full game. 2) Playing the average of T policies is equivalent to sampling one of the T policies and playing that policy. 3) Thus, rather than play the average policy produced by CFR-D, one could equivalently sample a random iteration and play that policy (and this sampling is done for every recursive call of CFR-D). 4) That is exactly what ReBeL does, but it also uses a value network to estimate the value of some subgames rather than solving them recursively. Feel free to email me if you'd like to discuss this further. Understanding CFR-D can be tricky but the proof will make a lot more sense then.
Thanks a lot for your time@@noambrown8362. I will read more about CFR-D and send you an email if necessary. Looking forward to seeing more results on ReBeL in future conferences.
@@noambrown8362 Could theorem 3 be applied to mccfr? 1) Could you solve a subgame with mccfr and sample a random iteration? 2) If so, due to the external sampling of mccfr, could you simply use the average strategy of the final iteration (or even the current strategy of the final iteration), rather than sampling a random iteration? My intuition is that the last iteration of mccfr is really similar to doing cfr and sampling among the last few iterations. I'm not really sure how to prove it, and it may be wrong.
This is the best ReBeL video I've watched. I'm very impressed by your insights into the method. Then I realized you are one of the authors ...
Lol
I like how you explain it shortly and precise
Amazing ideas, Noam, and great explanations. Thank you.
I just read "Superhuman AI for multiplayer poker" article by this Noam Brown guy and became interested in the "Supplementary Materials" he published for this project- but barely understood things.
Then I open up youtube to introduce myself to Monte Carlo CFR and find this... Thank you in advance already!
just was recommended this vid. whoever this Noam Brown guy is good
Extremely impressive, thank you for making the video (together with the paper). Congratulations!
I wonder why you only submitted to a CS conference, and not to a real scientific journal like Nature or Science? I think the quality of the work certainly warrents a publication that is seen by a large amount of scientists, and that's gone through proper peer-review (not only the CS conference reviews).
I believe your methods could be very exciting for the interface between deterministic and indeterministic processes in physics (classical and quantum mechanical boundary). They are in a way perfect information (classical physics) and imperfect information (quantum, see Bell's inequality).
Thanks! I think in order to turn it into a Nature or Science paper we would need experimental results on Go. That would take a lot of engineering work so we decided it wasn't quite worth it.
Great video Noam! I always love to see you videos in my feed. "With a ReBeL yell I want more... more...more" 😜
Ty for this!
Great presentation, thanks for the effort. Now it is time to read the paper :)
33:30 Why is the input dimension of Texas Holdem 1326?
In Texas Holdem each player has two hidden cards, so there are 52 choose 2 = 1326 hand combos
Amazing! thank you for this
Amazing, thank you for sharing
Hi Noam, I think this field of research is fascinating and it's very impressive to see the strides you took and are taking in the realm of imperfect information games.
Previously, Pluribus cost very little computationally, despite being placed in a 6 person table. I found one of the most impressive stats of Pluribus being that it only took ~$150 and a relatively weak computation to train and execute (which is because of the lack of deep neural nets?). In comparison, with the implementation of deep reinforcement and self play without information abstraction, how does ReBel compare in terms of cost and computability? Thanks
Thanks! ReBeL is much more computationally expensive. The exact resources are mentioned in the paper, but it's definitely way more than expensive than can be done on a home computer. However, it appears to scale well, and it's possible that future research will bring the computational cost down.
Wow! This is pretty cool! Was there any reason for not testing ReBeL in 6 player poker games?
I think it would work fine in 6-player and we thought about doing it, but there aren't many benchmark agents for 6-player so it would be hard to evaluate the bot.
In theory public belief states (PBSs) might not have unique values outside of two-player zero-sum games but I don't think that would be an issue for 6-player poker.
@毛凡 it would probably work for any number of players but it becomes increasingly difficult to evaluate as you add more players.
BRILLIANT IDEAS.
Very cool! Your Pro was Dong Kim who was able to play from home according to your SI. Does it mean you have a online-platform where one could play? Is there any chance to get access? Do you need special hardware to run the final version or is a standard workstation sufficient? Would love to get beaten in Chess and HU texas holdem :)
We do, but it's not publicly accessible unfortunately.
Thank you for the video! I'm an undergraduate who wants to learn a lot more about RL, and this was really great. I was actually curious whether ReBeL could be directly applied to a game like Pokemon. I'm not sure how familiar you are with battling mechanics in Pokemon, but there is a lot of imperfect information (each opponent Pokemon has hidden moves, hidden stats which can later be inferred based on observable damage numbers and moves used) as well as damage rolls, crits, etc. In particular, unlike poker or RPS where there is a set turn order (for RPS we reformulated it to be this way), in Pokemon move order is determined by speed stat of the Pokemon in play, which is sometimes hidden to the players if not previously observed. Would you still be able to use ReBeL in this case, and would it have to be reformulated in any way?
Also, I just had a couple quick clarifying questions about belief infostates. You mentioned in the paper that the input of the value network is the following size:
1(agent index) + 1(acting agent) + 1(pot) + 5(board) + 2 × 1326(infostate beliefs)
Why is there a separate value for "agent index" and "acting agent"? Also, are infostate beliefs just the set of probabilities for every possible player hand (which we update based on player actions)? Why don't we include information such as amount of money each player put in the pot and bets made during that current round?
Thank you again so much for the paper and video, I appreciate your time to help people like me learn more!
I'm not that familiar with Pokemon so it's hard for me to say, but one issue might be the amount of hidden information. At the end of the talk I mention that the input to the NN is proportional to the number of infostates in a public state. It sounds like that can be quite large in Pokemon.
There's a separate value for agent index vs acting agent because we need to whose turn it is and who we want the value for. Those could be different.
We don't need to keep track of prior bets and other information like that for the same reason that in chess you don't need to keep track of all the prior moves, you just need the current board configuration (plus some additional info to prevent repeating the same position). The info we track is sufficient for computing the EV for each player.
And yes, infostate beliefs are the probability for each hand.
The value vector computed with cfr is added as a training example. When solving the the last round, these values are completely accurate if it was solved to 0% exploitability. So over time it will get better at solving the round before the last, then the round before that. Something I don't understand is that you don't solve until the start of the next round. How does the value network improve at estimating the leaf nodes when you are giving it training data that consists only of public belief states at the start of a betting round?
The values of belief states at the end of a round can be computed by averaging over the values of belief states at the start of the next round. As we collect more data for values at the start of rounds, we also train the network to predict the values at the end of rounds.
I have a question about the Algorithm 1 in the page 1 of paper. In my opinion, I understand that function COMPUTEEV computes the expected value of PBS Br based on the output of value neural network( compute in SETLEAFVALUES function). However, you add the (Br, v(Br)) to the value net data, which means using output of a model to train it (did I understand right?). The CFR-D in section 5.1 also uses output of the value network to compute new policy for next iterations. I think the CFR-D has to use groundtruth value to compute policy for new iteration but it seems uses the output of untrained value network. So where is the groundtruth of value network in leaf nodes? Thank you so much Noam.
I'm not sure I understood your question. For some subgames, there will be leaf nodes at the end of the game (e.g., if the leaf node is for a fold action, or if you are on the river and both players have called). In those cases, you will receive a ground truth reward from the environment. Otherwise, you use the value network at the leaf node. But eventually the ground truth rewards will "bubble up" and the value network will become increasingly accurate.
@@noambrown8362 so could you check my understanding about your algorithm at each iteration t (from twarm -> T):
Step 1: UPDATEPOLICY(G, policy at t-1). This step compute policy at time t by applying CFR-D. CFR-D will compute regret matching with value at each leaf nodes (ground truth if node at the end of game or value network output at intermediate node of game). So it will update new policy based on the regret value.
Step 2: Update average policy. This step just computer new average policy using current average policy and policy at time t computed in step 1
Step 3: G
@@noambrown8362 I think I understand how to train value network after reading conversation between you and @brianbrewer974. Take a poker game for example, first we train value network with ground truth reward (river round), so this network will predict to output well in river round, which is almost exact as ground truth. So we can use this output to compute expected value for turn round and run CFR-D for T iterations to get proper value v(Br) at PBS Br. Now we add (Br, v(Br)) to train value network more so that it could predict well in turn round (may be that's why you call this "boostrapping" strategy training). Repeat this process to flop round and other previous rounds and the value network will be trained with every PBS Br in all rounds. ReBel finishes and we have a perfect trained value network. Is that my thought correct, mr Noam?
As I watch this video, I'm also watching some online poker games. There is a player on one of the tables named 'NBrown'. Hmmmm......
Hi Noam, I was wondering what the computational requirements of training ReBeL was as opposed to for the libratus model. I assume this is significantly more expensive considering the continuous nature of game state space, and the fact that the game state now contains atleast as much information as libratus's game space. Could you improve my understanding of this? Thanks!
Yes, ReBeL is much more expensive than the Libratus approach. This is the cost of generality. We talk about compute usage a bit in the paper.
Great job , Noam! But what about Prisoners Dilemma? I see that you are study non-cooperative zero sum games, so may be it is not your subject at all. But if all AI algorithms will concentrate only on competing, but not cooperating, will we have any win-win games in human future? Are altruistic gens rudiments for future homo artifices?
Does the exploitability issue when playing humans you mentioned at 26:19 also apply to Libratus or only to this new algorithm? Or does this issue only apply to ReBel because you are using value networks?
It was *mostly* not an issue in Libratus. Libratus used two kinds of search: upon reaching the turn for the first time, it used unsafe search. Thereafter, it used safe search. The exploitability issue was in theory relevant when using unsafe search, but in practice wasn't really an issue in poker. (The reason it wasn't an issue in practice is complicated. Basically because the search occurred immediately after the turn card came out, the unpredictability of the turn card made it difficult for a human to actually exploit the bot's strategy.)
You might wonder why we used unsafe search at all if we had safe search. The safe search techniques that existed before ReBeL added extra constraints to the search strategy that in practice could make it do worse than unsafe search. The nice thing about ReBeL's search is that it's unexploitable just like past safe search algorithms, but it doesn't have their extra constraints so ReBeL's search in practice performs much better.
@@noambrown8362 Thank you for the response. Conceptually, is it correct to think of the goal of the technique you employed of randomly selecting iterations in the search as making the strategies robust against every type of potential counterstrategy (even bad ones) in terms of long-run expectation/unexploitability ?
@@grabaka884 yes that's a good intuition for it. The real reason you need to select a random iteration has to do with the details of CFR. But understanding it would require going into the details of the CFR decomposition algorithm, which would take a while.
For the turn endgame holdem experiment where you trained with random values - Was this something like each of the possible hands has a random probability between 0 and 1? I was able to train a river net to a training loss of 0.013 and validation loss of 0.018 with 500k samples (solved to an exploitability of 0.1% of the pot) using Deepstack's pseudorandom range generating function and having the board be an input to the network (represented as 52 0's or 1s depending on if the card is on the board). The inputs were also players ranges rather than bucket probabilities. I also tested the network in different situations where both players had 100% of possible hands (a distribution not represented in the data), and it had a loss of about 0.003 in those cases (albeit a small sample). For some reason, that is an easier distribution than the pseudorandom distribution.
I'm guessing that if you assign random probabilities to every hand - each hand strength is going to have a similar number of combos in most samples. For instance, there may be 100 hands that make top pair, most samples on that board are going to have around 50 combinations of top pair. It may not be able to generalize in situations where a hand strength is very unlikely or very likely. I've come to the conclusion that buckets were not necessary in Deepstack or Supremus, but the algorithm for generating players ranges is.
Yes, it was each possible hand with a random probability between 0 and 1 (and then normalized). What you're describing is an interesting result. I'm guessing the validation set was drawn from the same pseudorandom distribution used for training? What if you try solving the turn endgame that we tested in the ReBeL paper (where we assume uniform ranges at the start of the turn)? I'd be interested to see what kind of exploitability numbers you get. That would give you a sense of how good the network is outside of its training distribution.
@@noambrown8362
Yes, the validation set was created identically to the training set.
It's not clear to me how to measure exploitability when using a value net. When solving to the end of the game, finding a best response is fairly straightforward and similar to a cfr iteration. When using a value net at the leaf nodes of the subgame, both players ranges will impact the result of the network (adding strong hands to your own range will change the expectation of your weak hands, this isn't true at terminal nodes at the end of the game). Do you measure exploitability in just the first round, or do you solve the second round (by passing the reach probabilities from the solution of the first round?) and measure exploitability in the full game?
@@brianbrewer974 You should first solve the turn using the river value network, and then "freeze" that turn strategy for each player. To calculate the best response value against P1, you run CFR on the *full* turn endgame (both turn and river without any neural value network) where P1's turn strategy is fixed at the frozen strategy (but not P1's river strategy, and not P2's turn/river strategy). You can do the same thing but mirrored to get the best response value against P2. Add these up to get exploitability.
@@noambrown8362 Sorry for the slow response. I created an end of turn network with 1m subgames and its huber loss was 0.00239. I solved the turn endgame holdem subgame described in the rebel paper. Locking the oop player strategy in a commercial solver caused him to lose about 50 chips, or 2.5% the pot. I haven't tested the ip player as copying the strategies was quite time consuming. I've noticed that the combo selection in the strategy (from the neural net) is super specific. It's much more likely to play pure strategies with specific hands than the solution from solving to the end of the game. This may be a weakness from the outputs not being bucketed.
@@brianbrewer974 It's not really possible to say what the exploitability is without measuring how much both players lose. The oop player might be losing 2.5% simply because the oop player might be at a disadvantage given the ranges. Can you let me know what the numbers look like for ip too? For comparison, ReBeL is exploitable by about 0.25% the size of the pot.
Why didn’t you mention AlphaStar?
While this work tries to incorporate techniques successful in deep RL, notably the use of value function approximation to truncate search, into solving imperfect info games, it's not clear that it can scale to scenarios where the underlying state space is already large. In the toy poker game, one has a state space of cardinality 104. I suppose this issue is addressed with the final comments on recon chess.
The state space of the whole game is extremely large: 10^161 decision points, which is (very roughly) comparable to the size of Go. When converted into the belief space representation, Texas holdem can be viewed as a continuous state space with more than 2,651 dimensions. Go has 0 dimensions in the belief representation.
The issue with scaling is with increasing amounts of hidden information, *not* with an increasing size in the state space.
-Actually, since you don't know either your cards or your opponent's in the belief space representation (they are not common knowledge), isn't the public belief state of size 52 * 51 * 50 * 49 / 2 or similar?-
Oh, I see, what we care about is the size of the info states, and not the PBS as a whole. I suppose contrary to your comment about 6 player poker games, this approach might already be intractable for 3 player games, since the info state for a single player would have > 10^6 dimensions.
Unless you are suggesting one might be able to handle them adequately separately?
(in general, I suppose factorising a state space is a common technique to reduce dimensionality, whether learned or manually)
I'm quite interested, with regard to these limitations, in exploring whether learned representations over infostates, possibly "common representations", the sense they are shared by all players, could be used to overcome them. I suppose that agents that have solved imperfect info games with high dimensional/continuous info states like dota 2/starcraft already do this (approximating over hidden knowledge) implicitly.
There does seem to be some crossover point where humans also can't handle high-dimensional info states with high precision. Since developing specific techniques was not required* in those cases to beat humans. Imperfect info is however only one part, rather than being the whole point of the game.
(*with the caveat that alphastar uses population based training with exploiter agents to reduce main agent exploitability, but this is not an algorithmic approach, and certainly not in same vein as work done in poker AI. Alphastar's approach addresses the problem of data distribution, rather than imperfect info per se).
@@jonabirdd 6-player poker shouldn't be much of a problem for ReBeL (except the fact that PBS values in theory might not be unique). It's the number of infostates in a public state that is the limiting factor, not the size of the infostates. In 3-player poker there are 3 * 1,326 infostates per public state, whereas the size of infostates is about 10^6.
You can approximate the equilibrium of any game regardless of the amount of hidden information by not doing search at all. This is what AlphaStar does in StarCraft 2 and OpenAI Five does in Dota 2. But search makes a *huge* difference in games where it has been used. The fact that AlphaStar was not able to achieve superhuman performance in StarCraft 2 and is highly exploitable in the game despite using massive amounts of compute suggests that search might be important in that game as well. The problem is that nobody knows how to do search effectively in RTS games yet (which is why it wasn't used in StarCraft / Dota).
Is there implementation on ReBel/ Dream on github?
Yes! Though not for Texas holdem specifically.
ReBeL: github.com/facebookresearch/rebel
DREAM: github.com/EricSteinberger/DREAM
Hello! I am wondering how exactly a subgame is constructed from a PBS. At first I thought that you would just sample an infostate for each player using the infostate probabilities, and then construct a subgame in the discrete representation using those infostates, but then I realized this doesnt work because some infostates may be incompatible (eg both players cannot hold the same card in poker).Thanks in advance!
In the full game of poker, there is an initial chance node whose outcomes lead to players being dealt different poker hands (with each poker hand having equal probability). A subgame rooted at a PBS is constructed in a similar way, but with the initial chance node having outcome probabilities based on the probabilities of the PBS rather than being uniform. In other words, at the start of the subgame, each player is "dealt" a hand with probability equal to the belief for that hand (this gets a bit more complicated in poker compared to, say, Liar's Dice because players can't be dealt the same card, so you have to account for this "card removal", but it's not a very important conceptual detail).
I think this technique would be applicable to tranding card games like magic or Yu-Gi-Oh but would it struggle with different matchups or decks that are not equal, and if so do you think there are ways of addressing it?
I don't see why it would struggle with unequal decks, especially if it's trained on all decks rather than on a specific pair of decks.
This seems like a cool idea outside the realm of poker. I'm gonna go implement it. Wish me luck
@@noambrown8362
I am not sure if you are still looking at comments or not. But I gave it my best shot and got it published in AJCAI this year (yt won't let me link sadly)
Thank you for being an awesome inspiration! I have learned so much.
In the paper the following is said "At the start of the game, a depth-limited subgame rooted at the initial PBS βr is generated. This
subgame is solved (i.e., a Nash equilibrium is approximated) by running T iterations of an iterative
equilibrium-finding algorithm in the discrete representation of the game," but I'm a bit confused as to how you would solve the discrete representation of the game when you start in the public belief state. Do you sample a infostate from the public belief state and run CFR on this or do you do something else?
This is a bit tricky. You can view it as sampling a history at the root of the subgame on each iteration (you don't just sample an infostate, because an infostate only describes one player's information). Now you might ask, how do you sample a history if you only have the probability distribution over infostates for each player? It turns out the probability distribution over infostates for each player uniquely defines a probability distribution over histories.
You could alternatively just always view a PBS as a probability distribution over histories, but if each player has 1,000 possible hands then there are 2*1,000 infostates, whereas there are 1,000^2 histories, so representing a PBS using probabilities over infostates is more compact.
@@noambrown8362 I see but then how would you apply CFR to a PBS, since normally you feed CFR a history and reach probabilities. Do you sample/keep track of a history then? It's nice that a probability distribution over infostates uniquely defines a probability distribution over histories but I presume a mapping from one to the other can't be explicitly written down.
@@ハェフィシェフ Yes you sample a history from the PBS and then solve the subgame just as you normally would solve an imperfect-information sequential game in the discrete representation.
What is the diffrence between MuZero? So Im not an engineer, just a programer who learns ML by my own. So in my view you doing kind of the same what MuZero is doing, try to come up with a policy-model which are othe multi-agents using. Sorry for asking. I should have just read the paper :D
MuZero also can't play poker. It can play perfect-information games without having to know the rules. That's impressive, but a separate contribution from ReBeL.
@@noambrown8362 Thank you for your answer
The proof of Theorem 3 is not very convincing. It looks like you are proving that the strategy obtained by Algorithm 1 is an equilibrium in a game where players know the past strategies and can compute beliefs. This is because the value at leaf nodes is obtained as a function of beta^t which in turn is a function of the strategies. Furthermore, according to the proof, the policy averaged over all iterations and the policy that stops at a random iteration are equivalent. However, you seem to imply that one of them works while the other does not. There seems to be a discrepancy here, it would be great if you can clarify this.
In zero-sum games, going from the case where the strategies are common knowledge to the case where they are not common knowledge is a tricky long-standing problem that has been addressed only in some special cases. That is why I feel skeptical about Theorem 3 and I think it should be handled more carefully.
Hi Dhruva, Theorem 3 (and all of Section 6) is in my opinion the most interesting, counter-intuitive, and under-appreciated parts of the paper, so I'm glad you highlighted it! Theorem 3 doesn't say that we need to know the opponent's actual strategy. Instead, it says that if we assume that the opponent's strategy is sampled from an iteration of CFR, then the strategy that we play is (in expectation) a Nash equilibrium strategy. It's still a Nash equilibrium strategy regardless of what the opponent actually plays (and indeed the opponent will almost certainly play a strategy different from what we assume).
To be clear, it is *not* theoretically sound to run CFR in a subgame and then assume the opponent played that average strategy. However, it *is* theoretically sound to sample a random iteration and assume the opponent played the strategy from that random iteration. The policy you play (in expectation) in future subgames will be different in the latter case than in the former case.
Hi Noam, thanks a lot for clarifying the statement of the theorem. That is more or less how I understood it and I am inclined to agree with it. My confusion however is with the proof of the theorem. If the proof of this theorem is correct, then it is indeed a major development in game theory. It would be great if you can make the proof more mathematically rigorous.
Specifically, these are my concerns:
1) As you just said, assuming the opponent played the average strategy is not sound while sampling a random iteration and assuming the strategy from this iteration is sound. If you look at the arguments used in the proof of theorem 3, there is nothing that distinguishes these two methods. In fact, the arguments clearly say that both of them are equivalent ("one can equivalently pick a
random iteration..." in the fourth paragraph for instance). Which part of the proof suggests that randomly choosing iteration t is sound whereas the average strategy is not? In the proof arguments, where is the significance of this added layer of randomizing over iterations?
2) In the last line of paragraph 3, you state "the average policy over all iterations is (approximate) Nash equilibrium." I think one needs to be more specific about the game in which this is a Nash equilibrium. Is it the game in which players know the strategies or they do not know the strategies? To prove the theorem, we would like this to be the game in which players do not know the strategies. But since the beliefs beta^t (which depend on strategies) are being used at the leaf nodes, I am not sure if that is the case.
Hi @@DhruvaKartik, the key is that in paragraph 3 we say that the subgame is solved using CFR-D ("Solving Imperfect Information Games Using Decomposition" by Burch et al. AAAI-14). The theorem makes a lot more sense once you understand the CFR-D algorithm.
Once you understand CFR-D, the proof can be summarized as follows:
1) CFR-D computes an average policy that is a Nash equilibrium for the full game.
2) Playing the average of T policies is equivalent to sampling one of the T policies and playing that policy.
3) Thus, rather than play the average policy produced by CFR-D, one could equivalently sample a random iteration and play that policy (and this sampling is done for every recursive call of CFR-D).
4) That is exactly what ReBeL does, but it also uses a value network to estimate the value of some subgames rather than solving them recursively.
Feel free to email me if you'd like to discuss this further. Understanding CFR-D can be tricky but the proof will make a lot more sense then.
Thanks a lot for your time@@noambrown8362. I will read more about CFR-D and send you an email if necessary. Looking forward to seeing more results on ReBeL in future conferences.
@@noambrown8362 Could theorem 3 be applied to mccfr?
1) Could you solve a subgame with mccfr and sample a random iteration?
2) If so, due to the external sampling of mccfr, could you simply use the average strategy of the final iteration (or even the current strategy of the final iteration), rather than sampling a random iteration?
My intuition is that the last iteration of mccfr is really similar to doing cfr and sampling among the last few iterations. I'm not really sure how to prove it, and it may be wrong.
cool i guess ...
jk :P