AI beats us at another game: STRATEGO | DeepNash paper explained
HTML-код
- Опубликовано: 7 июл 2024
- DeepMind made an expert-level Stratego bot. We explain how they program an unexploitable AI player and we go into more details while explaining their model-free Reinforcement Learning method and how they achieve Nash Equilibrium with Regularized Nash Dynamics.
► Sponsor: NVIDIA: 👉 nvda.ws/3HpWbzX Sign up for the GTC spring 2023 for FREE!
Google Form to enter DLI credits giveaway: forms.gle/DMPc4G22tnqbMWGCA
Check out our daily #MachineLearning Quiz Questions: / aicoffeebreak
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
📜 DeepNash paper: Perolat, J., De Vylder, B., Hennes, D., Tarassov, E., Strub, F., de Boer, V., Muller, P., Connor, J.T., Burch, N., Anthony, T. and McAleer, S., 2022. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science, 378(6623), pp.990-996. www.science.org/doi/pdf/10.11...
📖DeepNash blog post (DeepMind): www.deepmind.com/blog/masteri...
💻 Open-source implementation of the Deepnash algorithm on a GPU-accelerated game that can run on consumer hardware: github.com/baskuit/R-NaD
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Edvard Grødem, Vignesh Valliappan, Mutual Information, Mike Ton
Outline:
00:00 DeepNash from DeepMind
01:05 NVIDIA - GTC 2023 [Sponsor]
02:26 RL is hard for Stratego
03:35 How Stratego works
04:43 Why RL for solving it
06:22 Model-free RL - Nash equilibrium
07:45 Technical details of DeepNash
08:22 DeepNash architecture
10:13 R-NaD: Regularized Nash Dynamics explained
13:46 Finetuning
15:04 Results
15:54 Bluffing
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
RUclips: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
Music 🎵 : Cru - Yung Logos
Video editing: Nils Trost Наука
Check out our ► Sponsor: NVIDIA: 👉 nvda.ws/3HpWbzX Sign up for the GTC spring 2023 for FREE!
Google Form to enter DLI credits giveaway: forms.gle/DMPc4G22tnqbMWGCA
😍
Great video, very in-depth. And thank you for including my implementation in the description!
This paper has gone mostly unnoticed by the broader RL community, even though it should have myriad applications.
Also I think its a good idea for researchers to review the regularization code in my repo/OpenSpiel.
In my opinion, it differs from what is detailed in the paper.
Another awesome video, Letitia! After hearing about DeepNash some months ago, I tried to read the paper, but ultimately failed to absorb much of it. Your explanation here makes it a lot clearer.
The Methods section of the paper indicates that the discretization parameter during the play-phase of the game was n=16. The actions are sorted from highest-probability to lowest, and then each action's probability is rounded *up* to the nearest multiple of 1/n, discarding the remaining weights once a sum of 1 is reached.
My hypothesis on why discretization is used: it ensures that at most only the top-n moves are considered. They may have empirically found that whenever the agent made moves outside the top-n, it was more often than not a mistake.
Ah, it's a ranking, now I got it, thanks! 🫱
Another hypothesis is that discretization was motivated by memory considerations. In some MCTS implementations, the policy distribution can be the dominant contributor to each node's memory footprint, and thus to the entire system's memory. Without discretization, the policy distribution requires something like 4*N bytes, if using float32 representations. With discretization, each weight can be packed into log_2(n)=4 bits, and so you can pack all n of them into n*log_2(n) = 64 bits, or just an 8-byte word. You would need the index-information as well, which can be represented as a bit-mask of N bits, giving you a total footprint of N/8 + 8 bytes, which is about 32x less than the naive 4*N byte representation.
@@dshin83 It's almost certainly because of memory. Giving a bigger decision space should theoretically eventually lead to a better outcome, although it would make the training and memory requirements significantly greater.
nice. may use this for titan's battalion
I am learner of RL, to my understanding, for stratego, the action is a huge vector for each state, even worse, the length of this vector virtually needs to be variable. So if actions are spit out from a UNet, it has to be a fixed size vector, processed via softmax. This will make the probably of those invalid movements be a very small non zero value, which, might be picked via e-greedy like process to explorer the state space. To avoid this problem, authors may do the hack of discretize to only keep action with bigger probability. This is just a WILD GUESS. I need to read this interesting paper.
Thanks for your thoughts. 🤔 Shouldn't do thresholding alone do the trick already? Why also discretize?
First of all - I am writing with same voice as you made this video. Get well soon.
Secondly - Why this year Nvidia conference doesn't do RTX giveaway like last years?
Thirdly - This is most interesting topic that made me study DL and change field of work. I am very excited with every game solution. Poker and Go was very impressive. I remember poker bot played 2 sessions with 10 top players, and most of them noticed about "balanced game" and "unexpected bluffs". I totally get it, when search space of solution is not limited by human brain and previous knowledge gaming agents can come up with ideas that no human will come up. This is inspiring. I am waiting for gamification of real problems, like energy grid optimisation in global scale (for example for all Europe). Or working with researchers and policy makers on global warming \ economics crisis mitigation. Leave games for fun
1. Thanks! 🤧
2. They do RTX giveaways, but not with my small channel. Go to Yannic's or Louis's channel for fat RTX cards. ;)
3. I do not think you have to wait if you are talking about game-RL for real problems. DeepMind did matrix factorization, plasma stabilisation for fusion reaction, AlfaFold2, etc.
It's just that we still need to wait until this model-free RL method they presented for Stratego gets applied to something more interesting. :)
@@AICoffeeBreak Who is Louis? I know only you and Yannic, so Louis should be less popular
@@harumambaru Then check out What's AI. :)
Get well soon, Ms Coffee Bean! 🤗
Thanks! 🤧
Could anyone suggest a best resource to read/learn about game theory. I would like to gain intution why does this learning dynamics works.
👀
I still have no idea how the AI learned to understand information well enough to bluff, you'd have to understand what your opponent would play knowing you're chasing with an unrevealed piece near enemy territory. Also I'm still confused as to how this is unexploitable, if red saw that blue (deepnash) took the deep way in to chase the 8 he could have called the bluff because no way blue would risk their marshal that easily they should know it might be a spy trap, so arguably a smart human may have outplayed deepnash, unless if it was actually a marshal and deepnash double bluffed. So it's all just mind games in the end, which I don't know how deepnash can maneuver because scenarios like the past one seem like 50/50 if both players understand the risks.
But can it do even more complex turn based stategy games like Civ?
🤫, or you'll wake up the DeepMind scientists and they'll target Civilization games next.
Now honestly, I expect it to be played by RL agents in the future, just give it a few years.
@@AICoffeeBreak to bad DeepMind dosn't tend to release much, if they did these kind og games may finally end up with actually good AI.
softmax approximates any categorical distribution given that the support of it is the support of the softmax... in other words, it's impossible mathematically to have a softmax with discrete prob 0 or 1 (as it uses exponentials), so it's fair to just discretize them down during deployment (it's done also in LLM called top-k or beam search)
I understand that for 0 and 1, but why do it for 0.7168 too (clamp it to 0.7)? 🫠
AI beats us at another game: STRATEGO
Next:
- Car moved faster than the fastest human ran.
- Crane lifted more weight than the strongest man.
...
We are lost. 😱
I don't think the interesting thing here is that the bot performs well at the game, but rather the methodology they used to overcome the challenges that this specific board game poses. E.g. the incomplete information and the huge amount of possible states.
🤣