AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

Yannic Kilcher

Просмотров 21 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 21 окт 2024

Комментарии • 74

@minhuang8848 5 лет назад ⁺³⁵
"Stop publishing in nature"
So fucking much, yes. Walling off academic literature is bad enough as it is, doing the same for cutting edge comp-sci research is just ridiculous. Great video!
@bbalban 6 дней назад ⁺¹
Very nice description of the paper, thanks for creating this video.
@marijnstollenga1601 5 лет назад ⁺²⁶
Totally agree about nature!
@kyuhyoungchoi 5 лет назад ⁺⁶
Uploading reiviews for 4 days in a row. Just amazing. U are the hero.
@kristoferkrus 5 лет назад ⁺¹⁵
Nice video! 11:59 Smart to provide the value network with "hidden" knowledge (since it doesn't affect the actions, anyway). Kind of like a human player watching a replay without any fog of war (i.e. with full visibility) of the game he just played after it has finished in order to learn the opponents strategy and take advantage of it in the next game, I guess.
@filipgara3444 4 года назад ⁺⁴
over 2 years, the machine learning so evolved
@giovanniminelli5590 2 года назад
Wonderfull explanation of the network pipeline, thank you!!
Just a note about the league training (as i got it from the paper):
Main exploiters play only against the main agents to find weakness (but not between themselves or league)
League exploiters play against past versions of everyone which are in the league, they become better and then, by time, they replicate putting their clone in the league that will be eventually an opponent of the main (they not play directly against the main) Also they reinizialize themselves to the baseline supervised
@PasseScience 5 лет назад ⁺³
Thx a lot, what is a little unclear to me is what is trained exactly in the big archi we see? I mean for alphazero the net is trained for 2 things: the value head to predict the outcome the closest to the actual outcome, the policy head to predict the next move distribution the closest to what the search will conclude as the next move. In the big diagram here at 10:25 what exactly is trained and for what?
@YannicKilcher 5 лет назад ⁺¹
The same things are trained: The value head to predict the outcome of a state and the policy head to predict the action. The difference here is that the policy head is split into multiple individual units (everything on the top, right of the value head). At first, this is trained to imitate humans, then it is trained using reinforcement learning.
@PasseScience 5 лет назад
@@YannicKilcher So I can just see it as a big neural network which is trained in a standard way as a whole, and just has an internal complexe topology in which we can identify subnets but without subtraining target? (Some other deepmind stuff can have subpart that have subtraining target, like by example an auto-encoder that should compress the input and be trained for it's own auto encoding purpose in a totally independant training than the whole thing, but here, it;s not the case? just subpart of a big net trained together with this big net as a whole?) By the way Alphastar engine is supposed to be stochastic, do you know where what are the source of randomness here?
@connor-shorten 5 лет назад ⁺²
I am also curious about how the LSTM layer is used in RL agents to encode memory. Could you instead modify the input features to account for the history of the actions (make it Markovian)? e.g. wether it has started a building outside the camera view?
@DaulphinKiller 5 лет назад
I was wondering about the same thing. For the game of go, that is sort of what they did with their seven history planes right? However I think the problem is that in starcraft the correlation length between the actions to be taken and the history varies greatly so that you'd have to store practically all of the history with its many timesteps. This would make for huge input volumes, leading to models that are not only impractical but probably also impossible to train.
So it makes sense then that they have to be smarter about it and dynamically encode such memory through LSTMs. The above is just a guess, and I'm happy to be corrected if wrong.
@YannicKilcher 5 лет назад
Yes, this is definitely a possibility and is done wherever possible. But usually, there's still some information that is too complicated or implicit to encode directly and that's where you want the model to learn what to remember. I guess this comes back to the old paradigm that deep learning is often a better feature engineer than a human 🤷‍♀️
@YannicKilcher 5 лет назад ⁺¹
I think in Go, the recent history is important because the legality of some moves depends on it, like some capture-recapture cycles. In Atari, the last 4 or so frames are included because the game engine sometimes flickers and doesn't show your avatar or some opponents. In both cases, this is enough to make the state fully Markovian. In StarCraft, you're absolutely right, the amount of history you'd need to include to achieve the same thing would be huge, so the solution is a mixture of explicit (as input) and implicit (lstm) history encoding.
@kristoferkrus 5 лет назад
Why would it be interesting to make it Markovian-what would be the benefit of doing so? (On the other hand, isn't an LSTM already Makovian-i.e. the next internal state of the LSTM is independent of any history, given its current internal state-and if you would condition it on former actions, wouldn't that effectively make it non-Markovian?)
@michael-nef 4 года назад ⁺³
From what I've heard, LSTMs can generally be replaced by 1D conv-nets, I'm not really sure why this is the case, but do you think that it would be appropriate to replace the initial LSTM with a convnet in this case?
@YannicKilcher 4 года назад ⁺¹
I have never heard of replacing LSTMs with 1d convnets. Do you have any reference to that?
@michael-nef 4 года назад
@@YannicKilcher It's quite possible I'm wrong, but: towardsdatascience.com/how-to-use-convolutional-neural-networks-for-time-series-classification-56b1b0a07a57
@kristoferkrus 4 года назад ⁺¹
I guess essentially WaveNet is a 1D convnet for generating sound, which people previously attempted to do with LSTMs, therefore for sound generation you can perhaps say that they replaced an LSTM with a 1D convnet?
@jk-ml5fb 8 дней назад ⁺¹
What kind of hardware are they running the trained AI on?
@spenceraidukaitis2031 4 года назад ⁺²
Hey Yannic! I know it's been awhile since you posted this video but you explain things very well. Thank you so much for making these. One question, for this study what do you see the real world opportunities being for this kind of learning? Seems like the chaining of many networks to do a more general task would apply to the real world in many different ways. What are your thoughts?
@YannicKilcher 4 года назад ⁺¹
The problem is that you need a lot of data, which usually only get from a simulator, so it's a bit unclear how to get to the real world using that.
@avidrucker 3 года назад
@@YannicKilcher are there any new developments on this front? Such as using ML to create class tutors? (Physics, comp sci, spoken languages, etc.)
@sirusThu 5 лет назад ⁺³
Great tutorial. will they have a problem of the vanishing gradient if they play for a long time? because they are using an LSTM, or probably remembering just the few last steps that count?
@YannicKilcher 5 лет назад ⁺⁴
The long time-horizon is overcome by the actor-critic framework: The reward is densified using the value function. In addition, the agent sometimes gets an auxiliary reward for following a pre-specified build order.
So the LSTM isn't really meant to remember things for that long - in fact, it is usually impossible to do backprop through more than a handful of steps, so it's more geared towards putting the current decision in context of what happened recently (the last few seconds).
@sirusThu 5 лет назад
@@YannicKilcher Thank you very much for the explanation, it is clear now :). Great channel btw :)
@DaulphinKiller 5 лет назад
@@YannicKilcher If so, then could this also be achieved by replacing LSTMs by history planes in the input volume, like they did in the game of go (where the seven last time steps were stored)?
@kristoferkrus 5 лет назад
@@YannicKilcher Do you mean it's impossible because of memory constraints of the GPU? I thought that in theory that could remedied by using synthetic gradients or memory-efficient backpropagation through time, but maybe it's still difficult in practice for some reason?
@andrewferguson6901 5 лет назад ⁺²
I dont know much about AI but I'm pretty good at starcraft and from watching some of the gamea the AI definitely struggles remember things that it should already know. some players were able to confuse the ai by hiding units in the fog of war then revealing them then hiding them again and in some games the ai would send flying units over anti-air structures repeatedly even though it should know that those structures wont move and will kill its air units every time.
@JavierBoncompte 2 года назад
Thank you, very good video!
@connor-shorten 5 лет назад ⁺⁴
Thanks for this!!
@linanthony7892 5 лет назад ⁺¹
Thank you for explaining !
@Luck_x_Luck 4 года назад ⁺¹⁰
how to make it artificial?
step 1: give it a hat
step 2: ???
step 3: profit
@YannicKilcher 4 года назад ⁺¹
hey what's step 2?
@greyreynyn 5 лет назад ⁺⁶
Thanks for this, but I wouldn't mind a longer more in-depth explanation. Seemed like some spots were just scratching the surface of concepts. But thank you, still a great primer.
@skyacaniadev2229 5 лет назад ⁺¹
Great video, thanks!
@OpreanMircea 5 лет назад ⁺²
2:45 lol, yeah, helicopters, I know where this guy is coming from
@benlowe8111 4 года назад ⁺¹
Great overview!
@ClueMan5000 5 лет назад ⁺³
Selecting units outside of vision is to compensate for the fact that the AlphaStar AI doesn't use hotkeys.
@Jaime_Protein_Cannister 3 года назад
lol doesn't use hotkeys , instead is able to select anything instantaneously and knows everything about friendly units and buildings instantaneously , yes "compensation".
For instance if unit production ends the program is instantaneously informed about production, hence it does not need to check up on the production like human does.
Humans deal with Mouse Input the program can Select any pattern of units after 200ms delay in a single action.
Even at 200 ms delay , it's incomparable to a human. Watching the state of the game is mostly a single thread action. We have the concept of "Mental Cheklist" or "Action Anchoring".
Players monitor the game in a sort of loop Supply->Production->Upgrades->Minimap->Creep etc, the program has all of the information at all times.
Anchor one action another to collect information in timely fashion takes up majority of the focus and apm. To notice something in at 200ms is rivaling godlike play for a human.
This has never been a "Fair" exercise, A player doesn't get to practice thousand years worth of games. Does not have pixel perfect Mind-control over the game.
This is nothing but a proof of concept that a machine can learn some complicated tasks.
@shanestrunk8140 5 лет назад ⁺²
Nice to see an LSTM being used in a RL application. Markovian approaches seem obviously flawed for learning complicated tasks without excessive feature engineering.
@YannicKilcher 5 лет назад ⁺²
Yes, LSTMs are common in these types of RL applications. See for example papers on DeepMind Lab environments or similar things.
@kristoferkrus 5 лет назад
Why do you think so about Markovian approaches (I assume that you count LSTMs as a kind of Markov chains)?
@jonathanballoch 3 года назад ⁺¹
RNNs are pretty common in RL
@vralev 5 лет назад ⁺³
One part that is missing here is the pseudo-rewards they talked about, which seems to have had a very big role. They don't give any examples, but pseudo-rewards are human-generated with intent to guide agents towards common sense or micro-tactics. I personally wonder how the agents learned to do blink micro so well when they had trouble in much easier tasks. You can easily hand-code some blink micro, but it will be very hard for a neural net to come up with it. I want to know if this is a result of pseudo-reward such as limiting the available units to blink stalkers and pre-setting mandatory blink actions per time. That would explain a lot.
@skyacaniadev2229 5 лет назад
There should be an estimator to predict win rate. And I believe that was the rewarding system AlphaStar is using.
@YannicKilcher 5 лет назад
I think actions like blink-micro could be picked up in the first stage, when the agent tries to imitate humans. Of course, you're right, there are pseudo-rewards given, especially when the agent sticks to some pre-defined build order (though that's not always the case). It looks like this is - above all - a giant engineering effort.
@mickmickymick6927 3 года назад ⁺¹
Re: the bot seeing units off-camera. Humans can also remember units off-camera, would the agent forget the off-camera units if it couldn't see them?
@YannicKilcher 3 года назад
depends on how it's implemented. it can learn to remember them
@mickmickymick6927 3 года назад
@@YannicKilcher Yes but you were saying that the bot can see units while the units are off-camera. I meant that perhaps this is because the bot would just completely forget about those units existence otherwise, unlike a human who would remember that they were there, even though they couldn't see them.
@robosergTV 5 лет назад ⁺¹
thanks!
@florin.lupascu 5 лет назад ⁺¹⁰
Play it at x1.5 speed.
@bazejfiderek8038 4 года назад
Yes, and then 17:25
@JurekOK 4 года назад ⁺¹
TBH, this looks like classic engineering with a little bit of deep learning thrown in. If league training is their main contribution, then the science here is meh.
This work has also influenced how pro starcraft players play: These days it is common to not strive for perfect eco build, but rather, oversaturate in anticipation that there will be losses to harassment in early and mid-game: This is something that AlphaStar introduced as it was previously considered bad play. Wow. Hence, this was a fantastic demonstration of modern technology and where are we heading: AI better than most humans at most things that were so far considered "cognitive".
@WhitePum4 2 года назад
This is just straight up inefficient though, it is literally more cost effective to just prepare the minimal defense required to fend off the harassment that would be killing off the workers in the first place, and this can even be pretty accurately timed by scouting the enemy at certain intervals based on how long the tech you are scouting for takes to build and checking the gas to see how much had been mined. The work they did here is like a baby version of the potential AI that could actually be. I can't wait until they learn to use logic concepts as part of their learning algorithm to determine which move is best and what yields the best results. i personally see the professionals using fairly aggressive strategies and threats, when this game has a strong defenders advantage element to it. Even the AI doesn't even know how to utilise it correctly.
@youransun9198 2 года назад ⁺¹
"stop publishing in Nature"
@skyacaniadev2229 5 лет назад
35:00, so Zealots are just for humans...
@YannicKilcher 5 лет назад
On the contrary, it seems the AI builds them quite a bit compared to the other units.
@mns4183 4 года назад
Did you pay for it. But nice info. Remnants for last century . Rants over.
@a1xon 5 лет назад ⁺¹
Squeezing watchtime to the max
@dannygjk 5 лет назад ⁺¹
"Can't print it, can't download it." You can screen shot it. Problem solved.
@Go6etoBiggerThanLife 4 года назад
Get to the point quicker! Thanks 😃
@DanBurgaud 5 лет назад ⁺¹
AlphaStar was limited to just a 200~300 APMs.... IMO that is too restrictive...
THIS is suppose to be an AI that can do things faster than human; so LET THEM!
@adamyong6766 5 лет назад ⁺²
I think the goal is for the ai to develop a strategy rather then brute forcing to victory. Pro player have high apm but low epm. Alphastar have the same apm as epm so in reailty its not slow.
@DanBurgaud 5 лет назад
@@adamyong6766 Using that same logic, perhaps self driving AI cars should be limited to as few APMs like humans do so they can drive safely?
@maximkazhenkov11 5 лет назад ⁺⁴
@@DanBurgaud Self driving cars are end products, so getting the AI to drive well by all means is the end goal. AlphaStar is just a research tool, and the researchers who built it aren't interested in AIs that crush human players using godlike micro.
@YannicKilcher 5 лет назад ⁺³
That's a valid criticism, but I think DeepMind does this to make it more interesting. Given almost infinite APM, the AI could just out-micro any opponent. Because of that, it would also not need to develop particularly interesting strategic knowledge, which is a large part of what we're ultimately interested in.
@dannygjk 5 лет назад ⁺¹
@@DanBurgaud Seriously?! Self driving cars? Have you forgotten about safety issues?

Следующие

Автовоспроизведение

[Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)