ERRATA (from the authors): - KL balancing (prior vs posterior within the KL) is different from beta VAEs (reconstruction vs KL) - The vectors of categoricals can in theory represent 32^32 different images so their capacity is quite large
It's almost like Hafner et al watched your video and built v3 to rectify your criticisms. Transferability to other problems - check, less hyperparameters - check, more generalizable loss function - check. Would really love to see a video like this going over v3. Been having a hell of a time wrapping my head around it, but this video is still helping a ton. Thanks Yannic!!
took me a few attempts to go through this one but hot digiddy, this model building is a truly interesting approach and might get us closer to complex behavior
This is so cool!!! I was just learning about the original DREAMER 2 weeks ago, and seeing now the v2 is really a step forward. Guess I'll forget the older one and try to implement this. Thanks!
Yannic I fucking appreciate you so much. Thank you for supporting the whole AI community in such a cool way. People like you make this movement possible!
In games like pinball, perhaps it would help to look not only at the frames themselves, but also the difference with the previous frame? It would be pretty much zero everywhere except for the ball, the flippers and the score at the top.
That stop_grad description of a straight-through estimator is such a pytorch way to look at it. In a different framework, you might just assign a custom gradient to whatever function you're calling without having to add and subtract the same thing.
I think this is very applicable beyond atari; think self driving or robotic control. Sure, those real world systems dont have such a clear notion of a reward to train on as a game; and most self-driving data is more sparse on the collisions than youd like. But I think training a random agent in a simulator with plenty of info about crashes and driving into ditches, and signals like perhaps 'perplexity to your behavior' from other agents in the simulator, would allow for some pretty good pretrained agents that could be finetuned on real world data. The general idea of 'predicting the next token in a sequence of hidden states' is a very powerful one even if they didnt invent it, and the details of how categorical variables help with that is a pretty cool novelty I think (why not both though, I wonder, given that we know both categorical and continuous can be relevant real world models of reality). And yeah im not entirely down with the image reconstruction loss either; seems something GAN-like would be more appropriate, that is more inclined to focus on semantical accuracy in reconstruction rather than pixel-level detail. But I suppose getting stuff to actually work is hard enough as is without throwing GANs into the mix. Also agreed that the hyperparameters look a little daunting...
It seems that in recent years, some model-based learning algorithms are used for such as Carla simulator, and I agree with the hyperparameters, there are really too many!
How do you mean gan like? If you'd use a gan wouldn't you need to, for each observation optimise over the latent space of the gan to find the closest z that fits the observation? Or do you mean something else?
12:50 Sorry Jurgen XD. Anywhere there is an RNN, two papers down the line there will be a transformer. . 22:12 In one of my personal experiments, I spent hours searching the internet for nice ways to represent multimodal gaussians. Eventually I settled for categorical. I am happy to see that I was on the right path. . 27:15 Are we using concatenated vector to predict z_cap or just the last sampled discretized vector? . 31:25 Pet mouse? . 44:15 Oh yes, the classic stop gradient trick. I found more luck using the @tf.custom_gradient to exactly define how I wanted the backward pass to look like. Gives a bit more flexibility for certain scenarios. . 54:23 Another thing about the video pinball is that the game is just badly designed in the first place. The correlation between user input and game score is very weak. There is a lot of random chance and times where the game just plays by itself.
Really nice explanation, thanks! One other reason why it does so bad on that "video pinball" is that it's a game with almost no interaction. Like for example, in the video you show, how many times did the "pads" touched the balls? 3? The rest was all "automatically" done by the environment, with also actions being made in that time without any change in the result (but the agent could misinterpret).
Would be nice to see if the world model learnt in that case is bad (because it's difficult to learn) or if it's "good" but hard for the RL algorithm to make use of.
Hi! Loving your reviews, really learning a lot. One question on this model, any idea why they wouldn't use this architecture to perform some sort of planning muZero style? Given it already has the ability to predict next states.
I was thinking that the general framework of a generic categorical state space and stochastic process extremely cleanly. The reinforcement learning is much cleaner in the simple state space. If your took the framework to Chess and a robotics application, I think you'd get some weird, cool results.
@Robert w I think the point is to compress the state space and solve an efficient simplification of the game. Real life applications tend to have huge dimensionality issues which make "solving the game" intractable. The solution of simplifying the world and solving the simplification seems like a relatively cheap and flexible solution.
@Robert w @GreenManorite It seems like another trade-off situation, and it also seems like most settings with interesting or difficult dynamics to learn can be approached with “good enough” representations. Oversimplifying only matters when it becomes limiting, but my sense is robotics has the opposite problem, overwhelming problem space and slow learning because IRL. If you don’t have 4B years of evolution of skeletomuscularneuro, what ya gonna do?
@@oncedidactic Robotics comes to mind because "dreaming"--training on the simplification moves a big chunk of learning out of real life. Transfer learning for related tasks in the same environment should be feasible.
How does it deal with RND/changes? Would it need multiple prediction trees and a method to switch between them (if using in a functional/application setting)? Ah... hmmm... it seems it is trying to do this?
Great video. We can't be that sure can we, about how well the Agent fares in scenarios? It could be that pinball dynamics are harder to learn as to why dreamer did poorly, rather than any volatility in pixel arrangements compared to other scenarios. Probably can't project much with the fact that the latent is discrete either... Afterall it, is the latent (of a CNN?). For example chess is highly susceptible to change in notation (or pixel arrangements etc.) affecting the evaluation and outcome. I don't see why dreamer can't learn chess.
It's so frustrating to run this code. After I ran the pong program on a 1650ti, 4G video card, after 500,000 steps, it still rewarded -19 to -21. Upon closer inspection the original ran 200 million steps, and it seems to change more only at a few million steps. Anyone have the same results as me?
Then I tried the P100, it also seems the model can not converge. And do you know how to change the pong game to any other Atari games? The hype-parameters are also same to pong? Thanks.
I would imagine the ball position and velocity in video pinball, which is the only thing that really matters, would be pretty hard to project far into the future due to the chaotic nature of its motion. That would be my guess as to why this model fails in this game.
@@softerseltzer CARLA = 100% electric self-driving CARs in LA? There's a city in China that has 100% self-driving, so it is possible. But Americans love their FREEDOM. And what about motorcycles? ha
ERRATA (from the authors):
- KL balancing (prior vs posterior within the KL) is different from beta VAEs (reconstruction vs KL)
- The vectors of categoricals can in theory represent 32^32 different images so their capacity is quite large
Why is the "straight-through estimator" biased?
Is your channel becoming the open review platform you wished for? 👀🚀
I knew you are quick, but damn, this is faster than fedex express on a favorable weather conditions
It's almost like Hafner et al watched your video and built v3 to rectify your criticisms. Transferability to other problems - check, less hyperparameters - check, more generalizable loss function - check. Would really love to see a video like this going over v3. Been having a hell of a time wrapping my head around it, but this video is still helping a ton. Thanks Yannic!!
took me a few attempts to go through this one but hot digiddy, this model building is a truly interesting approach and might get us closer to complex behavior
"oh dream" at 31:20 killed me ayo was not expecting that. you can hear the smirk in his voice
"and it's also completely not cheated" lmaooooooooooooo
This is so cool!!!
I was just learning about the original DREAMER 2 weeks ago, and seeing now the v2 is really a step forward. Guess I'll forget the older one and try to implement this.
Thanks!
Exactly what I was looking for keep up the good work
Very exciting work where we are learning the dynamical laws of a stochastic world.
Yannic I fucking appreciate you so much. Thank you for supporting the whole AI community in such a cool way. People like you make this movement possible!
In games like pinball, perhaps it would help to look not only at the frames themselves, but also the difference with the previous frame? It would be pretty much zero everywhere except for the ball, the flippers and the score at the top.
That stop_grad description of a straight-through estimator is such a pytorch way to look at it. In a different framework, you might just assign a custom gradient to whatever function you're calling without having to add and subtract the same thing.
Another great video!! Thank you.
I think this is very applicable beyond atari; think self driving or robotic control. Sure, those real world systems dont have such a clear notion of a reward to train on as a game; and most self-driving data is more sparse on the collisions than youd like. But I think training a random agent in a simulator with plenty of info about crashes and driving into ditches, and signals like perhaps 'perplexity to your behavior' from other agents in the simulator, would allow for some pretty good pretrained agents that could be finetuned on real world data. The general idea of 'predicting the next token in a sequence of hidden states' is a very powerful one even if they didnt invent it, and the details of how categorical variables help with that is a pretty cool novelty I think (why not both though, I wonder, given that we know both categorical and continuous can be relevant real world models of reality).
And yeah im not entirely down with the image reconstruction loss either; seems something GAN-like would be more appropriate, that is more inclined to focus on semantical accuracy in reconstruction rather than pixel-level detail. But I suppose getting stuff to actually work is hard enough as is without throwing GANs into the mix. Also agreed that the hyperparameters look a little daunting...
It seems that in recent years, some model-based learning algorithms are used for such as Carla simulator, and I agree with the hyperparameters, there are really too many!
How do you mean gan like? If you'd use a gan wouldn't you need to, for each observation optimise over the latent space of the gan to find the closest z that fits the observation? Or do you mean something else?
12:50 Sorry Jurgen XD. Anywhere there is an RNN, two papers down the line there will be a transformer.
.
22:12 In one of my personal experiments, I spent hours searching the internet for nice ways to represent multimodal gaussians. Eventually I settled for categorical. I am happy to see that I was on the right path.
.
27:15 Are we using concatenated vector to predict z_cap or just the last sampled discretized vector?
.
31:25 Pet mouse?
.
44:15 Oh yes, the classic stop gradient trick. I found more luck using the @tf.custom_gradient to exactly define how I wanted the backward pass to look like. Gives a bit more flexibility for certain scenarios.
.
54:23 Another thing about the video pinball is that the game is just badly designed in the first place. The correlation between user input and game score is very weak. There is a lot of random chance and times where the game just plays by itself.
How can you review Dreamer if you never sleep to make these videos? 🤔
Didn’t know you were a Minecraft speed-run fan ;)
Really nice explanation, thanks! One other reason why it does so bad on that "video pinball" is that it's a game with almost no interaction. Like for example, in the video you show, how many times did the "pads" touched the balls? 3? The rest was all "automatically" done by the environment, with also actions being made in that time without any change in the result (but the agent could misinterpret).
Would be nice to see if the world model learnt in that case is bad (because it's difficult to learn) or if it's "good" but hard for the RL algorithm to make use of.
Hi! Loving your reviews, really learning a lot. One question on this model, any idea why they wouldn't use this architecture to perform some sort of planning muZero style? Given it already has the ability to predict next states.
I was thinking that the general framework of a generic categorical state space and stochastic process extremely cleanly. The reinforcement learning is much cleaner in the simple state space. If your took the framework to Chess and a robotics application, I think you'd get some weird, cool results.
@Robert w I think the point is to compress the state space and solve an efficient simplification of the game. Real life applications tend to have huge dimensionality issues which make "solving the game" intractable.
The solution of simplifying the world and solving the simplification seems like a relatively cheap and flexible solution.
@Robert w @GreenManorite
It seems like another trade-off situation, and it also seems like most settings with interesting or difficult dynamics to learn can be approached with “good enough” representations. Oversimplifying only matters when it becomes limiting, but my sense is robotics has the opposite problem, overwhelming problem space and slow learning because IRL. If you don’t have 4B years of evolution of skeletomuscularneuro, what ya gonna do?
@@oncedidactic Robotics comes to mind because "dreaming"--training on the simplification moves a big chunk of learning out of real life. Transfer learning for related tasks in the same environment should be feasible.
Very detailed and interesting, thanks!
How is the vector categoricals differs from a vector quatization?
isn't 8:00 a markov chain given a hidden markov model? I was confused when he said non-markovian I thought that qualified.
RNN's are markovian
LSTMS are not Markov. They’re not strictly dependent on an immediately previous state.
Good Explain bro Thanks for the video
Anyone know how you can backprop through the sampling? Edit: You just backdrop both with and without sampling
How does it deal with RND/changes? Would it need multiple prediction trees and a method to switch between them (if using in a functional/application setting)?
Ah... hmmm... it seems it is trying to do this?
In the paper, why imagination horizon
H = 15 was introduced?
Does "One" world model covers all 57 games or we need 57 world models each covering one atari game?
Brilliant stuff :)
Great video. We can't be that sure can we, about how well the Agent fares in scenarios? It could be that pinball dynamics are harder to learn as to why dreamer did poorly, rather than any volatility in pixel arrangements compared to other scenarios. Probably can't project much with the fact that the latent is discrete either... Afterall it, is the latent (of a CNN?). For example chess is highly susceptible to change in notation (or pixel arrangements etc.) affecting the evaluation and outcome. I don't see why dreamer can't learn chess.
does Reinforcement learning can apply for single object tracking?
Great Explanation !!!
Why is the "straight-through estimator" biased?
Is this the same idea behind MuZero? What's the difference?
The dream is over, and the nightmare begins ...
I'd like to see it on env's like procgen. I wish there was more footage of it.
Thank you.
This guy is a good comedian
nice video!
It's so frustrating to run this code. After I ran the pong program on a 1650ti, 4G video card, after 500,000 steps, it still rewarded -19 to -21. Upon closer inspection the original ran 200 million steps, and it seems to change more only at a few million steps. Anyone have the same results as me?
Then I tried the P100, it also seems the model can not converge. And do you know how to change the pong game to any other Atari games? The hype-parameters are also same to pong? Thanks.
yes,me too.
I would imagine the ball position and velocity in video pinball, which is the only thing that really matters, would be pretty hard to project far into the future due to the chaotic nature of its motion. That would be my guess as to why this model fails in this game.
This guy is amazing
31:16 1 IN 7 TRILLION CHANCES FOR YOU TO GET THIS JOKE
??
Next video be like: Minecraft speedrunner vs assassin AI
underrated comment
@@PaganPegasus Thanks for reminding me of it
Damn this heavy
Yikes. Googled Model-Free vs Model-Based, then resumed at 2:22. FML^^
LOL ive totally done that before
Think of this as if it were us tho. The meta. We experience world how we do become we learnt to experience it this way
This does not look like actor critic, the actor loss is reinforce. An actor in actor critic architectures follows the gradient of the critic.
perfect thank you
when x = z, r0..rn-1 = 0, rn= 1, then it is alphaGo, does that mean alphaGo is dreamer Go?
Well not quite alphago, since its missing the element of tree search. Infact, id say the learned world model is about the only similarity?
Apply this to self driving cars - BOOM!
@Robert w Why nonsense?
BOOM! because it will crash and explode
@@softerseltzer not if every car is connected, trained first in simulation, then applied. But that is not likely.
@@CandidDate You can try and implement it in CARLA.
@@softerseltzer CARLA = 100% electric self-driving CARs in LA? There's a city in China that has 100% self-driving, so it is possible. But Americans love their FREEDOM. And what about motorcycles? ha
smh, yannic talking dirty about my boy the GRU
Correction: this is actually called planning
No, it is Model-Based Reinforcement Learning, which combines planning and learning in a varied number of ways depending on the architecture
@@willrazen what is it called when you use planning?
@@graham8316 The planning part is called planning, but OP implied everything was just planning, which is incorrect
@@willrazen thanks!
Hello wonderful person
Hi there
Dear fellow scholars
Top of the morning to ya laddies
Which one is ya favorite
Who is the last one?
Hi there of course
That first "hello" is reserved for a another kind of channel. :D
hii 😄
lol this was TOO quick