Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 21 окт 2024

Комментарии • 90

  • @YannicKilcher
    @YannicKilcher  3 года назад +19

    ERRATA (from the authors):
    - KL balancing (prior vs posterior within the KL) is different from beta VAEs (reconstruction vs KL)
    - The vectors of categoricals can in theory represent 32^32 different images so their capacity is quite large

    • @martinmoder5900
      @martinmoder5900 3 года назад

      Why is the "straight-through estimator" biased?

    • @norik1616
      @norik1616 3 года назад +3

      Is your channel becoming the open review platform you wished for? 👀🚀

  • @morkovija
    @morkovija 3 года назад +36

    I knew you are quick, but damn, this is faster than fedex express on a favorable weather conditions

  • @zramsey11
    @zramsey11 Месяц назад

    It's almost like Hafner et al watched your video and built v3 to rectify your criticisms. Transferability to other problems - check, less hyperparameters - check, more generalizable loss function - check. Would really love to see a video like this going over v3. Been having a hell of a time wrapping my head around it, but this video is still helping a ton. Thanks Yannic!!

  • @morkovija
    @morkovija 3 года назад +12

    took me a few attempts to go through this one but hot digiddy, this model building is a truly interesting approach and might get us closer to complex behavior

  • @yeokc8757
    @yeokc8757 Год назад +1

    "oh dream" at 31:20 killed me ayo was not expecting that. you can hear the smirk in his voice

    • @yeokc8757
      @yeokc8757 Год назад +1

      "and it's also completely not cheated" lmaooooooooooooo

  • @voaneves
    @voaneves 3 года назад +1

    This is so cool!!!
    I was just learning about the original DREAMER 2 weeks ago, and seeing now the v2 is really a step forward. Guess I'll forget the older one and try to implement this.
    Thanks!

  • @josephgeorge8054
    @josephgeorge8054 3 года назад +3

    Exactly what I was looking for keep up the good work

  • @rachittibrewal7202
    @rachittibrewal7202 3 года назад +1

    Very exciting work where we are learning the dynamical laws of a stochastic world.

  • @itsfabiolous
    @itsfabiolous 2 года назад

    Yannic I fucking appreciate you so much. Thank you for supporting the whole AI community in such a cool way. People like you make this movement possible!

  • @EctoMorpheus
    @EctoMorpheus 3 года назад +4

    In games like pinball, perhaps it would help to look not only at the frames themselves, but also the difference with the previous frame? It would be pretty much zero everywhere except for the ball, the flippers and the score at the top.

  • @EctoMorpheus
    @EctoMorpheus 3 года назад

    That stop_grad description of a straight-through estimator is such a pytorch way to look at it. In a different framework, you might just assign a custom gradient to whatever function you're calling without having to add and subtract the same thing.

  • @krishnachaitanyadon
    @krishnachaitanyadon 3 года назад +1

    Another great video!! Thank you.

  • @eelcohoogendoorn8044
    @eelcohoogendoorn8044 3 года назад +3

    I think this is very applicable beyond atari; think self driving or robotic control. Sure, those real world systems dont have such a clear notion of a reward to train on as a game; and most self-driving data is more sparse on the collisions than youd like. But I think training a random agent in a simulator with plenty of info about crashes and driving into ditches, and signals like perhaps 'perplexity to your behavior' from other agents in the simulator, would allow for some pretty good pretrained agents that could be finetuned on real world data. The general idea of 'predicting the next token in a sequence of hidden states' is a very powerful one even if they didnt invent it, and the details of how categorical variables help with that is a pretty cool novelty I think (why not both though, I wonder, given that we know both categorical and continuous can be relevant real world models of reality).
    And yeah im not entirely down with the image reconstruction loss either; seems something GAN-like would be more appropriate, that is more inclined to focus on semantical accuracy in reconstruction rather than pixel-level detail. But I suppose getting stuff to actually work is hard enough as is without throwing GANs into the mix. Also agreed that the hyperparameters look a little daunting...

    • @shyboy523
      @shyboy523 3 года назад

      It seems that in recent years, some model-based learning algorithms are used for such as Carla simulator, and I agree with the hyperparameters, there are really too many!

    • @ハェフィシェフ
      @ハェフィシェフ 2 года назад

      How do you mean gan like? If you'd use a gan wouldn't you need to, for each observation optimise over the latent space of the gan to find the closest z that fits the observation? Or do you mean something else?

  • @herp_derpingson
    @herp_derpingson 3 года назад +1

    12:50 Sorry Jurgen XD. Anywhere there is an RNN, two papers down the line there will be a transformer.
    .
    22:12 In one of my personal experiments, I spent hours searching the internet for nice ways to represent multimodal gaussians. Eventually I settled for categorical. I am happy to see that I was on the right path.
    .
    27:15 Are we using concatenated vector to predict z_cap or just the last sampled discretized vector?
    .
    31:25 Pet mouse?
    .
    44:15 Oh yes, the classic stop gradient trick. I found more luck using the @tf.custom_gradient to exactly define how I wanted the backward pass to look like. Gives a bit more flexibility for certain scenarios.
    .
    54:23 Another thing about the video pinball is that the game is just badly designed in the first place. The correlation between user input and game score is very weak. There is a lot of random chance and times where the game just plays by itself.

  • @mgostIH
    @mgostIH 3 года назад +14

    How can you review Dreamer if you never sleep to make these videos? 🤔

  • @andres_pq
    @andres_pq 3 года назад +3

    Didn’t know you were a Minecraft speed-run fan ;)

  • @JackSPk
    @JackSPk 3 года назад +3

    Really nice explanation, thanks! One other reason why it does so bad on that "video pinball" is that it's a game with almost no interaction. Like for example, in the video you show, how many times did the "pads" touched the balls? 3? The rest was all "automatically" done by the environment, with also actions being made in that time without any change in the result (but the agent could misinterpret).

    • @JackSPk
      @JackSPk 3 года назад

      Would be nice to see if the world model learnt in that case is bad (because it's difficult to learn) or if it's "good" but hard for the RL algorithm to make use of.

  • @bensimonjoules4402
    @bensimonjoules4402 Год назад +1

    Hi! Loving your reviews, really learning a lot. One question on this model, any idea why they wouldn't use this architecture to perform some sort of planning muZero style? Given it already has the ability to predict next states.

  • @GreenManorite
    @GreenManorite 3 года назад +2

    I was thinking that the general framework of a generic categorical state space and stochastic process extremely cleanly. The reinforcement learning is much cleaner in the simple state space. If your took the framework to Chess and a robotics application, I think you'd get some weird, cool results.

    • @GreenManorite
      @GreenManorite 3 года назад +1

      @Robert w I think the point is to compress the state space and solve an efficient simplification of the game. Real life applications tend to have huge dimensionality issues which make "solving the game" intractable.
      The solution of simplifying the world and solving the simplification seems like a relatively cheap and flexible solution.

    • @oncedidactic
      @oncedidactic 3 года назад +1

      @Robert w @GreenManorite
      It seems like another trade-off situation, and it also seems like most settings with interesting or difficult dynamics to learn can be approached with “good enough” representations. Oversimplifying only matters when it becomes limiting, but my sense is robotics has the opposite problem, overwhelming problem space and slow learning because IRL. If you don’t have 4B years of evolution of skeletomuscularneuro, what ya gonna do?

    • @GreenManorite
      @GreenManorite 3 года назад

      @@oncedidactic Robotics comes to mind because "dreaming"--training on the simplification moves a big chunk of learning out of real life. Transfer learning for related tasks in the same environment should be feasible.

  • @codenie749
    @codenie749 3 года назад

    Very detailed and interesting, thanks!

  • @rayrayray-l5l
    @rayrayray-l5l 3 года назад +1

    How is the vector categoricals differs from a vector quatization?

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 3 года назад +3

    isn't 8:00 a markov chain given a hidden markov model? I was confused when he said non-markovian I thought that qualified.

  • @BeerGecko
    @BeerGecko 3 года назад

    Good Explain bro Thanks for the video

  • @graham8316
    @graham8316 Год назад

    Anyone know how you can backprop through the sampling? Edit: You just backdrop both with and without sampling

  • @TechyBen
    @TechyBen 3 года назад

    How does it deal with RND/changes? Would it need multiple prediction trees and a method to switch between them (if using in a functional/application setting)?
    Ah... hmmm... it seems it is trying to do this?

  • @alexanderyau6347
    @alexanderyau6347 2 года назад

    In the paper, why imagination horizon
    H = 15 was introduced?

  • @jeffreylim5920
    @jeffreylim5920 2 года назад

    Does "One" world model covers all 57 games or we need 57 world models each covering one atari game?

  • @jaakjpn
    @jaakjpn 3 года назад +2

    Brilliant stuff :)

  • @XecutionStyle
    @XecutionStyle 3 года назад

    Great video. We can't be that sure can we, about how well the Agent fares in scenarios? It could be that pinball dynamics are harder to learn as to why dreamer did poorly, rather than any volatility in pixel arrangements compared to other scenarios. Probably can't project much with the fact that the latent is discrete either... Afterall it, is the latent (of a CNN?). For example chess is highly susceptible to change in notation (or pixel arrangements etc.) affecting the evaluation and outcome. I don't see why dreamer can't learn chess.

  • @amirhamidi7013
    @amirhamidi7013 3 года назад

    does Reinforcement learning can apply for single object tracking?

  • @xinghaozong8977
    @xinghaozong8977 3 года назад

    Great Explanation !!!

  • @martinmoder5900
    @martinmoder5900 3 года назад

    Why is the "straight-through estimator" biased?

  • @kidscheung9226
    @kidscheung9226 3 года назад

    Is this the same idea behind MuZero? What's the difference?

  • @dawwdd
    @dawwdd 3 года назад +7

    The dream is over, and the nightmare begins ...

  • @Daniel-ih4zh
    @Daniel-ih4zh 3 года назад

    I'd like to see it on env's like procgen. I wish there was more footage of it.

  • @notsure7132
    @notsure7132 3 года назад

    Thank you.

  • @pratik245
    @pratik245 2 года назад

    This guy is a good comedian

  • @hannesstark5024
    @hannesstark5024 3 года назад +1

    nice video!

  • @shyboy523
    @shyboy523 3 года назад +1

    It's so frustrating to run this code. After I ran the pong program on a 1650ti, 4G video card, after 500,000 steps, it still rewarded -19 to -21. Upon closer inspection the original ran 200 million steps, and it seems to change more only at a few million steps. Anyone have the same results as me?

    • @shyboy523
      @shyboy523 3 года назад +1

      Then I tried the P100, it also seems the model can not converge. And do you know how to change the pong game to any other Atari games? The hype-parameters are also same to pong? Thanks.

    • @刘家林
      @刘家林 3 года назад

      yes,me too.

  • @victorrielly4588
    @victorrielly4588 3 года назад

    I would imagine the ball position and velocity in video pinball, which is the only thing that really matters, would be pretty hard to project far into the future due to the chaotic nature of its motion. That would be my guess as to why this model fails in this game.

  • @brandonyano4103
    @brandonyano4103 3 года назад

    This guy is amazing

  • @mgostIH
    @mgostIH 3 года назад +3

    31:16 1 IN 7 TRILLION CHANCES FOR YOU TO GET THIS JOKE

    • @siyn007
      @siyn007 3 года назад

      ??

    • @f14-werto
      @f14-werto 3 года назад +2

      Next video be like: Minecraft speedrunner vs assassin AI

    • @PaganPegasus
      @PaganPegasus 2 года назад

      underrated comment

    • @mgostIH
      @mgostIH 2 года назад

      @@PaganPegasus Thanks for reminding me of it

  • @akarshrastogi3682
    @akarshrastogi3682 2 года назад

    Damn this heavy

  • @Phenix66
    @Phenix66 3 года назад +2

    Yikes. Googled Model-Free vs Model-Based, then resumed at 2:22. FML^^

    • @cerebralm
      @cerebralm 3 года назад +1

      LOL ive totally done that before

  • @grave0x
    @grave0x Год назад

    Think of this as if it were us tho. The meta. We experience world how we do become we learnt to experience it this way

  • @JTMoustache
    @JTMoustache 3 года назад

    This does not look like actor critic, the actor loss is reinforce. An actor in actor critic architectures follows the gradient of the critic.

  • @christianleininger2954
    @christianleininger2954 3 года назад

    perfect thank you

  • @yusunliu4858
    @yusunliu4858 3 года назад

    when x = z, r0..rn-1 = 0, rn= 1, then it is alphaGo, does that mean alphaGo is dreamer Go?

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 3 года назад

      Well not quite alphago, since its missing the element of tree search. Infact, id say the learned world model is about the only similarity?

  • @CandidDate
    @CandidDate 3 года назад

    Apply this to self driving cars - BOOM!

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 3 года назад

      @Robert w Why nonsense?

    • @softerseltzer
      @softerseltzer 3 года назад +2

      BOOM! because it will crash and explode

    • @CandidDate
      @CandidDate 3 года назад

      @@softerseltzer not if every car is connected, trained first in simulation, then applied. But that is not likely.

    • @softerseltzer
      @softerseltzer 3 года назад

      @@CandidDate You can try and implement it in CARLA.

    • @CandidDate
      @CandidDate 3 года назад

      @@softerseltzer CARLA = 100% electric self-driving CARs in LA? There's a city in China that has 100% self-driving, so it is possible. But Americans love their FREEDOM. And what about motorcycles? ha

  • @PaganPegasus
    @PaganPegasus 2 года назад

    smh, yannic talking dirty about my boy the GRU

  • @MrGeoffreyNL
    @MrGeoffreyNL 3 года назад +4

    Correction: this is actually called planning

    • @willrazen
      @willrazen 3 года назад

      No, it is Model-Based Reinforcement Learning, which combines planning and learning in a varied number of ways depending on the architecture

    • @graham8316
      @graham8316 Год назад

      @@willrazen what is it called when you use planning?

    • @willrazen
      @willrazen Год назад +1

      @@graham8316 The planning part is called planning, but OP implied everything was just planning, which is incorrect

    • @graham8316
      @graham8316 Год назад +1

      @@willrazen thanks!

  • @mathematicalninja2756
    @mathematicalninja2756 3 года назад +2

    Hello wonderful person
    Hi there
    Dear fellow scholars
    Top of the morning to ya laddies
    Which one is ya favorite

  • @satvik9677
    @satvik9677 3 года назад +1

    hii 😄

  • @billykotsos4642
    @billykotsos4642 3 года назад

    lol this was TOO quick