Tweaking Custom Environment Rewards - Reinforcement Learning with Stable Baselines 3 (P.4)

Поделиться
HTML-код
  • Опубликовано: 13 сен 2024
  • Helping our reinforcement learning algorithm to learn better by tweaking the environment rewards.
    Text-based tutorial and sample code: pythonprogramm...
    Neural Networks from Scratch book: nnfs.io
    Channel membership: / @sentdex
    Discord: / discord
    Reddit: / sentdex
    Support the content: pythonprogramm...
    Twitter: / sentdex
    Instagram: / sentdex
    Facebook: / pythonprogramming.net
    Twitch: / sentdex

Комментарии • 42

  • @Zenosama26
    @Zenosama26 Год назад +1

    Thank you so much for this series. Best part was your laughter in these videos. Ofcourse I learned a lot but you are an amazing person too.

  • @aybber
    @aybber 2 года назад +22

    the agent literally realizing that life is suffering and decided to commit suicide everytime he's reincarnated
    Harisson: *i think it's kinda funny*

  • @MrShastaa
    @MrShastaa 2 года назад +21

    Nice series. I actually built and train a very similar env and I can share some learnings that I think will make yours better:
    - Have a -1*distance on each time step , +50 on apple catch , and - EPISODE_LENGHT*MAX_distance on hitting wall. That way the agent is incentivezd to move, not to "delete itself" and to catch the apple
    - Have a max lenght episode (say 1000 steps)
    - remove the cv2 code and move it to render, it slows down you training enormously!
    - train on cpu ( i know this one is counter intuitive) , the NN is too small to benefit from GPU and you will notice an increase on performance time .
    - the agent has no proper knowledge of its past position, I would have a fixed lenght vector ( say size 30,2 of the each square of the snake)

    • @sentdex
      @sentdex  2 года назад +2

      thanks for sharing!

    • @chuksojiugwo9093
      @chuksojiugwo9093 2 года назад +2

      Ouassim Fari: Please could you share your code.

    • @phantomBlurrrrr
      @phantomBlurrrrr 8 месяцев назад

      If you constrain the episode length to 1000 steps, doesn't that mean that the agent only has 1000 steps to reach the end goal, in Sentdex example, the goal is 30 apples eaten, so it would be impossible to reach that goal with the 1000 step constraint per episode? Assuming terrible RNG and apples are placed very far from the snake?

  • @ArMeD217
    @ArMeD217 2 года назад +1

    I watched all the playlist on stable baselines and I feel I've learned a lot. Nicely done videos, keep up such a good quality. Thank you

  • @naimhoues1185
    @naimhoues1185 2 года назад +6

    Thanks for this great series!, I just wanted to mention that when I went through the stable baseline3 documentation, I discovered that there is a parameter called deterministic in the model. predict() method. It appears that you should set this parameter to true in the inference to follow a deterministic policy.

  • @juhotuho10
    @juhotuho10 2 года назад +2

    Ended up tweaking my data and rewards and got a model with 100 game avg length of 84 and the best that I saw was 149 with a model trained for 5M steps using the same PPO model. Took me like a week to figure out optimal parameter and reward systems but i learned a lot.
    For the data I just scanned from the snakes head for the closest "danger block" with values from -1 to 4, -1 means imminent danger for all directions, if the danger is over 6 blocks avay (value 4) then we don't really care to indicate that it's further
    the apple X and Y position relative to the head, (-1 if it's to the left, 1 if its to the right and 0 if it's on the same row)
    and the same type of relative data for the snakes tail's position and the snake's middle part position so that the snake knows roughly the direction of it's body relative to it's head regardless of length.
    When you give the snake wayyyy too much unorganized data like the move that it did 30 steps ago, it really has a hard time grasping the relationship. Also it had a hard time getting to the apples with only absolute relative x and y and would just spin in circles next to the apple, but just having the value switch between 1,0 and -1 seemed to have fixed it quite well

    • @TiestoCrack
      @TiestoCrack Год назад

      Hey, could you explain more how you changed the rewards? thx!

  • @ggaarr485
    @ggaarr485 Год назад +2

    I believe line 108 of snakeenvp4 (self.reward = -10) should be self.total_reward = -10. self.reward is never used by the model.

  • @AakashKumar-gt9ip
    @AakashKumar-gt9ip 2 года назад +5

    If you're intrested, one strategy that produced fairly good results is starting with a small world and increasing the world size as the AI learns.

    • @sentdex
      @sentdex  2 года назад +1

      Yeah I imagine that'd work quite well to dramatically reduce the episodes required to learn that getting apple is good. I like this idea!

  • @sthardy2805
    @sthardy2805 2 года назад +3

    Thanks for this series! I'm a professional statistician who in his spare time is trying to pick up machine learning/artificial intelligence and this series has encouraged me to really put in some effort in trying to learn this stuff. One question I have for you -
    How would you deal with a dynamic action space? In one of my first personal projects I'm trying to apply this to, I need the model to learn what to do where there are different "states" with different action spaces. I've started looking into this and the best answer I've come up with so far is action masking (which seems horribly inefficient). Any chance you could do a video on a simple case where you'd need to deal with the action space changing depending on your observation, or "state" observed?

    • @ArMeD217
      @ArMeD217 2 года назад +1

      Why should actually the action space size and, accordingly, actions meaning change? To me it seems that you are trying to learn multiple different environments, each defined by its observation state, and each with its own specific goal. If that's the case, I believe you would need a separate agent for each goal (associated to an observation state and an action space). I find it really interesting and would like to know more about your idea.

  • @easyBob100
    @easyBob100 2 года назад +3

    One hot encoded vector of the board with 1s representing the snake body (so it knows it's body's location). 2d vector pointing to the apple. And maybe a few to represent if the wall is to the left/right/in front of the snake head. EDIT: Also maybe the current direction of the snake.

    • @HellTriX
      @HellTriX 2 года назад

      When the snake turns, it would no longer be a vector, would turn into a matrix.

    • @easyBob100
      @easyBob100 2 года назад

      @@HellTriX You can represent a 2d array as a 1d array. The whole map would be represented as a one hot encoding of the snakes position.
      It may also be good to start the snake with random lengths for training.

  • @d3c0deFPV
    @d3c0deFPV 2 года назад +1

    Thanks for making these. Really helpful for getting my head around ML concepts. Maybe a suggestion on the title though, as I thought this was part 3 for a moment. "Tweaking Custom Environment Rewards - Reinforcement Learning with Stable-Baselines3 - P.4". Not a big deal since you make playlists though. ;)

    • @sentdex
      @sentdex  2 года назад +1

      Tbh I meant to have those in the titles, tough with a library with numbers at the end too haha ty for reminder

  • @chandanakuntala4294
    @chandanakuntala4294 2 года назад +1

    I just love your videos!! I am still waiting for your neural networks from scratch series to complete :( . Can you please give updates whether you'll continue or not?

    • @dzivba
      @dzivba 2 года назад

      I think he kind of finished them by writing the book.

  • @serhatgvn
    @serhatgvn 2 года назад

    As always, great tutorial.

  • @zhezhu7554
    @zhezhu7554 2 года назад +1

    nice video
    I copy you code and train in 16 cores + 3090 only take half hour to 1M steps,
    snake seems learned get closer to apple and eat it
    but snake still seems offen bite itself,
    so I try to manual add some feature to observe object
    It seem to make converge much more slowly
    I will keep trying and testing

  • @ahmedyamany5065
    @ahmedyamany5065 2 года назад

    So interesting bro, I hope you add an adversarial agent, it will be more interesting

  • @microgamawave
    @microgamawave 2 года назад +1

    You can make a video about gait recognition biometrics in python
    recognized you from your walk model ????

  • @davidrusca2
    @davidrusca2 2 года назад

    Thanks for the series! Made it easy to understand how to work with these single-player environments.
    How would you addapt this to multiplayer? I've seen David Foster's SIMPLE for multiplayer (learning by playing against itself), but he uses custom neural networks and I feel like these baselines should kinda work there too. 🤔

  • @fawadkhan8905
    @fawadkhan8905 11 месяцев назад

    Thanks for making this tutorial, I am looking up to the tutorial to help me in the Human-Robot Collaboration object handover tasks, Could you please shed some light on it?? Am eagerly waiting for @all to jump in, if you could!

  • @akashgarg5770
    @akashgarg5770 2 года назад

    Hi, Thanks for the amazing video. Could you please make a tutorial on hypermeter tuning.

  • @phantomBlurrrrr
    @phantomBlurrrrr 8 месяцев назад

    Hey, if we cut out the code where it just waits a millisecond, I mean the one where it says cv2.waitkey, all it does is it will run faster, right? Cause you said, "I think the agent can run fast enough" like the agent doesn't actually need time to think right

  • @ganzovaska
    @ganzovaska 2 года назад

    i would put also rewarding points if the snake is in the same line (row and column separated), as the apple at the moment, if it is possible

  • @MiroKrotky
    @MiroKrotky 2 года назад

    Very interesting seeing ML application thx

  • @nbamek899
    @nbamek899 2 года назад

    Please make a RL lesson on stocks trading or forex.

  • @melih6826
    @melih6826 Год назад

    Hey sentdex, could you solve this problem with Hindsight Experience Replay? Every apple eat is +1 and 0 otherwise. This problem is a candidate for a sparse reward environment

  • @JPy90
    @JPy90 8 месяцев назад

    que buen buzo!

    • @JPy90
      @JPy90 8 месяцев назад

      I mean, best coat ever!

  • @friedrichwilhelmhufnagel3577
    @friedrichwilhelmhufnagel3577 2 года назад

    Hello, is this part of your book?

  • @XyXy-ku6yv
    @XyXy-ku6yv 2 года назад

    Hey!, I joined your discord server and I needed help but no one answered so I want to ask you directly. Im making your rc car robotics and I ran into a issue I'm on episode 8 when you user control and while I was using Tkinter it says "No $DISPLAY name and no display variable" can you give me a answer?

  • @niklasdamm6900
    @niklasdamm6900 Год назад

    20.12.22 22:00

  • @dolodestinations7628
    @dolodestinations7628 2 года назад

    Hey man! Love your videos! I’ve watched so many. Do you have any videos on the following? If not, would you make some? Or point me in the right direction please?
    1) LSTM
    2) ARIMA model
    3) Fourier Series Model
    Thank you!

  • @mollikaakther6459
    @mollikaakther6459 2 года назад +1

    First ig