Tweaking Custom Environment Rewards - Reinforcement Learning with Stable Baselines 3 (P.4)
HTML-код
- Опубликовано: 13 сен 2024
- Helping our reinforcement learning algorithm to learn better by tweaking the environment rewards.
Text-based tutorial and sample code: pythonprogramm...
Neural Networks from Scratch book: nnfs.io
Channel membership: / @sentdex
Discord: / discord
Reddit: / sentdex
Support the content: pythonprogramm...
Twitter: / sentdex
Instagram: / sentdex
Facebook: / pythonprogramming.net
Twitch: / sentdex
Thank you so much for this series. Best part was your laughter in these videos. Ofcourse I learned a lot but you are an amazing person too.
the agent literally realizing that life is suffering and decided to commit suicide everytime he's reincarnated
Harisson: *i think it's kinda funny*
Nice series. I actually built and train a very similar env and I can share some learnings that I think will make yours better:
- Have a -1*distance on each time step , +50 on apple catch , and - EPISODE_LENGHT*MAX_distance on hitting wall. That way the agent is incentivezd to move, not to "delete itself" and to catch the apple
- Have a max lenght episode (say 1000 steps)
- remove the cv2 code and move it to render, it slows down you training enormously!
- train on cpu ( i know this one is counter intuitive) , the NN is too small to benefit from GPU and you will notice an increase on performance time .
- the agent has no proper knowledge of its past position, I would have a fixed lenght vector ( say size 30,2 of the each square of the snake)
thanks for sharing!
Ouassim Fari: Please could you share your code.
If you constrain the episode length to 1000 steps, doesn't that mean that the agent only has 1000 steps to reach the end goal, in Sentdex example, the goal is 30 apples eaten, so it would be impossible to reach that goal with the 1000 step constraint per episode? Assuming terrible RNG and apples are placed very far from the snake?
I watched all the playlist on stable baselines and I feel I've learned a lot. Nicely done videos, keep up such a good quality. Thank you
Thanks for this great series!, I just wanted to mention that when I went through the stable baseline3 documentation, I discovered that there is a parameter called deterministic in the model. predict() method. It appears that you should set this parameter to true in the inference to follow a deterministic policy.
Ended up tweaking my data and rewards and got a model with 100 game avg length of 84 and the best that I saw was 149 with a model trained for 5M steps using the same PPO model. Took me like a week to figure out optimal parameter and reward systems but i learned a lot.
For the data I just scanned from the snakes head for the closest "danger block" with values from -1 to 4, -1 means imminent danger for all directions, if the danger is over 6 blocks avay (value 4) then we don't really care to indicate that it's further
the apple X and Y position relative to the head, (-1 if it's to the left, 1 if its to the right and 0 if it's on the same row)
and the same type of relative data for the snakes tail's position and the snake's middle part position so that the snake knows roughly the direction of it's body relative to it's head regardless of length.
When you give the snake wayyyy too much unorganized data like the move that it did 30 steps ago, it really has a hard time grasping the relationship. Also it had a hard time getting to the apples with only absolute relative x and y and would just spin in circles next to the apple, but just having the value switch between 1,0 and -1 seemed to have fixed it quite well
Hey, could you explain more how you changed the rewards? thx!
I believe line 108 of snakeenvp4 (self.reward = -10) should be self.total_reward = -10. self.reward is never used by the model.
If you're intrested, one strategy that produced fairly good results is starting with a small world and increasing the world size as the AI learns.
Yeah I imagine that'd work quite well to dramatically reduce the episodes required to learn that getting apple is good. I like this idea!
Thanks for this series! I'm a professional statistician who in his spare time is trying to pick up machine learning/artificial intelligence and this series has encouraged me to really put in some effort in trying to learn this stuff. One question I have for you -
How would you deal with a dynamic action space? In one of my first personal projects I'm trying to apply this to, I need the model to learn what to do where there are different "states" with different action spaces. I've started looking into this and the best answer I've come up with so far is action masking (which seems horribly inefficient). Any chance you could do a video on a simple case where you'd need to deal with the action space changing depending on your observation, or "state" observed?
Why should actually the action space size and, accordingly, actions meaning change? To me it seems that you are trying to learn multiple different environments, each defined by its observation state, and each with its own specific goal. If that's the case, I believe you would need a separate agent for each goal (associated to an observation state and an action space). I find it really interesting and would like to know more about your idea.
One hot encoded vector of the board with 1s representing the snake body (so it knows it's body's location). 2d vector pointing to the apple. And maybe a few to represent if the wall is to the left/right/in front of the snake head. EDIT: Also maybe the current direction of the snake.
When the snake turns, it would no longer be a vector, would turn into a matrix.
@@HellTriX You can represent a 2d array as a 1d array. The whole map would be represented as a one hot encoding of the snakes position.
It may also be good to start the snake with random lengths for training.
Thanks for making these. Really helpful for getting my head around ML concepts. Maybe a suggestion on the title though, as I thought this was part 3 for a moment. "Tweaking Custom Environment Rewards - Reinforcement Learning with Stable-Baselines3 - P.4". Not a big deal since you make playlists though. ;)
Tbh I meant to have those in the titles, tough with a library with numbers at the end too haha ty for reminder
I just love your videos!! I am still waiting for your neural networks from scratch series to complete :( . Can you please give updates whether you'll continue or not?
I think he kind of finished them by writing the book.
As always, great tutorial.
nice video
I copy you code and train in 16 cores + 3090 only take half hour to 1M steps,
snake seems learned get closer to apple and eat it
but snake still seems offen bite itself,
so I try to manual add some feature to observe object
It seem to make converge much more slowly
I will keep trying and testing
So interesting bro, I hope you add an adversarial agent, it will be more interesting
You can make a video about gait recognition biometrics in python
recognized you from your walk model ????
Thanks for the series! Made it easy to understand how to work with these single-player environments.
How would you addapt this to multiplayer? I've seen David Foster's SIMPLE for multiplayer (learning by playing against itself), but he uses custom neural networks and I feel like these baselines should kinda work there too. 🤔
Thanks for making this tutorial, I am looking up to the tutorial to help me in the Human-Robot Collaboration object handover tasks, Could you please shed some light on it?? Am eagerly waiting for @all to jump in, if you could!
Hi, Thanks for the amazing video. Could you please make a tutorial on hypermeter tuning.
Hey, if we cut out the code where it just waits a millisecond, I mean the one where it says cv2.waitkey, all it does is it will run faster, right? Cause you said, "I think the agent can run fast enough" like the agent doesn't actually need time to think right
i would put also rewarding points if the snake is in the same line (row and column separated), as the apple at the moment, if it is possible
Very interesting seeing ML application thx
Please make a RL lesson on stocks trading or forex.
Hey sentdex, could you solve this problem with Hindsight Experience Replay? Every apple eat is +1 and 0 otherwise. This problem is a candidate for a sparse reward environment
que buen buzo!
I mean, best coat ever!
Hello, is this part of your book?
Hey!, I joined your discord server and I needed help but no one answered so I want to ask you directly. Im making your rc car robotics and I ran into a issue I'm on episode 8 when you user control and while I was using Tkinter it says "No $DISPLAY name and no display variable" can you give me a answer?
20.12.22 22:00
Hey man! Love your videos! I’ve watched so many. Do you have any videos on the following? If not, would you make some? Or point me in the right direction please?
1) LSTM
2) ARIMA model
3) Fourier Series Model
Thank you!
First ig