Q Learning Algorithm and Agent - Reinforcement Learning p.2
HTML-код
- Опубликовано: 28 янв 2025
- Welcome to part 2 of the reinforcement learning tutorial series, specifically with Q-Learning. We've built our Q-Table which contains all of our possible discrete states. Next we need a way to update the Q-Values (value per possible action per unique state), which we will do using the Q-Learning algorithm!
Text-based tutorial and sample code: pythonprogramm...
Channel membership: / @sentdex
Discord: / discord
Support the content: pythonprogramm...
Twitter: / sentdex
Instagram: / sentdex
Facebook: / pythonprogramming.net
Twitch: / sentdex
#reinforcementlearning #machinelearning #python
A tip for anyone using large numbers like at 2:09 (epochs etc) in Python 3.6+ (I think). You can write integers using underscores to separate parts just like you would use commas so 25000 becomes 25_000. This has saved me alot of time because I don't have to count the digits to figure out if I have a million iterations or just a hundred thousand :D
Wow I so rarely learn new stuff like this now. Very cool! Thanks for sharing this.
Thanks Riaz for that great tip.
Nice trick! But I just end up using scientific notation for large numbers with 0's, like 25e3 for 25,000, 1e-99 etc.
For anyone doing this in the new gynmasium library, here is what you need to adapt in the code: there is no .render() method necessary anymore, you now pass the rendermode in the constructor e.g. env = gym.make("MountainCar-v0", render_mode='human'), so to selectively render / don't render, you need to create the environment in rendermode or not rendermode every episode, also the return type for the step function changed with it now returning truncation and termination as seperate variables, so you can call new_state, reward, termination, truncation, _ = env.step(action) and then done = termination or truncation
hey also can u help with index out of bounds error
Don't miss to tell about impovements to DQN(double, dueling, noisy nets for exploration, prioritized experience replay and so on). Wish you luck with this tutorial, it's great
watched this tutorial two to three times and understood what u were trying to convey
keep up the good work!!!
Thanks, dude for the tutorial. it's funny to see, that each next episode has fewer views :D
The video is great and helps people to get a gist of Q-learning quickly. I just never expected I would start to laugh during the watching. I hope I can have a presentation like sentdex did one day. So relax and easy. Make audiences feel so relax and easy, too. Thanks man!
Hey Man - great channel, thanks for sharing!
We can add the epsilon (exploitation vs. exploration) and a percentage of success to the code (sentdex does it at 27:07)
UNDER OTHER CONSTANTS IN CAPS
epsilon = 0.001 # exploration vs exploitation 1.0 is random 0 is policy
counter = 1
while not done:
if np.random.rand() < epsilon: #exploration option
action = env.action_space.sample()
else: #exploitation if we are not exploring
action = np.argmax(q_table[discrete_state])
......... all other code
elif new_state[0] >= env.goal_position:
print(f'we made it on episode {episode}')
q_table[discrete_state + (action,)] = 0
counter +=1
percentage = counter/(episode*0.01)
print('success rate: ',percentage)
if epsilon > 0.01:
epsilon = epsilon * epsilon_decay
Hope this helps others!
Wish you luck.
thanks for highlighting when he uses epsilon! I was so confused why we defined it but didnt do anything with it
It took me a little bit to understand the get_discrete_state() function. Essentially what's happening is we have already decided to parse the observation values into 20 discrete buckets between the highest and lowest possible. The discrete_os_window_size is the step between each bucket value. By subtracting (state - env.observation_space.low) we get some value between the high and low. Then, by dividing out discrete_os_window_size , we get the i'th bucket the observation falls into. By returning .astype(np.int) it can be used as an index for the q_table. The index is the unique position/velocity bucket.
Love the tip that you can use underscores to semantically separate numbers by thousands! I deal with big numbers all the time! Great tip!
Might be helpful (I'm at about 8:50 right now) to periodically pull up an image of the Q update function to remind us of which part of the Q function we are working on at a given time (e.g., "now we're working on current Q as can be seen here" or "this is where max future Q is"). That way, we can match the code to the function in our heads.
Oops, looks like you did it a few seconds later. Still a handy tool for sure.
It's have been 2 days since the last video!! Can't wait for the next episodes, loving this series, it came just in time
I can't wait for this series! Thanks!
this is a lot like how ants move at random to search for food but once they find a source,
they leave pheromone trails for another ants to exploit that information
the ways you explain the concept was really good
you made it easy for me to understand
normies: watching the video and learning
Me an intellectual: *Notices a new cup in every video* Damnnnn yet another amazing cup :)
Thank you, man, you saved my life wkwk. You explained this tutorial very clearly!
Was wondering whether or not you were gonna actually use epsilon for choosing actions , glad I watched until the end XD
For anyone getting the following error:
discrete_state = (np.array(state, dtype=object) - env.observation_space.low) / discrete_os_win_size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'dict' and 'float'
It's because 'state' is a tuple of an ndarray and a dictionary, replace this line: '(np.array(state, dtype=object) - env.observation_space.low)' with this line '(np.array(state[0], dtype=object) - env.observation_space.low)'
Here, state[0] is simply accessing the ndarray. I don't fully understand the purpose of the dictionary yet. However, the output will not be the same as that in the video for which index you reach on your Q-table
Before this video came out I ended up finishing the code for this demo using a shotgun approach for changing the weights. It ended up taking only around 80,000 itterations though, so there's that.
So today I skimmed through the book by Geron and wished for reinforcement learning tutorials.
Harrison is new Aladdin. 😀
how is the epsilon effecting the model. Even though we are changing the value of epsilion but what effect is having on any of the parameters of the model
It is just inserting random actions which will help you make a better w table basically it is going to places it normally wouldn’t. Go and fills in the q values of those places ( if I understand it correctly
I think there's an error as the epsilon term has been declared and it is decaying, but the agent does not use it for explore-exploit tradeoff. To fix this issue, I would do this in the loop:
while not done:
if np.random.uniform(low=0, high=1) < epsilon:
action = np.random.choice(env.action_space.n) # explore
else:
action = np.argmax(q_table[pos][vel]) # exploit
new_state, reward, done, _ = env.step(action)
new_pos, new_vel = get_discrete_state(new_state)
if render:
env.render()
It worked so well without epsilon due to your optimistic initialization of the Q table. Values between -2 and 0 are way higher than the actual correct values, so argmax selects unexplored actions, i.e., implicit exploration.
I think that's the whole point of this:
STEP 1 - explore new possibilities randomly until we reach the goal
STEP 2 - optimize the way we get there
You have finally done it, Harrison.
25:54 that's the sound of a true nut if I've ever heard one
dude you're so good! so helpful. thanks!
I love your collection of mugs
Thanks for these videos. They helped me with a task for my AI class :)
Mine just never learns to beat it I’m so confused
same man
you have to put env.reset() inside for episode in range(episodes)
Did anybody got ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part and solved?
because now the env.reset returns two values not one so you can't give it directly to the function ... instead I did that : observation, info = env.reset() then sent the observation to the function
Will someone help me fix this error:
File "qLearning.py", line 22, in
dicrete_state = get_discrete_state(env.reset())
File "qLearning.py", line 19, in get_discrete_state
discrete_state = (state - env.observation_space.low) / discrete_os_win_size
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It takes place around 4:26 in the video.
I watched up to 2:08 at the time of asking this question.
If the rewards are always -1 except 0 at the finish. How does the discount factor impact the Q-table or the agent? Aren't we only exclusively interested in the future reward, which is always the long term reward. What happens if we use 1 as discount factor?
I dont really understand the get discrete state function, can you please help me?
More specifically: discrete_state = (state - env.observation_space.low) / discrete_os_win_size
Did you solve it? I have the same problem in this line. The report is TypeError: unsupported operand type(s) for -: 'dict' and 'float'
@@chaodonghuang9130 If you print(env.reset()) youll see that it env.reset() produces a dictionary containing something like "(array([-0.5732655, 0. ], dtype=float32), {})." You want the array value out of that dictionary to subtract the observation space array from. We can get this variable by referencing its index position. If you test print(env.reset()[0]) youll see that it prints only the array [-0.5732655, 0. ]. To get past this error youll want to change your code to something like "pass discrete_state = get_discrete_state(env.reset()[0])". Also, .astype(np.int) is depricated, youll want to just use .astype(int).
5:20 why is the 0th entry came out as max, supposed to be the 1st right?
SamanwayGhatak He restarted Python which generated new set of random values.
12:15 you say that 0 is the reward is for completing things, "nothing, just punishment". So if it's the reward, then why are you using it as a Q-value?? I though that you would use it IN THE FORMULA to calculate the actual Q-value. But instead you just straight override the Q-value with the reward =0.. Could you or somebody clarify what is happening here?
Punishment and reward is all relative. If all punishment is -1 but reward is 0, its functionally the same as all punishment is 0 and reward is 1.
My guess on why this has worked well without epsilon is that you have been punishing "suboptimal" choices for the agent on each state until the first random sequence led it to first succesful completion and kickstarted the learning.
DON'T LEAVE AROUND THE 24 MINUTE MARK. Addendum @ 26:39
Getting an error at : elif new_state[0] >= env.goal_position: 'TimeLimit' object has no attribute 'goal_position' … any help would be appreciated
For anyone else that has the problem - here is the solution : elif new_state[0] >= env.unwrapped.goal_position:
Thank you!
@@petermills1397 Thank you ! Finally found a solution that works !
I do not understand how the epsilon going to effect in the code you wrote.
Is anyone else getting an out of bounds error? I copied and pasted the code from the site but everytime the second action in the new_discrete_state is something like -60 or -70, so it is out of bounds of the q_table.
yeah ,found any fix?
@@utkarshsharma6434 I assume this bug is caused by the gym updates that took place over the years. I haved fixed the issue by doing the folloing
Step 1: Rewrite get_discrete_state() by removing hte [0]
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(int))
Step 2: Before the while loop add a [0]
discrete_state = get_discrete_state(env.reset()[0])
@@SirDerRosen yeah thanks that got fixed
Apparently the gym update has changed lots of stuff
Where is the epsilon value used? I see it being decremented by the decay value but it is never used in any calculations?
Hey, check my comment above.
Action 0: push cart to left
Action 1: push cart to left to a very small amount (almost not moving)
Action 2: push cart to right
Is there a parameter displaying the amount of force executed to the cart? Is the amount of force executed to the cart a random number? Thank you!
I believe the force is always the same, you only decide if you push left, nowhere or right. The amount of force applied is more less the amount of consecutive steps you repeat the command. So, for instance, soft left would be [1, 0, 1, 0, 1, ... ] and strong left would be [1, 1, 1, 1, 1, ... ].
I think your agent dicovered inertia and gravity! The pattern is making the most of the intertia going to the other side!
You should use tqdm for progress bar in future videos. It is pure Python and super useful.
We use tqdm all the time...pretty sure even here in this series
Doing Great...but maybe you could use PyCharm for AutoUpdates. I know you know about it but I don't know why you don't use it.
Would it not be a good idea to do a numpy.random.seed so you can better compare your runs when tweaking parameters?
20000 episodes and it still has not reached the top of the hill , is the code hardware dependent ?
It may depend on the initial Q table. Remember that it was generated randomly. He should have fixed the random seed to get something that is reproducible.
did you update the action?
I know its been 1 year. just for the other people who may read this
@@erictheawesomest Well I did, but still never reached the top
I was following the tutorial and got to 5:00 when I had different values. I was freaking out, thinking I typed in something wrong. It took me a bit until I realize they were random values.
Update: Actually, why is the discrete_state sometimes (6,10) and sometimes (7,10)?
I have the same question actually
Are we going all the way to PPO or AC3 or maybe something with "TensorFlow Probability" during this RL series? It's asking lot I know.
We'll see. I'd like to cover quite a bit.
Awesome. I'm with you all the way.
Thanks so much for your efforts
I have used the same code but the agent just doesn't solve it. Using the parameters in the video it doesn't even solve it once in the 20000 episodes. I experimented with the parameters and the best I got is an avg reward of around -198 after 20000 iterations. Maybe this has something to do with the version of gym I was using?
same
@sentdex As always, great video! However, at 2:15, discount rate is better understand opposite of how you said it. Per Wikipedia, "it has the effect of valuing rewards received EARLIER higher than those received LATER". This is consistent with complexity theory, where we want to get started quickly, then once we get on the right path, start delaying gratification/reward so we increase the time we take exploring so we force ourselves to look for a better result.
Great tutorial, just shouldn't there be a termination condition somewhere in the outer loop?
How would someone replay a specific episode? I took the frame count of every "winner" and stored the fastest copy of the q_table into a variable, then reran the rendering loop (see code below) but the frames it beat it in always comes in lower than the replay frames.
The "replay" at the end of the program I set up:
print("Playing best episode: ", fastest_episode[0], " with ", fastest_episode[1], " frames")
done = False
render = True
env = gym.make('MountainCar-v0', render_mode='human')
discrete_state = get_discrete_state(env.reset()[0])
frame = 0
while not done:
frame += 1
env.render()
action = np.argmax(fastest_episode[2][discrete_state])
new_state, reward, done, truncated, _ = env.step(action)
done = done or truncated
if done:
break
new_discrete_state = get_discrete_state(new_state)
discrete_state = new_discrete_state
print("final frame: ", frame)
env.close()
I figured it out, rather than using the q_table, I saved all the actions into a list and then replayed the actions like so:
print("Playing best episode: ", fastest_episode[0], " with ", fastest_episode[1], " frames")
done = False
render = True
env = gym.make('MountainCar-v0', render_mode='human')
env.reset()
frame = 0
env.render()
for a in fastest_episode[2]:
frame += 1
new_state, reward, done, truncated, _ = env.step(a)
done = done or truncated
if done:
break
I am getting an error "ImportError: sys.meta_path is None, Python is likely shutting down" after running the script, Please help
you have to use env.close() at the end
@@nadjibbendaoud2719 that doesn't help still getting the same error!
At 8:55, can anyone explain to me why adding the (action, ) will access the exact q-value of that state?
How do you create q table for two agents in an environment whose actions are based on a coin-flip and both have opposite goals?
Thanks for the vids
Cool to see RL course, but
line 78, in
elif new_state[0] >= env.goal_position:
AttributeError: 'TimeLimit' object has no attribute 'goal_position'
Exception ignored in:
Would need to see your full code. I expect you made a typo somewhere compared to mine. come join discord.gg/sentdex. I don't edit in/out coding in my videos, so it definitely works as seen in the videos :P Probably just a simple mistake.
Same
@@sentdex
EDIT:: For this game goal_position is 0.5 so you can replace it with that value and it works.
I just copy/pasted from the full code on your website:
# objective is to get the cart to the flag.
# for now, let's just move randomly:
import gym
import numpy as np
env = gym.make("MountainCar-v0")
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000
SHOW_EVERY = 3000
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
# Exploration settings
epsilon = 1 # not a constant, qoing to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2
epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(np.int)) # we use this tuple to look up the 3 Q values for the available actions in the q-table
for episode in range(EPISODES):
discrete_state = get_discrete_state(env.reset())
done = False
if episode % SHOW_EVERY == 0:
render = True
print(episode)
else:
render = False
while not done:
if np.random.random() > epsilon:
# Get action from Q table
action = np.argmax(q_table[discrete_state])
else:
# Get random action
action = np.random.randint(0, env.action_space.n)
new_state, reward, done, _ = env.step(action)
new_discrete_state = get_discrete_state(new_state)
if episode % SHOW_EVERY == 0:
env.render()
#new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# If simulation did not end yet after last step - update Q table
if not done:
# Maximum possible Q value in next step (for new state)
max_future_q = np.max(q_table[new_discrete_state])
# Current Q value (for current state and performed action)
current_q = q_table[discrete_state + (action,)]
# And here's our equation for a new Q value for current state and action
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# Update Q table with new Q value
q_table[discrete_state + (action,)] = new_q
# Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
elif new_state[0] >= env.goal_position:
#q_table[discrete_state + (action,)] = reward
q_table[discrete_state + (action,)] = 0
discrete_state = new_discrete_state
# Decaying is being done every episode if episode number is within decaying range
if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
epsilon -= epsilon_decay_value
env.close()
----------------------
Same error:
Traceback (most recent call last):
File "C:\Users\David\Desktop\mine\aaSUMMER_PROJ_2019\q_learn_sentdex - Copy.py", line 78, in
elif new_state[0] >= env.goal_position:
AttributeError: 'TimeLimit' object has no attribute 'goal_position'
---------------------
The game starts and then this pops up after it runs one time.
pip install --upgrade gym
@@douglasferreira3506 Thanks a lot, you've solved a big problem for me!!!
I don't understand, how does it learn to climb the hill for the first time if the reward is always -1? Is it purely random at first?
Why does introducing exploration make it find the first solution much later( considering that no positive reward is given until it makes it to the top for the first time)? Is it that sticking to a policy even though it is a random one (as Q table is randomly filled at the beginning) is Better than changing the policy randondly ( as it does while exploring), at least for this case? Thank You
is he implementing epsilon greedy q-learning exactly in this tutorial?
Why discrete_state = (state - env.observation_space.low) /....
Why not state / discrete_os_win_size ?
State will always between high and low, i don't get it why we need subtract low bound.
Indeed! why subtract the env.observation_space.low?
in the while loop, we never set done = True. How does it ever break out of the loop to a next episode then?
the step function will make it true if it has accomplished the goal
It is really nice of u that u are taking so much efforts
But i didn't really understood that what u were trying to convey
It was just not clear
Hello,
why not have the discount be 1 ? I don't really get the necessity to have a decaying reward over the back propagation
They are all variables to be tweaked and tested. Have at it!
I think you haven't actually implemented the exploration - exploitation process while taking the action, even though you specified the epsilon value and decay.
Yes! I believe it was not implemented as well, the epsilon and its decaying were defined but were never used in the process...
i dont understand , how did your loop did not break on episode 0 ,1 ,2 .. and so on ? and prompted ("we made it on episode : " ) ,because when it completes the goal the episode ends right ? , done== True ..
Did you solve it? i dont understand
@@zheli189 , ran the same code , and it always printed reached goal on each episode , but after every epoch it reached it goal quicker and in optamised way (obv)
i cant believe it , its some kind of an illusion in the video ,how can it be possible to reach next episode without breaking while loop and in doing that the condition to print reach goal becomes true
why is mine not learning nearly as fast or consistently as his? I even downloaded his code to run on my computer(a mac air) and it didnt work. Is this hardware dependent?
The window shows (not responding) so i thought it crashed, but help on Discord tought me i just had to wait. It only updates every 3000 episodes
9:57 I thought it was always this: Q[state, action] = Q[state, action] + lr * (reward + gamma * np.max(Q[new_state]) - Q[state, action])
In the book it says the formula you posted as well, I don't know where he gets that formula from (he specifies that his source is wikipedia but it also says this one...)
In fact the 2 formulas are the same. I was confused by this too. But :
Q[state, action] + lr * (reward + gamma * np.max(Q[new_state]) - Q[state, action]) ( multiply the lr by the content on parentesis)
Q[state, action] + lr * reward + lr * gamma * np.max(Q[new_state]) - lr * Q[state, action] (reorder the terms and put new parentesis)
(Q[state, action] - lr * Q[state, action]) + (lr * reward + lr * gamma * np.max(Q[new_state])) (on the first parentesis we have Q[state, action] as a common factor so)
Q[state, action] * (1 - lr) + (lr * reward + lr * gamma * np.max(Q[new_state])) ( on the second parentesis lr is a common facor so)
Q[state, action] * (1 - lr) + lr * (reward + gamma * np.max(Q[new_state]))
This is the version of the formula on the video.
Subscribed.
why do you reset the q-value to 0 if the agent reaches the goal? i don't think setting the q-value to zero indicates the reward was given
I have the same quesion. Is it maybe because we have arbitrarily set the max value of Q to zero when initializing-randomizing our Q-table?
PS: Great tutorial, sentdex, thanks for your effort.
How exactly the epsilon making the agent to explore random states as in the code it is not used to take random action?
Well there is a correction in the video. In the code he has used the epsilon to take a random action but here in the video he has just skipped that part.
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It doesn't work, because why should it. Why am I, again, the only person for which it doesn't work, who also thinks it shouldn't work, but it should work and does work for everyone else. And the debugger doesn't work.
EDIT: I reimplemented everything myself and I will have a video about this, since this one needs a refresher.
Every now and then a bug appears in your code
This is the error that I get when I execute np.argmax(q_table[discrete_state]) or print(q_table[discrete_state])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
can you tell what is the logic behind choosing 25000 iterations and why cant i choose 1000 ? ..what effect it has ?
I'm just waiting until he combines q learning and neural networks
Lol, was about to ask why we never used epsilon xD
hahhaha
If you want to see how many "moves" it take to finish each time i've added these few lines to the code: moves = 0
##just above the line below in his code
while not done:
moves += 1 ##add this counter in the while loop
then down where you print
mov.append(moves)
print(f"we made it on episode {episode} in {moves} Moves best so far is {min(mov)} moves MA20 on moves = {sum(mov[-20:])/20}" )
you'll also need to initialize a list probably near the variables
mov = []
I feel like there is a show in you just reviewing agents as a sports commentator. If you were to include the distance travelled as a property of the fitness would it try to optimise travelling as little distance as possible?
lmfao. Just wait til we get back to doing something like SC2. Totally commentate that :)
Distance traveled basically IS a property. You have 200 frames or til completion. Each step is a -1. If you can do it in less steps, that's less -1's, therefor more ideal for a higher Q.
@@sentdex ahh although that doesnt take into account your velocity?
well the fastest way to completion is @ higher velocity, no?
@@sentdex yeah. I suppose my point is more that steps != Distance, I was suggesting trying to minimise the number of times it ramps up and down by measuring the distance it travels. Your steps are just ticks right? Its trying to complete as fast as possible not in as little distance as possible. What do I know, I write in a language named after coffee
shouldnt we be using epsilon somewhere ?
oh nvm
24:54 is there a way. to change the amount of time he gets
there is huuuge precision lost in epsilon, any way to make it work?
Epsilon: 0.8, decay: 0.00016032064128256515
results in 0.76793... which is not even close real value, python lost 0.03 somewhere. Any solution to this numerical problems?
How many coffee mugs do you have? I see a unique one per every video!
He mentioned somewhere a lot of videos ago that he has a collection of mugs and that supporting his channel contributes to increase it
Can you use the 'with' statement with the envs?
In what way?
Why does my system take a lot of time to run it and frames are slow like everything runs in slow motion? Is it because of the library upgrade of Gym to 0.26.2? Your's look very smooth when it's running. I have a 10th Gen Core i5 machine with 8 gigs ram.
add env.metadata['render_fps']=XXXX after you initiate the environment, itll speed itall up as well
could you please explain this line discrete_state= (state-env.observation_space.low) /discrete_os_win_size why we are subtracting with env.observation_space.low we can choose high as well.. why to divide by discrete_os_win_size
i am getting an error : AttributeError: module 'time' has no attribute 'clock'
on the command env.render() plz help me out
had the same problem... use a virtual environment to downgrade to python 3.7 and it should work.
Can we assign this to GPU? Can you show how?
it shows me an error
timelimit has no attribute goal_position
Yo yo this is the CPU song!!
good videos but seriously use vim (still run python by setting up f9)
If any of the state is equal to its high value, wouldn't the corresponding discrete_state be 20? If so, that's out of the size of q_table. I think discrete_os_win_size should be calculate dividing by DISCRETE_OS_SIZE - 1
why we have to define the "for episode in range(EPISODES)" condition? Normally as long as the task is not done "while not done" is True, the agent should keep trying to reach the "goal position". But it does not! why?
Thank you in advance.
running gym 0.26.1. Code would not break episodes at 200 steps. The addition of the truncated parameter from env.step() changed the logic required to reproduce the model. Created a truncated = False variable under the done = False variable and used "while not done and not truncated:" and "if not done and not truncated" to replicate the 200 step behavior.
15:20 so, at this stage my AI is just going the wrong way :(
mine too
To anyone currently stuck, try observation,info = env.reset() and discrete_state = get_discrete_state(observation) instead of directly passing env.reset() into get_discrete_state.
also double q learning is better, and i'm trying to understand zhiqing's solution on scoreboard (gave up on the maxqn stuff next to it...someone help....)
Fortunately VS code came out. Sublime is for the martyrs.