Q Learning Algorithm and Agent - Reinforcement Learning p.2

sentdex

Просмотров 109 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 янв 2025
Welcome to part 2 of the reinforcement learning tutorial series, specifically with Q-Learning. We've built our Q-Table which contains all of our possible discrete states. Next we need a way to update the Q-Values (value per possible action per unique state), which we will do using the Q-Learning algorithm!
Text-based tutorial and sample code: pythonprogramm...
Channel membership: / @sentdex
Discord: / discord
Support the content: pythonprogramm...
Twitter: / sentdex
Instagram: / sentdex
Facebook: / pythonprogramming.net
Twitch: / sentdex
#reinforcementlearning #machinelearning #python

Комментарии • 198

@RabeezRiaz 5 лет назад ⁺¹¹⁵
A tip for anyone using large numbers like at 2:09 (epochs etc) in Python 3.6+ (I think). You can write integers using underscores to separate parts just like you would use commas so 25000 becomes 25_000. This has saved me alot of time because I don't have to count the digits to figure out if I have a million iterations or just a hundred thousand :D
@sentdex 5 лет назад ⁺³¹
Wow I so rarely learn new stuff like this now. Very cool! Thanks for sharing this.
@mubangansofu7469 3 года назад
Thanks Riaz for that great tip.
@narayanbandodker5482 2 года назад
Nice trick! But I just end up using scientific notation for large numbers with 0's, like 25e3 for 25,000, 1e-99 etc.
@starship9874 Год назад ⁺²⁸
For anyone doing this in the new gynmasium library, here is what you need to adapt in the code: there is no .render() method necessary anymore, you now pass the rendermode in the constructor e.g. env = gym.make("MountainCar-v0", render_mode='human'), so to selectively render / don't render, you need to create the environment in rendermode or not rendermode every episode, also the return type for the step function changed with it now returning truncation and termination as seperate variables, so you can call new_state, reward, termination, truncation, _ = env.step(action) and then done = termination or truncation
@utkarshsharma6434 Год назад
hey also can u help with index out of bounds error
@togousch 5 лет назад ⁺¹¹
Don't miss to tell about impovements to DQN(double, dueling, noisy nets for exploration, prioritized experience replay and so on). Wish you luck with this tutorial, it's great
@decode0126 4 года назад ⁺¹
watched this tutorial two to three times and understood what u were trying to convey
keep up the good work!!!
@renegadeMoscow 4 года назад ⁺⁶
Thanks, dude for the tutorial. it's funny to see, that each next episode has fewer views :D
@mingmingchen1925 2 года назад
The video is great and helps people to get a gist of Q-learning quickly. I just never expected I would start to laugh during the watching. I hope I can have a presentation like sentdex did one day. So relax and easy. Make audiences feel so relax and easy, too. Thanks man!
@piotrromanowski4852 5 лет назад ⁺²
Hey Man - great channel, thanks for sharing!
We can add the epsilon (exploitation vs. exploration) and a percentage of success to the code (sentdex does it at 27:07)
UNDER OTHER CONSTANTS IN CAPS
epsilon = 0.001 # exploration vs exploitation 1.0 is random 0 is policy
counter = 1
while not done:
if np.random.rand() < epsilon: #exploration option
action = env.action_space.sample()
else: #exploitation if we are not exploring
action = np.argmax(q_table[discrete_state])
......... all other code
elif new_state[0] >= env.goal_position:
print(f'we made it on episode {episode}')
q_table[discrete_state + (action,)] = 0
counter +=1
percentage = counter/(episode*0.01)
print('success rate: ',percentage)
if epsilon > 0.01:
epsilon = epsilon * epsilon_decay
Hope this helps others!
Wish you luck.
@garlandzhang7201 4 года назад
thanks for highlighting when he uses epsilon! I was so confused why we defined it but didnt do anything with it
@soupizcool 4 года назад ⁺⁷
It took me a little bit to understand the get_discrete_state() function. Essentially what's happening is we have already decided to parse the observation values into 20 discrete buckets between the highest and lowest possible. The discrete_os_window_size is the step between each bucket value. By subtracting (state - env.observation_space.low) we get some value between the high and low. Then, by dividing out discrete_os_window_size , we get the i'th bucket the observation falls into. By returning .astype(np.int) it can be used as an index for the q_table. The index is the unique position/velocity bucket.
@adamhendry945 4 года назад
Love the tip that you can use underscores to semantically separate numbers by thousands! I deal with big numbers all the time! Great tip!
@AndJusTIceForRob 5 лет назад ⁺¹
Might be helpful (I'm at about 8:50 right now) to periodically pull up an image of the Q update function to remind us of which part of the Q function we are working on at a given time (e.g., "now we're working on current Q as can be seen here" or "this is where max future Q is"). That way, we can match the code to the function in our heads.
@AndJusTIceForRob 5 лет назад
Oops, looks like you did it a few seconds later. Still a handy tool for sure.
@lucaslopesf 5 лет назад
It's have been 2 days since the last video!! Can't wait for the next episodes, loving this series, it came just in time
@neighborsj 5 лет назад ⁺²
I can't wait for this series! Thanks!
@dankman7603 5 лет назад ⁺⁵
this is a lot like how ants move at random to search for food but once they find a source,
they leave pheromone trails for another ants to exploit that information
@juliuskamau1975 5 лет назад
the ways you explain the concept was really good
you made it easy for me to understand
@HrishikeshVichore 5 лет назад ⁺⁹
normies: watching the video and learning
Me an intellectual: *Notices a new cup in every video* Damnnnn yet another amazing cup :)
@s.r8081 4 года назад
Thank you, man, you saved my life wkwk. You explained this tutorial very clearly!
@TheSandCshow1 5 лет назад ⁺²
Was wondering whether or not you were gonna actually use epsilon for choosing actions , glad I watched until the end XD
@arunesh_u Год назад ⁺⁹
For anyone getting the following error:
discrete_state = (np.array(state, dtype=object) - env.observation_space.low) / discrete_os_win_size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'dict' and 'float'
It's because 'state' is a tuple of an ndarray and a dictionary, replace this line: '(np.array(state, dtype=object) - env.observation_space.low)' with this line '(np.array(state[0], dtype=object) - env.observation_space.low)'
Here, state[0] is simply accessing the ndarray. I don't fully understand the purpose of the dictionary yet. However, the output will not be the same as that in the video for which index you reach on your Q-table
@JustThomas1 5 лет назад ⁺⁷
Before this video came out I ended up finishing the code for this demo using a shotgun approach for changing the weights. It ended up taking only around 80,000 itterations though, so there's that.
@nikhiljagtap6587 5 лет назад ⁺³
So today I skimmed through the book by Geron and wished for reinforcement learning tutorials.
Harrison is new Aladdin. 😀
@jerinvarghese8545 4 года назад ⁺¹⁰
how is the epsilon effecting the model. Even though we are changing the value of epsilion but what effect is having on any of the parameters of the model
@jeff._.6262 4 года назад
It is just inserting random actions which will help you make a better w table basically it is going to places it normally wouldn’t. Go and fills in the q values of those places ( if I understand it correctly
@SameenIslam 3 года назад ⁺⁷
I think there's an error as the epsilon term has been declared and it is decaying, but the agent does not use it for explore-exploit tradeoff. To fix this issue, I would do this in the loop:
while not done:
if np.random.uniform(low=0, high=1) < epsilon:
action = np.random.choice(env.action_space.n) # explore
else:
action = np.argmax(q_table[pos][vel]) # exploit
new_state, reward, done, _ = env.step(action)
new_pos, new_vel = get_discrete_state(new_state)
if render:
env.render()
@rcpinto 5 лет назад ⁺¹
It worked so well without epsilon due to your optimistic initialization of the Q table. Values between -2 and 0 are way higher than the actual correct values, so argmax selects unexplored actions, i.e., implicit exploration.
@VascoCC95 5 лет назад
I think that's the whole point of this:
STEP 1 - explore new possibilities randomly until we reach the goal
STEP 2 - optimize the way we get there
@tpyou1640 5 лет назад
You have finally done it, Harrison.
@YouTubeChannel-jw5th 5 лет назад ⁺¹⁵
25:54 that's the sound of a true nut if I've ever heard one
@dfaiezdfaiez1699 5 лет назад ⁺²
dude you're so good! so helpful. thanks!
@qugh3173 5 лет назад
I love your collection of mugs
@vsmelo94 5 лет назад
Thanks for these videos. They helped me with a task for my AI class :)
@Rileyjamesc 5 лет назад ⁺²⁵
Mine just never learns to beat it I’m so confused
@arshshah1871 4 года назад ⁺²
same man
@Celibistrial 3 года назад
you have to put env.reset() inside for episode in range(episodes)
@Ohnomcrose Год назад ⁺⁴
Did anybody got ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part and solved?
@kaizoku_04 11 месяцев назад ⁺¹
because now the env.reset returns two values not one so you can't give it directly to the function ... instead I did that : observation, info = env.reset() then sent the observation to the function
@chickenkid3242 10 месяцев назад ⁺¹
Will someone help me fix this error:
File "qLearning.py", line 22, in
dicrete_state = get_discrete_state(env.reset())
File "qLearning.py", line 19, in get_discrete_state
discrete_state = (state - env.observation_space.low) / discrete_os_win_size
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It takes place around 4:26 in the video.
@AdrianFacchi 4 года назад ⁺¹
I watched up to 2:08 at the time of asking this question.
If the rewards are always -1 except 0 at the finish. How does the discount factor impact the Q-table or the agent? Aren't we only exclusively interested in the future reward, which is always the long term reward. What happens if we use 1 as discount factor?
@dahampter3844 3 года назад ⁺³
I dont really understand the get discrete state function, can you please help me?
More specifically: discrete_state = (state - env.observation_space.low) / discrete_os_win_size
@chaodonghuang9130 2 года назад ⁺¹
Did you solve it? I have the same problem in this line. The report is TypeError: unsupported operand type(s) for -: 'dict' and 'float'
@gatewayadam6275 2 года назад ⁺⁶
@@chaodonghuang9130 If you print(env.reset()) youll see that it env.reset() produces a dictionary containing something like "(array([-0.5732655, 0. ], dtype=float32), {})." You want the array value out of that dictionary to subtract the observation space array from. We can get this variable by referencing its index position. If you test print(env.reset()[0]) youll see that it prints only the array [-0.5732655, 0. ]. To get past this error youll want to change your code to something like "pass discrete_state = get_discrete_state(env.reset()[0])". Also, .astype(np.int) is depricated, youll want to just use .astype(int).
@SamanwayGhatak 5 лет назад ⁺⁴
5:20 why is the 0th entry came out as max, supposed to be the 1st right?
@alexpower2002 5 лет назад ⁺²
SamanwayGhatak He restarted Python which generated new set of random values.
@lulwat8523 4 года назад ⁺²
12:15 you say that 0 is the reward is for completing things, "nothing, just punishment". So if it's the reward, then why are you using it as a Q-value?? I though that you would use it IN THE FORMULA to calculate the actual Q-value. But instead you just straight override the Q-value with the reward =0.. Could you or somebody clarify what is happening here?
@TheGameJammer 4 года назад ⁺¹
Punishment and reward is all relative. If all punishment is -1 but reward is 0, its functionally the same as all punishment is 0 and reward is 1.
@AndriiMaidan-y2r 3 года назад
My guess on why this has worked well without epsilon is that you have been punishing "suboptimal" choices for the agent on each state until the first random sequence led it to first succesful completion and kickstarted the learning.
@Idlehampster 4 года назад ⁺²
DON'T LEAVE AROUND THE 24 MINUTE MARK. Addendum @ 26:39
@petermills1397 5 лет назад ⁺³
Getting an error at : elif new_state[0] >= env.goal_position: 'TimeLimit' object has no attribute 'goal_position' … any help would be appreciated
@petermills1397 5 лет назад ⁺⁴
For anyone else that has the problem - here is the solution : elif new_state[0] >= env.unwrapped.goal_position:
@alphadragon601_9 5 лет назад
Thank you!
@Akash_K 5 лет назад
@@petermills1397 Thank you ! Finally found a solution that works !
@datascienceed3069 5 лет назад ⁺⁴
I do not understand how the epsilon going to effect in the code you wrote.
@krzysztofdymanowski8759 Год назад ⁺²
Is anyone else getting an out of bounds error? I copied and pasted the code from the site but everytime the second action in the new_discrete_state is something like -60 or -70, so it is out of bounds of the q_table.
@utkarshsharma6434 Год назад
yeah ,found any fix?
@SirDerRosen Год назад
@@utkarshsharma6434 I assume this bug is caused by the gym updates that took place over the years. I haved fixed the issue by doing the folloing
Step 1: Rewrite get_discrete_state() by removing hte [0]
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(int))
Step 2: Before the while loop add a [0]
discrete_state = get_discrete_state(env.reset()[0])
@utkarshsharma6434 Год назад
@@SirDerRosen yeah thanks that got fixed
Apparently the gym update has changed lots of stuff
@Si1veR-b1i 5 лет назад ⁺¹
Where is the epsilon value used? I see it being decremented by the decay value but it is never used in any calculations?
@piotrromanowski4852 5 лет назад
Hey, check my comment above.
@guaishouxiao1623 5 лет назад ⁺¹
Action 0: push cart to left
Action 1: push cart to left to a very small amount (almost not moving)
Action 2: push cart to right
Is there a parameter displaying the amount of force executed to the cart? Is the amount of force executed to the cart a random number? Thank you!
@VascoCC95 5 лет назад
I believe the force is always the same, you only decide if you push left, nowhere or right. The amount of force applied is more less the amount of consecutive steps you repeat the command. So, for instance, soft left would be [1, 0, 1, 0, 1, ... ] and strong left would be [1, 1, 1, 1, 1, ... ].
@jcmachicao 5 лет назад
I think your agent dicovered inertia and gravity! The pattern is making the most of the intertia going to the other side!
@ichisadashioko 5 лет назад
You should use tqdm for progress bar in future videos. It is pure Python and super useful.
@sentdex 5 лет назад
We use tqdm all the time...pretty sure even here in this series
@g7-smart-logistic-app 5 лет назад ⁺¹
Doing Great...but maybe you could use PyCharm for AutoUpdates. I know you know about it but I don't know why you don't use it.
@Armaleet 5 лет назад ⁺¹
Would it not be a good idea to do a numpy.random.seed so you can better compare your runs when tweaking parameters?
@asimsan5507 5 лет назад ⁺⁴
20000 episodes and it still has not reached the top of the hill , is the code hardware dependent ?
@FelixCrazzolara 4 года назад
It may depend on the initial Q table. Remember that it was generated randomly. He should have fixed the random seed to get something that is reproducible.
@erictheawesomest 4 года назад
did you update the action?
I know its been 1 year. just for the other people who may read this
@egeerdem8272 Год назад
@@erictheawesomest Well I did, but still never reached the top
@mochilata 5 лет назад ⁺¹
I was following the tutorial and got to 5:00 when I had different values. I was freaking out, thinking I typed in something wrong. It took me a bit until I realize they were random values.
Update: Actually, why is the discrete_state sometimes (6,10) and sometimes (7,10)?
@ΠαναγιωτηςΜωραιτης-σ3ρ 5 лет назад
I have the same question actually
@st00ch 5 лет назад ⁺²
Are we going all the way to PPO or AC3 or maybe something with "TensorFlow Probability" during this RL series? It's asking lot I know.
@sentdex 5 лет назад ⁺⁶
We'll see. I'd like to cover quite a bit.
@st00ch 5 лет назад
Awesome. I'm with you all the way.
@jhgfdjhgfdhdjfjhd6721 5 лет назад
Thanks so much for your efforts
@abdulrehmanmohsin7893 Год назад ⁺¹
I have used the same code but the agent just doesn't solve it. Using the parameters in the video it doesn't even solve it once in the 20000 episodes. I experimented with the parameters and the best I got is an avg reward of around -198 after 20000 iterations. Maybe this has something to do with the version of gym I was using?
@kaizoku_04 11 месяцев назад
same
@adamhendry945 4 года назад
@sentdex As always, great video! However, at 2:15, discount rate is better understand opposite of how you said it. Per Wikipedia, "it has the effect of valuing rewards received EARLIER higher than those received LATER". This is consistent with complexity theory, where we want to get started quickly, then once we get on the right path, start delaying gratification/reward so we increase the time we take exploring so we force ourselves to look for a better result.
@mahdihosseinali7492 5 лет назад
Great tutorial, just shouldn't there be a termination condition somewhere in the outer loop?
@JonathonCwik 10 месяцев назад ⁺¹
How would someone replay a specific episode? I took the frame count of every "winner" and stored the fastest copy of the q_table into a variable, then reran the rendering loop (see code below) but the frames it beat it in always comes in lower than the replay frames.
The "replay" at the end of the program I set up:
print("Playing best episode: ", fastest_episode[0], " with ", fastest_episode[1], " frames")
done = False
render = True
env = gym.make('MountainCar-v0', render_mode='human')
discrete_state = get_discrete_state(env.reset()[0])
frame = 0
while not done:
frame += 1
env.render()
action = np.argmax(fastest_episode[2][discrete_state])
new_state, reward, done, truncated, _ = env.step(action)
done = done or truncated
if done:
break
new_discrete_state = get_discrete_state(new_state)
discrete_state = new_discrete_state
print("final frame: ", frame)
env.close()
@JonathonCwik 10 месяцев назад
I figured it out, rather than using the q_table, I saved all the actions into a list and then replayed the actions like so:
print("Playing best episode: ", fastest_episode[0], " with ", fastest_episode[1], " frames")
done = False
render = True
env = gym.make('MountainCar-v0', render_mode='human')
env.reset()
frame = 0
env.render()
for a in fastest_episode[2]:
frame += 1
new_state, reward, done, truncated, _ = env.step(a)
done = done or truncated
if done:
break
@lakshyasinghal3372 5 лет назад ⁺¹
I am getting an error "ImportError: sys.meta_path is None, Python is likely shutting down" after running the script, Please help
@nadjibbendaoud2719 5 лет назад
you have to use env.close() at the end
@Akash_K 5 лет назад
@@nadjibbendaoud2719 that doesn't help still getting the same error!
@philipkierkegaard3408 Год назад
At 8:55, can anyone explain to me why adding the (action, ) will access the exact q-value of that state?
@jithinjacobbenjamin 4 года назад
How do you create q table for two agents in an environment whose actions are based on a coin-flip and both have opposite goals?
@alexanderpohl1949 5 лет назад ⁺¹
Thanks for the vids
@user-co7ko3pb2f 5 лет назад ⁺³
Cool to see RL course, but
line 78, in
elif new_state[0] >= env.goal_position:
AttributeError: 'TimeLimit' object has no attribute 'goal_position'
Exception ignored in:
@sentdex 5 лет назад ⁺¹
Would need to see your full code. I expect you made a typo somewhere compared to mine. come join discord.gg/sentdex. I don't edit in/out coding in my videos, so it definitely works as seen in the videos :P Probably just a simple mistake.
@davidjhyatt 5 лет назад
Same
@davidjhyatt 5 лет назад
@@sentdex
EDIT:: For this game goal_position is 0.5 so you can replace it with that value and it works.
I just copy/pasted from the full code on your website:
# objective is to get the cart to the flag.
# for now, let's just move randomly:
import gym
import numpy as np
env = gym.make("MountainCar-v0")
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000
SHOW_EVERY = 3000
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
# Exploration settings
epsilon = 1 # not a constant, qoing to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2
epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(np.int)) # we use this tuple to look up the 3 Q values for the available actions in the q-table
for episode in range(EPISODES):
discrete_state = get_discrete_state(env.reset())
done = False
if episode % SHOW_EVERY == 0:
render = True
print(episode)
else:
render = False
while not done:
if np.random.random() > epsilon:
# Get action from Q table
action = np.argmax(q_table[discrete_state])
else:
# Get random action
action = np.random.randint(0, env.action_space.n)
new_state, reward, done, _ = env.step(action)
new_discrete_state = get_discrete_state(new_state)
if episode % SHOW_EVERY == 0:
env.render()
#new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# If simulation did not end yet after last step - update Q table
if not done:
# Maximum possible Q value in next step (for new state)
max_future_q = np.max(q_table[new_discrete_state])
# Current Q value (for current state and performed action)
current_q = q_table[discrete_state + (action,)]
# And here's our equation for a new Q value for current state and action
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# Update Q table with new Q value
q_table[discrete_state + (action,)] = new_q
# Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
elif new_state[0] >= env.goal_position:
#q_table[discrete_state + (action,)] = reward
q_table[discrete_state + (action,)] = 0
discrete_state = new_discrete_state
# Decaying is being done every episode if episode number is within decaying range
if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
epsilon -= epsilon_decay_value
env.close()
----------------------
Same error:
Traceback (most recent call last):
File "C:\Users\David\Desktop\mine\aaSUMMER_PROJ_2019\q_learn_sentdex - Copy.py", line 78, in
elif new_state[0] >= env.goal_position:
AttributeError: 'TimeLimit' object has no attribute 'goal_position'
---------------------
The game starts and then this pops up after it runs one time.
@douglasferreira3506 5 лет назад ⁺¹
pip install --upgrade gym
@davmar4285 4 года назад
@@douglasferreira3506 Thanks a lot, you've solved a big problem for me!!!
@dominikklein2989 4 года назад
I don't understand, how does it learn to climb the hill for the first time if the reward is always -1? Is it purely random at first?
@6197980 5 лет назад
Why does introducing exploration make it find the first solution much later( considering that no positive reward is given until it makes it to the top for the first time)? Is it that sticking to a policy even though it is a random one (as Q table is randomly filled at the beginning) is Better than changing the policy randondly ( as it does while exploring), at least for this case? Thank You
@marcusdudebro9426 2 года назад
is he implementing epsilon greedy q-learning exactly in this tutorial?
@johnyfishborn 5 лет назад ⁺¹
Why discrete_state = (state - env.observation_space.low) /....
Why not state / discrete_os_win_size ?
State will always between high and low, i don't get it why we need subtract low bound.
@timonstwedder3201 4 года назад
Indeed! why subtract the env.observation_space.low?
@aryaashutoshpathak4849 2 года назад
in the while loop, we never set done = True. How does it ever break out of the loop to a next episode then?
@kaizoku_04 11 месяцев назад
the step function will make it true if it has accomplished the goal
@decode0126 4 года назад
It is really nice of u that u are taking so much efforts
But i didn't really understood that what u were trying to convey
It was just not clear
@0969superman 5 лет назад
Hello,
why not have the discount be 1 ? I don't really get the necessity to have a decaying reward over the back propagation
@sentdex 5 лет назад
They are all variables to be tweaked and tested. Have at it!
@sharmakartikeya 2 года назад
I think you haven't actually implemented the exploration - exploitation process while taking the action, even though you specified the epsilon value and decay.
@arashhn8678 2 года назад
Yes! I believe it was not implemented as well, the epsilon and its decaying were defined but were never used in the process...
@HaiderAli-nm1oh Год назад
i dont understand , how did your loop did not break on episode 0 ,1 ,2 .. and so on ? and prompted ("we made it on episode : " ) ,because when it completes the goal the episode ends right ? , done== True ..
@zheli189 Год назад
Did you solve it? i dont understand
@HaiderAli-nm1oh Год назад
@@zheli189 , ran the same code , and it always printed reached goal on each episode , but after every epoch it reached it goal quicker and in optamised way (obv)
i cant believe it , its some kind of an illusion in the video ,how can it be possible to reach next episode without breaking while loop and in doing that the condition to print reach goal becomes true
@elishashmalo3731 4 года назад
why is mine not learning nearly as fast or consistently as his? I even downloaded his code to run on my computer(a mac air) and it didnt work. Is this hardware dependent?
@FuZZbaLLbee 5 лет назад
The window shows (not responding) so i thought it crashed, but help on Discord tought me i just had to wait. It only updates every 3000 episodes
@larrybird3729 4 года назад ⁺¹
9:57 I thought it was always this: Q[state, action] = Q[state, action] + lr * (reward + gamma * np.max(Q[new_state]) - Q[state, action])
@fernandovictoriavalpuesta4903 4 года назад ⁺¹
In the book it says the formula you posted as well, I don't know where he gets that formula from (he specifies that his source is wikipedia but it also says this one...)
@danilolr 4 года назад
In fact the 2 formulas are the same. I was confused by this too. But :
Q[state, action] + lr * (reward + gamma * np.max(Q[new_state]) - Q[state, action]) ( multiply the lr by the content on parentesis)
Q[state, action] + lr * reward + lr * gamma * np.max(Q[new_state]) - lr * Q[state, action] (reorder the terms and put new parentesis)
(Q[state, action] - lr * Q[state, action]) + (lr * reward + lr * gamma * np.max(Q[new_state])) (on the first parentesis we have Q[state, action] as a common factor so)
Q[state, action] * (1 - lr) + (lr * reward + lr * gamma * np.max(Q[new_state])) ( on the second parentesis lr is a common facor so)
Q[state, action] * (1 - lr) + lr * (reward + gamma * np.max(Q[new_state]))
This is the version of the formula on the video.
@sontapaa11jokulainen94 4 года назад
Subscribed.
@andrewdavies5722 4 года назад
why do you reset the q-value to 0 if the agent reaches the goal? i don't think setting the q-value to zero indicates the reward was given
@jaclgr 2 года назад
I have the same quesion. Is it maybe because we have arbitrarily set the max value of Q to zero when initializing-randomizing our Q-table?
PS: Great tutorial, sentdex, thanks for your effort.
@Erroleldrich 4 года назад
How exactly the epsilon making the agent to explore random states as in the code it is not used to take random action?
@Erroleldrich 4 года назад ⁺¹
Well there is a correction in the video. In the code he has used the epsilon to take a random action but here in the video he has just skipped that part.
@nexovec Год назад ⁺¹
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It doesn't work, because why should it. Why am I, again, the only person for which it doesn't work, who also thinks it shouldn't work, but it should work and does work for everyone else. And the debugger doesn't work.
EDIT: I reimplemented everything myself and I will have a video about this, since this one needs a refresher.
@arafatasghar372 5 месяцев назад
Every now and then a bug appears in your code
This is the error that I get when I execute np.argmax(q_table[discrete_state]) or print(q_table[discrete_state])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
@38820 3 года назад
can you tell what is the logic behind choosing 25000 iterations and why cant i choose 1000 ? ..what effect it has ?
@PySam 5 лет назад ⁺²
I'm just waiting until he combines q learning and neural networks
@AutismusMaximus1 5 лет назад ⁺³
Lol, was about to ask why we never used epsilon xD
@volt897 4 года назад
hahhaha
@kastin83 4 года назад ⁺¹
If you want to see how many "moves" it take to finish each time i've added these few lines to the code: moves = 0
##just above the line below in his code
while not done:
moves += 1 ##add this counter in the while loop
then down where you print
mov.append(moves)
print(f"we made it on episode {episode} in {moves} Moves best so far is {min(mov)} moves MA20 on moves = {sum(mov[-20:])/20}" )
you'll also need to initialize a list probably near the variables
mov = []
@scootscoot2k 5 лет назад
I feel like there is a show in you just reviewing agents as a sports commentator. If you were to include the distance travelled as a property of the fitness would it try to optimise travelling as little distance as possible?
@sentdex 5 лет назад
lmfao. Just wait til we get back to doing something like SC2. Totally commentate that :)
Distance traveled basically IS a property. You have 200 frames or til completion. Each step is a -1. If you can do it in less steps, that's less -1's, therefor more ideal for a higher Q.
@scootscoot2k 5 лет назад
@@sentdex ahh although that doesnt take into account your velocity?
@sentdex 5 лет назад
well the fastest way to completion is @ higher velocity, no?
@scootscoot2k 5 лет назад
@@sentdex yeah. I suppose my point is more that steps != Distance, I was suggesting trying to minimise the number of times it ramps up and down by measuring the distance it travels. Your steps are just ticks right? Its trying to complete as fast as possible not in as little distance as possible. What do I know, I write in a language named after coffee
@jonathan-._.- 5 лет назад ⁺³
shouldnt we be using epsilon somewhere ?
@jonathan-._.- 5 лет назад
oh nvm
@bobbysingh5666 4 года назад
24:54 is there a way. to change the amount of time he gets
@Youshisu 4 года назад
there is huuuge precision lost in epsilon, any way to make it work?
Epsilon: 0.8, decay: 0.00016032064128256515
results in 0.76793... which is not even close real value, python lost 0.03 somewhere. Any solution to this numerical problems?
@sungyoungkim4382 5 лет назад
How many coffee mugs do you have? I see a unique one per every video!
@VascoCC95 5 лет назад
He mentioned somewhere a lot of videos ago that he has a collection of mugs and that supporting his channel contributes to increase it
@izxle 5 лет назад
Can you use the 'with' statement with the envs?
@sentdex 5 лет назад
In what way?
@nature_through_my_lens 2 года назад
Why does my system take a lot of time to run it and frames are slow like everything runs in slow motion? Is it because of the library upgrade of Gym to 0.26.2? Your's look very smooth when it's running. I have a 10th Gen Core i5 machine with 8 gigs ram.
@washyb 11 месяцев назад
add env.metadata['render_fps']=XXXX after you initiate the environment, itll speed itall up as well
@priyankrajsharma 5 лет назад
could you please explain this line discrete_state= (state-env.observation_space.low) /discrete_os_win_size why we are subtracting with env.observation_space.low we can choose high as well.. why to divide by discrete_os_win_size
@ronrocks4819 5 лет назад
i am getting an error : AttributeError: module 'time' has no attribute 'clock'
on the command env.render() plz help me out
@Kirchert919 5 лет назад
had the same problem... use a virtual environment to downgrade to python 3.7 and it should work.
@Nova-Rift 5 лет назад
Can we assign this to GPU? Can you show how?
@amanbisht0774 4 года назад
it shows me an error
timelimit has no attribute goal_position
@new_owen1496 5 лет назад ⁺¹
Yo yo this is the CPU song!!
@Mikey-lj2kq 4 года назад
good videos but seriously use vim (still run python by setting up f9)
@alessandrorenna1222 3 года назад
If any of the state is equal to its high value, wouldn't the corresponding discrete_state be 20? If so, that's out of the size of q_table. I think discrete_os_win_size should be calculate dividing by DISCRETE_OS_SIZE - 1
@ardeshirmirbakhsh6345 4 года назад
why we have to define the "for episode in range(EPISODES)" condition? Normally as long as the task is not done "while not done" is True, the agent should keep trying to reach the "goal position". But it does not! why?
Thank you in advance.
@gatewayadam6275 2 года назад ⁺⁵
running gym 0.26.1. Code would not break episodes at 200 steps. The addition of the truncated parameter from env.step() changed the logic required to reproduce the model. Created a truncated = False variable under the done = False variable and used "while not done and not truncated:" and "if not done and not truncated" to replicate the 200 step behavior.
@alexgulewich9670 5 лет назад ⁺¹
15:20 so, at this stage my AI is just going the wrong way :(
@barra3 5 лет назад
mine too
@washyb 11 месяцев назад
To anyone currently stuck, try observation,info = env.reset() and discrete_state = get_discrete_state(observation) instead of directly passing env.reset() into get_discrete_state.
@Mikey-lj2kq 4 года назад
also double q learning is better, and i'm trying to understand zhiqing's solution on scoreboard (gave up on the maxqn stuff next to it...someone help....)
@pavfrang Год назад
Fortunately VS code came out. Sublime is for the martyrs.

Следующие

Автовоспроизведение

Q-Learning Agent Analysis - Reinforcement Learning p.3