Train Q-learning Agent with Python - Reinforcement Learning Code Project

Поделиться
HTML-код
  • Опубликовано: 25 июл 2024
  • 💡Enroll to gain access to the full course:
    deeplizard.com/course/rlcpailzrd
    Welcome back to this series on reinforcement learning! As promised, in this video, we're going to write the code to implement our first reinforcement learning algorithm. Specifically, we'll use Python to implement the Q-learning algorithm to train an agent to play OpenAI Gym's Frozen Lake game that we introduced in the previous video. Let's get to it!
    Sources:
    Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow
    incompleteideas.net/book/RLboo...
    Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies
    www.cs.toronto.edu/~vmnih/doc...
    Thomas Simonini's Frozen Lake Q-learning implementation
    github.com/simoninithomas/Dee...
    OpenAI Gym:
    gym.openai.com/docs/
    TED Talk: • The Rise of Artificial...
    🕒🦎 VIDEO SECTIONS 🦎🕒
    00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources
    00:30 Help deeplizard add video timestamps - See example in the description
    08:29 Collective Intelligence and the DEEPLIZARD HIVEMIND
    💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥
    👋 Hey, we're Chris and Mandy, the creators of deeplizard!
    👉 Check out the website for more learning material:
    🔗 deeplizard.com
    💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES
    🔗 deeplizard.com/resources
    🧠 Support collective intelligence, join the deeplizard hivemind:
    🔗 deeplizard.com/hivemind
    🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order
    👉 Use your receipt from Neurohacker to get a discount on deeplizard courses
    🔗 neurohacker.com/shop?rfsn=648...
    👀 CHECK OUT OUR VLOG:
    🔗 / deeplizardvlog
    ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind:
    Tammy
    Mano Prime
    Ling Li
    🚀 Boost collective intelligence by sharing this video on social media!
    👀 Follow deeplizard:
    Our vlog: / deeplizardvlog
    Facebook: / deeplizard
    Instagram: / deeplizard
    Twitter: / deeplizard
    Patreon: / deeplizard
    RUclips: / deeplizard
    🎓 Deep Learning with deeplizard:
    Deep Learning Dictionary - deeplizard.com/course/ddcpailzrd
    Deep Learning Fundamentals - deeplizard.com/course/dlcpailzrd
    Learn TensorFlow - deeplizard.com/course/tfcpailzrd
    Learn PyTorch - deeplizard.com/course/ptcpailzrd
    Natural Language Processing - deeplizard.com/course/txtcpai...
    Reinforcement Learning - deeplizard.com/course/rlcpailzrd
    Generative Adversarial Networks - deeplizard.com/course/gacpailzrd
    🎓 Other Courses:
    DL Fundamentals Classic - deeplizard.com/learn/video/gZ...
    Deep Learning Deployment - deeplizard.com/learn/video/SI...
    Data Science - deeplizard.com/learn/video/d1...
    Trading - deeplizard.com/learn/video/Zp...
    🛒 Check out products deeplizard recommends on Amazon:
    🔗 amazon.com/shop/deeplizard
    🎵 deeplizard uses music by Kevin MacLeod
    🔗 / @incompetech_kmac
    ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

Комментарии • 224

  • @deeplizard
    @deeplizard  5 лет назад +7

    Check out the corresponding blog and other resources for this video at:
    deeplizard.com/learn/video/HGeI30uATws

  • @datascience_with_yetty
    @datascience_with_yetty 5 лет назад +107

    I must really really say that I have never found any tutorial that explains stuffs both theoretical and codewise like you do. You're a GEM. You inspire me, keep up the great work.

    • @deeplizard
      @deeplizard  5 лет назад +3

      Thank you so much, Sanni!

  • @tryhardnoob1140
    @tryhardnoob1140 4 года назад +7

    This is exactly what I needed to see. I feel like too many tutorials either fail to give enough explanation, or spend an hour explaining basic programming concepts.

  • @tanismar2979
    @tanismar2979 3 года назад +17

    needed a course that goes into enough detail to understand what's going on beyond an introductory overview of RL, but not so much that it would be a playlist of 15 videos, 1 hour and a half each. This course strikes the perfect balance between speed and depth, with great explanations and very useful resources. I think it deserves many more views, and expecting that they'll get them soon!

    • @gvcallen
      @gvcallen Год назад

      exactly the same. great videos

  • @noahturner5212
    @noahturner5212 5 лет назад +8

    Definitely the most clear and by far the best tutorial on q learning out there. Thank you!!

  • @christianjt7018
    @christianjt7018 5 лет назад +5

    This tutorial is incredibly clear and is the best tutorial that I have found about RL on the internet, I have learned a lot, thanks a lot for the effort in creating this and sharing the knowledge.

  • @abdoulayely8284
    @abdoulayely8284 5 лет назад +5

    Your way of explaining makes these "obscure" concepts relatively easy to grab. I went into various resources but your videos are just the best I can found about reinforcement learning for starters. I am looking forward to seeing new contents uploaded. Thanks a million.

    • @deeplizard
      @deeplizard  5 лет назад

      Thank you, Abdoulaye! Glad you're here!

  • @omkarjadhav13
    @omkarjadhav13 3 года назад +2

    Amazing work!! Can't believe you are explaining RL in such a easy way. Thank you so much!

  • @aditjain3897
    @aditjain3897 3 года назад +1

    Just amazing, the way you explained the concepts is just brilliant. Thank you so much.

  • @Alchemist10241
    @Alchemist10241 3 года назад

    This video summarize all of the previous lessons. Thanks for the practicality 😎

  • @dmitriys4279
    @dmitriys4279 5 лет назад +1

    Thank you for awesome explanation!!! It's the greatest tutorial I have ever seen about Q-learning

  • @DanielWeikert
    @DanielWeikert 5 лет назад +1

    It's amazing how you guys are able to explain complex topics step by step so well. I am really grateful for that. Your videos are interesting and fun and I really learn a lot

  • @psychodrums8138
    @psychodrums8138 Год назад +1

    What a beautiful explanation of the code!!!! GREAT!

  • @chamangupta4624
    @chamangupta4624 3 года назад

    No better 8 min video i found on RL than this , i searched for 1 month
    Very thanks

  • @CesarAugusto-vu2ev
    @CesarAugusto-vu2ev 5 лет назад +1

    You are the best!!! EXCELLENT Explanation, EXCELLENT video! Thank you very much, I became a fan of your work!

  • @iAndrewMontanai
    @iAndrewMontanai 4 года назад +1

    Giving likes for any non music video very rarely, but here i gotta make exception. These videos deserve thousands, you're awesome ^^

  • @tong9977
    @tong9977 5 лет назад +1

    This is very good. Excellent explanation. Thank youy

  • @paedrufernando2351
    @paedrufernando2351 4 года назад

    Why r u such a gem of a person...keep it up...Really hats off

  • @varunbansal590
    @varunbansal590 3 года назад +1

    you made it so simple Thankkksssss
    one of the best resources for rl

  • @timonstwedder3201
    @timonstwedder3201 5 лет назад +1

    Great explanation! Thank you

  • @antoinelunaire9462
    @antoinelunaire9462 4 года назад

    Thanks a lot for the great effort, and an advice for whomever concerned to run the code is to visit the blog and copy/past the explained code with considering the spacing, to avoid the mistakes, it gave me some different numbers but with the same philosophy.

  • @Marcaunon
    @Marcaunon 5 лет назад +1

    Excellent video series!

  • @zsa208
    @zsa208 3 месяца назад +1

    very detailed and easy to understand tutorial

  • @maryannsalva3462
    @maryannsalva3462 4 года назад

    That was really amusing! Great videos and presentation. 😍😘

  • @TheMyrkiriad
    @TheMyrkiriad 3 года назад +7

    After a couple of changes, I managed 75.6 %. Also converging much faster, after only 2000 episodes. Since there is randomness in this game, figures can change quite a bit from one run to the other. Si I gave the best score that came up after several runs.

    • @sreeharshaparuchuri756
      @sreeharshaparuchuri756 Год назад +2

      Hey, that's cool to hear. Would you mind me asking what you changed? Did you expect the agent to converge that much faster or faster in general wrt the tutorial?

  • @mateusbalotin7247
    @mateusbalotin7247 2 года назад

    Thank you for the video, great!

  • @rashikakhandelwal702
    @rashikakhandelwal702 2 года назад +1

    Thank you for doing this great Work . I love the videos :)

  • @housseynenadour2233
    @housseynenadour2233 3 года назад +2

    Thank you for your lesson, so helpful. For this game, even if we take exploration_rate = 0, we can reach very quickly the global optlimal policy (with average of reward per 1000 episodes = 0.7) and without stacking in a local opitmum policy.

  • @absimaldata
    @absimaldata 3 года назад +8

    I dont know whats wrong with rest of the world. Everyone making some stupid mugged up nonsensical tutorials.
    But this one is by far the most clear and best explanation till now.

  • @absimaldata
    @absimaldata 3 года назад +2

    Best tutorial till now

  • @abdurrakib5324
    @abdurrakib5324 4 года назад

    Great tutorial, would be really helpful if you can make a tutorial on implementing the same game in the probabilistic setting based on MDP.

  • @ericchu3226
    @ericchu3226 2 года назад +1

    Very nice series...

  • @sametozabaci4633
    @sametozabaci4633 5 лет назад +15

    Thank you very much for your great explanation. You really know how to teach. I respect that deeply.
    That is a very good example for beginning but it's a prebuilt environment. I think one of the key point is to write proper reward function. In that example there is no reward function. Everything is done by environment object. Will you cover topics on writing proper reward functions or building environment for RL in the next videos?
    Again, thanks for you excellent work.

    • @deeplizard
      @deeplizard  5 лет назад +4

      Hey samet - You're welcome! I've thought about this as well. I haven't yet explored building and developing an environment from scratch, but I may consider doing this in a future video. Thanks for the suggestion!

    • @sametozabaci4633
      @sametozabaci4633 5 лет назад +1

      @@deeplizard Thanks. Waiting for upcoming videos.

    • @korbinianblanz3734
      @korbinianblanz3734 4 года назад

      exactly my thought. this series is absolutely amazing and helped me a lot but i miss this point. is there a @deeplizard video out there now on this topic? if not, can you suggest any valuable resources on implementing your own reward function? Thanks!

    • @trishantroy6256
      @trishantroy6256 4 года назад

      @@deeplizard Thanks a lot for this video series!
      Regarding the reward functions query, we can write the reward functions through code ourselves since we know the state of the agent, right? Is this correct?

  • @tariqshah2767
    @tariqshah2767 4 года назад +2

    Wow really impressed

  • @egrinant2
    @egrinant2 5 лет назад +1

    Hi, I commented on the poll before and it doesn't let me post another comment to reply you. Languages we use: mostly C# and javascript, sql, php, rarely python. However I am very interested on ML. I love your videos, I just started following you for about 2 weeks and I am amazed of the quality of the content. I've read more than a dozen books about ML, tryed to follow someone else's videos/courses, and I can say that your channel is explaining the most important concepts in a much clearer way than anyone else. Keep up the good work!

    • @deeplizard
      @deeplizard  5 лет назад +1

      Hey Edgar - Thanks for sharing your experiences! Really glad to hear you're finding such value in the content, and we're happy that you're here!

  • @hanserj169
    @hanserj169 4 года назад

    Amazing explanation. I was wondering if you have the code implementation for a continous task as well.

  • @Gleidson5608
    @Gleidson5608 4 года назад +1

    Thanks for this series. I had learned so much with it and i'm learning with you a lot of things.

  • @portiseremacunix
    @portiseremacunix 3 года назад +2

    I can finanlly understand how easy to implement the Q learning!

  • @niveyoga3242
    @niveyoga3242 4 года назад +1

    Thank you!

  • @marcin.sobocinski
    @marcin.sobocinski 2 года назад +1

    I know the course is quite dated (2018), but just wanted to say that the sample code is great. I am just doing another heavily charged course online and the code is so cryptic and unnecessarily complicated that I have to search for other sources of knowledge. Happy to find a good one here :D

  • @yash-vh9tk
    @yash-vh9tk 4 года назад

    The more we explore the better results I am seeing.

  • @RandomShowerThoughts
    @RandomShowerThoughts 5 лет назад +1

    After this video I'm going to become a patreon! This was an absolutely amazing series up to this point. Can't wait to finish

    • @deeplizard
      @deeplizard  5 лет назад

      Thank you so much, Farhanking! Really happy that you're enjoying the series!

    • @RandomShowerThoughts
      @RandomShowerThoughts 5 лет назад

      deeplizard really am. I doubt you remember but I request this series back in September or something so it’s awesome that you guys actually did it. I was off from machine learning for a while, I came back and saw you guys were also gone for a couple months. Great to see you back though

    • @deeplizard
      @deeplizard  5 лет назад +2

      I do remember :D When I saw your comment, your user name and profile photo refreshed my memory that you were previously going through the Keras series and had asked about plans for RL. New episodes are still being added to the RL series, and the next one should be out within a couple of days!

    • @RandomShowerThoughts
      @RandomShowerThoughts 5 лет назад +1

      deeplizard wow that’s insane! Ha it’s really been a while. Looking forward to it

  • @chrisfreiling5089
    @chrisfreiling5089 5 лет назад +3

    Thanks for these videos! Since this is such a small problem, I would bet (after you changed exploration_decay_rate =0.001) you have already found the optimum policy with about 10,000 episodes. Assuming this hunch is correct, the only way to improve the score further is to decrease the final exploration_rate. So I tried just setting exploration_rate=0 for the last 1000 steps. I think it does a little better. There is a lot of luck, but I would guess about 73% avg.

    • @deeplizard
      @deeplizard  5 лет назад

      Ah, nice! Thanks for sharing your approach and results!

  • @mohammadelghandour1614
    @mohammadelghandour1614 4 года назад

    Thanks for the amazing work. i have a question. for a state like "S" , which is the initial state for the agent. There will be only 2 action choices, moving right or moving down. the same also applies with states found on the edges (3 action choices) and corners (2 action choices). should the Q values of those state action pairs be zeros? just cannot locate them on the final Q-table. I know there are 23 zeros on the final Q-table. Are those for state action pairs that lead to "H" and for cases mentioned above when only 2 or 3 action choices are available?

  • @davidak_de
    @davidak_de 5 дней назад

    To improve the score, i first increased max_steps_per_episode to 256, which increased the final score from 67 to 77. Then i tried to increase num_episodes, but the end score got worse over time. It was at 79 once. I played around with exploration decay and settled with min_exploration_rate at 0.0001 and exploration_decay_rate at 0.00075. That way i got to 83.4.

  • @hello3141
    @hello3141 5 лет назад +3

    Once again, very nice. As you noted at 0:45, the initial exploration decay rate was off by an order of magnitude. As you suggest, a good exercise is to assess why this is a problem. My 2-cents: if the exploration decay rate is too large, the 2nd term in the “exploration-rate update” is ≈ 0 (because the exponential term is ≈ 0). The impact is that subsequent epsilon-greedy searches get stuck in an “exploitation” mode since the exploration rate converges to "min_exploration_rate" (little or no exploration occurs). This is the same point you made later in the video, so again, nice work.

    • @deeplizard
      @deeplizard  5 лет назад +1

      Thanks for answering the challenge, Bruce! 🎉Yes, the lower the exploration decay rate, the longer the agent will be able to explore. With 0.01 as the decay rate, the agent was only able to explore for a relatively short amount of time until it went into full exploitation mode without having a chance to fully explore and learn about the environment. Decreasing the decay rate to 0.001 allowed the agent to explore for longer and learn more.

    • @richarddeananderson82
      @richarddeananderson82 5 лет назад +2

      @@deeplizard It is true that a larger exploration decay rate (0.01 instead of 0.001) makes the learning algorithm to go faster to "exploitation".
      That been said, the issue here lies mainly on the "Exploration-exploitation trade-off" part together with the fact that the q_table will be updated with values different than zero starting from the "higher states" and going backward to the start state (because there is only a reward on the last state, where the frisbee is).
      And choosing the max q_value on the table from a state (action = np.argmax(q_table[state,:]) ) where all the q_values are 0 will ALWAYS RETURN action=0, or "Left" so at the start (in case of greedy policy) our buddy here will get stuck on the first line walking left, unless he "luckily" slips and goes down. So you may end up with a Q-table full of zeros every now and then.
      To avoid this, I suggest this minor change on the code for the "Exploration-exploitation trade-off" part:
      if exploration_rate_threshold > exploration_rate and ~np.all(q_table[state,:]==0):
      # Greedy Policy, only if the q values for the state are NOT all 0
      action = np.argmax(q_table[state,:])
      else:
      # Explore
      action = env.action_space.sample()
      in that case, in case the "Greedy Policy" is chosen but all of the q_values of the state are zero, it will choose a random action, instead of always going for action 0, which is not fair.
      With that you can leave exploration_decay_rate = 0.01 and get similar results.
      One last thing: these behavior would come much more clear if the game was deterministic (no slipping on ice), since the slippery situation adds a randomness which contributes to hide the phenomenon.
      Best Regards and keep it up!

  • @DEEPAKSV99
    @DEEPAKSV99 4 года назад +4

    It's true that it has been almost 2 years since this video was uploaded. But since I am viewing it just now let me try to answer your question on "why a higher value of exploration_decay_rate (0.01) gives poorer results (compared to 0.001) ?":
    When I tried running the code with 0.01 in my PC, all I got at the end were a q_table with all the elements as zero & all the rewards were zero as well.
    My intuition: I think this could have happened since a high exploration_decay_rate can cause the decay to happen instantly and the model would almost every time prefer exploitation over exploration, and when exploitation happens over a q-table with only zeros, the agent would be looping around the same path and might never reach the goal causing the q-table & rewards to never update. . .

    • @sebastianjost
      @sebastianjost 4 года назад +2

      That sounds very much correct and this is a common problem:
      the agent never gets enough reward to learn where to go.
      That's why more complex reward functions can be useful. In this case adding a penalty for falling in a home should help a lot.
      But since the environment is prebuilt, ai don't know if that's possible.

  • @iTube4U
    @iTube4U Год назад +3

    if u tried to follow along code, but this line doesn't work:
    ---> q_table[state, action]
    first change state = env.reset() to --> state = env.reset()[0]
    and don't forget to add new_state, reward, done, info, _ = env.step(action)
    if it asks about one more tuple to be returned

  • @arturasdruteika2628
    @arturasdruteika2628 4 года назад

    Hi, why do we need to set done = False when later we get done value from env.step(action)?

  • @sergiveramartinez2685
    @sergiveramartinez2685 3 года назад +1

    Hi everyone. I'm applying RL to get the link weights of different network topologies. Since I need a value at the output and not a classification (probability), the model will be of regression type. The problem is that how can I get N different values, where N is the number of links? Thank you so much

  • @TheRealAfroRick
    @TheRealAfroRick 4 года назад +1

    Love the tutorials, but the equation for the q_table update appears different from that of the Bellman Equation. q[state,action] = q[state,action] + learning_rate * ( reward + gamma * np.max(qtable[new_stare,:]) - qtable[state,action)

  • @masudcseku
    @masudcseku 3 года назад +1

    Great explanation. I tried the same code; however, the rewards are not increasing properly. After 5000 episodes, it started to decrease suddenly.

  • @farhanfaiyazkhan8916
    @farhanfaiyazkhan8916 28 дней назад

    the exploration_decay_rate is directly proportional to 1 - epsilon. hence, we will have more no. of exploitation cases before we have explored our environment, hence the inconsistency.

  • @crykrafter
    @crykrafter 5 лет назад +1

    Thank you for your great explaination. I realy understood everything you explained but im still interested how youd apply this method to a dynamic game like pong where you cant give the agent all the possible states because its to complex

    • @deeplizard
      @deeplizard  5 лет назад

      You're welcome, CRYKrafter! I'm glad to hear you're understanding everything.
      In regards to your question/interest, this would be where _deep_ reinforcement learning will come into play. Deep RL integrates neural networks into reinforcement learning. We'll learn about deep Q-learning in an upcoming video, which can be used for larger more complex state spaces.

    • @crykrafter
      @crykrafter 5 лет назад +1

      Thanks Im looking forward to this series

  • @henoknigatu7121
    @henoknigatu7121 Год назад

    i have tried the algorithm on frozenlake v1 in the same way you have implemented and printed out the reward for each step, and all rewards for all state (including the Holes) and action except the win state are 0.0, how would the agent learn from this rewards

  • @Mirandorl
    @Mirandorl 5 лет назад

    How does the agent know what actions are available to it? It seems they are abstracted away in this "action space". Where is up / down / left / right actually defined as an action and made available to the agent please?

  • @mmanuel98
    @mmanuel98 3 года назад

    I'm currently implementing reinforcement learning for another situation. How will the output from this (the q-table) be used for the policy? So that if the situation changes, the policy gained from the training can be used (that is the same "game" but different environment).

  • @mohanadahmed2819
    @mohanadahmed2819 4 года назад

    Great video series. What changes from one episode to the next? The only thing that I can see is exploration_rate_threshold, which is a random number. If this is correct, then should we re-seed the random number generator at every episode to guarantee non-identical episode results?

  • @faizanriasat2632
    @faizanriasat2632 Год назад

    What if when selecting random action from sample gives an action which is prohibited in that state?

  • @sankalpramesh5478
    @sankalpramesh5478 3 года назад

    Does the map of frozen lake remain the same or does it get randomly changed?

  • @TheMekkyfly02
    @TheMekkyfly02 3 года назад

    Please, I need a little help!!! I am getting a "tuple not in range" error when I run the code to print the rewards_per_thousand episode

  • @Zelmann1
    @Zelmann1 Год назад +1

    Very good tutorial. I noticed even with 50,000 episodes the rewards the agent gets per 1000 episodes plateaus at about 69-70%....implying there is a maximum amount that the agent can learn. Wondering what can be done to get it to 90-something percent.

  • @linvincent840
    @linvincent840 4 года назад

    I think we should remove the item "discount_rate *np.max(q_table[new_state, :])" in bellmann equation when our agent reach the goal, bacause in this case we don't have new state anymore.

  • @user-uk4zk1st6f
    @user-uk4zk1st6f 4 года назад

    Can anyone tell me why if I reduce epsilon *linearly* the agent doesn't learn? Using the formula from the video works just fine, but i would like to be able to use linear decay, because it helps to calculate how many steps I want to explore...

  • @happyduck70
    @happyduck70 Год назад

    A question about the reward that the environment return after taking env.step(action). Is this the reward for ending up in the next_state or this linked to taking the action according to the current state? If it would follow the rules for Q-learning I would expect that the reward is linked to ending up in a state

  • @pepe6666
    @pepe6666 5 лет назад +2

    hoooray exponential decay. the champion's decay

  • @interweb3401
    @interweb3401 Год назад +1

    BEST INSTRUCTOR EVER, BEST INSTRUCTOR EVERRRRRR

  • @TheHunnycool
    @TheHunnycool 5 лет назад +2

    Why are there large q-values for left and top action for starting State in the First Row of q_table, as we know it can only move to right or down?
    Also large value for left in 5th row for [F]HFH?
    By the way you are the best teacher EVER!!!! Really thankyou for these lectures.

    • @deeplizard
      @deeplizard  5 лет назад +2

      You're welcome, Himanshu. Glad you're enjoying the content!
      The answer to this question is very similar to what we previously discussed in your prior comments. Although the agent can choose to move left from the starting state, since the ice is slippery, then the ice may make the agent slip into another state rather than the state the agent chose to move. If the agent chose to move left, but instead slipped right, for example, then the Q-value associated with moving left from the starting state would actually be positive since the agent instead slipped right and gained a positive reward. This explains why the Q-value associated with a ending state could be non-zero.

  • @SamerSallam92
    @SamerSallam92 5 лет назад +3

    Thank u very much
    Great course and blog
    I guess there is a missing round bracket at the end of this line of code in your blog post
    # Update Q-table for Q(s,a)
    q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
    learning_rate * (reward + discount_rate * np.max(q_table[new_state, :])

  • @jeffshen7058
    @jeffshen7058 5 лет назад +3

    I tried using a linear decay rate but that didn't work too well. I also tried only exploring for the initial 7000 episodes, then only exploiting for the last 3000 episodes, and that seemed to do quite well. With a learning rate of 0.3 and a discount rate of 0.98, I got to 76% using this strategy. Would it be possible to implement grid search to automate (to some extent) the hyperparameter tuning? Thanks for the effort you put into these excellent videos.

    • @deeplizard
      @deeplizard  5 лет назад

      Thanks for sharing your experiments and results! I haven't tried gird search with this, so I'm not positive of the results it would yield.

  • @ka1zoku-o
    @ka1zoku-o Год назад

    DOUBT: Usually while training neural networks, we initialize the weights to random real numbers. Here we're initializing q-values to zero. Is there any specific reason for the difference in treatment of both situations while initializing.

  • @maximilianwittmann2400
    @maximilianwittmann2400 4 года назад +1

    Dear deeplizard-team,
    first of all I want to congratulate you for having put together an amazing Reinforcement Learning playlist. Your theoretical explanations are very precise, you walk us very smoothly through the source code, and your videos are both entertaining and rich on great content! By watching your videos, you can tell that you are very passionate about AI and on sharing your knowledge with fellow tech enthusiasts. Thank you for your hard work and dedication.
    I have one question though: I am unsure where exactly we specify the positive or negative rewards in the Python code? I followed your explanation and understood that the agent is basing its decisions for each state-action-pair on the q-table-values depending on the exploration-exploitation-tradeoff. But where exactly in the source code do we actually tell the agent that by stepping onto the fields with letter H, it will receive minus 1 points and for landing on F-letter fields it is "safe"? Is this information specified in the env.step-function and thus already imported from OpenAI gym's environment? I look forward to your reply.
    Thanks!

    • @deeplizard
      @deeplizard  4 года назад +1

      Hey Maximilian - You're so welcome! Thank you for the kind words, and we're very happy you're finding so much value in the content.
      In regards to your question, your assumption is correct that the reward function is defined by the OpenAI gym environment. The link below is gym's wiki page for the Frozen Lake environment, which specifies the an overview of the environment, including how the rewards are defined.
      github.com/openai/gym/wiki/FrozenLake-v0

  • @moritzpainz1839
    @moritzpainz1839 4 года назад +4

    why dont you publish your courses on udemy or coursera?
    i would be happy to support you there :D
    KEEP IT UP

  • @ilyeshouhou9998
    @ilyeshouhou9998 3 года назад

    i tried using this method on a maze solving agent with gym_maze but when the action to be taken is an exploitation action i get this error : ""TypeError: Invalid type for action, only 'int' and 'str' are allowed ""
    and when i print the action it is indeed an integer!! anyhelp would be great, thanks

  • @ka1zoku-o
    @ka1zoku-o Год назад

    DOUBT: You discussed something about the loss functions earlier. But there's no mention of loss function in the code?? I understand the working the bellman equations, in reaching optimal solutions over time. (which is also what happens in Bellman ford algorithm for single source shortest path in graphs). But still is it fine to leave out loss function??

  • @picumtg5631
    @picumtg5631 2 года назад

    I tried this qlearning with another simple game, but somehow it just cant get the reward right. Are there common mistakes beginners make?

  • @junhanouyang6593
    @junhanouyang6593 3 года назад

    I copy all your code and run it, but the result barely passes 6%. If I remove the randomness my result works fine. Anyone know what may be the problem?

  • @ninatenpas633
    @ninatenpas633 4 года назад +1

    love you

  • @thehungpham4898
    @thehungpham4898 5 лет назад +3

    Hey I get an error, when I update the q_table.
    "IndexError: index 4 is out of bounds for axis 0 with size 4"
    Can somebody help me?

    • @GtaRockt
      @GtaRockt 5 лет назад +1

      size 4 means there are indicies 0,1,2,3
      you always start counting at zero :)

    • @SuperOnlyP
      @SuperOnlyP 4 года назад

      it should be q_table[state,action] not q_table[action,state] for the formular. It will fix the error

  • @fishgoesbah7881
    @fishgoesbah7881 3 года назад

    Anyone else keep getting KeyError "Cannot call env.step() before calling reset()" :((, I don't know how to fix this, I did call reset() via state = env.reset()

  • @harshadevapriyankarabandar5456
    @harshadevapriyankarabandar5456 5 лет назад +1

    That explanation..Wow

    • @deeplizard
      @deeplizard  5 лет назад +1

      I hope that's a good wow :D

    • @harshadevapriyankarabandar5456
      @harshadevapriyankarabandar5456 5 лет назад +1

      @@deeplizard definitely that's a good wow for your good explanation..keep it going..it's very helpful.

  • @VJZ-YT
    @VJZ-YT 2 года назад +1

    I noticed that my agent keeps facing left (0) because the env.action_space.sample() keeps guessing and never reaches the reward therefore it always returns zero, the q table always equals 0, even as I lower the exploration rate decay, it doesn't matter: not once over 50000 episodes did my agent reach the reward, keep in mind, I have set exploration = 100% as there is nothing to optimize in the Q Table, I am at a lost.

  • @tinyentropy
    @tinyentropy 5 лет назад +1

    Would you be so kind to comment on the tools that you are using for crearing these beautiful slides with animations from the Jupyter notebooks as base material? :)

    • @deeplizard
      @deeplizard  5 лет назад +1

      Hey tinyentropy - I'm using Camtasia here. It's a video editing software. Glad you're liking the style :)

    • @tinyentropy
      @tinyentropy 5 лет назад +1

      deeplizard Wow! So you make all these animations manually? Amazing. I thought you would have some scripts to assist you by turning high-level descriptions of animations into the actual ones. I totally like your style.

    • @deeplizard
      @deeplizard  5 лет назад

      Yes, manually! Thank you!

    • @tinyentropy
      @tinyentropy 5 лет назад

      deeplizard By the way. I am kind of curious about your accent. Where do you produce these videos? :)

  • @sphynxusa
    @sphynxusa 5 лет назад +1

    I have more of a general question. You mentioned in one of your videos that there are 3 types of machine learning: 1) Supervised, 2) Unsupervised and 3) Semi-Supervised. Would Reinforcement Learning be considered a 4th type?

    • @deeplizard
      @deeplizard  5 лет назад +1

      Yes, it would!

    • @pepe6666
      @pepe6666 5 лет назад

      @@deeplizard why? it seems unsupervised because theres no humans telling it anything

    • @NityaStriker
      @NityaStriker 4 года назад

      The environment provides a rewards to the agent. That could be a reason it isn’t unsupervised. 🤔

  • @jorgeguzmanjmga
    @jorgeguzmanjmga Год назад +1

    Is the code working? (i check the page and still getting errors, like i cant use ' state = env.reset()[0] '

    • @kevinlizarazu-ampuero493
      @kevinlizarazu-ampuero493 5 месяцев назад

      per the new gym library you would end up unpacking it everytime you reset
      state, _ = env.reset()

  • @l.o.2963
    @l.o.2963 3 года назад

    Why is the alpha parameter needed? Due to Bellman equations, the q-value is guaranteed to converge because the mapping is a contraction. Wouldn't the alpha parameter make the convergence slower?

    • @l.o.2963
      @l.o.2963 3 года назад

      And you rock, amazing videos. I am very happy I found this

  • @pedrofrasao4102
    @pedrofrasao4102 2 месяца назад

    I'm having difficulty averaging the rewards per episode, as the averages are always at 0. Can anyone help me with this?

  • @dulminarenuka8819
    @dulminarenuka8819 5 лет назад +1

    thnk you very much.can't wait for the next one.can you make it quick? :-)

    • @deeplizard
      @deeplizard  5 лет назад

      You're welcome, Dulmina! Hope to have the next one done within a few days 😆

  • @deematashman5460
    @deematashman5460 Год назад

    Hi.. thank you for the great explanation. I have a problem in the code, I followed the same code you provided and I get a 0 for all the average rewards for all the episodes! Did anyone face the same problem? Thanks!

    • @jeffduncan9771
      @jeffduncan9771 Год назад

      Oddly, I ran into the exact same error. I debugged and found my problem:
      For the line:
      exploration_rate=min_exploration_rate + \
      (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
      I instead chance the * to a + (this is incorrect)
      exploration_rate=min_exploration_rate + \
      (max_exploration_rate - min_exploration_rate) + np.exp(-exploration_decay_rate*episode)
      I don't completely understand the math error (my misspent youth not understanding natural logs) but maybe your error is the same or similar. Good luck.

  • @izyaanimthiaz2082
    @izyaanimthiaz2082 Год назад +1

    i'm getting this error - only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices - in code q_table[state, action]= q_table[state, action] * (1-learning_rate)... directly copied from website, can someone help me to solve this error?

    • @deeplizard
      @deeplizard  Год назад +1

      This is due to changes released in later versions of Gym (now migrated to Gymnasium). Code is now updated in the corresponding lecture notes on deeplizard.com.

  • @bharathvarma9729
    @bharathvarma9729 4 года назад

    where did she declare q_table

  • @iworeushankaonce
    @iworeushankaonce 4 года назад

    Did anyone try out the code themselves? I tried to re-write the code and then even copied and pasted it, but agent doesn't learn/evolve over time anyways, even though the Q table looks almost identical to yours

  • @raghavgoel581
    @raghavgoel581 5 лет назад +8

    I am getting a decrease in the reward after running the code for 10k episodes, any ideas why ?

    • @antoinelunaire9462
      @antoinelunaire9462 4 года назад

      yes, me too, I believe that this shown code is not the final code, for example in Update Q-table , there is a parentheses missed at the end

    • @antoinelunaire9462
      @antoinelunaire9462 4 года назад

      Visit the blog and try to copy the code and fix the spacing accordingly, it somewhat worked with me but gives different numbers

    • @kushagrak4903
      @kushagrak4903 3 года назад

      I changed minimum exploration rate from 0.1 to 0.001. Btw if you solved it different please share bro

  • @DanielWeikert
    @DanielWeikert 5 лет назад

    I was not able to divide by num_episodes/1000 => 10. I got an error message
    "array split does not result in an equal division"
    For some reason my array only consists of 400497 elements so I had to adjust the number. Don't know why.
    One question for my understanding. We only discount the q value which we look up from our new_state when updating. So basically only one step from the future. Not all steps in the future? Thanks

    • @deeplizard
      @deeplizard  5 лет назад

      Hey Daniel - It sounds like you may have a misconfiguration somewhere. Since num_episodes = 10,000, then len(rewards_all_episodes) should also = 10,000. We're splitting the 10,000 rewards from rewards_all_episodes into sub arrays. The amount of sub arrays we should have is 10 with each array being of length 1000 since
      num_episodes / 10 = 10,000 / 10 = 1000.
      Double check your code to be sure you're not missing something somewhere or that you don't have something running outside of a loop that should be inside and vise-versa.
      In regards to your question-- Yes, when we update the Q-value, we're doing it iteratively using the fundamental property that the optimal Q-function must satisfy (the Bellman equation), which makes use of only the next optimal state-action pair.

    • @Create1st
      @Create1st 5 лет назад

      Hi Daniel, I think the problem may have been that the code section starting from the comment "# Calculate and print the average reward per thousand episodes" needs to be outside the "for episode in range(num_episodes):" block. I initially had my indentation levels wrong and got the same problem

    • @kaushalprasadhial9335
      @kaushalprasadhial9335 5 лет назад

      i know that its late to comment but one of the solution is to use np.array_split instead of using np.split.

    • @siming9374
      @siming9374 5 лет назад

      Its Lol Not late! Finally find the solution lol thx)

  • @Volpix28
    @Volpix28 Год назад +2

    Thanks for the Series, its great !
    Unfortunatly it seems that you cannot use frozenlake v0 anymore & therefore some things changed.
    First there is now 1 extra return from the step() function, so you need to assign the "truncated" return also.
    Second there seems to be something new with the state. You cannot calculate the q_table with
    q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
    learning_rate * (reward + discount_rate * np.max(q_table[new_state,:]))
    anymore, since state is now a tuple (eg. (0, {'prob': 1}) ).
    Idk if this is what should be in state, but you cannot adress q_table[state, action] anymore. I tried using state[0] instead bc i thought this would be the searched parameter but it did not work either. Does someone know what happened here?

    • @deeplizard
      @deeplizard  Год назад +2

      Hey Volpix - Thank you! Yes you're right, due to changes in the library, there are updates needed to the code. The updates are included in the Text section of the corresponding lesson pages on deeplizard.com for each video. You can find the updates in the text along with a description of the updates in the Updates section at the bottom of the page. Here is the link for this corresponding lesson:
      deeplizard.com/learn/video/QK_PP_2KgGE

    • @roseiruby
      @roseiruby Год назад

      @@deeplizard thanks a lot!

  • @prathampandey9898
    @prathampandey9898 2 года назад +1

    Hello the code is showing error in the "q_table" formula and "exploration_rate" formula because there is " \" just after "+" which I also think is wrong syntax. Can anyone explain?

    • @deeplizard
      @deeplizard  2 года назад

      The " \ " character is used in Python to allow a long line of code to be split into two lines:
      stackoverflow.com/questions/38125328/what-does-a-backslash-by-itself-mean-in-python

    • @prathampandey9898
      @prathampandey9898 2 года назад +1

      Thanks for the reply. I didn't expected such a quick reply. This is the best reinforcement learning lectures ON THE ENTIRE INTERNET. Keep up the GREAT work.

  • @Chillos100
    @Chillos100 3 года назад +1

    Thank you for these great tutorials, I'm really loving them. Can you help me out? I'm getting an error that at the "split function --> np.split(np.split(np.array(rewards_all_episodes), num_episodes/1000))" what i'm my doing wrong?
    this is the error message:TypeError Traceback (most recent call last)
    /usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
    864 try:
    --> 865 len(indices_or_sections)
    866 except TypeError:
    TypeError: object of type 'float' has no len()

    • @cassianocampes
      @cassianocampes 3 года назад +1

      I had the same problem, and the solution is here: stackoverflow.com/questions/14406567/paritition-array-into-n-chunks-with-numpy
      You simple change the np.split() to np.array_slit(), that works.

    • @Chillos100
      @Chillos100 3 года назад

      @@cassianocampes thanks man!

  • @Alchemist10241
    @Alchemist10241 3 года назад +1

    My little tiny Frozen Lake AI doesn't learn anything; Average reward isn't increasing - the code is the same. Anyway thanks for your practical examples

  • @vvviren
    @vvviren 5 лет назад +1

    NOBODY:
    LITERALLY NOBODY:
    deeplizard: Don't be shy. I wanna hear from you.

    • @deeplizard
      @deeplizard  5 лет назад +2

      Lol errbody being shy 🤭

    • @Then00tz
      @Then00tz 5 лет назад

      @@deeplizard I'm more of a listener. Good job though, you guys rock!

  • @pepe6666
    @pepe6666 5 лет назад +1

    what i dont get is how you can have an expected reward for anything unless you're at the second to last state before game-over. nothing elese gives a reward! so how could most of those rows for each state have anything in them at all?

    • @NityaStriker
      @NityaStriker 4 года назад +1

      When you reach the goal for the first time, the second-to-last state gains a value in the q-table. After that, if the agent reaches that second-to-last state again, then the state the agent was on before reaching this second-to-last state gains some value of its own in the q-table. i.e. In the formula, the “reward” variable increase the value of only the second-to-last state while the “q_table[next_state , action]” variable is what increases the value of all other states in the q-table. This occurs slowly and only when the number of episodes is large enough can all viable state-action pairs have some value in the q-table.