Q-Learning Explained - A Reinforcement Learning Technique

Поделиться
HTML-код
  • Опубликовано: 29 ноя 2024

Комментарии • 86

  • @absimaldata
    @absimaldata 3 года назад +10

    Why you are so so clear in explaining, I mean why others fail to deliver the tutorials with such clarity like you do?? I dont know whats wrong with everyone. Omg you are impressive.

  • @justinheehaw
    @justinheehaw 3 года назад +5

    I gave up when I see 1:29 for the first time (because I'm not so good at math and English)
    But when I came back again today and watched the entire video, I found this video the most well explained one. Especially the Q table section.

  • @richardkessler2171
    @richardkessler2171 5 лет назад +12

    One of the best series I've viewed on RL. Really great job teaching the content without boring the audience. Also...really enjoy the closing snippets that keep me excited to see the end. Excellent!

    • @deeplizard
      @deeplizard  5 лет назад

      Thank you, Richard! Really happy to hear that!

  • @tingnews7273
    @tingnews7273 6 лет назад +17

    What I learned:
    1.Q-learning:Learning the optimal policy in MDP.
    2.Q-learning work:learning the Q-values for each state-action pair
    3.Value iteration:Q-learning iteravely updates the Q-value(Will be more clear in future I thought)
    4.Q-table:store the Q-value for all state-action pair
    5.Exploration:exploring the environment to find out infomation about it.
    6.Exploitation:exploiting the information that is already known about(tip:epison greed)

  • @obensustam3574
    @obensustam3574 10 месяцев назад +3

    Very good content, I watched the videos in this playlist to prepera for my exam. Thank you 😊

  • @pawelczar
    @pawelczar 6 лет назад +37

    This whole series is great! I love the way you explain all math concepts and questions and more that happy that you did stop on that but you introduced practical example. Cant wait for next episodes :D

    • @deeplizard
      @deeplizard  6 лет назад +1

      Thanks, pawelczar! Glad you're liking it!

  • @abcqer555
    @abcqer555 6 лет назад +12

    Hi Lizard People,
    I feel so fortunate to have come across your channel. Your lessons/videos are very clear, concise, well produced, entertaining, and I am excited for all the videos that will be coming out. Keep up the fantastic work!

    • @deeplizard
      @deeplizard  6 лет назад +1

      Hey Paul - Thank you! We're glad you're here!

  • @Asmutiwari
    @Asmutiwari 4 года назад +6

    These series are so so informative !! I wish you could make videos on dynamic navigation techniques using DRL

  • @mohammadmohi8561
    @mohammadmohi8561 3 года назад +1

    u r an AI, so nicely explained all these hard concepts so easily. thank u so much

  • @adamhendry945
    @adamhendry945 4 года назад +1

    PHENOMENAL! Your videos are THE BEST! Can you PLEASE PLEASE PLEASE do a series on Actor-Critic methods!!

  • @guineteherve9751
    @guineteherve9751 Год назад +1

    Your work is simply incredible. Thank you!

  • @davidli9872
    @davidli9872 Год назад +9

    Are you here after Reuter's article on OpenAI's q*?

  • @hazzaldo
    @hazzaldo 5 лет назад +1

    Brilliant video. One of the best RL teaching series/materials I've come across anywhere on the internet (if not the best). Look forward to watching the rest of the series. On this video, I have 3 questions:
    1- Just to clarify is there a difference between the Q-function and Optimal Q-function? If so, is the difference that when a Q-function perform Q-value iteration, and eventually converge on the optimal Q-values, then this is called the Optimal Q-function?
    2- What does the capital `E` signify in the Bellman Optimality equation?
    3- So far I have only learnt the definition of a "policy". Putting it in practice, giving the scenario in this video (about the lizard navigating an environment), where does the policy come into play here? Re-phrasing the question, what part of this scenario is the policy?
    Many thanks

    • @deeplizard
      @deeplizard  5 лет назад +1

      Thanks, hazzaldo!
      1. Your assumption is correct.
      2. E is the notation for "expected value."
      3. Recall that a policy is a function that maps a given state to the corresponding probabilities of selecting each possible action from that state. The goal is for the lizard to navigate the environment in such a way that will yield the most return. Once it learns this "optimal navigation," it will have learned the optimal policy.

    • @hazzaldo
      @hazzaldo 5 лет назад

      ​@@deeplizard TY very much. I do have another question, that I left in the Exploration vs Exploitation video part of this series. If you ever get the time, would really appreciate any clarification on it. Many thanks again for the answer to my question and this great series.

  • @arkadipbasu828
    @arkadipbasu828 2 года назад +1

    Super explanation. Thanks from India

  • @arefeshghi
    @arefeshghi 4 года назад +3

    Good balance of exploration and exploitation will bring good results in life too! We are all lizards! :)

  • @davidkhassias4876
    @davidkhassias4876 4 года назад

    Can't wait for coming episodes, because this series is amazing! And they/you helped me a lot. Thank you so much!

  • @asdfasdfuhf
    @asdfasdfuhf 4 года назад +10

    This was an exciting video, finally, we are getting to the good stuff.

  • @xiaojiang2610
    @xiaojiang2610 4 года назад +1

    Better than my engineering teacher.

  • @DreadFox_official
    @DreadFox_official Год назад +1

    Hey, I loved your video. Thank you so much

  • @yelircaasi
    @yelircaasi 2 года назад +1

    Really nice video, thanks for the clear explanations!

  • @Ayushsingh-zw3yk
    @Ayushsingh-zw3yk 7 месяцев назад +1

    nice explanation deeplizard

  • @shoaibalyaan
    @shoaibalyaan 4 года назад +1

    AMAZING SERIES! Absolutely loved it!

  • @arnabjana2620
    @arnabjana2620 3 года назад

    {
    "question": "What is optimal Q-value for a policy?",
    "choices": [
    "Expected return for the reward at time (t+1) and maximum discounted reward thereafter for a state-action pair.",
    "It gives the optimal policy for the optimal expected return for an agent for each state-action pair.",
    "It is the reward for the action 'a' taken in state 's' at time 't'.",
    "Maximum accumulated reward by following the policy from time (t+1)."
    ],
    "answer": "Expected return for the reward at time (t+1) and maximum discounted reward thereafter.",
    "creator": "Arnab",
    "creationDate": "2021-08-03T08:12:26.884Z"
    }

  • @cedrichung6820
    @cedrichung6820 3 года назад +1

    How are you so good at explaining😍😍😍😍

  • @neogarciagarcia443
    @neogarciagarcia443 5 лет назад +1

    exploration of reinforcement learning is going fine !

  • @ashabrar2435
    @ashabrar2435 3 года назад +1

    {
    "question": "Q table is defined as _______________ and _______________",
    "choices": [
    "action and state",
    "action and agent",
    "state and environment",
    "environment and action"
    ],
    "answer": "action and state",
    "creator": "Hivemind",
    "creationDate": "2021-01-03T20:01:41.577Z"
    }

    • @deeplizard
      @deeplizard  3 года назад

      Thanks, ash! Just added your question to deeplizard.com/learn/video/qhRNvCVVJaA :)

  • @rosameliacarioni1022
    @rosameliacarioni1022 3 года назад +2

    Thanks so muuuuch !

  • @MohsinKhan-ve1hn
    @MohsinKhan-ve1hn 5 лет назад +3

    Your voice is great

  • @shashankdhananjaya9923
    @shashankdhananjaya9923 4 года назад +1

    Awesome explanation. I like this

  • @NoNTr1v1aL
    @NoNTr1v1aL 3 года назад +1

    Amazing video!

  • @patite3103
    @patite3103 3 года назад

    your videos are awsome! Please correct the corresponding quiz since the answer is uncorrect to me. Could you do a video explaining the first three steps and how the q-tables updates. This would really help to understand how the update works. thank you!

  • @michaelscott8572
    @michaelscott8572 4 года назад

    Thanks for the good explanation and all your work. A little hint if I may: Don't explain the words using the same words: Exploitation and Exploration

  • @SugamMaheshwari
    @SugamMaheshwari 4 года назад +3

    Your voice is just amazing 😍😍😍😍😍

  • @namitaa
    @namitaa 4 года назад

    you saved my life bro

  • @mateusbalotin7247
    @mateusbalotin7247 3 года назад

    Thank you!

  • @louerleseigneur4532
    @louerleseigneur4532 4 года назад +1

    merci merci
    hats off

  • @sontapaa11jokulainen94
    @sontapaa11jokulainen94 4 года назад

    Is the exploration vs exploitation part only a part of the training or does it happen also when actually using the learned q table and also can the policy be that "Take the action which has the largest q value and sometimes explore" (eq can that be an example of a policy in this case)? So the policy is just the probability of taking some action in a state so can the policy just be written as: "Take the action which has the largest q value" (this is just an example for exploitation)?

  • @tallwaters9708
    @tallwaters9708 2 года назад

    I'll tell you what I really don't get, it seems the equation only updates the q table based on the current and next state. But the Bellman equation seems to imply that all future states are considered, is there some recursion thing going on?

  • @madhesh18
    @madhesh18 4 года назад

    Really good work

  • @yashas9974
    @yashas9974 3 года назад

    Link to the talk that appeared at the end of the video?

  • @TheOfficialJeppezon
    @TheOfficialJeppezon 4 года назад

    You say that Q-learning tires to find the best policy. However, I thought q-learning is an off-policy algorithm. I also have troubles understanding the on/off policy concept.

  • @krajkumar6
    @krajkumar6 3 года назад

    Hey @deeplizard,
    Many thanks for this video. I'm reading 'Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow' and I'd like to know whether Q- learning technique described here is the same as dynamic programming explained in the book?

    • @krajkumar6
      @krajkumar6 2 года назад

      it is Temporal difference learning technique

  • @mauriziovassallo5499
    @mauriziovassallo5499 5 лет назад +1

    Very clear :)

  • @rursus8354
    @rursus8354 3 года назад

    Won't a square become empty when the cricket(s) is(are) eaten?

  • @iAndrewMontanai
    @iAndrewMontanai 5 лет назад +1

    What should I do in case of continuous tasks? Like in Flappy Bird (if its continuos, but anyway), I guess Q Table would be infinite here or just would have big fixed value to save memory. Can you give some recommendations or explain please? Wanna start implementation, but dont know how Q Table should look like in this case and how to interact with it correctly ( and i hope there will be no other surprises lol )

    • @deeplizard
      @deeplizard  5 лет назад +2

      Yes, a Q-table would not be feasible for this task. Keep going in the series, and you will see how you can use Deep Q-Learning for these tasks. Essentially, you substitute a Q-table for a neural network.

  • @saumyachaturvedi9065
    @saumyachaturvedi9065 9 месяцев назад +2

    I guess crickets make sound, so lizard can take that as input as well to take the path

  • @nodstradamus
    @nodstradamus 5 лет назад

    Thanks for the video, it was useful. But for me it would have been even more useful if you'd explained the Gamma Value (i.e. Discount Factor) in the formula as well..

    • @deeplizard
      @deeplizard  5 лет назад

      Hey Aleistar - You're welcome! We first introduce the discount rate (gamma) a couple of episodes back where we learned about expected return. Check out the video/blog where it is introduced and defined here:
      deeplizard.com/learn/video/a-SnJtmBtyA
      Let me know if this helps!

  • @MrGenbu
    @MrGenbu 5 лет назад

    hi i wanted to ask a Q , when the agent is trained on this lizard example in this board 3*3 ,
    if we placed it in a board of 6*6 can it still perform or this is another kind of reinforcement learning

    • @deeplizard
      @deeplizard  5 лет назад +1

      This technique would still work with a 6x6 board.

    • @MrGenbu
      @MrGenbu 5 лет назад

      @@deeplizard but we need to train it first to generate the Q table rigth ?
      i mean it can not be trained and run on different board even using a 3*3 board with different reward places will not work?
      like regression u make a line then use it as u like but here u can not becuase the states should be the same ? is not it ?

    • @deeplizard
      @deeplizard  5 лет назад +1

      Yes, I thought you were asking in general if the Q-learning with value iteration technique would work on a 6x6 board. If you changed the board, then you would need to change and initialize the Q-table as well before training starts.

    • @MrGenbu
      @MrGenbu 5 лет назад +2

      @@deeplizard so this kind of agent is environment specific
      did u watch openAI hide and seek agent
      it seems it adapts to a new environemt without training
      i see this kind of agents is limited in its use as it can not be used is the real world as it needs to be trained on every new environment
      i am a newpie so i am rly just asking u to get more clear answer ? if u had seen the openAI video i would like to know which type of reinforecemnt learning can adapt to new environments

  • @davidak_de
    @davidak_de 4 месяца назад

    Q-Star Lizard Gang 2024

  • @deeplizard
    @deeplizard  6 лет назад +2

    Check out the corresponding blog and other resources for this video at:
    deeplizard.com/learn/video/qhRNvCVVJaA

  • @adwaitnaik4003
    @adwaitnaik4003 4 года назад

    channel name is creepy but explanation is amazing...

  • @XxGabberlordxX
    @XxGabberlordxX 5 лет назад

    Hello,
    can someone explain me pls why there are 6 empty states?

    • @deeplizard
      @deeplizard  5 лет назад

      The six empty states are arbitrary. Think about a video game where some actions will cause you to gain points, some actions will cause you to lose points or lose the game, and some actions will have no immediate effect on your score. With the lizard game example, we have a similar set up where moving to a tile with crickets will gain points, moving to a tile with a bird will lose points/lose the game, and moving to an empty tile has no immediate effect on our score.

    • @XxGabberlordxX
      @XxGabberlordxX 5 лет назад

      @@deeplizard Hey ty for the answer! When I take a look at the picture I still don't get why there are 6 empty tiles. I don't count 6 empty tiles 🤔

    • @deeplizard
      @deeplizard  5 лет назад

      In the photo, the lizard is on one of the empty tiles. The lizard is the agent, and she is free to move to any tile.

    • @XxGabberlordxX
      @XxGabberlordxX 5 лет назад +1

      @@deeplizard Wow that was so obvious. Ty for the help :) Now i get it. Nice video and have a nice day :)

  • @shreyasrajanna7361
    @shreyasrajanna7361 6 лет назад +2

    Where is the next video

    • @deeplizard
      @deeplizard  6 лет назад +1

      It is being developed! Aiming to add a new video to this series every 3-4 days.

    • @shreyasrajanna7361
      @shreyasrajanna7361 6 лет назад +1

      @@deeplizard Your videos are really good.

    • @shreyasrajanna7361
      @shreyasrajanna7361 6 лет назад

      Can you make videos on Kaggle projects or make a project which may make learning even more interesting.

    • @deeplizard
      @deeplizard  6 лет назад +1

      We may do some Kaggle videos in the future. We do have the following two series that show practical deep learning projects in both Keras and TensorFlow.js.
      deeplizard.com/learn/playlist/PLZbbT5o_s2xr83l8w44N_g3pygvajLrJ-
      deeplizard.com/learn/playlist/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL

  • @EarlWallaceNYC
    @EarlWallaceNYC 3 года назад

    O' the puns, ... exploit vs explore

  • @megasage
    @megasage 3 года назад

    The dound at 00:30 I hear in every video is quite disturbing 😅

  • @muhammadsohailnisar6600
    @muhammadsohailnisar6600 4 года назад +1

    please remove the sound played with the logo at the start of video. the sound is very bad especially when one listens it on head phones.

  • @pututp
    @pututp 3 года назад

    I am too stupid to understand the video.. My bad..

  • @MohdDanish-bh1ok
    @MohdDanish-bh1ok 5 лет назад

    Luv u babe.