Why you are so so clear in explaining, I mean why others fail to deliver the tutorials with such clarity like you do?? I dont know whats wrong with everyone. Omg you are impressive.
I gave up when I see 1:29 for the first time (because I'm not so good at math and English) But when I came back again today and watched the entire video, I found this video the most well explained one. Especially the Q table section.
One of the best series I've viewed on RL. Really great job teaching the content without boring the audience. Also...really enjoy the closing snippets that keep me excited to see the end. Excellent!
What I learned: 1.Q-learning:Learning the optimal policy in MDP. 2.Q-learning work:learning the Q-values for each state-action pair 3.Value iteration:Q-learning iteravely updates the Q-value(Will be more clear in future I thought) 4.Q-table:store the Q-value for all state-action pair 5.Exploration:exploring the environment to find out infomation about it. 6.Exploitation:exploiting the information that is already known about(tip:epison greed)
This whole series is great! I love the way you explain all math concepts and questions and more that happy that you did stop on that but you introduced practical example. Cant wait for next episodes :D
Hi Lizard People, I feel so fortunate to have come across your channel. Your lessons/videos are very clear, concise, well produced, entertaining, and I am excited for all the videos that will be coming out. Keep up the fantastic work!
Brilliant video. One of the best RL teaching series/materials I've come across anywhere on the internet (if not the best). Look forward to watching the rest of the series. On this video, I have 3 questions: 1- Just to clarify is there a difference between the Q-function and Optimal Q-function? If so, is the difference that when a Q-function perform Q-value iteration, and eventually converge on the optimal Q-values, then this is called the Optimal Q-function? 2- What does the capital `E` signify in the Bellman Optimality equation? 3- So far I have only learnt the definition of a "policy". Putting it in practice, giving the scenario in this video (about the lizard navigating an environment), where does the policy come into play here? Re-phrasing the question, what part of this scenario is the policy? Many thanks
Thanks, hazzaldo! 1. Your assumption is correct. 2. E is the notation for "expected value." 3. Recall that a policy is a function that maps a given state to the corresponding probabilities of selecting each possible action from that state. The goal is for the lizard to navigate the environment in such a way that will yield the most return. Once it learns this "optimal navigation," it will have learned the optimal policy.
@@deeplizard TY very much. I do have another question, that I left in the Exploration vs Exploitation video part of this series. If you ever get the time, would really appreciate any clarification on it. Many thanks again for the answer to my question and this great series.
{ "question": "What is optimal Q-value for a policy?", "choices": [ "Expected return for the reward at time (t+1) and maximum discounted reward thereafter for a state-action pair.", "It gives the optimal policy for the optimal expected return for an agent for each state-action pair.", "It is the reward for the action 'a' taken in state 's' at time 't'.", "Maximum accumulated reward by following the policy from time (t+1)." ], "answer": "Expected return for the reward at time (t+1) and maximum discounted reward thereafter.", "creator": "Arnab", "creationDate": "2021-08-03T08:12:26.884Z" }
{ "question": "Q table is defined as _______________ and _______________", "choices": [ "action and state", "action and agent", "state and environment", "environment and action" ], "answer": "action and state", "creator": "Hivemind", "creationDate": "2021-01-03T20:01:41.577Z" }
your videos are awsome! Please correct the corresponding quiz since the answer is uncorrect to me. Could you do a video explaining the first three steps and how the q-tables updates. This would really help to understand how the update works. thank you!
Is the exploration vs exploitation part only a part of the training or does it happen also when actually using the learned q table and also can the policy be that "Take the action which has the largest q value and sometimes explore" (eq can that be an example of a policy in this case)? So the policy is just the probability of taking some action in a state so can the policy just be written as: "Take the action which has the largest q value" (this is just an example for exploitation)?
I'll tell you what I really don't get, it seems the equation only updates the q table based on the current and next state. But the Bellman equation seems to imply that all future states are considered, is there some recursion thing going on?
You say that Q-learning tires to find the best policy. However, I thought q-learning is an off-policy algorithm. I also have troubles understanding the on/off policy concept.
Hey @deeplizard, Many thanks for this video. I'm reading 'Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow' and I'd like to know whether Q- learning technique described here is the same as dynamic programming explained in the book?
What should I do in case of continuous tasks? Like in Flappy Bird (if its continuos, but anyway), I guess Q Table would be infinite here or just would have big fixed value to save memory. Can you give some recommendations or explain please? Wanna start implementation, but dont know how Q Table should look like in this case and how to interact with it correctly ( and i hope there will be no other surprises lol )
Yes, a Q-table would not be feasible for this task. Keep going in the series, and you will see how you can use Deep Q-Learning for these tasks. Essentially, you substitute a Q-table for a neural network.
Thanks for the video, it was useful. But for me it would have been even more useful if you'd explained the Gamma Value (i.e. Discount Factor) in the formula as well..
Hey Aleistar - You're welcome! We first introduce the discount rate (gamma) a couple of episodes back where we learned about expected return. Check out the video/blog where it is introduced and defined here: deeplizard.com/learn/video/a-SnJtmBtyA Let me know if this helps!
hi i wanted to ask a Q , when the agent is trained on this lizard example in this board 3*3 , if we placed it in a board of 6*6 can it still perform or this is another kind of reinforcement learning
@@deeplizard but we need to train it first to generate the Q table rigth ? i mean it can not be trained and run on different board even using a 3*3 board with different reward places will not work? like regression u make a line then use it as u like but here u can not becuase the states should be the same ? is not it ?
Yes, I thought you were asking in general if the Q-learning with value iteration technique would work on a 6x6 board. If you changed the board, then you would need to change and initialize the Q-table as well before training starts.
@@deeplizard so this kind of agent is environment specific did u watch openAI hide and seek agent it seems it adapts to a new environemt without training i see this kind of agents is limited in its use as it can not be used is the real world as it needs to be trained on every new environment i am a newpie so i am rly just asking u to get more clear answer ? if u had seen the openAI video i would like to know which type of reinforecemnt learning can adapt to new environments
The six empty states are arbitrary. Think about a video game where some actions will cause you to gain points, some actions will cause you to lose points or lose the game, and some actions will have no immediate effect on your score. With the lizard game example, we have a similar set up where moving to a tile with crickets will gain points, moving to a tile with a bird will lose points/lose the game, and moving to an empty tile has no immediate effect on our score.
We may do some Kaggle videos in the future. We do have the following two series that show practical deep learning projects in both Keras and TensorFlow.js. deeplizard.com/learn/playlist/PLZbbT5o_s2xr83l8w44N_g3pygvajLrJ- deeplizard.com/learn/playlist/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL
Why you are so so clear in explaining, I mean why others fail to deliver the tutorials with such clarity like you do?? I dont know whats wrong with everyone. Omg you are impressive.
I gave up when I see 1:29 for the first time (because I'm not so good at math and English)
But when I came back again today and watched the entire video, I found this video the most well explained one. Especially the Q table section.
One of the best series I've viewed on RL. Really great job teaching the content without boring the audience. Also...really enjoy the closing snippets that keep me excited to see the end. Excellent!
Thank you, Richard! Really happy to hear that!
What I learned:
1.Q-learning:Learning the optimal policy in MDP.
2.Q-learning work:learning the Q-values for each state-action pair
3.Value iteration:Q-learning iteravely updates the Q-value(Will be more clear in future I thought)
4.Q-table:store the Q-value for all state-action pair
5.Exploration:exploring the environment to find out infomation about it.
6.Exploitation:exploiting the information that is already known about(tip:epison greed)
Very good content, I watched the videos in this playlist to prepera for my exam. Thank you 😊
This whole series is great! I love the way you explain all math concepts and questions and more that happy that you did stop on that but you introduced practical example. Cant wait for next episodes :D
Thanks, pawelczar! Glad you're liking it!
Hi Lizard People,
I feel so fortunate to have come across your channel. Your lessons/videos are very clear, concise, well produced, entertaining, and I am excited for all the videos that will be coming out. Keep up the fantastic work!
Hey Paul - Thank you! We're glad you're here!
These series are so so informative !! I wish you could make videos on dynamic navigation techniques using DRL
u r an AI, so nicely explained all these hard concepts so easily. thank u so much
PHENOMENAL! Your videos are THE BEST! Can you PLEASE PLEASE PLEASE do a series on Actor-Critic methods!!
Your work is simply incredible. Thank you!
Are you here after Reuter's article on OpenAI's q*?
Yeap 😂 hi 👋
No, what's that OpenAI shit?
Brilliant video. One of the best RL teaching series/materials I've come across anywhere on the internet (if not the best). Look forward to watching the rest of the series. On this video, I have 3 questions:
1- Just to clarify is there a difference between the Q-function and Optimal Q-function? If so, is the difference that when a Q-function perform Q-value iteration, and eventually converge on the optimal Q-values, then this is called the Optimal Q-function?
2- What does the capital `E` signify in the Bellman Optimality equation?
3- So far I have only learnt the definition of a "policy". Putting it in practice, giving the scenario in this video (about the lizard navigating an environment), where does the policy come into play here? Re-phrasing the question, what part of this scenario is the policy?
Many thanks
Thanks, hazzaldo!
1. Your assumption is correct.
2. E is the notation for "expected value."
3. Recall that a policy is a function that maps a given state to the corresponding probabilities of selecting each possible action from that state. The goal is for the lizard to navigate the environment in such a way that will yield the most return. Once it learns this "optimal navigation," it will have learned the optimal policy.
@@deeplizard TY very much. I do have another question, that I left in the Exploration vs Exploitation video part of this series. If you ever get the time, would really appreciate any clarification on it. Many thanks again for the answer to my question and this great series.
Super explanation. Thanks from India
Good balance of exploration and exploitation will bring good results in life too! We are all lizards! :)
Can't wait for coming episodes, because this series is amazing! And they/you helped me a lot. Thank you so much!
This was an exciting video, finally, we are getting to the good stuff.
Better than my engineering teacher.
Hey, I loved your video. Thank you so much
Really nice video, thanks for the clear explanations!
nice explanation deeplizard
AMAZING SERIES! Absolutely loved it!
{
"question": "What is optimal Q-value for a policy?",
"choices": [
"Expected return for the reward at time (t+1) and maximum discounted reward thereafter for a state-action pair.",
"It gives the optimal policy for the optimal expected return for an agent for each state-action pair.",
"It is the reward for the action 'a' taken in state 's' at time 't'.",
"Maximum accumulated reward by following the policy from time (t+1)."
],
"answer": "Expected return for the reward at time (t+1) and maximum discounted reward thereafter.",
"creator": "Arnab",
"creationDate": "2021-08-03T08:12:26.884Z"
}
How are you so good at explaining😍😍😍😍
exploration of reinforcement learning is going fine !
{
"question": "Q table is defined as _______________ and _______________",
"choices": [
"action and state",
"action and agent",
"state and environment",
"environment and action"
],
"answer": "action and state",
"creator": "Hivemind",
"creationDate": "2021-01-03T20:01:41.577Z"
}
Thanks, ash! Just added your question to deeplizard.com/learn/video/qhRNvCVVJaA :)
Thanks so muuuuch !
Your voice is great
Awesome explanation. I like this
Amazing video!
your videos are awsome! Please correct the corresponding quiz since the answer is uncorrect to me. Could you do a video explaining the first three steps and how the q-tables updates. This would really help to understand how the update works. thank you!
Thanks for the good explanation and all your work. A little hint if I may: Don't explain the words using the same words: Exploitation and Exploration
Your voice is just amazing 😍😍😍😍😍
you saved my life bro
Thank you!
merci merci
hats off
Is the exploration vs exploitation part only a part of the training or does it happen also when actually using the learned q table and also can the policy be that "Take the action which has the largest q value and sometimes explore" (eq can that be an example of a policy in this case)? So the policy is just the probability of taking some action in a state so can the policy just be written as: "Take the action which has the largest q value" (this is just an example for exploitation)?
I'll tell you what I really don't get, it seems the equation only updates the q table based on the current and next state. But the Bellman equation seems to imply that all future states are considered, is there some recursion thing going on?
Really good work
Link to the talk that appeared at the end of the video?
You say that Q-learning tires to find the best policy. However, I thought q-learning is an off-policy algorithm. I also have troubles understanding the on/off policy concept.
Hey @deeplizard,
Many thanks for this video. I'm reading 'Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow' and I'd like to know whether Q- learning technique described here is the same as dynamic programming explained in the book?
it is Temporal difference learning technique
Very clear :)
Won't a square become empty when the cricket(s) is(are) eaten?
What should I do in case of continuous tasks? Like in Flappy Bird (if its continuos, but anyway), I guess Q Table would be infinite here or just would have big fixed value to save memory. Can you give some recommendations or explain please? Wanna start implementation, but dont know how Q Table should look like in this case and how to interact with it correctly ( and i hope there will be no other surprises lol )
Yes, a Q-table would not be feasible for this task. Keep going in the series, and you will see how you can use Deep Q-Learning for these tasks. Essentially, you substitute a Q-table for a neural network.
I guess crickets make sound, so lizard can take that as input as well to take the path
🦗🎶🦎
Thanks for the video, it was useful. But for me it would have been even more useful if you'd explained the Gamma Value (i.e. Discount Factor) in the formula as well..
Hey Aleistar - You're welcome! We first introduce the discount rate (gamma) a couple of episodes back where we learned about expected return. Check out the video/blog where it is introduced and defined here:
deeplizard.com/learn/video/a-SnJtmBtyA
Let me know if this helps!
hi i wanted to ask a Q , when the agent is trained on this lizard example in this board 3*3 ,
if we placed it in a board of 6*6 can it still perform or this is another kind of reinforcement learning
This technique would still work with a 6x6 board.
@@deeplizard but we need to train it first to generate the Q table rigth ?
i mean it can not be trained and run on different board even using a 3*3 board with different reward places will not work?
like regression u make a line then use it as u like but here u can not becuase the states should be the same ? is not it ?
Yes, I thought you were asking in general if the Q-learning with value iteration technique would work on a 6x6 board. If you changed the board, then you would need to change and initialize the Q-table as well before training starts.
@@deeplizard so this kind of agent is environment specific
did u watch openAI hide and seek agent
it seems it adapts to a new environemt without training
i see this kind of agents is limited in its use as it can not be used is the real world as it needs to be trained on every new environment
i am a newpie so i am rly just asking u to get more clear answer ? if u had seen the openAI video i would like to know which type of reinforecemnt learning can adapt to new environments
Q-Star Lizard Gang 2024
Check out the corresponding blog and other resources for this video at:
deeplizard.com/learn/video/qhRNvCVVJaA
channel name is creepy but explanation is amazing...
👻
@@deeplizard :)
Hello,
can someone explain me pls why there are 6 empty states?
The six empty states are arbitrary. Think about a video game where some actions will cause you to gain points, some actions will cause you to lose points or lose the game, and some actions will have no immediate effect on your score. With the lizard game example, we have a similar set up where moving to a tile with crickets will gain points, moving to a tile with a bird will lose points/lose the game, and moving to an empty tile has no immediate effect on our score.
@@deeplizard Hey ty for the answer! When I take a look at the picture I still don't get why there are 6 empty tiles. I don't count 6 empty tiles 🤔
In the photo, the lizard is on one of the empty tiles. The lizard is the agent, and she is free to move to any tile.
@@deeplizard Wow that was so obvious. Ty for the help :) Now i get it. Nice video and have a nice day :)
Where is the next video
It is being developed! Aiming to add a new video to this series every 3-4 days.
@@deeplizard Your videos are really good.
Can you make videos on Kaggle projects or make a project which may make learning even more interesting.
We may do some Kaggle videos in the future. We do have the following two series that show practical deep learning projects in both Keras and TensorFlow.js.
deeplizard.com/learn/playlist/PLZbbT5o_s2xr83l8w44N_g3pygvajLrJ-
deeplizard.com/learn/playlist/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL
O' the puns, ... exploit vs explore
The dound at 00:30 I hear in every video is quite disturbing 😅
please remove the sound played with the logo at the start of video. the sound is very bad especially when one listens it on head phones.
I am too stupid to understand the video.. My bad..
Luv u babe.