Training a Deep Q-Network with Fixed Q-targets - Reinforcement Learning

deeplizard

Просмотров 55 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 16 окт 2024

Комментарии • 122

@deeplizard 5 лет назад ⁺¹⁰
Check out the blog for this video!
deeplizard.com/learn/video/xVkPh9E9GfE
@AvZNaV 5 лет назад ⁺¹⁴⁶
This channel is criminally underrated.
@sphynxusa 5 лет назад ⁺⁶
It's hard to see the flower for the forest. Eventually, they will find her and others will soon follow.
@pushkarparanjpe 3 года назад ⁺³
Your style of gently concluding each video into a leader speak is an interesrting touch :)
@omkarlubal6799 8 месяцев назад ⁺¹
Truly the best course on reinforcement learning out there!
@fkaraman 6 месяцев назад ⁺¹
This series of videos saved a few months of time.
@MrDonald911 5 лет назад ⁺²⁵
Please upload the next video as soon as you can :D I can't wait to apply this for my little projects. And thanks btw for your way of explaining !
@hangchen 5 лет назад ⁺⁶
This is the only tutorial playlist I've gone through entirely from the beginning to the end and dying to wait for the next video update. Thanks a lot deeplizard! Your voice, your knowledge and the way you explain a concept make you a different level of a person, or rather a lizard, a deep one!!
@deeplizard 5 лет назад
Thank you, Hang! 🦎🦎🦎
@reuttal2633 4 месяца назад ⁺¹
Amazing series of videos. Thank you so much
@cyborgx1156 3 года назад
The clip at the end of the videos really helps
@AkshatSharma-qx9wh 2 года назад ⁺¹
Just amazing!! I am at a loss of words !
@benjamindeporte3806 3 года назад ⁺¹
Excellent series of videos. Many thanks.
@ugurkaraaslan9285 5 лет назад ⁺³
Excellent tutorial 👏 Looking forward to watch coding part! Thanks lot👍
@Javimac92 5 лет назад ⁺¹
Thank you very much deeplizard. I found your videos very clear and easy to understand about DQN techniques!
@mariushoppe8880 5 лет назад ⁺⁹
Hey Deeplizard, I am still having a bit of trouble understanding the problems of not having fixed Q-Targets. With the tail-problem even though we are moving both the Q function and the Q* function, which is chased by our Q function, don't they still move in the direction of the optimal solution, since we are using gradient descent to minimize our TD error? I can't think of an example, how chasing tail becomes a problem (since we are moving in the direction of the steepest descent, therefore making our policy better, right?) Any intel on this would be greatly appreciated, since I've been scratching my on this for quite some time now... Anyways, love your videos! Keep it up :)
@deeplizard 5 лет назад ⁺⁷
Hey Marius - The paper below by the individuals who coined fixed Q-targets is the most concrete explanation I could find for _why_ the additional target network adds stability. Let me know if this sheds any further insight or helps at all. Here is a relevant snippet:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets yj in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q^ and use Q^ for generating the Q-learning targets yj for the following C updates to Q. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(st,at) often also increases Q(st+1,a)for all a and hence also increases the target yj, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets yj, making divergence or oscillations much more unlikely.
web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
@kevin5k2008 5 лет назад ⁺³
Great series. Looking forward to your code demo for DQN. If I may suggest, it will be great if you could continue your examples off the previous RL Fronzenlake example. =)
@jadechip 5 лет назад ⁺¹
This helped me form an intuition surrounding fixed Q-targets, thank you!
@chronicfantastic 5 лет назад ⁺¹
Great video! Thanks looking forward to the rest in the series
@gourabchanda2102 3 года назад
Assignment to be submitted!! deeplizard to the rescue!!!
@jeffshen7058 5 лет назад ⁺¹
Thanks for all the videos. Great animations, great explanations, great content, great little jokes. No filler, just quality content. I love it. Please keep the videos coming! Looking forward to a video about implementing DQNs in Pytorch.
If you don't mind me asking, where did you learn all this? What is your (if any) formal educational background? Are there channels apart from this one that you'd recommend (regardless of whether it is for DL stuff or not)? Sorry for all the questions, just curious. Maybe a Q&A in the future?
@deeplizard 5 лет назад ⁺³
You're so welcome, Jeff! Really glad you're enjoying the content!
I learned DL through self-lead study and personal projects. My formal education was in mathematics and econ/finance with some programming/tech mixed in as well. In regards to other channels, I recommend Jeremy Howard's channel for his fast.ai course, as well as Andrew Ng's deeplearning.ai channel.
@kayaonur5657 5 лет назад ⁺¹
this channel is priceless. looking forward to for the next vid!
@fesqvw7522 5 лет назад ⁺¹
This series is amazing!!!!! You are rly great at explaining. KEEP IT GOING!!
@kasperbuskpedersen 5 лет назад ⁺¹
Thank you so much for this playlist. What would make it really perfect would be a last episode implementing the DQL with Keras / pytorch like you did for frozen lake :)
@deeplizard 5 лет назад
Yes, that will come in the following videos :D
@Meo646 5 лет назад ⁺¹
Thank you very much for your great series. Helped me a lot with my project. Greetz from Germany! :-)
@dimigian 5 лет назад ⁺¹
Excellent job as always. Looking forward for the next chapter/episode :)
@keyangke 4 года назад
I got a question about the target network technique, the loss function can not converge but the network actually make the agent works, why?
btw, best AI channel ever !
@gandalfdegrey 3 года назад ⁺²
Hi, I hope you guys are still there to answer my question:
It still seems to me like it's chasing its own tail, but this time, instead of the target moving away at each update, it just moves away after x updates. So what's the difference?
@arkasaha4412 5 лет назад ⁺¹
Great video! I am facing some issues understanding how the process begins. Like in the beginning both the networks have random weights. The target network spits out random Q values right? So how exactly are we optimizing to get better?
@parthpiyushprasad709 4 года назад
Okay. So I understand a lot but still have quite a few doubts.
1) So the thing in the end we are trying to perfect is the actions the agent takes, right? So when we explore, that's well and good.
But when we want to exploit, we do a pass with the policy network, right?
2) Is the Q value for a given state action pair the expected reward? If so, aren't we relying on this to be right, and it is actually expected?
3)Do we store all states in the experience memory, both the ones that we explore or exploit?
Thanks so much deeplizard! I love your videos so much even tho I don't get much of what you are saying...
@LightningSpeedtop 4 года назад
The network is an approximation,this means it isn’t really accurate, it’ll adjust the weights based on the loss until convergence
@caigvaar 4 года назад ⁺¹
Thanks a lot ! It was really helpful
@dariosangiovanni1502 5 лет назад ⁺¹
Hey guys, thanks for this channel! What about Actor Critic methods? Will they be in the next videos?
@tameraburouk1352 5 лет назад ⁺¹
Hello, I Would like to thank you for your deep teaching method. very clear explaining things step by step. I want to ask you please when can you start the next videos? Mainly the code part.
@deeplizard 5 лет назад ⁺¹
Thanks, Tamer! I'm hoping to start releasing new RL videos to this series within the next few weeks!
@sharifrezaie5438 4 года назад
just love it. concise and insightful.
@harshadevapriyankarabandar5456 5 лет назад ⁺⁸
waiting for DQN code project
@tingnews7273 5 лет назад
I didn't get totally clear about the whole process .But I get this idea clear.Because it just like the GAN. Done.
@amrindersingh5768 5 лет назад ⁺¹
Love the way you explain things
@sshuva56 5 лет назад
Hi! Your previous videos made me understand the training process of a deep Q-network. These are really helpful. I am working with a multiagent environment who needs to train via DQN. In case of a multi-agent system, how the DQN will train its agents?
@manuelkarner8746 3 года назад ⁺¹
so before the network had "no time" = just one try to adapt to the target & correct it´s error, but now we gave it some time = many try´s because the function stay´s fixed for a certain amount of time -- very cool.
I wonder how all of this can actually work lol :)
@vibekdutta6539 5 лет назад ⁺²
Awesome contents as always! Love this channel!
@deeplizard 5 лет назад ⁺¹
Thank you, Vivek!
@tomw4688 4 года назад
When doing the gradient descent backprop where will the Reward value fit into all this? I only understand that it's trained using the calculated loss (e.g. output Q-value minus target Q-value). Thank you.
@АлексейТучак-м4ч 5 лет назад
I think that reading and writing experience could be controlled by special algorithm, that makes sure that agent remembers mostly valuable things.
Such memory censor could use experience uniqueness as it's value. Which in turn could be assessed as orthogonal component to already stored system state vectors or by some mutual information measure.
@christianjt7018 5 лет назад ⁺¹
Can you please make a code implementation for this algorithm?
Your tutorial series have been awesome.
@deeplizard 5 лет назад
Hey Chris - Yes, a code project implementing a deep Q-network will be coming in the next videos! I've been on pause with this series momentarily, but it will continue!
@sarvagyagupta1744 5 лет назад
So let me get this right. We use the 2nd network as some sort of a reference to properly see how efficient our system is? Something like if we learn how to drive, we need to know how well we drive if we won't have the reference of other drivers? Or how well we play a sport? Am I correct here?
@Αναξ 5 лет назад ⁺⁴
I deeply love your voice and your content. Hope you are not an agent.
@deeplizard 5 лет назад
🤖🤔
@adrianoavelar1482 4 года назад ⁺¹
Perfect!
@amoghgadagi8445 Год назад
Since the policy network is updated every step and the target network is updated after a certain number of steps, won't this affect the training of the agent? As the agent continuously tries to get closer to the older weights of the target network, how will the learning happen? Please correct me if I am wrong
The concern is that the agent may struggle to learn effectively if it constantly tries to match the older weights of the target network. It raises the question of how learning can occur in such a scenario and invites clarification or correction if any misconceptions are present.
@shayanahmadi6024 Год назад
can you make a video for adding the critic network to this model?
@BenyNotNice 4 года назад ⁺¹
im not sure if the previous algorithm was chasing its own tale..
we added the reward obtained to our maximum future q so the neural network was getting new information after all.
can anyone enlighten please?
@LightningSpeedtop 4 года назад
The target network is fixed for a while, so the weight is constant, so your Q network can converge to those values, until those values converge, then the target network is then updated
@aburehan6965 4 года назад ⁺²
...the way the lizard in the logo is chasing it's own tail.
@chidambarjoshi3470 9 месяцев назад
Is cloning the policy network and making it a target network mean the A3C algorithm?
@Nissearne12 11 месяцев назад
❤Ahh cool. Sounds resonable to have a cloned frozen Target Network. How offren do we update our cloned frozen Target Network weights 1/100 sample or …? Very nice Tutorial 👍🥰
@chuanjiang6931 4 месяца назад
Is the concept same as Actor Critic method? Is the 2nd network Critic?
@vanvan3134 4 года назад
which networks that we use to produce real time action by choosing max q value ? is it the Target_Q networks ?
@krishbaisoya9863 3 года назад ⁺²
🔥🔥👍
@amrindersingh5768 5 лет назад
Please start a series on NLP.
@girish7914 5 лет назад
can you please show us demo how to create basic startup by using DQN...android startup , how we can apply these things to solve real life problems..
@ManarMagdy-yc2tm 4 месяца назад
where can i find examples using DQN?
@moritzpainz1839 4 года назад
Well, when we start the training, the weights of the target Network are initalized randomly, meaning the optimal q value which we calculate using this Network is a random number too, right? So why does it work - where am i wrong?
@gorkemvids4839 4 года назад
Targetnetwork's output is same or very near to policy network because they clone weights and biases often.
@Aer0xander 5 лет назад ⁺¹
Really appreciate these videos! Could you make a video about ONNX or tutorial about ONNX.js? Thanks :D
@deeplizard 5 лет назад ⁺¹
I haven't played with ONNX yet, but I'll put it on my radar. Thanks for the suggestion! In the mean time, we do have other NN Javascript tutorials available in the TensorFlow.js series if you're interested in more DL Javascript resources!
deeplizard.com/learn/playlist/PLZbbT5o_s2xr83l8w44N_g3pygvajLrJ-
@Aer0xander 5 лет назад
@@deeplizard I just checked those out that's why I was interested in ONNX.js 😄 they claim it's much faster and uses webassembly for the cpu, though TF.js will also add that next year
@daviddisbrow2222 5 лет назад
At the very start of the process, the recall memory is empty. Does the algorithm have some initial steps to load the recall memory? That is, is optimization delayed N timesteps until the N sized recall memory is full.
@deeplizard 5 лет назад
No, the algorithm can start to optimize right after we store the first experience in replay memory. The only thing that might come up here is that if our batch size is greater than the amount of experiences in memory at the start, then we'll have to account for how to handle that scenario in code.
@snowboyyuhui 5 лет назад
great series. when's the next video?
@maciejrogala5928 4 года назад
Why q value and target q value will move in the same direction? We compute the q-value when we input s, on the other hand, target q-value is computed by inputting s', so can't there be a scenario where for example q value will get bigger, and target will get smaller?
Or the point is that we just don't want the target to move at all? Why do we assume that the target will always move in the same direction if the input is different? Theoretically, if we want to distinguish cat vs dogs, and we want the dog probability go down, it doesn't matter for the cat input - the output stays the same (and does not behave like the output for the dog). Here the dog is s and cat is s'...
Maybe there is a proof for that?
@AvZNaV 5 лет назад ⁺¹
Hey, just a heads up, the link to the blog in the description is leading to a 404.
@deeplizard 5 лет назад ⁺¹
Hey AvZ - Thank you! Really appreciate the heads up! Good looking out! 🙏
Finishing up the blog now. It should be up soon. This also reminds me that we need to put us a custom 404 page. 😊
@deeplizard 5 лет назад ⁺¹
It's up now :)
@luischinchilla-garcia5640 5 лет назад
So in this approach, you don't need the entire history of each episode? This would be an online method since you only need the next step instead?
@deeplizard 5 лет назад
Yes, you're right Luis!
@chadjensenster 5 лет назад
Why run a second pass? You have twice the computations through your network. Why is it not possible to get the states from the tuple directly and find the error?
@deeplizard 5 лет назад
We do indeed get the next state directly from the tuple, however, what we need is the maximum q-value for that next state among all possible next actions. To get the q-value, we have to do a (second) pass to the network. Does this help?
@marzmohammadi8739 2 года назад
does anybody have Ted Talk link that has been shown at the end of this video?
@deeplizard 2 года назад ⁺¹
In the description :)
@m.zubairislam3405 5 лет назад
Hello Respected sir! I read your blogs or watch videos about RL I want to build ITS (intelligent tutoring system)
Example
Bot : what is 1 +1
User: 3 (wrong)
Bot: next question according to user response correct or incorrect Answer of the question.
Recommend the question to the user according to user intelligence.
Sir, please guide me about that.
@anishbatra8215 5 лет назад
When are you going to continue this series?
@deeplizard 5 лет назад ⁺¹
Hoping to get this RL series continuing within the next few weeks!
@shubhamwasnik5118 5 лет назад ⁺¹
Plz upload next video we are waiting for your videos
@deeplizard 5 лет назад ⁺¹
soon :D
@akankshamaurya8920 5 лет назад ⁺³
whoever is disliking, needs to leave the planet.
@deeplizard 5 лет назад ⁺¹
😆🚀
@aminezitoun3725 5 лет назад
hey , don't you think that using two neural networks is going to be so slow tho ? it would take ages to train the agent especially for a slow pc like mine xD
@deeplizard 5 лет назад ⁺¹
Hey Amine - The training itself is only occurring on one network, the policy network. The target network just gets a forward pass but is not having to go through SGD, backprop, etc.
@aminezitoun3725 5 лет назад
@@deeplizard aight thnx for responding ,one question tho xD its not about the video just wanted to ask you about what do you think about ai and is it really going to take over the world xD iam really confused
@LightningSpeedtop 4 года назад
@deeplizard I’m really hoping you still answer questions,I can see that you’re immediately sampling from your memory after the first time step , wouldn’t that just return the first experience
@deeplizard 4 года назад
In the early stages, we are unable to sample a batch from memory if memory is not yet large enough to return a sample. You will see exactly how this is implemented in code in the following episodes. Specifically, it is explained in the video and blog below in the Replay Memory section > function can_provide_sample().
deeplizard.com/learn/video/PyQNfsGUnQA
@LightningSpeedtop 4 года назад
deeplizard so does this mean it’ll do for the remaining time steps until the batch is full, then it’ll sample from the batch ?
@LightningSpeedtop 4 года назад
deeplizard and also when we are calculating our loss function and back propagating, do we keep calculating until we reach convergence for every state?
@saurabhkhodake 5 лет назад
When can we expect next video?
@deeplizard 5 лет назад ⁺¹
Hoping to get this RL series continuing within the next few weeks!
@saurabhkhodake 5 лет назад
@@deeplizard yayyy.... Can't wait
@gayathrirangu5488 Год назад
can someone please explain why does this work? having a fixed target gives stability, no doubt.
@grigorijdudnik575 3 года назад ⁺¹
{
"question": "Why do we need to copy an policy network as target network each x timesteps?",
"choices": [
"Using same network for calculating between output values and target values could provide unstability into learning process.",
"Because separating networks is more computation-efficient.",
"Values in the single network changing more slower, than in two of them.",
"Using two different networks can secure us from the vanishing gradient problem."
],
"answer": "Using same network for calculating between output values and target values could provide unstability into learning process.",
"creator": "Grigorij",
"creationDate": "2021-03-04T10:29:50.707Z"
}
@deeplizard 3 года назад
Thanks, Grigorij! Just added your question to deeplizard.com/learn/video/xVkPh9E9GfE :)
@Небудьбараном-к1м 4 года назад
Anyone knows why adding CNNs to this ruins the performance??? even one conv layer just smashes the agent into failure...
@hanserj169 4 года назад
I didn't get why having the same weights and the same network is a problem.
@LightningSpeedtop 4 года назад
You’re trying to reach convergence , if you have the same network, the weights are constantly changing meaning the Target Q will also be changing, making the whole point of calculating the loss function useless as both Qs will never converge
@hanserj169 4 года назад
@@LightningSpeedtop perfect explanation. Thanks , I got it now.
@LightningSpeedtop 4 года назад
Hanser J anytime, I also have an issue, if you can help, at the first time step, we’re already sampling fro the memory wouldn’t that return what we just put in?
@hanserj169 4 года назад
@@LightningSpeedtop I supposed it is, but I guess that this only happens at the very first steps until the replay memory reach the established batch or get full. After that, when memory got full, the last stored experience is the one that will be replaced. Then the second last one, then the third and so on.
@LightningSpeedtop 4 года назад
Hanser J thank you, and one more thing, when we do the loss function thing to back propagate, do we keep doing it to converge for a given state, or we only do it once and move to the next state
@yokrysty 5 лет назад ⁺¹
I'm so sad when this series doesn't continue. Is this channel dead?
@deeplizard 5 лет назад ⁺¹
No, we're alive! We've had some things prevent us from uploading recently, but we're aiming to have new content released within the next couple weeks! This series will continue.
@yokrysty 5 лет назад ⁺³
@@deeplizard I'm happy to hear that. Keep up the good work. Thank you for your reply.

Следующие

Автовоспроизведение

Reinforcement Learning - Course Reflection