An introduction to Policy Gradient methods - Deep Reinforcement Learning

Arxiv Insights

Просмотров 210 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 27 дек 2024

Комментарии • 195

@paulstevenconyngham7880 6 лет назад ⁺²¹⁶
This is the best explanation of PPO on the net hands down
@Alex-gc2vo 6 лет назад ⁺¹⁰
easily the best explanation of PPO I've ever seen. most papers and lectures get too tangled up in the probabilistic principals and advanced mathematic derivations and completely lose sight of what these models are doing in high level terms.
@bigdreams5554 2 года назад ⁺²
This guy actually knows what he's talking about. Excellent video.
@akshatagrawal819 5 лет назад ⁺³²⁵
He is actually much better than Siraj Raval.
@oracletrading3000 5 лет назад ⁺⁴
@@ahmadayazamin3313 what kind of scandal?
@oracletrading3000 5 лет назад ⁺¹
@@ahmadayazamin3313 I don't know it, just watch one or two videos of him demonstrate RL for trading
@DeanRKern 4 года назад ⁺¹
He seems to know what he's talking about.
@joirnpettersen 4 года назад ⁺¹⁰
He (Siraj) never explained any of what he was saying, and that was why I stopped watching him. He just rushed through the context and the results, explaining nothing.
@revimfadli4666 4 года назад ⁺⁵
@@oracletrading3000 'how to predict stock market in 5 minute'? More like, how to expose oneself as a fraud & end career in that time
@sarahjamal86 5 лет назад ⁺⁴
As someone who is working in RL field .... you did very good job.
@Navhkrin 5 лет назад ⁺²
12:19
Min operator also gets prefers old PO update
IF advantage is positive but probability of taking that action decreased, min operator selects unclipped objective here to undo the bad update
IF the advantage is negative but probability of taking that action increased, min operator also selects unclipped objective to undo bad update, just as mentioned in video.
@arkoraa 6 лет назад ⁺¹⁸
I'm loving this RL series. Keep it up!
@yuktikaura Год назад
Keep it up. Brevity is the soul of wit, it is indeed a skill to summarize the crux of a concept in such lucid way..!
@DavidSaintloth 6 лет назад ⁺⁷⁴
I actually understood your explanation cover to cover on first view and thought the 19 minutes felt more like 5.
Outstanding work.
@oguretsagressive 5 лет назад ⁺⁵
One view, no pauses?! Not willing to be mean, but how can you be sure you've truly understood and weren't conquering Mount Stupid the whole time?
@BDEvans 4 года назад
By far the best explanation on RUclips.
@maloxi1472 4 года назад ⁺⁸
The value you provide in these videos is insane !
Thank you very much for guiding our learning process ;)
@tyson96 Год назад
Explained so well and it was intuitive as well. I learnt more from this video than all the articles I found in the internet. Great job.
@BoltronRacingTeam 2 года назад ⁺¹
Excellent video! Wonderful resource for anyone participating in AWS DeepRacer competitions.
@MShahbazKharal 4 года назад
it is a long video, no doubt, but once you end watching it you think it was much better than actually reading the paper. thanks man!
@alializadeh8095 6 лет назад ⁺⁴
Amazing! This was the best explanation of PPO I have seen so far
@fktudiablo9579 4 года назад
one of the best overview of PPO, clean.
@4.0.4 6 лет назад ⁺²
Thank you for including links for learning more on the description.
@Խչո Год назад
Wonderful, this is the first video i've seen on this channel. I suspect it won't be the last!
@curumo_curunir 2 года назад
Thank you for the video, it is very helpful. The key concepts have been explained in just 20min, bravo. I would like to see more videos from your channel. Thank you.
@cherguioussama1611 3 года назад
best explanation of PPO I've found. Thanks
@xiguo2783 4 года назад
Great explanation with enough details! Thumbs up for all the free knowledge on the internet!
@scienceofart9121 4 года назад ⁺²
I watched this video more than 5 times and this was the best video about the PPO. Thank you for making great videos like this and keep up the good work. P.S: Your explanation was even simpler than the creator of this algorithm Schulman.)
@berin4427 4 года назад
Fantastic review of policy gradients, and PPO as well! Best place for a refresh
@Samuel-wl4fw 3 года назад
Coming back to this after thoroughly understanding Q-learning and looking into the advantage function in another network, this explanation is FAST, I wonder who would understand all that is happening without background knowledge
@bigdreams5554 2 года назад
Well for AI/ML some background info is needed. If you taking multivariable calculus, its assumed you know calculus already. For those who already work in machine learning, this video is amazing. If i didn't get something i can research what he's talking about because he's using the proper technical terms, not dumbing it down. It's a wake up call for what I need to know to be knowledge. Great video.
@labreynth 4 месяца назад
This topic is so far from my comprehension, and yet you got me to understand it within 3 minutes
@ColinSkow 6 лет назад ⁺³
Great breakdown of PPO. You've simplified a lot of complex concepts to make them understandable! Hahaha... and you can't beat an octopus slap!!!
@Navhkrin 5 лет назад
Much cleaner than deep learning boot camp explanation
@zeyudeng3223 5 лет назад ⁺¹
I watched all your videos today, great works! Love them!
@anonymous_user-s3s Месяц назад
Fantastic intuitive explanation, thank you.
@conlanrios 9 месяцев назад
Great breakdown and links for additional resources
@Fireblazer41 5 лет назад ⁺¹
Thank you so much for this video! This is way more insightful and intuitive than simply reading the papers!
@EddieSmolansky 6 лет назад ⁺²
This video was very well done, I definitely got a lot of value out of it. Thank you for your work!
@Rnjeazy 6 лет назад ⁺²
Dude, your channel is awesome! So glad I found it!
@Bardent 3 года назад
This video is absolutely amazing!!
@ravichunduru834 5 лет назад ⁺³
Great video, but I have a couple of doubts:
1. In PPO, how does changing the objective help in restricting the updates to the policy? Wouldn’t it make more sense to restrict the gradient so that we don’t update the policy too much in one go?
2. In PPO, when A
@jeffreylim5920 5 лет назад ⁺¹
16:46 How to apply gpu usage in PPO. For me, it was hard to implement collect experiences with cuda pytorch; Actually it seems even OpenAI didn't use gpu in the collecting process.
@DVDmatt 5 лет назад ⁺¹
You need to vectorize the environments. PPO2 now does this (see twitter.com/openai/status/931226402048811008?lang=en). In my experience GPU usage is still very low during collection, however.
@yonistoller1 Год назад
Thanks for sharing this! I may be misunderstanding something, but it seems like there might be a mistake in the description. Specifically, the claim in 12:50 that "this is the only region where the unclipped part... has a lower value than the clipped version".
I think this claim might be wrong, because there could be another case where the unclipped version would be selected:
For example, if the ratio is e.g 0.5 (and we assume epsilon is 0.2), that would mean the ratio is smaller than the clipped version (which would be 0.8), and it would be selected.
Is that not the case?
@m33pr0r 4 года назад ⁺²
Thank you for such a clear explanation. I was able to watch this at 2x speed and follow everything, which is a testament to your clarity. It really helped that you tied PPO back to previous work (e.g TRPO)
@junjieli9253 5 лет назад
Thank you for help me understand PPO faster, good explanation with useful resources included.
@sarvagyagupta1744 5 лет назад
I don't know if I'll get answers here but I have some questions:
1) Why are we taking the "min" in the loss function?
2) We are considering 1 in 1-e and 1+e because the reward we give for each positive action is 1, right? My question here is the scaling factor.
@jeremydesmond638 5 лет назад ⁺²
Third video in a row. Really enjoy your work. Keep it up! And thank you!!!
@antoinemathu7983 5 лет назад ⁺⁵
I watch Siraj Raval for the motivation, but I watch Arxiv Insights for the explanations
@umuti5ik 4 года назад
Excellent algorithm and explanation!
@jcdmb 4 года назад
Amazing explanation. Keep up the good work.
@DavidCH12345 5 лет назад ⁺¹
I love how you take the formula apart an look at it step by step. Great work!
@MyU2beCall 4 года назад
Great Video. Excellent intro to this topic .
@CommanderCraft98 3 года назад
At 10:27 he says that first part in the min() expression is the "default policy gradient objective" but I do not really see how, since the objective function usually is J=E[R_t]. Does someone understand this?
@francesco.messina88 5 лет назад
Congrats! You have a special skill to explain AI.
@arianvc8239 6 лет назад
This is really great! Keep up the good work!
@sainijagjit 5 месяцев назад
Thank you, for the clean explaination
@Guytron95 4 года назад
3:54 to 4:10 or so, why does that section remind me of the method used for ray marching in image rendering?
@suertem1 3 года назад
Great explanation and references
@Corpsecreate 6 лет назад
I some questions! Taking a quick step back to the Policy Gradient Loss for a sec, we had:
Loss = E ( [log prob] * advantage )
If my understanding is correct, then we actually have two neural networks here. One that calculates the probabilities of each action (this is the policy network we are trying to optimise), and one entirely different neural network that tries to guess the value of being in the current state. Q1 - does the value network simply learn off mean-squared-error by minimising ([actual discounted reward] - [value net prediction])^2? Is there no way to train use policy gradient methods without running 2 networks?
Q2 - How do we actually calculate the discounted reward for a neural network where only the probabilities of each action are taken? For example, if at time step 0, our NN produces:
Act 1 : 20%
Act 2 : 30%
Act 4 : 50%
I can only take one of these actions to end up in a new state. Do we take the highest one? Or do we, for each trajectory, randomly pick one based on their probability of being chosen? Do we do this for every time step t = 1 to T?
After the trajectory of T timesteps, we get one 'actual' value for G, that is attributed to the timestep at time t = 0. Does this mean we can only perform gradient descent on this single observation? If we do a minibatch, do we need multiple tractories, say 50, each of length T, then do gradient descent on the 50 where the only the G value for t = 0 for each of them has been calculated?
My apologies for the questions, hopefully they make sense and I'm just looking to confirm my understanding :)
@ArxivInsights 6 лет назад
Hi, really good questions!
Q1: you can train a policy gradient method without using a value function by just training the policy network, but using a value function to estimate the expected return from the current state tends to make things much more stable..
Q2: You're correct that this might seem a bit weird, but indeed you have to probabilistically sample an action at each timestamp and then play out the episode along that specific path in state space. However, on average & over time you can see it that each action will in fact get selected according to it's probability rate. So stochastically every action gets played!
@pawanbhandarkar4199 5 лет назад
Hat's off, mate. This is fantastic.
@jeffreylim5920 5 лет назад ⁺¹
12:56 The real power of clipping is that it automatically ignores oulier samples. Not decreasing the influence, but totally ignoring! This is because the gradient of outlier samples are 0
@alizerg Год назад
Thanks buddy, really appreciated!
@petersilie9702 4 года назад ⁺¹
Thank you so much. I watch this videos the 10th time :-D
@rutvikreddy772 4 года назад
Great video! I had a question though, at 6:50, the objective function, which you called loss is actually the function that we'd want to maximize right? I mean calling it loss gave me the idea that we should minimize it. Correct me if I am wrong, please.
@gregh6586 4 года назад ⁺¹
Yes, we are trying to maximise the advantage. It is called "loss" simply because it has the same function as (true) loss functions in other domains. It might get tricky when you implement a multi-head neural network with Actor-Critic-Methods where you combine different loss functions (GAE for the actor, lambda returns for the critic, entropy for exploration) as you have to make sure which "loss" you aim to maximise and which to minimise.
@joshuajohnson4339 6 лет назад ⁺⁶
Just thought I would let you know that I just shared your video with my cohort in the Udacity Reinforcement Learning Nanodegree. We are going through PPO now and this video is relevant and timely - especially wrt to the clip region explanations. Any ideas on how to convert the outputs from discrete to continuous action space?
@ArxivInsights 6 лет назад ⁺¹
Sounds great, thx for sharing! Well as I mentioned, the PPO policy head outputs the parameters of a Gaussian distribution (so means and variances) for each action. At runtime, you can then sample from these distributions to get continuous output values and use the reparametrization trick to backpropagate gradients through this non-differentiable block --> check out my video on Variational Autoencoders for all the details on this!
@cherrysun7054 5 лет назад
I really love your video, professional and informative, thank.
@abhishekkapoor7955 6 лет назад
keep up the good work , sir. thanks for this awesome explanation
@vizart2045 2 года назад
I need to dive deeper into this.
@adefirmanfauzi5500 4 года назад ⁺⁷
9:56
"Looks surprising simple,,, .. right?"
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
:(
@tranquilmagister8481 3 года назад
i felt that
@maraoz 4 года назад
Thanks for this video! Really good teaching skills! :)
@YuZhang-f1z 5 лет назад
great video for ppo! thanks a lot for you work!
@tuliomoreira7494 4 года назад
Amazing explanation.
Also. I just noticed that at 9:12 the seal slaps the guy with an octopus o.O
@anthonydawson9700 3 года назад
Pretty good explanation and very understandable thanks!
@akramsystems 6 лет назад ⁺²
Love your videos!!
@luck3949 6 лет назад ⁺²
Do we actually need machine learning for single agent situations? It seems that if we don't have any adversarial factors, then we only need path planning, which should be doable by SAT/SMT solver or variations of A* much better. For me it seems that RL in cases like moving a cube with a hand is same as "let's shoot a bunch of neural networks to our problem and wait untill it will invent an approximation of multi-dimensional A* for us". And it still doesn't provide any guarantees that it will use the best trajectories (while path planning algoritms do).
@ArxivInsights 6 лет назад ⁺⁵
The problem is that path planning requires you to have access to a somewhat accurate foreward model of the environment + requires quite a lot of computation at runtime since you need a significant amount of sampled forward trajectories in order to get decent performance. A trained policy network avoids both those constraints.
But I do agree that current RL methods are far from optimal. The biggest problem from my point of view is that we currently have no idea how to do meaningful abstraction/generalization. What works is overfitting on a dense sampling of data from the problem space, but things like transfer learning / one-shot generalization are very big problems right now and we'll need some radically new approaches to tackle those!
@luck3949 6 лет назад
@@ArxivInsights [I wanted to write that it would be interesting to try using ML to find the rules and then use solver to achieve the goal, but then I recalled a project called AIRIS that does exactly that.]
@siddharthmittal9355 6 лет назад
more, just more videos. so well explained.
@bikrammajhi3020 9 месяцев назад
This is gold!!
@阮雨迪 3 года назад
really good explanation!
@ruslanuchan8880 6 лет назад
Subscribed because the topics so cool!
@ConsultingjoeOnline 4 года назад
*Great* video. *Great* explanation!
@jrkirby93 6 лет назад ⁺²
I'm really confused about what the epsilon is and why it's there. Epsilon is generally used to refer to "a very small number" used to give very small bounds on things. So clip(x, 1-e, 1+e) is basically just 1 right? Why isn't the objective just min(r(θ)A, 1) ?
@frederikdesmedt4419 6 лет назад ⁺³
Epsilon is not that small, you can imagine epsilon as a hyperparameter specifying how much the new policy can differ from the old policy, so if you want PPO to update the policy more radically you increase epsilon, if you want to make smaller updates you decrease epsilon (closer to 0, but not as close such that you can replace clip(x, 1-e, 1+e) by 1).
@ArxivInsights 6 лет назад ⁺³
In PPO, this epsilon value is something like 0.2, so you're clipping r(θ) to within the range [0.8 - 1.2] and than multiplying that value with the Advantage estimate.
But as I explain in the video, the final result after the min() operator is dependent on the sign of A (pos or neg). So, for A>0 the 1-e clip doesn't matter since whenever r(θ) becomes smaller than 0.8, it's unclipped version will still get returned by the min() operator. Analogous for A
@DavidSaintloth 6 лет назад ⁺³
This really encapsulates the brilliance of the method, it's a dynamic regulator, like a differential gear for keeping the policy converging on regimes that are "proximal" to ones previously shown to be optimal.
@williamchamberlain2263 6 лет назад
@@DavidSaintloth like bounded sampling in some GA: don't sample too far from the current best/working parameter region.
@connor-shorten 5 лет назад ⁺¹
Thank you! Learned a lot from this!
@hadsaadat8283 2 года назад
simply the best ever
@rayanelhelou2009 4 года назад
A comment about the PPO paper, not this video: there's a minor typo in Eq. (10, 11).
The terms in the exponent should read T-t-1 rather than T-t+1.
Would you agree?
@SG-tz7jj Год назад
Great explanation.
@meddh1065 2 года назад
There is something I didn't understand : If you clip r then how will you do back propagation ?? the gradient will be just zero in the case r>1+epsilon or r
@apetrenko_ai 6 лет назад ⁺²
It was a great explanation!
Please do a video on Soft Actor-Critic and Maximum Entropy RL! That would be amazing!
@chid3835 3 года назад
Very nice videos. FYI: Please watch at 0.75 speed for better understanding, LOL!
@yinghaohu8784 7 месяцев назад
very good explanations
@victor-iyi 5 лет назад ⁺¹
Hi Andrew,
Can you please make a video explaining OpenAI's transformer model, Google's BERT & OpenAI's GPT&GPT-2 model?
I can't seem to wrap my head around them.
@Enerdzizer 4 месяца назад
2:59 are you sure you wrote right the difference between on policy and off policy ? Online policy is the one in which agent chooses the next step using the same policy which he is currently updating. Offline policy means that we choose steps according to one policy explanatory and learn absolutely other policy, target policy. I think it’s the difference, all others differences mentioned are not valid
@arkasaha4412 6 лет назад ⁺²
Great video as usual! Just a suggestion, maybe instead of diving directly into deep RL you can make videos (shorter if you don't have much time to devote) on simpler RL algorithms like DQN, Q-learning. That way someone who wants to know more about RL can build up the knowledge through more vanilla stuff. I admire your style and content like many others and would love to see it grow more :)
@ArxivInsights 6 лет назад ⁺¹¹
You're totally right that this is not easy to step into if you are new to RL, but I feel like there are tons and tons of good introduction resources out there for the simpler stuff.
I'm really trying to focus on the niche of people that already have this background and want to go a bit further :p
@52VaultBoy 6 лет назад ⁺²
And I thing that makes this channel amazing. Practically following the state-of-art is a fantastic concept. As I am just curious about AI as a hobby and doing science in a different sector, I am glad, that I don't need to go through tons of articles by myself, but you show me the direction, where I should look to stay in a picture. Thank you.
@M0481 6 лет назад ⁺¹
I was about to say what @Arvix Insights is saying. There's an amazing book called: RL: An introduction by Richard S. Sutton and Andrew G. Barto, which introduces all the basic concepts that you're talking about. This second version of this book has been made publically available as a pdf and will be available on Amazin next month (don't quote me on the release date please :P).
@ColinSkow 6 лет назад ⁺¹
I'm doing beginner level videos on my channel and would love your feedback... ruclips.net/channel/UCrRTWfso9OS3D09-QSLA5jg
@arkasaha4412 6 лет назад
Sure, thanks for the videos :)
@MdelaRE1 5 лет назад ⁺¹
Amazing work :D
@Arctus1491 5 лет назад
At 2:54 you talked about online vs offline learning, but the screen shows some comparisons between off-policy and on-policy learning. Otherwise cool video!
@learningdaily4533 3 года назад ⁺²
Hi, I'm really interested in your video and I found that these videos are really helpful for people who are learning RL in particular and A.I in general. Your way of representing the notion and key ideas behind these algorithms is amazing. It's too sad to found that you r not gonna update videos for 2 years. Do you have any another channel or anything I can learn from you. Please let me know :( It's my pleasure
@ademord 2 года назад
I came here to say this
@kenfuliang 4 года назад
Thank you so much. Very helpful
@thiyagutenysen8058 2 года назад
log(probabilities) will be -ve right, so if we take a bad action advantage function is -ve, so Lpg = -ve*-ve = +ve. so Lpg is blowing up when we take bad actions. L represents objective and not the loss function right?
@anshulpagariya6881 3 года назад
A big thanks for the video :)
@Sherlockarim 6 лет назад ⁺¹
great content keep up man
@selvaselvaooty9022 5 лет назад
Sherlock xx
@idabagusdiaz 4 года назад
YOU ARE AWESOME!
@benjaminf.3760 5 лет назад
Very well explained, thank you
@tumaaatum 3 года назад
Can you do a video about DDPG?
Also, The PPO I know uses simply the discounted future returns in the loss. Is the variant using the Advantage instead the standard one?
@samidelhi6150 5 лет назад
How do I tackle the moving target problem using methods from RL ? Where I have more than one reward ,3 possible actions to take and ofcourse state which include many factors / sources of information around the environment ,
Your help is highly appreciated
@fabiocescon3772 5 лет назад
Thank you, it's really a good explanation
@alex4good 6 лет назад
Hey @Arxiv Insights, I do have a basic question regarding Reinforcement Learning and would really appreciate your help. What is the basic difference between Reinforcement Learning, Deep Learning and Deep Reinforcement Learning? Does Basic Reinforcement Learning take advantage of Neural Networks to find the best solution and therefore uses Deep Learning? Thank you very much in advance, trying to get an overview and understand all the differences for my Master Thesis at the moment...
@absimaldata 3 года назад
Why do we take the log of policy in the loss??

Следующие

Автовоспроизведение

An introduction to Reinforcement Learning