easily the best explanation of PPO I've ever seen. most papers and lectures get too tangled up in the probabilistic principals and advanced mathematic derivations and completely lose sight of what these models are doing in high level terms.
He (Siraj) never explained any of what he was saying, and that was why I stopped watching him. He just rushed through the context and the results, explaining nothing.
12:19 Min operator also gets prefers old PO update IF advantage is positive but probability of taking that action decreased, min operator selects unclipped objective here to undo the bad update IF the advantage is negative but probability of taking that action increased, min operator also selects unclipped objective to undo bad update, just as mentioned in video.
Thank you for the video, it is very helpful. The key concepts have been explained in just 20min, bravo. I would like to see more videos from your channel. Thank you.
I watched this video more than 5 times and this was the best video about the PPO. Thank you for making great videos like this and keep up the good work. P.S: Your explanation was even simpler than the creator of this algorithm Schulman.)
Coming back to this after thoroughly understanding Q-learning and looking into the advantage function in another network, this explanation is FAST, I wonder who would understand all that is happening without background knowledge
Well for AI/ML some background info is needed. If you taking multivariable calculus, its assumed you know calculus already. For those who already work in machine learning, this video is amazing. If i didn't get something i can research what he's talking about because he's using the proper technical terms, not dumbing it down. It's a wake up call for what I need to know to be knowledge. Great video.
Great video, but I have a couple of doubts: 1. In PPO, how does changing the objective help in restricting the updates to the policy? Wouldn’t it make more sense to restrict the gradient so that we don’t update the policy too much in one go? 2. In PPO, when A
16:46 How to apply gpu usage in PPO. For me, it was hard to implement collect experiences with cuda pytorch; Actually it seems even OpenAI didn't use gpu in the collecting process.
You need to vectorize the environments. PPO2 now does this (see twitter.com/openai/status/931226402048811008?lang=en). In my experience GPU usage is still very low during collection, however.
Thanks for sharing this! I may be misunderstanding something, but it seems like there might be a mistake in the description. Specifically, the claim in 12:50 that "this is the only region where the unclipped part... has a lower value than the clipped version". I think this claim might be wrong, because there could be another case where the unclipped version would be selected: For example, if the ratio is e.g 0.5 (and we assume epsilon is 0.2), that would mean the ratio is smaller than the clipped version (which would be 0.8), and it would be selected. Is that not the case?
Thank you for such a clear explanation. I was able to watch this at 2x speed and follow everything, which is a testament to your clarity. It really helped that you tied PPO back to previous work (e.g TRPO)
I don't know if I'll get answers here but I have some questions: 1) Why are we taking the "min" in the loss function? 2) We are considering 1 in 1-e and 1+e because the reward we give for each positive action is 1, right? My question here is the scaling factor.
At 10:27 he says that first part in the min() expression is the "default policy gradient objective" but I do not really see how, since the objective function usually is J=E[R_t]. Does someone understand this?
I some questions! Taking a quick step back to the Policy Gradient Loss for a sec, we had: Loss = E ( [log prob] * advantage ) If my understanding is correct, then we actually have two neural networks here. One that calculates the probabilities of each action (this is the policy network we are trying to optimise), and one entirely different neural network that tries to guess the value of being in the current state. Q1 - does the value network simply learn off mean-squared-error by minimising ([actual discounted reward] - [value net prediction])^2? Is there no way to train use policy gradient methods without running 2 networks? Q2 - How do we actually calculate the discounted reward for a neural network where only the probabilities of each action are taken? For example, if at time step 0, our NN produces: Act 1 : 20% Act 2 : 30% Act 4 : 50% I can only take one of these actions to end up in a new state. Do we take the highest one? Or do we, for each trajectory, randomly pick one based on their probability of being chosen? Do we do this for every time step t = 1 to T? After the trajectory of T timesteps, we get one 'actual' value for G, that is attributed to the timestep at time t = 0. Does this mean we can only perform gradient descent on this single observation? If we do a minibatch, do we need multiple tractories, say 50, each of length T, then do gradient descent on the 50 where the only the G value for t = 0 for each of them has been calculated? My apologies for the questions, hopefully they make sense and I'm just looking to confirm my understanding :)
Hi, really good questions! Q1: you can train a policy gradient method without using a value function by just training the policy network, but using a value function to estimate the expected return from the current state tends to make things much more stable.. Q2: You're correct that this might seem a bit weird, but indeed you have to probabilistically sample an action at each timestamp and then play out the episode along that specific path in state space. However, on average & over time you can see it that each action will in fact get selected according to it's probability rate. So stochastically every action gets played!
12:56 The real power of clipping is that it automatically ignores oulier samples. Not decreasing the influence, but totally ignoring! This is because the gradient of outlier samples are 0
Great video! I had a question though, at 6:50, the objective function, which you called loss is actually the function that we'd want to maximize right? I mean calling it loss gave me the idea that we should minimize it. Correct me if I am wrong, please.
Yes, we are trying to maximise the advantage. It is called "loss" simply because it has the same function as (true) loss functions in other domains. It might get tricky when you implement a multi-head neural network with Actor-Critic-Methods where you combine different loss functions (GAE for the actor, lambda returns for the critic, entropy for exploration) as you have to make sure which "loss" you aim to maximise and which to minimise.
Just thought I would let you know that I just shared your video with my cohort in the Udacity Reinforcement Learning Nanodegree. We are going through PPO now and this video is relevant and timely - especially wrt to the clip region explanations. Any ideas on how to convert the outputs from discrete to continuous action space?
Sounds great, thx for sharing! Well as I mentioned, the PPO policy head outputs the parameters of a Gaussian distribution (so means and variances) for each action. At runtime, you can then sample from these distributions to get continuous output values and use the reparametrization trick to backpropagate gradients through this non-differentiable block --> check out my video on Variational Autoencoders for all the details on this!
Do we actually need machine learning for single agent situations? It seems that if we don't have any adversarial factors, then we only need path planning, which should be doable by SAT/SMT solver or variations of A* much better. For me it seems that RL in cases like moving a cube with a hand is same as "let's shoot a bunch of neural networks to our problem and wait untill it will invent an approximation of multi-dimensional A* for us". And it still doesn't provide any guarantees that it will use the best trajectories (while path planning algoritms do).
The problem is that path planning requires you to have access to a somewhat accurate foreward model of the environment + requires quite a lot of computation at runtime since you need a significant amount of sampled forward trajectories in order to get decent performance. A trained policy network avoids both those constraints. But I do agree that current RL methods are far from optimal. The biggest problem from my point of view is that we currently have no idea how to do meaningful abstraction/generalization. What works is overfitting on a dense sampling of data from the problem space, but things like transfer learning / one-shot generalization are very big problems right now and we'll need some radically new approaches to tackle those!
@@ArxivInsights [I wanted to write that it would be interesting to try using ML to find the rules and then use solver to achieve the goal, but then I recalled a project called AIRIS that does exactly that.]
I'm really confused about what the epsilon is and why it's there. Epsilon is generally used to refer to "a very small number" used to give very small bounds on things. So clip(x, 1-e, 1+e) is basically just 1 right? Why isn't the objective just min(r(θ)A, 1) ?
Epsilon is not that small, you can imagine epsilon as a hyperparameter specifying how much the new policy can differ from the old policy, so if you want PPO to update the policy more radically you increase epsilon, if you want to make smaller updates you decrease epsilon (closer to 0, but not as close such that you can replace clip(x, 1-e, 1+e) by 1).
In PPO, this epsilon value is something like 0.2, so you're clipping r(θ) to within the range [0.8 - 1.2] and than multiplying that value with the Advantage estimate. But as I explain in the video, the final result after the min() operator is dependent on the sign of A (pos or neg). So, for A>0 the 1-e clip doesn't matter since whenever r(θ) becomes smaller than 0.8, it's unclipped version will still get returned by the min() operator. Analogous for A
This really encapsulates the brilliance of the method, it's a dynamic regulator, like a differential gear for keeping the policy converging on regimes that are "proximal" to ones previously shown to be optimal.
A comment about the PPO paper, not this video: there's a minor typo in Eq. (10, 11). The terms in the exponent should read T-t-1 rather than T-t+1. Would you agree?
There is something I didn't understand : If you clip r then how will you do back propagation ?? the gradient will be just zero in the case r>1+epsilon or r
Hi Andrew, Can you please make a video explaining OpenAI's transformer model, Google's BERT & OpenAI's GPT&GPT-2 model? I can't seem to wrap my head around them.
2:59 are you sure you wrote right the difference between on policy and off policy ? Online policy is the one in which agent chooses the next step using the same policy which he is currently updating. Offline policy means that we choose steps according to one policy explanatory and learn absolutely other policy, target policy. I think it’s the difference, all others differences mentioned are not valid
Great video as usual! Just a suggestion, maybe instead of diving directly into deep RL you can make videos (shorter if you don't have much time to devote) on simpler RL algorithms like DQN, Q-learning. That way someone who wants to know more about RL can build up the knowledge through more vanilla stuff. I admire your style and content like many others and would love to see it grow more :)
You're totally right that this is not easy to step into if you are new to RL, but I feel like there are tons and tons of good introduction resources out there for the simpler stuff. I'm really trying to focus on the niche of people that already have this background and want to go a bit further :p
And I thing that makes this channel amazing. Practically following the state-of-art is a fantastic concept. As I am just curious about AI as a hobby and doing science in a different sector, I am glad, that I don't need to go through tons of articles by myself, but you show me the direction, where I should look to stay in a picture. Thank you.
I was about to say what @Arvix Insights is saying. There's an amazing book called: RL: An introduction by Richard S. Sutton and Andrew G. Barto, which introduces all the basic concepts that you're talking about. This second version of this book has been made publically available as a pdf and will be available on Amazin next month (don't quote me on the release date please :P).
At 2:54 you talked about online vs offline learning, but the screen shows some comparisons between off-policy and on-policy learning. Otherwise cool video!
Hi, I'm really interested in your video and I found that these videos are really helpful for people who are learning RL in particular and A.I in general. Your way of representing the notion and key ideas behind these algorithms is amazing. It's too sad to found that you r not gonna update videos for 2 years. Do you have any another channel or anything I can learn from you. Please let me know :( It's my pleasure
log(probabilities) will be -ve right, so if we take a bad action advantage function is -ve, so Lpg = -ve*-ve = +ve. so Lpg is blowing up when we take bad actions. L represents objective and not the loss function right?
Can you do a video about DDPG? Also, The PPO I know uses simply the discounted future returns in the loss. Is the variant using the Advantage instead the standard one?
How do I tackle the moving target problem using methods from RL ? Where I have more than one reward ,3 possible actions to take and ofcourse state which include many factors / sources of information around the environment , Your help is highly appreciated
Hey @Arxiv Insights, I do have a basic question regarding Reinforcement Learning and would really appreciate your help. What is the basic difference between Reinforcement Learning, Deep Learning and Deep Reinforcement Learning? Does Basic Reinforcement Learning take advantage of Neural Networks to find the best solution and therefore uses Deep Learning? Thank you very much in advance, trying to get an overview and understand all the differences for my Master Thesis at the moment...
This is the best explanation of PPO on the net hands down
easily the best explanation of PPO I've ever seen. most papers and lectures get too tangled up in the probabilistic principals and advanced mathematic derivations and completely lose sight of what these models are doing in high level terms.
This guy actually knows what he's talking about. Excellent video.
He is actually much better than Siraj Raval.
@@ahmadayazamin3313 what kind of scandal?
@@ahmadayazamin3313 I don't know it, just watch one or two videos of him demonstrate RL for trading
He seems to know what he's talking about.
He (Siraj) never explained any of what he was saying, and that was why I stopped watching him. He just rushed through the context and the results, explaining nothing.
@@oracletrading3000 'how to predict stock market in 5 minute'? More like, how to expose oneself as a fraud & end career in that time
As someone who is working in RL field .... you did very good job.
12:19
Min operator also gets prefers old PO update
IF advantage is positive but probability of taking that action decreased, min operator selects unclipped objective here to undo the bad update
IF the advantage is negative but probability of taking that action increased, min operator also selects unclipped objective to undo bad update, just as mentioned in video.
I'm loving this RL series. Keep it up!
Keep it up. Brevity is the soul of wit, it is indeed a skill to summarize the crux of a concept in such lucid way..!
I actually understood your explanation cover to cover on first view and thought the 19 minutes felt more like 5.
Outstanding work.
One view, no pauses?! Not willing to be mean, but how can you be sure you've truly understood and weren't conquering Mount Stupid the whole time?
By far the best explanation on RUclips.
The value you provide in these videos is insane !
Thank you very much for guiding our learning process ;)
Explained so well and it was intuitive as well. I learnt more from this video than all the articles I found in the internet. Great job.
Excellent video! Wonderful resource for anyone participating in AWS DeepRacer competitions.
it is a long video, no doubt, but once you end watching it you think it was much better than actually reading the paper. thanks man!
Amazing! This was the best explanation of PPO I have seen so far
one of the best overview of PPO, clean.
Thank you for including links for learning more on the description.
Wonderful, this is the first video i've seen on this channel. I suspect it won't be the last!
Thank you for the video, it is very helpful. The key concepts have been explained in just 20min, bravo. I would like to see more videos from your channel. Thank you.
best explanation of PPO I've found. Thanks
Great explanation with enough details! Thumbs up for all the free knowledge on the internet!
I watched this video more than 5 times and this was the best video about the PPO. Thank you for making great videos like this and keep up the good work. P.S: Your explanation was even simpler than the creator of this algorithm Schulman.)
Fantastic review of policy gradients, and PPO as well! Best place for a refresh
Coming back to this after thoroughly understanding Q-learning and looking into the advantage function in another network, this explanation is FAST, I wonder who would understand all that is happening without background knowledge
Well for AI/ML some background info is needed. If you taking multivariable calculus, its assumed you know calculus already. For those who already work in machine learning, this video is amazing. If i didn't get something i can research what he's talking about because he's using the proper technical terms, not dumbing it down. It's a wake up call for what I need to know to be knowledge. Great video.
This topic is so far from my comprehension, and yet you got me to understand it within 3 minutes
Great breakdown of PPO. You've simplified a lot of complex concepts to make them understandable! Hahaha... and you can't beat an octopus slap!!!
Much cleaner than deep learning boot camp explanation
I watched all your videos today, great works! Love them!
Fantastic intuitive explanation, thank you.
Great breakdown and links for additional resources
Thank you so much for this video! This is way more insightful and intuitive than simply reading the papers!
This video was very well done, I definitely got a lot of value out of it. Thank you for your work!
Dude, your channel is awesome! So glad I found it!
This video is absolutely amazing!!
Great video, but I have a couple of doubts:
1. In PPO, how does changing the objective help in restricting the updates to the policy? Wouldn’t it make more sense to restrict the gradient so that we don’t update the policy too much in one go?
2. In PPO, when A
16:46 How to apply gpu usage in PPO. For me, it was hard to implement collect experiences with cuda pytorch; Actually it seems even OpenAI didn't use gpu in the collecting process.
You need to vectorize the environments. PPO2 now does this (see twitter.com/openai/status/931226402048811008?lang=en). In my experience GPU usage is still very low during collection, however.
Thanks for sharing this! I may be misunderstanding something, but it seems like there might be a mistake in the description. Specifically, the claim in 12:50 that "this is the only region where the unclipped part... has a lower value than the clipped version".
I think this claim might be wrong, because there could be another case where the unclipped version would be selected:
For example, if the ratio is e.g 0.5 (and we assume epsilon is 0.2), that would mean the ratio is smaller than the clipped version (which would be 0.8), and it would be selected.
Is that not the case?
Thank you for such a clear explanation. I was able to watch this at 2x speed and follow everything, which is a testament to your clarity. It really helped that you tied PPO back to previous work (e.g TRPO)
Thank you for help me understand PPO faster, good explanation with useful resources included.
I don't know if I'll get answers here but I have some questions:
1) Why are we taking the "min" in the loss function?
2) We are considering 1 in 1-e and 1+e because the reward we give for each positive action is 1, right? My question here is the scaling factor.
Third video in a row. Really enjoy your work. Keep it up! And thank you!!!
I watch Siraj Raval for the motivation, but I watch Arxiv Insights for the explanations
Excellent algorithm and explanation!
Amazing explanation. Keep up the good work.
I love how you take the formula apart an look at it step by step. Great work!
Great Video. Excellent intro to this topic .
At 10:27 he says that first part in the min() expression is the "default policy gradient objective" but I do not really see how, since the objective function usually is J=E[R_t]. Does someone understand this?
Congrats! You have a special skill to explain AI.
This is really great! Keep up the good work!
Thank you, for the clean explaination
3:54 to 4:10 or so, why does that section remind me of the method used for ray marching in image rendering?
Great explanation and references
I some questions! Taking a quick step back to the Policy Gradient Loss for a sec, we had:
Loss = E ( [log prob] * advantage )
If my understanding is correct, then we actually have two neural networks here. One that calculates the probabilities of each action (this is the policy network we are trying to optimise), and one entirely different neural network that tries to guess the value of being in the current state. Q1 - does the value network simply learn off mean-squared-error by minimising ([actual discounted reward] - [value net prediction])^2? Is there no way to train use policy gradient methods without running 2 networks?
Q2 - How do we actually calculate the discounted reward for a neural network where only the probabilities of each action are taken? For example, if at time step 0, our NN produces:
Act 1 : 20%
Act 2 : 30%
Act 4 : 50%
I can only take one of these actions to end up in a new state. Do we take the highest one? Or do we, for each trajectory, randomly pick one based on their probability of being chosen? Do we do this for every time step t = 1 to T?
After the trajectory of T timesteps, we get one 'actual' value for G, that is attributed to the timestep at time t = 0. Does this mean we can only perform gradient descent on this single observation? If we do a minibatch, do we need multiple tractories, say 50, each of length T, then do gradient descent on the 50 where the only the G value for t = 0 for each of them has been calculated?
My apologies for the questions, hopefully they make sense and I'm just looking to confirm my understanding :)
Hi, really good questions!
Q1: you can train a policy gradient method without using a value function by just training the policy network, but using a value function to estimate the expected return from the current state tends to make things much more stable..
Q2: You're correct that this might seem a bit weird, but indeed you have to probabilistically sample an action at each timestamp and then play out the episode along that specific path in state space. However, on average & over time you can see it that each action will in fact get selected according to it's probability rate. So stochastically every action gets played!
Hat's off, mate. This is fantastic.
12:56 The real power of clipping is that it automatically ignores oulier samples. Not decreasing the influence, but totally ignoring! This is because the gradient of outlier samples are 0
Thanks buddy, really appreciated!
Thank you so much. I watch this videos the 10th time :-D
Great video! I had a question though, at 6:50, the objective function, which you called loss is actually the function that we'd want to maximize right? I mean calling it loss gave me the idea that we should minimize it. Correct me if I am wrong, please.
Yes, we are trying to maximise the advantage. It is called "loss" simply because it has the same function as (true) loss functions in other domains. It might get tricky when you implement a multi-head neural network with Actor-Critic-Methods where you combine different loss functions (GAE for the actor, lambda returns for the critic, entropy for exploration) as you have to make sure which "loss" you aim to maximise and which to minimise.
Just thought I would let you know that I just shared your video with my cohort in the Udacity Reinforcement Learning Nanodegree. We are going through PPO now and this video is relevant and timely - especially wrt to the clip region explanations. Any ideas on how to convert the outputs from discrete to continuous action space?
Sounds great, thx for sharing! Well as I mentioned, the PPO policy head outputs the parameters of a Gaussian distribution (so means and variances) for each action. At runtime, you can then sample from these distributions to get continuous output values and use the reparametrization trick to backpropagate gradients through this non-differentiable block --> check out my video on Variational Autoencoders for all the details on this!
I really love your video, professional and informative, thank.
keep up the good work , sir. thanks for this awesome explanation
I need to dive deeper into this.
9:56
"Looks surprising simple,,, .. right?"
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
:(
i felt that
Thanks for this video! Really good teaching skills! :)
great video for ppo! thanks a lot for you work!
Amazing explanation.
Also. I just noticed that at 9:12 the seal slaps the guy with an octopus o.O
Pretty good explanation and very understandable thanks!
Love your videos!!
Do we actually need machine learning for single agent situations? It seems that if we don't have any adversarial factors, then we only need path planning, which should be doable by SAT/SMT solver or variations of A* much better. For me it seems that RL in cases like moving a cube with a hand is same as "let's shoot a bunch of neural networks to our problem and wait untill it will invent an approximation of multi-dimensional A* for us". And it still doesn't provide any guarantees that it will use the best trajectories (while path planning algoritms do).
The problem is that path planning requires you to have access to a somewhat accurate foreward model of the environment + requires quite a lot of computation at runtime since you need a significant amount of sampled forward trajectories in order to get decent performance. A trained policy network avoids both those constraints.
But I do agree that current RL methods are far from optimal. The biggest problem from my point of view is that we currently have no idea how to do meaningful abstraction/generalization. What works is overfitting on a dense sampling of data from the problem space, but things like transfer learning / one-shot generalization are very big problems right now and we'll need some radically new approaches to tackle those!
@@ArxivInsights [I wanted to write that it would be interesting to try using ML to find the rules and then use solver to achieve the goal, but then I recalled a project called AIRIS that does exactly that.]
more, just more videos. so well explained.
This is gold!!
really good explanation!
Subscribed because the topics so cool!
*Great* video. *Great* explanation!
I'm really confused about what the epsilon is and why it's there. Epsilon is generally used to refer to "a very small number" used to give very small bounds on things. So clip(x, 1-e, 1+e) is basically just 1 right? Why isn't the objective just min(r(θ)A, 1) ?
Epsilon is not that small, you can imagine epsilon as a hyperparameter specifying how much the new policy can differ from the old policy, so if you want PPO to update the policy more radically you increase epsilon, if you want to make smaller updates you decrease epsilon (closer to 0, but not as close such that you can replace clip(x, 1-e, 1+e) by 1).
In PPO, this epsilon value is something like 0.2, so you're clipping r(θ) to within the range [0.8 - 1.2] and than multiplying that value with the Advantage estimate.
But as I explain in the video, the final result after the min() operator is dependent on the sign of A (pos or neg). So, for A>0 the 1-e clip doesn't matter since whenever r(θ) becomes smaller than 0.8, it's unclipped version will still get returned by the min() operator. Analogous for A
This really encapsulates the brilliance of the method, it's a dynamic regulator, like a differential gear for keeping the policy converging on regimes that are "proximal" to ones previously shown to be optimal.
@@DavidSaintloth like bounded sampling in some GA: don't sample too far from the current best/working parameter region.
Thank you! Learned a lot from this!
simply the best ever
A comment about the PPO paper, not this video: there's a minor typo in Eq. (10, 11).
The terms in the exponent should read T-t-1 rather than T-t+1.
Would you agree?
Great explanation.
There is something I didn't understand : If you clip r then how will you do back propagation ?? the gradient will be just zero in the case r>1+epsilon or r
It was a great explanation!
Please do a video on Soft Actor-Critic and Maximum Entropy RL! That would be amazing!
Very nice videos. FYI: Please watch at 0.75 speed for better understanding, LOL!
very good explanations
Hi Andrew,
Can you please make a video explaining OpenAI's transformer model, Google's BERT & OpenAI's GPT&GPT-2 model?
I can't seem to wrap my head around them.
2:59 are you sure you wrote right the difference between on policy and off policy ? Online policy is the one in which agent chooses the next step using the same policy which he is currently updating. Offline policy means that we choose steps according to one policy explanatory and learn absolutely other policy, target policy. I think it’s the difference, all others differences mentioned are not valid
Great video as usual! Just a suggestion, maybe instead of diving directly into deep RL you can make videos (shorter if you don't have much time to devote) on simpler RL algorithms like DQN, Q-learning. That way someone who wants to know more about RL can build up the knowledge through more vanilla stuff. I admire your style and content like many others and would love to see it grow more :)
You're totally right that this is not easy to step into if you are new to RL, but I feel like there are tons and tons of good introduction resources out there for the simpler stuff.
I'm really trying to focus on the niche of people that already have this background and want to go a bit further :p
And I thing that makes this channel amazing. Practically following the state-of-art is a fantastic concept. As I am just curious about AI as a hobby and doing science in a different sector, I am glad, that I don't need to go through tons of articles by myself, but you show me the direction, where I should look to stay in a picture. Thank you.
I was about to say what @Arvix Insights is saying. There's an amazing book called: RL: An introduction by Richard S. Sutton and Andrew G. Barto, which introduces all the basic concepts that you're talking about. This second version of this book has been made publically available as a pdf and will be available on Amazin next month (don't quote me on the release date please :P).
I'm doing beginner level videos on my channel and would love your feedback... ruclips.net/channel/UCrRTWfso9OS3D09-QSLA5jg
Sure, thanks for the videos :)
Amazing work :D
At 2:54 you talked about online vs offline learning, but the screen shows some comparisons between off-policy and on-policy learning. Otherwise cool video!
Hi, I'm really interested in your video and I found that these videos are really helpful for people who are learning RL in particular and A.I in general. Your way of representing the notion and key ideas behind these algorithms is amazing. It's too sad to found that you r not gonna update videos for 2 years. Do you have any another channel or anything I can learn from you. Please let me know :( It's my pleasure
I came here to say this
Thank you so much. Very helpful
log(probabilities) will be -ve right, so if we take a bad action advantage function is -ve, so Lpg = -ve*-ve = +ve. so Lpg is blowing up when we take bad actions. L represents objective and not the loss function right?
A big thanks for the video :)
great content keep up man
Sherlock xx
YOU ARE AWESOME!
Very well explained, thank you
Can you do a video about DDPG?
Also, The PPO I know uses simply the discounted future returns in the loss. Is the variant using the Advantage instead the standard one?
How do I tackle the moving target problem using methods from RL ? Where I have more than one reward ,3 possible actions to take and ofcourse state which include many factors / sources of information around the environment ,
Your help is highly appreciated
Thank you, it's really a good explanation
Hey @Arxiv Insights, I do have a basic question regarding Reinforcement Learning and would really appreciate your help. What is the basic difference between Reinforcement Learning, Deep Learning and Deep Reinforcement Learning? Does Basic Reinforcement Learning take advantage of Neural Networks to find the best solution and therefore uses Deep Learning? Thank you very much in advance, trying to get an overview and understand all the differences for my Master Thesis at the moment...
Why do we take the log of policy in the loss??