Thank you so much for this talk, Olivier. It helped me a lot figuring out the theory of the paper before I implemented the algorithm. I hope you keep up making this kind of content. The community will surely appreciate it.
Great presentation and slides! I were reading SpinningupRL explanation of it and had a hard time wrapping my head around it, this made it way clearer! Thank you! The slides you link on the description seems to be a little bit different from the one used in the video (the link have 15 pages)
I was struggling to find a neat and concise explanation for the SAC algorithm and your video is spot on. Thank you very much and carry on the good work. I would be interested in what you have to say about the recent development of Black-box algorithms in the field of RL.
@@abdowaraiet2169 I confirm that Evolution Strategies, in particular, are strong alternatives to Deep RL. I hope I will be able to build a set of videos on this topic next year. If you want to know more meanwhile, search for "deep neuroevolution" and "CEM-RL" on the web.
Quick question regarding the final loss of the policy. Is the target critic used? Or only the local one? Because in the original papers they don't clarify if they use the target or the other one for the policy loss. Another question also regarding the policy loss. Is the input of the critic s_(t+1) or s_(t)? In the original paper the critic in the policy loss takes as an input s_(t).
To compute the policy loss, it makes more sense to use the current critic rather than the target critic, as the former is more up-to-date than the latter (the latter is a slow tracker). And you should use s_t rather than s_{t+1}.
I don't understand how entropy on 19 slide can be negative H = - |A|, and A - number of actions, for example it 1? and H = -1?? and what about loss(alpha) it is look like linear function by alpha so alpha will be always increased, or i don't understand something
\bar{H} in slide 19 is the target entropy. In the continuous action space, your target entropy should be minus the size of the action space, computed e.g. with: -np.prod(env.action_space.shape).astype(np.float32) You can have a look at this paper, end of section 3: arxiv.org/pdf/2209.10081.pdf The loss on the alpĥa optimizer is entropy_coef_loss = -(log_entropy_coef * (action_logprobs + target_entropy).mean(), so it is not trivially linear
@@OlivierSigaud well, I understand, it's like a regulator that increases or decreases depending on current entropy, but why is the entropy for continuous actions negative or is it just such a name
@@OlivierSigaud i set something like that for discrete actions H = tf.reduce_sum(y_pi1*logpi,axis=-1) lossa = - self.alphav*(H-tf.math.log(pt)*0.98), let's see what happens
@@OlivierSigaud what if I have my own environment?? And action bounds are like 30-50?? I think it will raise issue because the output of the actor network is a tanh output which is from -1 to 1.So how will the actor get trained ? Please let me know .
@@tanujajoshi1901 There are two options: either you remove the tanh and just use a Gaussian policy instead of a squashed Gaussian policy. If you do this, you count on the spread of the Gaussian to cover 30-50. It should tune the center of your Gaussian to be ~40 and the spread to represent ~10 by each side. But with low probability it will suggest values beyond 30-50, that you will have to clip. The other option is to have the output of your actor in ]0,1[ with the tanh, and then multiply this output by a scaling function (something like 20x + 30, where x is your action). I think this option is better.
Thank you so much for this talk, Olivier. It helped me a lot figuring out the theory of the paper before I implemented the algorithm. I hope you keep up making this kind of content. The community will surely appreciate it.
Thank you Prof. Sigaud. Compact, but also very illustrative presentation of SAC.
Olivier, thank you very much for detailed explanation.
OMG, thank you so much for explaining the actor update part, it was so confusing in the paper, you made it so clear!!
I'm glad it helped you
Thank you for this tutorial. It’s really clear and helpful 🎉.
A very useful video, thank you
Very good explanation, thanks
Great presentation and slides! I were reading SpinningupRL explanation of it and had a hard time wrapping my head around it, this made it way clearer! Thank you!
The slides you link on the description seems to be a little bit different from the one used in the video (the link have 15 pages)
Thank you very much. You are right about the slides, they correspond to the previous version, I will fix this asap, probably tomorrow.
The slides are now OK
@@OlivierSigaud Thank you for the quick reply! :D
Thank you so much for the detailed explanation
You are welcome ;)
I was struggling to find a neat and concise explanation for the SAC algorithm and your video is spot on. Thank you very much and carry on the good work. I would be interested in what you have to say about the recent development of Black-box algorithms in the field of RL.
Thanks for the nice comment about my videos. Which can of Black-box algorithms do you have in mind ?
@@OlivierSigaud Gradient-free methods and genetic algorithms in particular. I hear they are a really strong alternative to gradient-based methods.
@@abdowaraiet2169 I confirm that Evolution Strategies, in particular, are strong alternatives to Deep RL. I hope I will be able to build a set of videos on this topic next year. If you want to know more meanwhile, search for "deep neuroevolution" and "CEM-RL" on the web.
Thank you for summing this all up for us :)
Thanks. Very interesting and useful.
thanks a lot Olivier , please continue your great job!
Quick question regarding the final loss of the policy. Is the target critic used? Or only the local one? Because in the original papers they don't clarify if they use the target or the other one for the policy loss.
Another question also regarding the policy loss. Is the input of the critic s_(t+1) or s_(t)? In the original paper the critic in the policy loss takes as an input s_(t).
To compute the policy loss, it makes more sense to use the current critic rather than the target critic, as the former is more up-to-date than the latter (the latter is a slow tracker). And you should use s_t rather than s_{t+1}.
Merci pour la réponse Dr.!@@OlivierSigaud
I don't understand how entropy on 19 slide can be negative H = - |A|, and A - number of actions, for example it 1? and H = -1?? and what about loss(alpha) it is look like linear function by alpha so alpha will be always increased, or i don't understand something
\bar{H} in slide 19 is the target entropy. In the continuous action space, your target entropy should be minus the size of the action space, computed e.g. with: -np.prod(env.action_space.shape).astype(np.float32)
You can have a look at this paper, end of section 3: arxiv.org/pdf/2209.10081.pdf
The loss on the alpĥa optimizer is entropy_coef_loss = -(log_entropy_coef * (action_logprobs + target_entropy).mean(), so it is not trivially linear
@@OlivierSigaud well, I understand, it's like a regulator that increases or decreases depending on current entropy, but why is the entropy for continuous actions negative or is it just such a name
@@OlivierSigaud i set something like that for discrete actions H = tf.reduce_sum(y_pi1*logpi,axis=-1)
lossa = - self.alphav*(H-tf.math.log(pt)*0.98), let's see what happens
very useful, thanks
Thank you all for the positive feedback about our video. It is very rewarding to read that people find it useful.
can you explain how to write the action space if the action bounds are not symmetric i.e they are not like [-1,1].??
This raises no particular issue. You just define your gym Box space with the bounds you have
@@OlivierSigaud what if I have my own environment?? And action bounds are like 30-50?? I think it will raise issue because the output of the actor network is a tanh output which is from -1 to 1.So how will the actor get trained ? Please let me know .
@@tanujajoshi1901 There are two options: either you remove the tanh and just use a Gaussian policy instead of a squashed Gaussian policy. If you do this, you count on the spread of the Gaussian to cover 30-50. It should tune the center of your Gaussian to be ~40 and the spread to represent ~10 by each side. But with low probability it will suggest values beyond 30-50, that you will have to clip. The other option is to have the output of your actor in ]0,1[ with the tanh, and then multiply this output by a scaling function (something like 20x + 30, where x is your action). I think this option is better.
@@OlivierSigaud OK Thanks
I have one another question...What is the need of doing exp of 'standard deviation' which is the output of the actor network?
audio SACs
Unfortunately, I have to agree. I will do my best to provide a new version in the next months.
@@OlivierSigaud don't worry, it's not so bad: the important is the quality of the lesson and that's already great ;)