Soft Actor Critic (V2)

Olivier Sigaud

Просмотров 12 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 янв 2025

Комментарии •

@DanielHernandez-rn6rp 4 года назад ⁺¹⁰
Thank you so much for this talk, Olivier. It helped me a lot figuring out the theory of the paper before I implemented the algorithm. I hope you keep up making this kind of content. The community will surely appreciate it.
@costasmylonas5557 3 года назад ⁺²
Thank you Prof. Sigaud. Compact, but also very illustrative presentation of SAC.
@itpuls 4 года назад ⁺²
Olivier, thank you very much for detailed explanation.
@jumpingeagle568 3 года назад
OMG, thank you so much for explaining the actor update part, it was so confusing in the paper, you made it so clear!!
@OlivierSigaud 3 года назад
I'm glad it helped you
@jiangpengli86 6 месяцев назад
Thank you for this tutorial. It’s really clear and helpful 🎉.
@audreyb3566 Месяц назад
A very useful video, thank you
@rezarawassizadeh4601 Год назад
Very good explanation, thanks
@Dannyboi91 4 года назад
Great presentation and slides! I were reading SpinningupRL explanation of it and had a hard time wrapping my head around it, this made it way clearer! Thank you!
The slides you link on the description seems to be a little bit different from the one used in the video (the link have 15 pages)
@OlivierSigaud 4 года назад ⁺²
Thank you very much. You are right about the slides, they correspond to the previous version, I will fix this asap, probably tomorrow.
@OlivierSigaud 4 года назад ⁺³
The slides are now OK
@Dannyboi91 4 года назад
@@OlivierSigaud Thank you for the quick reply! :D
@yihongliu7326 3 года назад
Thank you so much for the detailed explanation
@OlivierSigaud 3 года назад
You are welcome ;)
@abdowaraiet2169 3 года назад
I was struggling to find a neat and concise explanation for the SAC algorithm and your video is spot on. Thank you very much and carry on the good work. I would be interested in what you have to say about the recent development of Black-box algorithms in the field of RL.
@OlivierSigaud 3 года назад
Thanks for the nice comment about my videos. Which can of Black-box algorithms do you have in mind ?
@abdowaraiet2169 3 года назад
@@OlivierSigaud Gradient-free methods and genetic algorithms in particular. I hear they are a really strong alternative to gradient-based methods.
@OlivierSigaud 3 года назад ⁺¹
@@abdowaraiet2169 I confirm that Evolution Strategies, in particular, are strong alternatives to Deep RL. I hope I will be able to build a set of videos on this topic next year. If you want to know more meanwhile, search for "deep neuroevolution" and "CEM-RL" on the web.
@AwesomeLemur 3 года назад
Thank you for summing this all up for us :)
@abolfazlzakeri6822 4 года назад
Thanks. Very interesting and useful.
@ДмитроБабчук-д2г 4 года назад
thanks a lot Olivier , please continue your great job!
@marioalan1547 10 месяцев назад
Quick question regarding the final loss of the policy. Is the target critic used? Or only the local one? Because in the original papers they don't clarify if they use the target or the other one for the policy loss.
Another question also regarding the policy loss. Is the input of the critic s_(t+1) or s_(t)? In the original paper the critic in the policy loss takes as an input s_(t).
@OlivierSigaud 10 месяцев назад
To compute the policy loss, it makes more sense to use the current critic rather than the target critic, as the former is more up-to-date than the latter (the latter is a slow tracker). And you should use s_t rather than s_{t+1}.
@marioalan1547 10 месяцев назад
Merci pour la réponse Dr.!@@OlivierSigaud
@alexanderdimov7329 Год назад
I don't understand how entropy on 19 slide can be negative H = - |A|, and A - number of actions, for example it 1? and H = -1?? and what about loss(alpha) it is look like linear function by alpha so alpha will be always increased, or i don't understand something
@OlivierSigaud Год назад
\bar{H} in slide 19 is the target entropy. In the continuous action space, your target entropy should be minus the size of the action space, computed e.g. with: -np.prod(env.action_space.shape).astype(np.float32)
You can have a look at this paper, end of section 3: arxiv.org/pdf/2209.10081.pdf
The loss on the alpĥa optimizer is entropy_coef_loss = -(log_entropy_coef * (action_logprobs + target_entropy).mean(), so it is not trivially linear
@alexanderdimov7329 Год назад
@@OlivierSigaud well, I understand, it's like a regulator that increases or decreases depending on current entropy, but why is the entropy for continuous actions negative or is it just such a name
@alexanderdimov7329 Год назад
@@OlivierSigaud i set something like that for discrete actions H = tf.reduce_sum(y_pi1*logpi,axis=-1)
lossa = - self.alphav*(H-tf.math.log(pt)*0.98), let's see what happens
@张超-x1q 4 года назад
very useful, thanks
@OlivierSigaud 4 года назад ⁺¹
Thank you all for the positive feedback about our video. It is very rewarding to read that people find it useful.
@tanujajoshi1901 3 года назад
can you explain how to write the action space if the action bounds are not symmetric i.e they are not like [-1,1].??
@OlivierSigaud 3 года назад
This raises no particular issue. You just define your gym Box space with the bounds you have
@tanujajoshi1901 3 года назад
@@OlivierSigaud what if I have my own environment?? And action bounds are like 30-50?? I think it will raise issue because the output of the actor network is a tanh output which is from -1 to 1.So how will the actor get trained ? Please let me know .
@OlivierSigaud 3 года назад
@@tanujajoshi1901 There are two options: either you remove the tanh and just use a Gaussian policy instead of a squashed Gaussian policy. If you do this, you count on the spread of the Gaussian to cover 30-50. It should tune the center of your Gaussian to be ~40 and the spread to represent ~10 by each side. But with low probability it will suggest values beyond 30-50, that you will have to clip. The other option is to have the output of your actor in ]0,1[ with the tanh, and then multiply this output by a scaling function (something like 20x + 30, where x is your action). I think this option is better.
@tanujajoshi1901 3 года назад
@@OlivierSigaud OK Thanks
@tanujajoshi1901 3 года назад
I have one another question...What is the need of doing exp of 'standard deviation' which is the output of the actor network?
@stefano8936 3 года назад
audio SACs
@OlivierSigaud 3 года назад
Unfortunately, I have to agree. I will do my best to provide a new version in the next months.
@stefano8936 3 года назад
@@OlivierSigaud don't worry, it's not so bad: the important is the quality of the lesson and that's already great ;)

Следующие

Автовоспроизведение