Reinforcement Learning from Human Feedback: From Zero to chatGPT

HuggingFace

Просмотров 167 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 5 авг 2024
In this talk, we will cover the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT. Most of the talk will be an overview of the interconnected ML models and cover the basics of Natural Language Processing and RL that one needs to understand how RLHF is used on large language models. It will conclude with open question in RLHF.
RLHF Blogpost: huggingface.co/blog/rlhf
The Deep RL Course: hf.co/deep-rl-course
Slides from this talk: docs.google.com/presentation/...
Nathan Twitter: / natolambert
Thomas Twitter: / thomassimonini
Nathan Lambert is a Research Scientist at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research. He was lucky to intern at Facebook AI and DeepMind during his Ph.D. Nathan was was awarded the UC Berkeley EECS Demetri Angelakos Memorial Achievement Award for Altruism for his efforts to better community norms.
Развлечения

Комментарии • 44

@nadinelinterman268 Год назад ⁺³
A wonderfully talented presentation that was easy to listen to. Thank you very much.
@1südtiroltechnik Год назад
Now i get it, thank you for the video!
@muskduh Год назад
Thanks for the video.
@stevenjordan5795 Год назад
Would you recommend RLHF for Thoroughbred handicapping?
@teddysalas3590 Год назад
I only come to know today that there is reinforcement learning course in hugging face, Is there will re enforcement course in hugging face after September?
@akeshagarwal794 10 месяцев назад
Can we fine tune Encoder-Decoder model like T5 with RLHF? if we can please link a source code
@ashishj2358 8 месяцев назад
offline RL is quite unstable more often than not. PPO is simple and excellent algorithm if tuned well achieves really great results. Even after many papers coming with other approaches like DPO, Open AI has stuck with PPO.
@mike_skinner Год назад ⁺⁹
I asked chatGPT questions about a skill that I was expert on. I am from England but it said that the companies that I worked with were American and even what false states they were. It came up with names of people I knew but I think that it thought England was a state of the US.
@burgermind802 Год назад ⁺³
Probably Chatgpt was trained on text disproportionately based in the United States, so it is biased to be American centric. It doesn't know things, just guesses semantically, which often but not always also happens to be true.
@Silly.Old.Sisyphus Год назад
the reason it talked complete garbage to you is that it doesn't have a clue about what it's saying - let alone what you said - because it's "language" model has nothing to do with language, but is effectively probabilistic statistical regurgitation of fragments of things that have been said before (it's "training" data).
@abhinandanwadhwa2605 Год назад
how we could add RLHF to our own LLM model?
@nlarchive Год назад ⁺³
great job! we need people to create content until AI does the content XD
@serkhetreo2489 Год назад
Please, what is the link for the discord
@user-ch3gs7el5k 8 месяцев назад
Wonderful presentation. Question: why would reward function reward gibberish? I though reward function is sophisticate enough to only reward human-like speech
@meghanarb7724 Год назад
can chatgpt is safe from hacking or cyber threat
@johntanchongmin Год назад ⁺¹¹
I liked the content. I personally do not agree with using RLHF as the main tool for learning because it is just too costly to use human feedback, but perhaps using this with a combination of self-supervised learning could help to scale up the utility of human annotation to a wider domain.
I also wonder how robust RLHF is to outliers that are not within the training distribution. Perhaps the key is in more generalizable structural retrieval, i.e. making sure output is coherent according to a knowledge graph in memory, rather than human feedback as a reward for the output text.
@mohammadaqib4275 11 месяцев назад
hey do you have any hands on RLHF. I have few questions to ask
@johntanchongmin 11 месяцев назад
@@mohammadaqib4275 Thanks for the question. I personally do not perform my own RLHF - it is too costly. In my opinion, just doing the SFT step may already be enough for most use cases, and I typically just do that.
@mohammadaqib4275 11 месяцев назад
Any resources in form of blog or guided project that you can suggest?
@johntanchongmin 11 месяцев назад
@@mohammadaqib4275 Can try some of HuggingFace or Weights and Biases implementations. Maybe can take a look at StackLlama, which does RLHF on Llama (the non-2 version)
@present-bk2dh 8 месяцев назад
@@mohammadaqib4275 did you find any?
@RD-AI-ROBOTICS 11 месяцев назад
You mentioned about the ChatGPT paper that was going to release tomorrow viz 14th Dec,22. But I have not been able to find it. Can anyone pls guide me to it. thanks.
@parthshah4339 11 месяцев назад
he was joking
@FlipTheTables 10 месяцев назад
I ran across a video where someone was doing the breakdown it was built
@IoannisNousias 8 месяцев назад
Given the reward model is differential, why not just use it to do backprop through it, in a self-supervised manner rather than RL. Think discriminator of a GAN providing gradients to the generator, here the discriminator is the reward model that can be frozen and the generator is the LLM. Still use the KL regularizer to keep it in the desired optimization regime.
@KeqiChen-ds2co Год назад
Discord invitation is invalid now T_T
@nitroyetevn Год назад ⁺³
I find the constant "If you have any questions, post them in the comments" ticker kind of distracting. Maybe just display it every 2 minutes or every 5 minutes.
@st3ppenwolf Год назад ⁺¹
I thought I was going to see code. Good talk though
@erener7897 5 месяцев назад ⁺¹
Thank you for lecture! But I personally found it very hard to follow. Voice is so monotonic and material is not catchy or interactive It was hard not to fall sleep. Nevertheless you are making important work, keep going!
@dlal8042 9 месяцев назад
Would be better if you can show small implementation
@pariveshplayson 6 месяцев назад
Theta should be a learnable policy of the language model and not the reward model. By this time, parameters of the reward function are already learned.
@weiwuWonderfulLife Год назад ⁺⁴
The ads of "writing a comment" at the bottom of this video is so annoying.
@preston748159263 9 месяцев назад
“People are here to learn about language models.”
The zeitgeist of language as insight to human behavior may be functional but it does not provide us with a better understanding. Language is to far away from what’s going on. Human1 has to interpret perception, reason, incorporate it into their own mental model, articulate it, communicate it, human2/ai model has to perceive it, reason with it, incorporate it, then produce an action. There is to much room for error and a bold strategy because it’s so far removed. it’s simple convenient because we have a lot of text data. RL is unique though because it is based on operant conditioning, so there is some opportunity here that is not being taken advantage of.
We need to go back to a cognitive perspective on AI, that’s the only way we better replicate and model human cognition, and we certainly can’t create robust AI if we don’t first understand our underlying cognitive mechanisms.
@preston748159263 9 месяцев назад
Tldr RLHF seems to be a bandaid on a large wound caused by language based AI.
@indramal Год назад ⁺³
OPS I missed it
@doulaishamrashikhasan8425 Год назад
you can still watch it lol
@indramal Год назад ⁺¹
@@doulaishamrashikhasan8425 but can not discuss with others like online
@antonderoest9462 Год назад ⁺³
Please use better microphone and add some damping in your room.
@Shalaginov_com Год назад ⁺³⁸
The first 17 minutes are waste of time
@franciscofreitas6695 Год назад ⁺²
Thank you bro
@xasopheno Год назад
Thanks
@seeusoon07 7 месяцев назад
Thanks for saving time
@dontwannabefound 9 месяцев назад ⁺¹
Too much hand waving

Следующие

Автовоспроизведение

Proximal Policy Optimization (PPO) - How to train Large Language Models