Reinforcement Learning from Human Feedback: From Zero to chatGPT
HTML-код
- Опубликовано: 5 авг 2024
- In this talk, we will cover the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT. Most of the talk will be an overview of the interconnected ML models and cover the basics of Natural Language Processing and RL that one needs to understand how RLHF is used on large language models. It will conclude with open question in RLHF.
RLHF Blogpost: huggingface.co/blog/rlhf
The Deep RL Course: hf.co/deep-rl-course
Slides from this talk: docs.google.com/presentation/...
Nathan Twitter: / natolambert
Thomas Twitter: / thomassimonini
Nathan Lambert is a Research Scientist at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research. He was lucky to intern at Facebook AI and DeepMind during his Ph.D. Nathan was was awarded the UC Berkeley EECS Demetri Angelakos Memorial Achievement Award for Altruism for his efforts to better community norms. - Развлечения
A wonderfully talented presentation that was easy to listen to. Thank you very much.
Now i get it, thank you for the video!
Thanks for the video.
Would you recommend RLHF for Thoroughbred handicapping?
I only come to know today that there is reinforcement learning course in hugging face, Is there will re enforcement course in hugging face after September?
Can we fine tune Encoder-Decoder model like T5 with RLHF? if we can please link a source code
offline RL is quite unstable more often than not. PPO is simple and excellent algorithm if tuned well achieves really great results. Even after many papers coming with other approaches like DPO, Open AI has stuck with PPO.
I asked chatGPT questions about a skill that I was expert on. I am from England but it said that the companies that I worked with were American and even what false states they were. It came up with names of people I knew but I think that it thought England was a state of the US.
Probably Chatgpt was trained on text disproportionately based in the United States, so it is biased to be American centric. It doesn't know things, just guesses semantically, which often but not always also happens to be true.
the reason it talked complete garbage to you is that it doesn't have a clue about what it's saying - let alone what you said - because it's "language" model has nothing to do with language, but is effectively probabilistic statistical regurgitation of fragments of things that have been said before (it's "training" data).
how we could add RLHF to our own LLM model?
great job! we need people to create content until AI does the content XD
Please, what is the link for the discord
Wonderful presentation. Question: why would reward function reward gibberish? I though reward function is sophisticate enough to only reward human-like speech
can chatgpt is safe from hacking or cyber threat
I liked the content. I personally do not agree with using RLHF as the main tool for learning because it is just too costly to use human feedback, but perhaps using this with a combination of self-supervised learning could help to scale up the utility of human annotation to a wider domain.
I also wonder how robust RLHF is to outliers that are not within the training distribution. Perhaps the key is in more generalizable structural retrieval, i.e. making sure output is coherent according to a knowledge graph in memory, rather than human feedback as a reward for the output text.
hey do you have any hands on RLHF. I have few questions to ask
@@mohammadaqib4275 Thanks for the question. I personally do not perform my own RLHF - it is too costly. In my opinion, just doing the SFT step may already be enough for most use cases, and I typically just do that.
Any resources in form of blog or guided project that you can suggest?
@@mohammadaqib4275 Can try some of HuggingFace or Weights and Biases implementations. Maybe can take a look at StackLlama, which does RLHF on Llama (the non-2 version)
@@mohammadaqib4275 did you find any?
You mentioned about the ChatGPT paper that was going to release tomorrow viz 14th Dec,22. But I have not been able to find it. Can anyone pls guide me to it. thanks.
he was joking
I ran across a video where someone was doing the breakdown it was built
Given the reward model is differential, why not just use it to do backprop through it, in a self-supervised manner rather than RL. Think discriminator of a GAN providing gradients to the generator, here the discriminator is the reward model that can be frozen and the generator is the LLM. Still use the KL regularizer to keep it in the desired optimization regime.
Discord invitation is invalid now T_T
I find the constant "If you have any questions, post them in the comments" ticker kind of distracting. Maybe just display it every 2 minutes or every 5 minutes.
I thought I was going to see code. Good talk though
Thank you for lecture! But I personally found it very hard to follow. Voice is so monotonic and material is not catchy or interactive It was hard not to fall sleep. Nevertheless you are making important work, keep going!
Would be better if you can show small implementation
Theta should be a learnable policy of the language model and not the reward model. By this time, parameters of the reward function are already learned.
The ads of "writing a comment" at the bottom of this video is so annoying.
“People are here to learn about language models.”
The zeitgeist of language as insight to human behavior may be functional but it does not provide us with a better understanding. Language is to far away from what’s going on. Human1 has to interpret perception, reason, incorporate it into their own mental model, articulate it, communicate it, human2/ai model has to perceive it, reason with it, incorporate it, then produce an action. There is to much room for error and a bold strategy because it’s so far removed. it’s simple convenient because we have a lot of text data. RL is unique though because it is based on operant conditioning, so there is some opportunity here that is not being taken advantage of.
We need to go back to a cognitive perspective on AI, that’s the only way we better replicate and model human cognition, and we certainly can’t create robust AI if we don’t first understand our underlying cognitive mechanisms.
Tldr RLHF seems to be a bandaid on a large wound caused by language based AI.
OPS I missed it
you can still watch it lol
@@doulaishamrashikhasan8425 but can not discuss with others like online
Please use better microphone and add some damping in your room.
The first 17 minutes are waste of time
Thank you bro
Thanks
Thanks for saving time
Too much hand waving