🎯 Key Takeaways for quick navigation: 00:00 🤖 Reinforcement learning improves large language models like ChatGPT. 00:25 🃏 Large language models face issues like bias, errors, and quality. 01:11 📊 Training data quality impacts results; removing bad jokes might help. 01:55 🧩 Training on both good and bad jokes improves language models. 02:38 🔄 Language models are policies, reinforcement learning uses policy gradient. 03:08 🎯 Reinforcement Learning from Human Feedback (RLHF) challenges data acquisition. 03:35 🤔 RLHF theory: Language model might already know jokes' boundary. 04:18 🏆 Training a reward network predicts human ratings for model's output. 04:47 🔄 Reward network is a modified language model for predicting ratings. 05:14 📝 Approach: Humans write text, train reward network, refine model with RL. 05:57 ⚖️ Systems convert comparisons to ratings for reward network training. 06:11 😄 RLHF successfully improves language models, including humor. Made with HARPA AI
I just binged this playlist at 1 am. Absolutely worth it. You deserve more views.
agreed
PLEASE COMEBACK!! You are an amazing theacher!
All of your videos are amazing, please upload more
Welcome back!
Hope to see more of these videos..
Please come back, your videos are great!
Amazing content! Please keep them coming!
Joel, excellent explanation and talk! Thank you!
Super helpful - thank you for this series!
help me a lot, can't wait to see more
🎯 Key Takeaways for quick navigation:
00:00 🤖 Reinforcement learning improves large language models like ChatGPT.
00:25 🃏 Large language models face issues like bias, errors, and quality.
01:11 📊 Training data quality impacts results; removing bad jokes might help.
01:55 🧩 Training on both good and bad jokes improves language models.
02:38 🔄 Language models are policies, reinforcement learning uses policy gradient.
03:08 🎯 Reinforcement Learning from Human Feedback (RLHF) challenges data acquisition.
03:35 🤔 RLHF theory: Language model might already know jokes' boundary.
04:18 🏆 Training a reward network predicts human ratings for model's output.
04:47 🔄 Reward network is a modified language model for predicting ratings.
05:14 📝 Approach: Humans write text, train reward network, refine model with RL.
05:57 ⚖️ Systems convert comparisons to ratings for reward network training.
06:11 😄 RLHF successfully improves language models, including humor.
Made with HARPA AI
ok everything makes sense now, thx
Great content!!
Good teaching.
You are the Best
come back :(
How long it takes to train a reward network? And how reliable would it be?
Who is this guy? He made all the complexity so simple with his words. Anyone know this gentleman name?