Timestamps: 0:00 Introduction 0:33 Background on RLHF (Reinforcement learning from Human Feedback) 5:45 Back to basics: REINFORCE 8:44 From PPO (Proximal Policy Optimization) to REINFORCE 23:27 Results of the new optimization method, RLOO 32:35 Conclusions 34:05 Q&A
Timestamps:
0:00 Introduction
0:33 Background on RLHF (Reinforcement learning from Human Feedback)
5:45 Back to basics: REINFORCE
8:44 From PPO (Proximal Policy Optimization) to REINFORCE
23:27 Results of the new optimization method, RLOO
32:35 Conclusions
34:05 Q&A