RLOO: A Cost-Efficient Optimization for Learning from Human Feedback in LLMs

Поделиться
HTML-код
  • Опубликовано: 8 сен 2024

Комментарии • 1

  • @BuzzRobot
    @BuzzRobot  Месяц назад +6

    Timestamps:
    0:00 Introduction
    0:33 Background on RLHF (Reinforcement learning from Human Feedback)
    5:45 Back to basics: REINFORCE
    8:44 From PPO (Proximal Policy Optimization) to REINFORCE
    23:27 Results of the new optimization method, RLOO
    32:35 Conclusions
    34:05 Q&A