Reinforcement learning 10 DeepSeekR1 = CoT + RL(GRPO)

Поделиться
HTML-код
  • Опубликовано: 7 фев 2025

Комментарии • 2

  • @TheTruthOfAI
    @TheTruthOfAI 8 дней назад

    finally a video that walks the notation of the GRPO and decomposes it properly.. unlike the 99.9% of the other videos that talks about DeepSeek-R1 .. this one is the one that truly highlights the reward/policy forward.

  • @vietchuxuan8789
    @vietchuxuan8789 10 дней назад

    Thank you, these are some good notes.