Great vid! Btw: the value function takes only the state as an input and averages the reward for each action possible. To obtain optimal policy out of the state value function one has to iterate through the V(s_next) and look at the action where the maximum expected reward is given. The action-state value function takes in the state and action and outputs the cumulative reward.
The only video one needs to understand how rlhf actually works. Great demonstration sir, thanks a lot.
Great vid!
Btw: the value function takes only the state as an input and averages the reward for each action possible. To obtain optimal policy out of the state value function one has to iterate through the V(s_next) and look at the action where the maximum expected reward is given.
The action-state value function takes in the state and action and outputs the cumulative reward.
I was hoping you would start with Dear Fellow Scholars hahaha
Great video, thanks for making it!!
You need to zoom in the sections of paper so we can see things clearly..... Just showing unintelligible font size isn't helping......
Interesting, Thanks
Is the key part of the instructGPT is its value policy which taking input of prompt and answers?
For RLHF I think yes!
awesome