very good explanation of current RL approaches, SOTA video about it. Ideas for improvement: you can do some kind of presentation giving more insights about RL and fine tuning, I remember you did something like that in the past, but maybe updated version on DeepSeek approaches. Maybe stages on how to do a model and then fine tune it so it would be able to reason but with no-code presentation form.
@@TrelisResearch there were some news that Berkeley reaserchers recreated "aha moment" for $30, you can also do video on that (just sharing ideas on videos, not demanding them lol)
Your in-depth content surpasses that of many other RUclipsrs. A tutorial demonstrating computer use model training using reinforcement learning and simulated UI would be highly valuable. Would a (GRPO) approach be suitable for this image-inclusive data? Finally, to enhance the reasoning process, could we incorporate a "tool_call" tag enabling LLMs to utilize tools during reasoning, rather than solely in the answer phase?
That’s a cool idea and I’ll add to my list of potential ideas. Yea you can add tools. It does make eval a bit harder because now there can be stochasticity in the tool. But broadly a good idea
Nice man, much needed after deepseek. I am gonna watch and do hands on. Hey, do you have any job for AI engineer? may be someone in your network? Please let me know I want to do remote work
Could ORPO’s balance of cross-entropy and odds ratios make it a more stable alternative to PPO-based RLHF? Also, does the beta parameter generalize across models, or does it require fine-tuning?
I'll talk more about this in the next video but the x-entropy inclusion kind of serves a similar role to the KL divergence in GRPO or PPO (which keeps the model grounded towards original weights). Beta does not generalise all that well in my experience and needs tuning. Somewhere between 0.2 and 0.5
Great video. Deffo think model size is what is hampering meaningful results.
Yeah I’m gonna try ablate bigger. This is a key Q
Waiting for this!
Very good content, as usual.
many thanks Loki
Top Notch content. Thank you.
Cheers, many thanks
very good explanation of current RL approaches, SOTA video about it. Ideas for improvement: you can do some kind of presentation giving more insights about RL and fine tuning, I remember you did something like that in the past, but maybe updated version on DeepSeek approaches. Maybe stages on how to do a model and then fine tune it so it would be able to reason but with no-code presentation form.
good points, yeah will aim to do that in the follow-on
@@TrelisResearch there were some news that Berkeley reaserchers recreated "aha moment" for $30, you can also do video on that (just sharing ideas on videos, not demanding them lol)
Your in-depth content surpasses that of many other RUclipsrs. A tutorial demonstrating computer use model training using reinforcement learning and simulated UI would be highly valuable. Would a (GRPO) approach be suitable for this image-inclusive data? Finally, to enhance the reasoning process, could we incorporate a "tool_call" tag enabling LLMs to utilize tools during reasoning, rather than solely in the answer phase?
That’s a cool idea and I’ll add to my list of potential ideas.
Yea you can add tools. It does make eval a bit harder because now there can be stochasticity in the tool. But broadly a good idea
Nice man, much needed after deepseek. I am gonna watch and do hands on. Hey, do you have any job for AI engineer? may be someone in your network? Please let me know I want to do remote work
check trelis.com for developer collaborations, which are the path way to joining the Trelis team
can u pls make a video of applying RL to vision llms
Good idea. Will add to my list
Could ORPO’s balance of cross-entropy and odds ratios make it a more stable alternative to PPO-based RLHF? Also, does the beta parameter generalize across models, or does it require fine-tuning?
I'll talk more about this in the next video but the x-entropy inclusion kind of serves a similar role to the KL divergence in GRPO or PPO (which keeps the model grounded towards original weights).
Beta does not generalise all that well in my experience and needs tuning. Somewhere between 0.2 and 0.5