Strawberry

Поделиться
HTML-код
  • Опубликовано: 30 янв 2025

Комментарии • 36

  • @user-pt1kj5uw3b
    @user-pt1kj5uw3b 4 месяца назад +6

    Finally someone who knows what theyre talking about. Its been frustrating seeing people act like they have any idea whats going on behind the scenes with minimal domain knowledge, who have never read the publicly available research. Not to mention that I don't think we've even scratched the surface of quality chain of thought.

  • @aazzrwadrf
    @aazzrwadrf 4 месяца назад +3

    Thanks for doing these! Really helps me get a grip on what's going on as someoen without a rigorous background in ML.

  • @JamesRBentley
    @JamesRBentley 4 месяца назад

    God tier breakdown - I drew a lot of very similar conclusions - the only bit I missed when first experimenting with o1 was the production of synthetic training data angle of it

  • @markozege
    @markozege 4 месяца назад

    Great coverage, thank you for doing these HuPo!

  • @synchro-dentally1965
    @synchro-dentally1965 4 месяца назад +3

    Summary: 1:47:36

    • @wolpumba4099
      @wolpumba4099 4 месяца назад

      *OpenAI's Strawberry Model: Combining Pre-Training, RL, and Inference-Time Compute*
      * *1:47:32** Mixed Reception:* OpenAI's Strawberry model has received mixed reactions, with some praising its innovation while others consider it merely an enhanced version of existing techniques like Chain of Thought.
      * *1:48:27** Combining Paradigms:* The model's potential strength lies in its combination of Reinforcement Learning (RL) and large-scale transformer pre-training, enabling it to leverage both extensive knowledge and refined reasoning capabilities.
      * *1:48:59** RL Post-Training:* Strawberry utilizes RL to improve its reasoning through a "self-play" process, particularly in domains like math and coding where answers can be objectively verified.
      * *1:49:45** Inference-Time Compute:* The model incorporates inference-time compute, allowing it to generate internal reasoning steps without explicit prompting, potentially rendering techniques like "think step by step" obsolete.
      * *1:50:27** Production-Grade RL:* Karpathy (former Tesla AI Director) described Strawberry as a production-grade implementation of RL on an LLM, showcasing promising benchmark performance.
      * *1:50:38** Small Model Efficiency:* The model's reported small size further amplifies its impressive performance, suggesting high efficiency.
      * *1:51:11** Pathway to Superintelligence:* The integration of RL opens up possibilities for continuous improvement through self-play, potentially accelerating progress towards more advanced AI systems.
      * *1:52:16** Hiding Chain of Thought (Theory):* One theory suggests OpenAI might be intentionally concealing the model's internal reasoning (Chain of Thought) to prevent others from easily replicating its capabilities and potentially building powerful open-source alternatives.
      * *1:52:53** Google's Parallel Efforts:* It's highly likely that Google is pursuing similar research directions, given the known potential of combining RL with LLMs.
      * *1:54:01** Reward Limitations:* A key challenge is defining appropriate rewards for RL in areas where there isn't a single correct answer, such as philosophical arguments.
      I used gemini-1.5-pro-exp-0801 on rocketrecap dot com to summarize the transcript.
      Cost (if I didn't use the free tier): $0.05
      Input tokens: 12895
      Output tokens: 461

  • @sterlingcrispin
    @sterlingcrispin 4 месяца назад

    They may not have invented all the techniques here but publishing a paper and shipping a system at scale, like you said, is very different, and deserves its own praise and hype. It’s a huge accomplishment

  • @Nio-t8o
    @Nio-t8o 4 месяца назад +8

    Bro you explain all this “complex” concepts in such an easy way, I’m so glad I discovered your channel. Keep on going!

  • @user-pt1kj5uw3b
    @user-pt1kj5uw3b 4 месяца назад +1

    Ok further in and I think you've nailed it. I was watching an interview with Eric Steinberger a couple days ago, who studied closely under Noam Brown, one of the primary researches behind this project, and he confirmed in a roundabout way a lot of what you are saying. The video is "Founder Eric Steinberger on Magic’s Counterintuitive Approach to Pursuing AGI." If you can read between the lines of what he says in the interview, there is a lot of info about what might be happening in the private conversations of some of the top researchers.

  • @artoemkozlov
    @artoemkozlov 4 месяца назад

    Amazing! And thank you for sharing your own takes. I think they're quite reasonable.

  • @srimallya
    @srimallya 4 месяца назад

    Let’s reframe the premise.
    The context is the self.
    We are learning the value of the self with value network.
    The token is discretisation of the self.
    The policy learn self transition as a function of time.
    Mcts does self play within the boundaries of the exploration vs cost based on value.
    Ppo reinforce compress the search to the policy.
    Self is a function of environment.
    Therefore we map self transition as a proxy to dynamics of the environment.
    Now, there is another dqn that guides mcts for turn by turn self play.
    Then we can feed the gradient back to the policy for generalisation.
    In generation, we can do fixed number of turn by turn searches to expand the original context gives back the final answer.

  • @user-pt1kj5uw3b
    @user-pt1kj5uw3b 4 месяца назад

    Agreed about them taking credit for techniques that have been around for years. It would be nice if "open" ai cited their sources...

  • @_XoR_
    @_XoR_ 4 месяца назад

    They did exactly what I wanted to experiment with, I really doubt they might also use GFlowNets in the bigger model to optimize the chain of thought as a whole, not just at one step time.

  • @tomharmon2000
    @tomharmon2000 4 месяца назад

    🧠🧠🧠

  • @abdulbasitbello2381
    @abdulbasitbello2381 4 месяца назад

    I love the fact that you are patient with the questions in the chat. You're awesome taking your time and explaining these, I'm a data engineer getting into Ai engineering and you're making the papers and lingo less intimidating.🎉 Thanks 🙏🏽

  • @zzzzzzz8473
    @zzzzzzz8473 4 месяца назад +3

    incredible overview as always , connecting the related papers and concepts together .
    47:35 excellent point the RLHF is a bad title because it lacks everything that RL is good at it is purely human feedback that doesnt correlate to improvements , think even the LLM arena suffers from this where people are selecting which output *LOOKS* more intelligent rather then which output *IS* more accurate , resulting in preferencing the models to be "confident marketing salesmen" rather then a "scientist" that we want .

  • @JungHeeyun-t3x
    @JungHeeyun-t3x 4 месяца назад +1

    23:52 "grrrr"

  • @michelians1148
    @michelians1148 4 месяца назад +4

    Don't waste your time waiting for the content, he just repeats "testing on youtube" for almost two hours 😮‍💨

  • @ScottzPlaylists
    @ScottzPlaylists 4 месяца назад +1

    👍I think you've figured out how it works❗
    Has anyone put together an open version of this process and training yet❓ probably not, but at least a good start❓
    What's the closest thing to it that's already done❓

  • @clearmindstudios
    @clearmindstudios 4 месяца назад

    U rock dude ❤🎉 dropping insights at a followable pace

  • @deter3
    @deter3 4 месяца назад

    You're super genus !!

  • @ArmaanSultaan
    @ArmaanSultaan 4 месяца назад +1

    Loved this video.Absolute best analysis of this new llm which make sense easily to me.I have one question though .When you said that when we scale this RL based llm system they might also improve in other fields that are less reasoning oriented and more creative(like literature).What exactly that looks like how that can happen? Thnx in advance and thnx for this video

    • @ArmaanSultaan
      @ArmaanSultaan 4 месяца назад

      never mind,you explain this later in video.

  • @fromscratch4109
    @fromscratch4109 4 месяца назад

    Thank you, I was a little ify at first but with the question and Answer section I learned so much more😊

  • @Dht1kna
    @Dht1kna 4 месяца назад

    I assumed o1 mini is 4o mini which i assume is arround ~30B

  • @hanyanglee9018
    @hanyanglee9018 4 месяца назад

    If CoT is the solution, why not some diffusion directly. Is it any difficult to implement a diffusion for text?

  • @therobotocracy
    @therobotocracy 4 месяца назад

    What do you work on as a day job? You seem like a game programmer or something with graphics?

  • @JungHeeyun-t3x
    @JungHeeyun-t3x 4 месяца назад

    i think inference models and the wrap-up models are 2 different models. the tone is different.

  • @太郎田中-i2b
    @太郎田中-i2b 4 месяца назад

    Is it safe to assume that optimizing to solve math problems using tokens is harder than Go due to the indirect nature?
    Apologies if it's already mentioned. I've only watched half the stream.

  • @Elikatie25
    @Elikatie25 4 месяца назад

    1:42 Horn

  • @vishalrajput9856
    @vishalrajput9856 4 месяца назад

    ARC AGI results are out, it doesn't perform better than GPT4o

  • @TheKumarAshwin
    @TheKumarAshwin 4 месяца назад

    Ok

  • @TheGuillotineKing
    @TheGuillotineKing 4 месяца назад

    This is just langChain but 1000x more expensive

  • @xx1slimeball
    @xx1slimeball 4 месяца назад

    gr8 stream !! ☜(゚ヮ゚☜)

  • @ktb1381
    @ktb1381 4 месяца назад

    Aren't you being a little hard on them, talking about sinister secrecy and so forth? Aren't they just trying to actually make money? Right now I believe they're losing money just in general operation. They've got to try to figure out how to make some money otherwise they won't exist much longer.