we're in a spot where a serious person can seriously say "it's SIMPLY the model talking to itself until it solves the problem" , and we enthusiasts shrug and move along. What a time to be alive.
But there is so much more to problem-solving than recursive iteration, is it? Humans solve problems using hypermodalities. Bodily sensations, sounds, smells, gut bioma, and emotional states all impact how we think. Then there are the more or less understood “a-ha!” moments or trial-and-error lucky guesses where intuitive judgment makes the call. We also have subconscious processing during sleep tackling the most difficult problems we are stuck with, accompanied by cerebrospinal fluid flushing over our brain tissue. Then there are hungover days when creativity takes the lead for some (e.g., Hemingway). Good luck trying to introduce a central nervous system depressant like alcohol into an LLM and then get the best out of it, lol. I can only imagine how difficult it is to capture all these nuances in current or future LLM architectures. Almost seems like we need something else to augment LLMs with.
Stream of search + let's verify step-by-step has looked the most likely to me. It might be that they just put their heads down and worked really hard to solve the collapse problems and optimized generalizability. Regardless, amazing overview, thanks a bunch for sharing
I find this ridiculous and remarkably improbable. Did you see the missed space in the example CoT from o1? That matches Sam Altman’s laidback writing style, he’s clearly writing all the CoT a test-time by hand.
for search it is important to search over ideas. not letters or tokens or words or sentences or paragraphs but ideas. so an llm needs to be able to output a token that says that it has finished laying out an idea, and thus a new idea can begin at this point. if an llm is constantly interrupted at the lower levels, it can never fully finish the idea. that would also help battle the combinatorial explosion that makes search on lower levels untreatable. its like a human chess player that only considers a few moves vs a brute force algorithm that considers millions of moves that are leading nowhere.
That is awesome. It saved me lots of time. I am trying to use some of these techniques for the AIMO Kaggle contest. If anyone is interested drop me a message.
Oh no I forgot to mention that! In my notation the reasoning token is how you know to move from z to y. It's kind of implied by the color changing from green to red.
I think not following from expert examples is a stretch. They could of helped finetune the CoT mechanism having people write out their thought processes while solving problems especially for math and coding. Edit: i see it addressed at 20:30
Yeah I agree that there are expert examples somewhere in the training procedure. Wanted to emphasize that these play less of a role than I would have assumed before diving into this area (if you believe the OAI comments).
@@DistortedV12 I think to achieve scale, the data has to be generated by the model itself via a step-by-step prompt, the correctness of the solution has to be easily verified. For example, the AIME problems have an integer solution between 0-999. One can then use process and advantage reward on such dataset.
Thinking LLMs from Meta, LLM-Berry, ARC AGI paper from MIT on test time training. Can someone (a LLM) ideally Noam Brown or otherwise comment how these are related to what is discussed here?
* Thinking LLMs is quite related. It uses an LLM as the verifier (I was emphasizing automatic verifiers in this talk.). * LLM-Berry is an effort to do a MCTS style search on existing Llama models without learning. * ARC-AGI paper that came out today seems really neat! They do SGD at test time, so pretty different than these methods that only do CoT at test time.
@@srush_nlp thank you so much for responding to my questions! Very great talk / liked how you pointed out the core problem so other researchers can focus efforts
Test compute capability is still constrained by the data used for the RL training, which is harder to curate. You can give a D student an infinite amount of time on an exam and he is certainly not going to get an A.
But synthetic data can solve this restraint. Just have increasingly more capable models create more synthetic data to allow further reinforcement learning, and so on.
@@haiderameer9473 No it doesn't as its still combinatorics at work, D -> A remains a challenge. No amount of recursive repetition of one domain over even seemingly infinite window of time will make you an expert in another that you know little about
we're in a spot where a serious person can seriously say "it's SIMPLY the model talking to itself until it solves the problem" , and we enthusiasts shrug and move along. What a time to be alive.
But there is so much more to problem-solving than recursive iteration, is it? Humans solve problems using hypermodalities. Bodily sensations, sounds, smells, gut bioma, and emotional states all impact how we think. Then there are the more or less understood “a-ha!” moments or trial-and-error lucky guesses where intuitive judgment makes the call. We also have subconscious processing during sleep tackling the most difficult problems we are stuck with, accompanied by cerebrospinal fluid flushing over our brain tissue. Then there are hungover days when creativity takes the lead for some (e.g., Hemingway). Good luck trying to introduce a central nervous system depressant like alcohol into an LLM and then get the best out of it, lol. I can only imagine how difficult it is to capture all these nuances in current or future LLM architectures. Almost seems like we need something else to augment LLMs with.
Very interesting summary, thanks a lot. My intuition is that evaluation/test is where we can grow / low hanging fruits.
Stream of search + let's verify step-by-step has looked the most likely to me. It might be that they just put their heads down and worked really hard to solve the collapse problems and optimized generalizability.
Regardless, amazing overview, thanks a bunch for sharing
such a good overview - thank you for the insights, quite instructive and accessible
Thank you so much for such an informative video 🙏🙏.
Thanks for creating this video
I find this ridiculous and remarkably improbable. Did you see the missed space in the example CoT from o1? That matches Sam Altman’s laidback writing style, he’s clearly writing all the CoT a test-time by hand.
This is fantastic work❤!
for search it is important to search over ideas. not letters or tokens or words or sentences or paragraphs but ideas. so an llm needs to be able to output a token that says that it has finished laying out an idea, and thus a new idea can begin at this point. if an llm is constantly interrupted at the lower levels, it can never fully finish the idea. that would also help battle the combinatorial explosion that makes search on lower levels untreatable. its like a human chess player that only considers a few moves vs a brute force algorithm that considers millions of moves that are leading nowhere.
Agreed. Lots of choices though in how to actually build that. Need steps that cause tangible progress.
That is awesome. It saved me lots of time. I am trying to use some of these techniques for the AIMO Kaggle contest. If anyone is interested drop me a message.
Did he mention that they use reasoning tokens?
Oh no I forgot to mention that! In my notation the reasoning token is how you know to move from z to y. It's kind of implied by the color changing from green to red.
Brilliant!
I think not following from expert examples is a stretch. They could of helped finetune the CoT mechanism having people write out their thought processes while solving problems especially for math and coding. Edit: i see it addressed at 20:30
Yeah I agree that there are expert examples somewhere in the training procedure. Wanted to emphasize that these play less of a role than I would have assumed before diving into this area (if you believe the OAI comments).
@@DistortedV12 I think to achieve scale, the data has to be generated by the model itself via a step-by-step prompt, the correctness of the solution has to be easily verified. For example, the AIME problems have an integer solution between 0-999. One can then use process and advantage reward on such dataset.
Goat
Thinking LLMs from Meta, LLM-Berry, ARC AGI paper from MIT on test time training. Can someone (a LLM) ideally Noam Brown or otherwise comment how these are related to what is discussed here?
* Thinking LLMs is quite related. It uses an LLM as the verifier (I was emphasizing automatic verifiers in this talk.).
* LLM-Berry is an effort to do a MCTS style search on existing Llama models without learning.
* ARC-AGI paper that came out today seems really neat! They do SGD at test time, so pretty different than these methods that only do CoT at test time.
@@srush_nlp thank you so much for responding to my questions! Very great talk / liked how you pointed out the core problem so other researchers can focus efforts
Test compute capability is still constrained by the data used for the RL training, which is harder to curate. You can give a D student an infinite amount of time on an exam and he is certainly not going to get an A.
Depends entirely on the verifier and the test.
But synthetic data can solve this restraint. Just have increasingly more capable models create more synthetic data to allow further reinforcement learning, and so on.
@@haiderameer9473 No it doesn't as its still combinatorics at work, D -> A remains a challenge. No amount of recursive repetition of one domain over even seemingly infinite window of time will make you an expert in another that you know little about
Has to be process reward
Yeah, it definitely seems like that is part of the equation. The question is whether that is everything.