Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer)

Yannic Kilcher

Просмотров 35 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 29 сен 2024

Комментарии • 88

@marshallmcluhan33 5 месяцев назад ⁺⁶⁷
I'd give this paper an A+ but it's too Meta.
@Jvo_Rien 5 месяцев назад ⁺¹
the use wildcard * instead of +
@marshallmcluhan33 5 месяцев назад
@@Jvo_Rien Oh I could give it a gold star ⭐
@snippletrap 5 месяцев назад ⁺⁴
FAIR
@marshallmcluhan33 5 месяцев назад
@@snippletrap Project Aria sees what you did there.
@moisestorresgarcia8179 5 месяцев назад ⁺¹
What a metaverse you laid for us right there 😂
@rogerc7960 5 месяцев назад ⁺⁹
I'll have to think about it a little bit more.
@chrisspencer6502 5 месяцев назад
Please do a video on lucen vector search
@DeepThinker193 5 месяцев назад ⁺²
If I ever wanted a LLM to come even close to reasoning like a human being. I'd do it like this. Basically, I would treat it like the mind of a child.
1) I'd place the model in a robotic body with all sensory mechanism for sight, smell, touch, taste and hearing.
2) I'd give it a constant thought process just as how humans constantly think and reinforce.
3) I'd then train that model in person as if it were a child, by talking to it and reasoning with it daily by using demonstrations until it gains more and more sense of this physical world and where everything it stands. Rinse and repeat until it can reason better than me.
Mark my words, the way they're going about training the models by stuffing endless data down it's throat will never achieve AGI. It's so inefficient infact, they're now looking at friggen nuclear power stations to power the models which I find ridiculous, but entertaining none the less. We as human being don't need that much power for our reasoning processes.
@drdca8263 5 месяцев назад ⁺²
How much data do you imagine the system you describe, would involve?
Like, how many bytes per second of sensory input are you thinking?
You seem to contrast this with the large quantities of data used to train big models today, but it isn’t clear to me that this would involve less training data.
(Also, the human brain seems to have lots of information already built in, where it isn’t clear how that information would get into the model in the setup you describe)
@DeepThinker193 5 месяцев назад
@@drdca8263What I'd imagine is the robot model would have to be plugged in/tethered to a larger mainframe computer at all times. Because battery tech simply isn't good enough yet, and also, I have no idea how much compute they can fit in the body after adding all the motors for movement etc.
The idea is for the robot model to focus on 'efficiency' in logic, reasoning and the physics in our reality, rather than cramming more and more data and hoping for the best.
Think about it. There's a reason why smart animals have to gradually learn about their environment to survive. It's why we are always constantly thinking. One can also think of the mind as our virtual environment and the body as our physical reality. Both our virtual and physical models are intertwined, we simply cannot reason well without one or the other.
Nature uses this method because it is the most efficient and effective slow gradual steps rather than data all at once. I would argue that similar to our brains that as you say "already has some data built in." The LLM's also already has Data built in. It's just a matter of focusing on it's efficiency in logic and reasoning.
@yichunchen4370 5 месяцев назад ⁺²⁵
The key thing in A star is the heuristic, it cannot overestimate the cost to goal to avoid missing the optimal plan but should be as large as possible at the meantime, to fast fail other nodes in the frontier to get a shorter search trace. This is hard since finding a good heuristic is always a non-trivial manual task in A star.
This model I think is just using the encoder to better understand the problem context, and the trace to understand the search strategy, and implicitly generated a very complex heuristic algorithm within it which outperformed the man crafted determined heuristic. That modeled heuristic does not have to be A star since it is not chained to the A star restrictions, but can be better(shorter). A star is just its teacher. It can use some of its self generated data to "eat its own p..." because in this scenario it is very easy to judge which is a good job and which is not, to only "eat the good p..."
@clray123 5 месяцев назад ⁺⁶
Or maybe it just good lucky because of the stochastic and the lucky guesses got cherry-picked out by authors of the paper in desperation to make their experiments look worthwhile and publication-worthy.
@andreaterlizzi 4 месяца назад
@@clray123surely results are a bit inflated, and I doubt the method would be able to really scale to indefinitely large problem instances (they go ODD). Still, the results are interesting, and show that transformers can learn good heuristics for search problems, if given nice inductive biases (yayyy inductive biases work what a surprise). I guess that a better approach would be to just train the thing to estimate a good heuristic, which is what NN are very good at doing, while the A* algorithm acts as a complex "token decoding" strategy. That would single-handedly guarantee that plans are optimal under unrestricting assumptions (since you are still executing A*), and grant average better (shorter) plans since the NN is learning the heuristic. But I guess this wouldn't suggest that LLMs will eventually be able to plan (which I think they won't), but that actual symbolic reasoning and algorithmic reasoning is needed.
@rockapedra1130 5 месяцев назад ⁺²²
Yay! Paper reviews! Love these best. Yannic is an superb teacher!
@ericadar 5 месяцев назад ⁺¹⁴
@YannicKilcher can you review deepmind's mixture-of-depths paper (arxiv 2404.02258) ? do you think it's a substantive improvement in compute allocation by token?
@MaJetiGizzle 5 месяцев назад
It’s done by leveraging routers/more intelligent routing, but yes. It is a substantial improvement in compute usage and allocation.
@clray123 5 месяцев назад ⁺³
" optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps"
Well since the A* optimally solves previously unseen Sokoban puzzles 100% of the time, it sounds like their Searchformer found a broken version of the algorithm which generates shorter search dynamics but because of that does not work as it should 6.3% procent of the time. Maybe they should put that version of the achievement in the abstract...
@unclecode 5 месяцев назад ⁺⁶
Meta-cognition, or "thinking about thinking," has become a subject of AI research, which I find quite impressive and always wonder when this happens. Remarkably, within the same transformer architecture and the current large language model (LLM) paradigm, this is now achievable. I believe if we can represent any problem through sequential linguistic presentation (or finding an encoder/decoder), then there's likely a transformer that can be trained to find perhaps an optimal solution. Sounds like a new definition of "computability"!! An interesting avenue to explore would be applying this to an NP or NP-hard problem, like the Traveling Salesman Problem, to see if a transformer model could uncover an optimal solution. If I were in my twenties, this might have been a paper I'd work on :D
@jimwagner7817 5 месяцев назад
Absolutely, I love this design space. Do you have some papers you can recommend?
@antonioILbig 5 месяцев назад ⁺¹
That would be interesting! Heuristics serve this purpose: to make feasible something computationally infeasible but only restricted to some specific domain/problem distribution.
So applying LLMs to this purpose could lead to good heuristics useful in the downstream task without handcrafting them.
@unclecode 5 месяцев назад
@@antonioILbig
@googleyoutubechannel8554 29 дней назад ⁺¹
This paper should be called, how to implement A* in the most expensive way possible. I would be surprised if a transformer network, which is also remember people, is just a bunch of stacked NNs with back prop goodness, couldn't learn it. Transformers should be able, if massaged and trained correctly, to do basically any type of ML curve fitting we've discovered, if you can shove it into their context window (which is just a big vector that gets shoved through a bunch of NNs) learning A*, a very short algo, and some data processing, seems very reasonable.
@agenticmark 5 месяцев назад ⁺¹⁰
Its so awesome when we get synergies like this. Ive been writing game AIs for years, using transformers instead a* pathfinding is pretty amazing.
Its crazy how everything in the ML stack seems to generalize for things they were not designed to work on!
@ericadar 5 месяцев назад ⁺⁴
What if you could show that the trace lengths produced by the dynamic bootstrap are often shorter than the min length of the trace lengths produced by A* ? Would that be convincing evidence that they produced a novel heuristic that is more efficient than A* at finding optimal plans, or would you still think it's just solving the subtask of more efficient deterministic path search ?
@PaganPegasus 5 месяцев назад
I'd argue that learning how to "nondeterministically tie break" makes the algorithm no longer the same as A*, because it's sorta introducing a second heuristic into A* for tiebreaking which is no longer the same as vanilla A*.
@Dan-hw9iu 5 месяцев назад ⁺⁶
Here's a potential response to address your criticism: Consider the A* algorithm, and let B = "Least trace result of A* wrapped in a for loop." Algorithms A* and B are quite distinct, by definition. Ditto for datasets produced by them, in general. Algorithm B is guaranteed to produce results compatible with A*. But that's not relevant. Especially when training on B produces substantially better models than A*. Different algorithms, dataset properties, and model performance -- both definitionally and empirically. I think the authors' conclusion that their best model learned _something_ outside of A* is pretty reasonable and supported. Great A/B tests! ;-)
Please let me know if I'm wrong here. I only watched your (excellent) video, Yannic. I didn't read the full paper. Cheers, and thanks for all that you do.
@antonioILbig 5 месяцев назад ⁺¹
It'd be reallly interesting to see an AI powered A* algorithm in the sense that its internal heuristics are substituted by an AI model of some kind, for instance substituting the tie-breaking heuristic with a model. In this way we could really see how much A* can be improved by AI without doubts about wethere it's still A* 😅
@antonioILbig 5 месяцев назад
What happens if an LLM is trained to solve a problem of which an algorithmic solution doesn't exist, like the halting problem?
@clray123 5 месяцев назад ⁺¹
I would only be somewhat impressed if the training done on maze somehow automagically generalized into performance on Sokoban or vice versa. I don't recall them even attempting to demonstrate such a thing. Other than that, it just seems like another overfit the dataset, report success to get published paper.
@dansplain2393 5 месяцев назад
I’d like to see the learned model compared to a simple improvement on the A* heuristic that incorporates the world, such as one that tie breaks using the the number of walls in the possible manhattan paths to the destination. Why walk through a forest if there’s another route through an open field?
@clray123 5 месяцев назад ⁺⁴
LLMs don't run algorithms, they imitate algorithms' outputs by approximate copying. Poorly. One could also say LLMs are not reasoning. Or not intelligent. But hey, that could ruin your AI investment.
@maheshprabhu 5 месяцев назад ⁺²
Why do you say that? We have no idea how to define intelligence. If anything, it might just be statistical extrapolation just like how LLMs operate.
@clray123 5 месяцев назад
@@maheshprabhu A human does not need to read a trillion tokens in order to learn language. A human is also capable of learning new concepts, planning, strategizing, optimizing, doing logic, arithmetics, avoiding contradictions, all while using just 20 Watts of power and a lot of it self-taught.
Therefore I claim the LLMs are not intelligent... they are fakes. And that our brains do not run transformers.
Funny enough, when computers were first introduced, people were claiming that a brain was "just like a computer" and when mechanical automata were introduced our body then was "just a sort of biological machine". Well, we should have realized by now that our brains are "a bit" more capable than that.
@clray123 5 месяцев назад ⁺²
What we have with LLMs right now is exactly what Searle described in his "Chinese room" thought experiment. We have a laboriously trained algorithm with a huge database which reproduces the inputs and outputs of (any) language. But there is plenty of evidence, in form of the errors that it makes, that it does not in fact internally understand any of the language which it processes, and that it is able to reproduce traces of reasoning, but does not reason.
@antonioILbig 5 месяцев назад
@@maheshprabhuThat could be the case! We actually don't know it.
For this same reason I wouldn't say that these things are intelligent because that would mean we know what intelligence is. So the best I can say is: this things are not intelligent.
I'm not sure of the validity of this reasoning, but this is what I feel about the "intelligent" discussion.
If something looked really intelligent and we hadn't absolutely no idea of its functioning, I would say it is intelligent.
@clray123 5 месяцев назад
@@antonioILbig Actually, we do know it. We do know that intelligent beings, namely us humans, are capable of running algorithms of arbitrary time/space complexity, given enough resources. We are Turing-complete, LLMs are not.
We know the algorithmic complexity of generating the next token by an LLM. It is mathematically impossible for an LLM to "learn" any algorithm which has a higher complexity than that required for the next token generation.
What is possible is for the LLM to memorize the lookup function for mapping the algorithm's inputs to outputs, for a certain fixed-size range of inputs. And that is what the LLMs of today are doing. It is an insult to actual intelligence to call that sort of mapping intelligent.
Show me an LLM which, after training on sequences of digits of the number PI, has obtained the knowledge of how to generate the n-th digit of PI, for any n > length of training dataset. But wait, the "smartest" LLMs out there even have a problem with counting the number of digits in their input. Try it yourself. There are rather simple limitations in these algorithms which are quite obviously not present in the "algorithms" that execute in our heads.
@andikacandra4996 5 месяцев назад
I think it good to train positional embedding
@Mehrdadkh87 5 месяцев назад
Doesn’t it related to Qstar ?
@VeganeDeDroite 5 месяцев назад
For equal performance, how many training tokens do we need with the "solution only" version compared to the "teaching" version ?
@tiagotiagot 5 месяцев назад
Sound like the model might have learned a tie-breaker algorithm that statistically beats A* with naive stochastic tie-breaking? Am I understand this right?
Are we approaching the P=NP threshold?
@tiagotiagot 5 месяцев назад
Even with forking and backtracking , A* is still somewhat in the land of linear thinking. When are we gonna see something like this bootstraped from something more abstract, like a freeform graph variant of the Wavefunction Collapse algorithm?
@sagetmaster4 5 месяцев назад
I wonder if we end up stitching together many extremely purpose built very small and efficient LLMs that are called upon for very specific functions. Less mixture of experts and more very specialized "dumb" organs and one or more larger executive more generalized models
@Gauloi007 5 месяцев назад
The argument about it's predicted the next token don't have to much sense, it's more about to did we just memorize all the answers possible ( what is probably impossible) or do we creating model of reality more and more dense...
@GNARGNARHEAD 5 месяцев назад
good explanation, I can't help but be wanting for both known and unknown obstructions in the training data, that way it changes what is optimal based on forethought
@chanel454879876354 5 месяцев назад
Interesting concept. It can be used as a tool of agent. If agent can generate the environment, then agent can train the tool subnet and utilize this trained tool to solve the corresponding tasks.
@OperationDarkside 5 месяцев назад
Would be cool if this could be generalized and applied to real world scenarios
@jamescamacho3403 5 месяцев назад
It'd be interesting to apply this to proof solving.
@clray123 5 месяцев назад
The "key conclusion" that it is easier to "teach" the LLM to find the correct solution if you provide it a trajectory toward it is not "really interesting", it is trivial. You are artificially constraining the search space, guiding the LLM to generate the correct next token conditioned on the predecesssors. In an extreme case, you are spoon-feeding it the entire solution and only asking for the final step. It's kind of also showing how little success we can expect from the unsupervised "data mining" based training alone, and it explains why LLMs got so much more "clever" when trained on more reasonable datasets (e.g. training with textbooks rather than Internet crap and what not).
While it is a plausible and rather easy to see result, it is not very optimistic for those who imagine they can solve all the world's future problems by brute-forcing them with tons of data and GPUs (well, they sorta can, but it gets really expensive really quick, already for toy problems... which again is an already known characteristic of any brute-force search algorithm... well, duh).
@brendawilliams8062 16 дней назад
Algorithm using Turin it seems
@EinsteinNewtonify 5 месяцев назад
🎉🎉🎉🎉🎉
@LukePluto 5 месяцев назад ⁺³
great review! really interesting paper, hopefully some of the misleading claims get cleaned up during review. surprised there was no discussion of Q*
@maloxi1472 5 месяцев назад ⁺²
Unsurprising. Hype isn't research.
@chipotle2417 5 месяцев назад ⁺¹
Exactly what Y. Lecun said on Lex Fridman podcast some months ago. It's undoubtedly a further step to AGI.
@alexkubiesa9073 5 месяцев назад
I feel like a simpler way to ensure short execution traces in the SearchFormer model would be, when generating the training data, to just run A* search a few times with random tie-breaking and pick the example with the shortest trace. No need to train one SearchFormer model to generate data to train another model.
@tst3rt 5 месяцев назад
A* selects the path with the minimum heuristic value, which is not necessarily the shortest one, see Russell and Norvig for an example
@alexkubiesa9073 5 месяцев назад
@@tst3rt According to the video, as long as the heuristic is admissible then the path output by A* is optimal.
But regardless, whatever criterion we apply to filter the data generated by the first SearchFormer we could have applied to the process of generating the original training data.
@hasko_not_the_pirate 5 месяцев назад
it’s like, who cares about math, let’s intuit our way through it. It’s valid research but ultimately, it’s bad engineering.
@jks234 5 месяцев назад ⁺³
?
That’s all of deep learning.
Neural networks are just set up to “potentially contain all branches needed to act properly in all situations”.
This approach is also not “learning on purpose”. What AI is teaching us in general, I think, is explainability is a luxury and often, we cannot use it to make the most complex and useful solutions.
@hasko_not_the_pirate 5 месяцев назад
Clearly, deep learning and LLMs have brought capabilities we haven’t been able to implement in other ways, so that’s good.
This paper is valid research in the sense that it’s interesting to explore what is and isn’t possible.
It is “bad engineering” though in the sense that instead of teaching the model to use an algorithm that’s known to be optimal, they taught it to emulate it in a very inefficient way.
@TheRohr 5 месяцев назад ⁺¹
@@hasko_not_the_pirate Nah, it's not about efficiency here, but the question if a neural model can learn capabilities that are encoded in human-designed algorithms. Still ofc. it makes sense to use the algorithm directly in production. ^^
@JorgetePanete 5 месяцев назад
I call overfitting
@brendawilliams8062 16 дней назад
How. With not shortening a string. It appears it’s viewed as forever an infinity wrestler
@lucasbeyer2985 5 месяцев назад ⁺¹
Ahhh please don’t start to do vocal fry, leave that to the valley girls!
@otrqffaimajg 5 месяцев назад ⁺⁴
Please stop burning my eyes
@DeepThinker193 5 месяцев назад ⁺³
Go into the light.
@sean_vikoren 5 месяцев назад ⁺²
The statement about beating A* caught my attention, since A* is one of the few algorithms that we have an optimality proof for.
It seemed a bit confusing, in the mathematically not possible sense.
@JorgetePanete 5 месяцев назад
I think the part of replacing traces might make it overfit
@clray123 5 месяцев назад ⁺²
They are not claiming that the generated plan is "more optimal" than A* in terms of number of steps to execute by following the plan, just that the same optimal plan was generated using fewer planning steps.
@sean_vikoren 5 месяцев назад
@@JorgetePanete The whole paper is self falsifying.
But, if you can't read and understand the proof for A*'s optimality, you should probably not talk about it at all.
That 'paper' is so sloppy, it's not even wrong.
It actually makes me wonder if we should go back to beating children with sticks for being idiots.
@sean_vikoren 5 месяцев назад
@@clray123 Read it again. That is exactly what they are claiming. They are grade zero idiots.
@sean_vikoren 5 месяцев назад
@@clray123 They are moving their unqualified lips.
@hll97fr16 5 месяцев назад
Isn't this result kind of related to the fact that we have noticed with Tree Of Thought or Chain of Thought, that when we ask for a LLM to give us a plan of his response, we obtain better final response in the end? 😊
@simonkotchou9644 5 месяцев назад
Always hype when a new meta paper comes out
@wurstelei1356 5 месяцев назад
Taking 100k train steps instead of 1m is pretty descent even thou if I didn't get everything.
@444haluk 5 месяцев назад ⁺¹
What a stupid idea😂😂😂

Следующие

Автовоспроизведение

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention