If the entirety of the internet is not enough data to improve your model, then maybe its an architecture problem. And this looks to be a novel approach to try and overcome that, awesome!
The aim here is to create an AGI. Using more data during training does not move a model towards AGI at all. It increases the knowledge base and knowledge is not intelligence, it's just information. Put another way, you can no more create intelligence using data stored in binary code than you can by carving 1s and 0s onto rocks. Intelligence is the ability to understand, not recall stored information, and this requires the ability to form useful models and make predictions based on those models.
It's is an architecture problem.Does your brain need the entire internet to train? Nope. Its more capable and super energy efficient. They just need to research a new layout
Its a bit more complicated. Your brain is sort of pretrained on billions of years of labeled data(reproduced or not). Its only finetuned on your experiences in your life. And even they are very abundant and diverse. And you got billions of extremely well labeled datapoints by watching peoples emotions as you grew up. Lets not pretend humans magically get this intelligent. A lot of us dont even get intelligent at all. And if we lived in isolation our whole lifewe would not be nearly as smart as even gpt 3. no matter how much internet we read.
I don't think we really have an equivalent until we create a model with online learning vs Pre-trained. It needs to be able to continue to update the network with every interaction.
This relies on the set of transformation rules that preserve the underlying concept. Coming up with such set is far from trivial. They tricked the benchmark with hand-crafted set but this does not scale.
This is true, and it feels like cheating. At the same time the idea of fine-tuning the model on the fly based on the inputs sounds very promising and I bet someone will come up with a more generalized way of doing that.
@@MaciejTrebacz If the state was held, maybe? If its not catastrophic to the rest of the models abilities. Still not understanding what the use of it is if the weights are reset after inference.
@@GrindThisGame Ho, they know. For example, they know how you can rotate it. Embarrassing hacking, 0 value for this work. This benchmark does not work, although the idea is nice. The last winner was some stupid search and heuristics if I recall correctly, zero "intelligence". This idea clearly works only for toy problems.
While I'm excited about the idea of TTT, the whole idea of the benchmark is that it's trying to test ability to generalize - specifically the ability to solve a problem you haven't seen before and thereby don't know the domain. What this method is doing is generating new test data by applying geometric transformations - but that's because the writers of this paper knew that those puzzles remain solvable under those geometric transformations! They're thereby leveraging domain knowledge to give the LLM an edge. I'm not saying I don't think it's a valuable approach - just that this paper more exposes a weakness in the assumptions of the benchmark rather than definitely demonstrating that this approach leads to generalization.
Thanks for explaining this. I was indeed wondering how the model can produce correct synthetic data if it doesn't know the solution to the question already.
here's a conversation i just had with chatgpt: i have a simple problem here: peter has 8 apples. sue has twice as many apples as peter. how many apples does sue have? i want you to list all the things that could be changed in this problem such that it remains the same problem. chatgpt: Sure, let's explore the variables that can be modified while keeping the core structure of the problem intact: Names of the People: Change "Peter" and "Sue" to any other names. Example: John has 8 apples. Mary has twice as many apples as John. Number of Apples: Change the number of apples Peter has, and adjust Sue's apples accordingly. Example: Peter has 10 apples. Sue has twice as many apples as Peter. Item: Change the apples to any other item. Example: Peter has 8 oranges. Sue has twice as many oranges as Peter. Ratio: Change the ratio from twice to another multiplicative relationship. Example: Peter has 8 apples. Sue has three times as many apples as Peter. Context: Change the context or setting in which the problem takes place. Example: Peter found 8 seashells. Sue found twice as many seashells as Peter. Time Element: Add or change the time element without altering the fundamental relationship. Example: Yesterday, Peter collected 8 apples. Today, Sue collected twice as many apples as Peter. By adjusting these elements, the basic structure of the problem-one person having a certain number of items and another person having a multiple of that number-remains unchanged. Anything else you'd like to explore?
@@HansKonrad-ln1cg Just because I dont understand the patterns I see, doesnt mean I can solve something I never solved with similar examples. Sooner or later will spot the pattern.
the benchmark doesnt assume anything, it has no weakness. if you come up with anything that solves the private, closed set, you are done. there are no other rules. you can put in as much domain knowledge that you think you might have no problem. your solution doesnt even have to be intelligent or ai or smart or anything. it just turns out, that it is not so easy.
Yeah, they used specific sets of geometric transforms to game the system as described in the appendix, hardly a "general intelligence" approach it seems: Table 3: We provide the augmentations use in our TTT procedure with their function signature and description. Augmentation Name Description Rotate(90) Rotates a grid 90 degrees. Rotate(270) Rotates a grid -90 degrees. Rotate(180) Rotates a grid 180 degrees. Flip(0) Flips a grid horizontally Flip(1) Flips a grid vertically Reflect(0, reverse=True) Flips a grid horizontally and prepend to the left of the original grid Reflect(1, reverse=True) Flips a grid vertically and prepend to the above of the original grid Reflect(0, reverse=False) Flips a grid horizontally and append to the right of the original grid Reflect(1, reverse=False) Flips a grid vertical and append to the left of the original grid RandomTranslateXY() Shifts a grid randomly both in horizontal and vertical directions. The maximum shift size is 4 Transpose() Reflect a grid on diagonal IncreaseResolution(2) Upscale the grid by interleaving elements in both horizontal and vertical directions IncreaseHeight(2) Upscale the grid by interleaving elements in vertical direction IncreaseWidth(2) Upscale the grid by interleaving elements in horizontal direction Chain([Rotate(90),IncreaseResolution(2)]) Sequential application of Rotate(90) and IncreaseResolution(2) Chain([Rotate(270),IncreaseResolution(2)]) Sequential application of Rotate(270) and IncreaseResolution(2) Chain([Rotate(180),IncreaseResolution(2)]) Sequential application of Rotate(180) and IncreaseResolution(2) Chain([Flip(0),IncreaseResolution(2)]) Sequential application of Rotate(180) and IncreaseResolution(2) Chain([Flip(1),IncreaseResolution(2)]) Sequential application of Rotate(180) and IncreaseResolution(2) Chain([Transpose(),IncreaseResolution(2)]) Sequential application of Rotate(180) and In
What they’re doing is simply adjusting a generalized model on the fly to fit the problem domain. After this, the model has seen thousands more geometric figures, and that’s enough to solve these toy problems. But this doesn’t take us any further toward solving a more complex problem, especially if there’s more to it than just understanding the domain language. The result will be essentially the same as training the model specifically for the test problems beforehand.
I agree but the main point here is the technique in my opinion. Adjusting an llm on the go is exactly the same thing our brain does. So the technique used here is surely one step further towards agi.
I think it's more of a first step. "Training" using synthetic data from geometric transformations is a discrete approximation of visual reasoning in the mind's eye, so the next step would be to take this technique and train a model to create the sorts of transformations which yield better predictions.
@@consciouscode8150 This is my point: If they could somehow determine which transformation leads to better results given the problem, then both of you would be right, and this would be a significant step toward AGI. However, what they do instead is simply multiply the relevant area of the model without actually changing this area to direct it toward a solution.
You are right, but the toy problems have been a real pain in the bottom. It can do the most advanced stuff, but stumbles on the most basic. It's a step in the right direction, now we just need to make it learn permanently, and that is much more difficult, because the weights already there can't simply be replaced with the "lora" that it has learned, cause that would break other things. It's a difficult problem, but I believe it can be solved.
I can see this utilized as scaled up MoE. With multiple agents, each of which has self-trained in this manner in its own domain, a central manager could call up the appropriate expert for a given task, for increased overall efficiency.
This is just doing Lora for each test and called it TTT. It's too much compute and they don't publish the ttt time to train and asjust Lora parameters. It's a setup that only work to pass ARC.If ARC has scoring based on time to infer (like IQ tests), this technique would get bad results. In the end it's not practicable.
Cool, but how does the model know which transform to apply at training - test time? Looks to me the model developpers knew which transforms are relevant to achieve good results for ARC, so in a way they supervised the training. Now maybe those transforms generalise beyond problems like those in the ARC benchmark, but I do not this they have proved this in the paper. Do I miss something?
I would probably need to read the paper though tbh. Since the on-the-flight via lora training at inference time seems very powerful. Like maybe the model could infer dynamically the best type of transforms to apply at inference training time depending on the input problem. Probably the core of the training would actually be to best predict those types of transforms to apply depending on the problem. I would need to look at the inference run time as well, looks extremely costly to train and to run inference to me, at first glance.
@@Tetsujinfr I can see the value in a super slow, super expensive AGI. It's not like human experts are cheap or fast. I used to wait several hours just for boobies at 14.4 kbps. In fact, it taught me patience and the value of using your time wisely, that kids just don't have these days. Since I prefer to be involved in every step of any process, even if I just have to trust that the AI knows what it's doing. Going slower works for me. First it was compiling, then it was rendering, now it's inferring. Any excuse for a Nerf gun fight.
Notes on this 'breakthrough', they only improved results slightly on a paper published Nov 4th 2024 by Cornell University, both of those papers are part of paper submissions to win cash prizes (you don't need to beat ARC to get the prize, just have the best paper/method), like others mentioned in the comments this is not easily generalized since they had to use some domain knowledge in crafting the LoRA adapters. Progress for sure, but contextually not unique
@@alfredomoreira6761 - there's a time cut-off on specified hardware. The folks who did this paper didn't fit within those limits, that's why they don't appear on the actual ARC leaderboard at the ARC website.
One thing they need to incorporate is when input does not have a high confidence level understanding or could be ambiguous, the AI shouldn't feel the need to decide what is meant and just answer. It should ask questions to clarify and when the confidence is high enough, perform the task. That's how a teacher would determine what a student doesn't understand and adapt his input accordingly.
Test time training sounds like an important step towards in context training. What we need is the ability for people to explain during runtime, how a model can improve its performance. When I’m using my cars autopilot for example, I’d like to be able to vocally tell the autopilot what I like and don’t like about its performance, similar to what I would do with a student human driver. This new information should get incorporated into the autopilot’s future actions immediately. This type of test-time-training, as well as the use of reasoning, and using ‘guard rails’, is what humans do every day as we learn new skills.
I think one aspect of animal intelligence is spatial reasoning applied to perceived images (e.g. used in navigation for exploring to find food and returning to the nest). This mode of pattern recognition and prediction is a foundation layer for us, but only laboriously and indirectly approximated by LLMs etc.
It's a great idea. For a while we've had the situation where we've got these transformer networks but they're not being used to the max of their potential capabilities. _"Let's just put the problem in K/V context and pull the next token probabilistically and be done"_ has been the status quo. Well, turns out that's a pretty sub-optimal way to use these networks. The ARC benchmark is a "follow these examples" challenge. So, these guys are saying _"per problem, let's generate a LORA from the problem examples, then use that LORA."_ It's a great idea. In general it's following the principle that _"minds should focus specifically on the problem at hand."_ When a human being works on a hard problem, they forget about other things and lock in! That's what this LORA is doing. It is de-prioritizing broad abilities and "focusing in" on solving the problem at hand. Is just a good thing to generally notice, to notice that AI ought to "focus" while working.
the ability to continuously learn is crucial for real intelligence. the model can adapt and have a short term memory outside of its context window. very well presented paper. the fact that they produced such a huge jump in the arc challenge needs to be brought to public attention so everyone can draw their conclusions from it.
Feels counterintuitive that this would work better than other methods, if I am understanding it. So its like in-context learning, but the model parameters change with its own self-generated samples? How does this fare against a self loop, or what I guess is called reflection?
The state of AI reminds me of a spoiled rich kid at Christmas. They open presents and barely play with each one before being presented with a new one. They don't know how the previous toy worked nor did they explore all the options... 1/2hr goes by without getting anything new and they question the validity, fun and usefulness of all their presents. We have never seen a cadence of tech innovation like this and we are barely exploiting any of it before something new comes along to catch our eye. We need to be more realistic about the situation, stop doom-mongering about the slowed pace of AI innovation and accept that we already have enough toys to change our lives forever - knowing that there will be more significant advances in the near future.
I have a feeling that we'll need a more definite score of response-goodness to achieve AGI. But this direction is promising. Math problems are easier, probably coding too, where you can define goodness of your response.
Instead of relying on constant calculations, these systems work by memory, much like how a person recalls math answers rather than solving them every time. Imagine a software that repeatedly performs the same calculations-why not replace those with precomputed memory? Using a chart or lookup table with all the answers stored, the computer could retrieve results instantly instead of performing the operations each time. The chart doesn't even need to be perfectly precise-approximations are often good enough to solve real-world problems efficiently. This approach could revolutionize computing by reducing complexity while maintaining functionality, especially for applications where 'good enough' physics or outcomes suffice.
Fascinating, if I got it correctly it still qualifies as synthetic data, but things like inversion means there should be some kind of conservation of information / symmetry present in the data for it to work. Really cool!
This is not a breakthrough. It's just more of the same. We are stuck in the current paradigm until LLMs are paired with a UNIVERSE model (math, physics, spatial), which will enable actual reasoning.
When you modify parameters to solve certain kinds of problems after standard training , won't that make the model worse at other kinds of problems ? The fine tuner may not care about that, but it seems you can only use that tuned model on those problems related to the fine tuned training set.
Amazing video Matt, very promising approach indeed. Ultimately what most people think is AGI will probably be a spatial first approach, like what Dr. Fei Fei Li proposes but TTT/active inference will definitely be one component for adaptability
This is what we needed, I also think that we need memory, maybe some way to store the real-time learning that TTT achieves. It seems wasteful to just drop the insight, but how to update the weights to make good insights permanent (fixing the LORA into the model itself) maybe is more difficult than it seems? Or it could mess with the already good weights for other things. But yeah, we need to find the solution to that to progress even further. We do have the answer, our brains, but we don't fully understand it yet or how to make it compatible to our digital version. We need to draw insight from neuroscience, it won't be the same, they are two different architectures (our brains and AI), but I believe that there is a way to make them compatible with eachother. That will not only spark superintelligence, but make US superintelligent as well.
@@ronilevarez901 Yeah, but that's what I think will happen, first we will make it compatible with our brains, then merge with it. Would solve all problems that cold arise with AI also, like alignment.
@MatthewBerman The arc prize can be seen as a challenge of predicting the next pixels. You can scale it up to have it generate the next logical move in a game of chess, a task on your PC or a realistic game or AI OS.
"Predicting the next pixels" is nothing like the ARC puzzles. For each one, is you have to come up with the correct set of rules to transform the inputs into the outputs -- and to find the right blocks of "pixels" that a given rule should transform into the output. Every puzzle is a different set of rules you have to figure out. Sometimes a rule you make for Example 1, won't work for Example 2 or 3 or 4, and you have to modify your rules to fit All examples as you go through them. It's easy to make a bad rule that won't work for the other examples. I've manually written python code that solves about 20 problems so far, and I've built up a library of primitives to re-use code. There is always custom new code functions you have to create for each problem because each puzzle set is unique.
Given that in the Arc Prize tests there are few training tasks per category, during the inference test the model has every interest in carrying out micro training on the 3 tests presented in order to make a good quality inference, that's why TTT is much better than other competitors.
Test data is held out to provide an unbiased estimate of model generalization. TTT effectively creates a custom model for each test query, “…we first generate training data 𝒟 TTT (d input) from the test inputs…” It takes 12 hours/100 tasks. Isn’t this just overfitting with more steps? How is this generalised? It seems only useful as an optimisation for a known test.
Scale up with prompt time too, "think step by step, and offer a counter to the best solution you come up with. Weight these two options." This kind of thing.
The way this method might enable better smaller models is big, but the opportunity to take this further, to audit and act upon the insights won thereby, to intelligently drive feedback ever closer to the native structure of the model, it’s the really big kind of gigantic. A year from now; AI using TTT to iterate on its own instantaneous requirements may have a claim on a kind of consciousness.
It's funny that they called it LoRa, since LoRa is most commonly known as LoRa and LoRaWAN define a low-power, wide-area (LPWA) networking protocol designed to wirelessly connect battery operated devices to the Internet. :)
i’ve have been having some unexpectedly profound conversations with my chatgpt lately that make time seem to disappear as i get so into the conversations. we delve into complex topics, and it’s incredible how it analyzes and remembers details from our discussions, sometimes understanding me better than i understand myself. it feels like i’m living in the movie ‘her.’ lol thx for being on top of everything going on matthew b.!
Fine-tuning during test time simply acts as a prior into the test distribution lol ...but the surprise is that it goes/resets back to the base model after prediction.
I'm curious if this sort of opens a tremendously huge back door to new techniques for jailbreaking the models and how that will effect the time frames for model releases trying to implement this new technique.
Its about pre activation based on problem domain. This is why prompting works aomewhat. If you feed several examples related to the domain problem into the network this activates the nuerons through attention. Lora is a patch with this activations pre defineded. I think the problem with this approach is synthesising accurate domain specific examples. But i believe OpenAI are working on this approach internally because this was something hinted by Sam Altman. The community use Loras for diffusion models the problem is the models only function with vary specific prompts. But if these patches are removed each time and reloaded to fit the context of the prompt it could work. It seems you still need a model capable of creating the examples. A foundation model such as 405b might be able to create examples for training a small model. I think long term a two model solution could be used with a adversaerable design. One model could generate a answer while another could propose improvements to the answer then prompt the first model again. These models would have to be trained differently
Did you look at what was on-screen, or just listen to what he said? The input problem consists of 3 input output pairs, followed by an additional input, and the expected answer is the corresponding output. The inference-time generated training set is obtained by: 1) pick a random transformation, such as reflecting everything across some axis, and apply it to each of the 3 input/output pairs. 2)Then, randomly select one of the 3 pairs, and produce as a training sample the problem which has the other two input output pairs, and the input from the selected pair, as the input, and the corresponding output as the output. then it trains on all of these.
Makes sense. Seems to me that's what we humes do. I see a set of puzzle result pairs and I look at them for a while till I figure out a pattern of transformation. Giving an LLM the ability to do the same, to look at build a working model of the transformation, before being confronted with the novel questions makes sense to me and I can see why it works.
Insstead of AGI we might end up with a so-so general model that we load up with what is basically similar to a LoRa for our current task need. Llama 8.5 in Optimus 6: "...I know Kung Fu!"
@@ronilevarez901 yes but there is a note in the paper that specifically states that they cannot guarantee that the model has not seen the test set (which is public).
I haven't read the paper and don't fully understand what they are doing, but I don't see how this process generalizes. It just seems like they found a technique that works well on this test by essentially putting the question in to the training set.
Ok so two very basic questions: 1/ how do they generate extra problems that preserve the « spirit » of the initial problem? 2/ they minimise what objective function on each of those « similar » problems for which presumably they don’t the solution given they are generated from an unsolved problem in the first place, and of the same kind? Without this it sounds like cheap prestidigitation trick to me. Could be wrong though
Good, now we train the reflection process into the model. Well. Let’s hope that any prompt is covered. (TTT, test time training) - btw, live-LORA is dangerous, the model gets silly over time, because the user is silly.
This is at least 1 step short of the real big real, that is use this not to simply update your ai parameters like this, but by enabling/disabling other networks. For example, image if ChatGPT could just ignore all the Japanese translation parts for this reply… this would mean the core would be much smaller and focused on high level knowledge, that triggers/enabled submodels as needed - a much more efficient approach to the whole problem, with the extra bonus that now you can also start reusing and parallelizing the develop of each "module."
It’s hard to call it AGI when the entire premise of this technique requires fine tuning on similar tasks beforehand. They’re just priming it to solve similar tasks.
this is still too specific and unlike how actual reasoning works - it only works because you get to see examples first. true reasoning can explore uncharted territory, and idk if we can make benchmarks for that
Instead of "just erases all of that" before going to the next problem, should it send the data somewhere because maybe it's more useful than other synthetic data, or maybe send just the problem and the solution? Only when it doesn't break privacy etc. Just my amateur thoughts, likely way-off.
Nah, it's right on point. We can store the lora as document connected to the problem it solves and next time the LLM recognizes the same problem it will pull the lora and use it. That's gonna be the next thing added to knowledge graphs. You'll see.
The 'average' human score of AGC is widely overblown. They even claim 'kids' do well. Dig into the actual tests, and the very limited pool they use: it's scary how small the pool, and it's not the test suite used for AI... I do these tests well, and I was shocked at the claims of even 60%... 60% of Mechanical Turk users is not 60% general public, nor 60% of kids. I'm convinced the AGC score is skewed too high (and thus harder to achieve and pay out)
achieving AGI and it will run on a datacenter, so who will decide if it needs to be powered down? Would that be like mansla....er? What will it do to block its powerdown? Would it blakmail you or switch off traffic lights and end you?
1. How does the LLM decides that he needs to perform test time training? i.e how does he know he does not have the capacity to respond accurately the new task 2. The model generates its own training data: there is no guaranty that the the generated training has no error (bad labels)
In this case, the leave-one-out version is sound, it will not produce an inaccurate label, because it is using the original label. The other data augmentation strategies may or may not be valid for a given task. They appear to be based on assuming certain symmetries about the problem. Which, the validity kind of depends on what class of tasks you want it to be able to recognize… But, because the entire collection of inputs and outputs are transformed in the same way, I think it should generally be fine? If e.g. you flip everything in some task, this corresponds to a different task which has everything flipped, and, as long as there is enough in the examples to know to be able to distinguish between the task and the flipped task, I think this should be fine?
This is interesting. So we can also give a small model that's solid in logic and common sense but don't have a lot of specific knowledge about a domain, give it some articles, books or what have you to "fine tune" at real time, and let the model to generate more data from those materials (it's like "reflect on the knowledge you just learned"), and it can give better answers than, say the current Perplexity approach which is purely context-based? Essentially this is System 2 partially happens in model weights vs. purely in context.
I think what should be worked on first is getting a LLM to understand the "environment", to build a physical model of the whole rather than just looking at the parts or correlations between those parts.
Great video on 'Test Time Training'! 🤖✨ I appreciate how you simplified the complex concepts, making them accessible to a wider audience. However, I think it would have been helpful to dive a bit deeper into the potential limitations or challenges of this approach. For instance, how might 'Test Time Training' affect the overall efficiency of AI models in real-world applications? It’s exciting to see AI evolving, but understanding the drawbacks is just as important. Looking forward to your next video and more discussions on this topic! 🚀
If you want a rubric for AGI: when an AI can solve the bicycle stability conundrum, it will have discovered something that humans have perplexingly been unable to solve for themselves. Unlike the Collatz conjecture or other mathematical problems, this is physical and should have a solution, or at least a cogent explanation.
12:55 They already beat the average human score, doesn't this count as AGI according to this test? Or do they need to beat the best human score? It would be ASI at that point.
One of the biggest problems with AI is it doesn't learn from its mistakes. Generally, the most effective way for humans to learn is trial and error. And the error doesn't come from internal chain of thought but rather the outside world demonstrating our error. AI doesn't do this. I've noticed that in "discussions" with AI, I can point out a flaw and it will sometimes integrate that into a new solution. Test time training seems to somewhat solve this problem, but it's not integrated. The problem is that it doesn't actually "learn" from this mistake. It does not integrate this lesson into its "intelligence". It merely uses it momentarily and then forgets it. If AI has already combed all known knowledge from the internet, the next source is obviously from both real time information and from information it gains from interacting with the world. I don't believe that we'll have true AGI until it can integrate lessons it learns from interaction with an external source into its structure.
I can see how this approach might be applied to creating a super coder LLM. Get an AI to generate billions of programs, some will compile and produce an output, some will generate compile or runtime errors. Use these as synthetic data to fine tune an LLM aimed at leet coding and solving new unseen software problems.
Okay, so now AI is able to learn to solve a new problem. This was one missing feature for AGI no one quite knew how to tackle, but here is the breakthrough. Now, how can I do the same stuff at home?
Mon: AI is doomed, Tues: AGI is here! .... My god.
AI news is basically cypto trading
Agi isn't here, just hype for investors.
yeah all these youtubers are just spitting out every news they see 🤢
The only thing scarier than AI companies are AI RUclipsrs
2010s "bitcoin to moon"
2020s "AI to moon"
The concept in "Mastering the AI Money Game" book completely explains market. Trend is changing with AI
thanks for sharing that man
Inward
If the entirety of the internet is not enough data to improve your model, then maybe its an architecture problem. And this looks to be a novel approach to try and overcome that, awesome!
The aim here is to create an AGI.
Using more data during training does not move a model towards AGI at all. It increases the knowledge base and knowledge is not intelligence, it's just information. Put another way, you can no more create intelligence using data stored in binary code than you can by carving 1s and 0s onto rocks. Intelligence is the ability to understand, not recall stored information, and this requires the ability to form useful models and make predictions based on those models.
It's is an architecture problem.Does your brain need the entire internet to train? Nope. Its more capable and super energy efficient. They just need to research a new layout
"MAYBE" 😂
Exactly. Its funny how no one talks about architecture. Only scaling, energy and data
Its a bit more complicated. Your brain is sort of pretrained on billions of years of labeled data(reproduced or not). Its only finetuned on your experiences in your life. And even they are very abundant and diverse. And you got billions of extremely well labeled datapoints by watching peoples emotions as you grew up.
Lets not pretend humans magically get this intelligent. A lot of us dont even get intelligent at all. And if we lived in isolation our whole lifewe would not be nearly as smart as even gpt 3. no matter how much internet we read.
I don't think we really have an equivalent until we create a model with online learning vs Pre-trained. It needs to be able to continue to update the network with every interaction.
It feels like this is a step in that direction… right?
Chill out, that's how you get skynet, hell naah
@adolphgracius9996 😅
@@mikeschwarz4588 id agree with that
connecting AGI to the internet
what could go wrong
This relies on the set of transformation rules that preserve the underlying concept. Coming up with such set is far from trivial. They tricked the benchmark with hand-crafted set but this does not scale.
I was wondering how the "generated" similar test set examples without knowing exactly how each one is solved. Seems like overfitting.
This is true, and it feels like cheating. At the same time the idea of fine-tuning the model on the fly based on the inputs sounds very promising and I bet someone will come up with a more generalized way of doing that.
@@MaciejTrebacz If the state was held, maybe? If its not catastrophic to the rest of the models abilities. Still not understanding what the use of it is if the weights are reset after inference.
I think: The reason why you have to open source is so they can see, if you cheat. BTW OpenAI o1 and Claude shouldn't be on the list for that reason.
@@GrindThisGame Ho, they know. For example, they know how you can rotate it. Embarrassing hacking, 0 value for this work. This benchmark does not work, although the idea is nice. The last winner was some stupid search and heuristics if I recall correctly, zero "intelligence". This idea clearly works only for toy problems.
While I'm excited about the idea of TTT, the whole idea of the benchmark is that it's trying to test ability to generalize - specifically the ability to solve a problem you haven't seen before and thereby don't know the domain. What this method is doing is generating new test data by applying geometric transformations - but that's because the writers of this paper knew that those puzzles remain solvable under those geometric transformations! They're thereby leveraging domain knowledge to give the LLM an edge. I'm not saying I don't think it's a valuable approach - just that this paper more exposes a weakness in the assumptions of the benchmark rather than definitely demonstrating that this approach leads to generalization.
Thanks for explaining this. I was indeed wondering how the model can produce correct synthetic data if it doesn't know the solution to the question already.
here's a conversation i just had with chatgpt:
i have a simple problem here: peter has 8 apples. sue has twice as many apples as peter. how many apples does sue have? i want you to list all the things that could be changed in this problem such that it remains the same problem.
chatgpt:
Sure, let's explore the variables that can be modified while keeping the core structure of the problem intact:
Names of the People: Change "Peter" and "Sue" to any other names.
Example: John has 8 apples. Mary has twice as many apples as John.
Number of Apples: Change the number of apples Peter has, and adjust Sue's apples accordingly.
Example: Peter has 10 apples. Sue has twice as many apples as Peter.
Item: Change the apples to any other item.
Example: Peter has 8 oranges. Sue has twice as many oranges as Peter.
Ratio: Change the ratio from twice to another multiplicative relationship.
Example: Peter has 8 apples. Sue has three times as many apples as Peter.
Context: Change the context or setting in which the problem takes place.
Example: Peter found 8 seashells. Sue found twice as many seashells as Peter.
Time Element: Add or change the time element without altering the fundamental relationship.
Example: Yesterday, Peter collected 8 apples. Today, Sue collected twice as many apples as Peter.
By adjusting these elements, the basic structure of the problem-one person having a certain number of items and another person having a multiple of that number-remains unchanged.
Anything else you'd like to explore?
@@HansKonrad-ln1cg Just because I dont understand the patterns I see, doesnt mean I can solve something I never solved with similar examples. Sooner or later will spot the pattern.
the benchmark doesnt assume anything, it has no weakness. if you come up with anything that solves the private, closed set, you are done. there are no other rules. you can put in as much domain knowledge that you think you might have no problem. your solution doesnt even have to be intelligent or ai or smart or anything. it just turns out, that it is not so easy.
Yeah, they used specific sets of geometric transforms to game the system as described in the appendix, hardly a "general intelligence" approach it seems:
Table 3: We provide the augmentations use in our TTT procedure with their function signature and description.
Augmentation Name Description
Rotate(90) Rotates a grid 90 degrees.
Rotate(270) Rotates a grid -90 degrees.
Rotate(180) Rotates a grid 180 degrees.
Flip(0) Flips a grid horizontally
Flip(1) Flips a grid vertically
Reflect(0, reverse=True) Flips a grid horizontally and prepend to the left of
the original grid
Reflect(1, reverse=True) Flips a grid vertically and prepend to the above of
the original grid
Reflect(0, reverse=False) Flips a grid horizontally and append to the right of
the original grid
Reflect(1, reverse=False) Flips a grid vertical and append to the left of the
original grid
RandomTranslateXY() Shifts a grid randomly both in horizontal and vertical
directions. The maximum shift size is 4
Transpose() Reflect a grid on diagonal
IncreaseResolution(2) Upscale the grid by interleaving elements in both
horizontal and vertical directions
IncreaseHeight(2) Upscale the grid by interleaving elements in vertical
direction
IncreaseWidth(2) Upscale the grid by interleaving elements in horizontal direction
Chain([Rotate(90),IncreaseResolution(2)]) Sequential application of Rotate(90) and
IncreaseResolution(2)
Chain([Rotate(270),IncreaseResolution(2)]) Sequential application of Rotate(270) and IncreaseResolution(2)
Chain([Rotate(180),IncreaseResolution(2)]) Sequential application of Rotate(180) and IncreaseResolution(2)
Chain([Flip(0),IncreaseResolution(2)]) Sequential application of Rotate(180) and IncreaseResolution(2)
Chain([Flip(1),IncreaseResolution(2)]) Sequential application of Rotate(180) and IncreaseResolution(2)
Chain([Transpose(),IncreaseResolution(2)]) Sequential application of Rotate(180) and In
What they’re doing is simply adjusting a generalized model on the fly to fit the problem domain. After this, the model has seen thousands more geometric figures, and that’s enough to solve these toy problems. But this doesn’t take us any further toward solving a more complex problem, especially if there’s more to it than just understanding the domain language. The result will be essentially the same as training the model specifically for the test problems beforehand.
I agree but the main point here is the technique in my opinion. Adjusting an llm on the go is exactly the same thing our brain does. So the technique used here is surely one step further towards agi.
I think it's more of a first step. "Training" using synthetic data from geometric transformations is a discrete approximation of visual reasoning in the mind's eye, so the next step would be to take this technique and train a model to create the sorts of transformations which yield better predictions.
This technique sounds like fine-tuning to me, though.
@@consciouscode8150 This is my point: If they could somehow determine which transformation leads to better results given the problem, then both of you would be right, and this would be a significant step toward AGI. However, what they do instead is simply multiply the relevant area of the model without actually changing this area to direct it toward a solution.
You are right, but the toy problems have been a real pain in the bottom. It can do the most advanced stuff, but stumbles on the most basic. It's a step in the right direction, now we just need to make it learn permanently, and that is much more difficult, because the weights already there can't simply be replaced with the "lora" that it has learned, cause that would break other things. It's a difficult problem, but I believe it can be solved.
love these paper reviews with a simplification of concepts. Keep doing them please!
I can see this utilized as scaled up MoE. With multiple agents, each of which has self-trained in this manner in its own domain, a central manager could call up the appropriate expert for a given task, for increased overall efficiency.
This is just doing Lora for each test and called it TTT. It's too much compute and they don't publish the ttt time to train and asjust Lora parameters. It's a setup that only work to pass ARC.If ARC has scoring based on time to infer (like IQ tests), this technique would get bad results.
In the end it's not practicable.
Soon we going to have insane models open source
Cool, but how does the model know which transform to apply at training - test time? Looks to me the model developpers knew which transforms are relevant to achieve good results for ARC, so in a way they supervised the training. Now maybe those transforms generalise beyond problems like those in the ARC benchmark, but I do not this they have proved this in the paper. Do I miss something?
I was wondering exactly the same ...
I think they use the vector representation of the problem and apply transformations trying to hit a sweet spot
I would probably need to read the paper though tbh. Since the on-the-flight via lora training at inference time seems very powerful. Like maybe the model could infer dynamically the best type of transforms to apply at inference training time depending on the input problem. Probably the core of the training would actually be to best predict those types of transforms to apply depending on the problem. I would need to look at the inference run time as well, looks extremely costly to train and to run inference to me, at first glance.
@@Tetsujinfr I can see the value in a super slow, super expensive AGI. It's not like human experts are cheap or fast.
I used to wait several hours just for boobies at 14.4 kbps.
In fact, it taught me patience and the value of using your time wisely, that kids just don't have these days.
Since I prefer to be involved in every step of any process, even if I just have to trust that the AI knows what it's doing. Going slower works for me.
First it was compiling, then it was rendering, now it's inferring. Any excuse for a Nerf gun fight.
The really smart models like o1 or Sonnet might be able to come up with augmentations themselves and then train themselves.
Notes on this 'breakthrough', they only improved results slightly on a paper published Nov 4th 2024 by Cornell University, both of those papers are part of paper submissions to win cash prizes (you don't need to beat ARC to get the prize, just have the best paper/method), like others mentioned in the comments this is not easily generalized since they had to use some domain knowledge in crafting the LoRA adapters. Progress for sure, but contextually not unique
There should be a penalty on ARC challenge for long inferrence time. These methods would not see the day of light
@@alfredomoreira6761 - there's a time cut-off on specified hardware. The folks who did this paper didn't fit within those limits, that's why they don't appear on the actual ARC leaderboard at the ARC website.
One thing they need to incorporate is when input does not have a high confidence level understanding or could be ambiguous, the AI shouldn't feel the need to decide what is meant and just answer. It should ask questions to clarify and when the confidence is high enough, perform the task. That's how a teacher would determine what a student doesn't understand and adapt his input accordingly.
Test time training sounds like an important step towards in context training. What we need is the ability for people to explain during runtime, how a model can improve its performance. When I’m using my cars autopilot for example, I’d like to be able to vocally tell the autopilot what I like and don’t like about its performance, similar to what I would do with a student human driver. This new information should get incorporated into the autopilot’s future actions immediately. This type of test-time-training, as well as the use of reasoning, and using ‘guard rails’, is what humans do every day as we learn new skills.
I think one aspect of animal intelligence is spatial reasoning applied to perceived images (e.g. used in navigation for exploring to find food and returning to the nest). This mode of pattern recognition and prediction is a foundation layer for us, but only laboriously and indirectly approximated by LLMs etc.
It's a great idea. For a while we've had the situation where we've got these transformer networks but they're not being used to the max of their potential capabilities. _"Let's just put the problem in K/V context and pull the next token probabilistically and be done"_ has been the status quo. Well, turns out that's a pretty sub-optimal way to use these networks.
The ARC benchmark is a "follow these examples" challenge. So, these guys are saying _"per problem, let's generate a LORA from the problem examples, then use that LORA."_ It's a great idea.
In general it's following the principle that _"minds should focus specifically on the problem at hand."_ When a human being works on a hard problem, they forget about other things and lock in! That's what this LORA is doing. It is de-prioritizing broad abilities and "focusing in" on solving the problem at hand.
Is just a good thing to generally notice, to notice that AI ought to "focus" while working.
This is brilliant research. Hats off to Ekin Akyürek, Mehul Damani, Linlu Qiu, Han Guo, Yoon Kim, and Jacob Andreas.
the ability to continuously learn is crucial for real intelligence. the model can adapt and have a short term memory outside of its context window. very well presented paper. the fact that they produced such a huge jump in the arc challenge needs to be brought to public attention so everyone can draw their conclusions from it.
Thx for paper breakdown. Much appreciated
Great breakdown of the paper and I love you enthusiasm, thanks again 👍🙏
Glad you enjoyed it!
Yes - very excited to see this paper. Glad you’re getting the word out
Super helpful, thanks a bunch for explaining the concepts
Feels counterintuitive that this would work better than other methods, if I am understanding it. So its like in-context learning, but the model parameters change with its own self-generated samples? How does this fare against a self loop, or what I guess is called reflection?
Great vid. I'm on deep seeknnow testing it against some of my use cases. Keep up the great info. Happy 🦃 day..
The state of AI reminds me of a spoiled rich kid at Christmas. They open presents and barely play with each one before being presented with a new one. They don't know how the previous toy worked nor did they explore all the options... 1/2hr goes by without getting anything new and they question the validity, fun and usefulness of all their presents.
We have never seen a cadence of tech innovation like this and we are barely exploiting any of it before something new comes along to catch our eye.
We need to be more realistic about the situation, stop doom-mongering about the slowed pace of AI innovation and accept that we already have enough toys to change our lives forever - knowing that there will be more significant advances in the near future.
It's amazing that more than a hundred thousand people watch such technical topics.
I have a feeling that we'll need a more definite score of response-goodness to achieve AGI. But this direction is promising.
Math problems are easier, probably coding too, where you can define goodness of your response.
Instead of relying on constant calculations, these systems work by memory, much like how a person recalls math answers rather than solving them every time. Imagine a software that repeatedly performs the same calculations-why not replace those with precomputed memory? Using a chart or lookup table with all the answers stored, the computer could retrieve results instantly instead of performing the operations each time. The chart doesn't even need to be perfectly precise-approximations are often good enough to solve real-world problems efficiently. This approach could revolutionize computing by reducing complexity while maintaining functionality, especially for applications where 'good enough' physics or outcomes suffice.
ah, just sipp'n my coffee and watch'n some ai news.
theres a bomb in your house!!!!!
Yes it is fascinating, good work 👍
Fascinating, if I got it correctly it still qualifies as synthetic data, but things like inversion means there should be some kind of conservation of information / symmetry present in the data for it to work. Really cool!
Great Video!
This is not a breakthrough. It's just more of the same. We are stuck in the current paradigm until LLMs are paired with a UNIVERSE model (math, physics, spatial), which will enable actual reasoning.
Love the content!
When you modify parameters to solve certain kinds of problems after standard training , won't that make the model worse at other kinds of problems ? The fine tuner may not care about that, but it seems you can only use that tuned model on those problems related to the fine tuned training set.
Great video, thank you!
you are a treasure Matthew
I’m excited, that you’re excited 😆
Yes, this is important. Thank you for covering in depth.
Amazing video Matt, very promising approach indeed. Ultimately what most people think is AGI will probably be a spatial first approach, like what Dr. Fei Fei Li proposes but TTT/active inference will definitely be one component for adaptability
finally understood Lora - like saw its effects and visual representation in front of me
It’s not exactly recognizing and applying analogies, but it’s getting closer!
You can build a workflow to take a prompt, brainstorm on it, review the ideas and then generate a response based on all that.
This is what we needed, I also think that we need memory, maybe some way to store the real-time learning that TTT achieves. It seems wasteful to just drop the insight, but how to update the weights to make good insights permanent (fixing the LORA into the model itself) maybe is more difficult than it seems? Or it could mess with the already good weights for other things. But yeah, we need to find the solution to that to progress even further. We do have the answer, our brains, but we don't fully understand it yet or how to make it compatible to our digital version. We need to draw insight from neuroscience, it won't be the same, they are two different architectures (our brains and AI), but I believe that there is a way to make them compatible with eachother. That will not only spark superintelligence, but make US superintelligent as well.
Sone paper declared that Lora overrides pretraining knowledge.
But I'm sure they'll figure it out.
Btw,we wont become super intelligent unless we merge with AI.
@@ronilevarez901 Yeah, but that's what I think will happen, first we will make it compatible with our brains, then merge with it. Would solve all problems that cold arise with AI also, like alignment.
@MatthewBerman The arc prize can be seen as a challenge of predicting the next pixels. You can scale it up to have it generate the next logical move in a game of chess, a task on your PC or a realistic game or AI OS.
"Predicting the next pixels" is nothing like the ARC puzzles.
For each one, is you have to come up with the correct set of rules to transform the inputs into the outputs -- and to find the right blocks of "pixels" that a given rule should transform into the output.
Every puzzle is a different set of rules you have to figure out.
Sometimes a rule you make for Example 1, won't work for Example 2 or 3 or 4, and you have to modify your rules to fit All examples as you go through them. It's easy to make a bad rule that won't work for the other examples.
I've manually written python code that solves about 20 problems so far, and I've built up a library of primitives to re-use code. There is always custom new code functions you have to create for each problem because each puzzle set is unique.
Bro is back with Namaste 🙏
Given that in the Arc Prize tests there are few training tasks per category, during the inference test the model has every interest in carrying out micro training on the 3 tests presented in order to make a good quality inference, that's why TTT is much better than other competitors.
Yes, you are absolutely right but what is the underlying logic/algorithm or whatever of the time-based training on synthetic data? Greate E.p❤
Test data is held out to provide an unbiased estimate of model generalization. TTT effectively creates a custom model for each test query, “…we first generate training data 𝒟 TTT (d input) from the test inputs…” It takes 12 hours/100 tasks. Isn’t this just overfitting with more steps? How is this generalised? It seems only useful as an optimisation for a known test.
💯
This should be trending
Scale up with prompt time too, "think step by step, and offer a counter to the best solution you come up with. Weight these two options." This kind of thing.
The way this method might enable better smaller models is big, but the opportunity to take this further, to audit and act upon the insights won thereby, to intelligently drive feedback ever closer to the native structure of the model, it’s the really big kind of gigantic.
A year from now; AI using TTT to iterate on its own instantaneous requirements may have a claim on a kind of consciousness.
It's funny that they called it LoRa, since LoRa is most commonly known as LoRa and LoRaWAN define a low-power, wide-area (LPWA) networking protocol designed to wirelessly connect battery operated devices to the Internet. :)
i’ve have been having some unexpectedly profound conversations with my chatgpt lately that make time seem to disappear as i get so into the conversations. we delve into complex topics, and it’s incredible how it analyzes and remembers details from our discussions, sometimes understanding me better than i understand myself. it feels like i’m living in the movie ‘her.’ lol
thx for being on top of everything going on matthew b.!
I have a doubt they achieved AGI. ARC is clearly not the definitive test for AGI, although i hope it have some potential.
Fine-tuning during test time simply acts as a prior into the test distribution lol
...but the surprise is that it goes/resets back to the base model after prediction.
I'm curious if this sort of opens a tremendously huge back door to new techniques for jailbreaking the models and how that will effect the time frames for model releases trying to implement this new technique.
Its about pre activation based on problem domain. This is why prompting works aomewhat. If you feed several examples related to the domain problem into the network this activates the nuerons through attention. Lora is a patch with this activations pre defineded. I think the problem with this approach is synthesising accurate domain specific examples. But i believe OpenAI are working on this approach internally because this was something hinted by Sam Altman. The community use Loras for diffusion models the problem is the models only function with vary specific prompts. But if these patches are removed each time and reloaded to fit the context of the prompt it could work. It seems you still need a model capable of creating the examples. A foundation model such as 405b might be able to create examples for training a small model. I think long term a two model solution could be used with a adversaerable design. One model could generate a answer while another could propose improvements to the answer then prompt the first model again. These models would have to be trained differently
but how can it generate training-data with the correct solutions if it can't solve the problem to begin with?
Did you look at what was on-screen, or just listen to what he said?
The input problem consists of 3 input output pairs, followed by an additional input, and the expected answer is the corresponding output.
The inference-time generated training set is obtained by:
1) pick a random transformation, such as reflecting everything across some axis, and apply it to each of the 3 input/output pairs. 2)Then, randomly select one of the 3 pairs, and produce as a training sample the problem which has the other two input output pairs, and the input from the selected pair, as the input, and the corresponding output as the output.
then it trains on all of these.
Makes sense. Seems to me that's what we humes do. I see a set of puzzle result pairs and I look at them for a while till I figure out a pattern of transformation. Giving an LLM the ability to do the same, to look at build a working model of the transformation, before being confronted with the novel questions makes sense to me and I can see why it works.
When it generates similar samples to the new problem at hand, it needs to solve the similar samples correctly as well before fine tuning
Insstead of AGI we might end up with a so-so general model that we load up with what is basically similar to a LoRa for our current task need.
Llama 8.5 in Optimus 6: "...I know Kung Fu!"
Do we know for sure that the test data is not in the training data?
It seems to me that the couple of examples given for the problem can be used for supervised learning.
All these super intelligent models and no ai can give an new mnemonic to remember some simple facts for my exam
It would have been useful to make the difference between the ARC-AGI and ARC-AGI-pub. We don't know the leaderboard for the latter until 6th December
The trained on the public dataset. This isn't the real ARC-AGI challenge.
Also, is the leaderboard at 0:05 photoshopped? The leaderboard doesn't have Test-Time Training on it.
Feels like cheating right?
But we have trained on the public datasets since the begging.
As long as they don't train on the test set, it's all good.
@@ronilevarez901 yes but there is a note in the paper that specifically states that they cannot guarantee that the model has not seen the test set (which is public).
Thrusters compute post training means more training. It's just a training method, it says train twice.
How do you train a model on an answer that you don’t have?
Did you mention an actual scaling law for this inference-time-training?
I haven't read the paper and don't fully understand what they are doing, but I don't see how this process generalizes. It just seems like they found a technique that works well on this test by essentially putting the question in to the training set.
This seems simple enough to be implemented very soon. I think is kinda like a baked dspy vector technique into each query.
Ok so two very basic questions:
1/ how do they generate extra problems that preserve the « spirit » of the initial problem?
2/ they minimise what objective function on each of those « similar » problems for which presumably they don’t the solution given they are generated from an unsolved problem in the first place, and of the same kind?
Without this it sounds like cheap prestidigitation trick to me.
Could be wrong though
Good, now we train the reflection process into the model. Well. Let’s hope that any prompt is covered. (TTT, test time training) - btw, live-LORA is dangerous, the model gets silly over time, because the user is silly.
This is at least 1 step short of the real big real, that is use this not to simply update your ai parameters like this, but by enabling/disabling other networks. For example, image if ChatGPT could just ignore all the Japanese translation parts for this reply… this would mean the core would be much smaller and focused on high level knowledge, that triggers/enabled submodels as needed - a much more efficient approach to the whole problem, with the extra bonus that now you can also start reusing and parallelizing the develop of each "module."
It’s hard to call it AGI when the entire premise of this technique requires fine tuning on similar tasks beforehand. They’re just priming it to solve similar tasks.
Tomorrow: AI is doom, Sunday: AI is not dead, Monday:Rinse and repeat.
The only ceiling is us using our infinite imagination to imagine the lack of infinity. - This will continue forever.
this is still too specific and unlike how actual reasoning works - it only works because you get to see examples first. true reasoning can explore uncharted territory, and idk if we can make benchmarks for that
Instead of "just erases all of that" before going to the next problem, should it send the data somewhere because maybe it's more useful than other synthetic data, or maybe send just the problem and the solution? Only when it doesn't break privacy etc. Just my amateur thoughts, likely way-off.
Nah, it's right on point. We can store the lora as document connected to the problem it solves and next time the LLM recognizes the same problem it will pull the lora and use it.
That's gonna be the next thing added to knowledge graphs. You'll see.
@@ronilevarez901 Thanks for the info.
Now we are cooking!! humans can reflect and learn. this is a step forward
The 'average' human score of AGC is widely overblown. They even claim 'kids' do well. Dig into the actual tests, and the very limited pool they use: it's scary how small the pool, and it's not the test suite used for AI... I do these tests well, and I was shocked at the claims of even 60%... 60% of Mechanical Turk users is not 60% general public, nor 60% of kids. I'm convinced the AGC score is skewed too high (and thus harder to achieve and pay out)
achieving AGI and it will run on a datacenter, so who will decide if it needs to be powered down? Would that be like mansla....er? What will it do to block its powerdown? Would it blakmail you or switch off traffic lights and end you?
Why TF is o1 in that list ? Where is the source code ?
1. How does the LLM decides that he needs to perform test time training? i.e how does he know he does not have the capacity to respond accurately the new task
2. The model generates its own training data: there is no guaranty that the the generated training has no error (bad labels)
In this case, the leave-one-out version is sound, it will not produce an inaccurate label, because it is using the original label.
The other data augmentation strategies may or may not be valid for a given task. They appear to be based on assuming certain symmetries about the problem.
Which, the validity kind of depends on what class of tasks you want it to be able to recognize…
But, because the entire collection of inputs and outputs are transformed in the same way, I think it should generally be fine?
If e.g. you flip everything in some task, this corresponds to a different task which has everything flipped,
and, as long as there is enough in the examples to know to be able to distinguish between the task and the flipped task, I think this should be fine?
This is interesting. So we can also give a small model that's solid in logic and common sense but don't have a lot of specific knowledge about a domain, give it some articles, books or what have you to "fine tune" at real time, and let the model to generate more data from those materials (it's like "reflect on the knowledge you just learned"), and it can give better answers than, say the current Perplexity approach which is purely context-based? Essentially this is System 2 partially happens in model weights vs. purely in context.
This may not be *the AGI* someone was looking for, but could still be quite potent.
I think what should be worked on first is getting a LLM to understand the "environment", to build a physical model of the whole rather than just looking at the parts or correlations between those parts.
Wow, so they just train on the benchmark - amazing…
Great video on 'Test Time Training'! 🤖✨ I appreciate how you simplified the complex concepts, making them accessible to a wider audience. However, I think it would have been helpful to dive a bit deeper into the potential limitations or challenges of this approach. For instance, how might 'Test Time Training' affect the overall efficiency of AI models in real-world applications? It’s exciting to see AI evolving, but understanding the drawbacks is just as important. Looking forward to your next video and more discussions on this topic! 🚀
If you want a rubric for AGI: when an AI can solve the bicycle stability conundrum, it will have discovered something that humans have perplexingly been unable to solve for themselves. Unlike the Collatz conjecture or other mathematical problems, this is physical and should have a solution, or at least a cogent explanation.
what is the bicycle stability conundrum? Isn't it fairly well understood by now how rake etc affect bicycle self centering, or am I missing something?
Great video and great paper. Isn't it strange that they then reset the model to its original state?
12:55 They already beat the average human score, doesn't this count as AGI according to this test? Or do they need to beat the best human score? It would be ASI at that point.
One of the biggest problems with AI is it doesn't learn from its mistakes. Generally, the most effective way for humans to learn is trial and error. And the error doesn't come from internal chain of thought but rather the outside world demonstrating our error. AI doesn't do this. I've noticed that in "discussions" with AI, I can point out a flaw and it will sometimes integrate that into a new solution.
Test time training seems to somewhat solve this problem, but it's not integrated. The problem is that it doesn't actually "learn" from this mistake. It does not integrate this lesson into its "intelligence". It merely uses it momentarily and then forgets it. If AI has already combed all known knowledge from the internet, the next source is obviously from both real time information and from information it gains from interacting with the world.
I don't believe that we'll have true AGI until it can integrate lessons it learns from interaction with an external source into its structure.
AGI is here..Breakthrough.. this is it.. blah blah blah... 10% improvement!! Let me know in the comment if you wants to test it out..
I should just train my AI mainly on benchmarks, so I get ace them and get in the leaderboards
I can see how this approach might be applied to creating a super coder LLM. Get an AI to generate billions of programs, some will compile and produce an output, some will generate compile or runtime errors. Use these as synthetic data to fine tune an LLM aimed at leet coding and solving new unseen software problems.
By tuning a model on ARC-like stuff you get better results on ARC. What a non-surprise.
Okay, so now AI is able to learn to solve a new problem. This was one missing feature for AGI no one quite knew how to tackle, but here is the breakthrough.
Now, how can I do the same stuff at home?
pairing this up with Mixture-of-LoRAs on top of moe architecture, huge potential for tiny fast models