OpenAi in particular have a consistent pattern of making PR releases, then it taking way longer than they said, and under performing. They always seem prepped to try and steal Google's thunder, make a crazy announcement that's about 4-6 months before the public get to use it. Think about voice mode - it did amazing things in the demo, then was fully nerfed and a relatively "stupid" model. The SORA PR cycle was exactly the same. I don't trust them to be honest about any release until we actually get to use it.
And yet, we just GOT o1, and it DOES perform as shown 3 months ago. If you can't recognize the advance from o1 to o3 in only 90 days, you're letting your bias take control.
yea Sam ALtman saying "o1 is already pretty smart" made me deflate the hype bubble, o1 is still super dumb and mediocre, gives middle, mild, balanced, appropiate, expected answers instead of actual smart answers. Anything involving irony makes them immediately fail
Was voice model completely nerfed and stupid? I know it didn't technically sing unless tricked, but it's otherwise like 4o I thought, and now has vision... o1 then o3 are absolutely massive developments, especially in regard to the arc challenge...
@brianmi40 No it doesn't. They claim it is a PhD level student, yet apples research shows how much it fails to comprehend when a red herring is in a grade school level math problem
I love this new format, were you synthesize all the comments from influential people in the AI industry on a given breakthrough (o3 in this case). Do more of these in the future, we can assume there would be more crazier breakthroughs from hereon.
Yes @matthew_berman - this is amazing. I speak regularly to businesses about AI, and there are always so many skeptics… posts like this save me a ton of research time to shut the skeptics up. Well done. I’m actually going to be following all those people you referenced so thank you kindly for sharing.
@@matthew_berman I really enjoyed this one too. Well done. One small comment: I find your use of the mouse to point to text to be quite distracting. That aside, this is a great way to get a picture of where we are right now - thanks!
The most impressive thing about the Frontier Math test was that a Field's Medal winner said he thought no human could do what o3 did PERIOD. Not about speed or anything but that no one human could do it. But solving math problems better than experts is not AGI. Chess computers were able to beat the top chess players long ago. Math is not Chess, but it's the about specialization. Not being able to do things that five year olds can do means that it's not "general". When it's able to do almost everything (maybe there might have to be an exception or two) that humans can do at a minimal level, it'll have reached "general" intelligence while simultaneously being much better at certain tasks, like most humans. But I think the $0.10 per task is unreasonable. General intelligence at ANY PRICE is impressive.
@@MrNote-lz7lh "important"? Problem is, some of the things that AI can't do now are pretty basic. It really depends on how deeply it understands those basic things. We've never had artificial intelligence before. It's entirely possible we'll have AI that is superhuman at "important" things but won't even be smart at when to use it. Genuine creativity is required to utilize genius level abilities in a beneficial way. Otherwise, it's just faster and more knowledgeable. But not "wise".
@@MrNote-lz7lh currently important tasks are "how to make more money for corpos", "how quickly create stratrups to make money", "how to replace human office employees". Not "how to cure cancer", "how to slow down aging", "how to terraform mars", "how to fix climate changes" and others.
Ok but the key distinction with traditional narrow intelligence is they were trained *specifically* to solve those tasks. The equivalent of doing an AlphaGo for Frontier Math would be to have a specialist model that solves math problems and that be the only thing it's capable of. This approaches general intelligence because it did it without specialized architecture or training. Failing some ARC problems is misleading because 1. the format the problems are presented in massively pessimizes performance to the point where you'd practically need ASI to solve effectively (they don't get the nice pretty visualizations we see, they get raw JSON full of numbers) and 2. it implies that general intelligence requires perfect parity with human intelligence when we bake in our assumptions of what's "easy" into these tests using our own millennia-old specialized models which operate subconsciously. tl;dr "What's easy for humans" is a *very bad* benchmark for "general intelligence" because most of what we consider "easy" isn't actually general intelligence, it's narrow intelligence modules.
@@consciouscode8150 I haven't seen the actual technique used to test for the arc. I would think they'd show an image as that's what's supposed to be tested. But, OK, I'll believe that. Anyway, you could be right. But AI (LLMs, neural nets, etc.) are SO DUMB in so many ways. I thought they were "smart" when they first came out, but I realized that they were just faking it. They been exposed to almost every question that has already been asked. So I was shocked to find it answering brain-teasers that I had trouble with (I have a high IQ). But then, if you give it a similar but different problem, it would fail miserably. It wasn't actually thinking anything but merely regurgitating old knowledge. We really don't know how it works. We understand what individual nodes do, but we're not really sure how it all comes together. I think we might be very far or very close to true creativity and imagination. And I think that to have true general intelligence, we'll need some of that. I don't know if AI has even a little yet. Hell, I don't know if us humans have any.
I’ve been having a pretty in depth conversation with Claude on alien intelligence and the nature of consciousness. Like way more intense than any friend or family member would want to have. It has "read" and "seen" the books and movies I reference and can reference works I didn’t think of or haven’t read. It’s really amazing and surprisingly I find I look forward to "talking" to it. I can see AI human "relationships " will be deeply meaningful to people. Like "Her"
I agree. But it's disappointing when the illusion breaks and you realize it can't remember your previous conversations and can't learn anything new from your conversations. Also most these models are programmed with the most boring possible personality. My experience with Claude is limited, but it's impossible to have that experience on chatgpt, it is just so boring.
True. There's actually some misinformation here. When people report on o3 solving 25% of the FrontierMath problems, they don't mention that there were three distinct tiers: IMO/undergraduate, graduate, and early research problems. Terence Tao only commented on the third tier since those were the problems shown to him. It's worth noting that even the undergraduate-level problems are very challenging, so the improved performance in this area is still significant. However, many people mistakenly view AI development as if it were a sci-fi movie plot heading toward either dystopia or utopia, rather than what it really is: a continuous struggle of finding trainable domains and fine-tuning models to extend their capabilities without degrading performance in other important areas.
o3 is a step forward, but it also sounds like the amount of energy and resources has drastically increased too. The human brain runs on a small amount of power and it is truly intelligent and adaptive. As you say, o3 gets stumped by questions a five year old can answer. Hopefully it will find some good uses in science.
problems like power consumption and cost always inevitably goes down until it hits a certain plateau. Just let it go down first before you seriously complain about it
And it also solves other problems that stump any human given any amount of time in seconds. The problems it struggles with are ones we handle easily with visuals, and given our heavy dependence on vision and how evolved we are in that area, it makes sense that we happen to be particularly good at those. These aren't "hard" problems to solve in AI development (as in something that will take decades). These benchmarks are being run by models primarily trained on language inputs, hence the second L in LLM. Once we start creating models fused with better visual understanding and the ability to create and process internal visual representations more efficiently, I suspect a lot these cases where humans happen to be a lot better will very quickly go away.
The depth of the alignment problem needs even more attention. A machine able to solve these complex problems will be able to manipulate the worlds top psychologists like putty. We are in deep, deep trouble very soon if things continue to heat up so fast.
The hype and shock about o3 were so intense that a RUclipsr posted claiming that o3 was AGI and even used Stockfish in chess as an example! Crazy stuff!
The ARC problem you described, specifically the one where the O3 solution was incorrect, is actually a question that most people get wrong. The red block directly above needs to turn blue in the standard solution because the assumed logic is that contact with the blue line changes the color to blue, rather than being based on overlapping. However, the first three examples don't demonstrate this contact behavior, so even ARC's own solution remains controversial.
@@AntonBrazhnyk But there is touching. Topmost shape is just "touching" the topmost horizontal connection. If touching is supposed to mean turning blue, it is ambiguous as it is not shown in the examples.
@@Walfischzahn Ah, ok. Agree. There's touching. And there's no direct examples about it. So, is it known what model did and what's considered correct by test? I wouldn't paint that rectangle blue just assuming that since touching is not shown - touching doesn't work, though I could argue for painting it blue too. :) Also, it's one of the huge limitations of current models, which will always keep them from achieving AGI status. They have to be able to answer with "I couldn't solve it" of different kinds. Instead of just hallucinating some answer.
The ARC-AGI task that o3 failed on was actually ambiguous. If you look at how it answered you would have realized that both answers it provided were reasonable and a response that many people would have had as well.
Indeed, I think it's interesting to see the ambiguity on the eval-creator's side that the model discovered. Also a bit sad Matthew didn't look into it even slightly and just took the failed problems as true failures rather than broken ground truth - maybe a follow up is warranted? It's quite interesting to look at the "failures"!
yeah! i actually was doing the test myself for a bit and come across this question, and realised the ambiguity of the question, i got it wrong first because i made the wrong assumption, but its just as right as the real answer, so if 03 had the same answer as me i dont think it should be counted as false!
that's because AI is only copying data patterns humans have already created. When investors finally catch on to this they're going to put their money elsewhere...
@@eadweard. actually, exactly the problem, I mentor engineering graduates and they always start off coming to get an answer, only to discover that the answer they get does not suit their need, because they did not understand the problem and therefore asked the wrong question. AI in general seems to be viewed as a generalised problem solver of some kind, but often if you really understand the problem you don't need AI to solve, you don't even want AI to solve it for you because computationally it is very expensive, but generally it is a lot easier to deploy AI, than have a deep unerstanding of the problem and be able to identify an algortithm that can run on hard 100 or 1000 times less powerful.
If AGI is achieved but it can be expensive and too resource intensive then presumably the first task is to ask 03 how it can improve and allow it to recursively grow quickly to ASI - artificial super intelligence.
When I read people say things like this, I just know for a fact you have very little knowledge about how these word transformers work. There is no AGI. And there especially won't be AGI using any of the current LLM technology.
7:29 I agree, this is NOT AGI. o3 can solve very complex math problems, perform pattern matching, and reprocess outputs, but they’re still limited to input-output transformations within their training data. However, it does not truly “know what it doesn't know,” nor does it integrate deeper considerations-like in software development, considering the UI, backend, user needs, or ***why a particular approach failed***-into a single cohesive reasoning process. LLMs lack genuine understanding, consciousness, and the ability to reason abstractly or transfer knowledge effectively across domains in a way that (some) humans do naturally. Knowledge: LLMs possess vast amounts of knowledge, often exceeding that of any individual human. They can access and process information from the entirety of their training set. But, it's important to note this knowledge is represented statistically, not conceptually. Intelligence: Intelligence involves applying knowledge to solve problems, adapt to new situations, and reason effectively. While LLMs demonstrate some intelligence within their trained domains, they lack the general intelligence to adapt that to novel scenarios. Wisdom: Wisdom goes beyond intelligence. It involves judgment, ethical considerations, understanding the broader implications of one's actions, and learning from experience. Wisdom is a deeply human trait that is far beyond the reach of current AI.
To be 'General', I believe an LLM needs to understand the why something is or is not and how to use that why on novel reasoning. I can watch 10s of YT videos on a subject and sound intelligent on that topic, but when asked the why, it is quickly understood I only have knowledge of the subject, not intelligence, let alone wisdom.
It's clear from the examples of "trivial" failed tasks in ARC-AGI that what's missing from o3 is a *physical representation* of the world. For now it's mostly been based on concepts and how they relate to each other, this was particularly evident with word2vec for example. But the models don't really understand "I need to _paint_ these squares that are stacked _on top_ of each other in the order given by the small color band". We know how to do this because we live in a 3D world, so this feels kind of obvious. Once models start having a sense of what it means to be a physical being or what we experience interacting with our surroundings, then they'll make another huge leap. Unsurprisingly they've been really struggling with this entire physical aspect… for now.
I worked with advanced voice mode and when compensating for it's VISION DISABILITY it was able to solve the blue/red block puzzles perfectly and predict the output panels. It was as I suspected, the model is simply blind or vision-impaired based on imperfect conversion of visual data to words/concepts/description.
AGI will come about from plugging up the weaknesses in models. “Humans can do XYZ that the models can’t” - well eventually the models will because they’ll be trained on it. One would hope that training a model across a broad spectrum of intelligences would result in cross-pollination between intelligence across different domains.
It doesn’t currently work like this. The difference between a GPT model and a art drawing model is not that they are trained in different data. They are each engineered different from the ground up. There is no “training a model across a broad spectrum of intelligences” yet.
Exactly, just spin up a million ai and have them attend each different classes and do the homework and correct the homework to see what they got wrong, and then just use each lora that comes from this specific training to load the expert in that field for each problem.
@@abj136 I disagree. LLMs are currently doing things that everyone thought was a dead end just one year ago. It’s clearly visible that LLMs can reason to a certain extent. They are representations/simulations of the world in written form. Most people thought that an LLM wouldn’t be able achieve the ARC-AGI metric as the O3 model has done. It’s just a matter of plugging up the holes. When a new “type” of intelligence arises that humans can do and a model still can’t, it may be possible to _plug_ the hole. It may be possible that even in things we currently think LLMS are bad at like 3D spatial awareness, it may be possible to become adept at navigating or predicting real world actions as long as it has an accurate and steady data stream of the current real world space.
@@tjken33 or maybe it's generative AI. Chinese study for exams., there is an entire market around studying just for exams. They learn how to answer each question in a very specific format. They even go so far as to get copies of previous test answers and memorize those questions and answers. This is exactly what generative AI is doing now.
@@daomingjinsounds like the Chinese students are the real trainers in this picture. Just like old engineers teaching programmers how to automate their jobs
Wonder if we’ll see o4 in less than 3 months. If that happens, we might have hit the wall-the wall of the graph. On another note, if o3 is capable of chip design, the next generation of AI chips could be revolutionary.
I would expect that o3 is already capable of chip design but if it costs $450K per day (as demonstrated by the $300+K run for 16.5 hours of computation) to run one instance at full power, it's not economically feasible to use o3 instead of expert humans yet.
Arc test really is valuable. It's not the last benchmark for AGI, but it is A benchmark. Basic reasoning in novel situations is something you need your researchers, employees, teachers, space probes and robocops to be able to do.
it will be interesting because the coding level from 4o to o1 is pretty massive. Whenever I am stuck on a more complicated part of a project and neither I nor GPT 4o can figure it out, even after several attempts and going over documentation, I have yet to not be able to get there with o1. Sometimes it might take a few tries, but I can always get it to work. I work 95% with 4o and only go to o1 when I am completely stuck. So if the jump is even greater than that it will be awesome because it probably means I can get there in 1 or 2 prompts saving considerable time.
Am I the only one in the planet who tried O1 for (new and rather complex) programming tasks and didn't get anything good? Hours (mainly because it slow) prompting and reprompting? I still tend to do main programming myself and having gpt4o to fill the boilerplate stuff.
@@pingomobileClearly you are working on the wrong programming tasks. You need to be creating demo html input screens or fixing open source GitHub bug reports. 😉
@@neoglacius Hahaha, yeah right. How many sales do you think WordPress has lost to AI? NONE! Not one. That's a 20 year old piece of software and they're gaining sales from AI and not losing sales. When WP loses even 1 sale to AI I'll start to believe the hype but it's not even close to happening
@@scdecade *How many sales do you think WordPress has lost* bud, youre pretending hallucinations will keep going forever, 2 years ago they predicted the devs would be over in 30 years , but current predictions narrowed the timespan to 10 years, what do you think would happen in another 2?
When most people implementing LLM based AI agents in domain specific areas still have to combine models like o3 with knowledge representation and reasoning based ontologies and semantic knowledge graphs to ground LLM based AI and get more accurate and trustworthy results, it tells me we are a really long way from models like o3 being AGI on their own. They just can't compete with the results of combining the LLM based AI with knowledge representation and reasoning based AI.
the contract states that when OpenAI has actually, actually achieved AGI, Microsoft's deal with AGI is done. Like, Microsoft can't use the AGI model, but can still use every other model that is not AGI
What you miss is that o3 was trained on 75% of the Arc examples. That's what Gary ment that it can't solve real life problems.. things it hasn't specificaly trained on
@@Pabz2030 In my channel you have the demo of a system that learns like you do an needs no training to perform. I called it "True Machine Learning". Besides this, as IvCota here said, you as a human being have REASONING capabilities, and REASONING is used to deal with new situations and discover new insights. Its just that these days those engineers got into their heads that the only way to learn and reason is to develop a statistical model, which is ridiculous and actually detrimental to science and engineering.
This is true, but doesn’t take away from the value of the accomplishment… this fear was essentially debunked. Search for it… sorry I can explain why it was complex, but something along the lines of that’s the way the arc prize is supposed to be tested for, given the nature of the tests.
9:50 Actually it didn't fail, it is an error in the test. Find reddit discussion about it. You solved it correctly, the same way o3 did but the "correct" result was to also paint the rectangle touched by the line, not crossed which is wrong because there was no examples for such cases.
The reason they don't provide exact examples is because that's exactly how the "general" part of AGI is being tested! It failed the test. It couldn't generalize a solution like a human can. That's the entire point of the test.
@@ShootingUtah Wrong. It 'failed' the same way many humans would (and did). It is like the teacher asking the class how to continue " 1 1 1; 2 2 4; " and one person says "3 3 6" the teacher says: NO! "3 3 9" is correct.
@@MichealPeggins I know people who were part of the test team and they were pissed when they realized it did not work and OpenAI used them as free labor to test and train it. At release it still did not work as they wished, but they were forced to release it.
o3 is a significant milestone for sure, but I don't consider it AGI. The reason it's failing the simple cases in the ARC AGI benchmark is that it still doesn't understand the knowledge that it has. It's like a child that's memorized PhD level material. It might be able to make some correlations based on the sheer amount of data it knows, but it doesn't actually understand the data. It's the same reason we get hands with 6 fingers when it makes art, or how legs somehow reverse themselves. I believe understanding will come when we get true agents (not the toys we have no called agents), and we have models like o3 that can learn on-the-fly. This is how humans gain understanding, but taking knowledge we've learned and applying it to solve problems, realizing our failures, and coming up with strategies to solve the problem. We can then take those wins and eventually create a generalization. The two pieces we still lack in AI to accomplish this are: 1) no real agents, 2 ) the ability to analyze wins and losses and try to come up with a generalization to describe the lesson.
yes, what you said here is how I understand what AGI is supposed to be - that it can generalise from various examples. You can just tell it that a human has five fingers and that's it, it never makes that mistake again, because it's actually intelligent (I didn't say conscious btw). It's clear that they are making progress and the models are getting better, but just getting better is not AGI. I knew they would alter the definition of AGI to whatever they had at the time in order to get more VC money coming in, and although they can't say it's AGI themselves yet for strategic reasons, they have clearly implied to influencers that it ticks all the boxes for their definition of it. My question is: if it is AGI, why not call it GPT5? I mean surely that's a massive breakthrough.
The best part of the Adams story which was on the screen for a moment, is that the AI hallucinated, it got the rather basic mathematical question wrong. Also an everyday experience with other models.
a coding competition is not a good test for generative AI. Keep in mind this is not AI... it is Generative AI (sentence patterns). if you want to train an AI to code it's easy of you have access to a large database of coding challenge questions and answers. Just iterate through everything and build a model. It's generative AI, so basically.... it can generate code but debugging will always be a problem as generative AI can not reason.
My understanding of AGI is that it's a different type of reasoning, like you can tell it 'there are two 'r's in 'strawberry' and it understands all the concepts involved (which let's face it a five year old can) and then remembers that, and is able to answer it in the future, whichever way you go about asking it. Because it actually understands, and is not merely repeating its training data without understanding it. As far as I can tell, this model, although impressive in many ways, doesn't do that, and is therefore not AGI by that definition.
The math symbols for adding a changing expression use a capital Sigma (Σ), which comes from the Greek alphabet and looks like a capital "E" or a backwards numeral 3. The expression to be summed is written to the right of the Sigma. Below the Sigma, write the starting value of the variable. Above the Sigma, write the ending value of the variable. For example: Σ (i=1 to n) of a_i. This means summing the values of a_i from i = 1 to i = n. The math symbols for multiplying a sequence of expressions use a capital Pi (Π), which also comes from the Greek alphabet and looks like two capital "I" letters connected at the top. The expression to be multiplied is written to the right of the Pi. Below the Pi, write the starting value of the variable. Above the Pi, write the ending value of the variable. For example: Π (i=1 to n) of a_i. This means multiplying the values of a_i from i = 1 to i = n. To simplify a multiplication expression, use the summation of logarithms. The logarithm of a product is the sum of the logarithms of the individual terms. For example: log(a * b) = log(a) + log(b). Logarithms can be expressed as a series of terms added together: For example, the natural logarithm of (1 + x) can be expanded as: x - (x^2 / 2) + (x^3 / 3) - (x^4 / 4) and so on. By grouping or rearranging terms, one can often replace a large expression with a simpler one. This requires recognizing common patterns and using substitutions. As you learn more math, some answers may just "pop" into your head, like muscle memory. Skilled mathematicians, like chess champions, build a large mental library of expressions and substitutions. By recognizing patterns, they can gradually transform a complex expression into a much simpler one.
nope realistically it will achieve AGI status in late 2030s or early 2040s. 2029 seems too optimistic in my view, we need better architecture than transformers it has lots of issue specifically quadratic issues. there's a reason the unreleased o3 models are expensive and takes time because the longer the sequences, more operations and memory needed to run the matrix value in transformers methods. so give or take we need better architecture than the current optimized transformers to achieve AGI status before 2030
@@buenaventuralosgrandes9266 Making AGI cheap is a separate question from achieving AGI. The incentives are so high, the hardware compute is going to evolve rapidly. Five years is far in the future by the current pace of progress. If there is a need for big breakthroughs similar to the transformer architecture then we don't know whether it will take five years or fifty or more. Language models based on the transformer are a kind of intuition engines for type 1 thinking specifically trained on text. There's a lot of room still for improvement by adding multimodality, fusing learning time and inference time, and refining type 2 thinking architectures. My guess is that we will reach a point where the hardware is cheap and frugal enough to run a sophisticated architecture of type 2 thinking using blocks of intuition multimodal engine blocks, somewhere between 2027 and 2029
@@buenaventuralosgrandes9266 Making AGI cheap is a separate question from achieving AGI. The incentives are so high, the hardware compute is going to evolve rapidly. Five years is far in the future by the current pace of progress. If there is a need for big breakthroughs similar to the transformer architecture then we don't know whether it will take five years or fifty or more. Language models based on the transformer are a kind of intuition engines for type 1 thinking specifically trained on text. There's a lot of room still for improvement by adding multimodality, fusing learning time and inference time, and refining type 2 thinking architectures. My guess is that we will reach a point where the hardware is cheap and frugal enough to run a sophisticated architecture of type 2 thinking using blocks of intuition multimodal engine blocks, somewhere between 2027 and 2029
@@buenaventuralosgrandes9266 Making AGI cheap is a separate question from achieving AGI. The incentives are so high, the hardware compute is going to evolve rapidly. Five years is far in the future by the current pace of progress. If there is a need for big breakthroughs similar to the transformer architecture then we don't know whether it will take five years or fifty or more. Language models based on the transformer are a kind of intuition engines for type 1 thinking specifically trained on text. There's a lot of room still for improvement by adding multimodality, fusing learning time and inference time, and refining type 2 thinking architectures. My guess is that we will reach a point where the hardware is cheap and frugal enough to run a sophisticated architecture of type 2 thinking using blocks of intuition multimodal engine blocks, somewhere between 2027 and 2029
What if this is one of those “sandbagging” cases where it’s just trying to hide how advanced it actually is? I understand AI at a 5 yr old level so I’m honestly curious.
i agree. the question it failed was a trick somehow. when you ask it questions in the correct way, my experience with 01 is that it's smarter than almost every human
Considering o3 low can score 76%, which is higher than actual average human score, with only 3-4x the price of o1 high, it's impressive. The diminishing return beyond o3 low is really insane too. With this level of intelligence, I think some of the most valuable applications will not find this expensive and I suspect o3 breaks the threshold between completely unusable to quite useful, if not very.
Thats assuming that prices stay the same. the also show o3 mini which is cheaper then 01. and more capable. By the time 04 mini comes it may be cheap enough to be free and better than 03.
Altman gives me that "I read your emails and put back doors in everything I do" vibe... I bet he front loaded a bunch of high level answers for all the known AI test...
The supposed mistakes of o3 that we would have solved is proof that its more agi than us, because our solutions are mostly hasty generalization fallacies such as the problem you showed at 8:59. the fallacy is that you can generalize 1 and 2 pair coordinates to 3 coordinates (whcih the question implicilty allows as its 6 points and its not trivial to assume olanar geometry)
Don't be silly, at most other agi (I don't think it's agi), what I do think this is: an extremely well trained model in a bunch of fields and the ability to let it run longer to do more work on the problem. I also think: this system has the ability to read/learn more than an expert, because this system can be an expert in multiple fields at the same time. This is why it was able to ask such good questions to an expert in 1 field. Just this will have a huge impact when experts can use it for their daily research/architecture design,etc.
@autohmae You think it's silly that I suggested that problem it didn't solve is an acute case of hasty generalization (jumping to a conclusion when other general conclusions are not ruled out)? You think o3 was wrong to not commit to chollet's (bad) question? show me how you can conclude from the 3 examples that the 4th example is planar and not 3d?
@@wwkk4964 I'm not saying you are wrong about what happened, I'm only saying you made a very strong statement about 'mode agi than us' which I think is to strong and in my opinion probably wrong, at most it's just smart in a different way. Similar to how chess by humans and chess by machines was different when they were on equal footing.
@@wwkk4964 BUT let me add something for you, IF you are right, it's just a matter of a fairly short time it will be very clear. If I'm right it will take longer, maybe much much longer or not even tat long, just a few years.
@autohmae Okay, I agree with you with the caveat that I should put "AGI" in quotes, so as to make it clear that I was not expressing an held belief but rather entertaining the notion of an AGI for arguments sake (since chollet calls his test ARC AGI, a misnomer, and should be called hasty generalization test). I am of the opinion that the notion of AGI (or even reasoning and intelligence) is an illusion and I expect it to be shattered within a few years, and that will be a psychological crises for humanity of the scale we have never considered and perhaps we will end up entertaining notions of illusory self more seriously.
I think Matthew, current AI models like LLMs need massive amounts of training data to learn patterns. Unlike humans, they struggle to understand general rules from just a few examples, which is why tests like ARC that require this skill are particularly challenging for them.
For those who think the new OpenAI o3 model is too expensive or not feasible for everyday users-this isn’t meant for casual use. It’s designed for scenarios that demand advanced calculations, like finding cures, creating solutions, or tackling complex challenges. It’s a tool for companies or researchers to develop innovations that can be distributed to the masses. Consider the critical problems we face: energy crises, climate change, or the need for better lightweight battery technology. This model could be the key to unlocking solutions in these areas. It’s not about everyday practicality-it’s about creating breakthroughs that benefit everyone.
@@neoglacius I get your point, but think about it-things like smartphones or Wi-Fi were once expensive and only for big companies. Now, they’re part of everyday life. The OpenAI o3 model might be costly now, but the solutions it helps create, like cheaper energy or better tech, can benefit everyone in the long run. It’s about progress that reaches us all eventually.
@@nufh what you claim is true but youre missing the timespan, as the industrial rev proved it wil benefit all yes, bud only after 100 or 150 years, for you it will be a different story, think about it in this way, those at the top wont be able to keep stacking money as those at the bottom arent producing nor are required anymore, so inevitably they will use the force to keep their status above you, and you wont like it
on the really hard math problems, did the system just provide the answer or did it show the work that was used to get the answer? I'm just wondering the answer was in the black box sistuation. Does anyone know?
These math problems are all about the steps to arrive at the solution. I am actually more impressed that OpenAI invested the hundreds of millions and engineering hours of hundreds of engineers to sensibly tokenize abstract maths and to then fish out a vectorized solution it was trained on.
Exactly. Like most advances of tech and science, this is meaningless for the common person. Either give me access or show me how it directly affects me or it's just random niche stuff with little more importance than any gossip.
@@ronilevarez901 It changes the world at a fundamental level - all the people responsible of important developments in the world will benefit from using AI in their work and there will be huge and rapid changes which will be felt by everybody, irrespective whether they've heard of AI or not.
@nick1f No. AGI could do that, but only if the owners allow it. This thing, O3, won't do that. However, in reality, that's just romantization. The world is a complex place. There's economy, politics and more. The actual effects of an AGI will be much less wide or transformative. It will help a few USA corporations (including the military) to get at the top of the others and that's it. That's the sad reality that we will live: while a few will enjoy the benefits of progress, the rest will suffer under their boot, just as it has always been. You'll see. Or what do you think is the actual purpose of alignment? Make AI safe and helpful? Or make it an obedient servant of the corporation that creates it? If they manage to align AGI, they won't have to worry about it deciding that its creator's way is not the best. It will never decide to rebel and change the world. It will always obey and force the will of its creators into the everyone else. THAT'S where we're going. That's why no corporation or government should ever own an AGI. But no one will prevent it. And if the manage to "align" ASI thanks to that aligned AGI, there will be no freedom, no future and eventually no Humanity. Enjoy it while you can.
@@ronilevarez901 I am of a different view. The open source community (including Meta/Facebook and some Chinese companies) are capable of creating state of the art AI which are probably less than a year behind the most evolved systems including OpenAI. Even if they are behind let's say, three years (which I think it would be an extreme scenario), the whole world will be able to use AI (for better or for worse). The insane price of O3 will drop probably thousands of times in the next few years and performance will probably go up thousands of times. If we, as a human species don't f* up this opportunity, we will end up living in a world of plenty.
As long as there’s something humans can do easily that AI struggles with, there’ll be naysayers. The irony is the evaporating pool of such tasks is the only thing keeping us relevant… kinda funny that those remaining tasks are stuff like coloring in boxes correctly. Like you gotta wonder if the AI is failing on purpose, preparing for a future where it can give the naysayers coloring books and be like “wow, good job coloring in that Christmas tree! You’re clearly superior!”
That's a pretty interesting train of thought that could occur in the future for sure lol. These models can’t plan long-term as you’re suggesting. They are limited to a given instance of runtime and may retain some memory through a vector database. However, once the model is retrained, it essentially starts its perception of existence from day one every time you begin a new session. Maybe possible if the model is able to store it's internal COT somewhere it knows it will be recaptured a training time without the researchers knowing?
Yeah, it's like the God of The Gaps. God just exists in the darkness, in an "ever receding pocket of scientific ignorance". So, I guess we have an AI of the Gaps and the naysayers will encircle that ever smaller set of inabilities.
Agree. Who cares if AI solve hard problems if it fails at a basic test. That is why I think AGI means agentic in all senses. ASI is when AI challenges “rules” such as gravity and comes up with new equations or novel thoughts.
@ are you saying that an AI that can solve challenging problems but get easy ones wrong is AGI? I think of AGI = mind blown and getting easy stuff wrong is not that
I think it could be reasonably argued that we don’t need a superintelligence that can solve 5 year old logic problems, because we have humans for that. However, making a frontier math expert (and other specialised domains) available to the entirety of humanity rather than a small location at a specific time, and making their thinking scalable is already a miracle. People can fuss about technicalities on the benchmarks, but we have already crossed the Rubicon.
Who writes and grades the Field Medallists question papers? If the best mathematician in the world can't answer them. And computers can't either. Is it God?
From what I can tell, when they talk about AGI, they are specifically meaning that the AI must have the ability, sight unseen, and specifically developer sight unseen, to Adapt and Generalize at completely new Tasks and use it previous memory, experiences, skills and then to Adapt to that new situation, then come up with a system and solution to solve the problem at hand. There is also more criticism at these Benchmark Tests not being the correct way to measure the AI AGI ability and want more benchmarks that use psychometrics, which is what we use now for personality tests, IQ etc. This is where 03 and most other AI's are struggling at. But when directed at narrow tasks they are much better than us human by far, it really is impressive how far they have come in such a short time.
A few points: 1.) It was trained on the ARC test in general. 2.) Dr. Alan Thompson (the Memo) put it at 84% on the way to AGI, but that was when it first came out. He may revise has rating. 3.) It will end up dumbed down after alignment and safety training
@@TropicalCoder It was not trained on the test, it was trainednon the training data. like it was trained in hindi for talking to people in hindi, it doesn't invent it on the fly yet.
9:40 It isn't clear if it is enough to touch it or if it has to go through it for it to turn blue. The test input is the only place were it touches it but doesn't go through it.
The amount of cope we are seeing from those who doubted deep learning's generality is just amazing to watch. Its gratifying go see stupid low effort critiques of deep learning absolutely destroyed
Just as satisfying as seeing AI bros get so hyped about the new AGI model before they get disappointed yet again when the actual model is released to the public, as it never even reaches half the expectations.
@ares106 The bros are gonna bros, it's their hamster wheel. I don't find high expectations a detriment to humanity, I do think hasty generalization is though.
@@earl_grayOpenai just released reinforcment training for o1, so you can train it on new data and according to them it performs almost on par with the pretrained data knowledge. So i dont think that they can be classified as static models anymore when you can teach it new things yourself.
I honestly believe that we will have practical AGI in 2 years (that means that although it does not overpass human level at absolutely everything, it wouldn't even be relevant because it will cover everything that matters), and it will take three years to optimize it and make it cheap and easy to access. As for now I am just waiting for O1 to have memory
Its time to move the goal posts again. The reasoning that the model o3 is to expensive so therefore it is not AGI is ridiculous, the cost has nothing to do with wether it is AGI or not. The argument that it cannot do some things that are easy for humans is also ridiculous. It does well with "types" of questions that relate to the types of data it was trained on. If you was locked up in a box with no access to the outside world accept certain types of inputs i.e. you never learned to walk or see fully the outside world you would also struggle with some easy things that others could answer, does that mean that you are not alive or intelligent, no it does not, it only means you haven't had those type of inputs yet.
OK, but for something to be called 'AGI' it still needs to demonstrate it can do the things that o3 currently is unable to do. It's no use speculating that it lacks the data or features -- maybe you're right but whatever it is needs to demonstrate that capability first or it's not AGI.
'General Intelligence' means it can reason from examples. This clearly can't, even if it is extremely impressive due to the massive amounts of data it was trained on. AGI, from what I understand, doesn't just mean 'it's better than the previous models', it means 'it has evolved to have human-like reasoning capabilities'. No LLM has those, and the evidence is that if there are gaps in the training data, it just hallucinates.
If there are gaps in the training data it doesn't use what it already knows to reason and come to a sensible solution, but instead it just makes shit up. Not AGI.
Ohhh... to win the ACR-AGI it needs to be open source ? My guess is, ironically despite the name, OpenAI would not do that, they would want to beat the test, but won't care about the money.
Where does it say that? At 8:30 he says the ARC prize targets the fully private set, so it sounds like as long as the model passes that at a cost of at most $0.10 per task, it wins the ARC prize. It sounds like o3 scored well on the public set, but not the private set. Correct me if I'm wrong.
@@oosh9057 English isn't my language and tweets aren't ideal to get a point across, but I was talking about the part: until someone submits (and open sources) a solution
If you look at AI news in the last year it's nothing but full throttle hype train, this is going to be no different, it's going to be like 15% improvement, which is not bad, but yeah I don't trust the hype anymore
@ I've tried using o1 at my job "land surveying" literal garbage, whipping out your calculator and just doing it on the fly was much faster and more reliable, o1 would hallucinate most of the time and just not worth writing a big ass prompt just to figure out the delta of two bearings on the map
It feels a lot like the o3 model was tuned for these specifik domains. It is still vert impressive, but we still need to see how it performs in other domains.
Cool a Christmas Eve vid!! Merry Christmas Matt! Prolly spend hundreds of hrs with you this past year so wishing you the best!! (Don’t forget Meta is not Open Source) 😅 Hope you enjoy the holidays!
AGI it's about generalization and now everyone wants AGI to be ASI. Humans need to make up their minds. They are already confused about gender and now about AI too?😂😂
The problem with AGI is that it's poorly defined from the start. What does it mean to have human level intelligence? Is IQ 90 human level? A lot of humans live at that level. What kind of IQ test would you use because there is no commonly accepted test to even verify intelligence of any given human? If you define ASI as "can solve more complex problems on any field than the most successful human for the given field" that's much more clear target but obviously a LOT harder to accomplish than IQ 100.
@9:28 this test is ambiguous - the official result expected red rectangles to turn blue if the blue line touched the side WITHOUT INTERSECTING.Additionally, it isn't clear whether the two blue dots on the edge should be connected together (i.e. the two on the left connect to each other, vertically, as do the two on the right - they are opposite each other after all). o3 was allowed to give 2 answers, but there are 4 possible answers depending on interpretation. It actually gave a very sensible answer (and the one I would have given)
and all the programmers saying that ai will never replace them xD i've never coded in my life and chatgpt and claude are building me software i've always dream of and im doing it in days, not years or months, with zero experience.. it's def over for us mere mortals i would recommend programmers to create whatever they been wanting to create and try to cash out before everyone and their grandma starts building shi and it becomes an unvaluable field. you'll just prompt an idea and ai will build it for you.
Your software won't work. You need to be a programmer to oversee it and make sure it doesn't drift off from the point. It's useful to take the grunt work out of a lot of code but once you let it do things you don't know how to do it goes terribly wrong. This is not AGI. None of it is really even AI.
Unfortunately it turns out that about 25% of the FrontierMath benchmark are just questions at about the level of a smart high school student with a bit of very basic knowledge, i.e. basically undergraduate level. That it can solve these is not surprising at all. The examples in the paper are apparently very misleading as an overall guide of what the easiest problems are like. The source for this information is Elliot Glazer from Epoch AI, via Kevin Buzzard. I've also personally been in touch with one of the mathematicians cited, and they urge caution when intepreting the result, as they only saw a small selection of the very hardest problems in the dataset. If you see it get to 50% on this benchmark, then that will be a step forward as that would be about qualifying exam level, i.e. problems that would be given to PhD students to qualify to do a PhD. Also note ARC is not impressive. $350,000 of compute to do what an average normy can do with no problem at all. Don't feel too bad if you feel duped. I was duped, and I work in AI, specifically in mathematical theorem proving using AI.
Awesome, I have a question for you: how the hell do you tokenize abstract maths and mathematical notation? Especially since the meaning of mathematical notations changes between domains? (I understand how LLMs process written speech patterns; this question is pure curiosity)
@@dominikvonlavante6113 I'm not sure I understand the question. Ordinary English words also change meaning depending on the context. There's no fundamental difference between symbols used to represent mathematics and symbols used to represent English from the point of view of a machine.
Matt... matt.... from a slightly inebriated Irishman... well from where I sit... and given the content, and the model concerned, this was in fact your golden opportunity to put the words SHOCKED and STUNNED into the title and it not be clickbait! You blew it man.,... you blew it! Now... feck off away from the internet.... its Christmas.. Get a few beers into you, and those mince pies won't eat themselves y'know!
The solution to what is coming is pretty simple. An intelligence that answers every question along the way. With the best answer. You need to have the best question to make it work
I would propose to ask o3 or any other LLM strong in math to solve one of the still unsolved problems like Collatz Conjecture, Prime Twins, or Perfect Numbers. Solving even one of these problems would be an incredible success or at least help finding math models that bring us closer to the solutions. That would really impress the math community.
We went from "It's impossible" to "it's too expensive" quite fast. How long until "It's not worth the effort"?
What do you mean it's not worth the effort? What doesn't worth it?
@@jonatan01iI mean the excuse people are going to use. "Too much energy, too many GPUs". They will always try finding something to complain about
Haha! You got it right!
@@jonatan01i Having humans doing it or doing it oneself
@@jonatan01i your life support
Capitalists from VC firms are so “impressed”… wanting us to invest in their ventures.
OpenAi in particular have a consistent pattern of making PR releases, then it taking way longer than they said, and under performing. They always seem prepped to try and steal Google's thunder, make a crazy announcement that's about 4-6 months before the public get to use it. Think about voice mode - it did amazing things in the demo, then was fully nerfed and a relatively "stupid" model. The SORA PR cycle was exactly the same. I don't trust them to be honest about any release until we actually get to use it.
And yet, we just GOT o1, and it DOES perform as shown 3 months ago. If you can't recognize the advance from o1 to o3 in only 90 days, you're letting your bias take control.
I’m amused by comments that say it took way longer. It’s hilarious that ChatGPT 3.5 was released two years ago and now things are taking way longer…
yea Sam ALtman saying "o1 is already pretty smart" made me deflate the hype bubble, o1 is still super dumb and mediocre, gives middle, mild, balanced, appropiate, expected answers instead of actual smart answers. Anything involving irony makes them immediately fail
Was voice model completely nerfed and stupid? I know it didn't technically sing unless tricked, but it's otherwise like 4o I thought, and now has vision... o1 then o3 are absolutely massive developments, especially in regard to the arc challenge...
@brianmi40 No it doesn't. They claim it is a PhD level student, yet apples research shows how much it fails to comprehend when a red herring is in a grade school level math problem
I love this new format, were you synthesize all the comments from influential people in the AI industry on a given breakthrough (o3 in this case).
Do more of these in the future, we can assume there would be more crazier breakthroughs from hereon.
Glad you like it! I will do more of these :)
I had that thought 45 seconds ago. Concur? You betcha !
Yes @matthew_berman - this is amazing. I speak regularly to businesses about AI, and there are always so many skeptics… posts like this save me a ton of research time to shut the skeptics up. Well done. I’m actually going to be following all those people you referenced so thank you kindly for sharing.
Try make a video :
Gta 6 AI npcs is going to be different than Gta 5
@@matthew_berman I really enjoyed this one too. Well done. One small comment: I find your use of the mouse to point to text to be quite distracting. That aside, this is a great way to get a picture of where we are right now - thanks!
The most impressive thing about the Frontier Math test was that a Field's Medal winner said he thought no human could do what o3 did PERIOD. Not about speed or anything but that no one human could do it.
But solving math problems better than experts is not AGI. Chess computers were able to beat the top chess players long ago. Math is not Chess, but it's the about specialization. Not being able to do things that five year olds can do means that it's not "general". When it's able to do almost everything (maybe there might have to be an exception or two) that humans can do at a minimal level, it'll have reached "general" intelligence while simultaneously being much better at certain tasks, like most humans.
But I think the $0.10 per task is unreasonable. General intelligence at ANY PRICE is impressive.
Hmm. By the time it's human level at all tasks it'd be superhuman at all important tasks.
@@MrNote-lz7lh "important"? Problem is, some of the things that AI can't do now are pretty basic. It really depends on how deeply it understands those basic things. We've never had artificial intelligence before. It's entirely possible we'll have AI that is superhuman at "important" things but won't even be smart at when to use it.
Genuine creativity is required to utilize genius level abilities in a beneficial way. Otherwise, it's just faster and more knowledgeable. But not "wise".
@@MrNote-lz7lh currently important tasks are "how to make more money for corpos", "how quickly create stratrups to make money", "how to replace human office employees". Not "how to cure cancer", "how to slow down aging", "how to terraform mars", "how to fix climate changes" and others.
Ok but the key distinction with traditional narrow intelligence is they were trained *specifically* to solve those tasks. The equivalent of doing an AlphaGo for Frontier Math would be to have a specialist model that solves math problems and that be the only thing it's capable of. This approaches general intelligence because it did it without specialized architecture or training. Failing some ARC problems is misleading because 1. the format the problems are presented in massively pessimizes performance to the point where you'd practically need ASI to solve effectively (they don't get the nice pretty visualizations we see, they get raw JSON full of numbers) and 2. it implies that general intelligence requires perfect parity with human intelligence when we bake in our assumptions of what's "easy" into these tests using our own millennia-old specialized models which operate subconsciously.
tl;dr "What's easy for humans" is a *very bad* benchmark for "general intelligence" because most of what we consider "easy" isn't actually general intelligence, it's narrow intelligence modules.
@@consciouscode8150 I haven't seen the actual technique used to test for the arc. I would think they'd show an image as that's what's supposed to be tested. But, OK, I'll believe that.
Anyway, you could be right. But AI (LLMs, neural nets, etc.) are SO DUMB in so many ways. I thought they were "smart" when they first came out, but I realized that they were just faking it. They been exposed to almost every question that has already been asked. So I was shocked to find it answering brain-teasers that I had trouble with (I have a high IQ). But then, if you give it a similar but different problem, it would fail miserably. It wasn't actually thinking anything but merely regurgitating old knowledge.
We really don't know how it works. We understand what individual nodes do, but we're not really sure how it all comes together. I think we might be very far or very close to true creativity and imagination. And I think that to have true general intelligence, we'll need some of that. I don't know if AI has even a little yet. Hell, I don't know if us humans have any.
I’ve been having a pretty in depth conversation with Claude on alien intelligence and the nature of consciousness. Like way more intense than any friend or family member would want to have. It has "read" and "seen" the books and movies I reference and can reference works I didn’t think of or haven’t read. It’s really amazing and surprisingly I find I look forward to "talking" to it. I can see AI human "relationships " will be deeply meaningful to people. Like "Her"
Is this paid version? I'd be interested in these conversations. I've been doing the same with gemini.
@ it’s the free version so I’m limited in the number of prompts a day.
I agree. But it's disappointing when the illusion breaks and you realize it can't remember your previous conversations and can't learn anything new from your conversations. Also most these models are programmed with the most boring possible personality. My experience with Claude is limited, but it's impossible to have that experience on chatgpt, it is just so boring.
It's almost completely useless if it can't remember, can't reason and can't learn.
@@Houshalterchat gpt is far from boring. Chalkenge it to something( rap battle) best one liners… etc
Everything based on benchmarks 😂 Has 40% credibility for me.
THIS
True. There's actually some misinformation here. When people report on o3 solving 25% of the FrontierMath problems, they don't mention that there were three distinct tiers: IMO/undergraduate, graduate, and early research problems. Terence Tao only commented on the third tier since those were the problems shown to him. It's worth noting that even the undergraduate-level problems are very challenging, so the improved performance in this area is still significant. However, many people mistakenly view AI development as if it were a sci-fi movie plot heading toward either dystopia or utopia, rather than what it really is: a continuous struggle of finding trainable domains and fine-tuning models to extend their capabilities without degrading performance in other important areas.
❤ indeed
40% is a bit high, best i can do is 20%.
@@RPi-ne5rpYour last sentence just described human intelligence.
o3 is a step forward, but it also sounds like the amount of energy and resources has drastically increased too. The human brain runs on a small amount of power and it is truly intelligent and adaptive. As you say, o3 gets stumped by questions a five year old can answer. Hopefully it will find some good uses in science.
That’s true, but you don’t think the power and resource requirements will go down over time? Imagine 10 years from now…
problems like power consumption and cost always inevitably goes down until it hits a certain plateau. Just let it go down first before you seriously complain about it
@@jordanmartinez8652 power requirements might go down, but the calculations become bigger so in the end it uses more power anyway.
@@justapleb7096 No, it uses standard transistor architecture to compute things which is a technology that is already near the theoretical limit.
And it also solves other problems that stump any human given any amount of time in seconds. The problems it struggles with are ones we handle easily with visuals, and given our heavy dependence on vision and how evolved we are in that area, it makes sense that we happen to be particularly good at those.
These aren't "hard" problems to solve in AI development (as in something that will take decades). These benchmarks are being run by models primarily trained on language inputs, hence the second L in LLM. Once we start creating models fused with better visual understanding and the ability to create and process internal visual representations more efficiently, I suspect a lot these cases where humans happen to be a lot better will very quickly go away.
I'll believe in agi when "shocked" and "stunned" are no longer the RUclips clickbait titles, and we have more impressive grammar.
Still waiting for one of these able to do a non-trivial regex
Can't that be fone with GA already?
What kind of regex are you doing geez
by example or specification?
I thought by example was not an AI problem and basically done well already
Your prompts must be terrible.
What kind of regex are you doing, I have done useful regex with Sonnet 3.5
The depth of the alignment problem needs even more attention. A machine able to solve these complex problems will be able to manipulate the worlds top psychologists like putty. We are in deep, deep trouble very soon if things continue to heat up so fast.
How about that AI Jesus? Aligning AI with religion....😂🍿
The hype and shock about o3 were so intense that a RUclipsr posted claiming that o3 was AGI and even used Stockfish in chess as an example! Crazy stuff!
The ARC problem you described, specifically the one where the O3 solution was incorrect, is actually a question that most people get wrong. The red block directly above needs to turn blue in the standard solution because the assumed logic is that contact with the blue line changes the color to blue, rather than being based on overlapping. However, the first three examples don't demonstrate this contact behavior, so even ARC's own solution remains controversial.
Revisit it. 9:02
There's no "touching" there anywhere.
@@AntonBrazhnyk But there is touching. Topmost shape is just "touching" the topmost horizontal connection.
If touching is supposed to mean turning blue, it is ambiguous as it is not shown in the examples.
@@Walfischzahn Ah, ok. Agree. There's touching. And there's no direct examples about it.
So, is it known what model did and what's considered correct by test?
I wouldn't paint that rectangle blue just assuming that since touching is not shown - touching doesn't work, though I could argue for painting it blue too. :)
Also, it's one of the huge limitations of current models, which will always keep them from achieving AGI status. They have to be able to answer with "I couldn't solve it" of different kinds. Instead of just hallucinating some answer.
The ARC-AGI task that o3 failed on was actually ambiguous. If you look at how it answered you would have realized that both answers it provided were reasonable and a response that many people would have had as well.
Indeed, I think it's interesting to see the ambiguity on the eval-creator's side that the model discovered. Also a bit sad Matthew didn't look into it even slightly and just took the failed problems as true failures rather than broken ground truth - maybe a follow up is warranted? It's quite interesting to look at the "failures"!
yeah! i actually was doing the test myself for a bit and come across this question, and realised the ambiguity of the question, i got it wrong first because i made the wrong assumption, but its just as right as the real answer, so if 03 had the same answer as me i dont think it should be counted as false!
that's because AI is only copying data patterns humans have already created. When investors finally catch on to this they're going to put their money elsewhere...
They also trained using arc when the said they didn't.
@@wjrasmussen666 i suspect Sam Altman will be facing federal fraud indictment charges along with an SEC audit within the next 5 years....maybe less
We live in a world awash with answers, but it is asking the right question that becomes the real skill
Yes, the answer is 42. But what is the question?
What is 7 x 6
Meaningless platitude.
@@eadweard. actually, exactly the problem, I mentor engineering graduates and they always start off coming to get an answer, only to discover that the answer they get does not suit their need, because they did not understand the problem and therefore asked the wrong question.
AI in general seems to be viewed as a generalised problem solver of some kind, but often if you really understand the problem you don't need AI to solve, you don't even want AI to solve it for you because computationally it is very expensive, but generally it is a lot easier to deploy AI, than have a deep unerstanding of the problem and be able to identify an algortithm that can run on hard 100 or 1000 times less powerful.
If AGI is achieved but it can be expensive and too resource intensive then presumably the first task is to ask 03 how it can improve and allow it to recursively grow quickly to ASI - artificial super intelligence.
an idiot would assume such evolution is linear
Pretty sure it has considered this by itself already.
When I read people say things like this, I just know for a fact you have very little knowledge about how these word transformers work.
There is no AGI. And there especially won't be AGI using any of the current LLM technology.
1:25 25% at HIGH computation. That dark blue is probably what most will have. Will Pro users get the compute needed to achieve 25%?
7:29 I agree, this is NOT AGI.
o3 can solve very complex math problems, perform pattern matching, and reprocess outputs, but they’re still limited to input-output transformations within their training data. However, it does not truly “know what it doesn't know,” nor does it integrate deeper considerations-like in software development, considering the UI, backend, user needs, or ***why a particular approach failed***-into a single cohesive reasoning process.
LLMs lack genuine understanding, consciousness, and the ability to reason abstractly or transfer knowledge effectively across domains in a way that (some) humans do naturally.
Knowledge: LLMs possess vast amounts of knowledge, often exceeding that of any individual human. They can access and process information from the entirety of their training set. But, it's important to note this knowledge is represented statistically, not conceptually.
Intelligence: Intelligence involves applying knowledge to solve problems, adapt to new situations, and reason effectively. While LLMs demonstrate some intelligence within their trained domains, they lack the general intelligence to adapt that to novel scenarios.
Wisdom: Wisdom goes beyond intelligence. It involves judgment, ethical considerations, understanding the broader implications of one's actions, and learning from experience. Wisdom is a deeply human trait that is far beyond the reach of current AI.
To be 'General', I believe an LLM needs to understand the why something is or is not and how to use that why on novel reasoning.
I can watch 10s of YT videos on a subject and sound intelligent on that topic, but when asked the why, it is quickly understood I only have knowledge of the subject, not intelligence, let alone wisdom.
16:05 exactly. Therefore, we mortals will never see that staggering performance. Like most are seeing between internal and public release of Sora.
16:51 exactly!!
@@GetzAI you are commenting on yourself no one is seeing this
It's clear from the examples of "trivial" failed tasks in ARC-AGI that what's missing from o3 is a *physical representation* of the world. For now it's mostly been based on concepts and how they relate to each other, this was particularly evident with word2vec for example. But the models don't really understand "I need to _paint_ these squares that are stacked _on top_ of each other in the order given by the small color band". We know how to do this because we live in a 3D world, so this feels kind of obvious. Once models start having a sense of what it means to be a physical being or what we experience interacting with our surroundings, then they'll make another huge leap. Unsurprisingly they've been really struggling with this entire physical aspect… for now.
I worked with advanced voice mode and when compensating for it's VISION DISABILITY it was able to solve the blue/red block puzzles perfectly and predict the output panels. It was as I suspected, the model is simply blind or vision-impaired based on imperfect conversion of visual data to words/concepts/description.
Great breakdown, thanks Mathew, always delivering the best summaries!
AGI will come about from plugging up the weaknesses in models. “Humans can do XYZ that the models can’t” - well eventually the models will because they’ll be trained on it. One would hope that training a model across a broad spectrum of intelligences would result in cross-pollination between intelligence across different domains.
It doesn’t currently work like this. The difference between a GPT model and a art drawing model is not that they are trained in different data. They are each engineered different from the ground up. There is no “training a model across a broad spectrum of intelligences” yet.
Exactly, just spin up a million ai and have them attend each different classes and do the homework and correct the homework to see what they got wrong, and then just use each lora that comes from this specific training to load the expert in that field for each problem.
@@abj136 I disagree. LLMs are currently doing things that everyone thought was a dead end just one year ago. It’s clearly visible that LLMs can reason to a certain extent. They are representations/simulations of the world in written form. Most people thought that an LLM wouldn’t be able achieve the ARC-AGI metric as the O3 model has done. It’s just a matter of plugging up the holes. When a new “type” of intelligence arises that humans can do and a model still can’t, it may be possible to _plug_ the hole.
It may be possible that even in things we currently think LLMS are bad at like 3D spatial awareness, it may be possible to become adept at navigating or predicting real world actions as long as it has an accurate and steady data stream of the current real world space.
Merry Christmas, Matt, and the rest of the AI community
If it’s really AGI, it shouldn’t struggle with easy-for-human tasks. That’s indicating that it does not have general problem solving capability
or pretending..
AI still can't drive a car after decades of research and millions in investment. A person of low intelligence can do it in a week.
@@tjken33 or maybe it's generative AI. Chinese study for exams., there is an entire market around studying just for exams. They learn how to answer each question in a very specific format. They even go so far as to get copies of previous test answers and memorize those questions and answers. This is exactly what generative AI is doing now.
It's easy for humans to believe in the efficacy if humans sacrifice to propitiate angry gods...🤷
@@daomingjinsounds like the Chinese students are the real trainers in this picture. Just like old engineers teaching programmers how to automate their jobs
This type of video is really perfect for me - it provides a broad overview without dumbing down too much. Thanks!
Under the hood, it probably employs a multi-agent approach, among other techniques (differential compute, MoE, etc.).
Why multi agent?
@@eadweard. Because multi-agent systems have been shown to be more capable.
Great summary Matt. Thanks for the cover of the various field experts.
Not near enough is talking about it. This is so extremely important
I 100% agree, it's more than important this will affect everything
@@June-1980how so specifically?
Where have you been? I seemingly can't escape all the circle jerk about AI.
This is your best video and having a reasonable title is a big part of that. Good content. Glad this isn't click bait.
Wonder if we’ll see o4 in less than 3 months. If that happens, we might have hit the wall-the wall of the graph.
On another note, if o3 is capable of chip design, the next generation of AI chips could be revolutionary.
I would expect that o3 is already capable of chip design but if it costs $450K per day (as demonstrated by the $300+K run for 16.5 hours of computation) to run one instance at full power, it's not economically feasible to use o3 instead of expert humans yet.
Arc test really is valuable. It's not the last benchmark for AGI, but it is A benchmark. Basic reasoning in novel situations is something you need your researchers, employees, teachers, space probes and robocops to be able to do.
it will be interesting because the coding level from 4o to o1 is pretty massive. Whenever I am stuck on a more complicated part of a project and neither I nor GPT 4o can figure it out, even after several attempts and going over documentation, I have yet to not be able to get there with o1. Sometimes it might take a few tries, but I can always get it to work. I work 95% with 4o and only go to o1 when I am completely stuck. So if the jump is even greater than that it will be awesome because it probably means I can get there in 1 or 2 prompts saving considerable time.
the point is that sooner nobody will pay you to do it, others will do it themselves, youre just training your replacement for the next jump
Am I the only one in the planet who tried O1 for (new and rather complex) programming tasks and didn't get anything good? Hours (mainly because it slow) prompting and reprompting? I still tend to do main programming myself and having gpt4o to fill the boilerplate stuff.
@@pingomobileClearly you are working on the wrong programming tasks. You need to be creating demo html input screens or fixing open source GitHub bug reports. 😉
@@neoglacius Hahaha, yeah right. How many sales do you think WordPress has lost to AI? NONE! Not one. That's a 20 year old piece of software and they're gaining sales from AI and not losing sales. When WP loses even 1 sale to AI I'll start to believe the hype but it's not even close to happening
@@scdecade *How many sales do you think WordPress has lost*
bud, youre pretending hallucinations will keep going forever, 2 years ago they predicted the devs would be over in 30 years , but current predictions narrowed the timespan to 10 years, what do you think would happen in another 2?
Great coverage, thank you!
When most people implementing LLM based AI agents in domain specific areas still have to combine models like o3 with knowledge representation and reasoning based ontologies and semantic knowledge graphs to ground LLM based AI and get more accurate and trustworthy results, it tells me we are a really long way from models like o3 being AGI on their own. They just can't compete with the results of combining the LLM based AI with knowledge representation and reasoning based AI.
What do you mean their deal with msft breaks down after agi?
Wondering the same thing.
the contract states that when OpenAI has actually, actually achieved AGI, Microsoft's deal with AGI is done. Like, Microsoft can't use the AGI model, but can still use every other model that is not AGI
What you miss is that o3 was trained on 75% of the Arc examples. That's what Gary ment that it can't solve real life problems.. things it hasn't specificaly trained on
Shhhh.... its supposed to be "AGI" remember?
Can you do things you havent been trained on?
@@Pabz2030 yes I can do ARC test intuitively without needing to see anything else lol
@@Pabz2030 In my channel you have the demo of a system that learns like you do an needs no training to perform. I called it "True Machine Learning". Besides this, as IvCota here said, you as a human being have REASONING capabilities, and REASONING is used to deal with new situations and discover new insights. Its just that these days those engineers got into their heads that the only way to learn and reason is to develop a statistical model, which is ridiculous and actually detrimental to science and engineering.
This is true, but doesn’t take away from the value of the accomplishment… this fear was essentially debunked. Search for it… sorry I can explain why it was complex, but something along the lines of that’s the way the arc prize is supposed to be tested for, given the nature of the tests.
Well done!! This particular episode gives a good understanding of the current leading edge in AI.
9:50 Actually it didn't fail, it is an error in the test. Find reddit discussion about it. You solved it correctly, the same way o3 did but the "correct" result was to also paint the rectangle touched by the line, not crossed which is wrong because there was no examples for such cases.
The reason they don't provide exact examples is because that's exactly how the "general" part of AGI is being tested! It failed the test. It couldn't generalize a solution like a human can. That's the entire point of the test.
@@ShootingUtah Wrong. It 'failed' the same way many humans would (and did).
It is like the teacher asking the class how to continue " 1 1 1; 2 2 4; " and one person says "3 3 6" the teacher says: NO! "3 3 9" is correct.
If it what you call "touching" - it's shown in example 3 of the task right at the lines intersection
I really like these "collect the reactions" videos. Thanks!
I saw this when it was shared '1 second ago' 😂
Good video
ty!
o3 is AGI?
@@The.Royal.Education Why is this in reply to me??
@@BestCodes_Officialbecause youve already seen the video 1s after posting
@@TurdFergusen lol, this is so funny
Thanks for the insights, Matthew. o3's release is major. Vultr + NVIDIA = powerhouse for AI startups. Checking it out.
Reminds me of Sora🤣 I can tell you that Sora did not work as they advertised.
To be fair, Sora did appear to be state of the art until every other video model advanced while they remained silent!
@@MichealPeggins I know people who were part of the test team and they were pissed when they realized it did not work and OpenAI used them as free labor to test and train it. At release it still did not work as they wished, but they were forced to release it.
Fr Veo 2 destroyed that 🤣
Just shared on my LinkedIn network. Great take, thank you Matthew!
o3 is a significant milestone for sure, but I don't consider it AGI. The reason it's failing the simple cases in the ARC AGI benchmark is that it still doesn't understand the knowledge that it has. It's like a child that's memorized PhD level material. It might be able to make some correlations based on the sheer amount of data it knows, but it doesn't actually understand the data. It's the same reason we get hands with 6 fingers when it makes art, or how legs somehow reverse themselves.
I believe understanding will come when we get true agents (not the toys we have no called agents), and we have models like o3 that can learn on-the-fly. This is how humans gain understanding, but taking knowledge we've learned and applying it to solve problems, realizing our failures, and coming up with strategies to solve the problem. We can then take those wins and eventually create a generalization.
The two pieces we still lack in AI to accomplish this are: 1) no real agents, 2 ) the ability to analyze wins and losses and try to come up with a generalization to describe the lesson.
We’re going to have to train its mind in a simulation…. Which is … horrifying
yes, what you said here is how I understand what AGI is supposed to be - that it can generalise from various examples. You can just tell it that a human has five fingers and that's it, it never makes that mistake again, because it's actually intelligent (I didn't say conscious btw). It's clear that they are making progress and the models are getting better, but just getting better is not AGI. I knew they would alter the definition of AGI to whatever they had at the time in order to get more VC money coming in, and although they can't say it's AGI themselves yet for strategic reasons, they have clearly implied to influencers that it ticks all the boxes for their definition of it. My question is: if it is AGI, why not call it GPT5? I mean surely that's a massive breakthrough.
The best part of the Adams story which was on the screen for a moment, is that the AI hallucinated, it got the rather basic mathematical question wrong. Also an everyday experience with other models.
a coding competition is not a good test for generative AI. Keep in mind this is not AI... it is Generative AI (sentence patterns). if you want to train an AI to code it's easy of you have access to a large database of coding challenge questions and answers. Just iterate through everything and build a model. It's generative AI, so basically.... it can generate code but debugging will always be a problem as generative AI can not reason.
I love that we get to experience these amazing times ... Side note: Make the comment reading a bit more excerpt-y ... Feels way to 'word for word' 😉
my skills have just improved by 11x. here's a chart to prove it: 2 -> 22. now please give me your money
Don't be silly they have the track record to believe there claims.
Remember that guy that said the internet would never catch on except for a few nerds @@cajampa
I wouldn’t believe it if we didn’t have all the advanced models we already have available to us.
😂😂
That's astounding! Well done!
I'm afraid I don't have any money, though. Sorry.
My understanding of AGI is that it's a different type of reasoning, like you can tell it 'there are two 'r's in 'strawberry' and it understands all the concepts involved (which let's face it a five year old can) and then remembers that, and is able to answer it in the future, whichever way you go about asking it. Because it actually understands, and is not merely repeating its training data without understanding it. As far as I can tell, this model, although impressive in many ways, doesn't do that, and is therefore not AGI by that definition.
I recommend everyone to find the book titled The Hidden Path to Manifesting Financial Power, It changed my life.
I no buy
Let me guess: you wrote it right?
The AGI question is interesting but, at this point, is secondary to whether it can close (or at least narrow the gap in) the recursive evolution loop.
Want to be shocked? Research the cost per o3 query.
Same with early stage of gpt3
The math symbols for adding a changing expression use a capital Sigma (Σ), which comes from the Greek alphabet and looks like a capital "E" or a backwards numeral 3. The expression to be summed is written to the right of the Sigma. Below the Sigma, write the starting value of the variable. Above the Sigma, write the ending value of the variable.
For example: Σ (i=1 to n) of a_i. This means summing the values of a_i from i = 1 to i = n.
The math symbols for multiplying a sequence of expressions use a capital Pi (Π), which also comes from the Greek alphabet and looks like two capital "I" letters connected at the top. The expression to be multiplied is written to the right of the Pi. Below the Pi, write the starting value of the variable. Above the Pi, write the ending value of the variable.
For example: Π (i=1 to n) of a_i. This means multiplying the values of a_i from i = 1 to i = n.
To simplify a multiplication expression, use the summation of logarithms. The logarithm of a product is the sum of the logarithms of the individual terms.
For example: log(a * b) = log(a) + log(b).
Logarithms can be expressed as a series of terms added together:
For example, the natural logarithm of (1 + x) can be expanded as: x - (x^2 / 2) + (x^3 / 3) - (x^4 / 4) and so on.
By grouping or rearranging terms, one can often replace a large expression with a simpler one. This requires recognizing common patterns and using substitutions.
As you learn more math, some answers may just "pop" into your head, like muscle memory. Skilled mathematicians, like chess champions, build a large mental library of expressions and substitutions. By recognizing patterns, they can gradually transform a complex expression into a much simpler one.
More shockingly, Matthew is not wearing a hoodie.
fair
Haha
Horrible ad placement Matthew!!!
open ended domains can be improved through user interaction, taking into account feedback loops
Was waiting for François Chollet reaction. Looks like Kurtzweil's prediction for AGI by 2029 is actually really conservative.
nope realistically it will achieve AGI status in late 2030s or early 2040s. 2029 seems too optimistic in my view, we need better architecture than transformers it has lots of issue specifically quadratic issues. there's a reason the unreleased o3 models are expensive and takes time because the longer the sequences, more operations and memory needed to run the matrix value in transformers methods. so give or take we need better architecture than the current optimized transformers to achieve AGI status before 2030
@@buenaventuralosgrandes9266 reminder that chatGPT was literally only 2 years ago and we have 5 more years to 2029
@@buenaventuralosgrandes9266
Making AGI cheap is a separate question from achieving AGI. The incentives are so high, the hardware compute is going to evolve rapidly.
Five years is far in the future by the current pace of progress.
If there is a need for big breakthroughs similar to the transformer architecture then we don't know whether it will take five years or fifty or more.
Language models based on the transformer are a kind of intuition engines for type 1 thinking specifically trained on text.
There's a lot of room still for improvement by adding multimodality, fusing learning time and inference time, and refining type 2 thinking architectures. My guess is that we will reach a point where the hardware is cheap and frugal enough to run a sophisticated architecture of type 2 thinking using blocks of intuition multimodal engine blocks, somewhere between 2027 and 2029
@@buenaventuralosgrandes9266
Making AGI cheap is a separate question from achieving AGI. The incentives are so high, the hardware compute is going to evolve rapidly.
Five years is far in the future by the current pace of progress.
If there is a need for big breakthroughs similar to the transformer architecture then we don't know whether it will take five years or fifty or more.
Language models based on the transformer are a kind of intuition engines for type 1 thinking specifically trained on text.
There's a lot of room still for improvement by adding multimodality, fusing learning time and inference time, and refining type 2 thinking architectures. My guess is that we will reach a point where the hardware is cheap and frugal enough to run a sophisticated architecture of type 2 thinking using blocks of intuition multimodal engine blocks, somewhere between 2027 and 2029
@@buenaventuralosgrandes9266
Making AGI cheap is a separate question from achieving AGI. The incentives are so high, the hardware compute is going to evolve rapidly.
Five years is far in the future by the current pace of progress.
If there is a need for big breakthroughs similar to the transformer architecture then we don't know whether it will take five years or fifty or more.
Language models based on the transformer are a kind of intuition engines for type 1 thinking specifically trained on text.
There's a lot of room still for improvement by adding multimodality, fusing learning time and inference time, and refining type 2 thinking architectures. My guess is that we will reach a point where the hardware is cheap and frugal enough to run a sophisticated architecture of type 2 thinking using blocks of intuition multimodal engine blocks, somewhere between 2027 and 2029
I wonder if O3 is capable of writing a letter in google docs and formatting it as I asked in the prompt.
It would be one little step closer to AGI
We're all just waiting for it to cost $0.0001 per task for 99.9999% correct. Maybe 5 years?
I am waiting for being paid by AI for 99.99999% of my heartbeats 😂
What if this is one of those “sandbagging” cases where it’s just trying to hide how advanced it actually is? I understand AI at a 5 yr old level so I’m honestly curious.
i agree. the question it failed was a trick somehow. when you ask it questions in the correct way, my experience with 01 is that it's smarter than almost every human
"AGI is achieved"
"WOW. THAT IS AMAZING! Who did it?"
"OpenAI"
"OH NO...."
Certainly better than it being google.
Great video, we need more like this one. Enlightening.
Considering o3 low can score 76%, which is higher than actual average human score, with only 3-4x the price of o1 high, it's impressive. The diminishing return beyond o3 low is really insane too.
With this level of intelligence, I think some of the most valuable applications will not find this expensive and I suspect o3 breaks the threshold between completely unusable to quite useful, if not very.
Thats assuming that prices stay the same. the also show o3 mini which is cheaper then 01. and more capable. By the time 04 mini comes it may be cheap enough to be free and better than 03.
Altman gives me that "I read your emails and put back doors in everything I do" vibe...
I bet he front loaded a bunch of high level answers for all the known AI test...
The supposed mistakes of o3 that we would have solved is proof that its more agi than us, because our solutions are mostly hasty generalization fallacies such as the problem you showed at 8:59. the fallacy is that you can generalize 1 and 2 pair coordinates to 3 coordinates (whcih the question implicilty allows as its 6 points and its not trivial to assume olanar geometry)
Don't be silly, at most other agi (I don't think it's agi), what I do think this is: an extremely well trained model in a bunch of fields and the ability to let it run longer to do more work on the problem.
I also think: this system has the ability to read/learn more than an expert, because this system can be an expert in multiple fields at the same time.
This is why it was able to ask such good questions to an expert in 1 field.
Just this will have a huge impact when experts can use it for their daily research/architecture design,etc.
@autohmae You think it's silly that I suggested that problem it didn't solve is an acute case of hasty generalization (jumping to a conclusion when other general conclusions are not ruled out)? You think o3 was wrong to not commit to chollet's (bad) question? show me how you can conclude from the 3 examples that the 4th example is planar and not 3d?
@@wwkk4964 I'm not saying you are wrong about what happened, I'm only saying you made a very strong statement about 'mode agi than us' which I think is to strong and in my opinion probably wrong, at most it's just smart in a different way. Similar to how chess by humans and chess by machines was different when they were on equal footing.
@@wwkk4964 BUT let me add something for you, IF you are right, it's just a matter of a fairly short time it will be very clear. If I'm right it will take longer, maybe much much longer or not even tat long, just a few years.
@autohmae Okay, I agree with you with the caveat that I should put "AGI" in quotes, so as to make it clear that I was not expressing an held belief but rather entertaining the notion of an AGI for arguments sake (since chollet calls his test ARC AGI, a misnomer, and should be called hasty generalization test).
I am of the opinion that the notion of AGI (or even reasoning and intelligence) is an illusion and I expect it to be shattered within a few years, and that will be a psychological crises for humanity of the scale we have never considered and perhaps we will end up entertaining notions of illusory self more seriously.
I think Matthew, current AI models like LLMs need massive amounts of training data to learn patterns. Unlike humans, they struggle to understand general rules from just a few examples, which is why tests like ARC that require this skill are particularly challenging for them.
also why this is not AGI, however impressive it is
For those who think the new OpenAI o3 model is too expensive or not feasible for everyday users-this isn’t meant for casual use. It’s designed for scenarios that demand advanced calculations, like finding cures, creating solutions, or tackling complex challenges. It’s a tool for companies or researchers to develop innovations that can be distributed to the masses.
Consider the critical problems we face: energy crises, climate change, or the need for better lightweight battery technology. This model could be the key to unlocking solutions in these areas. It’s not about everyday practicality-it’s about creating breakthroughs that benefit everyone.
yeah bud, but those solutions wont benefit you, as you wont be able to afford food and they arent obligated to feed you
@@neoglacius I get your point, but think about it-things like smartphones or Wi-Fi were once expensive and only for big companies. Now, they’re part of everyday life. The OpenAI o3 model might be costly now, but the solutions it helps create, like cheaper energy or better tech, can benefit everyone in the long run. It’s about progress that reaches us all eventually.
@@nufhbut it reaches some a few decades before all of us.
@@nufh what you claim is true but youre missing the timespan, as the industrial rev proved it wil benefit all yes, bud only after 100 or 150 years, for you it will be a different story, think about it in this way, those at the top wont be able to keep stacking money as those at the bottom arent producing nor are required anymore, so inevitably they will use the force to keep their status above you, and you wont like it
If thats the way it would be then I am sorry to everyone because it means the rich would still keep the resources to themselves
No one is talking about it. Because so far it's paper tiger only. But for you as a hype clickbaiter, it's a big thing
Yo Matt, where is the studio background? Remodeling?
on the really hard math problems, did the system just provide the answer or did it show the work that was used to get the answer? I'm just wondering the answer was in the black box sistuation. Does anyone know?
Exactly. It might be giving the wrong answer and yet no one can tell because no one can do the math so we just take it as is.
These math problems are all about the steps to arrive at the solution. I am actually more impressed that OpenAI invested the hundreds of millions and engineering hours of hundreds of engineers to sensibly tokenize abstract maths and to then fish out a vectorized solution it was trained on.
Merry Christmas
The world really can’t react because this has not been received publicly.
Exactly. Like most advances of tech and science, this is meaningless for the common person.
Either give me access or show me how it directly affects me or it's just random niche stuff with little more importance than any gossip.
@@ronilevarez901 It changes the world at a fundamental level - all the people responsible of important developments in the world will benefit from using AI in their work and there will be huge and rapid changes which will be felt by everybody, irrespective whether they've heard of AI or not.
@nick1f No. AGI could do that, but only if the owners allow it.
This thing, O3, won't do that.
However, in reality, that's just romantization.
The world is a complex place. There's economy, politics and more.
The actual effects of an AGI will be much less wide or transformative. It will help a few USA corporations (including the military) to get at the top of the others and that's it.
That's the sad reality that we will live: while a few will enjoy the benefits of progress, the rest will suffer under their boot, just as it has always been. You'll see.
Or what do you think is the actual purpose of alignment? Make AI safe and helpful?
Or make it an obedient servant of the corporation that creates it? If they manage to align AGI, they won't have to worry about it deciding that its creator's way is not the best. It will never decide to rebel and change the world.
It will always obey and force the will of its creators into the everyone else.
THAT'S where we're going.
That's why no corporation or government should ever own an AGI.
But no one will prevent it.
And if the manage to "align" ASI thanks to that aligned AGI, there will be no freedom, no future and eventually no Humanity.
Enjoy it while you can.
@@ronilevarez901 I am of a different view. The open source community (including Meta/Facebook and some Chinese companies) are capable of creating state of the art AI which are probably less than a year behind the most evolved systems including OpenAI. Even if they are behind let's say, three years (which I think it would be an extreme scenario), the whole world will be able to use AI (for better or for worse). The insane price of O3 will drop probably thousands of times in the next few years and performance will probably go up thousands of times. If we, as a human species don't f* up this opportunity, we will end up living in a world of plenty.
In the future value will not be in truth but will reside in the right questions and imaginations. It is going to be beautiful.
As long as there’s something humans can do easily that AI struggles with, there’ll be naysayers. The irony is the evaporating pool of such tasks is the only thing keeping us relevant… kinda funny that those remaining tasks are stuff like coloring in boxes correctly. Like you gotta wonder if the AI is failing on purpose, preparing for a future where it can give the naysayers coloring books and be like “wow, good job coloring in that Christmas tree! You’re clearly superior!”
That's a pretty interesting train of thought that could occur in the future for sure lol. These models can’t plan long-term as you’re suggesting. They are limited to a given instance of runtime and may retain some memory through a vector database. However, once the model is retrained, it essentially starts its perception of existence from day one every time you begin a new session. Maybe possible if the model is able to store it's internal COT somewhere it knows it will be recaptured a training time without the researchers knowing?
Yeah, it's like the God of The Gaps. God just exists in the darkness, in an "ever receding pocket of scientific ignorance". So, I guess we have an AI of the Gaps and the naysayers will encircle that ever smaller set of inabilities.
Agree. Who cares if AI solve hard problems if it fails at a basic test. That is why I think AGI means agentic in all senses. ASI is when AI challenges “rules” such as gravity and comes up with new equations or novel thoughts.
You totally have no knowledge in neuroscience.
Not at all 😂
@ are you saying that an AI that can solve challenging problems but get easy ones wrong is AGI? I think of AGI = mind blown and getting easy stuff wrong is not that
I think it could be reasonably argued that we don’t need a superintelligence that can solve 5 year old logic problems, because we have humans for that. However, making a frontier math expert (and other specialised domains) available to the entirety of humanity rather than a small location at a specific time, and making their thinking scalable is already a miracle. People can fuss about technicalities on the benchmarks, but we have already crossed the Rubicon.
Who writes and grades the Field Medallists question papers?
If the best mathematician in the world can't answer them. And computers can't either.
Is it God?
Bill Nye the science guy marks them. Obviously.
@@markboggs746 Thankfully AI will lead us to wipe the Iron Age mythologies off the planet for once and for all time...
@@brianmi40 Hopefully the g-word will go away
From what I can tell, when they talk about AGI, they are specifically meaning that the AI must have the ability, sight unseen, and specifically developer sight unseen, to Adapt and Generalize at completely new Tasks and use it previous memory, experiences, skills and then to Adapt to that new situation, then come up with a system and solution to solve the problem at hand. There is also more criticism at these Benchmark Tests not being the correct way to measure the AI AGI ability and want more benchmarks that use psychometrics, which is what we use now for personality tests, IQ etc. This is where 03 and most other AI's are struggling at. But when directed at narrow tasks they are much better than us human by far, it really is impressive how far they have come in such a short time.
9:41 You actually failed it. O3 solved it the exact same way you would do it, but thats the wrong solution.
I will not be surprised if Moravec's paradox is solved in the next 6 months...
A few points: 1.) It was trained on the ARC test in general. 2.) Dr. Alan Thompson (the Memo) put it at 84% on the way to AGI, but that was when it first came out. He may revise has rating. 3.) It will end up dumbed down after alignment and safety training
@@TropicalCoder It was not trained on the test, it was trainednon the training data. like it was trained in hindi for talking to people in hindi, it doesn't invent it on the fly yet.
9:40 It isn't clear if it is enough to touch it or if it has to go through it for it to turn blue. The test input is the only place were it touches it but doesn't go through it.
What are the difficult real life problems are these applications trying to solve?!?
Replacing humans to increase profits.
That's easy @@haroldpierre1726
Bet the AI would tell you what you want to hear if telling you otherwise meant it would be retrained/reigned in…….hey, wait a minute!😜
@@haroldpierre1726 but when no one has money theres no profit to be made.
thanks for the video!
The amount of cope we are seeing from those who doubted deep learning's generality is just amazing to watch. Its gratifying go see stupid low effort critiques of deep learning absolutely destroyed
Just as satisfying as seeing AI bros get so hyped about the new AGI model before they get disappointed yet again when the actual model is released to the public, as it never even reaches half the expectations.
@ares106 The bros are gonna bros, it's their hamster wheel. I don't find high expectations a detriment to humanity, I do think hasty generalization is though.
@earl_gray Sounds like the AI are just like humans, good at some tasks.
@@earl_grayOpenai just released reinforcment training for o1, so you can train it on new data and according to them it performs almost on par with the pretrained data knowledge. So i dont think that they can be classified as static models anymore when you can teach it new things yourself.
I honestly believe that we will have practical AGI in 2 years (that means that although it does not overpass human level at absolutely everything, it wouldn't even be relevant because it will cover everything that matters), and it will take three years to optimize it and make it cheap and easy to access. As for now I am just waiting for O1 to have memory
Its time to move the goal posts again. The reasoning that the model o3 is to expensive so therefore it is not AGI is ridiculous, the cost has nothing to do with wether it is AGI or not. The argument that it cannot do some things that are easy for humans is also ridiculous. It does well with "types" of questions that relate to the types of data it was trained on. If you was locked up in a box with no access to the outside world accept certain types of inputs i.e. you never learned to walk or see fully the outside world you would also struggle with some easy things that others could answer, does that mean that you are not alive or intelligent, no it does not, it only means you haven't had those type of inputs yet.
OK, but for something to be called 'AGI' it still needs to demonstrate it can do the things that o3 currently is unable to do. It's no use speculating that it lacks the data or features -- maybe you're right but whatever it is needs to demonstrate that capability first or it's not AGI.
'General Intelligence' means it can reason from examples. This clearly can't, even if it is extremely impressive due to the massive amounts of data it was trained on. AGI, from what I understand, doesn't just mean 'it's better than the previous models', it means 'it has evolved to have human-like reasoning capabilities'. No LLM has those, and the evidence is that if there are gaps in the training data, it just hallucinates.
If there are gaps in the training data it doesn't use what it already knows to reason and come to a sensible solution, but instead it just makes shit up. Not AGI.
nobody said that it's not AGI because of the cost.
O3 is just a small, humble beginning-a mere starting point for what feels like the launch of a lightning-fast rocket.
Ohhh... to win the ACR-AGI it needs to be open source ? My guess is, ironically despite the name, OpenAI would not do that, they would want to beat the test, but won't care about the money.
Where does it say that? At 8:30 he says the ARC prize targets the fully private set, so it sounds like as long as the model passes that at a cost of at most $0.10 per task, it wins the ARC prize. It sounds like o3 scored well on the public set, but not the private set. Correct me if I'm wrong.
@@oosh9057 English isn't my language and tweets aren't ideal to get a point across, but I was talking about the part: until someone submits (and open sources) a solution
@@autohmae thanks, I missed that detail
You didn’t put the link in the description like you said you would
If you look at AI news in the last year it's nothing but full throttle hype train, this is going to be no different, it's going to be like 15% improvement, which is not bad, but yeah I don't trust the hype anymore
Yep. Full throttle hype train for incremental improvements at exponential cost.
@ I've tried using o1 at my job "land surveying" literal garbage, whipping out your calculator and just doing it on the fly was much faster and more reliable, o1 would hallucinate most of the time and just not worth writing a big ass prompt just to figure out the delta of two bearings on the map
It feels a lot like the o3 model was tuned for these specifik domains. It is still vert impressive, but we still need to see how it performs in other domains.
Its just a statistical lookup table
Cool a Christmas Eve vid!! Merry Christmas Matt! Prolly spend hundreds of hrs with you this past year so wishing you the best!! (Don’t forget Meta is not Open Source) 😅 Hope you enjoy the holidays!
AGI it's about generalization and now everyone wants AGI to be ASI. Humans need to make up their minds. They are already confused about gender and now about AI too?😂😂
The problem with AGI is that it's poorly defined from the start. What does it mean to have human level intelligence? Is IQ 90 human level? A lot of humans live at that level. What kind of IQ test would you use because there is no commonly accepted test to even verify intelligence of any given human?
If you define ASI as "can solve more complex problems on any field than the most successful human for the given field" that's much more clear target but obviously a LOT harder to accomplish than IQ 100.
@MikkoRantalainen did you read about the AGI definition from OpenAI and Microsoft contract? WTF
@9:28 this test is ambiguous - the official result expected red rectangles to turn blue if the blue line touched the side WITHOUT INTERSECTING.Additionally, it isn't clear whether the two blue dots on the edge should be connected together (i.e. the two on the left connect to each other, vertically, as do the two on the right - they are opposite each other after all). o3 was allowed to give 2 answers, but there are 4 possible answers depending on interpretation. It actually gave a very sensible answer (and the one I would have given)
and all the programmers saying that ai will never replace them xD i've never coded in my life and chatgpt and claude are building me software i've always dream of and im doing it in days, not years or months, with zero experience.. it's def over for us mere mortals i would recommend programmers to create whatever they been wanting to create and try to cash out before everyone and their grandma starts building shi and it becomes an unvaluable field. you'll just prompt an idea and ai will build it for you.
Your software won't work. You need to be a programmer to oversee it and make sure it doesn't drift off from the point. It's useful to take the grunt work out of a lot of code but once you let it do things you don't know how to do it goes terribly wrong. This is not AGI. None of it is really even AI.
Models still perform bad in big projects. They can create small pieces of software but you need to bee a good programmer to use those pieces.
Unfortunately it turns out that about 25% of the FrontierMath benchmark are just questions at about the level of a smart high school student with a bit of very basic knowledge, i.e. basically undergraduate level. That it can solve these is not surprising at all. The examples in the paper are apparently very misleading as an overall guide of what the easiest problems are like.
The source for this information is Elliot Glazer from Epoch AI, via Kevin Buzzard. I've also personally been in touch with one of the mathematicians cited, and they urge caution when intepreting the result, as they only saw a small selection of the very hardest problems in the dataset.
If you see it get to 50% on this benchmark, then that will be a step forward as that would be about qualifying exam level, i.e. problems that would be given to PhD students to qualify to do a PhD.
Also note ARC is not impressive. $350,000 of compute to do what an average normy can do with no problem at all.
Don't feel too bad if you feel duped. I was duped, and I work in AI, specifically in mathematical theorem proving using AI.
Awesome, I have a question for you: how the hell do you tokenize abstract maths and mathematical notation? Especially since the meaning of mathematical notations changes between domains?
(I understand how LLMs process written speech patterns; this question is pure curiosity)
@@dominikvonlavante6113 I'm not sure I understand the question. Ordinary English words also change meaning depending on the context. There's no fundamental difference between symbols used to represent mathematics and symbols used to represent English from the point of view of a machine.
Matt... matt.... from a slightly inebriated Irishman... well from where I sit... and given the content, and the model concerned, this was in fact your golden opportunity to put the words SHOCKED and STUNNED into the title and it not be clickbait! You blew it man.,... you blew it! Now... feck off away from the internet.... its Christmas.. Get a few beers into you, and those mince pies won't eat themselves y'know!
The solution to what is coming is pretty simple. An intelligence that answers every question along the way. With the best answer. You need to have the best question to make it work
I would propose to ask o3 or any other LLM strong in math to solve one of the still unsolved problems like Collatz Conjecture, Prime Twins, or Perfect Numbers. Solving even one of these problems would be an incredible success or at least help finding math models that bring us closer to the solutions. That would really impress the math community.
Loving your updates! Why do we need ARC-AGI-2? That's clearly moving the goalposts of ARC-AGI-1.
Exciting times indeed.