Hi everyone, thank you so much for the feedback! I couldn't have expected this kind of attention on my video in the first 48 hours. I've taken some of your suggestions in the comments and have created a Part 2: ruclips.net/video/a8QvnIAGjPA/видео.html Please consider watching!
it was funny ! the thing is to keep up with the technolgys ad current innovations being deployed as it should not be hard to emulate these neural networks wth the open sourced models ! the aim is to train the local models as best you can at the highest point you capability but keep aware that technolgy needs to adavce to hande these heavy tensor calcualtions hence local model will be able to perform these tasks without the eed of outseide intervenetion so get an early start ! or it will be a nightmre of stuy to catch up : it has taken me a full year of constant python etc doing this training and implementation to keep up and get ahead ! .. that gap is widening @ Just expect to be able to host a 24b or 70b local within the next two years ! , a full genrative model ! ( so you could host multiple mini 7b agents at the same time ! hence a power full system ! ( agentic ! )
It is worth noting that GPT has probably 'read' ever thousands of answer books on Jackson, As well as all of Stack Exchange, as well as several study guides on Jackson in particular. So if you want to really test GPT ability you probably need to create novel questions that will not be found in any textbook or online form.
Note 2: I haven’t looked through the answers, but in cases where GPT knows what the answer should be, it will make up crap to fill in the middle. I’ve asked it many PhD level math questions where it has hallucinated its way to a “plausible” answer.
@@KyleKabasares_PhD that's not what he was saying. You were providing GPT 1o the answers, so of course it would give you the right answers since you provided them for it. To know if it truly solves PhD questions, you shoudn't give it questions like :"prove that this formula is verified" but rather " what is the formula for ... ?"
@@omatbaydaui722 I understood what he was saying. I’ve verified it doing problems correctly from start to finish (non-Jackson) without knowing the answer in advance! But in those cases I actually did the problem unlike here, so I’m planning on revisiting the problems in this video.
It could have made several errors in its derivation but displayed the right answer since it knew the desired result. Ai is notorious for simple mistakes and contradictory statements. It’s still impressive but also as you admitted you didn’t really check it in depth and gave it problems where the end result was given.
I just tried o1 with some fairly simple integrals, which it got badly wrong and I had to guide it to the correct answer. So I'd advise checking every step in the derivation.
You have to remember that this book was probably directly in the chatGPT training data, so this may not be a valid measure of novel problem solving ability
100% this. It's given the answer to work towards. I do not have enough knowledge in this area to prove that it came to it's conclusions incorrectly, but it's a well known quirk.
The issue with this "test" is that the solutions to one of the most famous physics books are certainly in it's training data. Give it a problem for which there is no known solution. Or at a minimum give it a problem from an unknown text book. Find an old physics or math book from 100 years ago that has no digital copies. Then ask it questions from that and see how it does.
@@amcmillion3 yes I did it and it’s not very accurate , I feed in 3 jee advance questions and out of which it could only answer 1 correctly , 1 he did it wrong even with hints and 1 he had to solve wrong first and than with hints it was able to solve it
If they were JEE Mains level questions, than solving 1/3 would put it at the same level as those qualifying in top 1000 for the exams. FYI, The highest marks in JEE Mains were usually around 33-35%. I’d wager that would be folks with l an IQ level of ~130+, that is pretty damn good for GenAI. On the normal distribution curve of IQ where 100 is the population average, 130 should at least yield 1 or 2 sigma of confidence level to the statement “GenAI has definitively exceeded the average of the human intelligence level”
Testing it with a book that is “infamous” probably isnt a great benchmark considering that it would mean that there is a considerable database related to that specific book it could read from. If you could test it on a novel problem that would be better
In general, ML models shouldn’t memorize the training data. A lot of effort is put into ensuring the model learns how to do the problem rather than memorizing.
They told me AI would replace hard labourers and Fast food workers first leaving us more time to think so I went to college now I'm college and I'm the first one being replaced.
yeah it's everyone. from labourers to physicists, Ai could do everything much more effectively. the biggest surprise was creativity, that Ai could create art.
Just don’t be a data analyst and if you want to be a computer scientist get out of school and get into the languages and start building plenty of budding industries right now
I haven’t watched the solving yet, but immediately I would like to point out that choosing problems which have known solutions may mean that the model has already seen (and simply memorized) the correct answer! A better test is to ask it an impossible problem or one that solutions don’t exist for and then try to see if it’s generated solution is correct.
Absolutely. If you simply Google the first 15 words of Problem 1, the very first result is a pdf document with a detailed, step-by-step solution. If anything, assuming the steps provided by o1 are correct, it just demonstrates it's decent at summarising search results... The same goes for programming. A lot of people get easily impressed when GPT "writes" a 50-line script that's basically within the first 3-4 StackOverflow posts after a Google search. I mean, yeah, I won't deny it's really convenient that the tool can save you a few clicks, but saying that it has an understanding of the answer it's giving you is (as of today) still a stretch.
if you know how ai works the way they are trained is lossful they don't have access to word for word of every bit of their training info if they did these models would be terabytes upon terabytes in size and would be extremely slow.
@@o_o9039 I know how they work, and I'm not saying the model has all the information stored in its parameters, but it's no secret GPT can indeed search the web and summarize its findings. Copilot (based on GPT4) openly provides sources for almost everything it spits out.
Instead of telling it the answers, try asking it to find them. When I did this, it got the first one to an infinite sum but didn't reduce the infinite sum to the final answer: pretty good! For the second one, it had an extra factor of 1/pi that it made up. For the third it completely disregarded the angular dependence of the scattering and failed.
i asked it to calculate some stuff (quantum mechanics) for me and it also did some difficult step without explanation. i asked it to prove that step and it gave me a proof containing 1 mistake, but i wasn't sure and asked about that step, then it realized it was wrong, explained exactly why it was wrong, fixed it, and remade the calculation corrrectly
The model has most likely been trained on these problems and their solutions, since they've been around on the internet for a long time. So it isn't really a good test of its abilities since it has just memorized the solutions. That being said, I also tried it with some problems from the 2024 International Math Olympiad, and it was able to get at least two (out of six) of them correct consistently. I only tried problems where the answer was a single number/expression, going through proofs would be much more work. The model's knowledge cutoff should be in October 2023 so it shouldn't be possible for it to have seen these problems before. It's still hard to tell since OpenAI isn't being very transparent with their methodology, but if the model is actually able to solve novel IMO level problems it has never seen before, color me impressed.
The key is having the answer before hand so it can guess from both ends and connect them. Ask it to evaluate a parameterized surface integral even with wolfram plugins and it will make mistakes.
If you want to know, the steps are simply contextual changes, it is essentially a gpt that has the instructions self-set, and the output of its thinking steps is the instructions it is providing itself at each step. it works because by shifting context at each step rather than only a single context of the original message and response, it is able to aproach problems itteratively from different 'perspectives'
The model's performance is undoubtedly impressive, but if it was trained on this book (which seems likely), it's not truly generalizing to new data. For a fair assessment of its capabilities, models should be tested on novel, unforeseen problems, rather than those for which the answers are already available. In practice, models are typically evaluated on fresh data to gauge how well they can generalize. To accurately measure performance at this level, problems should be novel and manually verified, even if that takes considerable time (1.5 weeks or more).
I believe the book does not have the answers to the problems, so even if it was trained with the book it shouldnt help it to solve problems. Still it is possible that it just took the answers from some physics subreddit post and just pasted it
@@juanperez-lh9mt It doesn't look it up, but the questions and answers are likely part of the training data for the model, which means that they're, in a way, already imprinted into the AI's "memory". Now, it surely still has to do some work to figure it out, but sometimes when you see something like "Recognize that _____ = _____" in some completely unrecognizable way, it's probably something the AI has just summoned from "memory".
@@tachyonindustries It's not prevalent enough in the training data that we can say it recognizes the problem, it just recognizes the type of problem and reasons from there
@@ancientmodis LLMs do not look up in their databases. The database is used to train the model, all they learned is probabilities to figure out the most probable continuation to your prompt. It does not look anything up. LLMs do not even have access to look up the raw data they have been trained on.... That is fundamentally NOT how they work.
There is a problem with the test: Since the answer ,,show that.." is given, the AI will always show the correct answer, the reasoning might be flawed. It would be better to cut out the correct answer from the problem and see what AI will answer then.
This applies to humans completing the problem as well, and there was an effort made to check the steps. I agree, it might be interesting to see if it could though (although if it succeeds, will likely express it in an different form which may be hard to verify).
Prior versions had bogus steps that didn’t really follow legitimate steps, and units were often fouled up. Definitely deserves to be looked at deeper to see if that has improved.
" OpenAI's new AI model, "o1," has achieved a significant milestone by scoring around 120 on the Norway Mensa IQ test, far surpassing previous models. In a recent test, it got 25 out of 35 questions right, which is notably better than most humans. A critical factor in these tests is ensuring the AI doesn't benefit from pre-existing training data. To address this, custom questions were created that had never been publicly available, and o1 still performed impressively"
100%. Perplexity pointed me to at least 6 pdf versions available for free online. There are also lots of study notes online available for this text. Although I have no idea if it is memorizing answers.
@@lolilollolilol7773 I agree, it's definitely not just remembering the answer. But is it remembering steps to solving the problems that it found in online study examples? I don't know. But my own testing makes me think it is a big step up in capablity.
Its funny how good it is at some things and how terrible it is at others things still, seems its abilities are heavily dependent on whether examples of the problem were included in its training data. I've asked it to create a 32 bit crc algorithm and it did it perfectly, however when asking it to create considerably more trivial 3 bit crc algorithm (which is uncommon and quite useless), it failed miserably and in fact produced multiple wrong result that got worse and worse as i pointed out the flaws.
I just filmed a part 2 where it involves some problems where the answer is not known in advance, and problems that I'm confident it did not have access to previously: ruclips.net/video/a8QvnIAGjPA/видео.html
i am doing a phd in ml related field. Setting fair benchmarks and tests in the current day is quite hard considering the shear scale of data top models are trained on. And using a famous physique text book isnt really a good attempt. model o1 reasoning is a massive step up though for sure, i think it could do a similar blind test like this very soon.
*is 9.11 bigger than 9.9* [Thought for 4 seconds] "Yes, 9.11 is greater than 9.9." *Are you sure?* [Thought for 17 seconds] "You're right, and I apologize for the mistake in my previous response. Actually, 9.9 is greater than 9.11. Here's why: 9.9 can be expressed as 9.90 or 9.900. 9.11 remains 9.11 or 9.110. When comparing: 9.900 (which is 9.9) is greater than 9.110 (which is 9.11). Therefore, 9.9 is greater than 9.11." I suppose we're going to have to make sure ChatGPT re-evaluates their answer every single time
I just did it and it got it right first try: No, 9.11 is not bigger than 9.9; in fact, 9.11 is less than 9.9. Here's the explanation: 9.11 equals 9 plus 0.11. 9.9 equals 9 plus 0.9. Since 0.11 is less than 0.9, it follows that: 9 + 0.11 9+0.11 (which is 9.11) is less than 9 + 0.9 9+0.9 (which is 9.9). Therefore: 9.11 < 9.9
This is scary. But you have to try with novel problems that the AI has never seen before. Chatgpt has been for sure trained with the Jackson book! Nevertheless, the reasoning capabilities are astonishing. A new era has begun.
" Chatgpt has been for sure trained with the Jackson book!" This is such an oft-repeated nonsense statement though. Just because a problem might be in its training set, the model will not be significantly or any better answering that exact problem than any other problem in the same category. It's, like. Do you remember every homework math equation you have solved in your life? Would you be any better at solving a problem you have already encountered once 10 years before vs a similar novel one? No, of course not, unless you have superhuman memory where you keep an exact copy of everything you've done ever. Similarly, these models don't memorize. They synthesize. They are learning models, not search engines or "neually indexed databases" or whatever.
@@sid.h Ai remembers patterns, not particular problems. And indeed if some pattern is missing AI will miss it, if pattern is well represented AI will solve it well. Better architecture of neural network remembers more and remembers and solves corner cases better. This is what we see in chess networks such as Leela Chess Zero.
As a person doing a bachelor's in EE, this will be very useful for me. Like many, I only wonder what'll happen in the future when it gets even more advanced? Maybe take my reduced earnings and live off the land somewhere. Away from this.
If it is on the internet, it's in its training data. You would need to find questions that it has not been trained on. This is why benchmarking is so hard
It still impressive the model can accurately comprehend which part of its training data deals with the problem in question. There are human beings who haven’t mastered this skill lmao
Stop the downplaying. These types of problems are impossible to solve without reasoning. Simple pattern recognition doesn't make this possible. This cope needs to stop
@@xorqwerty8276Star Wars Universe but more humanoid bots on our planet , and billions of them being like gods building anything and everything they imagine. Earth is surrounded by a giant dome that extracts/enhances light from the sun combined with technology that speeds up how fast plants or trees grow, we have a combo of biological machines that have become humans too and are interbreeding half humans half machines. The sun is all we need to survive now. Millions of unique new species emerge. (10 years is like millions of years if true ASI comes in a year from now) Even 2 years could be very wtf lol 😂
as a high schooler who had taken part in aime, o1 is really impressive. aime problems get so much harder when it gets to the latter half. so 83% (o1) compared to 13%(gpt4o) is huge. the latter solve possibly only solve the first two which are not challenging at all
not maybe, for sure. i know people dont have to be all experts in exactly what the black box of deep learning is doing but holy people are so dumb... i wonder if they don't think that IF what they think is true, meaning the models being this great, that in 1 month we wouldn't have to get new discoveries in all science fields... which will not come because the current AI is 100% data capped. its just memorization of PDFs and manifold recalling
This is a fair point! I have gone ahead and uploaded a Part 2 using problems I'm confident it had not seen before and that I have detailed answers to! ruclips.net/video/a8QvnIAGjPA/видео.html
Solutions are publicly available and most probably in training datasets already. LLMs are good at what they already learned, but even not 100% accurate there.
"to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve advent of code programming problems published december 2023)"
This is true for humans as well. I have worked in aerospace for a major company for many years. When I have to solve a difficult engineering problem I first search for for a so-called a "subject matter expert" in the field and it's quite likely that he or she will know the answer.
Incredible. It would be interesting to see what happens if you give it to solve an incorrect result. Will it show that your result is incorrect and instead give the correct one ?
This was an interesting test. I still think it's funny when people say these models don't understand. Anyone who's used them enough understands that they do understand. One nice thing is that you can ask follow up questions as well and ask why something is like that, or ask it to try things in a slightly different way if you want it done differently.
I dunno about latest models, but ChatGPT 3.5 does NOT "understand" anything. It feeds you fake references, and when you repeatedly tell it it is doing so, it will say "sorry" and continue to feed you fake references. That is not its fault--it is not "replying" or "responding" or dong anything a living being is doing. If you give it a training set containing PhD Level physics problems, sure it can solve those problems. That is just predicting output from a training data.
@@woosterjeeves This isn't GPT 3.5 though, and that specific model you mentioned was released back in November of 2022, the first public release of ChatGPT. In the video, you can see it's process of reasoning. ChatGPT doesn't use fake references if it's able to break it down and be able to express why and how it conducts it's problem solving and reasoning. Also to "That is just predicting output from a training data", one, how is that different from learning? Isn't that the point of teachers, to help you predict and reason the output from the input of questions and data? Two, this is just a preview, not the full model, and it is able to do extremely difficult problems like these, explain the reasoning, the process, and give the right answer. We are slowly gravitating towards such a world where such an excuse of prediction of data will no longer be viable to argue about. The model is able to understand. The model is able to think with it's data. It's putting formulas and answers together from it's data, to reason and to form intelligent answers and responses when in contrast, the same problems make the most qualified PhDs scratch their heads. Reminder, these questions take around 1.5 weeks as said to solve ONE problem, GPT-o1 does it in less than 2 minutes.
@@人人人人人人人人 Sure. I am still flummoxed why someone would add "understanding" to a prediction model. If you think prediction (from training data) is equal to understanding, then algorithms are already "understanding". Why hype this one? OTOH, if you think there is something qualitatively different, then we can talk about that. But you cannot claim both. Are chess computers "understanding" because they can do moves that make super GMS scratch their heads? If so, then the argument is already over. I am only cautioning against use of common-term words ("understanding") which makes one think in terms of sentience. A language model has no such thing. Does this mean AI will never reach sentience? I never said that--just that the video does not do it for me. I am totally clueless why others are impressed about this model's "understanding", the same way I would be if someone said Alpha-Zero (the chess AI) understands chess. That is all.
to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve some advent of code programming problems published december 2023)
@plonkermaster5244 Your statement it's half true, LLMs need to have existing information to propperly work. However, unless the problem presented needs an actual new theory with previous research and a never seen formula, LLMs can recognize the formulas needed to solve the problem. Good observation.
Not bad for a model smaller than the o1 and based on GPT-4, to be honest I don't know how I'll be able to test upcoming versions like the ones based on GPT-5. I can't wait to use this on university projects, there are so many things I need to go looking for experts for relatively "easy" tasks.
There’s so much potential in the pipeline. Imagine the o1 techniques applied to image/video generation. Bye-bye obviously fake images, and hello “indiscernible from reality” images. Also, once o1 is layered on top of GPT-5, we’re likely talking “competing with or beating best-in-the-world level scientists/thought leaders” in different fields. This will fuel more investment into compute farms to create even MORE powerful AI, and multiple instances can run simultaneously to solve problems that would take humanity millennia to solve otherwise. Including AI researching how to improve AI in a self-improving recursive loop that will only stop upon reaching the physical boundaries of the universe.
The way this model was trained was it took physics problems just like that and used a model like gpt4 to create reasoning chains until it could actually derive the correct answer. So it's not surprising. It can already solve textbooks that are well solved already because the answer is very objective and once you get a solid reasoning chain to get to the answer, you can simply train the model on that
@@Patrick-vv3ig no you didn't and if I had a penny for every shitty undergrad bragging about how they went through hard books without understanding a single thing I'd be rich
And they wrote how this was just a step of many like that to come. In 5-10 years the world may be changed fundamentally, 20 years it’ll be hard to recognize
I asked it to find how much the earth would have to be compressed to become opaque to neutrinos: It took it 39 seconds to say 26 km diameter. Totally fascinating how it got there...(01Preview)
ChatGPT is truly amazing i wonder what this technology will be like in 10 years i think schools should really use this technology and allow it because it’s not like it’s going away tomorrow. I also think this technology makes it impossible to be ignorant
If the solutions exist on the internet is it really solving it? Or just analyzing and printing the answers . A true test could be a creating a unique problem with no known solutions
Great video and interesting commentary. It’s interesting you think this might be a good study aid or a tool… however I just watched you take longer to check the answers than the model took to solve them… and your an actual subject matter expert… and as you correctly pointed out, this is just a preview of the full model capabilities. This new type of model will not help experts, but replace them. They will eclipse not only human level knowledge, but human level speed. This is not a tool. It’s disruption personified. With something this good (and as the saying goes, this is as bad as they will ever be as they will only improve from this time onwards) what purpose will it serve to complete university study for 3 years, only to try and find employment in a career that no longer requires humans. Amazing.
It's a machine, like cotton gin, the steam engine, the locomotive, etc. All advance of technology has displaced people from some jobs into others. And yet we are still here. What's the alternative? Structure society to be less productive and less efficient in order to keep people employed in obsolete jobs? That will just slow the growth of the economy and cause a lower standard of living, leading to poverty and hunger as the world population keeps multiplying. It's going to put people out of work, we will be ok. Becoming a Luddite is not going to change anything.
@@AlfarrisiMuammar I am glad you are thinking about it now. 1) Truck drivers replaced wagon drivers (not horses.) There are many more truck drivers now. 2) The standard of living for both truck drivers AND horses is higher than ever. Same thing goes for automobiles and horses.
7:36 You're right. It's not obvious. Proving that identity is the main crux of the the problem and the model just skips over it. My professor would give me a 0 for that entire problem if I did it this way. This is the reason you shouldn't give it "prove it" problems where you've already given it the answer.
Maybe the follow on question is to ask it to explain that part? I think the takeaway here for me, is that we're 2 years into this resolution, and the basic tech is already at this level. How do we think about the world if we extrapolate another 5-10 years? What do we want and how will this inevitable tech change whats possible?
The first time i watched a video like this was from sixty symbols where they also tried to solve physics problems using the original vanilla Chatgpt 3.5. They didn't get anywhere close to this level. I think the progress is reallty accelerating. I also think that inference time compute is a very real thing and the guys at openai have solved it with this new model in a fundamental way for sure. I think there will be other ways to implement system 2 thinking but i think that using reasoning tokens for accomplishing this is maybe the best and coherent way to go forward. I truly think that with o1, we have the first complete architecture for AGI.
I mean it gave you step by step how it was able to solve those problems and gives you its insights into how it’s thinking. That is just wild beyond imagination.
I tested to solve a sudoku and it failed. It either give wrong results or change up the original question Still did much better than 4o when I tested it a few months ago
You're welcome! I have made a part 2 using new questions that I'm confident it didn't have access to beforehand: ruclips.net/video/a8QvnIAGjPA/видео.html
@lolilollolilol7773 LLMs literally predict the next word based on probability. If the answer isn’t in the training data it can’t answer the question. It doesn’t have reasoning skills.
@@lewie8136they recognize patterns like we do... We don't really think, we also predict things based on the patters we see... We just named it thinking.
The first one is the easy one? Yet at the same time you're amazed that it solved it in 122 seconds, while you mention that it generally takes others 1.5 week.
it was messing up on symbolab generated Mclauren & taylor series problems for me last night, but its a massive improvement from last time i tried to get it to do math a couple years ago
This should make you seriously question the way we do education. If human value isn't in solving problems that are hard, yet clearly defined, then why teach that? You teach it because you need to know it to solve higher level problems. But maybe we no longer need to also train the skill of doing the calculations. So long as you understand the concept properly you can move on without spending a week pushing through the math. That's going to be very hard for some people to accept.
Understanding the concept, unfortunately, typically requires dozens of practical experiences. This us why teaching math starting with a calculator leads to lesser learning than introducing a calculator following basic practice
I think giving it problems where it asks to arrive at a specific solution (shown in the problem itself) is not a good way to evaluate it. I bet the results would be very different if you just asked it to solve the problem by itself.
I wouldn’t be surprised. The true limit of LLMs are problems with no real known solutions. The advancements still do not change the (oversimplified here) model of fitting a curve. OpenAI can do another round of RL and CoT on these specific problems, but all it takes is another set of problem that it really hasn’t encountered that well. It still suffers from the generic flipping an image of a dog and having the AI shit itself.
@@user-jm8fj7ez8s Yup; and that's why even Sam Altman admitted it's "more impressive on your first use" [than when you use it a lot], and that it's "not AGI". But this is already INCREDIBLY useful; think about it like this: even if it can't automate most jobs, what % of humans ACTUALLY work on problems that have 0 prior data in terms of how to solve them? It's only less than maybe 1% if we're being generous. This means the AI will soon be able to eclipse the regular work 99% of humans do, without any further breakthroughs. And THAT is the mindblowing part! (this was generated by O1-mini).
@@user-jm8fj7ez8syeah it does not reason from first principles. Unlike humans, it doesnt explicitly operate through spatial & relational reasoning from the ground up. They just had o1 trained by asking it millions of questions, letting it think, and then reinforcing the reasoning that led to the right answers. So basically o1 will know the most accurate and efficient chain of reasoning for familiar questions. Will this eventually turn into a super intelligent reasoning engine when scaled up? Nobody knows really but I personally doubt
Just as a comment: it looks impressive. However, to truly judge how good the model is, one (unfortunately 😬) needs to read the proofs line by line and examine the arguments in depth. From my experience with GPT-4, the proofs often look good, but they sometimes contain flaws when examined more closely.
Just finished recording a video where I do that more or less with some problems I have the answer to and am pretty sure the problem didn't exist on the internet!
An "ai president" as long as there isn't a person telling it how to think could be the best thing for any country. I would still give it a few years before doing so tho and make sure it's main objective is to do the best for the country.
i think current o1-preview is far more capable to govern than any human. Of course, it would need some readjustments like a more continous existence, without resetting itself, and a virtually infinite context window so it can always take into account everything that has ever happened in the past
But one moment, how can you verify that the GPT did not know about this problem before and is only recreating it from his own knowledge base? You need to give something that you are 100 sure that it doesn't know. For that the best way to do it is to ask if it actually knows the solution directly if GPT4o knows the solution, it is likely that the o1 knows it.
I don't know if it's seen these problems before, but it was tested repeatedly using newly made-up logical and reasoning problems and it solved them. I showed it some work that was unpublished (but actually valid and verifiable) so I knew it hadn't seen it, and it's response was the same as I would expect from someone who was very experienced in the field and seeing it for the first time. So it definitely can reason (in its own way) on its own without already knowing the answer. I highly recommend the "Sparks of AGI" paper or lecture that goes into this in detail.
I gave ChatGPT a couple of master’s degree computer science problems, both solutions were wrong. I had to tell her they were wrong, she apologized and corrected herself. It wasn’t reliable.
i gave it multiple undergrad problems from algorithms and data structures, its certainly usefull but 90% of the time fails first try, then maybe in about half of those times you can instruct it how to fix the solution. Im talking about the 4o model. I find it useful for finding dumb mistakes, like missing some boundary conditions or just forgetting an i++ and stuff like that
If you want to test the actual knowledge of the model then use textbooks that were compiled with questions created after the knowledge cutoff of the model. This doesn't reflect its actual knowledge but prior knowledge of the model.
I think a student who can't check the answer for correctness may get his ‘points’, but if the professor asks, the gaps in his understanding will quickly become apparent.
It's a fair point, I've gone ahead and filmed and recorded a part 2 that involves problems I'm confident it hadn't seen before: ruclips.net/video/a8QvnIAGjPA/видео.html
I’m worried that this doesn’t show anything that somewhere and it’s training data. It has the answer and it’s memorized how to explain the answer, but not the underlying logic.
I don't think you can be, but the fact that it tried one approach and then backtracked and did another is pretty good evidence it's not just a regurgitated answer.
That is a valid point, this is why I have gone ahead and made a part 2 using problems that I'm confident aren't floating around on the internet: ruclips.net/video/a8QvnIAGjPA/видео.html
I would have done the test of giving the answer with some error, for example an extra factor of 2, or an arctan instead of arcsin, and see if it gets the true answer anyway and recognizes the incorrect input. That would make a very convincing test.
@@hipotures No, you don't get it. The standards have been raised. The hyper intelligent.... are going to be on steroids. I know i am. Imagine if someone at 18 with an iq of 145+ with AI tools at their disposal? now imagine a decade of this progress and the new generation coming in. We're going to see hyper geniuses.
Generally I would love to see some problems where you do not need to prove solution you know in advance, So not "Show that" but what is ..... I wonder if those proofs are actually flawles or they just look convincing
Well, until you yourself don't know whether what it did was correct, we can't say for sure. It is surprising nonetheless, yes, however if you had given these problems to the earlier version, it would have also arrived at the conclusion required, i think. It would have just done some mumbo jumbo and forcefully arrived at the conclusion, no matter what it got wrong in the process. This time around though, it looks like it actually did all the things correctly in its "reasoning" process.
Hey man! You should do a video with scores, like, you will do 5 tests, and allow 5-shot for each problem to each model. And then see out of 5 what's the score. Do this for GPT4o vs O1 preview, you can also do O1 vs Claude sonnet! Like a "LLM's Face Off"
I actually did a stream like that last night! Gave o1, 4o, Gemini Advanced, Claude Sonnet 3.5, Grok 2, and LLama 3.1 a college math exam! ruclips.net/user/liveGdN4MFxLQUU?si=flPSFIxx85Uqyoz7
@@marcusrosales3344True, but the money doesn’t lie. Until the bubble bursts, very smart people have bet tens of billions of dollars on it being game-changing. And notice how the goalposts keep moving back? “It’s ONLY getting 60% on the bar” is a far cry from 5 years ago when AI could only put out gibberish.
10:12 Not a physics student, but looking at the solution it seems like it ignored the first part, took the M expression for granted, and only applied the approximations. Which I believe are much much easier than getting the expression for M itself :D
I think you're right! Sorry I missed that, I was a bit flustered and tired to be honest haha, but I'm working on a new video that will involve problems that it should have never seen before!
@@KyleKabasares_PhD No worries at all, the solution it provided felt a bit too short for such a problem. I will definitely watch the next video since as others pointed out, having it solve unknown problems will be much more useful and interesting.
It ABSOLUTELY was. Jackson is such a common book used in grad EM. This video has almost no substance, there's no verification on the accuracy of the logic. Guy also said he didn't know if it was correct.
As Andrej Karpathy recently said in an interview, ideal training data for a reasoning model would include step-by-step reasoning. (like how we teach children in school) t's a bit amazing that bulk internet data has enough of this embedded reasoning to get us the current results. OpenAI is using q*star to refine their synthetic data, no doubt with successful step-by-step reasoning in that data. This will take a couple years to reach the next model (that's how long it takes to train a new model) but it's start of a virtuous cycle, where ever capable models refine future synthetic data.
Thank you for your comment, this step-by-step reasoning approach is definitely a game-changer! I have also just uploaded a part 2 to my channel if you are interested: ruclips.net/video/a8QvnIAGjPA/видео.html
@@japiye I am not sure but I do know it was trained on those types of problems so it’s not truly deriving those problems cold did you notice it would pull numbers out of nowhere. It’s still really impressive and a very useful model I think we skeptical that it’s really the equivalent of a physics grad student, if you watch ai explained video it gets common sense problems wrong
@@japiye As it probabilistically selects the next word, it will select different words compared to what is has seen. This is what makes the model generate new sentences, but it is able to evaluate it's chain of thought which leads to the correct one or a better result. As the problems are found online and the jackson problems are well known in the field for many years previously, it must be in it's training set.
Hi everyone, thank you so much for the feedback! I couldn't have expected this kind of attention on my video in the first 48 hours. I've taken some of your suggestions in the comments and have created a Part 2: ruclips.net/video/a8QvnIAGjPA/видео.html
Please consider watching!
Open ai say The real Ai o1 version of will be out before the end of 2024.
UNIVERSAL BASIC ICOME 2025
it was funny !
the thing is to keep up with the technolgys ad current innovations being deployed as it should not be hard to emulate these neural networks wth the open sourced models ! the aim is to train the local models as best you can at the highest point you capability but keep aware that technolgy needs to adavce to hande these heavy tensor calcualtions hence local model will be able to perform these tasks without the eed of outseide intervenetion so get an early start !
or it will be a nightmre of stuy to catch up : it has taken me a full year of constant python etc doing this training and implementation to keep up and get ahead ! .. that gap is widening @
Just expect to be able to host a 24b or 70b local within the next two years ! , a full genrative model ! ( so you could host multiple mini 7b agents at the same time ! hence a power full system ! ( agentic ! )
How much did open ai pay you to make this ad?
@@AlfarrisiMuammar I can't wait I'm still intrepid for GPT 5
It is worth noting that GPT has probably 'read' ever thousands of answer books on Jackson, As well as all of Stack Exchange, as well as several study guides on Jackson in particular. So if you want to really test GPT ability you probably need to create novel questions that will not be found in any textbook or online form.
Exactly, problems that solve students are already done somewhere on the internet, is just about googling it and copy paste the solution.
It's the same issue with AI being "great at programming" because it's extensively trained on leetcode problems.
@@taragnor being good at leetcode is not even being good at programming.
@@gabrielbarrantes6946 its doesnt have access to the internet
@@gabrielbarrantes6946 said the web dev.
Note 2: I haven’t looked through the answers, but in cases where GPT knows what the answer should be, it will make up crap to fill in the middle. I’ve asked it many PhD level math questions where it has hallucinated its way to a “plausible” answer.
I'm planning on making a follow up video on comparing my approach to solving this problem with ChatGPT's! Thanks for pointing that out
@@KyleKabasares_PhD that's not what he was saying. You were providing GPT 1o the answers, so of course it would give you the right answers since you provided them for it. To know if it truly solves PhD questions, you shoudn't give it questions like :"prove that this formula is verified" but rather " what is the formula for ... ?"
@@omatbaydaui722 I understood what he was saying. I’ve verified it doing problems correctly from start to finish (non-Jackson) without knowing the answer in advance! But in those cases I actually did the problem unlike here, so I’m planning on revisiting the problems in this video.
Crazy part is that this isn’t even the full model, which is even better
Yeah, it's not even a beta, it's a preview. And it's still using the last gen model. They're coming out with a new model pretty soon.
To us who's math students😂😂❤
@@CommentGuard717 yeah, imagine the GPT 5 implemented not preview version. Thats gonna be fucking wild and its not that far from now
@alvaroluffy1 well, they are now working on chatgpt 6.0
@@Ken-vy7zu shut up you know nothing, stop making up things, they are still working on gpt-5 you realize that right?
This is the worst this technology will ever be….
That's an incredible truth
That's a terrifying truth
Eh it might hit a wall though.
@@thegeeeeeeeeeeI’m here from the future, your comment aged poorly
It might stagnate tho
It could have made several errors in its derivation but displayed the right answer since it knew the desired result. Ai is notorious for simple mistakes and contradictory statements. It’s still impressive but also as you admitted you didn’t really check it in depth and gave it problems where the end result was given.
I just tried o1 with some fairly simple integrals, which it got badly wrong and I had to guide it to the correct answer. So I'd advise checking every step in the derivation.
You have to remember that this book was probably directly in the chatGPT training data, so this may not be a valid measure of novel problem solving ability
to be fair, you don’t seem to have actually checked the model responses, there could have been mistakes or hallucinations throughout
Can you point out any specific hallucinations on this video?
100% this. It's given the answer to work towards. I do not have enough knowledge in this area to prove that it came to it's conclusions incorrectly, but it's a well known quirk.
The issue with this "test" is that the solutions to one of the most famous physics books are certainly in it's training data. Give it a problem for which there is no known solution. Or at a minimum give it a problem from an unknown text book. Find an old physics or math book from 100 years ago that has no digital copies. Then ask it questions from that and see how it does.
Exactly
@@amcmillion3 yes I did it and it’s not very accurate , I feed in 3 jee advance questions and out of which it could only answer 1 correctly , 1 he did it wrong even with hints and 1 he had to solve wrong first and than with hints it was able to solve it
best of all, make your own
If they were JEE Mains level questions, than solving 1/3 would put it at the same level as those qualifying in top 1000 for the exams. FYI, The highest marks in JEE Mains were usually around 33-35%. I’d wager that would be folks with l an IQ level of ~130+, that is pretty damn good for GenAI. On the normal distribution curve of IQ where 100 is the population average, 130 should at least yield 1 or 2 sigma of confidence level to the statement “GenAI has definitively exceeded the average of the human intelligence level”
@@pratikpaharia nah
Testing it with a book that is “infamous” probably isnt a great benchmark considering that it would mean that there is a considerable database related to that specific book it could read from. If you could test it on a novel problem that would be better
if its a famous problem, isn't there a good chance the solution was already in the training data?
In general, ML models shouldn’t memorize the training data. A lot of effort is put into ensuring the model learns how to do the problem rather than memorizing.
it was backtracking and double checking though?
Exactly. But similar data as the training data yields high accuracy
They told me AI would replace hard labourers and Fast food workers first leaving us more time to think so I went to college now I'm college and I'm the first one being replaced.
Don't worry, everyone will be replaced in 3-5 years💀
@@phen-themoogle7651 💯
yeah it's everyone. from labourers to physicists, Ai could do everything much more effectively.
the biggest surprise was creativity, that Ai could create art.
Just don’t be a data analyst and if you want to be a computer scientist get out of school and get into the languages and start building plenty of budding industries right now
@@avijit849 ai is not creative and I don't believe it ever will be
I haven’t watched the solving yet, but immediately I would like to point out that choosing problems which have known solutions may mean that the model has already seen (and simply memorized) the correct answer!
A better test is to ask it an impossible problem or one that solutions don’t exist for and then try to see if it’s generated solution is correct.
Absolutely. If you simply Google the first 15 words of Problem 1, the very first result is a pdf document with a detailed, step-by-step solution. If anything, assuming the steps provided by o1 are correct, it just demonstrates it's decent at summarising search results...
The same goes for programming. A lot of people get easily impressed when GPT "writes" a 50-line script that's basically within the first 3-4 StackOverflow posts after a Google search. I mean, yeah, I won't deny it's really convenient that the tool can save you a few clicks, but saying that it has an understanding of the answer it's giving you is (as of today) still a stretch.
if you know how ai works the way they are trained is lossful they don't have access to word for word of every bit of their training info if they did these models would be terabytes upon terabytes in size and would be extremely slow.
@@o_o9039 I know how they work, and I'm not saying the model has all the information stored in its parameters, but it's no secret GPT can indeed search the web and summarize its findings. Copilot (based on GPT4) openly provides sources for almost everything it spits out.
@@pripyaatHow to know if it cheated?
@@pripyaatjeez. even worss than I thought
Instead of telling it the answers, try asking it to find them. When I did this, it got the first one to an infinite sum but didn't reduce the infinite sum to the final answer: pretty good! For the second one, it had an extra factor of 1/pi that it made up. For the third it completely disregarded the angular dependence of the scattering and failed.
i asked it to calculate some stuff (quantum mechanics) for me and it also did some difficult step without explanation. i asked it to prove that step and it gave me a proof containing 1 mistake, but i wasn't sure and asked about that step, then it realized it was wrong, explained exactly why it was wrong, fixed it, and remade the calculation corrrectly
And remember this is a truncated version of the model. It's full version is much better at problems like this
What? 😮
The model has most likely been trained on these problems and their solutions, since they've been around on the internet for a long time. So it isn't really a good test of its abilities since it has just memorized the solutions. That being said, I also tried it with some problems from the 2024 International Math Olympiad, and it was able to get at least two (out of six) of them correct consistently. I only tried problems where the answer was a single number/expression, going through proofs would be much more work. The model's knowledge cutoff should be in October 2023 so it shouldn't be possible for it to have seen these problems before. It's still hard to tell since OpenAI isn't being very transparent with their methodology, but if the model is actually able to solve novel IMO level problems it has never seen before, color me impressed.
I test ,AND correct answer for me o1 2024 final with alternatives
gpt4o has the same training data and can not sole it? so ...
The key is having the answer before hand so it can guess from both ends and connect them. Ask it to evaluate a parameterized surface integral even with wolfram plugins and it will make mistakes.
If you want to know, the steps are simply contextual changes, it is essentially a gpt that has the instructions self-set, and the output of its thinking steps is the instructions it is providing itself at each step. it works because by shifting context at each step rather than only a single context of the original message and response, it is able to aproach problems itteratively from different 'perspectives'
This channel is the reason why I'm not reading fluid mechanics rn
Can’t tell if I should say thank you or I’m sorry lol
Fluid mechanics is extremely fun🤲
Just a a few years ago no one ever imagined bots thinking...😭
I certainly didn't!
The model's performance is undoubtedly impressive, but if it was trained on this book (which seems likely), it's not truly generalizing to new data. For a fair assessment of its capabilities, models should be tested on novel, unforeseen problems, rather than those for which the answers are already available. In practice, models are typically evaluated on fresh data to gauge how well they can generalize. To accurately measure performance at this level, problems should be novel and manually verified, even if that takes considerable time (1.5 weeks or more).
I believe the book does not have the answers to the problems, so even if it was trained with the book it shouldnt help it to solve problems. Still it is possible that it just took the answers from some physics subreddit post and just pasted it
It backtracked on its own answers double checking so I doubt it already knew the answer off it being trained off the book
IT BACKTRACKED ITSELF THOUGH????
Not to mention, universities have and still run research where they create brand new tests solely for having AI take them
It was trained on this data lol
Yes, but it's not like it just looks it up. It actually thinks whatever that means for a machine
@@juanperez-lh9mtIt's a language model, it LITERALLY looks it up.
doesn't make this less impressive
@@juanperez-lh9mt It doesn't look it up, but the questions and answers are likely part of the training data for the model, which means that they're, in a way, already imprinted into the AI's "memory". Now, it surely still has to do some work to figure it out, but sometimes when you see something like "Recognize that _____ = _____" in some completely unrecognizable way, it's probably something the AI has just summoned from "memory".
@@tachyonindustries It's not prevalent enough in the training data that we can say it recognizes the problem, it just recognizes the type of problem and reasons from there
@@ancientmodis LLMs do not look up in their databases.
The database is used to train the model, all they learned is probabilities to figure out the most probable continuation to your prompt.
It does not look anything up. LLMs do not even have access to look up the raw data they have been trained on....
That is fundamentally NOT how they work.
O1 was trained on all Internet, including that book
So were all of us.
@@HedgeFundCIO the difference is we can think, but it can only answer. Its a great tool!! but not think on its own
Actually we dont know if its thinks because we dont know how we think. This is a philosical debate in ai community over the years.@@roro-v3z
@@roro-v3z almost like you didn't see it go through problems step by step to get to an answer..... It can indeed reason on it's own now
@@Hosea405 yes it did but on training data, but it won't have new ideas that have not been trained
There is a problem with the test:
Since the answer ,,show that.." is given, the AI will always show the correct answer, the reasoning might be flawed. It would be better to cut out the correct answer from the problem and see what AI will answer then.
This applies to humans completing the problem as well, and there was an effort made to check the steps.
I agree, it might be interesting to see if it could though (although if it succeeds, will likely express it in an different form which may be hard to verify).
i agree with you, specially taking into account that it may just be bluffing and we would have no idea
Prior versions had bogus steps that didn’t really follow legitimate steps, and units were often fouled up. Definitely deserves to be looked at deeper to see if that has improved.
" OpenAI's new AI model, "o1," has achieved a significant milestone by scoring around 120 on the Norway Mensa IQ test, far surpassing previous models. In a recent test, it got 25 out of 35 questions right, which is notably better than most humans. A critical factor in these tests is ensuring the AI doesn't benefit from pre-existing training data. To address this, custom questions were created that had never been publicly available, and o1 still performed impressively"
So it’s already smarter than 90% of the global human population, and it knows everything on the internet.
Hahaha I love it, we are truly living in the future guys, appreciate it!
Since that book is older than the model, I wonder if it appeared in its training data.
100%. Perplexity pointed me to at least 6 pdf versions available for free online. There are also lots of study notes
online available for this text. Although I have no idea if it is memorizing answers.
@@Analyse_US it looks like it actually tries to solve the problems.
@@lolilollolilol7773 I agree, it's definitely not just remembering the answer. But is it remembering steps to solving the problems that it found in online study examples? I don't know. But my own testing makes me think it is a big step up in capablity.
@@Analyse_US AI memorizes patterns. If pattern is similar but exercise is different AI will solve it.
Its funny how good it is at some things and how terrible it is at others things still, seems its abilities are heavily dependent on whether examples of the problem were included in its training data. I've asked it to create a 32 bit crc algorithm and it did it perfectly, however when asking it to create considerably more trivial 3 bit crc algorithm (which is uncommon and quite useless), it failed miserably and in fact produced multiple wrong result that got worse and worse as i pointed out the flaws.
It might hallucinate since it knows the answers. So one would need to check all the calculations.
I just filmed a part 2 where it involves some problems where the answer is not known in advance, and problems that I'm confident it did not have access to previously: ruclips.net/video/a8QvnIAGjPA/видео.html
i am doing a phd in ml related field.
Setting fair benchmarks and tests in the current day is quite hard considering the shear scale of data top models are trained on.
And using a famous physique text book isnt really a good attempt.
model o1 reasoning is a massive step up though for sure, i think it could do a similar blind test like this very soon.
Now ask it if 9.11 is bigger than 9.9
it worked for me 👍
bruh it said 9.11 is bigger than 9.9
*is 9.11 bigger than 9.9*
[Thought for 4 seconds]
"Yes, 9.11 is greater than 9.9."
*Are you sure?*
[Thought for 17 seconds]
"You're right, and I apologize for the mistake in my previous response. Actually, 9.9 is greater than 9.11. Here's why:
9.9 can be expressed as 9.90 or 9.900.
9.11 remains 9.11 or 9.110.
When comparing:
9.900 (which is 9.9) is greater than 9.110 (which is 9.11).
Therefore, 9.9 is greater than 9.11."
I suppose we're going to have to make sure ChatGPT re-evaluates their answer every single time
I just did it and it got it right first try:
No, 9.11 is not bigger than 9.9; in fact, 9.11 is less than 9.9. Here's the explanation:
9.11 equals 9 plus 0.11.
9.9 equals 9 plus 0.9.
Since 0.11 is less than 0.9, it follows that:
9
+
0.11
9+0.11 (which is 9.11) is less than
9
+
0.9
9+0.9 (which is 9.9).
Therefore:
9.11 < 9.9
9.11 is bigger than 9.9 when it comes to version numbers.
This is scary. But you have to try with novel problems that the AI has never seen before. Chatgpt has been for sure trained with the Jackson book!
Nevertheless, the reasoning capabilities are astonishing.
A new era has begun.
" Chatgpt has been for sure trained with the Jackson book!"
This is such an oft-repeated nonsense statement though. Just because a problem might be in its training set, the model will not be significantly or any better answering that exact problem than any other problem in the same category.
It's, like. Do you remember every homework math equation you have solved in your life?
Would you be any better at solving a problem you have already encountered once 10 years before vs a similar novel one? No, of course not, unless you have superhuman memory where you keep an exact copy of everything you've done ever.
Similarly, these models don't memorize. They synthesize. They are learning models, not search engines or "neually indexed databases" or whatever.
@@sid.h Ai remembers patterns, not particular problems. And indeed if some pattern is missing AI will miss it, if pattern is well represented AI will solve it well. Better architecture of neural network remembers more and remembers and solves corner cases better. This is what we see in chess networks such as Leela Chess Zero.
I don't understand anything about physics and advanced mathematics, but this video just made me excited for the future again!
As a person doing a bachelor's in EE, this will be very useful for me. Like many, I only wonder what'll happen in the future when it gets even more advanced?
Maybe take my reduced earnings and live off the land somewhere. Away from this.
If it is on the internet, it's in its training data. You would need to find questions that it has not been trained on. This is why benchmarking is so hard
It still impressive the model can accurately comprehend which part of its training data deals with the problem in question.
There are human beings who haven’t mastered this skill lmao
Stop the downplaying. These types of problems are impossible to solve without reasoning. Simple pattern recognition doesn't make this possible.
This cope needs to stop
Imagine 10 years from now
@@xorqwerty8276Star Wars Universe but more humanoid bots on our planet , and billions of them being like gods building anything and everything they imagine. Earth is surrounded by a giant dome that extracts/enhances light from the sun combined with technology that speeds up how fast plants or trees grow, we have a combo of biological machines that have become humans too and are interbreeding half humans half machines. The sun is all we need to survive now. Millions of unique new species emerge.
(10 years is like millions of years if true ASI comes in a year from now)
Even 2 years could be very wtf lol 😂
In less than 3 years lots of knowledge workers will be displaced by AI.
as a high schooler who had taken part in aime, o1 is really impressive. aime problems get so much harder when it gets to the latter half. so 83% (o1) compared to 13%(gpt4o) is huge. the latter solve possibly only solve the first two which are not challenging at all
Open ai say The real Ai o1 version of will be out before the end of 2024.
really? where?
They said about 1 month but will probably be end of 2024 as you say, o1-preview is not the full version there is the "full" o1 that is better yeah
i’ve been watching lots of videos on o1 and i’ve not had a wow moment yet…. but this was it
God, if only I had this back in 2003 when I completed my physics degree. I would have saved myself so much pain and suffering.
Since it is an infamous book, how do we know that it really solved the problems by reasoning and is not just trained on the existing solutions?
Is there the answers at the back of this book?
@@hxlbac No, but an Instructor's Solutions Manual online as PDF and several other sample solutuons.
Maybe it was in its training dataset, would be interesting for you to test something it could not have seen during training
not maybe, for sure. i know people dont have to be all experts in exactly what the black box of deep learning is doing but holy people are so dumb... i wonder if they don't think that IF what they think is true, meaning the models being this great, that in 1 month we wouldn't have to get new discoveries in all science fields...
which will not come because the current AI is 100% data capped. its just memorization of PDFs and manifold recalling
This is a fair point! I have gone ahead and uploaded a Part 2 using problems I'm confident it had not seen before and that I have detailed answers to! ruclips.net/video/a8QvnIAGjPA/видео.html
Solutions are publicly available and most probably in training datasets already. LLMs are good at what they already learned, but even not 100% accurate there.
"to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve advent of code programming problems published december 2023)"
This is true for humans as well. I have worked in aerospace for a major company for many years. When I have to solve a difficult engineering problem I first search for for a so-called a "subject matter expert" in the field and it's quite likely that he or she will know the answer.
Incredible. It would be interesting to see what happens if you give it to solve an incorrect result. Will it show that your result is incorrect and instead give the correct one ?
This was an interesting test. I still think it's funny when people say these models don't understand.
Anyone who's used them enough understands that they do understand.
One nice thing is that you can ask follow up questions as well and ask why something is like that, or ask it to try things in a slightly different way if you want it done differently.
I dunno about latest models, but ChatGPT 3.5 does NOT "understand" anything. It feeds you fake references, and when you repeatedly tell it it is doing so, it will say "sorry" and continue to feed you fake references. That is not its fault--it is not "replying" or "responding" or dong anything a living being is doing. If you give it a training set containing PhD Level physics problems, sure it can solve those problems. That is just predicting output from a training data.
@@woosterjeeves This isn't GPT 3.5 though, and that specific model you mentioned was released back in November of 2022, the first public release of ChatGPT. In the video, you can see it's process of reasoning. ChatGPT doesn't use fake references if it's able to break it down and be able to express why and how it conducts it's problem solving and reasoning. Also to "That is just predicting output from a training data", one, how is that different from learning? Isn't that the point of teachers, to help you predict and reason the output from the input of questions and data? Two, this is just a preview, not the full model, and it is able to do extremely difficult problems like these, explain the reasoning, the process, and give the right answer. We are slowly gravitating towards such a world where such an excuse of prediction of data will no longer be viable to argue about. The model is able to understand. The model is able to think with it's data. It's putting formulas and answers together from it's data, to reason and to form intelligent answers and responses when in contrast, the same problems make the most qualified PhDs scratch their heads. Reminder, these questions take around 1.5 weeks as said to solve ONE problem, GPT-o1 does it in less than 2 minutes.
@@人人人人人人人人 Sure. I am still flummoxed why someone would add "understanding" to a prediction model. If you think prediction (from training data) is equal to understanding, then algorithms are already "understanding". Why hype this one? OTOH, if you think there is something qualitatively different, then we can talk about that. But you cannot claim both.
Are chess computers "understanding" because they can do moves that make super GMS scratch their heads? If so, then the argument is already over. I am only cautioning against use of common-term words ("understanding") which makes one think in terms of sentience. A language model has no such thing.
Does this mean AI will never reach sentience? I never said that--just that the video does not do it for me. I am totally clueless why others are impressed about this model's "understanding", the same way I would be if someone said Alpha-Zero (the chess AI) understands chess. That is all.
Please refer to the chinese room problem.
@@woosterjeeves If you’ve only used 3.5 then I’m not surprised that’s your opinion 😂
the problems are known by the llm already, it has been trained on the issue it dident come to a conclusion through reasoning
to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve some advent of code programming problems published december 2023)
@plonkermaster5244 Your statement it's half true, LLMs need to have existing information to propperly work. However, unless the problem presented needs an actual new theory with previous research and a never seen formula, LLMs can recognize the formulas needed to solve the problem. Good observation.
Lol, it's not the case whatsoever, keep coping though.
@@matiasluna514to be fair, we as humans need to do that as well haha
@@I_INFNITY_I LLMs do not do reasoning. They just give the appearance of doing so. It's one of the most researched topics of LLMs
Not bad for a model smaller than the o1 and based on GPT-4, to be honest I don't know how I'll be able to test upcoming versions like the ones based on GPT-5.
I can't wait to use this on university projects, there are so many things I need to go looking for experts for relatively "easy" tasks.
There’s so much potential in the pipeline. Imagine the o1 techniques applied to image/video generation. Bye-bye obviously fake images, and hello “indiscernible from reality” images.
Also, once o1 is layered on top of GPT-5, we’re likely talking “competing with or beating best-in-the-world level scientists/thought leaders” in different fields. This will fuel more investment into compute farms to create even MORE powerful AI, and multiple instances can run simultaneously to solve problems that would take humanity millennia to solve otherwise. Including AI researching how to improve AI in a self-improving recursive loop that will only stop upon reaching the physical boundaries of the universe.
The way this model was trained was it took physics problems just like that and used a model like gpt4 to create reasoning chains until it could actually derive the correct answer. So it's not surprising. It can already solve textbooks that are well solved already because the answer is very objective and once you get a solid reasoning chain to get to the answer, you can simply train the model on that
"PhD-level". Our undergraduate theoretical physics course in electrodynamics used Jackson lol
smells like a clickbait title you know
Definitely non-standard in the US.
You used it as a vague reference book but you never really read through it.
@@andreaaurigemma2782 Of course I did.
@@Patrick-vv3ig no you didn't and if I had a penny for every shitty undergrad bragging about how they went through hard books without understanding a single thing I'd be rich
And they wrote how this was just a step of many like that to come. In 5-10 years the world may be changed fundamentally, 20 years it’ll be hard to recognize
The real question is can it solve Indian entrance exam questions or not?
You show up in 2005 with this tool and they'd call it AGI
I asked it to find how much the earth would have to be compressed to become opaque to neutrinos: It took it 39 seconds to say 26 km diameter. Totally fascinating how it got there...(01Preview)
The correct answer is ~300 meters. It told me 360 meters.
ChatGPT is truly amazing i wonder what this technology will be like in 10 years i think schools should really use this technology and allow it because it’s not like it’s going away tomorrow. I also think this technology makes it impossible to be ignorant
Lol that was my reaction last year with gpt4 but with programing
If the solutions exist on the internet is it really solving it? Or just analyzing and printing the answers . A true test could be a creating a unique problem with no known solutions
how would you know if it was correct or not if there was no known solution?
@@dieg9054 Maybe he means a problem that isn't posted on the internet since ChatGPT gets its solutions from the downloaded internet data.
That isn't how LLMs work, unless it is a wildly popular problem the small details it learns from the internet gets lost in the neural web
Great video and interesting commentary. It’s interesting you think this might be a good study aid or a tool… however I just watched you take longer to check the answers than the model took to solve them… and your an actual subject matter expert… and as you correctly pointed out, this is just a preview of the full model capabilities. This new type of model will not help experts, but replace them. They will eclipse not only human level knowledge, but human level speed. This is not a tool. It’s disruption personified. With something this good (and as the saying goes, this is as bad as they will ever be as they will only improve from this time onwards) what purpose will it serve to complete university study for 3 years, only to try and find employment in a career that no longer requires humans. Amazing.
It's a machine, like cotton gin, the steam engine, the locomotive, etc. All advance of technology has displaced people from some jobs into others. And yet we are still here. What's the alternative? Structure society to be less productive and less efficient in order to keep people employed in obsolete jobs? That will just slow the growth of the economy and cause a lower standard of living, leading to poverty and hunger as the world population keeps multiplying. It's going to put people out of work, we will be ok. Becoming a Luddite is not going to change anything.
@@msromike123Cars replace horses So will humans suffer the same fate as horses?
@@AlfarrisiMuammar I am glad you are thinking about it now. 1) Truck drivers replaced wagon drivers (not horses.) There are many more truck drivers now. 2) The standard of living for both truck drivers AND horses is higher than ever. Same thing goes for automobiles and horses.
I suspect this was trained on the Jackson book.
7:36 You're right. It's not obvious. Proving that identity is the main crux of the the problem and the model just skips over it. My professor would give me a 0 for that entire problem if I did it this way. This is the reason you shouldn't give it "prove it" problems where you've already given it the answer.
Maybe the follow on question is to ask it to explain that part? I think the takeaway here for me, is that we're 2 years into this resolution, and the basic tech is already at this level. How do we think about the world if we extrapolate another 5-10 years? What do we want and how will this inevitable tech change whats possible?
The first time i watched a video like this was from sixty symbols where they also tried to solve physics problems using the original vanilla Chatgpt 3.5. They didn't get anywhere close to this level. I think the progress is reallty accelerating. I also think that inference time compute is a very real thing and the guys at openai have solved it with this new model in a fundamental way for sure. I think there will be other ways to implement system 2 thinking but i think that using reasoning tokens for accomplishing this is maybe the best and coherent way to go forward. I truly think that with o1, we have the first complete architecture for AGI.
Tip to past questions in: ask ChatGPT 4o to transcribe the picture
the changing my approach part was kinda scary ngl
GPT is most probably trained on the answers since it is a well known book
Maybe, but it showed its work
I mean it gave you step by step how it was able to solve those problems and gives you its insights into how it’s thinking. That is just wild beyond imagination.
I tested to solve a sudoku and it failed. It either give wrong results or change up the original question
Still did much better than 4o when I tested it a few months ago
This is a great to see the model's real ability. Thank you for the demonstration.
You're welcome! I have made a part 2 using new questions that I'm confident it didn't have access to beforehand: ruclips.net/video/a8QvnIAGjPA/видео.html
You have to give it your own problem. The book is part of its training data. That is why it just knew the sum.
Even if that was the case, the simple fact that it worked out the path to the solution is impressive. But you are likely wrong.
@lolilollolilol7773
LLMs literally predict the next word based on probability. If the answer isn’t in the training data it can’t answer the question. It doesn’t have reasoning skills.
But they do have reasoning skills ,
@@Lucid.28 No they dont.
@@lewie8136they recognize patterns like we do... We don't really think, we also predict things based on the patters we see... We just named it thinking.
FYI you should put new problems in new chats to avoid polluting the context window
The first one is the easy one? Yet at the same time you're amazed that it solved it in 122 seconds, while you mention that it generally takes others 1.5 week.
he shouldve clarified that jackson problems can take around 10 hours to 10 days. that question probably takes a couple of days to do but not 10 days
it was messing up on symbolab generated Mclauren & taylor series problems for me last night, but its a massive improvement from last time i tried to get it to do math a couple years ago
Oh interesting, I'm working on another video that involve problems that shouldn't exist on the internet that my professors created themselves.
This should make you seriously question the way we do education. If human value isn't in solving problems that are hard, yet clearly defined, then why teach that? You teach it because you need to know it to solve higher level problems. But maybe we no longer need to also train the skill of doing the calculations. So long as you understand the concept properly you can move on without spending a week pushing through the math. That's going to be very hard for some people to accept.
Understanding the concept, unfortunately, typically requires dozens of practical experiences. This us why teaching math starting with a calculator leads to lesser learning than introducing a calculator following basic practice
@@JaredCooney very true. But I think students will be doing less of it and learning more. We’ve seen this pattern before.
Chat we're cooked
I think giving it problems where it asks to arrive at a specific solution (shown in the problem itself) is not a good way to evaluate it.
I bet the results would be very different if you just asked it to solve the problem by itself.
o1 will get better in the coming months and these problems with be easy for it perhaps.
I wouldn’t be surprised. The true limit of LLMs are problems with no real known solutions. The advancements still do not change the (oversimplified here) model of fitting a curve.
OpenAI can do another round of RL and CoT on these specific problems, but all it takes is another set of problem that it really hasn’t encountered that well. It still suffers from the generic flipping an image of a dog and having the AI shit itself.
@@user-jm8fj7ez8s Yup; and that's why even Sam Altman admitted it's "more impressive on your first use" [than when you use it a lot], and that it's "not AGI". But this is already INCREDIBLY useful; think about it like this: even if it can't automate most jobs, what % of humans ACTUALLY work on problems that have 0 prior data in terms of how to solve them? It's only less than maybe 1% if we're being generous. This means the AI will soon be able to eclipse the regular work 99% of humans do, without any further breakthroughs. And THAT is the mindblowing part! (this was generated by O1-mini).
@@user-jm8fj7ez8syeah it does not reason from first principles. Unlike humans, it doesnt explicitly operate through spatial & relational reasoning from the ground up.
They just had o1 trained by asking it millions of questions, letting it think, and then reinforcing the reasoning that led to the right answers. So basically o1 will know the most accurate and efficient chain of reasoning for familiar questions.
Will this eventually turn into a super intelligent reasoning engine when scaled up? Nobody knows really but I personally doubt
Just as a comment: it looks impressive. However, to truly judge how good the model is, one (unfortunately 😬) needs to read the proofs line by line and examine the arguments in depth. From my experience with GPT-4, the proofs often look good, but they sometimes contain flaws when examined more closely.
Just finished recording a video where I do that more or less with some problems I have the answer to and am pretty sure the problem didn't exist on the internet!
Here is part if you are interested: ruclips.net/video/a8QvnIAGjPA/видео.html
We need AI to replace politicians, ASAP. The 'presidential debate' was a travesty.
best realization I've heard in weeks
An "ai president" as long as there isn't a person telling it how to think could be the best thing for any country. I would still give it a few years before doing so tho and make sure it's main objective is to do the best for the country.
I support this idea 1000%
i think current o1-preview is far more capable to govern than any human. Of course, it would need some readjustments like a more continous existence, without resetting itself, and a virtually infinite context window so it can always take into account everything that has ever happened in the past
You have no clou how ai works 😂
what if this book was in the training base?
Yes, probably. But this is only o1 model, every 6 months openAI release new model. What do you think, an o9 can do?
You need the solutions book, if it exists. But the model backtracked, so it tried several methods.
nothing ... gpt4o had the same training data and fails to sole it
But one moment, how can you verify that the GPT did not know about this problem before and is only recreating it from his own knowledge base? You need to give something that you are 100 sure that it doesn't know. For that the best way to do it is to ask if it actually knows the solution directly if GPT4o knows the solution, it is likely that the o1 knows it.
This is a good point! I just recorded a Part 2 with some new problems that I believe it didn't have in its knowledge-base. Will be uploading shortly!
I don't know if it's seen these problems before, but it was tested repeatedly using newly made-up logical and reasoning problems and it solved them. I showed it some work that was unpublished (but actually valid and verifiable) so I knew it hadn't seen it, and it's response was the same as I would expect from someone who was very experienced in the field and seeing it for the first time. So it definitely can reason (in its own way) on its own without already knowing the answer. I highly recommend the "Sparks of AGI" paper or lecture that goes into this in detail.
I gave ChatGPT a couple of master’s degree computer science problems, both solutions were wrong. I had to tell her they were wrong, she apologized and corrected herself. It wasn’t reliable.
The free model or paid model?
@@magicalgibus3006 I used model 4 and 4o, free though.
@@jekyll366 Habibi he obviously tests the state-of-the-art model o1..
i gave it multiple undergrad problems from algorithms and data structures, its certainly usefull but 90% of the time fails first try, then maybe in about half of those times you can instruct it how to fix the solution. Im talking about the 4o model. I find it useful for finding dumb mistakes, like missing some boundary conditions or just forgetting an i++ and stuff like that
@@jekyll366 Then your assessment was useless, we are talking about O1 preview.
So much cope in the comments
If you want to test the actual knowledge of the model then use textbooks that were compiled with questions created after the knowledge cutoff of the model. This doesn't reflect its actual knowledge but prior knowledge of the model.
I think a student who can't check the answer for correctness may get his ‘points’, but if the professor asks, the gaps in his understanding will quickly become apparent.
But surely the model has already been trained on that textbook?
It's a fair point, I've gone ahead and filmed and recorded a part 2 that involves problems I'm confident it hadn't seen before: ruclips.net/video/a8QvnIAGjPA/видео.html
I’m worried that this doesn’t show anything that somewhere and it’s training data. It has the answer and it’s memorized how to explain the answer, but not the underlying logic.
Most of the time if it can give you the answer, it is enough.
the way so is trained is lossful the ai doesn't have access to word for word all of the info they were trained on it doesn't have anything "memorized"
@@o_o9039regardless, it synthethizes training data to a big extent, so in some sense, it is "memorized"
How can you be sure that the model has not actually been learned by this book?
I don't think you can be, but the fact that it tried one approach and then backtracked and did another is pretty good evidence it's not just a regurgitated answer.
That is a valid point, this is why I have gone ahead and made a part 2 using problems that I'm confident aren't floating around on the internet: ruclips.net/video/a8QvnIAGjPA/видео.html
I would have done the test of giving the answer with some error, for example an extra factor of 2, or an arctan instead of arcsin, and see if it gets the true answer anyway and recognizes the incorrect input. That would make a very convincing test.
OMFG, another year.... everyone going to have a PHD.
Or no one, because why do something that a machine does better?
@@hipotures No, you don't get it. The standards have been raised. The hyper intelligent.... are going to be on steroids. I know i am.
Imagine if someone at 18 with an iq of 145+ with AI tools at their disposal? now imagine a decade of this progress and the new generation coming in.
We're going to see hyper geniuses.
Generally I would love to see some problems where you do not need to prove solution you know in advance, So not "Show that" but what is ..... I wonder if those proofs are actually flawles or they just look convincing
Well, until you yourself don't know whether what it did was correct, we can't say for sure. It is surprising nonetheless, yes, however if you had given these problems to the earlier version, it would have also arrived at the conclusion required, i think. It would have just done some mumbo jumbo and forcefully arrived at the conclusion, no matter what it got wrong in the process. This time around though, it looks like it actually did all the things correctly in its "reasoning" process.
Hey man! You should do a video with scores, like, you will do 5 tests, and allow 5-shot for each problem to each model. And then see out of 5 what's the score. Do this for GPT4o vs O1 preview, you can also do O1 vs Claude sonnet!
Like a "LLM's Face Off"
I actually did a stream like that last night! Gave o1, 4o, Gemini Advanced, Claude Sonnet 3.5, Grok 2, and LLama 3.1 a college math exam! ruclips.net/user/liveGdN4MFxLQUU?si=flPSFIxx85Uqyoz7
All i see is that in a few years AI will be able to do everything and most of us will be obsolete
Keep in mind these companies lie A LOT! Like the bar exam, it tests in the 60th percentile with the initially hidden caveats in place
@@marcusrosales3344True, but the money doesn’t lie. Until the bubble bursts, very smart people have bet tens of billions of dollars on it being game-changing. And notice how the goalposts keep moving back? “It’s ONLY getting 60% on the bar” is a far cry from 5 years ago when AI could only put out gibberish.
10:12 Not a physics student, but looking at the solution it seems like it ignored the first part, took the M expression for granted, and only applied the approximations. Which I believe are much much easier than getting the expression for M itself :D
I think you're right! Sorry I missed that, I was a bit flustered and tired to be honest haha, but I'm working on a new video that will involve problems that it should have never seen before!
@@KyleKabasares_PhD No worries at all, the solution it provided felt a bit too short for such a problem. I will definitely watch the next video since as others pointed out, having it solve unknown problems will be much more useful and interesting.
The video is now out if you are interested! ruclips.net/video/a8QvnIAGjPA/видео.html
Great Video. Can it be the case that solutions were part of training data of this model as earlier GPTs had a lot of books as training data?
It is possible but hard to know
It ABSOLUTELY was. Jackson is such a common book used in grad EM.
This video has almost no substance, there's no verification on the accuracy of the logic. Guy also said he didn't know if it was correct.
As Andrej Karpathy recently said in an interview, ideal training data for a reasoning model would include step-by-step reasoning. (like how we teach children in school) t's a bit amazing that bulk internet data has enough of this embedded reasoning to get us the current results.
OpenAI is using q*star to refine their synthetic data, no doubt with successful step-by-step reasoning in that data. This will take a couple years to reach the next model (that's how long it takes to train a new model) but it's start of a virtuous cycle, where ever capable models refine future synthetic data.
Thank you for your comment, this step-by-step reasoning approach is definitely a game-changer! I have also just uploaded a part 2 to my channel if you are interested: ruclips.net/video/a8QvnIAGjPA/видео.html
It would be cool if one of the creators of these problems could get paid to make original ones on the spot and feed it to o1.
This book is probably in its training data
so why did it try different approaches and not just the correct one?
@@japiye I am not sure but I do know it was trained on those types of problems so it’s not truly deriving those problems cold did you notice it would pull numbers out of nowhere. It’s still really impressive and a very useful model I think we skeptical that it’s really the equivalent of a physics grad student, if you watch ai explained video it gets common sense problems wrong
@@Mayeverycreaturefindhappiness Still impressive
@@trueuniverse690 yes
@@japiye As it probabilistically selects the next word, it will select different words compared to what is has seen. This is what makes the model generate new sentences, but it is able to evaluate it's chain of thought which leads to the correct one or a better result. As the problems are found online and the jackson problems are well known in the field for many years previously, it must be in it's training set.
What happens when the full orion model drops soon? This is like half as "smart"
Your reaction clips bout to go viral bro 🚀