Hi everyone, thank you so much for the feedback! I couldn't have expected this kind of attention on my video in the first 48 hours. I've taken some of your suggestions in the comments and have created a Part 2: ruclips.net/video/a8QvnIAGjPA/видео.html Please consider watching!
it was funny ! the thing is to keep up with the technolgys ad current innovations being deployed as it should not be hard to emulate these neural networks wth the open sourced models ! the aim is to train the local models as best you can at the highest point you capability but keep aware that technolgy needs to adavce to hande these heavy tensor calcualtions hence local model will be able to perform these tasks without the eed of outseide intervenetion so get an early start ! or it will be a nightmre of stuy to catch up : it has taken me a full year of constant python etc doing this training and implementation to keep up and get ahead ! .. that gap is widening @ Just expect to be able to host a 24b or 70b local within the next two years ! , a full genrative model ! ( so you could host multiple mini 7b agents at the same time ! hence a power full system ! ( agentic ! )
It is worth noting that GPT has probably 'read' ever thousands of answer books on Jackson, As well as all of Stack Exchange, as well as several study guides on Jackson in particular. So if you want to really test GPT ability you probably need to create novel questions that will not be found in any textbook or online form.
Note 2: I haven’t looked through the answers, but in cases where GPT knows what the answer should be, it will make up crap to fill in the middle. I’ve asked it many PhD level math questions where it has hallucinated its way to a “plausible” answer.
@@KyleKabasares_PhD that's not what he was saying. You were providing GPT 1o the answers, so of course it would give you the right answers since you provided them for it. To know if it truly solves PhD questions, you shoudn't give it questions like :"prove that this formula is verified" but rather " what is the formula for ... ?"
@@omatbaydaui722 I understood what he was saying. I’ve verified it doing problems correctly from start to finish (non-Jackson) without knowing the answer in advance! But in those cases I actually did the problem unlike here, so I’m planning on revisiting the problems in this video.
I just tried o1 with some fairly simple integrals, which it got badly wrong and I had to guide it to the correct answer. So I'd advise checking every step in the derivation.
The issue with this "test" is that the solutions to one of the most famous physics books are certainly in it's training data. Give it a problem for which there is no known solution. Or at a minimum give it a problem from an unknown text book. Find an old physics or math book from 100 years ago that has no digital copies. Then ask it questions from that and see how it does.
@@amcmillion3 yes I did it and it’s not very accurate , I feed in 3 jee advance questions and out of which it could only answer 1 correctly , 1 he did it wrong even with hints and 1 he had to solve wrong first and than with hints it was able to solve it
If they were JEE Mains level questions, than solving 1/3 would put it at the same level as those qualifying in top 1000 for the exams. FYI, The highest marks in JEE Mains were usually around 33-35%. I’d wager that would be folks with l an IQ level of ~130+, that is pretty damn good for GenAI. On the normal distribution curve of IQ where 100 is the population average, 130 should at least yield 1 or 2 sigma of confidence level to the statement “GenAI has definitively exceeded the average of the human intelligence level”
They told me AI would replace hard labourers and Fast food workers first leaving us more time to think so I went to college now I'm college and I'm the first one being replaced.
yeah it's everyone. from labourers to physicists, Ai could do everything much more effectively. the biggest surprise was creativity, that Ai could create art.
Just don’t be a data analyst and if you want to be a computer scientist get out of school and get into the languages and start building plenty of budding industries right now
100% this. It's given the answer to work towards. I do not have enough knowledge in this area to prove that it came to it's conclusions incorrectly, but it's a well known quirk.
Testing it with a book that is “infamous” probably isnt a great benchmark considering that it would mean that there is a considerable database related to that specific book it could read from. If you could test it on a novel problem that would be better
You have to remember that this book was probably directly in the chatGPT training data, so this may not be a valid measure of novel problem solving ability
In general, ML models shouldn’t memorize the training data. A lot of effort is put into ensuring the model learns how to do the problem rather than memorizing.
I haven’t watched the solving yet, but immediately I would like to point out that choosing problems which have known solutions may mean that the model has already seen (and simply memorized) the correct answer! A better test is to ask it an impossible problem or one that solutions don’t exist for and then try to see if it’s generated solution is correct.
Absolutely. If you simply Google the first 15 words of Problem 1, the very first result is a pdf document with a detailed, step-by-step solution. If anything, assuming the steps provided by o1 are correct, it just demonstrates it's decent at summarising search results... The same goes for programming. A lot of people get easily impressed when GPT "writes" a 50-line script that's basically within the first 3-4 StackOverflow posts after a Google search. I mean, yeah, I won't deny it's really convenient that the tool can save you a few clicks, but saying that it has an understanding of the answer it's giving you is (as of today) still a stretch.
if you know how ai works the way they are trained is lossful they don't have access to word for word of every bit of their training info if they did these models would be terabytes upon terabytes in size and would be extremely slow.
@@o_o9039 I know how they work, and I'm not saying the model has all the information stored in its parameters, but it's no secret GPT can indeed search the web and summarize its findings. Copilot (based on GPT4) openly provides sources for almost everything it spits out.
i asked it to calculate some stuff (quantum mechanics) for me and it also did some difficult step without explanation. i asked it to prove that step and it gave me a proof containing 1 mistake, but i wasn't sure and asked about that step, then it realized it was wrong, explained exactly why it was wrong, fixed it, and remade the calculation corrrectly
The key is having the answer before hand so it can guess from both ends and connect them. Ask it to evaluate a parameterized surface integral even with wolfram plugins and it will make mistakes.
Instead of telling it the answers, try asking it to find them. When I did this, it got the first one to an infinite sum but didn't reduce the infinite sum to the final answer: pretty good! For the second one, it had an extra factor of 1/pi that it made up. For the third it completely disregarded the angular dependence of the scattering and failed.
The model has most likely been trained on these problems and their solutions, since they've been around on the internet for a long time. So it isn't really a good test of its abilities since it has just memorized the solutions. That being said, I also tried it with some problems from the 2024 International Math Olympiad, and it was able to get at least two (out of six) of them correct consistently. I only tried problems where the answer was a single number/expression, going through proofs would be much more work. The model's knowledge cutoff should be in October 2023 so it shouldn't be possible for it to have seen these problems before. It's still hard to tell since OpenAI isn't being very transparent with their methodology, but if the model is actually able to solve novel IMO level problems it has never seen before, color me impressed.
The model's performance is undoubtedly impressive, but if it was trained on this book (which seems likely), it's not truly generalizing to new data. For a fair assessment of its capabilities, models should be tested on novel, unforeseen problems, rather than those for which the answers are already available. In practice, models are typically evaluated on fresh data to gauge how well they can generalize. To accurately measure performance at this level, problems should be novel and manually verified, even if that takes considerable time (1.5 weeks or more).
I believe the book does not have the answers to the problems, so even if it was trained with the book it shouldnt help it to solve problems. Still it is possible that it just took the answers from some physics subreddit post and just pasted it
" OpenAI's new AI model, "o1," has achieved a significant milestone by scoring around 120 on the Norway Mensa IQ test, far surpassing previous models. In a recent test, it got 25 out of 35 questions right, which is notably better than most humans. A critical factor in these tests is ensuring the AI doesn't benefit from pre-existing training data. To address this, custom questions were created that had never been publicly available, and o1 still performed impressively"
There is a problem with the test: Since the answer ,,show that.." is given, the AI will always show the correct answer, the reasoning might be flawed. It would be better to cut out the correct answer from the problem and see what AI will answer then.
This applies to humans completing the problem as well, and there was an effort made to check the steps. I agree, it might be interesting to see if it could though (although if it succeeds, will likely express it in an different form which may be hard to verify).
Prior versions had bogus steps that didn’t really follow legitimate steps, and units were often fouled up. Definitely deserves to be looked at deeper to see if that has improved.
If you want to know, the steps are simply contextual changes, it is essentially a gpt that has the instructions self-set, and the output of its thinking steps is the instructions it is providing itself at each step. it works because by shifting context at each step rather than only a single context of the original message and response, it is able to aproach problems itteratively from different 'perspectives'
I just filmed a part 2 where it involves some problems where the answer is not known in advance, and problems that I'm confident it did not have access to previously: ruclips.net/video/a8QvnIAGjPA/видео.html
As a person doing a bachelor's in EE, this will be very useful for me. Like many, I only wonder what'll happen in the future when it gets even more advanced? Maybe take my reduced earnings and live off the land somewhere. Away from this.
*is 9.11 bigger than 9.9* [Thought for 4 seconds] "Yes, 9.11 is greater than 9.9." *Are you sure?* [Thought for 17 seconds] "You're right, and I apologize for the mistake in my previous response. Actually, 9.9 is greater than 9.11. Here's why: 9.9 can be expressed as 9.90 or 9.900. 9.11 remains 9.11 or 9.110. When comparing: 9.900 (which is 9.9) is greater than 9.110 (which is 9.11). Therefore, 9.9 is greater than 9.11." I suppose we're going to have to make sure ChatGPT re-evaluates their answer every single time
I just did it and it got it right first try: No, 9.11 is not bigger than 9.9; in fact, 9.11 is less than 9.9. Here's the explanation: 9.11 equals 9 plus 0.11. 9.9 equals 9 plus 0.9. Since 0.11 is less than 0.9, it follows that: 9 + 0.11 9+0.11 (which is 9.11) is less than 9 + 0.9 9+0.9 (which is 9.9). Therefore: 9.11 < 9.9
Its funny how good it is at some things and how terrible it is at others things still, seems its abilities are heavily dependent on whether examples of the problem were included in its training data. I've asked it to create a 32 bit crc algorithm and it did it perfectly, however when asking it to create considerably more trivial 3 bit crc algorithm (which is uncommon and quite useless), it failed miserably and in fact produced multiple wrong result that got worse and worse as i pointed out the flaws.
This is scary. But you have to try with novel problems that the AI has never seen before. Chatgpt has been for sure trained with the Jackson book! Nevertheless, the reasoning capabilities are astonishing. A new era has begun.
" Chatgpt has been for sure trained with the Jackson book!" This is such an oft-repeated nonsense statement though. Just because a problem might be in its training set, the model will not be significantly or any better answering that exact problem than any other problem in the same category. It's, like. Do you remember every homework math equation you have solved in your life? Would you be any better at solving a problem you have already encountered once 10 years before vs a similar novel one? No, of course not, unless you have superhuman memory where you keep an exact copy of everything you've done ever. Similarly, these models don't memorize. They synthesize. They are learning models, not search engines or "neually indexed databases" or whatever.
@@sid.h Ai remembers patterns, not particular problems. And indeed if some pattern is missing AI will miss it, if pattern is well represented AI will solve it well. Better architecture of neural network remembers more and remembers and solves corner cases better. This is what we see in chess networks such as Leela Chess Zero.
@@xorqwerty8276Star Wars Universe but more humanoid bots on our planet , and billions of them being like gods building anything and everything they imagine. Earth is surrounded by a giant dome that extracts/enhances light from the sun combined with technology that speeds up how fast plants or trees grow, we have a combo of biological machines that have become humans too and are interbreeding half humans half machines. The sun is all we need to survive now. Millions of unique new species emerge. (10 years is like millions of years if true ASI comes in a year from now) Even 2 years could be very wtf lol 😂
i am doing a phd in ml related field. Setting fair benchmarks and tests in the current day is quite hard considering the shear scale of data top models are trained on. And using a famous physique text book isnt really a good attempt. model o1 reasoning is a massive step up though for sure, i think it could do a similar blind test like this very soon.
If it is on the internet, it's in its training data. You would need to find questions that it has not been trained on. This is why benchmarking is so hard
It still impressive the model can accurately comprehend which part of its training data deals with the problem in question. There are human beings who haven’t mastered this skill lmao
Stop the downplaying. These types of problems are impossible to solve without reasoning. Simple pattern recognition doesn't make this possible. This cope needs to stop
as a high schooler who had taken part in aime, o1 is really impressive. aime problems get so much harder when it gets to the latter half. so 83% (o1) compared to 13%(gpt4o) is huge. the latter solve possibly only solve the first two which are not challenging at all
100%. Perplexity pointed me to at least 6 pdf versions available for free online. There are also lots of study notes online available for this text. Although I have no idea if it is memorizing answers.
@@lolilollolilol7773 I agree, it's definitely not just remembering the answer. But is it remembering steps to solving the problems that it found in online study examples? I don't know. But my own testing makes me think it is a big step up in capablity.
This was an interesting test. I still think it's funny when people say these models don't understand. Anyone who's used them enough understands that they do understand. One nice thing is that you can ask follow up questions as well and ask why something is like that, or ask it to try things in a slightly different way if you want it done differently.
I dunno about latest models, but ChatGPT 3.5 does NOT "understand" anything. It feeds you fake references, and when you repeatedly tell it it is doing so, it will say "sorry" and continue to feed you fake references. That is not its fault--it is not "replying" or "responding" or dong anything a living being is doing. If you give it a training set containing PhD Level physics problems, sure it can solve those problems. That is just predicting output from a training data.
@@woosterjeeves This isn't GPT 3.5 though, and that specific model you mentioned was released back in November of 2022, the first public release of ChatGPT. In the video, you can see it's process of reasoning. ChatGPT doesn't use fake references if it's able to break it down and be able to express why and how it conducts it's problem solving and reasoning. Also to "That is just predicting output from a training data", one, how is that different from learning? Isn't that the point of teachers, to help you predict and reason the output from the input of questions and data? Two, this is just a preview, not the full model, and it is able to do extremely difficult problems like these, explain the reasoning, the process, and give the right answer. We are slowly gravitating towards such a world where such an excuse of prediction of data will no longer be viable to argue about. The model is able to understand. The model is able to think with it's data. It's putting formulas and answers together from it's data, to reason and to form intelligent answers and responses when in contrast, the same problems make the most qualified PhDs scratch their heads. Reminder, these questions take around 1.5 weeks as said to solve ONE problem, GPT-o1 does it in less than 2 minutes.
@@人人人人人人人人 Sure. I am still flummoxed why someone would add "understanding" to a prediction model. If you think prediction (from training data) is equal to understanding, then algorithms are already "understanding". Why hype this one? OTOH, if you think there is something qualitatively different, then we can talk about that. But you cannot claim both. Are chess computers "understanding" because they can do moves that make super GMS scratch their heads? If so, then the argument is already over. I am only cautioning against use of common-term words ("understanding") which makes one think in terms of sentience. A language model has no such thing. Does this mean AI will never reach sentience? I never said that--just that the video does not do it for me. I am totally clueless why others are impressed about this model's "understanding", the same way I would be if someone said Alpha-Zero (the chess AI) understands chess. That is all.
not maybe, for sure. i know people dont have to be all experts in exactly what the black box of deep learning is doing but holy people are so dumb... i wonder if they don't think that IF what they think is true, meaning the models being this great, that in 1 month we wouldn't have to get new discoveries in all science fields... which will not come because the current AI is 100% data capped. its just memorization of PDFs and manifold recalling
This is a fair point! I have gone ahead and uploaded a Part 2 using problems I'm confident it had not seen before and that I have detailed answers to! ruclips.net/video/a8QvnIAGjPA/видео.html
@@Patrick-vv3ig no you didn't and if I had a penny for every shitty undergrad bragging about how they went through hard books without understanding a single thing I'd be rich
The first time i watched a video like this was from sixty symbols where they also tried to solve physics problems using the original vanilla Chatgpt 3.5. They didn't get anywhere close to this level. I think the progress is reallty accelerating. I also think that inference time compute is a very real thing and the guys at openai have solved it with this new model in a fundamental way for sure. I think there will be other ways to implement system 2 thinking but i think that using reasoning tokens for accomplishing this is maybe the best and coherent way to go forward. I truly think that with o1, we have the first complete architecture for AGI.
Great video and interesting commentary. It’s interesting you think this might be a good study aid or a tool… however I just watched you take longer to check the answers than the model took to solve them… and your an actual subject matter expert… and as you correctly pointed out, this is just a preview of the full model capabilities. This new type of model will not help experts, but replace them. They will eclipse not only human level knowledge, but human level speed. This is not a tool. It’s disruption personified. With something this good (and as the saying goes, this is as bad as they will ever be as they will only improve from this time onwards) what purpose will it serve to complete university study for 3 years, only to try and find employment in a career that no longer requires humans. Amazing.
It's a machine, like cotton gin, the steam engine, the locomotive, etc. All advance of technology has displaced people from some jobs into others. And yet we are still here. What's the alternative? Structure society to be less productive and less efficient in order to keep people employed in obsolete jobs? That will just slow the growth of the economy and cause a lower standard of living, leading to poverty and hunger as the world population keeps multiplying. It's going to put people out of work, we will be ok. Becoming a Luddite is not going to change anything.
@@AlfarrisiMuammar I am glad you are thinking about it now. 1) Truck drivers replaced wagon drivers (not horses.) There are many more truck drivers now. 2) The standard of living for both truck drivers AND horses is higher than ever. Same thing goes for automobiles and horses.
Not bad for a model smaller than the o1 and based on GPT-4, to be honest I don't know how I'll be able to test upcoming versions like the ones based on GPT-5. I can't wait to use this on university projects, there are so many things I need to go looking for experts for relatively "easy" tasks.
There’s so much potential in the pipeline. Imagine the o1 techniques applied to image/video generation. Bye-bye obviously fake images, and hello “indiscernible from reality” images. Also, once o1 is layered on top of GPT-5, we’re likely talking “competing with or beating best-in-the-world level scientists/thought leaders” in different fields. This will fuel more investment into compute farms to create even MORE powerful AI, and multiple instances can run simultaneously to solve problems that would take humanity millennia to solve otherwise. Including AI researching how to improve AI in a self-improving recursive loop that will only stop upon reaching the physical boundaries of the universe.
I mean it gave you step by step how it was able to solve those problems and gives you its insights into how it’s thinking. That is just wild beyond imagination.
@lolilollolilol7773 LLMs literally predict the next word based on probability. If the answer isn’t in the training data it can’t answer the question. It doesn’t have reasoning skills.
@@lewie8136they recognize patterns like we do... We don't really think, we also predict things based on the patters we see... We just named it thinking.
The way this model was trained was it took physics problems just like that and used a model like gpt4 to create reasoning chains until it could actually derive the correct answer. So it's not surprising. It can already solve textbooks that are well solved already because the answer is very objective and once you get a solid reasoning chain to get to the answer, you can simply train the model on that
An "ai president" as long as there isn't a person telling it how to think could be the best thing for any country. I would still give it a few years before doing so tho and make sure it's main objective is to do the best for the country.
i think current o1-preview is far more capable to govern than any human. Of course, it would need some readjustments like a more continous existence, without resetting itself, and a virtually infinite context window so it can always take into account everything that has ever happened in the past
I asked it to find how much the earth would have to be compressed to become opaque to neutrinos: It took it 39 seconds to say 26 km diameter. Totally fascinating how it got there...(01Preview)
And they wrote how this was just a step of many like that to come. In 5-10 years the world may be changed fundamentally, 20 years it’ll be hard to recognize
If the solutions exist on the internet is it really solving it? Or just analyzing and printing the answers . A true test could be a creating a unique problem with no known solutions
You're welcome! I have made a part 2 using new questions that I'm confident it didn't have access to beforehand: ruclips.net/video/a8QvnIAGjPA/видео.html
to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve some advent of code programming problems published december 2023)
@plonkermaster5244 Your statement it's half true, LLMs need to have existing information to propperly work. However, unless the problem presented needs an actual new theory with previous research and a never seen formula, LLMs can recognize the formulas needed to solve the problem. Good observation.
Solutions are publicly available and most probably in training datasets already. LLMs are good at what they already learned, but even not 100% accurate there.
"to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve advent of code programming problems published december 2023)"
The first one is the easy one? Yet at the same time you're amazed that it solved it in 122 seconds, while you mention that it generally takes others 1.5 week.
ChatGPT is truly amazing i wonder what this technology will be like in 10 years i think schools should really use this technology and allow it because it’s not like it’s going away tomorrow. I also think this technology makes it impossible to be ignorant
@@japiye I am not sure but I do know it was trained on those types of problems so it’s not truly deriving those problems cold did you notice it would pull numbers out of nowhere. It’s still really impressive and a very useful model I think we skeptical that it’s really the equivalent of a physics grad student, if you watch ai explained video it gets common sense problems wrong
@@japiye As it probabilistically selects the next word, it will select different words compared to what is has seen. This is what makes the model generate new sentences, but it is able to evaluate it's chain of thought which leads to the correct one or a better result. As the problems are found online and the jackson problems are well known in the field for many years previously, it must be in it's training set.
Just as a comment: it looks impressive. However, to truly judge how good the model is, one (unfortunately 😬) needs to read the proofs line by line and examine the arguments in depth. From my experience with GPT-4, the proofs often look good, but they sometimes contain flaws when examined more closely.
Just finished recording a video where I do that more or less with some problems I have the answer to and am pretty sure the problem didn't exist on the internet!
Incredible. It would be interesting to see what happens if you give it to solve an incorrect result. Will it show that your result is incorrect and instead give the correct one ?
I appreciate the view! I am working on creating a Part 2 that will involve open-ended questions that I was given as a graduate student in school that I don't believe come from any textbooks, so stay tuned for that!
It's a fair point, I've gone ahead and filmed and recorded a part 2 that involves problems I'm confident it hadn't seen before: ruclips.net/video/a8QvnIAGjPA/видео.html
Hey man! You should do a video with scores, like, you will do 5 tests, and allow 5-shot for each problem to each model. And then see out of 5 what's the score. Do this for GPT4o vs O1 preview, you can also do O1 vs Claude sonnet! Like a "LLM's Face Off"
I actually did a stream like that last night! Gave o1, 4o, Gemini Advanced, Claude Sonnet 3.5, Grok 2, and LLama 3.1 a college math exam! ruclips.net/user/liveGdN4MFxLQUU?si=flPSFIxx85Uqyoz7
This should make you seriously question the way we do education. If human value isn't in solving problems that are hard, yet clearly defined, then why teach that? You teach it because you need to know it to solve higher level problems. But maybe we no longer need to also train the skill of doing the calculations. So long as you understand the concept properly you can move on without spending a week pushing through the math. That's going to be very hard for some people to accept.
Understanding the concept, unfortunately, typically requires dozens of practical experiences. This us why teaching math starting with a calculator leads to lesser learning than introducing a calculator following basic practice
@@hipotures No, you don't get it. The standards have been raised. The hyper intelligent.... are going to be on steroids. I know i am. Imagine if someone at 18 with an iq of 145+ with AI tools at their disposal? now imagine a decade of this progress and the new generation coming in. We're going to see hyper geniuses.
it was messing up on symbolab generated Mclauren & taylor series problems for me last night, but its a massive improvement from last time i tried to get it to do math a couple years ago
I will admit, even though this is cool and impressive It likely was trained off of this. I recommend trying to create some novel problem yourself and then testing it but the GPQA benchmark they used is completely Google proof and is not on the Internet it was created by PhD‘s, who were in physics, chemistry, and biology and it outperformed them in answering the tests that doesn’t mean its as good as them that just means it’s good at answering questions that PhD’s in those fields would struggle with I know nothing about physics. I came here for the AI stuff
the fact that it tired different apraoches, failed and tried others does not proof your point, but i lnow how hard it is for humans to accept that a machine can match their intelligence but i guess this time there is no excaping it..
This is a good point! I've gone ahead and made a part 2 using newer problems that I don't believe it had access to in its training set and that I do have the answers to! ruclips.net/video/a8QvnIAGjPA/видео.html
Realistically, it probably does have that math problem in its data. However, unless they specifically overfitted their data with that one question hundreds of times, it's not likely the ai model simply remembered the question or something.
Thank you so much! I've gone ahead and created a part 2 based on the feedback to this video, I hope you will consider watching it! ruclips.net/video/a8QvnIAGjPA/видео.html
As Andrej Karpathy recently said in an interview, ideal training data for a reasoning model would include step-by-step reasoning. (like how we teach children in school) t's a bit amazing that bulk internet data has enough of this embedded reasoning to get us the current results. OpenAI is using q*star to refine their synthetic data, no doubt with successful step-by-step reasoning in that data. This will take a couple years to reach the next model (that's how long it takes to train a new model) but it's start of a virtuous cycle, where ever capable models refine future synthetic data.
Thank you for your comment, this step-by-step reasoning approach is definitely a game-changer! I have also just uploaded a part 2 to my channel if you are interested: ruclips.net/video/a8QvnIAGjPA/видео.html
I tested to solve a sudoku and it failed. It either give wrong results or change up the original question Still did much better than 4o when I tested it a few months ago
I wouldn’t be surprised. The true limit of LLMs are problems with no real known solutions. The advancements still do not change the (oversimplified here) model of fitting a curve. OpenAI can do another round of RL and CoT on these specific problems, but all it takes is another set of problem that it really hasn’t encountered that well. It still suffers from the generic flipping an image of a dog and having the AI shit itself.
@@user-jm8fj7ez8s Yup; and that's why even Sam Altman admitted it's "more impressive on your first use" [than when you use it a lot], and that it's "not AGI". But this is already INCREDIBLY useful; think about it like this: even if it can't automate most jobs, what % of humans ACTUALLY work on problems that have 0 prior data in terms of how to solve them? It's only less than maybe 1% if we're being generous. This means the AI will soon be able to eclipse the regular work 99% of humans do, without any further breakthroughs. And THAT is the mindblowing part! (this was generated by O1-mini).
@@user-jm8fj7ez8syeah it does not reason from first principles. Unlike humans, it doesnt explicitly operate through spatial & relational reasoning from the ground up. They just had o1 trained by asking it millions of questions, letting it think, and then reinforcing the reasoning that led to the right answers. So basically o1 will know the most accurate and efficient chain of reasoning for familiar questions. Will this eventually turn into a super intelligent reasoning engine when scaled up? Nobody knows really but I personally doubt
@@marcusrosales3344True, but the money doesn’t lie. Until the bubble bursts, very smart people have bet tens of billions of dollars on it being game-changing. And notice how the goalposts keep moving back? “It’s ONLY getting 60% on the bar” is a far cry from 5 years ago when AI could only put out gibberish.
Well, you said it yourself, this book is *very* well known for its problems, and by students. I.E. it was in the training set. You need to prompt it with your own problems and watch how it fails miserably. Take something like Yann Lecun's (infamous) geodesics problem for instance, idk. Also, when it has the solutions, it will try to get the solution at the end of the proof no matter what, and will sneak in errors in order to get to this result, so be very carefull.
yea but you could set up a special prompt that synthesizes new original problems and automate it, then they would have millions of original problems to work with for training a new model, im sure they will find a way to make it work with all the smart people they have at OpenAI for instance.
@@hypersonicmonkeybrains3418 For this to even work, you would need to generate the correct solutions to train on, alongside the generated "original problems", which is impossible, since the whole point of training it is to be able to generate those solutions.
I think giving it problems where it asks to arrive at a specific solution (shown in the problem itself) is not a good way to evaluate it. I bet the results would be very different if you just asked it to solve the problem by itself.
But one moment, how can you verify that the GPT did not know about this problem before and is only recreating it from his own knowledge base? You need to give something that you are 100 sure that it doesn't know. For that the best way to do it is to ask if it actually knows the solution directly if GPT4o knows the solution, it is likely that the o1 knows it.
I don't know if it's seen these problems before, but it was tested repeatedly using newly made-up logical and reasoning problems and it solved them. I showed it some work that was unpublished (but actually valid and verifiable) so I knew it hadn't seen it, and it's response was the same as I would expect from someone who was very experienced in the field and seeing it for the first time. So it definitely can reason (in its own way) on its own without already knowing the answer. I highly recommend the "Sparks of AGI" paper or lecture that goes into this in detail.
I gave ChatGPT a couple of master’s degree computer science problems, both solutions were wrong. I had to tell her they were wrong, she apologized and corrected herself. It wasn’t reliable.
i gave it multiple undergrad problems from algorithms and data structures, its certainly usefull but 90% of the time fails first try, then maybe in about half of those times you can instruct it how to fix the solution. Im talking about the 4o model. I find it useful for finding dumb mistakes, like missing some boundary conditions or just forgetting an i++ and stuff like that
I would have done the test of giving the answer with some error, for example an extra factor of 2, or an arctan instead of arcsin, and see if it gets the true answer anyway and recognizes the incorrect input. That would make a very convincing test.
Well, until you yourself don't know whether what it did was correct, we can't say for sure. It is surprising nonetheless, yes, however if you had given these problems to the earlier version, it would have also arrived at the conclusion required, i think. It would have just done some mumbo jumbo and forcefully arrived at the conclusion, no matter what it got wrong in the process. This time around though, it looks like it actually did all the things correctly in its "reasoning" process.
The model can do some amazing things, but since its results are thought-wise so far separated from the input question (as opposed to the regular GPT models which just start answering immediately), yet this model can still get some weird elementary things wrong, I'm not sure that I can ever use it for anything where I am unable to verify if its solution either works or is better than one I came up with. I think we're at a stage where it is becoming increasingly impossible for people to even evaluate the capabilities of these models. Their ceiling is immense, but when you have it thinking for a few minutes, it can make subtle mistakes at any step. OpenAI is also not exposing the thoughts to users, so it becomes impossible to read through all of it to decide whether it checks out. It's gonna take me forever to develop intuition for what sorts of things this thing can and can't do realiably and thus be trusted with.
Thank you for your feedback, I've gone ahead and recorded and uploaded a Part 2 where I test o1 with questions I have the answers to and questions that I'm confident it didn't have access to in its training. ruclips.net/video/a8QvnIAGjPA/видео.html
If you want to test the actual knowledge of the model then use textbooks that were compiled with questions created after the knowledge cutoff of the model. This doesn't reflect its actual knowledge but prior knowledge of the model.
Hi everyone, thank you so much for the feedback! I couldn't have expected this kind of attention on my video in the first 48 hours. I've taken some of your suggestions in the comments and have created a Part 2: ruclips.net/video/a8QvnIAGjPA/видео.html
Please consider watching!
Open ai say The real Ai o1 version of will be out before the end of 2024.
UNIVERSAL BASIC ICOME 2025
it was funny !
the thing is to keep up with the technolgys ad current innovations being deployed as it should not be hard to emulate these neural networks wth the open sourced models ! the aim is to train the local models as best you can at the highest point you capability but keep aware that technolgy needs to adavce to hande these heavy tensor calcualtions hence local model will be able to perform these tasks without the eed of outseide intervenetion so get an early start !
or it will be a nightmre of stuy to catch up : it has taken me a full year of constant python etc doing this training and implementation to keep up and get ahead ! .. that gap is widening @
Just expect to be able to host a 24b or 70b local within the next two years ! , a full genrative model ! ( so you could host multiple mini 7b agents at the same time ! hence a power full system ! ( agentic ! )
How much did open ai pay you to make this ad?
@@AlfarrisiMuammar I can't wait I'm still intrepid for GPT 5
It is worth noting that GPT has probably 'read' ever thousands of answer books on Jackson, As well as all of Stack Exchange, as well as several study guides on Jackson in particular. So if you want to really test GPT ability you probably need to create novel questions that will not be found in any textbook or online form.
Exactly, problems that solve students are already done somewhere on the internet, is just about googling it and copy paste the solution.
It's the same issue with AI being "great at programming" because it's extensively trained on leetcode problems.
@@taragnor being good at leetcode is not even being good at programming.
@@gabrielbarrantes6946 its doesnt have access to the internet
@@gabrielbarrantes6946 said the web dev.
Note 2: I haven’t looked through the answers, but in cases where GPT knows what the answer should be, it will make up crap to fill in the middle. I’ve asked it many PhD level math questions where it has hallucinated its way to a “plausible” answer.
I'm planning on making a follow up video on comparing my approach to solving this problem with ChatGPT's! Thanks for pointing that out
@@KyleKabasares_PhD that's not what he was saying. You were providing GPT 1o the answers, so of course it would give you the right answers since you provided them for it. To know if it truly solves PhD questions, you shoudn't give it questions like :"prove that this formula is verified" but rather " what is the formula for ... ?"
@@omatbaydaui722 I understood what he was saying. I’ve verified it doing problems correctly from start to finish (non-Jackson) without knowing the answer in advance! But in those cases I actually did the problem unlike here, so I’m planning on revisiting the problems in this video.
Crazy part is that this isn’t even the full model, which is even better
Yeah, it's not even a beta, it's a preview. And it's still using the last gen model. They're coming out with a new model pretty soon.
To us who's math students😂😂❤
@@CommentGuard717 yeah, imagine the GPT 5 implemented not preview version. Thats gonna be fucking wild and its not that far from now
@alvaroluffy1 well, they are now working on chatgpt 6.0
@@Ken-vy7zu shut up you know nothing, stop making up things, they are still working on gpt-5 you realize that right?
I just tried o1 with some fairly simple integrals, which it got badly wrong and I had to guide it to the correct answer. So I'd advise checking every step in the derivation.
The issue with this "test" is that the solutions to one of the most famous physics books are certainly in it's training data. Give it a problem for which there is no known solution. Or at a minimum give it a problem from an unknown text book. Find an old physics or math book from 100 years ago that has no digital copies. Then ask it questions from that and see how it does.
Exactly
@@amcmillion3 yes I did it and it’s not very accurate , I feed in 3 jee advance questions and out of which it could only answer 1 correctly , 1 he did it wrong even with hints and 1 he had to solve wrong first and than with hints it was able to solve it
best of all, make your own
If they were JEE Mains level questions, than solving 1/3 would put it at the same level as those qualifying in top 1000 for the exams. FYI, The highest marks in JEE Mains were usually around 33-35%. I’d wager that would be folks with l an IQ level of ~130+, that is pretty damn good for GenAI. On the normal distribution curve of IQ where 100 is the population average, 130 should at least yield 1 or 2 sigma of confidence level to the statement “GenAI has definitively exceeded the average of the human intelligence level”
@@pratikpaharia nah
They told me AI would replace hard labourers and Fast food workers first leaving us more time to think so I went to college now I'm college and I'm the first one being replaced.
Don't worry, everyone will be replaced in 3-5 years💀
@@phen-themoogle7651 💯
yeah it's everyone. from labourers to physicists, Ai could do everything much more effectively.
the biggest surprise was creativity, that Ai could create art.
Just don’t be a data analyst and if you want to be a computer scientist get out of school and get into the languages and start building plenty of budding industries right now
@@avijit849 ai is not creative and I don't believe it ever will be
to be fair, you don’t seem to have actually checked the model responses, there could have been mistakes or hallucinations throughout
Can you point out any specific hallucinations on this video?
100% this. It's given the answer to work towards. I do not have enough knowledge in this area to prove that it came to it's conclusions incorrectly, but it's a well known quirk.
Testing it with a book that is “infamous” probably isnt a great benchmark considering that it would mean that there is a considerable database related to that specific book it could read from. If you could test it on a novel problem that would be better
You have to remember that this book was probably directly in the chatGPT training data, so this may not be a valid measure of novel problem solving ability
This is the worst this technology will ever be….
That's an incredible truth
That's a terrifying truth
Eh it might hit a wall though.
@@thegeeeeeeeeeeI’m here from the future, your comment aged poorly
It might stagnate tho
if its a famous problem, isn't there a good chance the solution was already in the training data?
In general, ML models shouldn’t memorize the training data. A lot of effort is put into ensuring the model learns how to do the problem rather than memorizing.
I haven’t watched the solving yet, but immediately I would like to point out that choosing problems which have known solutions may mean that the model has already seen (and simply memorized) the correct answer!
A better test is to ask it an impossible problem or one that solutions don’t exist for and then try to see if it’s generated solution is correct.
Absolutely. If you simply Google the first 15 words of Problem 1, the very first result is a pdf document with a detailed, step-by-step solution. If anything, assuming the steps provided by o1 are correct, it just demonstrates it's decent at summarising search results...
The same goes for programming. A lot of people get easily impressed when GPT "writes" a 50-line script that's basically within the first 3-4 StackOverflow posts after a Google search. I mean, yeah, I won't deny it's really convenient that the tool can save you a few clicks, but saying that it has an understanding of the answer it's giving you is (as of today) still a stretch.
if you know how ai works the way they are trained is lossful they don't have access to word for word of every bit of their training info if they did these models would be terabytes upon terabytes in size and would be extremely slow.
@@o_o9039 I know how they work, and I'm not saying the model has all the information stored in its parameters, but it's no secret GPT can indeed search the web and summarize its findings. Copilot (based on GPT4) openly provides sources for almost everything it spits out.
@@pripyaatHow to know if it cheated?
@@pripyaatjeez. even worss than I thought
i asked it to calculate some stuff (quantum mechanics) for me and it also did some difficult step without explanation. i asked it to prove that step and it gave me a proof containing 1 mistake, but i wasn't sure and asked about that step, then it realized it was wrong, explained exactly why it was wrong, fixed it, and remade the calculation corrrectly
Just a a few years ago no one ever imagined bots thinking...😭
I certainly didn't!
The key is having the answer before hand so it can guess from both ends and connect them. Ask it to evaluate a parameterized surface integral even with wolfram plugins and it will make mistakes.
And remember this is a truncated version of the model. It's full version is much better at problems like this
What? 😮
Instead of telling it the answers, try asking it to find them. When I did this, it got the first one to an infinite sum but didn't reduce the infinite sum to the final answer: pretty good! For the second one, it had an extra factor of 1/pi that it made up. For the third it completely disregarded the angular dependence of the scattering and failed.
I don't understand anything about physics and advanced mathematics, but this video just made me excited for the future again!
The model has most likely been trained on these problems and their solutions, since they've been around on the internet for a long time. So it isn't really a good test of its abilities since it has just memorized the solutions. That being said, I also tried it with some problems from the 2024 International Math Olympiad, and it was able to get at least two (out of six) of them correct consistently. I only tried problems where the answer was a single number/expression, going through proofs would be much more work. The model's knowledge cutoff should be in October 2023 so it shouldn't be possible for it to have seen these problems before. It's still hard to tell since OpenAI isn't being very transparent with their methodology, but if the model is actually able to solve novel IMO level problems it has never seen before, color me impressed.
I test ,AND correct answer for me o1 2024 final with alternatives
gpt4o has the same training data and can not sole it? so ...
This channel is the reason why I'm not reading fluid mechanics rn
Can’t tell if I should say thank you or I’m sorry lol
Fluid mechanics is extremely fun🤲
O1 was trained on all Internet, including that book
So were all of us.
@@HedgeFundCIO the difference is we can think, but it can only answer. Its a great tool!! but not think on its own
Actually we dont know if its thinks because we dont know how we think. This is a philosical debate in ai community over the years.@@roro-v3z
@@roro-v3z almost like you didn't see it go through problems step by step to get to an answer..... It can indeed reason on it's own now
@@Hosea405 yes it did but on training data, but it won't have new ideas that have not been trained
It was trained on this data lol
The model's performance is undoubtedly impressive, but if it was trained on this book (which seems likely), it's not truly generalizing to new data. For a fair assessment of its capabilities, models should be tested on novel, unforeseen problems, rather than those for which the answers are already available. In practice, models are typically evaluated on fresh data to gauge how well they can generalize. To accurately measure performance at this level, problems should be novel and manually verified, even if that takes considerable time (1.5 weeks or more).
I believe the book does not have the answers to the problems, so even if it was trained with the book it shouldnt help it to solve problems. Still it is possible that it just took the answers from some physics subreddit post and just pasted it
It backtracked on its own answers double checking so I doubt it already knew the answer off it being trained off the book
IT BACKTRACKED ITSELF THOUGH????
Not to mention, universities have and still run research where they create brand new tests solely for having AI take them
" OpenAI's new AI model, "o1," has achieved a significant milestone by scoring around 120 on the Norway Mensa IQ test, far surpassing previous models. In a recent test, it got 25 out of 35 questions right, which is notably better than most humans. A critical factor in these tests is ensuring the AI doesn't benefit from pre-existing training data. To address this, custom questions were created that had never been publicly available, and o1 still performed impressively"
So it’s already smarter than 90% of the global human population, and it knows everything on the internet.
There is a problem with the test:
Since the answer ,,show that.." is given, the AI will always show the correct answer, the reasoning might be flawed. It would be better to cut out the correct answer from the problem and see what AI will answer then.
This applies to humans completing the problem as well, and there was an effort made to check the steps.
I agree, it might be interesting to see if it could though (although if it succeeds, will likely express it in an different form which may be hard to verify).
i agree with you, specially taking into account that it may just be bluffing and we would have no idea
Prior versions had bogus steps that didn’t really follow legitimate steps, and units were often fouled up. Definitely deserves to be looked at deeper to see if that has improved.
If you want to know, the steps are simply contextual changes, it is essentially a gpt that has the instructions self-set, and the output of its thinking steps is the instructions it is providing itself at each step. it works because by shifting context at each step rather than only a single context of the original message and response, it is able to aproach problems itteratively from different 'perspectives'
It might hallucinate since it knows the answers. So one would need to check all the calculations.
I just filmed a part 2 where it involves some problems where the answer is not known in advance, and problems that I'm confident it did not have access to previously: ruclips.net/video/a8QvnIAGjPA/видео.html
i’ve been watching lots of videos on o1 and i’ve not had a wow moment yet…. but this was it
As a person doing a bachelor's in EE, this will be very useful for me. Like many, I only wonder what'll happen in the future when it gets even more advanced?
Maybe take my reduced earnings and live off the land somewhere. Away from this.
Now ask it if 9.11 is bigger than 9.9
it worked for me 👍
bruh it said 9.11 is bigger than 9.9
*is 9.11 bigger than 9.9*
[Thought for 4 seconds]
"Yes, 9.11 is greater than 9.9."
*Are you sure?*
[Thought for 17 seconds]
"You're right, and I apologize for the mistake in my previous response. Actually, 9.9 is greater than 9.11. Here's why:
9.9 can be expressed as 9.90 or 9.900.
9.11 remains 9.11 or 9.110.
When comparing:
9.900 (which is 9.9) is greater than 9.110 (which is 9.11).
Therefore, 9.9 is greater than 9.11."
I suppose we're going to have to make sure ChatGPT re-evaluates their answer every single time
I just did it and it got it right first try:
No, 9.11 is not bigger than 9.9; in fact, 9.11 is less than 9.9. Here's the explanation:
9.11 equals 9 plus 0.11.
9.9 equals 9 plus 0.9.
Since 0.11 is less than 0.9, it follows that:
9
+
0.11
9+0.11 (which is 9.11) is less than
9
+
0.9
9+0.9 (which is 9.9).
Therefore:
9.11 < 9.9
9.11 is bigger than 9.9 when it comes to version numbers.
Its funny how good it is at some things and how terrible it is at others things still, seems its abilities are heavily dependent on whether examples of the problem were included in its training data. I've asked it to create a 32 bit crc algorithm and it did it perfectly, however when asking it to create considerably more trivial 3 bit crc algorithm (which is uncommon and quite useless), it failed miserably and in fact produced multiple wrong result that got worse and worse as i pointed out the flaws.
This is scary. But you have to try with novel problems that the AI has never seen before. Chatgpt has been for sure trained with the Jackson book!
Nevertheless, the reasoning capabilities are astonishing.
A new era has begun.
" Chatgpt has been for sure trained with the Jackson book!"
This is such an oft-repeated nonsense statement though. Just because a problem might be in its training set, the model will not be significantly or any better answering that exact problem than any other problem in the same category.
It's, like. Do you remember every homework math equation you have solved in your life?
Would you be any better at solving a problem you have already encountered once 10 years before vs a similar novel one? No, of course not, unless you have superhuman memory where you keep an exact copy of everything you've done ever.
Similarly, these models don't memorize. They synthesize. They are learning models, not search engines or "neually indexed databases" or whatever.
@@sid.h Ai remembers patterns, not particular problems. And indeed if some pattern is missing AI will miss it, if pattern is well represented AI will solve it well. Better architecture of neural network remembers more and remembers and solves corner cases better. This is what we see in chess networks such as Leela Chess Zero.
Imagine 10 years from now
@@xorqwerty8276Star Wars Universe but more humanoid bots on our planet , and billions of them being like gods building anything and everything they imagine. Earth is surrounded by a giant dome that extracts/enhances light from the sun combined with technology that speeds up how fast plants or trees grow, we have a combo of biological machines that have become humans too and are interbreeding half humans half machines. The sun is all we need to survive now. Millions of unique new species emerge.
(10 years is like millions of years if true ASI comes in a year from now)
Even 2 years could be very wtf lol 😂
In less than 3 years lots of knowledge workers will be displaced by AI.
i am doing a phd in ml related field.
Setting fair benchmarks and tests in the current day is quite hard considering the shear scale of data top models are trained on.
And using a famous physique text book isnt really a good attempt.
model o1 reasoning is a massive step up though for sure, i think it could do a similar blind test like this very soon.
If it is on the internet, it's in its training data. You would need to find questions that it has not been trained on. This is why benchmarking is so hard
It still impressive the model can accurately comprehend which part of its training data deals with the problem in question.
There are human beings who haven’t mastered this skill lmao
Stop the downplaying. These types of problems are impossible to solve without reasoning. Simple pattern recognition doesn't make this possible.
This cope needs to stop
You show up in 2005 with this tool and they'd call it AGI
as a high schooler who had taken part in aime, o1 is really impressive. aime problems get so much harder when it gets to the latter half. so 83% (o1) compared to 13%(gpt4o) is huge. the latter solve possibly only solve the first two which are not challenging at all
God, if only I had this back in 2003 when I completed my physics degree. I would have saved myself so much pain and suffering.
Since that book is older than the model, I wonder if it appeared in its training data.
100%. Perplexity pointed me to at least 6 pdf versions available for free online. There are also lots of study notes
online available for this text. Although I have no idea if it is memorizing answers.
@@Analyse_US it looks like it actually tries to solve the problems.
@@lolilollolilol7773 I agree, it's definitely not just remembering the answer. But is it remembering steps to solving the problems that it found in online study examples? I don't know. But my own testing makes me think it is a big step up in capablity.
@@Analyse_US AI memorizes patterns. If pattern is similar but exercise is different AI will solve it.
This was an interesting test. I still think it's funny when people say these models don't understand.
Anyone who's used them enough understands that they do understand.
One nice thing is that you can ask follow up questions as well and ask why something is like that, or ask it to try things in a slightly different way if you want it done differently.
I dunno about latest models, but ChatGPT 3.5 does NOT "understand" anything. It feeds you fake references, and when you repeatedly tell it it is doing so, it will say "sorry" and continue to feed you fake references. That is not its fault--it is not "replying" or "responding" or dong anything a living being is doing. If you give it a training set containing PhD Level physics problems, sure it can solve those problems. That is just predicting output from a training data.
@@woosterjeeves This isn't GPT 3.5 though, and that specific model you mentioned was released back in November of 2022, the first public release of ChatGPT. In the video, you can see it's process of reasoning. ChatGPT doesn't use fake references if it's able to break it down and be able to express why and how it conducts it's problem solving and reasoning. Also to "That is just predicting output from a training data", one, how is that different from learning? Isn't that the point of teachers, to help you predict and reason the output from the input of questions and data? Two, this is just a preview, not the full model, and it is able to do extremely difficult problems like these, explain the reasoning, the process, and give the right answer. We are slowly gravitating towards such a world where such an excuse of prediction of data will no longer be viable to argue about. The model is able to understand. The model is able to think with it's data. It's putting formulas and answers together from it's data, to reason and to form intelligent answers and responses when in contrast, the same problems make the most qualified PhDs scratch their heads. Reminder, these questions take around 1.5 weeks as said to solve ONE problem, GPT-o1 does it in less than 2 minutes.
@@人人人人人人人人 Sure. I am still flummoxed why someone would add "understanding" to a prediction model. If you think prediction (from training data) is equal to understanding, then algorithms are already "understanding". Why hype this one? OTOH, if you think there is something qualitatively different, then we can talk about that. But you cannot claim both.
Are chess computers "understanding" because they can do moves that make super GMS scratch their heads? If so, then the argument is already over. I am only cautioning against use of common-term words ("understanding") which makes one think in terms of sentience. A language model has no such thing.
Does this mean AI will never reach sentience? I never said that--just that the video does not do it for me. I am totally clueless why others are impressed about this model's "understanding", the same way I would be if someone said Alpha-Zero (the chess AI) understands chess. That is all.
Please refer to the chinese room problem.
@@woosterjeeves If you’ve only used 3.5 then I’m not surprised that’s your opinion 😂
Maybe it was in its training dataset, would be interesting for you to test something it could not have seen during training
not maybe, for sure. i know people dont have to be all experts in exactly what the black box of deep learning is doing but holy people are so dumb... i wonder if they don't think that IF what they think is true, meaning the models being this great, that in 1 month we wouldn't have to get new discoveries in all science fields...
which will not come because the current AI is 100% data capped. its just memorization of PDFs and manifold recalling
This is a fair point! I have gone ahead and uploaded a Part 2 using problems I'm confident it had not seen before and that I have detailed answers to! ruclips.net/video/a8QvnIAGjPA/видео.html
Open ai say The real Ai o1 version of will be out before the end of 2024.
really? where?
They said about 1 month but will probably be end of 2024 as you say, o1-preview is not the full version there is the "full" o1 that is better yeah
"PhD-level". Our undergraduate theoretical physics course in electrodynamics used Jackson lol
smells like a clickbait title you know
Definitely non-standard in the US.
You used it as a vague reference book but you never really read through it.
@@andreaaurigemma2782 Of course I did.
@@Patrick-vv3ig no you didn't and if I had a penny for every shitty undergrad bragging about how they went through hard books without understanding a single thing I'd be rich
I suspect this was trained on the Jackson book.
The first time i watched a video like this was from sixty symbols where they also tried to solve physics problems using the original vanilla Chatgpt 3.5. They didn't get anywhere close to this level. I think the progress is reallty accelerating. I also think that inference time compute is a very real thing and the guys at openai have solved it with this new model in a fundamental way for sure. I think there will be other ways to implement system 2 thinking but i think that using reasoning tokens for accomplishing this is maybe the best and coherent way to go forward. I truly think that with o1, we have the first complete architecture for AGI.
Great video and interesting commentary. It’s interesting you think this might be a good study aid or a tool… however I just watched you take longer to check the answers than the model took to solve them… and your an actual subject matter expert… and as you correctly pointed out, this is just a preview of the full model capabilities. This new type of model will not help experts, but replace them. They will eclipse not only human level knowledge, but human level speed. This is not a tool. It’s disruption personified. With something this good (and as the saying goes, this is as bad as they will ever be as they will only improve from this time onwards) what purpose will it serve to complete university study for 3 years, only to try and find employment in a career that no longer requires humans. Amazing.
It's a machine, like cotton gin, the steam engine, the locomotive, etc. All advance of technology has displaced people from some jobs into others. And yet we are still here. What's the alternative? Structure society to be less productive and less efficient in order to keep people employed in obsolete jobs? That will just slow the growth of the economy and cause a lower standard of living, leading to poverty and hunger as the world population keeps multiplying. It's going to put people out of work, we will be ok. Becoming a Luddite is not going to change anything.
@@msromike123Cars replace horses So will humans suffer the same fate as horses?
@@AlfarrisiMuammar I am glad you are thinking about it now. 1) Truck drivers replaced wagon drivers (not horses.) There are many more truck drivers now. 2) The standard of living for both truck drivers AND horses is higher than ever. Same thing goes for automobiles and horses.
Not bad for a model smaller than the o1 and based on GPT-4, to be honest I don't know how I'll be able to test upcoming versions like the ones based on GPT-5.
I can't wait to use this on university projects, there are so many things I need to go looking for experts for relatively "easy" tasks.
There’s so much potential in the pipeline. Imagine the o1 techniques applied to image/video generation. Bye-bye obviously fake images, and hello “indiscernible from reality” images.
Also, once o1 is layered on top of GPT-5, we’re likely talking “competing with or beating best-in-the-world level scientists/thought leaders” in different fields. This will fuel more investment into compute farms to create even MORE powerful AI, and multiple instances can run simultaneously to solve problems that would take humanity millennia to solve otherwise. Including AI researching how to improve AI in a self-improving recursive loop that will only stop upon reaching the physical boundaries of the universe.
I mean it gave you step by step how it was able to solve those problems and gives you its insights into how it’s thinking. That is just wild beyond imagination.
You have to give it your own problem. The book is part of its training data. That is why it just knew the sum.
Even if that was the case, the simple fact that it worked out the path to the solution is impressive. But you are likely wrong.
@lolilollolilol7773
LLMs literally predict the next word based on probability. If the answer isn’t in the training data it can’t answer the question. It doesn’t have reasoning skills.
But they do have reasoning skills ,
@@Lucid.28 No they dont.
@@lewie8136they recognize patterns like we do... We don't really think, we also predict things based on the patters we see... We just named it thinking.
The way this model was trained was it took physics problems just like that and used a model like gpt4 to create reasoning chains until it could actually derive the correct answer. So it's not surprising. It can already solve textbooks that are well solved already because the answer is very objective and once you get a solid reasoning chain to get to the answer, you can simply train the model on that
We need AI to replace politicians, ASAP. The 'presidential debate' was a travesty.
best realization I've heard in weeks
An "ai president" as long as there isn't a person telling it how to think could be the best thing for any country. I would still give it a few years before doing so tho and make sure it's main objective is to do the best for the country.
I support this idea 1000%
i think current o1-preview is far more capable to govern than any human. Of course, it would need some readjustments like a more continous existence, without resetting itself, and a virtually infinite context window so it can always take into account everything that has ever happened in the past
You have no clou how ai works 😂
I asked it to find how much the earth would have to be compressed to become opaque to neutrinos: It took it 39 seconds to say 26 km diameter. Totally fascinating how it got there...(01Preview)
The correct answer is ~300 meters. It told me 360 meters.
And they wrote how this was just a step of many like that to come. In 5-10 years the world may be changed fundamentally, 20 years it’ll be hard to recognize
If the solutions exist on the internet is it really solving it? Or just analyzing and printing the answers . A true test could be a creating a unique problem with no known solutions
how would you know if it was correct or not if there was no known solution?
@@dieg9054 Maybe he means a problem that isn't posted on the internet since ChatGPT gets its solutions from the downloaded internet data.
That isn't how LLMs work, unless it is a wildly popular problem the small details it learns from the internet gets lost in the neural web
Your reaction clips bout to go viral bro 🚀
Lol that was my reaction last year with gpt4 but with programing
This is a great to see the model's real ability. Thank you for the demonstration.
You're welcome! I have made a part 2 using new questions that I'm confident it didn't have access to beforehand: ruclips.net/video/a8QvnIAGjPA/видео.html
GPT is most probably trained on the answers since it is a well known book
Maybe, but it showed its work
Since it is an infamous book, how do we know that it really solved the problems by reasoning and is not just trained on the existing solutions?
Is there the answers at the back of this book?
@@hxlbac No, but an Instructor's Solutions Manual online as PDF and several other sample solutuons.
the changing my approach part was kinda scary ngl
the problems are known by the llm already, it has been trained on the issue it dident come to a conclusion through reasoning
to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve some advent of code programming problems published december 2023)
@plonkermaster5244 Your statement it's half true, LLMs need to have existing information to propperly work. However, unless the problem presented needs an actual new theory with previous research and a never seen formula, LLMs can recognize the formulas needed to solve the problem. Good observation.
Lol, it's not the case whatsoever, keep coping though.
@@matiasluna514to be fair, we as humans need to do that as well haha
Solutions are publicly available and most probably in training datasets already. LLMs are good at what they already learned, but even not 100% accurate there.
"to my knowledge, its data is only until october 2023, and it can solve problems created after that data cutoff just as well. (for example it o1 mini was able to solve advent of code programming problems published december 2023)"
The first one is the easy one? Yet at the same time you're amazed that it solved it in 122 seconds, while you mention that it generally takes others 1.5 week.
ChatGPT is truly amazing i wonder what this technology will be like in 10 years i think schools should really use this technology and allow it because it’s not like it’s going away tomorrow. I also think this technology makes it impossible to be ignorant
This book is probably in its training data
so why did it try different approaches and not just the correct one?
@@japiye I am not sure but I do know it was trained on those types of problems so it’s not truly deriving those problems cold did you notice it would pull numbers out of nowhere. It’s still really impressive and a very useful model I think we skeptical that it’s really the equivalent of a physics grad student, if you watch ai explained video it gets common sense problems wrong
@@Mayeverycreaturefindhappiness Still impressive
@@trueuniverse690 yes
@@japiye As it probabilistically selects the next word, it will select different words compared to what is has seen. This is what makes the model generate new sentences, but it is able to evaluate it's chain of thought which leads to the correct one or a better result. As the problems are found online and the jackson problems are well known in the field for many years previously, it must be in it's training set.
Just as a comment: it looks impressive. However, to truly judge how good the model is, one (unfortunately 😬) needs to read the proofs line by line and examine the arguments in depth. From my experience with GPT-4, the proofs often look good, but they sometimes contain flaws when examined more closely.
Just finished recording a video where I do that more or less with some problems I have the answer to and am pretty sure the problem didn't exist on the internet!
Here is part if you are interested: ruclips.net/video/a8QvnIAGjPA/видео.html
Incredible. It would be interesting to see what happens if you give it to solve an incorrect result. Will it show that your result is incorrect and instead give the correct one ?
The real question is can it solve Indian entrance exam questions or not?
Thank you. I find this more interesting that Dr. Alan D. Thompson's obscure-information tests.
I appreciate the view! I am working on creating a Part 2 that will involve open-ended questions that I was given as a graduate student in school that I don't believe come from any textbooks, so stay tuned for that!
@@KyleKabasares_PhD I subscribed. Don't want to miss Part 2.
But surely the model has already been trained on that textbook?
It's a fair point, I've gone ahead and filmed and recorded a part 2 that involves problems I'm confident it hadn't seen before: ruclips.net/video/a8QvnIAGjPA/видео.html
Hey man! You should do a video with scores, like, you will do 5 tests, and allow 5-shot for each problem to each model. And then see out of 5 what's the score. Do this for GPT4o vs O1 preview, you can also do O1 vs Claude sonnet!
Like a "LLM's Face Off"
I actually did a stream like that last night! Gave o1, 4o, Gemini Advanced, Claude Sonnet 3.5, Grok 2, and LLama 3.1 a college math exam! ruclips.net/user/liveGdN4MFxLQUU?si=flPSFIxx85Uqyoz7
Tip to past questions in: ask ChatGPT 4o to transcribe the picture
Everything Is Easy Until o1 Faces Keshab Chandra Nag (Only Bangalis Will Understand) 😂
This should make you seriously question the way we do education. If human value isn't in solving problems that are hard, yet clearly defined, then why teach that? You teach it because you need to know it to solve higher level problems. But maybe we no longer need to also train the skill of doing the calculations. So long as you understand the concept properly you can move on without spending a week pushing through the math. That's going to be very hard for some people to accept.
Understanding the concept, unfortunately, typically requires dozens of practical experiences. This us why teaching math starting with a calculator leads to lesser learning than introducing a calculator following basic practice
@@JaredCooney very true. But I think students will be doing less of it and learning more. We’ve seen this pattern before.
OMFG, another year.... everyone going to have a PHD.
Or no one, because why do something that a machine does better?
@@hipotures No, you don't get it. The standards have been raised. The hyper intelligent.... are going to be on steroids. I know i am.
Imagine if someone at 18 with an iq of 145+ with AI tools at their disposal? now imagine a decade of this progress and the new generation coming in.
We're going to see hyper geniuses.
FYI you should put new problems in new chats to avoid polluting the context window
it was messing up on symbolab generated Mclauren & taylor series problems for me last night, but its a massive improvement from last time i tried to get it to do math a couple years ago
Oh interesting, I'm working on another video that involve problems that shouldn't exist on the internet that my professors created themselves.
I will admit, even though this is cool and impressive It likely was trained off of this. I recommend trying to create some novel problem yourself and then testing it but the GPQA benchmark they used is completely Google proof and is not on the Internet it was created by PhD‘s, who were in physics, chemistry, and biology and it outperformed them in answering the tests that doesn’t mean its as good as them that just means it’s good at answering questions that PhD’s in those fields would struggle with I know nothing about physics. I came here for the AI stuff
the fact that it tired different apraoches, failed and tried others does not proof your point, but i lnow how hard it is for humans to accept that a machine can match their intelligence but i guess this time there is no excaping it..
@@samsonabanni9562 Doesnt rule out the possibility that it still couldve helped guide it to the answer.
This is a good point! I've gone ahead and made a part 2 using newer problems that I don't believe it had access to in its training set and that I do have the answers to! ruclips.net/video/a8QvnIAGjPA/видео.html
Realistically, it probably does have that math problem in its data. However, unless they specifically overfitted their data with that one question hundreds of times, it's not likely the ai model simply remembered the question or something.
@@mint-o5497 yeah that’s my point i’m an ai nerd I’m not anti ai just being cautious
Great content idea. Love your reaction. Genuine.
Thank you so much! I've gone ahead and created a part 2 based on the feedback to this video, I hope you will consider watching it! ruclips.net/video/a8QvnIAGjPA/видео.html
So much cope in the comments
As Andrej Karpathy recently said in an interview, ideal training data for a reasoning model would include step-by-step reasoning. (like how we teach children in school) t's a bit amazing that bulk internet data has enough of this embedded reasoning to get us the current results.
OpenAI is using q*star to refine their synthetic data, no doubt with successful step-by-step reasoning in that data. This will take a couple years to reach the next model (that's how long it takes to train a new model) but it's start of a virtuous cycle, where ever capable models refine future synthetic data.
Thank you for your comment, this step-by-step reasoning approach is definitely a game-changer! I have also just uploaded a part 2 to my channel if you are interested: ruclips.net/video/a8QvnIAGjPA/видео.html
I tested to solve a sudoku and it failed. It either give wrong results or change up the original question
Still did much better than 4o when I tested it a few months ago
o1 will get better in the coming months and these problems with be easy for it perhaps.
I wouldn’t be surprised. The true limit of LLMs are problems with no real known solutions. The advancements still do not change the (oversimplified here) model of fitting a curve.
OpenAI can do another round of RL and CoT on these specific problems, but all it takes is another set of problem that it really hasn’t encountered that well. It still suffers from the generic flipping an image of a dog and having the AI shit itself.
@@user-jm8fj7ez8s Yup; and that's why even Sam Altman admitted it's "more impressive on your first use" [than when you use it a lot], and that it's "not AGI". But this is already INCREDIBLY useful; think about it like this: even if it can't automate most jobs, what % of humans ACTUALLY work on problems that have 0 prior data in terms of how to solve them? It's only less than maybe 1% if we're being generous. This means the AI will soon be able to eclipse the regular work 99% of humans do, without any further breakthroughs. And THAT is the mindblowing part! (this was generated by O1-mini).
@@user-jm8fj7ez8syeah it does not reason from first principles. Unlike humans, it doesnt explicitly operate through spatial & relational reasoning from the ground up.
They just had o1 trained by asking it millions of questions, letting it think, and then reinforcing the reasoning that led to the right answers. So basically o1 will know the most accurate and efficient chain of reasoning for familiar questions.
Will this eventually turn into a super intelligent reasoning engine when scaled up? Nobody knows really but I personally doubt
All i see is that in a few years AI will be able to do everything and most of us will be obsolete
Keep in mind these companies lie A LOT! Like the bar exam, it tests in the 60th percentile with the initially hidden caveats in place
@@marcusrosales3344True, but the money doesn’t lie. Until the bubble bursts, very smart people have bet tens of billions of dollars on it being game-changing. And notice how the goalposts keep moving back? “It’s ONLY getting 60% on the bar” is a far cry from 5 years ago when AI could only put out gibberish.
Wow. That's amazing. What amazes me is that it apparently knows all the tricks.
I am amazed that this is only o1 preview model. . Open AI is probably working on o2,o3,o4 and o5 model😊
Well, you said it yourself, this book is *very* well known for its problems, and by students.
I.E. it was in the training set.
You need to prompt it with your own problems and watch how it fails miserably.
Take something like Yann Lecun's (infamous) geodesics problem for instance, idk.
Also, when it has the solutions, it will try to get the solution at the end of the proof no matter what, and will sneak in errors in order to get to this result, so be very carefull.
yea but you could set up a special prompt that synthesizes new original problems and automate it, then they would have millions of original problems to work with for training a new model, im sure they will find a way to make it work with all the smart people they have at OpenAI for instance.
@@hypersonicmonkeybrains3418 For this to even work, you would need to generate the correct solutions to train on, alongside the generated "original problems", which is impossible, since the whole point of training it is to be able to generate those solutions.
Well done! Excellent video! 😮😮😮
Thanks so much for watching! Please consider watching part 2: ruclips.net/video/a8QvnIAGjPA/видео.html
I think giving it problems where it asks to arrive at a specific solution (shown in the problem itself) is not a good way to evaluate it.
I bet the results would be very different if you just asked it to solve the problem by itself.
Now give it the Millennium problems and see how it does with those
LOL
It still produces wrong answers in topics as basic as finite automaton.
Chat we're cooked
We have to remember to use it as a really smart helper and guide it at times
But one moment, how can you verify that the GPT did not know about this problem before and is only recreating it from his own knowledge base? You need to give something that you are 100 sure that it doesn't know. For that the best way to do it is to ask if it actually knows the solution directly if GPT4o knows the solution, it is likely that the o1 knows it.
This is a good point! I just recorded a Part 2 with some new problems that I believe it didn't have in its knowledge-base. Will be uploading shortly!
I don't know if it's seen these problems before, but it was tested repeatedly using newly made-up logical and reasoning problems and it solved them. I showed it some work that was unpublished (but actually valid and verifiable) so I knew it hadn't seen it, and it's response was the same as I would expect from someone who was very experienced in the field and seeing it for the first time. So it definitely can reason (in its own way) on its own without already knowing the answer. I highly recommend the "Sparks of AGI" paper or lecture that goes into this in detail.
So crazy! That level is insane
I gave ChatGPT a couple of master’s degree computer science problems, both solutions were wrong. I had to tell her they were wrong, she apologized and corrected herself. It wasn’t reliable.
The free model or paid model?
@@magicalgibus3006 I used model 4 and 4o, free though.
@@jekyll366 Habibi he obviously tests the state-of-the-art model o1..
i gave it multiple undergrad problems from algorithms and data structures, its certainly usefull but 90% of the time fails first try, then maybe in about half of those times you can instruct it how to fix the solution. Im talking about the 4o model. I find it useful for finding dumb mistakes, like missing some boundary conditions or just forgetting an i++ and stuff like that
@@jekyll366 Then your assessment was useless, we are talking about O1 preview.
It would be cool if one of the creators of these problems could get paid to make original ones on the spot and feed it to o1.
I would have done the test of giving the answer with some error, for example an extra factor of 2, or an arctan instead of arcsin, and see if it gets the true answer anyway and recognizes the incorrect input. That would make a very convincing test.
Well, until you yourself don't know whether what it did was correct, we can't say for sure. It is surprising nonetheless, yes, however if you had given these problems to the earlier version, it would have also arrived at the conclusion required, i think. It would have just done some mumbo jumbo and forcefully arrived at the conclusion, no matter what it got wrong in the process. This time around though, it looks like it actually did all the things correctly in its "reasoning" process.
The model can do some amazing things, but since its results are thought-wise so far separated from the input question (as opposed to the regular GPT models which just start answering immediately), yet this model can still get some weird elementary things wrong, I'm not sure that I can ever use it for anything where I am unable to verify if its solution either works or is better than one I came up with.
I think we're at a stage where it is becoming increasingly impossible for people to even evaluate the capabilities of these models. Their ceiling is immense, but when you have it thinking for a few minutes, it can make subtle mistakes at any step. OpenAI is also not exposing the thoughts to users, so it becomes impossible to read through all of it to decide whether it checks out.
It's gonna take me forever to develop intuition for what sorts of things this thing can and can't do realiably and thus be trusted with.
This gives you a skeleton from which to create the solution and has probably halved the amount of work you needed to do.
I'd be more concerned about the people who use this model and figure out how to get past the quirks.
Thank you for your feedback, I've gone ahead and recorded and uploaded a Part 2 where I test o1 with questions I have the answers to and questions that I'm confident it didn't have access to in its training. ruclips.net/video/a8QvnIAGjPA/видео.html
If you want to test the actual knowledge of the model then use textbooks that were compiled with questions created after the knowledge cutoff of the model. This doesn't reflect its actual knowledge but prior knowledge of the model.