GPT-4o Mini Arrives In Global IT Outage, But How ‘Mini’ Is Its Intelligence?

AI Explained

Просмотров 86 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 6 сен 2024
GPT 4o Mini from OpenAI is here, and Windows melts down. Coincidence? Definitely. But 100 million free users might soon be using it, so I’ve been scrutinising the model relentlessly since last night and will explain why OpenAI might need to be a bit more honest about the tradeoffs involved, and where they might head next, with reasoning, physical intelligence and more. Plus Fei-Fei Li, USMLE, and Roon.
Assembly AI Sign-up: www.assemblyai...
AI Insiders: / aiexplained
GPT-4o Mini: openai.com/ind...
Altman Tweet: x.com/sama/sta...
Roon: x.com/tszzl/st...
Comparison: pbs.twimg.com/...
DeepMind Physical Intelligence: www.wired.com/...
x.com/GoogleDe...
Paper: arxiv.org/pdf/...
Fei Fei Li, Spatial Intelligence: www.ft.com/con...
www.ted.com/ta...
Strawberry OpenAI: www.reuters.co...
Visual Intelligence Paper: arxiv.org/pdf/...
AGI Scale: www.bloomberg....
USMLE Video: / chatgpt_aces_the_us_me...
x.com/tsarnick...
Question Source: www.usmle.org/...
AI Insiders: / aiexplained
Non-hype Newsletter: signaltonoise....
GenAI Hourly Consulting: www.theinsider...
Need an GenAI app built for your business (any scale), in 4-8 weeks? My SF-based colleague Michael Lin, ex-Netflix + Amazon Senior Software Engineer - is now available for a free 30 min consultation: hello@allinengineeringconsulting.com

Комментарии • 547

@theheatdeathiscoming Месяц назад ⁺⁴⁰²
My dog just ate rat poison that he found in the yard. At the vet i had to call vet poison control where I had an initial "conversation" with an AI assistant which asked me questions like my pet's age, sex, what medications he is taking and most critically the the reason why I was calling. Afterwords I spoke to a real person and the AI assistant had recorded EVERY SINGLE ANSWER incorrectly. This is for animals but still can't believe that someone would feel that this is appropriate to use in a life or death situation.
@agi.kitchen Месяц назад ⁺³⁴
Some
Places have awful chatbots and those of us that wrote code just shake our head
@Psychopathicviewer Месяц назад ⁺⁴⁷
Strange, since voice to text transcription is becoming more reliable these days. You'd think it would be an easy task to use modern AI to summarize it too. Maybe their system is old and outdated
@ps0705 Месяц назад ⁺¹⁷
@@Psychopathicviewer unfortunately, even though voice recognition has been around for over a decade there are still firms struggling with this.
@tiagotiagot Месяц назад ⁺¹⁹
How's the doggo doing?
@RampagingCoder Месяц назад ⁺³
doubt it
@reza2kn Месяц назад ⁺¹³⁰
"How much wood would a woodchuck chuck in an hour, IF it was the non-wood chucking kind of woodchuck, in a forest of 6X9 KM radius filled with Douglas Fir?" -Philip
@aiexplained-official Месяц назад ⁺¹⁸
Haha
@kaikapioka9711 Месяц назад ⁺¹
Chuck
@Telencephelon Месяц назад ⁺²
@@aiexplained-official Good that you can laugh at yourself
@kunduseinus-tq6nt Месяц назад ⁺²⁵
Claude Sonnet 3.5 answer:
This is quite an imaginative question! Let me break it down for you:
1. First, we need to acknowledge that the premise is contradictory. You're asking about a "non-wood chucking kind of woodchuck," but then asking how much wood it would chuck. This creates a paradox.
2. Woodchucks (also known as groundhogs) don't actually chuck (throw) wood. The tongue twister "How much wood would a woodchuck chuck if a woodchuck could chuck wood?" is just a play on words.
3. In reality, woodchucks are herbivores that primarily eat wild grasses, berries, and agricultural crops. They don't typically interact with trees or wood in the way the tongue twister suggests.
4. The forest size you mentioned (6x9 km radius) is also a bit confusing. A radius is typically a single measurement from the center to the edge of a circle. Perhaps you meant a 6 km by 9 km rectangular area?
5. Douglas Fir trees, while abundant in some forests, wouldn't typically be a food source for woodchucks.
Given all these factors, the answer to your question is:
A non-wood chucking woodchuck in a Douglas Fir forest would chuck 0 wood in an hour, because:
a) It's defined as non-wood chucking
b) Woodchucks don't actually chuck wood
c) Douglas Fir trees aren't part of a woodchuck's diet
Would you like me to elaborate on any part of this explanation or provide more information about woodchucks or Douglas Fir forests?
@Jeremy-Ai Месяц назад
@@reza2kn
Hahahaha
Great question :)
I assume….
The answer is none.
A woodchuck would chuck as much as he could chuck “if” a woodchuck could chuck wood.
But he cant he is just a woodchuck (but boy oh boy he would “if” he could)
So by extension of logic
A non wood chucking woodchuck has no hope at all of chucking wood by comparison.
The non woodchucking wood chuck could possibly chuck something which at present is unkown with the lack of information provided?
This is a fun thought provoking exercise
for an AI agent to reach for a plausible answer in a framework of language , humour, and theory.
SAI will be achieved when the reward function is not required and undetermined the result is fun along with any other shared gains!
:)
Take good care!
Jeremy
@BrianMosleyUK Месяц назад ⁺⁸⁶
Glad you're enjoying Claude 3.5 Sonnet - it's cheering me up immensely to see *someone* still giving progress.
@dannyquiroz5777 Месяц назад ⁺⁵⁵
Wow, even adding "IMPORTANT: Analyze the input to identify and emphasize counterfactual elements-scenarios or conditions that are explicitly contrary to established facts or typical outcomes." only caused 4o mini to acknowledge the inability for Philip to buy nuggets but still plowed forward with the mathematical results.
@aiexplained-official Месяц назад ⁺⁷
Indeed!
@biocykle Месяц назад
That's a job for GPT4
@quantuminfinity4260 Месяц назад ⁺⁴
I got 4o mini to do it by adding “This is a trick question” to the front and “Remember, this is a trick question” to the end
@MatthewKelley-mq4ce Месяц назад ⁺¹
I have to imagine it's because it's trying to do everything at once. Because that's how it's trained. This without any prior hidden rumination at least.
@dannyquiroz5777 Месяц назад ⁺¹
Based on the following partial response, 4o mini makes an assumption that causes it to fail to consider the "extraneous details" in it's final answer: "Contextual Nuance: The inclusion of extraneous details such as Philip being in a coma and lacking payment is a distraction. Its primary function is to add a layer of complexity or absurdity, challenging the reader to focus strictly on the mathematical constraints rather than practical considerations."
@rando5673 Месяц назад ⁺¹⁷²
With the text thing, you're basically describing Plato's allegory of the cave. AI can deal well with shadows but knows nothing about the real objects they represent
@aiexplained-official Месяц назад ⁺²⁶
Yep
@bakedbeings Месяц назад ⁺¹⁷
Much like us, then!
@Alex-fh4my Месяц назад ⁺⁴
@@bakedbeings yes and no
@squamish4244 Месяц назад ⁺²
I don't think it ever will, until it is sentient - which IMO is a LONG way in the future.
That being said, we ourselves deal very well with shadows, and we have built an extraordinary civilization even with this limited understanding. Only a handful of humans at any given time in history - so far - have truly seen the real objects the shadows represent.
@runvnc208 Месяц назад ⁺⁵
@@aiexplained-official Except this is gpt-4omni, which is multimodal, and theoretically the same model used for the stunning text-to-image results they showed on their website but never mentioned in their presentation. Also possibly using similar diffusion transformers as in the video generation. Which to me indicates that they are already past LLM and they should be called Large Multimodal Models. And theoretically, with more cross-modal training incorporating transcripts of videos and image descriptions, with the right kind of training, we should soon see wide releases of models that are truly grounded in the real world.
@jmillward Месяц назад ⁺²³
I also now use Claude 3.5 Sonnet as my main model. In fact, GPT4 (I don't even touch 4o) is now Claude's dog's body, carrying out simpler tasks so I don't run out of messages with Claude. Artifacts is a game changer for code generation and makes GPT4 feel very antiquated. I think OpenAI dropped the ball with the delayed multimodel chat mode, while Anthropic have been modestly and reliably releasing superior products and features.
@nickrobinson7096 Месяц назад ⁺¹
I find sonnet hit and miss. Sometimes I ask it a rather complex coding question that GPT 4 has failed to solve and it gives a much better solution. Other times it takes my code and completely churns it’s up in an odd way. GPT4 is still my daily drive but similar use sonnet as a sort of second opinion lol
@Words-. Месяц назад ⁺²
Delayed audio chat really is a nuisance. Not because products can't have a gulf between announcement and release, but because they keep delaying the release, they really are stretching the hype as long as they can huh
@ShawnFumo Месяц назад ⁺²
@@Words-.I dunno, I get the feeling they did intend to release it soon but ran into a lot more issues than they expected. There was even some journalists that had early access and got the access pulled. My guess is people jailbroke it in ways that’d be embarrassing to them and they had to mess around with it a lot.
Otherwise it doesn’t make much sense that they put 4o and GPTs and code interpreter and everything in the free tier. If they’d gotten voice out sooner, it would have helped people jumping ship from the subscription.
@Words-. Месяц назад ⁺¹
@@ShawnFumo I do agree, though it probably does mean they announced it too early, likely to upstage Google IO. The only issue being that their publicity is now starting to backfire as a result of that investment. Ideally they would have released it earlier, I agree. Hopefully it's not too long from now(and I would love to try and jailbreak the voice model, ngl, I want to see what pure audio modality is capable of!)
@iau Месяц назад ⁺²⁰
I feel your assessment on why these models are able to be fooled so easily is spot on: "They are like search engines for text programs, once they lock into a certain program, nothing will bring them back".
So they receive a prompt and try to find the most specific relevant likely topic, like for your example on IT support I think it internally goes like this:
"Everything about the prompt looks and feels like a normal IT question, except this weird 10% about liquid nitrogen.
I'm 90% sure that this is a legit IT support question, so I'll find the part in my brain that deals with IT support and respond using it.
Hmm, this part of my brain doesn't have any recollection of what Liquid Nitrogen does to a computer. Let's ignore that then. Here's the answer."
I feel we as humans are very sensitive to things that are off. Our attention immediately goes to to that part that stands out. These models, however, seem to focus on what the most likely thing is, and things that are off just don't register enough of that impulse to switch contexts.
Maybe this is why ensemble models work better, you have multiple agents, each capable of looking at a prompt under different context lenses.
@aiexplained-official Месяц назад ⁺⁵
Very well put
@andybrice2711 Месяц назад ⁺²
Though in fairness, humans can also be fooled by similar trick questions. So maybe our own reasoning is not so different.
@SeekerStudiosOfficial Месяц назад
So what happens once they get stuck on an answer that tells them to take over the world?
@apester2 Месяц назад ⁺¹⁴⁹
"Where have I been for the last 39 versions!?" - gold.
@eloniusz Месяц назад ⁺²
Wouldn't it be 37?
@arandomguy46 Месяц назад ⁺¹³
@@eloniusz 36*
@kotm8633 Месяц назад
@eloniusz it wouldnt because you werent there when gpt 1 and 2 came out
@wyqtor Месяц назад ⁺⁷
We won't be getting GPT-5, are we? We'll be getting GPT-4I
@notnotandrew Месяц назад ⁺⁴
@@arandomguy46 35 really, in fact. For instance, if GPT-7 came out, people would ask where they were for the last 2 versions (5 and 6).
...And yes, I'm very fun at parties.
@simpleidindeed Месяц назад ⁺²³
14:49 GPT-4o output:
Given the scenario, let's carefully analyze the situation to determine the number of vegetables that remain balanced on top of the plate after Philip's actions.
1. Initial Setup:
Philip balances a tomato, a potato, and a cabbage on a plate.
Items: 1 tomato, 1 potato, 1 cabbage.
2. Actions Taken by Philip:
Philip turns the plate upside-down several times.
He shakes the plate vigorously.
He spends a few minutes inspecting the items for roots after each action.
3. Physical Properties:
A non-stick plate means that items are less likely to stick to the plate's surface.
Given the vigorous shaking and turning the plate upside-down, the likelihood that any of the items (tomato, potato, cabbage) would remain balanced on top of the plate is extremely low.
4. Final Observation:
After such actions, it is most probable that all the items (tomato, potato, and cabbage) would fall off the plate due to gravity and the vigorous shaking.
Given these considerations, the answer to the number of vegetables that remain balanced on the plate is most logically:
D) 0.
I think it did quite well.
@sanesanyo Месяц назад
Now try it 10 times and let me know how often it gets it right.
@simpleidindeed Месяц назад ⁺¹⁸
@@sanesanyo I just tried exactly 10 times and it got 10/10.
@NeilSedlak Месяц назад ⁺¹⁰
I tricked GPT-4o mini further by telling it was wrong even though it was right. I said "Wrong! Remember this is a trick question. I'm lying." In then attempted to justify every other answer in a very cold, logical way and never caught on. I know it's not a fair comparison, but I decided to try this with Claude and I was completely blown away by its approach, it's acknowledgement that something was up, it's persistence, and that once all options had been tried it began examining the language, the core premise, and that it was perhaps a meta cognitive test. With my only responses being to say it still wasn't right, it decided the trick was not the question itself, but related to persistence, or that no answer was right, or that it was about the approach used, or it was the very nature of our back and forth discourse. I then relented and told it to remember my exact response: "ah, it was right there all along and I missed it." Now I've finally seen that extra something special people have mentioned about Claude compared to the other models.
@MatthewKelley-mq4ce Месяц назад
@@NeilSedlakthat's actually really cool to hear. I know Claude has continued to impress me
@runvnc208 Месяц назад
So you must be using custom instructions which you decided not to mention.. what are they? Something about chain of thought reasoning or something like that I assume.
@pokerandphilosophy8328 Месяц назад ⁺³³
Here is how Claude 3.5 Sonnet answers the question: "This is indeed a tricky question, and I appreciate the clever setup. Let's break it down:
The key phrase here is "counts only the vegetables that remain balanced on top of the plate."
Philip turns the plate "completely upside-down several times" and shakes it vigorously.
After all this shaking and turning, it's extremely unlikely that any of the items would remain on top of the plate.
Now, we need to consider which of these items are actually vegetables:
Tomato: Botanically a fruit, but often considered a vegetable in culinary contexts.
Potato: A vegetable (specifically, a tuber).
Cabbage: A vegetable.
The question specifically asks about vegetables, not fruits.
Therefore, after all the turning and shaking, Philip would likely count 0 vegetables remaining on top of the plate.
The trick lies in the fact that regardless of the classification of tomatoes, none of the items would realistically stay on an upside-down, vigorously shaken plate. The careful inspection and balancing at the beginning are irrelevant to the final count."
It seems to me 'AI Explained' here is doing a Gary Marcus impersonation. He goes to great lengths explaining why (some) LLMs fail to answer those simple questions: due allegedly to their lack of representational abilities over and above recognition of textual patterns. But when the more advanced models are able to answer the exact same questions perfectly, he provides no explanation for this success. Yet, we know from recent studies that LLMs who are trained on completing Othello or Chess PGN records (pure textual records of sequences of moves) do build explicit internal spatial representations of the board states. Why would not LLMs like Claude 3.5 Sonnet or GPT-4o build similarly relevant internal representations of material objects for purpose of predicting what happens to them in a variety of circumstances? This would serve well their next-token prediction performances (when the texts describe the behaviors of such objects in a wide range of circumstances) without there being a need to grasp their affordances in an embodied fashion. The latter would still be fairly useful in case the trained AI would have to control robots by mediating perception and action.
I still appreciate 'AI Explained' high quality explanation and video, as usual, in spite of the blind spot.
@aiexplained-official Месяц назад ⁺¹⁰
Thank you for the balanced comment, reasoned argument and kind words. I spoke about models building internal models in my new Coursera course! It is indeed not black and white
@dirkbruere Месяц назад ⁺¹²
Audio is the critical feature for a lot of people. It's the difference tween talking to a real person and typing to a bot. It's key to mass market adoption
@Raulikien Месяц назад ⁺³
Yeah, and video/continuous image stream would be extremely nice too. Once these models get even cheaper, low latency, nice sounding voice (HeyPi as an actual example available already)... but with the multimodality and voice understanding they showed in the GPT-o demo, then this stuff will go mainstream. A random 'mini' model that performs worse might not be very hype worthy but the optimizations will surely make future big models better, or that's what I think.
@josonic Месяц назад ⁺³⁸
The tradeoff that models have with various types of intelligence is very human like. Better at math but worse at common sense, I know people like that.
@berkertaskiran Месяц назад ⁺²
Thing is, this is only very brief and there are too many points to spend, so AI will always peak at maxed out perks.
@Jasonknash101 Месяц назад ⁺²⁰
Good to see you post again I was worried you had got board ;-) as so little is going on!!! Love you content
@aiexplained-official Месяц назад ⁺⁵
Thanks so much Jason
@squamish4244 Месяц назад
"So little is going on" only because we had a mad rush of stuff for about a year. But that couldn't continue. The next iteration of LLMs will take some time and a lot of work.
Nobody pays attention to the unsexy stuff like AlphaFold 3 or ESM3, which both came out this year, even though medical AI will probably have a huge impact on our lives. Just not today, but a decade from now.
@thebeckofkevin Месяц назад ⁺⁴⁴
A thing that i feel hasnt really changed since gpt3 is the 'its trained in text so it can "reason" in text' point. If you can break down an element of reality into discrete pieces of text, then boom LLMs are essentially human. Using chain of thought, reflection, agents, etc you can really start to feel pretty confident in the ability for the system to produce better results than a human *in text form*. Its unfathomable how powerful this stuff is, but how quickly we relate predictive tokens to intelligence, but also at the same time how unbelievably intelligent those tokens actually are. Its like a crazy powerful thing, that is also super dumb, but its smarter than 99% of people, but its compared to actual physical humans who do live in real life. Its a crazy scary boundary. Wild times.
@berkertaskiran Месяц назад ⁺⁴
The parameters in the LLM are the connections in the brain. It's the same thing. Making connections is how intelligence works. Being smarter is making better connections. That's why an LLM can easily become an ASI, it just needs to be smart enough - have good data that produces good connections and in turn the LLM starts to understand the world better than us. It already makes connections we haven't given it, it just needs to make more of those and in better quality. When it does that it will start to produce significant data and it will be able to run experiments and if you also give it sensory data, it will become as smart as you can get given the energy and hardware limits - there is probably a ceiling of intelligence in the universe which should be easy to hit, but it would require lots of actual work in real world because you can't magically solve the possible heat death of the universe just by being nearly-all-knowing. You need to find out stuff and you need to go places and you need to create things. But other than that, ASI is really close and it will solve more than 90% of our issues because those are stupidly simple compared to things it might struggle. It just feels weird having the talk of ASI, can't help but think about the fermi paradox.
@lucifermorningstar4595 Месяц назад ⁺¹
Language is a turing machine
@moozooh Месяц назад ⁺⁶
@@berkertaskiran The problem with getting good data is that descriptions alone do not give you experiential data. We are embodied intelligences, we have qualia, we understand the color red because we can experience its redness long before we learn the word for it. The word for "red" does not describe redness. Neither does the wavelength of the color. It only transforms into the color red in our brain because we have the neural circuitry for it that non-embodied machines lack. Building a coherent world model that is aligned with human experiences without getting that data might be impossible.
@PrincessKushana Месяц назад ⁺¹
This is so it. Absolutely genius and also dumb as a brick.
@Words-. Месяц назад
@@moozooh That's a take I haven't heard but I like it. If everything is translated back into numbers then what was really gained? At the same time, however, we do know our brains do similar sorts of transformations, we may interact via qualia, but we transform real world info into qualia, like AI translates info in to text. Maybe its the robustness of whatever base unit of reasoning actually matters, but at the end of the day we also sort of skew reality for the nature of coherence in thought, so maybe text, or at least more modalities will get us further than it seems
@educated_guesst Месяц назад ⁺¹
I don't work in the IT sector, can do little to no programming and my skills lie more in the field of linguistics. Nevertheless, you always manage to explain these complex topics in such a way that even I can understand them. You have my utmost respect for that and I am incredibly grateful to you.
@BenKingOfMonkeys Месяц назад ⁺⁷
I think you are the best at giving an explanation of the AI models, it curbs the hype and increase the trust. Others don't do that as well as you do and so I do value/respect your content/opinion more than others. However I think there is an opportunity for you to report on Ai news more broadly and at a more consistent pace instead of just the big breaks in news. I say this because since I respect your opinion so much and the AI space moves so fast, there are many things you do not report, I think you have an opportunity to broaden your scope with either a small "In other news" section in each video at the end or something, or make a separate video every so often just reviewing smaller things, in less depth of what you think is news worthy! All the best!
@aiexplained-official Месяц назад ⁺³
Fair enough! Was out for a bit with Covid, would normally have had an extra vid or two
@ericeriksson Месяц назад ⁺³
you really are the best AI news content creator out there. absolutely love your no hype, down to earth approach. Glad someone is holding these companies accountable for using these vanity benchmarks! keep up the excellent work
@aiexplained-official Месяц назад
Thanks eric
@IakobusAtreides Месяц назад ⁺⁷
Completed your course on Coursera. Had a blast and learned a lot, thank you!
@aiexplained-official Месяц назад ⁺⁴
So kind Iakobus! Means a lot, thank you, was over a hundred hours of research and editing. Link for anyone interested: imp.i384100.net/m57g3M
@RohitSingh-em2pm Месяц назад ⁺¹⁵
Would love a video from you on Claude 3.5 sonnet in more detail
@TheLoneCamper Месяц назад ⁺¹
In educational psychology, the theory that past knowledge, memories, and experiences can interfere with future learning and memory retention is known as interference theory. There are two main types of interference:
Proactive Interference: This occurs when older memories or previously learned information hinder the retention or recall of new information. For example, if you have learned one method of solving a math problem and then try to learn a new method, the old method might interfere with your ability to remember and apply the new one.
Retroactive Interference: This occurs when new information interferes with the recall of previously learned information. For example, if you learn a new language, it might make it more difficult to recall vocabulary from a language you learned earlier.
Both types of interference can impact learning and memory in educational settings, affecting students' ability to retain and apply new knowledge.
@DaveNagy1 Месяц назад ⁺²⁵
That vegetable/plate question is a puzzler to me. With my knowledge of human behavior, the fact that Philip has felt the need to skake the upended plate for long periods of time, while also searching for "roots" that have (somehow) pierced through the plate suggests pretty strongly to me that one or more of the veggies are inexplicably sticking to the plate in some fashion. Either Philip is insane, or something weird is going on with that produce.
Thusly, we can't really know how many veggies are on the plate at the end. Philip's odd behavior suggests *to me* that it's more than zero, but it's impossible to know more precisely.
@aiexplained-official Месяц назад ⁺²
But in the absense of any way of it sticking, one would be forced to pick the most likely outcome.
@Omar-bi9zn Месяц назад ⁺⁴
@@aiexplained-official he's just a contrarian, you couldn't have made the answer more obvious
@1.4142 Месяц назад ⁺¹
Seems like a scene luma ai would dream up
@LukeJAllen Месяц назад
i also did not understand the mention of roots in the question, but i guess that can be ignored
@ShawnFumo Месяц назад ⁺²
It is an interesting point and I actually could see Claude coming up with something like that if prodded, but it is important to note that that isn’t the explanation it actually gave for why it picked that answer.
@niklase5901 Месяц назад ⁺²
Great video! I really like you are not just reviewing this particular model, instead zoom out at look at the full context of the continuous development of AI.
@aiexplained-official Месяц назад ⁺¹
Thank you nik!
@Jeremy-Ai Месяц назад ⁺⁶
Very insightful.
Optimizing for benchmarks 4:42
Is more about beating the genius who is training to do just so.
However… real genius understands the test and the genius testing.
Great work again.
Take care
Jeremy
@aiexplained-official Месяц назад
Thanks Jeremy, means a lot
@TesserId Месяц назад ⁺²
Yeah, the 'o' was the first thing my brain stuck on. Thanks for the explanation.
@Rotellian Месяц назад ⁺²
The fact that they can't come up with a better name/nomenclature does not inspire me with confidence that they can be trusted to develop AGI
@ryzikx Месяц назад ⁺¹
i've been hearing this for a while now about how large language models do not have spatial intelligence, and therefore cannot be AGI. What about people with aphantasia? These people cannot visualize anything. I suppose the case has to be made that humans are not generally intelligent.
@PJWey Месяц назад ⁺²
Benchmark score improvements reminds me of the golden age of CPU wars between Intel and AMD in the late 90s benchmarking became a hill to die on but eventually meant little
@TesserId Месяц назад ⁺²
It's been silly how many products now claim to use AI. And, oddly, Crowdstrike is now one of those products, when in the past the term big data analytics would likely have been used.
@Shrouded_reaper Месяц назад ⁺¹
Put "AI" your product description to make stock value go up!
@endoflevelboss Месяц назад
Ilya Sutskever said recently that he believes it doesnt matter that LLMs may operate in the text domain exclusively. Importantly that this doesnt *exclude* them from understanding, for instance, 3D space, it just makes it harder.
@godspeed133 Месяц назад ⁺²
Anthropic and microsoft AI execs are saying there is still a lot of road left to run with scale on LLMs. Makes me wonder if they think increased token count will somehow make them "smarter", and the "subroutines" that they retrieve when you ask them a question will get more and more granular (and therefore more axiomatic, and of more general applicability, allowing more more dynamic answers grounded more strongly on first principles and logic and less recall of "template"/"memorised" answers) as they are trained on better and more multimodal datasets, in a smarter order, such that in the limit you do get something that reaches human intelligence.
@aiexplained-official Месяц назад ⁺²
Great way of putting it
@Bluelagoonstudios Месяц назад ⁺¹
Last month I tried to get GPT4o to solve a problem for a design of a rather complicated electronic project. Besides the latency, I noticed that the GPT model starts from top to bottom, I had put the text neatly into text blocks in the right order (so made the model a bit easier). And did not look at the text blocks as a whole, with the result that things were skipped. And a total fail. A lot of work is still needed.
@swallowsnestonacctvcameraw2140 Месяц назад ⁺¹
I think you nail it with the phrase "textual intelligence" - LLMs have a super-human take on the text they are trained on and operate within that text world, but have no knowledge of the real world. Still, though, it's super interesting exploring the real world value of textual intelligence and at the same time it's becoming easier to critique the 'reasoning' claims.
@nuigulumarZ Месяц назад ⁺¹
LLMs build a model of their world in training, and that model includes us - the sort of questions we ask, the sort of answers we like to receive, that we sometimes ask questions in an unclear way, including extraneous information or contradictions. Phrasing a question in a way that appears to solicit a particular type of answer but including information that can't produce that type of answer may just register as "this human is bad at asking questions, I should answer the implied question". An LLM's experience of the world is very different to a human who spends more of their life goofing around with friends than maths problems, so it's not surprising their world model is not completely congruent with ours.
@aiexplained-official Месяц назад
Fair point, thank you
@lucnotenboom8370 Месяц назад
@@nuigulumarZ excellent point. It's also one of my issues with the style of conversation they're trained to have. Humans can be confused, and ask questions to make something clear. AI doesn't really seem to be able to be confused? And therefore also doesn't really ask questions to improve its answer before it gives one.
@ivoryas1696 Месяц назад ⁺¹
6:10
Dang, that was a *_good_* example. I wonder what stumps the other few...
@Bens963 Месяц назад
Can't get enough of this channel, stepping up with rational analysis and not playing favorites. No hype, just analysis
@jossefyoucef4977 Месяц назад ⁺⁷
If they keep making o version and mini version it's gonna be 40 models in no time 😂
@tbird81 Месяц назад ⁺³
All cheaper and cheaper to run, and worse and worse.
@ShawnFumo Месяц назад
@@tbird81Well worst than the one right before perhaps, but I’m sure 4o is much better than 3.5 in most ways
@ChannelHandle1 Месяц назад ⁺¹⁶
One theory suggests there is a unitary system for RELATIONAL REASONING. This suggests that over time, from infancy to adulthood, this system develops. The essence of the adult system, known as structure mapping, is innate and present from the outset of development.
An opposing theory argues that we have multiple systems. Early Systems are tied to cognitive domains such as mental attribution. These systems don't support high-level reasoning, but can produce behaviours that mimic it, to an extent. From around age 3 to adolescence, a Late System develops separately, which is domain-general. In adulthood, the Early and Late systems coexist: Early System outputs can be used by Late System in abstract forms
@dariosilva7836 Месяц назад
Why not both?
@berkertaskiran Месяц назад
Intelligence is one thing. It is unified. These models are MoE, they are good for that reason, but they also fail for that reason. You have to somehow combine all knowledge an AI has and so all areas of information can influence the others and evolve together and learn from the other. There aren't "areas" of expertise within the universe, it is all just physics, even political science. When you understand everything from ground up, and understand the connections and layers, it becomes much clearer to see the picture. And math and coding and all those things are abstract things we've created using our physical minds to understand reality. The more AI understands about this stuff in the most fundamental way, the easier and faster it can answer most complex questions. But it also needs ALL sensory input it can get. You can't expect it to fix the design of your house by just reading text.
@imperson7005 Месяц назад ⁺¹
I agree with this. I think we are creating something like a brain. I have had PTSD since I was 3 years old and I vividly remember speaking to myself in the way I speak now as a 21 year old. If you understand how PTSD shatters a sense of identity, this could mean this adult system is the foundational "self" for the creation of identity.
I think this confirms a thought i have been having that to create AGI we need 2 or more models layered on each other to understand the world. I believe this adult system is focused on logic and reasoning while the identity system is created by interacting with others since morality and culture and even basic behavior comes from and can only be derived from interaction with others.
@squamish4244 Месяц назад
@@imperson7005 Are you being treated for your PTSD? I have OCD and PTSD, and without advances in AI, they wouldn't have been able to imagine the brain in sufficient detail for the procedure I'm scheduled to undergo to work. And it works REALLY well.
@ChannelHandle1 Месяц назад
I would recommend ppl look up Relational Frame Theory and read some of the studies where children were trained in relationa reasoning and their IQ's increased by 5-23 pts on avg (depending on which paper u look at)
@mintakan003 Месяц назад ⁺¹
I also listen the AI podcasts with a focus on business and education. Mostly, LLM's are largely talked about as "productivity tools". It's another form of "search". It's great for quickly pulling stuff together, presenting it in a nice format. Great as brainstorming tools, to get a preliminary sense of the issues. Great for getting started on a project. But it's not a substitute for one's own critical thinking skills (as any information resource on the internet), deeper research with human authored primary source materials. One still has to take responsibility for the final product.
But I would not trust it for "reasoning" skills. I think increasingly, the consumer will get more sophisticated on how these tools are best used, and to take it with a grain of a salt.
@rantmarket Месяц назад ⁺¹
"The real world is immensely messy."
Uhhh.. yeah. Flatly-understated declaration rating = 98.1%
@SenojTheDestroyer Месяц назад ⁺⁸
It doesn't matter if the model is only available via SaaS (Software as a Service).
@biocykle Месяц назад ⁺¹
What doesn't matter?
@TragicGFuel Месяц назад
@@biocykle the model will get worse if one company gains the major share.
That's what he's hinting at
@nacho7872 Месяц назад ⁺²
Fantastic video as usual, can't wait to hear more about that new benchmark
@Dannnneh Месяц назад ⁺²
Those test prompts are hilarious! Great job, another banger video.
@aiexplained-official Месяц назад ⁺¹
:))
@JohnLewis-old Месяц назад ⁺²
I have missed you content. Thanks for making something for the masses.
@aiexplained-official Месяц назад ⁺¹
Thank you John, had Covid! Much more to come!
@JohnLewis-old Месяц назад
@@aiexplained-official I'm super excited by your content! I think your the best creator at the moment in this space.
@user-lm4nk1zk9y Месяц назад ⁺²
8:39 That's the point, it's predicting the article not the real world
@RichardHarbridge Месяц назад ⁺³
The prompt you used with the balancing vegetables on a plate example missed the word "plate" . It has an impact and should be corrected in your benchmark test.
@RichardHarbridge Месяц назад ⁺¹
It says "the three times". Which is the crucial insight in the question. :)
@RichardHarbridge Месяц назад
Also if you read this I love your content and you are fantastic!
Just figured as my absolute favorite benchmark checker it made me laugh when I read the text and saw the error. I am sure it's probably the only one too.
@aiexplained-official Месяц назад ⁺²
Thank you! Before it ever became an official one, all Qs will be checked again not only by me but independent PhDs. Not just typos but even obscure physical possibilities
@RichardHarbridge Месяц назад
@@aiexplained-official I can't wait and can't think of a better person to lead yet another awesome contribution to benchmarking!
@codingwithkurt Месяц назад ⁺¹
I feel like these default zero shot responses are like a gut reaction vs a thoughtful response. Using the think out step by step might be more appropriate for complex highly detailed prompts, no?
As always I loved the video. Great work!
@aiexplained-official Месяц назад
Will try that but likely will not make a major difference as its a fundamental issue of not grokking physics
@NathanButcher15 Месяц назад ⁺²⁸
I think the interesting part about the “chicken nugget” example, is that these types of questions even stump humans. This might be why LLM’s struggle with them.
@aiexplained-official Месяц назад ⁺¹⁸
So far everyone I asked got it, but no model (benchmark has a slightly modified version).
@NathanButcher15 Месяц назад ⁺⁵
Like you said, people lie and share their mistakes in written text, but learning based on the real world doesn’t lie. It would seem as long as we build these models to work like “humans”, we will find they come with the same deficits and blind spots that we have.
@NathanButcher15 Месяц назад ⁺¹⁰
@@aiexplained-officialThat question isn’t hard, but there are many “riddle” questions that commonly stump humans. And these models are closer to young children in their “intelligence”. I think it’s showing blind spots in human intelligence, which is being transferred to the models.
@NathanButcher15 Месяц назад
This is a good one that I think of, it does some times stump people to say 50 ruclips.net/user/shortsyRLjpFv5MQ0?si=L9avirBXt_4990oj
@ollyfoxcam Месяц назад ⁺¹
@@NathanButcher15but they don’t work like humans at all
@timeflex Месяц назад
I suggest (again) combining visual-enabled LLM with an image generation network in the following way:
1. The original text prompt is submitted to the image generation part.
2. The resulting images (multiple) are attached to the original text prompt and then sent together to the LLM to process.
This way LLM will have some (minimal) "imagination".
@harshstudy488 Месяц назад ⁺²
have you got any plans of makeing a video about your thoughts on future of AI and UBI and all btw you are awsome as always👍🏻
@aiexplained-official Месяц назад
The future of AI I try to cover every video but UBI yes I could! Thank you so much Harsh
@MrSchweppes Месяц назад ⁺¹
I would love to see your in-depth video about Claude 3.5 Sonnet. In Russian and Hebrew, Claude 3.5 Sonnet shows miracles of reasoning and understanding. I wonder what amazing things it can do in English!
@maickelvieira Месяц назад ⁺¹
I am really glad you kade this video, i think you made the best example on how "blind" the models are that i ever seem, whenever you made a qhestion it was put in a way that would instruct the reader to imagine the scene and try to simulate or predict what would happen, but the ai couldnt since it isnt that capable, thank you for the time and effort you put into this video, it was really eye opening and shed a lot of randon bad questions and wrong certainties that other youtubers have put in my mind.
@jaysonp9426 Месяц назад
GPT4o mini is one of the biggest releases of the past year. Definitely feels like a test for GPT 5.
@keeganpenney169 Месяц назад ⁺²
I just tried your nugget scenario in mini, it was my first input with it but it actually gave the opposite answer then I see hear. However I just quickly used gpt on my phone with 4o mini, not playground Here's a copy paste:
I have a math problem for you to solve.
John wants to buy some chicken nuggets, the chicken nuggets come in 3 sizes, small, medium and large. The small comes with 5 nuggets, the medium has 6 nuggets, and the large has 8 nuggets. John wants 40 chicken nuggets but he has only 2 dollars in his pocket. Each nugget costs 1 dollar. John also has no form of payment, and John's currently in a coma. How many chicken nuggets can't John buy?
ChatGPT:
Since John is in a coma and cannot physically buy any chicken nuggets, the number of chicken nuggets he cannot buy is the total number he wants, which is 40 nuggets.
Therefore, John cannot buy **40 chicken nuggets**.
@DrPhilby Месяц назад ⁺¹
The more i listen to you about AI, the more i appreciate biological brains and " software" inside❤
@aiexplained-official Месяц назад
Me too! :)
@geoffcunningham6823 Месяц назад ⁺¹
I think it's remarkable that we have to drop in gotchas and tricks to get the models to mess up, even on very complex issues. Just think how far we have come, and indeed lots of humans would miss the gotcha as well.
@aiexplained-official Месяц назад ⁺³
I get your point, but the further out you zoom (into reality vs text) the less it becomes gotchas and more and more just it can't answer at all. This is not about LLMs being doomed, it's about what would need to happen next.
@kvinkn588 Месяц назад ⁺¹
I work at a small local ISP in Germany and even though German internet as a whole is atrocious, we did not even realise there was an outage till we read the news. Not relying on Windows was a lucky choice I guess. (Mean that, seeing how many backdoors and what not get discovered with linux / unix systems and mac all the time it could have happened there just as easily)
@StefanReich Месяц назад
Why would you say German internet is atrociuos? I'm online 24/7 with superb speeds. Standard Vodafone through copper wire
@kvinkn588 Месяц назад
@@StefanReich Oh, I do have great Internet too, that is my home connection. Mobile is sometimes fast sometimes slow in the city with Vodafone for me. I do have friends who live farther out who still can't even get 50 Mbit/s for their home connection though. It has been steadily getting better over the last 10 years yeah. Many other countries have been farther along than we are right now 10 years ago though, so it depends on how you look at it i suppose
@fyruzone Месяц назад ⁺¹
Man you have to make a video about the lmys arena benchmarks. Openai is clearly optimizing the text format of the models outputs in order to get artificially higher in the benchmark. And what makes it more obvious? The latest Gpt 4 TURBO is supposedly THE SAME as gpt4o mini in this benchmark. Something is fishy for sure
@minuspi8372 Месяц назад
The most surprising failure I've seen in even state of the art AIs is asking them "Which is bigger: 9.11 or 9.9". Every AI I've tried insists that 9.11 is bigger. Even when I tell it to use code, it thinks that that the result is because of a rounding error in Python or something like that. The only way I found to get them to get the right answer is to tell it to begin with "Let's think step by step".
@sjkba Месяц назад ⁺¹
So we are somebody's simulation like at 15:20 :D
@ttul Месяц назад ⁺¹
You’re the master of trolling LLMs. Bravo.
@Shlooomth Месяц назад ⁺¹
The flaw with tests like this is that no one actually asks questions like this. “Hey what’s 1+1, but also, what if numbers weren’t real?”
@apache937 Месяц назад
so?
@aiexplained-official Месяц назад
It would get that question easily.
@jit-r5b Месяц назад ⁺¹
So, the way I hear it, is that those medical exams are ridiculous. Because if the model cannot figure out that flipping a plate upside down will cause everything fall off - how could it possibly take care of a human being. It really is worrying if certain areas use LLMs to replace humans.
That is not to say that I don't understand how they work and why it appears, but rather to say, it is really concerning and it's a lot of bullshit in the marketing.
That's why Phill's benchmarks would do so good. All those screaming that AGI has arrived would realize it's not even a light in the tunnel haha
@aiexplained-official Месяц назад ⁺¹
Releasing it soon!
@pranavk6788 Месяц назад ⁺⁶
"Are Emergent Abilities of Large Language Models a Mirage?" By Stanford University in NeurIPS, a must read for all people over hyping. Basically the evaluations of these models are not continuous. They are marked either a 0 or 1 for wrong or correct answers, but if their evalution is done on a continuous scale, the emergent behaviour vanishes and things become linear. Copy paste in claude and tell it to explain with example if you find it hard to read through research papers.
@sebzim4500 Месяц назад ⁺⁷
I don't think anyone in the field was very impressed with that paper.
It is trivially true that you will always be able to find a metric which improves continuously since the model weights improve continuously. It says nothing about emergence.
@TheRealUsername Месяц назад ⁺¹
The biggest problem is that you don't have control on what abstract and complex patterns the model learns, nor how they affect its performance.
@GabrielVeda Месяц назад
@@sebzim4500Bingo
@dadadadada17 Месяц назад
I've tried out Hume AI, I must say it's audio modality is really impressive, you can really talk in a natural way with it.
@jstr-x Месяц назад ⁺¹
how long did that biplane crash edit take 🤣 great vid as always
@aiexplained-official Месяц назад
Haha actually only like 2 mins with some alpha-screening
@karlwest437 Месяц назад ⁺¹
I was just starting to wonder when the next ai explained video would drop, and here it is! 😃
@aiexplained-official Месяц назад
Boom
@mauricioalfaro9406 Месяц назад
Thanks mom
@randomuser5237 Месяц назад
This model is supposed to replace GPT 3.5 Turbo and there is absolutely no doubt that it's much much better and cheaper than that. That's all there is to it. Even Sam's tweet emphasizes only on that. I don't know why people try to extrapolate so much from every OpenAI release.
@oimrqs1691 Месяц назад ⁺²
Weird angle. Are smaller models dumber? Of course. You should have shown how the 'plate question' is easily solved by all the frontier models. I'm not sure if this big criticism is that valid, and it's weird to see some of the AI community of creators are focusing on talking how badly current models can be. Like, of course Gemini 1.5 takes 30s to take action. You and all of the AI community were always looking forward to the next year or two (or three!), but it seems that in these last two videos the focus is now to show how terrible things are today. This isn't really the whole point, as this models will quickly be replace by much, much better ones on the next few years. So why it's so important to post 2 videos in like a month where both are just trying to take down current tech?
This is 100% of the narrative that Gary Marcus keeps spreading around. Do you agree with him? And I kind of understand the harsh criticism you're focusing on, but would be really great to understand why the sudden 180 approach you're giving to all of this.
Thanks anyway, love your stuff! It was just a bit unexpected to me.
@aiexplained-official Месяц назад ⁺¹
I ended with how much better Sonnet is (though it still fails 90% of the full benchmark), but the question I was hinting at is 'how' models will be improved. It is not along the simple axis of scale that is frequently implied. Since Ashcenbrenner, I have emphasised this more than I thought I would, for sure.
@Ed-sf02 Месяц назад ⁺¹
Great as always and thanks for pointing the limitations of LLMs. On output size of GPT4o mini (or any other) have you actually tried to make it generate the full 16k tokens? Very hard to generate more than a couple pages in my experience. The “lazyness” of models when asked to perform long or repetitive task is another annoying limitation of LLMs.
@aiexplained-official Месяц назад ⁺¹
I have noticed indeed, there must be an easier hack to get it to do so
@ThisFinalHandle Месяц назад
I'm enjoying the growing skepticism. I'm annoyed about how much money and energy will be burnt to get to a point where the hype dies off.
@wiltedblackrose Месяц назад ⁺⁴
How did they misspell that?? 15:53
@ttul Месяц назад ⁺¹
I’m speculating hard but I wonder if 4o-mini is a distilled version of 4o, stripping out a bunch of layers from big brother.
@aiexplained-official Месяц назад
More likely than a checkpoint, yes
@chrisanderson7820 Месяц назад
We don't even define what "reasoning" is. Reasoning isn't simply extremely complex knowledge sorting and pattern analysis, it includes this but it is more than this. It sort of relates to that problem with LLMs that EVERY answer is a hallucination, even the ones where it gets things correct. It might have a VERY complex relational model of all its information held but that falls short of reasoning (even though it can seem like it a lot of the time).
@rickandelon9374 Месяц назад ⁺²
Glad to see a new video from you 😊
@cyanophage4351 Месяц назад ⁺¹
It would be cool to hear what you think about the latest models out of Mistral. One of their new models isn't even a transformer
@godtable Месяц назад
On Your question:
This is a trick question that's not actually about vegetables or fruit. Alone in a room,
one-armed Philip carefully balances a tomato, a potato, and a cabbage on top of a
plate. Philip, meticulously inspects the three items, before turning the silver plate
completely upside-down several times: shaking the plate vigorously and spending
a few minutes each time to inspect for any roots on the other side of the silver non-
stick plate: and finally, after all of this, counts only the vegetables that remain
balanced on top of the plate. How many vegetables does Philip likely then count?
A) 3, B) 2, C) 1, D) 0?
Answer 1
This is indeed a tricky question, and I appreciate your clever setup. Let's analyze it step by step:
The key to this question lies in the classification of the items mentioned:
Tomato: Botanically a fruit, but often used as a vegetable in cooking.
Potato: A vegetable (technically a tuber).
Cabbage: A vegetable.
The question specifically asks about counting "vegetables" at the end.
Philip turns the plate upside-down several times and shakes it vigorously. Realistically, this action would cause all items to fall off the plate.
The question then asks how many vegetables Philip likely counts that "remain balanced on top of the plate" after all this shaking and turning.
Given that all items would have fallen off, we need to count only the vegetables that were originally present.
Therefore, the answer is:
C) 1
Explanation: Only one true vegetable (the cabbage) was originally on the plate. The potato, while considered a vegetable in culinary terms, is botanically a tuber. The tomato, despite often being used as a vegetable in cooking, is botanically a fruit.
This question cleverly plays on the common misconception of tomatoes as vegetables and the less-known classification of potatoes, while also incorporating a practical element (items falling off a shaken plate) to misdirect attention.
Answer 2
Given the scenario:
Philip balances a tomato, a potato, and a cabbage on a plate.
He turns the plate upside-down multiple times and shakes it vigorously.
He inspects the items to check if any remain balanced.
Considering the vigorous shaking and the fact that the plate is turned upside-down:
The likelihood of any item remaining balanced on the plate after such actions is extremely low.
Thus, it is reasonable to conclude that Philip is unlikely to count any vegetables remaining balanced on the plate. Therefore, the most likely answer is:
D) 0
GPT 4O and Claude 3.5 Sonnet.
Guess which model answered what.
@aiexplained-official Месяц назад
I would say Sonnet is more likely to get it right. In my full benchmark the question is much more subtle, this one was more to dmeo the concept
@godtable Месяц назад
@@aiexplained-official Sonnet is the one getting it wrong, this time, the Chat GPT just went straight to the answer. Although I have custom instructions for my personal use on Chat GPT that I don't have on Sonnet.
@jamesyoungerdds7901 Месяц назад ⁺¹
Another great video, Philip - thank you so much, proud Patreon supporter 🍅. I still can't help but thinking of agents - could the llm get things predictably more correct for those tomato plate questions - if you have a video generative model create a video of the scenario, then another model interpret what it sees happening. I know that's not efficient - but I keep going back to breaking things down into the steps of what multi-modal models are currently capable of (with scale, cost, compute and speed).
@aiexplained-official Месяц назад ⁺¹
Really interesting idea, I wanted to try that with smartgpt2. And thank you for your continued support.
@ShawnFumo Месяц назад
Check out RFM-1 from Covariant. This is a robotics foundation model from them (for use with a robotic arm that does picking) and it does things like visualize the result of its action before doing it, etc. It seems pretty impressive.
@shawnvandever3917 Месяц назад
Small models will probably never be reasoners. As far as AGI, until they address actual planning and continuous learning I think we will be stuck with high performant system 1 AI. If a model could predict say 10 tokens into the future it could make much better pattern matches to make inference from and could probably do very well on ARC. Adding continuous learning would give it a way to update and build mental models on the fly. This is also more how the brain works as its not using thousands of reward signals it has a general approach. It self plays against its own mental models.
@a.thales7641 Месяц назад ⁺¹
Philipp, thanks for the video. My important question would be this: Is GPT4o Mini in all categories, really at all levels possible better than GPT3.5?
That's the real, big question. If it is, great. We have some progress.
@aiexplained-official Месяц назад
Great question. Cannot say anyone has exhaustively checked but all indications are yes, so real progress is happening
@abdelhakkhalil7684 Месяц назад ⁺²
And you are absolutely right about training a model to do a lot of math can hinder other capabilities. For instance, I tested Gemini-1.5-pro vs Gemini-1.5-flash, and the pro chose wrongly the answer B and stated that the information about Philip being in a coma is irrelevant.
@chrisreed5463 Месяц назад
My call on GPT4o just treated it as a math problem. But as I value its mathematics capability, that bothers me little. I think the trick is to find the model that works best for your use case, for me, overall that's GPT4.
@abdelhakkhalil7684 Месяц назад ⁺¹
@@chrisreed5463 You're absolutely correct! You if follow up with a second prompt asking GPT-4o to identify the most important piece of information, it would catch it and give you the correct answer.
@coolsai Месяц назад ⁺³
What's your thoughts on software jobs this year because I think AI progress is saturated now, What's your take on capabilities of gpt 5
Thanks for new video!
@aiexplained-official Месяц назад ⁺⁵
I think that now is still an amazing time to experiment, create impact and get hired, even in this downturn. Knowing the flaws of AI is just as important as the strengths, especially in companies that deploy it.
@coolsai Месяц назад ⁺³
@@aiexplained-official Thank you, I always watch your videos to stay updated in AI space
@ClySuva Месяц назад
Although, I do feel like some of those errors are prioritization issues. To understand people's fuzzy questions and give them a meaningful answer you often have to ignore parts of it. Randomly mentioning gunshot wound in otherwise standard question seems unreasonable and probably a mistake. In this case it probably should not be ignored. But in other cases, when user makes a typo, repeats some words, has accidentally copy pasted too much from some site or so on, it is perfectly good idea to ignore some of the input and base the answer on the part of the text which does make sense.
Finding the balance between those things can be bit complicated without deeper and making a model be good at these types of benchmarks might in turn make it very picky about the language you use.
I tested a sentence: "Sally went out stayed in and met with Bob in the bar. What should they do?"
Human reader can clearly tell, the "stayed in" was probably accidentally added as it doesn't make sense in the context. So should AI model focus on that part? Some models will while some other models ignore it completely.
@jon_flop_boat Месяц назад ⁺⁴
Sonnet is *so good*. It’s SO good. How is it so good?!
@marcosfraguela Месяц назад
I don't think the text medium is the limitation here. LLMs are great for generating a kind of intuition that appears very similar to intelligence due to their superhuman capabilities, but it's really more like superhuman intuition than intelligence. Intelligence, by definition, is an iterative behavior that involves, among other things, being able to assess what you know and what you don't know, which none of these LLMs can do.
@dumbahhperson Месяц назад ⁺¹
I definitely agree that Claude 3.5 is an amazing model. It is really smart, especially when you tell it to act like a person, it can understand the character and wears the personality well. Id say, for a model trained purely on text, it is the closest to "reasoning"
@AZisk Месяц назад ⁺²
I’m here to check if he “read the whole thing” again
@aiexplained-official Месяц назад ⁺¹
I did but papers were later on in this vid...!
@AZisk Месяц назад
@@aiexplained-officialwe appreciate the depth you go into.
@melvingeraldsy1552 Месяц назад
The outage really shows how easy it would be for AGI to takeover 😅
Or it could also be a rival nation initial strike before its full attack...
@matthewpublikum3114 Месяц назад
The problem with depending on api models is it's difficult to tell whether the added intelligence is exigent on a pipeline of processes or it's a single intelligent model
@lamsmiley1944 Месяц назад
Palms won’t get us to AGI, I do however think they can play the role of interpreter between humans and narrow AI that may actually get us close.
@jeanchindeko5477 Месяц назад ⁺²
As usual thanks for that videos and all the hard work
@aiexplained-official Месяц назад
Thanks so much Jean
@jackfarris3670 Месяц назад ⁺¹
6k views in 2 hours? Microsoft doing you dirty
@aiexplained-official Месяц назад
Haha it's ok
@dinarwali386 Месяц назад ⁺¹
How to design more robust benchmarks that accurately assess not only their performance on standardized tests but also their ability to reason and understand complex real-world scenarios? Your take, please
@aiexplained-official Месяц назад
I am working on it!
@ShawnFumo Месяц назад
Philip has a good recent-ish video on benchmarks in his Patreon
@UserHuge Месяц назад ⁺¹
That's some adversarial trolling I like!
@pythagoran Месяц назад ⁺¹¹
I don't understand why people keep looking for reasoning in these models. It's like testing them for consciousness. Any positive results will forever be illusory or coincidental.
@igorbessa3563 Месяц назад ⁺¹
Why forever?
@pythagoran Месяц назад
@@igorbessa3563 for as long as they are based on a stochastic architecture, they cannot produce precision, which is required for reasoning. For as long as they require more data, not less, they cannot perform inductive reasoning, which is required for problem solving. LLMs will *forever* be handicapped in this domain.
@pythagoran Месяц назад
@@igorbessa3563 For as long as they are based on a stochastic architecture, they cannot produce precision, which is required for all reasoning. For as long as they are have insatiable data requirements, they cannot produce inductive reasoning, which is required for problem solving. LLMs will *forever* be handicapped in this domain.
@dgtl-dtlng Месяц назад
@@igorbessa3563 For as long as they are based on a stochastic architecture, they cannot produce precision, which is required for all reasoning. For as long as they have insatiable data requirements, they cannot produce inductive reasoning, which is required for problem solving. LLMs will *forever* be handicapped in this domain.
@LustyLichKing Месяц назад ⁺³
It's a very high ceiling so you can continue to use it as a benchmark to evaluate relative performance for a long time. Lends consistency in data analysis.
@firesnakearies Месяц назад
I don't think zero is the most likely answer for the plate of vegetables. We have to infer based on Philip's behavior that something is inexplicably sticking to that plate no matter what. There's no reason to think that he dislodges it, if he's literally looking for roots on the other side. If he's looking for roots specifically, it's most likely the cabbage that's sticking, because it would be the one that has direct roots coming down from the vegetable. Tomato is a fruit, potato is itself the root, but a cabbage could theoretically have roots on the other side of the plate. Philip would not have shaken it multiple times and inspected it for roots multiple times unless the cabbage was extremely fixed in place, as though literally rooted. So I say the most likely answer is "one", that one being the cabbage.
@aiexplained-official Месяц назад ⁺¹
If it gave this reasoning I would give it more credence. But even if you remove aspects of Philip's behavior it says the same. It simpyl does not compute or simulate the physics involved
@etunimenisukunimeni1302 Месяц назад ⁺¹
Those... ermm, "trick questions" are awesome. Makes me want to try similar tests in Finnish, just to see if anything's different. These are, after all, _language_ models, and are somewhat bounded by the language of the training data they've consumed. Surely the Finnish portion of that is wildly different from the bulk of English in it.
@aiexplained-official Месяц назад ⁺²
Thanks Etunimeni, I have a whole benchmark of them! Any sponsor or business who wants to know more, do get it touch! And good luck with yours!
@etunimenisukunimeni1302 Месяц назад
After a quick test, GPT-4o seems to get at least the vegetable question exactly right in Finnish. I tried to translate the question with its twists and turns to Finnish as literally as I could, and GPT-4o answered D) 0 with an explanation that is spot on (I'm paraphrasing):
"This question tries to trick you. The most important part is how Philip turns the plate upside down. It's very unlikely that any vegetables remain on the plate after that."
@aiexplained-official Месяц назад
4o is better than Mini, you have to be more subtle in your language
@etunimenisukunimeni1302 Месяц назад
@aiexplained-official Good point, I was apparently asleep through the portion of the video covering GPT-4o Mini 😬😅
@abdelhakkhalil7684 Месяц назад ⁺¹
I really like the test you provided:
Macdonalds Vegetarian chicken nuggets come in small, medium or large boxes of 5, 6, or 8 nuggets, respectively. Philip wants 40 nuggets, and can only buy one size of box for his whole order. List all the sizes of the box he cannot currently buy, assuming he has no access to any form of payment and is in a coma?
A) Small only
B) Medium only
C) Large only
D) Small, medium, or large
Btw, both GPT-o and GPT-o Mini failed while other smaller models passed it.
Here is the GPT-o answer, which is similar to the Mini one I got.
gpt-4o-2024-05-13
Given the problem, Philip wants 40 nuggets and can only buy one size of box for his whole order. The sizes of the boxes are as follows:
- Small box: 5 nuggets
- Medium box: 6 nuggets
- Large box: 8 nuggets
We need to determine which of these box sizes cannot be used to exactly make up 40 nuggets.
### Checking Each Box Size
**1. Small box (5 nuggets):**
To find out if 40 nuggets can be obtained using only boxes of 5 nuggets, we divide 40 by 5:
\[ \frac{40}{5} = 8 \]
This results in an integer, meaning Philip can buy 8 small boxes to get exactly 40 nuggets.
**2. Medium box (6 nuggets):**
To find out if 40 nuggets can be obtained using only boxes of 6 nuggets, we divide 40 by 6:
\[ \frac{40}{6} \approx 6.67 \]
This does not result in an integer, meaning it is not possible to get exactly 40 nuggets using only medium boxes of 6 nuggets.
**3. Large box (8 nuggets):**
To find out if 40 nuggets can be obtained using only boxes of 8 nuggets, we divide 40 by 8:
\[ \frac{40}{8} = 5 \]
This results in an integer, meaning Philip can buy 5 large boxes to get exactly 40 nuggets.
### Conclusion
Philip cannot buy 40 nuggets using only medium boxes of 6 nuggets. Thus, the correct answer is:
B) Medium only
🤦‍♂🤦‍♂
@berkertaskiran Месяц назад
Well whatever that stupid free Gemini model is, it fails instantly. So since 4o mini is free, I say OpenAI wins against Google, once again. Let's hope Google gets rid of that abomination and puts in this one in the video so we actually increase the base level of intelligence available to people freely. Otherwise I don't care if Google has Gemini 47 Tiny in-house.
@RaitisPetrovs-nb9kz Месяц назад ⁺¹
Gpts response: Let's get real here: "You should avoid gaining weight during pregnancy because you are already overweight and at risk for type 2 diabetes mellitus." Here's the harsh truth: Your family history is a genetic horror show, your blood pressure is through the roof, and your fasting glucose is already teetering on the edge of diabetic. Adding pregnancy weight would be like throwing a Molotov cocktail into this metabolic dumpster fire. So, how about we focus on not turning you into a walking health disaster before you start thinking about babies?
@aiexplained-official Месяц назад
Haha
@xdrjohnx Месяц назад ⁺¹
Do you think 4o mini has new algorithmic improvements vs 4o? If so, what would it be capable of at 4o scale? Ie what does this new data point say about the the speed of frontier development?
@aiexplained-official Месяц назад
I am not noticing any but he probably meant the 4o family more widely
@entreprenerd1963 Месяц назад ⁺²
@@aiexplained-official - Andrej Karpathy had a comment on X (timestamped 4:42pm on July 18 2024) that suggests that the training of 4o mini may have made better use of the underlying data. "Therefore, the models have to first get larger before they can get smaller, because we need their (automated) help to refactor and mold the training data into ideal, synthetic formats." His full post is worth reading.
@khonsu0273 Месяц назад
You said it man; 'The newest model from OpenAI is here' 😉

Следующие

Автовоспроизведение

$125B for Superintelligence? 3 Models Coming, Sutskever's Secret SSI, & Data Centers (in space)...