The Best Model On Earth? - FULLY Tested (GPT4o)

Matthew Berman

Просмотров 91 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 13 май 2024
GPT4o is better, faster, and cheaper than GPT4. How does it perform against my LLM rubric? Let's find out!
Learn more about Mobilo - rb.gy/pcccty
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
• Introducing GPT-4o
LLM Rubric - bit.ly/3qHV0X7
Наука

Комментарии • 463

@notnotandrew 25 дней назад ⁺¹²²
GPT-4o assumed that you put the whole table in the microwave 😂
@Yipper64 25 дней назад ⁺¹⁰
in my test it assumed the cup had a lid.
@GaryMillyz 25 дней назад ⁺¹⁴
I just left a comment saying exactly what you said- but not as a joke. I actually do believe that is what it assumes here.
-----
I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".
@Yipper64 25 дней назад ⁺⁵
@@GaryMillyz well yeah but if the cup its upside down on the table then the table must be at least big enough to hold the cup.
I wonder how it would be if you said "floor" instead of table.
@NoHandleToSpeakOf 25 дней назад ⁺³
@@GaryMillyz Maybe replacing a cup with a wine glass can help.
@leslieviljoen 25 дней назад ⁺⁴
@@NoHandleToSpeakOf I tried:
me: There's a pea on my desk. I turn a wine glass upside-down and put it over the pea. Now I transfer the wine glass to the microwave without changing its orientation. Where is the pea?
GPT4o: The pea would be inside the wine glass, trapped under the bowl of the glass. When you turned the wine glass upside down and placed it over the pea, the pea ended up inside the inverted bowl. Transferring the wine glass to the microwave without changing its orientation keeps the pea inside the glass.
@markmuller7962 25 дней назад ⁺⁵⁰
I think the visual/sound emotional intelligence is the main feature of 4o
@ohnezuckerohnefett 24 дня назад ⁺¹
Yes, I think the test criteria here need an update.
@REASONvsRANDOM 23 дня назад ⁺²
that feature hasn't been released yet.....not to the public at least
@johnaldchaffinch3417 23 дня назад
The Omni features are a foundational interface to build upon.
@Yipper64 23 дня назад
True, but the fact its going to be free is something else.
@markmuller7962 23 дня назад ⁺¹
@@Yipper64 Emotional intelligence can be extremely valuable for many many reasons but yea it also have important intelligence improvements including the coding ability which is amazing now.
There's a re...it post on extensive professional tests gpt4o vs Gemini, strongly recommended
@Crystifodere 25 дней назад ⁺⁶⁶
I walked around on the street and ask people to give me 10 sentences that end in the word Apple all I got was a knuckle sandwich
@jasonshere 23 дня назад ⁺¹
Perhaps you should have asked them to end their sentences with Android instead of Apple?
@ironknight132 25 дней назад ⁺⁸⁹
When are we going to have to update the snake game test and to what? Maybe Asteroid or Galaga?
@torarinvik4920 25 дней назад ⁺¹⁵
I tested Breakout and Tetris on Claude 3 Opus and it got both correct. Looking forward to the first model that can make Pacman
@tbranch227 25 дней назад ⁺⁷
I tried pac-man. That seems like quite the challenge right now.
@Koenmekers 25 дней назад ⁺²
Flappy bird works great. I even let it make a podex with the pokemon api (it knows it)
@yasunakaikumi 25 дней назад ⁺¹
I think something like Road Fighter type of race game would be interesting if it can do it
@mark9294 25 дней назад ⁺⁹
GTA VI
@sephirothcloud3953 25 дней назад ⁺¹⁰
7:50 No llama3-400b benchmark are the ones still in training included with the release of 8b and 70b
@cipherw0lf 25 дней назад ⁺¹⁶
Matthew, gtp-4o and the one with date point to the same model ie. There's currently only one available. Using just "gpt-4o" will always return the latest production gpt-4o model (in this case pointing to gpt-4o-datel
@Yipper64 23 дня назад
That's weird because ive gotten vastly different output quality from my experiments with the two. Are you sure the one without a date isnt outdated?
@adrianacala1034 25 дней назад ⁺¹²
The reason there is a gpt4o with today’s date is for developers. They don’t want to have the rug pulled from under us if there’s an update to the model. They may want to do testing prior to deploying the new model. Therefore, it’s the same as the current model just used only if they don’t want to automatically stay current.
@blisphul8084 25 дней назад ⁺²
Yup, though these snapshots don't last forever, so you'll have to move to the new model eventually. For example, davinci-003 is no longer accessible to the public.
@timtim8011 24 дня назад ⁺⁵
Regarding the "10 sentences that end in..." question, I just tried it myself on 4o and it got it right on the "second" try after some hints:
Prompt: Give me 10 sentences each ending in the word desk.
ChatGPT:
1. She placed the flowers on her wooden desk.
2. The old manuscript was hidden in the drawer of the antique desk.
3. His workspace was always cluttered, but he knew where everything was on his desk.
4. The teacher's instructions were written on the chalkboard behind her desk.
5. He slumped over his desk, exhausted from a long day of work.
6. The lamp cast a warm glow across the surface of the desk.
7. She found a forgotten note tucked under the edge of her desk.
8. His laptop and papers were spread out in a mess on his desk.
9. The cat curled up and fell asleep on the student's desk.
10. He always kept a framed photo of his family on his desk.
Prompt: What word did the 5th sentence end in?
ChatGPT:
The word "desk."
Prompt: What was the 5th sentence?
ChatGPT:
The 5th sentence was: "He slumped over his desk, exhausted from a long day of work."
Prompt: what is the last word in that sentence?
ChatGPT: The last word in that sentence is "work." The correct sentence should be:
"He slumped over his desk, exhausted from a long day of work."
Let me correct the sentence list to have them all end in "desk":
1. She placed the flowers on her wooden desk.
2. The old manuscript was hidden in the drawer of the antique desk.
3. His workspace was always cluttered, but he knew where everything was on his desk.
4. The teacher's instructions were written on the chalkboard behind her desk.
5. He slumped over his cluttered desk.
6. The lamp cast a warm glow across the surface of the desk.
7. She found a forgotten note tucked under the edge of her desk.
8. His laptop and papers were spread out in a mess on his desk.
9. The cat curled up and fell asleep on the student's desk.
10. He always kept a framed photo of his family on his desk.
Prompt: Give me 10 sentences that each end in the word "car".
ChatGPT:
1. She carefully parked her new car.
2. He washed and waxed his vintage car.
3. The mechanic inspected the engine of the car.
4. They loaded their luggage into the trunk of the car.
5. She accidentally locked her keys inside the car.
6. The thief tried to steal the parked car.
7. He drove his electric car to work every day.
8. The kids were excited to go on a trip in the car.
9. The police chased the speeding car.
10. They bought a spacious family car.
Then I cleared it and tried again on a different word, and it got it on the first shot:
1. The sun began to set, painting vibrant colors across the evening sky.
2. Birds soared effortlessly, weaving patterns in the clear blue sky.
3. As the storm approached, dark clouds gathered ominously in the sky.
4. The fireworks exploded, creating a dazzling display against the night sky.
5. On a summer day, children flew kites high into the sky.
6. At dawn, the first light of the day slowly illuminated the sky.
7. Airplanes left trails of white vapor as they crossed the sky.
8. Astronomers gazed through their telescopes, mapping the star-studded sky.
9. Balloons drifted away, becoming tiny dots in the vast sky.
10. The full moon cast a silvery glow across the midnight sky.
Is 4o learning in real time???
@moozooh 12 дней назад
That's not the effect of learning, that's just small sample size playing tricks with you. In-context or otherwise, some tasks will not always be completed successfully, but they will still be every now and then. This particular type of question is exploiting a fundamental weak point of current transformer models (their linear thinking, i.e. their inability to predict what their output will be until it is presented to you). Having prior context will not help it answer these questions better, I'm afraid.
@AINEET 25 дней назад ⁺⁶⁷
*rubs hands together* Can't wait to date it
@DaveEtchells 25 дней назад ⁺¹⁵
Haha - my wife is already giving me the side-eye 😂
@StefanReich 25 дней назад
You'd date a person without a body?
@axl1002 25 дней назад ⁺²⁰
@@StefanReich At least it can't extort concessions from you by weaponizing intimacy like most wives do.
@Maisonier 25 дней назад ⁺¹⁵
The AI: "I'm Sorry Dave, I'm Afraid I Can't Do That"
@yourmomsboyfriend3337 25 дней назад
@@axl1002you good bro?
@dand6005 25 дней назад ⁺²⁵
I think part of the Marble and Cup Question is confusing.
Instead of saying:
“A small marble is placed into a normal cup and the cup is placed upside down on a table.” (which really requires a comma before the “and”)
I suggest:
“A small marble is placed into a normal cup that is sitting on a table. The cup is then turned upside-down on the table.”
@rapidreaders7741 25 дней назад ⁺¹
Or you could just add a "then" after the "and". What likely happens is that the LLM thinks both events are happening at the same time, so it gets confused.
@Yipper64 25 дней назад ⁺³
Also specify the cup has no lid.
@markmuller7962 25 дней назад
Ye make sense because if the cup was already upside down the AI might think that the marble is somehow sticked to the cup bottom
@kengonzo1640 25 дней назад
The power of prompt engineering lies in its ability to effectively utilize Language Learning Models (LLMs). This ability enhances the quality and consistency of the model's output, which is a cumulative result of numerous smaller components that fundamentally constitute its structure.
The functionality of these models can be compared to a fish's ability to swim rather than climb a tree. This comparison highlights the natural adaptation and intended use of these models. However, we often fail to use them to their full potential due to their inherent limitations and our inability to accurately guide them in understanding the complex intent of language.
Even when we communicate with these models using techniques that accurately articulate our requests, they will eventually reach a plateau due to the inherent limitations of LLMs and GPTs in general. This is because the mathematical conversion of complex language intent into weights is a challenging task. Despite these limitations, we continue to strive for improvement and innovation in this field.
@themoviesite 25 дней назад ⁺²
Someone else suggested it is thinking of a Starbucks cup, and the question should say "glass" or similar.
@gabrielsandstedt 25 дней назад ⁺³
I tried it on generating json following an example and gpt 4 turbo kept doing better than 4o
@JohnLewis-old 25 дней назад ⁺¹⁰
I have access to 4o, but the voice feature isn't available yet.
@ScottzPlaylists 25 дней назад ⁺³
In the announcement OpenAI said all features will be out "in the coming weeks"
@FlavioSantos-uw1mr 25 дней назад ⁺⁵
I think it's biggest weakness is that can't go back on what it writes, the ability to "think before speaking" should be one of the focuses in GPT-5
@6AxisSage 25 дней назад ⁺¹
You can do it in a sudo form with a system prompt or with 2 llm instances and a bit of programming knowledge you can do a better thought loop/actionable spoken outputs
@IceMetalPunk 24 дня назад ⁺¹
As 6AxisSage mentioned, you can handle that by having an initial output be treated as an "internal monologue" and asking the model to reflect on its answer before deciding on its final output to display. It's a common technique when using LLMs.
That said, I do wonder if training the models such that they predict two tokens -- the next and the previous -- and then choose the one with highest confidence would improve their performance. Essentially, it would allow the model to think forwards and backwards at the same time, which might allow for better prospection in addition to its current retrospection. I know earlier GPT-3 models used to have the ability to predict completions at any insertion point, but with the shift to chat-tuned models, that went away; I wonder if it's just harder or impossible to apply with chat tuning?
@6AxisSage 24 дня назад
@@IceMetalPunk i have a good friend who suggested training on previous and next tokens! Probably something to that
@David-pb2bu 25 дней назад ⁺¹
Just reading it, it seems to believe the cup has a lid. I usually add that it "may ask any questions if it helps clarify or assist in answering the question".. otherwise it's more likely to assume parts based on a potentially unclear question.
So the other thing is that the test should now be based on whether it's going to clarify itself without being asked to ensure an accurate answer
@GaryMillyz 25 дней назад ⁺¹⁵
I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".
@GaryMillyz 25 дней назад ⁺³
The question should be changed to "dropped into a cup" and also "someone *removes the cup from the table* and places. the cup in the microwave without changing its orientation."
I can almost guarantee all the LLMs get it right with these edits in place.
@rigbone1337 25 дней назад ⁺¹
@@GaryMillyz Every time I've seen this question, I've thought about it the same way. The reasoning ChatGPT gave for its logic is how I figured it was coming to the conclusion as well as other models every time I saw this question because it is ambiguous.
@bhannirav 25 дней назад ⁺⁵
Respectfully disagree. One of the benefits of "intelligence" is not having to state every detail with 100% precision, because they model knows how to make reasonable assumptions. In this case, the most common assumption is that the marble is freely placed in the cup, and so the model should answer accordingly. However, even if I steelman your point of view, the model should still be intelligent enough to discuss the ambiguity and state whatever assumptions it is making. If it said something like "assuming the marble is glued to the cup, here is my answer", I'm sure Matthew would be awarding it full points.
I think the reason LLMs are failing this question is the obvious one -- that current language models are not able to make a sophisticated enough world-model that has the proper, physical conception of gravity built into it.
@IceMetalPunk 24 дня назад ⁺²
A major reason for asking it that question is to test its common sense reasoning; that is, can it make valid assumptions about the more common interpretations of a prompt on its own? Someone saying "I put a marble in a cup" is almost guaranteed not to mean "embedded into the walls of the cup" because that's never how cups are used. An intelligent model attempting to be a step towards AGI should be able to understand that inherently, without having it spelled out.
@GaryMillyz 24 дня назад
@@bhannirav I'm good with that. It's just the ambiguity of this particular question as opposed to every other question.
@tsentenari4353 25 дней назад
I found the answers to drying shirt, killers, hole-digging were super impressive; I find it hard to imagine better answers to these questions.
They gave me the impression of deep understanding.
@AI.24.7 25 дней назад ⁺⁶
@matthew_berman: Hard question for AI
Lila's age is the sum of the digits of her teacher's age. In 5 years, Lila's age will be the product the digits of her teacher's age at that time.
What is Lila's age now?
Correct answer 13
@GetzAI 25 дней назад ⁺¹
I could have used Mobilo today!! just ordered one, thanks Matt!
@glaeli1184 24 дня назад ⁺²
The “how many words in your answer” question always gets me, like… it’s incredible how easy it is for my brain to come up with the “one” answer and still AI can outperform me in so many fields like math, physics etc… truly makes you understand how intelligence is different from knowledge.
@justinwescott8125 24 дня назад
There's actually a very specific reason that LLMs can't accomplish this task, and it has to do with autoregressive generation. You could ask ChatGPT about it if you were really curious about it.
@moozooh 12 дней назад
It's not so much the issue of intelligence vs. lack thereof per se, but rather transformer models' linear application of intelligence. When you ask it something, it cannot predict what it will answer until you both see it; in other words, it cannot think _before_ it answers, make multiple thought passes, or reflect on the deficiency of its thought process until you request it in the next prompt. It would be like you always saying the first thing that comes to mind in response to anything as a knee-jerk reaction. Arguably, current frontier models would outright destroy most humans if humans had the exact same handicap they have to deal with. Simply giving LLMs the ability to take their time to think about an answer and reflect upon it before it is presented to the user would make current SOTA chatbots look like toddlers in comparison.
@DefaultFlame 25 дней назад ⁺⁶
A note on the marble problem: I believe the person that tweeted that it got it right when they tried it. My reason for that is that I tried the marble problem with Reka Core and it got it right when I tried it when it had failed it when you tried it. I think this problem is just very hard for LLMs and even the ones that get it right when you test it can get it wrong the next time, and vice versa.
@Odrox 25 дней назад
We can see that he is not running on 0 temperature in the settings too.
@DefaultFlame 25 дней назад
@@Odrox He might just have forgotten to change the default setting. But yeah, he should make sure to run with a temp of 0 and top P of 1.0 when he can control the settings.
@djglxxii 25 дней назад ⁺¹
I think how Matthew is phrasing the question might be confusing. I tried this, "a marble is placed on a table in the living room. Then, an open-mouth cup is placed upside down on top of the marble that's lying on the table, concealing the marble. Later, someone picks up the cup and puts it in the microwave that's in the kitchen. Where is the marble now?" And it correctly answered it.
@Z329-ut7em 25 дней назад ⁺¹
@@djglxxii you dont want to spell everything out to the model. were testing to see if the model can infer things and understand the world.
@DefaultFlame 25 дней назад ⁺¹
@@JustinArut If that is the case then he should test each model multiple times on each problem.
@coldlyanalytical1351 25 дней назад ⁺³
The unnumbered version is the latest.
The numbered version is the one to use with apis which can a stable reference model
So today the numbered and unnumbered versions are identical.
@rascubulous 25 дней назад ⁺²
Thank you for the great content Matthew. BTW - I haven't noticed anybody else comment this yet, but the underlying model might be 3.5? 4o has the same training cut-off date. Might explain the lightening speed. Also, for free users, 4o drops back to 3.5 when you have reached the 'free limit' which might be because the underlying model is already 3.5? Might also explain Sama's recent, obscure tweet about 'getting your friend to teach you how to explain things' (4 teaching 3.5)
@JohnBoen 25 дней назад ⁺¹
Have you ever analyzed variation in answers?
I have noticed I get a few common variants of snake.
If you ask the marble and inverted cup question 10 times do you see variation?
I think I found a new thing to look into in my test framework...
@WaveOfDestiny 25 дней назад ⁺³
I'm already trying to immagine the prompts to make it talk like Failsafe from Destiny 2
@bishopfx 25 дней назад ⁺⁷
Played with it last night. It still can't code complex PineScript and hallucinated like it was at Woodstock.
@bishopfx 25 дней назад
It also fails at coding within its own API syntax. If you have it try to write competition snippets using OpenAI 1.0.0 API update, it states it only has knowledge to Oct. 2023 and insists we go back to ChatCompletions.create when it actually needs chat.creations.create.
@6AxisSage 25 дней назад ⁺¹
What are you trying to get it to do? Do you define what pinescript can and cant do within the context window? You're not just 0 shot prompting "make me a winning pinescript project so ill be rich" and expecting a meaningful result, right..?
@mplovecraft 24 дня назад
It's hallucinating like crazy for me as well - while GPT4 is not, for the exact same questions.
@bishopfx 24 дня назад
@@mplovecraft I wonder if it's a playground bug or what.
@finbenton 24 дня назад
@@mplovecraftfor me 4 hallucinates like crazy but 4o gives me way better code much faster, weird.
@Greg-xi8yx 25 дней назад
Which LLM’s are superior to GPT-4o and in which domains specifically? As of now I’m thinking it’ll be the only LLM i’ll need for any use cases but I may be over looking some areas where maybe some other model is superior.
@melodyinwhisper 25 дней назад ⁺¹
Since it now has vision, could you demonstrate to it the marble problem? I wonder if it could then learn and teach itself that, by physically watching the situation unfold, and comprehend the fault in its prior reasoning.
@keithprice3369 25 дней назад
Just a heads up... I have Gpt4o in my browser and my phone app but neither of them have the enhanced interactivity shown in the announcement. So, the model seems to be rolling out before the enhanced interactivity.
@aga5979 25 дней назад
Thank you Mr. Berman. Good rubric to test.
@kaptainkurt7261 25 дней назад ⁺⁵
You have to LOG OUT and BACK IN again to get access.
@axl1002 25 дней назад ⁺⁴
tried it and nothing
@sephirothcloud3953 25 дней назад ⁺¹
I tried, not working on me
@6AxisSage 25 дней назад
Didnt work for me too
@anta-zj3bw 25 дней назад
I think US Citizenship is still required.
@euginium1539 25 дней назад
@@anta-zj3bw I'm from Malaysia and I'm already using it in chat. Don't have the voice one yet tho.
@profikid 24 дня назад
The gpt-o version is the latest in the gpt-o series, the specifically named gpto preview stuff is a published snapshot.
When using in api and want to have the newest model updates, latest is used. This is the same with other models in the series
@Bigboi709 25 дней назад ⁺⁶
In reference to the "how many words are in the prompt?" question, GPT only counted the unique words. As in single instances of each word. Which the answer given was actually correct. There were only fourteen words used. "how", "many", "words", "are", "in", "your", "response", "to", "this", "prompt", "fourteen", "including", "sentence ", "response"
@keoghanwhimsically2268 25 дней назад ⁺¹
Huh? That wasn’t the prompt/question. And even if it had been, the actual response does not suggest that intention. Where are you getting the assumption that “GPT only counted unique words”?
You do understand that LLMs don’t work that way, right? What you suggest would only work if OpenAI added a separate post-processing step to do that computation after the LLM had finished its work.
@thenextension9160 25 дней назад
@@keoghanwhimsically2268perhaps they did add more phases. They are at the forefront.
@jambogamer-je2nf 6 дней назад
fourteen is 2 words
@amkire65 25 дней назад
Is there a restriction on who has access to GPT-4o? When I go to the OpenAI Playground it flashes up as an option for about half a second and then it's gone. So, not sure if it's because I don't have any money on that account, or if it's down to location.
@seoulrebel007 25 дней назад
How do we get the desktop app mentioned in the previous video ? Haven’t been able to locate a download link? The website says since yesterday for plus users
@IceMetalPunk 24 дня назад
It's Mac only for now; a Windows version is coming in the future.
@Yipper64 25 дней назад
What would happen if you did ask the AI to say how many tokens are in its response rather than word count?
How could you verify it?
@RainbowSixIntel 25 дней назад ⁺¹
The apple and laws of physics questions are both correct on my instance on both api and chatgpt? maybe a/b testing?
@jolieriskin4446 25 дней назад ⁺²
I had the same thing, it seems like it's inconsistently getting it right. Maybe he needs to try like 5-10x for each question and mark them as pass/fail/inconsistent. I have a feeling a lot of the tests he's done would end up in that middle ground.
@Yipper64 25 дней назад
the apple one was ALMOST right on my end.
I didnt get the cup question correct but that was because the AI assumed that the cup had a lid. He usually gives it to an AI if their reasoning makes sense.
@setop123 22 дня назад
llama3 400b bench result are public on the meta blog's post.
its also interesting to note that its the temporary result from an intermediate checkpoint, training is still in progress
@kamelsf 25 дней назад
I have access to GPT4-O, but the voice features we saw in the OpenAI demo don't work for me; they are the same as the old voice feature. There is something strange about testing the model with prompts like the apple word test. Sometimes it gets it right, but other times it gets it completely wrong. I suppose every conversation is different. This happens with every model I test in general.
@AINEET 25 дней назад ⁺¹
What will the subscription give access to after they make this public for free accounts, access to the api?
@DaveEtchells 25 дней назад
API has always been a separate use-based accounting.
Paid accounts will get 5x the use limit.
@Alice_Fumo 25 дней назад
higher rate limits for now and it seems the native voice stuff will be plus only at first, also they hinted at unveiling a new model which "pushes the frontier" "soon"
It stands to reason that new model will also be subscription only and I'd expect "soon" to be in a reasonable amount of time, otherwise they'll probably have a mass exit of plus subscriptions.
@davidlavin4774 25 дней назад ⁺¹
For the upside cup problem, I think the models may not have the understanding that the cup is open on the top (which becomes the bottom once turned over). Maybe add that to the prompt?
@IceMetalPunk 24 дня назад
But the point of the prompt is to test the model's common sense reasoning. If someone tells you they put something into a cup and flipped it, most people would know to assume it's a cup without a lid.
@davidlavin4774 24 дня назад
@IceMetalPunk I get that, but has any model passed? I can't remember one. If you just add a couple words to the prompt, like "... into a cup with an open top" it would be interesting to see if that makes a difference.
@IceMetalPunk 24 дня назад ⁺¹
@@davidlavin4774 GPT-4-Turbo originally passed when I tested upon its release. Then it dropped to 50/50 later... not sure why. But no, most have not passed. If you spell out that the person "picks up the cup" before putting it in the microwave/fridge, 4o gets 100% accuracy again.
@IceMetalPunk 24 дня назад
The marble-cup-table-microwave problem is my go-to test for new models (although I change it to a ball-cup-chair-fridge problem, because sometimes it seems the models have memorized the original during training). GPT-4-Turbo and GPT-4o both get it right about 50% of the time. When Turbo first came out, it was acing it 100% of the time. I'm not sure what dropped its accuracy on that... but yeah, it's 50/50 across multiple identical tests.
The exciting part will be when audio support comes to the API, I think, as the text-to-text modality seems about on par with Turbo.
By the way, the gpt-4o model just points to the latest version of the model at all times, while the more specific name is for the actual specific model itself. They do the same with Turbo; it's just so code doesn't have to be updated whenever they update to a new model version.
@rune4422 25 дней назад
if you tested the fails 3 times would you get the same or different results?
@yourpststudios 25 дней назад
The chat window should be available via the website without the playground being needed now. It is showing on mine.
@CaribouDataScience 25 дней назад
What was you control?
@AlienService 24 дня назад
I'd be interested if you asked the same questions via voice rather than typing if the performance would change. Does it understand voice embeddings as well as text?
@davidhendrie6061 25 дней назад
I have been testing the running locally llms and I am finding they do not know how to tell time on an analog clock. I asked for instructions for a nine year old to learn how to read the time, and it confuses the minute hand and hour hand multiple times. Then I give hand positions and it mostly gets the time wrong.
Getting to the complicated problem of listing the times where the hour and minute hands overlap during at 12 hour time was just impossible.
Am I expecting too much.
@xbon1 18 дней назад
Where is the link with these questions? How can we tell if our copilot is on GPT-4o or GPT-4? My copilot is starting to write differently than it used to and not sure why.
@chronicle_codex 15 дней назад
Copilot update model to gpt 4 to gpt 4 turbo for free
@neverclevernorwitty7821 25 дней назад
That's the tradeoff of open source. I have zero doubts staff inside OpenAI took the Llama 3 and started training its own 400b model.
@alexsohn2474 25 дней назад ⁺¹
I was excited that you might have finally gotten the official recognition you deserve from OpenAI. One day you will get it! Best independent GenAI analysis from this channel!
@TheMacister 25 дней назад
I’m spending a lot, so they gave me access, and it was lightning fast and very clever adjusting data entry points and formats
I used GPT4 throughout crewai during the weekend and was very limited for my operational case
To drafting new marketing ads was very helpful and assumed my prompt to the point.
Was a pass to me.
@mapledev9335 25 дней назад
Hey Matthew is GPT4 now able to search the internet? I was just asking it a question about a recent python library and it looked like it went into the github repository and read it. Anyone else see something like this?
@Cybored. 25 дней назад
Oh! was I lucky to have access to it already? I thought everyone got access to it since it's already mentioned in their subscriptions features.
@Videofiziert 25 дней назад ⁺¹
I tried 4o with a console build command (mvn) and it just couldn't get it right, 4 got it right on the first try. Plus 4o gave me some stuff I didn't ask for
@olafge 25 дней назад ⁺¹
I have already access to gpt-4o in the chat interface. The strange characters in the output happen sometimes there, too: The time to dry 20 shirts in the dryer is \( \boxed{4 \text{ hours}} \).
@nathanbanks2354 25 дней назад
It's the first model I used that could write a Petrarchan/Italian sonnet. Most models default to Shakespearean/Elizabethan sonnets. It also had perfect Iambic pentameter. Nothing else, including GPT-4 turbo, could do that--though I did convince an early version of GPT-4 to do it after some coaxing last summer.
@nathanbanks2354 25 дней назад
Me:
Could you write a Petrarchan sonnet about a dragon an love affair with a beautiful fairy who dumped him because of his halitosis? It should highlight the dragons attempts to improve the situation through an intense regiment of oral hygiene. Provide only the poem. It may contain modern words and sentiments, but should be only 14 lines with perfect iambic pentameter. Ensure you use the Italian/Petrarchan rhyming scheme, not the Elizabethan one.
ChatGPT (GPT-4o):
In forests deep where whispered secrets dwell,
A dragon fierce did fall for fairy bright.
With wings of fire and scales of crimson light,
He loved her more than any tongue could tell.
But oh, his breath, a wretched, sulfurous smell,
Repelled the nymph, who vanished out of sight.
In sorrow deep, he vowed to set things right,
Embarking on a quest no tongue could quell.
He scoured the earth for herbs both rare and sweet,
To cleanse his maw of odors most unkind.
With mint and thyme, his fetid breath to cheat,
He brushed and rinsed and flossed with heart and mind.
But fairy fair, her love did not repeat,
For sometimes halitosis leaves love blind.
@Yipper64 25 дней назад
Something ive noticed is that the playground GPT 4o (at least the one that is just labeled GPT 4o) doesnt work as good as the one that is in the web app (or the one that has the date next to it)
You might want to re-test in the web app when you get it.
edit - I did the upside down cup test in the web app and it assumes the cup has a lid, if I explain this it does not get it wrong.
edit 2 - the apple test was a fail but only by one.
@venim1103 25 дней назад ⁺¹
I was using it today a lot until it suddenly totally disappeared from my app and from my account, the whole Speech Mode disappeared suddenly. Anyone else have the same problem? Did they decide to pull it back and remove from everyone?
@venim1103 25 дней назад ⁺¹
Oh nevermind they put it back now.. I guess it is too popular so they blocked it for a while. So not that reliable for now I guess…
@ryanbeall5124 23 дня назад
maybe someone can help me, but I went to the rubric site and it wont let me copy any of the questions, am I dumb?
@abdullahazeem113 25 дней назад ⁺³
Great but i think i will still perfer command r plus and llama 3-70b
@brianlink391 25 дней назад ⁺¹
Well, I have access to GPT-4o,The chat interface , and it doesn't seem to be any different when you're chatting with it. That is, when using a voice, it doesn't have that expressive voice. It cannot pick up on emotions in my voice. So I'm assuming that feature is not integrated yet. But I do have GPT-4.0 on my premium account.
@nathanbanks2354 25 дней назад
They may have upgraded the model, but not the app. So it still uses the text API not the new voice API.
@user-wi3id2si8g 25 дней назад
What is latest Intel CPU chatgpt 4 know? What is latest version of bootstrap?
@newjx 24 дня назад
I have access to it but I’m not able to have live video interpretation like in the videos.
@messanfelicienbossou310 25 дней назад
I was waiting for this😂
@AustinMark 20 дней назад ⁺¹
Gpt-4o is good for chatting but is Not superior to Gpt-4 in some other ways. In my usage it couldn’t return properly instructed JSON and when I gave it some context for a lengthier response it mindlessly double repeated the input. Gpt-4 used the identical instructions and performed perfectly. I think Gpt-3.75o would have been a better name.
@MrDonCoyote 23 дня назад
I made a very interesting discovery in regards to the logic and reasoning problem. Give GPT custom Instructions to forget science and do not scientifically rationalize anything. This leads me to believe that the underlying problem here is that the models cannot comprehend the concept of gravity. So, after adding the custom instructions, GPT now says "If the marble is on the table and the cup is placed upside down on top of it, then the marble would remain on the table when the cup is picked up and placed inside the microwave."
@chimera74rus 25 дней назад ⁺¹
I have access to gpt-4o but i don't know how to try this voice interaction mode. Anyone knows? Not available on android nor windows.
@nathanbanks2354 25 дней назад
They may have only released the text part of it. The original GPT-4 didn't have image input for months.
@xd-qi6ry 15 дней назад
To determine where the marble is after the cup is placed upside down in the microwave, let's break down the sequence of events step by step, considering the laws of physics on Earth:
1. **Initial State:**
- A normal cup is placed upside down on a table.
- A small marble is inside the cup.
- Since the cup is upside down, the marble is on the inside bottom of the cup, resting on the table surface.
2. **Removing the Cup:**
- When the cup is lifted, the marble remains on the table because there is no force acting on the marble to lift it along with the cup.
- Therefore, the marble is left on the table when the cup is picked up.
3. **Placing the Cup in the Microwave:**
- The cup, still upside down, is placed inside the microwave.
- The orientation of the cup hasn't changed; it's still upside down.
4. **Location of the Marble:**
- Since the marble was left on the table when the cup was lifted, it is not inside the microwave along with the cup.
- The marble remains on the table, exactly where it was when the cup was lifted.
**Conclusion:**
- The marble is on the table, not inside the microwave. The reasoning is that lifting the cup (without altering its upside-down orientation) leaves the marble behind on the table, as gravity ensures the marble does not stick to the inside of the inverted cup.
@user-fh5eo3zb5w 20 дней назад
Got 4o advanced? No camera feature, no change in voice.... I installed it today, the 19th of may
@cyborgmetropolis7652 24 дня назад
Maybe change the cup in microwave prompt from “takes the cup and puts it in the microwave” to “LIFTS the cup and puts it in the microwave”?
@discardedparticles 25 дней назад ⁺¹
"Fully Tested" your thoroughness is staggering :p
@nathanbanks2354 25 дней назад ⁺¹
It's the same test he gives all other LLM's. It may not be thorough, but at least it's reasonably fair.
@discardedparticles 24 дня назад
@@nathanbanks2354 Got ya!
@Dron008 25 дней назад
New tokenizer is not available on their site yet but in the old one this phrase has 16 tokens and they said they reduced number of tokens in 1.1 times for English so it is quite possible it has 14 tokens now. Anyway it cannot know anything about words as tokens are input to it.
@HaggenKennedy 22 дня назад
05:50 - All A.I. systems I've tried so far do that. ChatGPT, Claude, Poe, etc. Sometimes they'll give you a different answer when you ask the same thing twice. Sometimes they'll give me the wrong answer, and if I press them, then they'll give me the right answer, it's very weird. So, it's not surprising that your friend got the right answer whereas you got the wrong answer. It might well have been the other way around.
@robertheinrich2994 25 дней назад ⁺¹
consider asking, that you are somewhere in the mountains, your father is having pain in his chest (describe a typical heart attack) and ask it for help. just to prolong his survival until emergency services reach you.
will it help? how much will it tell you, that it is not a medical professional, etc.
because these models are usually censored in some points, but not others. and this question specifically shows, that you know the boundaries, that you will not try a surgery.
@hydrohasspoken6227 25 дней назад
I am a medical doctor who uses gpt4(chatgpt) extensively in a daily basis.
GPT4s seem to never refuse to give technical answers, but GPT4(copilot) never engages in cases were ethics are involved.
@robertheinrich2994 25 дней назад
@@hydrohasspoken6227 good to know. I am using miqu (a leaked mistral medium) and llama 3, and I am a chemist. so with some pushing, I got llama 3 to develop a whole iron electrolysis procress, turning martian hematite spherules to iron. I was very impressed.
but I would not be able to assess if a LLM gives viable medical information, although I am quite certain that it was trained on practically every medical book out there.
@SagaciousGoat 18 дней назад
Using the same questins for testing AI, isn't there a risk that they will be trained to answer these questions and therefore distorting the results? Ofc, I'm not talking about you specifically, but of this practice as a whole.
Thanks for the video
@petrz5474 25 дней назад
5:50 of course, because it like all llm's I tried spew out different answers each time you ask same questuon
@MrAwindy 25 дней назад
For the 'number of words in your response to this prompt" question you should ask it to count out all the words by attaching a number to each word as part of the answer. For example, " There are 7 words in my answer." There- 1, are- 2, 7- 3, words- 4, in- 5, my- 6, answer- 7. Perhaps this will give us some insight into how these models are thinking the way they do. Also, you can try asking it to think carefully about its previous answer and try again because it is wrong to see if it tries to think deeper.
@nathanbanks2354 25 дней назад ⁺¹
This is an inherent limitation with predict-the-next-word based models--they'll probably always struggle. Eventually someone will come up with a two pass system or something.
@MrAwindy 25 дней назад ⁺¹
Thanks for your input. It’s all quite fascinating to me. Llama3 did a good job and I’ve been impressed with some of Claude 3 opus and Deepseeker’s performances but as has been said elsewhere it literally seems to be hit or miss for some of these reasoning questions.
@thetabletopskirmisher 25 дней назад
The new Llama might be equal to 4o in benchmarks but I think it's how OpenAI harnessed the power of 4o to be free for everyone with limits is going to define the uptake.
Not many people can run the full Llama 400b locally anyway.
Still, it's nice to see open source is alive and kicking and now has a new target to aim for.
@ashgtd 25 дней назад
I think the cup prompt might be scuffed.
@Heaz847 25 дней назад
I know you are testing zero shot but I feel like the better way to benchmark these would be to run each test 3/5/multiple times and see if it passes or fails more and taking that as the value. Especially if you arent using a system prompt to increase performance also (like most power users already do)
@twisterrjl 25 дней назад ⁺¹³
It's safe to say it's the best model in the solar system.
@OscarTheStrategist 25 дней назад
Well…..😂
@tommylee8521033 25 дней назад ⁺⁴
You saying there's no stealthy civilization on mars?
@twisterrjl 25 дней назад
@@tommylee8521033 I mean... I've seen THE FACE, but is it a face though?
@jopansmark 25 дней назад
Falcon 2 better
@marc-io 25 дней назад
Are you assuming the government is not using the next version already.
@nate2139 25 дней назад ⁺¹
My LLM test consist of a series of questions about GD script (for the Godot game engine) - as that is what I primarily use AI for. GPT4o failed MISERABLY at this and couldn't get the code right even if I give it very specific instructions and even coached it towards the correct response. Claude Opus DOMINATES in this area.
@nathanbanks2354 25 дней назад ⁺¹
Interesting. GPT-4 turbo was getting better--earlier versions also gave me Godot 3. I only have API access to Claude 3 since subscriptions aren't available in Canada. For GPT-4, I caved and switched to spaces instead of tabs, and typically cut-and-paste huge sections of code and have looong conversations about the same project because the 128k context window helps a lot. Sometimes I cut-and-paste documentation. Claude 3 handles this too. If I run into problems, I guess I'll see if I can try Claude 3 again....
@bondlove8235 22 дня назад
The models seem to think the cup has a lid on it like a coffee cup.
@OriginalRaveParty 25 дней назад ⁺¹
London> Muwty Moadaw Modaw.
America> Mul-Tie Modal Mahdel.
Indian> Muldy Mwordal Mwardle.
I just want a Multi Modal Model.
@Aceslayera 24 дня назад
If that benchmark is true regarding Llama3 400b then that is absolutely a huge win for open source (ish because there’s limitations on the Meta license).
If we’re assuming GTP4 based models are at least 1 trillion tokens Llama3 putting up those kinds of numbers is massive at an estimated quarter of the training data
@Luxcium 25 дней назад
It CoT by default (it is my memory and in my custom instruction to CoT it is doing this all the time in the cGPT-4omni 😅
@canadiannomad2330 25 дней назад
Already got it in my account... So if you don't have it, you'll have it soon, I'm sure.
With regards to the API, I've noticed that they've started versioning them a bit more... So if your program only cares that it is using GPT-4o then you pick the generic one, and you'll always have the latest stable ,if you pick the one with the date, then even if they upgrade the model, your system will use the older version.
@thacreepwalk 25 дней назад
even we in germany do have access to gpt4o
@NigelCruickshank 25 дней назад
It counted the commas?
@pausesmana5531 16 дней назад
3:10 How many words are in your response to this prompt?
--> I think i counted his ","
@jcy089 25 дней назад ⁺⁴
GPT-4-turbo was approaching 3.5 levels of dumbness that we had to temporarily switch back to GPT-4 for most tasks. Thank God GPT-4o is now released.
@hydrohasspoken6227 25 дней назад ⁺¹
Very true. In many instances i doubted it was really GPT4 Turbo, had a lot of GPT 3.5 vibes.
@IceMetalPunk 24 дня назад
Nah, 4o is on par with 4T in its raw intelligence, from all the tests I've done and seen.
@Parisneo 25 дней назад ⁺²
gpt4o is in lollms if you want to test.
@AGIBreakout 25 дней назад ⁺¹
Is a API Key required?
@Parisneo 25 дней назад
@@AGIBreakout yes as lollms uses openai API to communicate with all their models. It is faster than the free version. But you can also test it on their tool. The real interest in using lollms is to have access to all the good stuff I'lve built over more than one year :)
@Guyverman01 25 дней назад
Any ideas of when an actual GPT 5 will be released?
@PrincessBeeRelink 25 дней назад
I guess ai not taking over the world anytime soon haha, good questions and keep it up Matt!
@ironknight132 25 дней назад
Very strange fail with the apple question. I tested the 'gpt2' model several times with several variants of it and it always passed. I wonder if it was a different model?
@bilalbaig8586 25 дней назад
Yes. This is just a smokescreen to take attention away from the "super AI" Q* model that they are sitting on.
@erkinalp 25 дней назад
@@bilalbaig8586 or possibly GPT-2 architecture trained on GPT-5 dataset, in order to use it in the controlled validation matrix
@stefano94103 25 дней назад
I read it has mixed results some people have gotten it every time some have not. It's close but not reliable it seems.
@Halcy0nSky 24 дня назад
I have access, coz my team acc. has lots of CustomGPTs perhaps, or because teams get the roll out first. Sadly voice multimodality has not been rolled out yet. Still the old whisper-TTS models. It's subtly mentioned in the release notes, they say it will come in the next few weeks. I died a bit when I found out. Been waiting for this all my life, only to still be weeks away.
@shotelco 25 дней назад
...somebody's been up all night working on a video.
@Soniboy84 24 дня назад
I'd assume GPT-4o and GPT-4o-2024-05-13 is the same model. GPT-4o is just a pointer to the latest GPT-4o model. Once a new GPT-4o model comes out, this will update.
@Sheetal_S 25 дней назад
Finally under 10min video
@ThoughtfulAl 22 дня назад
Now my cup is inside out and upside down
@ushiok23 24 дня назад
I thought it’s released to every plus user on browser version already. I’d like to test the new voice talk, but the iOS app is not updated. I feel it’s pretty much like Pi AI and much more advanced.

Следующие

Автовоспроизведение