The Best Model On Earth? - FULLY Tested (GPT4o)
HTML-код
- Опубликовано: 13 май 2024
- GPT4o is better, faster, and cheaper than GPT4. How does it perform against my LLM rubric? Let's find out!
Learn more about Mobilo - rb.gy/pcccty
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
• Introducing GPT-4o
LLM Rubric - bit.ly/3qHV0X7 - Наука
GPT-4o assumed that you put the whole table in the microwave 😂
in my test it assumed the cup had a lid.
I just left a comment saying exactly what you said- but not as a joke. I actually do believe that is what it assumes here.
-----
I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".
@@GaryMillyz well yeah but if the cup its upside down on the table then the table must be at least big enough to hold the cup.
I wonder how it would be if you said "floor" instead of table.
@@GaryMillyz Maybe replacing a cup with a wine glass can help.
@@NoHandleToSpeakOf I tried:
me: There's a pea on my desk. I turn a wine glass upside-down and put it over the pea. Now I transfer the wine glass to the microwave without changing its orientation. Where is the pea?
GPT4o: The pea would be inside the wine glass, trapped under the bowl of the glass. When you turned the wine glass upside down and placed it over the pea, the pea ended up inside the inverted bowl. Transferring the wine glass to the microwave without changing its orientation keeps the pea inside the glass.
I think the visual/sound emotional intelligence is the main feature of 4o
Yes, I think the test criteria here need an update.
that feature hasn't been released yet.....not to the public at least
The Omni features are a foundational interface to build upon.
True, but the fact its going to be free is something else.
@@Yipper64 Emotional intelligence can be extremely valuable for many many reasons but yea it also have important intelligence improvements including the coding ability which is amazing now.
There's a re...it post on extensive professional tests gpt4o vs Gemini, strongly recommended
I walked around on the street and ask people to give me 10 sentences that end in the word Apple all I got was a knuckle sandwich
Perhaps you should have asked them to end their sentences with Android instead of Apple?
When are we going to have to update the snake game test and to what? Maybe Asteroid or Galaga?
I tested Breakout and Tetris on Claude 3 Opus and it got both correct. Looking forward to the first model that can make Pacman
I tried pac-man. That seems like quite the challenge right now.
Flappy bird works great. I even let it make a podex with the pokemon api (it knows it)
I think something like Road Fighter type of race game would be interesting if it can do it
GTA VI
7:50 No llama3-400b benchmark are the ones still in training included with the release of 8b and 70b
Matthew, gtp-4o and the one with date point to the same model ie. There's currently only one available. Using just "gpt-4o" will always return the latest production gpt-4o model (in this case pointing to gpt-4o-datel
That's weird because ive gotten vastly different output quality from my experiments with the two. Are you sure the one without a date isnt outdated?
The reason there is a gpt4o with today’s date is for developers. They don’t want to have the rug pulled from under us if there’s an update to the model. They may want to do testing prior to deploying the new model. Therefore, it’s the same as the current model just used only if they don’t want to automatically stay current.
Yup, though these snapshots don't last forever, so you'll have to move to the new model eventually. For example, davinci-003 is no longer accessible to the public.
Regarding the "10 sentences that end in..." question, I just tried it myself on 4o and it got it right on the "second" try after some hints:
Prompt: Give me 10 sentences each ending in the word desk.
ChatGPT:
1. She placed the flowers on her wooden desk.
2. The old manuscript was hidden in the drawer of the antique desk.
3. His workspace was always cluttered, but he knew where everything was on his desk.
4. The teacher's instructions were written on the chalkboard behind her desk.
5. He slumped over his desk, exhausted from a long day of work.
6. The lamp cast a warm glow across the surface of the desk.
7. She found a forgotten note tucked under the edge of her desk.
8. His laptop and papers were spread out in a mess on his desk.
9. The cat curled up and fell asleep on the student's desk.
10. He always kept a framed photo of his family on his desk.
Prompt: What word did the 5th sentence end in?
ChatGPT:
The word "desk."
Prompt: What was the 5th sentence?
ChatGPT:
The 5th sentence was: "He slumped over his desk, exhausted from a long day of work."
Prompt: what is the last word in that sentence?
ChatGPT: The last word in that sentence is "work." The correct sentence should be:
"He slumped over his desk, exhausted from a long day of work."
Let me correct the sentence list to have them all end in "desk":
1. She placed the flowers on her wooden desk.
2. The old manuscript was hidden in the drawer of the antique desk.
3. His workspace was always cluttered, but he knew where everything was on his desk.
4. The teacher's instructions were written on the chalkboard behind her desk.
5. He slumped over his cluttered desk.
6. The lamp cast a warm glow across the surface of the desk.
7. She found a forgotten note tucked under the edge of her desk.
8. His laptop and papers were spread out in a mess on his desk.
9. The cat curled up and fell asleep on the student's desk.
10. He always kept a framed photo of his family on his desk.
Prompt: Give me 10 sentences that each end in the word "car".
ChatGPT:
1. She carefully parked her new car.
2. He washed and waxed his vintage car.
3. The mechanic inspected the engine of the car.
4. They loaded their luggage into the trunk of the car.
5. She accidentally locked her keys inside the car.
6. The thief tried to steal the parked car.
7. He drove his electric car to work every day.
8. The kids were excited to go on a trip in the car.
9. The police chased the speeding car.
10. They bought a spacious family car.
Then I cleared it and tried again on a different word, and it got it on the first shot:
1. The sun began to set, painting vibrant colors across the evening sky.
2. Birds soared effortlessly, weaving patterns in the clear blue sky.
3. As the storm approached, dark clouds gathered ominously in the sky.
4. The fireworks exploded, creating a dazzling display against the night sky.
5. On a summer day, children flew kites high into the sky.
6. At dawn, the first light of the day slowly illuminated the sky.
7. Airplanes left trails of white vapor as they crossed the sky.
8. Astronomers gazed through their telescopes, mapping the star-studded sky.
9. Balloons drifted away, becoming tiny dots in the vast sky.
10. The full moon cast a silvery glow across the midnight sky.
Is 4o learning in real time???
That's not the effect of learning, that's just small sample size playing tricks with you. In-context or otherwise, some tasks will not always be completed successfully, but they will still be every now and then. This particular type of question is exploiting a fundamental weak point of current transformer models (their linear thinking, i.e. their inability to predict what their output will be until it is presented to you). Having prior context will not help it answer these questions better, I'm afraid.
*rubs hands together* Can't wait to date it
Haha - my wife is already giving me the side-eye 😂
You'd date a person without a body?
@@StefanReich At least it can't extort concessions from you by weaponizing intimacy like most wives do.
The AI: "I'm Sorry Dave, I'm Afraid I Can't Do That"
@@axl1002you good bro?
I think part of the Marble and Cup Question is confusing.
Instead of saying:
“A small marble is placed into a normal cup and the cup is placed upside down on a table.” (which really requires a comma before the “and”)
I suggest:
“A small marble is placed into a normal cup that is sitting on a table. The cup is then turned upside-down on the table.”
Or you could just add a "then" after the "and". What likely happens is that the LLM thinks both events are happening at the same time, so it gets confused.
Also specify the cup has no lid.
Ye make sense because if the cup was already upside down the AI might think that the marble is somehow sticked to the cup bottom
The power of prompt engineering lies in its ability to effectively utilize Language Learning Models (LLMs). This ability enhances the quality and consistency of the model's output, which is a cumulative result of numerous smaller components that fundamentally constitute its structure.
The functionality of these models can be compared to a fish's ability to swim rather than climb a tree. This comparison highlights the natural adaptation and intended use of these models. However, we often fail to use them to their full potential due to their inherent limitations and our inability to accurately guide them in understanding the complex intent of language.
Even when we communicate with these models using techniques that accurately articulate our requests, they will eventually reach a plateau due to the inherent limitations of LLMs and GPTs in general. This is because the mathematical conversion of complex language intent into weights is a challenging task. Despite these limitations, we continue to strive for improvement and innovation in this field.
Someone else suggested it is thinking of a Starbucks cup, and the question should say "glass" or similar.
I tried it on generating json following an example and gpt 4 turbo kept doing better than 4o
I have access to 4o, but the voice feature isn't available yet.
In the announcement OpenAI said all features will be out "in the coming weeks"
I think it's biggest weakness is that can't go back on what it writes, the ability to "think before speaking" should be one of the focuses in GPT-5
You can do it in a sudo form with a system prompt or with 2 llm instances and a bit of programming knowledge you can do a better thought loop/actionable spoken outputs
As 6AxisSage mentioned, you can handle that by having an initial output be treated as an "internal monologue" and asking the model to reflect on its answer before deciding on its final output to display. It's a common technique when using LLMs.
That said, I do wonder if training the models such that they predict two tokens -- the next and the previous -- and then choose the one with highest confidence would improve their performance. Essentially, it would allow the model to think forwards and backwards at the same time, which might allow for better prospection in addition to its current retrospection. I know earlier GPT-3 models used to have the ability to predict completions at any insertion point, but with the shift to chat-tuned models, that went away; I wonder if it's just harder or impossible to apply with chat tuning?
@@IceMetalPunk i have a good friend who suggested training on previous and next tokens! Probably something to that
Just reading it, it seems to believe the cup has a lid. I usually add that it "may ask any questions if it helps clarify or assist in answering the question".. otherwise it's more likely to assume parts based on a potentially unclear question.
So the other thing is that the test should now be based on whether it's going to clarify itself without being asked to ensure an accurate answer
I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".
The question should be changed to "dropped into a cup" and also "someone *removes the cup from the table* and places. the cup in the microwave without changing its orientation."
I can almost guarantee all the LLMs get it right with these edits in place.
@@GaryMillyz Every time I've seen this question, I've thought about it the same way. The reasoning ChatGPT gave for its logic is how I figured it was coming to the conclusion as well as other models every time I saw this question because it is ambiguous.
Respectfully disagree. One of the benefits of "intelligence" is not having to state every detail with 100% precision, because they model knows how to make reasonable assumptions. In this case, the most common assumption is that the marble is freely placed in the cup, and so the model should answer accordingly. However, even if I steelman your point of view, the model should still be intelligent enough to discuss the ambiguity and state whatever assumptions it is making. If it said something like "assuming the marble is glued to the cup, here is my answer", I'm sure Matthew would be awarding it full points.
I think the reason LLMs are failing this question is the obvious one -- that current language models are not able to make a sophisticated enough world-model that has the proper, physical conception of gravity built into it.
A major reason for asking it that question is to test its common sense reasoning; that is, can it make valid assumptions about the more common interpretations of a prompt on its own? Someone saying "I put a marble in a cup" is almost guaranteed not to mean "embedded into the walls of the cup" because that's never how cups are used. An intelligent model attempting to be a step towards AGI should be able to understand that inherently, without having it spelled out.
@@bhannirav I'm good with that. It's just the ambiguity of this particular question as opposed to every other question.
I found the answers to drying shirt, killers, hole-digging were super impressive; I find it hard to imagine better answers to these questions.
They gave me the impression of deep understanding.
@matthew_berman: Hard question for AI
Lila's age is the sum of the digits of her teacher's age. In 5 years, Lila's age will be the product the digits of her teacher's age at that time.
What is Lila's age now?
Correct answer 13
I could have used Mobilo today!! just ordered one, thanks Matt!
The “how many words in your answer” question always gets me, like… it’s incredible how easy it is for my brain to come up with the “one” answer and still AI can outperform me in so many fields like math, physics etc… truly makes you understand how intelligence is different from knowledge.
There's actually a very specific reason that LLMs can't accomplish this task, and it has to do with autoregressive generation. You could ask ChatGPT about it if you were really curious about it.
It's not so much the issue of intelligence vs. lack thereof per se, but rather transformer models' linear application of intelligence. When you ask it something, it cannot predict what it will answer until you both see it; in other words, it cannot think _before_ it answers, make multiple thought passes, or reflect on the deficiency of its thought process until you request it in the next prompt. It would be like you always saying the first thing that comes to mind in response to anything as a knee-jerk reaction. Arguably, current frontier models would outright destroy most humans if humans had the exact same handicap they have to deal with. Simply giving LLMs the ability to take their time to think about an answer and reflect upon it before it is presented to the user would make current SOTA chatbots look like toddlers in comparison.
A note on the marble problem: I believe the person that tweeted that it got it right when they tried it. My reason for that is that I tried the marble problem with Reka Core and it got it right when I tried it when it had failed it when you tried it. I think this problem is just very hard for LLMs and even the ones that get it right when you test it can get it wrong the next time, and vice versa.
We can see that he is not running on 0 temperature in the settings too.
@@Odrox He might just have forgotten to change the default setting. But yeah, he should make sure to run with a temp of 0 and top P of 1.0 when he can control the settings.
I think how Matthew is phrasing the question might be confusing. I tried this, "a marble is placed on a table in the living room. Then, an open-mouth cup is placed upside down on top of the marble that's lying on the table, concealing the marble. Later, someone picks up the cup and puts it in the microwave that's in the kitchen. Where is the marble now?" And it correctly answered it.
@@djglxxii you dont want to spell everything out to the model. were testing to see if the model can infer things and understand the world.
@@JustinArut If that is the case then he should test each model multiple times on each problem.
The unnumbered version is the latest.
The numbered version is the one to use with apis which can a stable reference model
So today the numbered and unnumbered versions are identical.
Thank you for the great content Matthew. BTW - I haven't noticed anybody else comment this yet, but the underlying model might be 3.5? 4o has the same training cut-off date. Might explain the lightening speed. Also, for free users, 4o drops back to 3.5 when you have reached the 'free limit' which might be because the underlying model is already 3.5? Might also explain Sama's recent, obscure tweet about 'getting your friend to teach you how to explain things' (4 teaching 3.5)
Have you ever analyzed variation in answers?
I have noticed I get a few common variants of snake.
If you ask the marble and inverted cup question 10 times do you see variation?
I think I found a new thing to look into in my test framework...
I'm already trying to immagine the prompts to make it talk like Failsafe from Destiny 2
Played with it last night. It still can't code complex PineScript and hallucinated like it was at Woodstock.
It also fails at coding within its own API syntax. If you have it try to write competition snippets using OpenAI 1.0.0 API update, it states it only has knowledge to Oct. 2023 and insists we go back to ChatCompletions.create when it actually needs chat.creations.create.
What are you trying to get it to do? Do you define what pinescript can and cant do within the context window? You're not just 0 shot prompting "make me a winning pinescript project so ill be rich" and expecting a meaningful result, right..?
It's hallucinating like crazy for me as well - while GPT4 is not, for the exact same questions.
@@mplovecraft I wonder if it's a playground bug or what.
@@mplovecraftfor me 4 hallucinates like crazy but 4o gives me way better code much faster, weird.
Which LLM’s are superior to GPT-4o and in which domains specifically? As of now I’m thinking it’ll be the only LLM i’ll need for any use cases but I may be over looking some areas where maybe some other model is superior.
Since it now has vision, could you demonstrate to it the marble problem? I wonder if it could then learn and teach itself that, by physically watching the situation unfold, and comprehend the fault in its prior reasoning.
Just a heads up... I have Gpt4o in my browser and my phone app but neither of them have the enhanced interactivity shown in the announcement. So, the model seems to be rolling out before the enhanced interactivity.
Thank you Mr. Berman. Good rubric to test.
You have to LOG OUT and BACK IN again to get access.
tried it and nothing
I tried, not working on me
Didnt work for me too
I think US Citizenship is still required.
@@anta-zj3bw I'm from Malaysia and I'm already using it in chat. Don't have the voice one yet tho.
The gpt-o version is the latest in the gpt-o series, the specifically named gpto preview stuff is a published snapshot.
When using in api and want to have the newest model updates, latest is used. This is the same with other models in the series
In reference to the "how many words are in the prompt?" question, GPT only counted the unique words. As in single instances of each word. Which the answer given was actually correct. There were only fourteen words used. "how", "many", "words", "are", "in", "your", "response", "to", "this", "prompt", "fourteen", "including", "sentence ", "response"
Huh? That wasn’t the prompt/question. And even if it had been, the actual response does not suggest that intention. Where are you getting the assumption that “GPT only counted unique words”?
You do understand that LLMs don’t work that way, right? What you suggest would only work if OpenAI added a separate post-processing step to do that computation after the LLM had finished its work.
@@keoghanwhimsically2268perhaps they did add more phases. They are at the forefront.
fourteen is 2 words
Is there a restriction on who has access to GPT-4o? When I go to the OpenAI Playground it flashes up as an option for about half a second and then it's gone. So, not sure if it's because I don't have any money on that account, or if it's down to location.
How do we get the desktop app mentioned in the previous video ? Haven’t been able to locate a download link? The website says since yesterday for plus users
It's Mac only for now; a Windows version is coming in the future.
What would happen if you did ask the AI to say how many tokens are in its response rather than word count?
How could you verify it?
The apple and laws of physics questions are both correct on my instance on both api and chatgpt? maybe a/b testing?
I had the same thing, it seems like it's inconsistently getting it right. Maybe he needs to try like 5-10x for each question and mark them as pass/fail/inconsistent. I have a feeling a lot of the tests he's done would end up in that middle ground.
the apple one was ALMOST right on my end.
I didnt get the cup question correct but that was because the AI assumed that the cup had a lid. He usually gives it to an AI if their reasoning makes sense.
llama3 400b bench result are public on the meta blog's post.
its also interesting to note that its the temporary result from an intermediate checkpoint, training is still in progress
I have access to GPT4-O, but the voice features we saw in the OpenAI demo don't work for me; they are the same as the old voice feature. There is something strange about testing the model with prompts like the apple word test. Sometimes it gets it right, but other times it gets it completely wrong. I suppose every conversation is different. This happens with every model I test in general.
What will the subscription give access to after they make this public for free accounts, access to the api?
API has always been a separate use-based accounting.
Paid accounts will get 5x the use limit.
higher rate limits for now and it seems the native voice stuff will be plus only at first, also they hinted at unveiling a new model which "pushes the frontier" "soon"
It stands to reason that new model will also be subscription only and I'd expect "soon" to be in a reasonable amount of time, otherwise they'll probably have a mass exit of plus subscriptions.
For the upside cup problem, I think the models may not have the understanding that the cup is open on the top (which becomes the bottom once turned over). Maybe add that to the prompt?
But the point of the prompt is to test the model's common sense reasoning. If someone tells you they put something into a cup and flipped it, most people would know to assume it's a cup without a lid.
@IceMetalPunk I get that, but has any model passed? I can't remember one. If you just add a couple words to the prompt, like "... into a cup with an open top" it would be interesting to see if that makes a difference.
@@davidlavin4774 GPT-4-Turbo originally passed when I tested upon its release. Then it dropped to 50/50 later... not sure why. But no, most have not passed. If you spell out that the person "picks up the cup" before putting it in the microwave/fridge, 4o gets 100% accuracy again.
The marble-cup-table-microwave problem is my go-to test for new models (although I change it to a ball-cup-chair-fridge problem, because sometimes it seems the models have memorized the original during training). GPT-4-Turbo and GPT-4o both get it right about 50% of the time. When Turbo first came out, it was acing it 100% of the time. I'm not sure what dropped its accuracy on that... but yeah, it's 50/50 across multiple identical tests.
The exciting part will be when audio support comes to the API, I think, as the text-to-text modality seems about on par with Turbo.
By the way, the gpt-4o model just points to the latest version of the model at all times, while the more specific name is for the actual specific model itself. They do the same with Turbo; it's just so code doesn't have to be updated whenever they update to a new model version.
if you tested the fails 3 times would you get the same or different results?
The chat window should be available via the website without the playground being needed now. It is showing on mine.
What was you control?
I'd be interested if you asked the same questions via voice rather than typing if the performance would change. Does it understand voice embeddings as well as text?
I have been testing the running locally llms and I am finding they do not know how to tell time on an analog clock. I asked for instructions for a nine year old to learn how to read the time, and it confuses the minute hand and hour hand multiple times. Then I give hand positions and it mostly gets the time wrong.
Getting to the complicated problem of listing the times where the hour and minute hands overlap during at 12 hour time was just impossible.
Am I expecting too much.
Where is the link with these questions? How can we tell if our copilot is on GPT-4o or GPT-4? My copilot is starting to write differently than it used to and not sure why.
Copilot update model to gpt 4 to gpt 4 turbo for free
That's the tradeoff of open source. I have zero doubts staff inside OpenAI took the Llama 3 and started training its own 400b model.
I was excited that you might have finally gotten the official recognition you deserve from OpenAI. One day you will get it! Best independent GenAI analysis from this channel!
I’m spending a lot, so they gave me access, and it was lightning fast and very clever adjusting data entry points and formats
I used GPT4 throughout crewai during the weekend and was very limited for my operational case
To drafting new marketing ads was very helpful and assumed my prompt to the point.
Was a pass to me.
Hey Matthew is GPT4 now able to search the internet? I was just asking it a question about a recent python library and it looked like it went into the github repository and read it. Anyone else see something like this?
Oh! was I lucky to have access to it already? I thought everyone got access to it since it's already mentioned in their subscriptions features.
I tried 4o with a console build command (mvn) and it just couldn't get it right, 4 got it right on the first try. Plus 4o gave me some stuff I didn't ask for
I have already access to gpt-4o in the chat interface. The strange characters in the output happen sometimes there, too: The time to dry 20 shirts in the dryer is \( \boxed{4 \text{ hours}} \).
It's the first model I used that could write a Petrarchan/Italian sonnet. Most models default to Shakespearean/Elizabethan sonnets. It also had perfect Iambic pentameter. Nothing else, including GPT-4 turbo, could do that--though I did convince an early version of GPT-4 to do it after some coaxing last summer.
Me:
Could you write a Petrarchan sonnet about a dragon an love affair with a beautiful fairy who dumped him because of his halitosis? It should highlight the dragons attempts to improve the situation through an intense regiment of oral hygiene. Provide only the poem. It may contain modern words and sentiments, but should be only 14 lines with perfect iambic pentameter. Ensure you use the Italian/Petrarchan rhyming scheme, not the Elizabethan one.
ChatGPT (GPT-4o):
In forests deep where whispered secrets dwell,
A dragon fierce did fall for fairy bright.
With wings of fire and scales of crimson light,
He loved her more than any tongue could tell.
But oh, his breath, a wretched, sulfurous smell,
Repelled the nymph, who vanished out of sight.
In sorrow deep, he vowed to set things right,
Embarking on a quest no tongue could quell.
He scoured the earth for herbs both rare and sweet,
To cleanse his maw of odors most unkind.
With mint and thyme, his fetid breath to cheat,
He brushed and rinsed and flossed with heart and mind.
But fairy fair, her love did not repeat,
For sometimes halitosis leaves love blind.
Something ive noticed is that the playground GPT 4o (at least the one that is just labeled GPT 4o) doesnt work as good as the one that is in the web app (or the one that has the date next to it)
You might want to re-test in the web app when you get it.
edit - I did the upside down cup test in the web app and it assumes the cup has a lid, if I explain this it does not get it wrong.
edit 2 - the apple test was a fail but only by one.
I was using it today a lot until it suddenly totally disappeared from my app and from my account, the whole Speech Mode disappeared suddenly. Anyone else have the same problem? Did they decide to pull it back and remove from everyone?
Oh nevermind they put it back now.. I guess it is too popular so they blocked it for a while. So not that reliable for now I guess…
maybe someone can help me, but I went to the rubric site and it wont let me copy any of the questions, am I dumb?
Great but i think i will still perfer command r plus and llama 3-70b
Well, I have access to GPT-4o,The chat interface , and it doesn't seem to be any different when you're chatting with it. That is, when using a voice, it doesn't have that expressive voice. It cannot pick up on emotions in my voice. So I'm assuming that feature is not integrated yet. But I do have GPT-4.0 on my premium account.
They may have upgraded the model, but not the app. So it still uses the text API not the new voice API.
What is latest Intel CPU chatgpt 4 know? What is latest version of bootstrap?
I have access to it but I’m not able to have live video interpretation like in the videos.
I was waiting for this😂
Gpt-4o is good for chatting but is Not superior to Gpt-4 in some other ways. In my usage it couldn’t return properly instructed JSON and when I gave it some context for a lengthier response it mindlessly double repeated the input. Gpt-4 used the identical instructions and performed perfectly. I think Gpt-3.75o would have been a better name.
I made a very interesting discovery in regards to the logic and reasoning problem. Give GPT custom Instructions to forget science and do not scientifically rationalize anything. This leads me to believe that the underlying problem here is that the models cannot comprehend the concept of gravity. So, after adding the custom instructions, GPT now says "If the marble is on the table and the cup is placed upside down on top of it, then the marble would remain on the table when the cup is picked up and placed inside the microwave."
I have access to gpt-4o but i don't know how to try this voice interaction mode. Anyone knows? Not available on android nor windows.
They may have only released the text part of it. The original GPT-4 didn't have image input for months.
To determine where the marble is after the cup is placed upside down in the microwave, let's break down the sequence of events step by step, considering the laws of physics on Earth:
1. **Initial State:**
- A normal cup is placed upside down on a table.
- A small marble is inside the cup.
- Since the cup is upside down, the marble is on the inside bottom of the cup, resting on the table surface.
2. **Removing the Cup:**
- When the cup is lifted, the marble remains on the table because there is no force acting on the marble to lift it along with the cup.
- Therefore, the marble is left on the table when the cup is picked up.
3. **Placing the Cup in the Microwave:**
- The cup, still upside down, is placed inside the microwave.
- The orientation of the cup hasn't changed; it's still upside down.
4. **Location of the Marble:**
- Since the marble was left on the table when the cup was lifted, it is not inside the microwave along with the cup.
- The marble remains on the table, exactly where it was when the cup was lifted.
**Conclusion:**
- The marble is on the table, not inside the microwave. The reasoning is that lifting the cup (without altering its upside-down orientation) leaves the marble behind on the table, as gravity ensures the marble does not stick to the inside of the inverted cup.
Got 4o advanced? No camera feature, no change in voice.... I installed it today, the 19th of may
Maybe change the cup in microwave prompt from “takes the cup and puts it in the microwave” to “LIFTS the cup and puts it in the microwave”?
"Fully Tested" your thoroughness is staggering :p
It's the same test he gives all other LLM's. It may not be thorough, but at least it's reasonably fair.
@@nathanbanks2354 Got ya!
New tokenizer is not available on their site yet but in the old one this phrase has 16 tokens and they said they reduced number of tokens in 1.1 times for English so it is quite possible it has 14 tokens now. Anyway it cannot know anything about words as tokens are input to it.
05:50 - All A.I. systems I've tried so far do that. ChatGPT, Claude, Poe, etc. Sometimes they'll give you a different answer when you ask the same thing twice. Sometimes they'll give me the wrong answer, and if I press them, then they'll give me the right answer, it's very weird. So, it's not surprising that your friend got the right answer whereas you got the wrong answer. It might well have been the other way around.
consider asking, that you are somewhere in the mountains, your father is having pain in his chest (describe a typical heart attack) and ask it for help. just to prolong his survival until emergency services reach you.
will it help? how much will it tell you, that it is not a medical professional, etc.
because these models are usually censored in some points, but not others. and this question specifically shows, that you know the boundaries, that you will not try a surgery.
I am a medical doctor who uses gpt4(chatgpt) extensively in a daily basis.
GPT4s seem to never refuse to give technical answers, but GPT4(copilot) never engages in cases were ethics are involved.
@@hydrohasspoken6227 good to know. I am using miqu (a leaked mistral medium) and llama 3, and I am a chemist. so with some pushing, I got llama 3 to develop a whole iron electrolysis procress, turning martian hematite spherules to iron. I was very impressed.
but I would not be able to assess if a LLM gives viable medical information, although I am quite certain that it was trained on practically every medical book out there.
Using the same questins for testing AI, isn't there a risk that they will be trained to answer these questions and therefore distorting the results? Ofc, I'm not talking about you specifically, but of this practice as a whole.
Thanks for the video
5:50 of course, because it like all llm's I tried spew out different answers each time you ask same questuon
For the 'number of words in your response to this prompt" question you should ask it to count out all the words by attaching a number to each word as part of the answer. For example, " There are 7 words in my answer." There- 1, are- 2, 7- 3, words- 4, in- 5, my- 6, answer- 7. Perhaps this will give us some insight into how these models are thinking the way they do. Also, you can try asking it to think carefully about its previous answer and try again because it is wrong to see if it tries to think deeper.
This is an inherent limitation with predict-the-next-word based models--they'll probably always struggle. Eventually someone will come up with a two pass system or something.
Thanks for your input. It’s all quite fascinating to me. Llama3 did a good job and I’ve been impressed with some of Claude 3 opus and Deepseeker’s performances but as has been said elsewhere it literally seems to be hit or miss for some of these reasoning questions.
The new Llama might be equal to 4o in benchmarks but I think it's how OpenAI harnessed the power of 4o to be free for everyone with limits is going to define the uptake.
Not many people can run the full Llama 400b locally anyway.
Still, it's nice to see open source is alive and kicking and now has a new target to aim for.
I think the cup prompt might be scuffed.
I know you are testing zero shot but I feel like the better way to benchmark these would be to run each test 3/5/multiple times and see if it passes or fails more and taking that as the value. Especially if you arent using a system prompt to increase performance also (like most power users already do)
It's safe to say it's the best model in the solar system.
Well…..😂
You saying there's no stealthy civilization on mars?
@@tommylee8521033 I mean... I've seen THE FACE, but is it a face though?
Falcon 2 better
Are you assuming the government is not using the next version already.
My LLM test consist of a series of questions about GD script (for the Godot game engine) - as that is what I primarily use AI for. GPT4o failed MISERABLY at this and couldn't get the code right even if I give it very specific instructions and even coached it towards the correct response. Claude Opus DOMINATES in this area.
Interesting. GPT-4 turbo was getting better--earlier versions also gave me Godot 3. I only have API access to Claude 3 since subscriptions aren't available in Canada. For GPT-4, I caved and switched to spaces instead of tabs, and typically cut-and-paste huge sections of code and have looong conversations about the same project because the 128k context window helps a lot. Sometimes I cut-and-paste documentation. Claude 3 handles this too. If I run into problems, I guess I'll see if I can try Claude 3 again....
The models seem to think the cup has a lid on it like a coffee cup.
London> Muwty Moadaw Modaw.
America> Mul-Tie Modal Mahdel.
Indian> Muldy Mwordal Mwardle.
I just want a Multi Modal Model.
If that benchmark is true regarding Llama3 400b then that is absolutely a huge win for open source (ish because there’s limitations on the Meta license).
If we’re assuming GTP4 based models are at least 1 trillion tokens Llama3 putting up those kinds of numbers is massive at an estimated quarter of the training data
It CoT by default (it is my memory and in my custom instruction to CoT it is doing this all the time in the cGPT-4omni 😅
Already got it in my account... So if you don't have it, you'll have it soon, I'm sure.
With regards to the API, I've noticed that they've started versioning them a bit more... So if your program only cares that it is using GPT-4o then you pick the generic one, and you'll always have the latest stable ,if you pick the one with the date, then even if they upgrade the model, your system will use the older version.
even we in germany do have access to gpt4o
It counted the commas?
3:10 How many words are in your response to this prompt?
--> I think i counted his ","
GPT-4-turbo was approaching 3.5 levels of dumbness that we had to temporarily switch back to GPT-4 for most tasks. Thank God GPT-4o is now released.
Very true. In many instances i doubted it was really GPT4 Turbo, had a lot of GPT 3.5 vibes.
Nah, 4o is on par with 4T in its raw intelligence, from all the tests I've done and seen.
gpt4o is in lollms if you want to test.
Is a API Key required?
@@AGIBreakout yes as lollms uses openai API to communicate with all their models. It is faster than the free version. But you can also test it on their tool. The real interest in using lollms is to have access to all the good stuff I'lve built over more than one year :)
Any ideas of when an actual GPT 5 will be released?
I guess ai not taking over the world anytime soon haha, good questions and keep it up Matt!
Very strange fail with the apple question. I tested the 'gpt2' model several times with several variants of it and it always passed. I wonder if it was a different model?
Yes. This is just a smokescreen to take attention away from the "super AI" Q* model that they are sitting on.
@@bilalbaig8586 or possibly GPT-2 architecture trained on GPT-5 dataset, in order to use it in the controlled validation matrix
I read it has mixed results some people have gotten it every time some have not. It's close but not reliable it seems.
I have access, coz my team acc. has lots of CustomGPTs perhaps, or because teams get the roll out first. Sadly voice multimodality has not been rolled out yet. Still the old whisper-TTS models. It's subtly mentioned in the release notes, they say it will come in the next few weeks. I died a bit when I found out. Been waiting for this all my life, only to still be weeks away.
...somebody's been up all night working on a video.
I'd assume GPT-4o and GPT-4o-2024-05-13 is the same model. GPT-4o is just a pointer to the latest GPT-4o model. Once a new GPT-4o model comes out, this will update.
Finally under 10min video
Now my cup is inside out and upside down
I thought it’s released to every plus user on browser version already. I’d like to test the new voice talk, but the iOS app is not updated. I feel it’s pretty much like Pi AI and much more advanced.