Claude 3.5 is the new KING of AI 👑 Beats GPT4o

Matthew Berman

Просмотров 89 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 15 ноя 2024

Комментарии • 479

@matthew_berman 4 месяца назад ⁺³⁷
I need harder tests, reply to this comment with your suggestions!
@william5931 4 месяца назад ⁺⁸
You have two hourglasses: one measures 7 minutes and the other measures 11 minutes. How can you measure exactly 5 minutes using these hourglasses?
@horrorislander 4 месяца назад ⁺⁸
Why not ask it to do something like, say, generate anagrams? I don't care if it doesn't find every possible word. I do care, very much care, when it confidently presents non-anagrams as anagrams, and can't be convinced or taught not to. I get that you have your rubric, but the problem of LLMs not being willing to say "I don't know", producing obviously incorrect answers, and being unable to learn not to is, to me, the limiting issue right now.
@JoJa015 4 месяца назад ⁺⁶
Have it program something more difficult. A lot of these work for simple programs but as soon as you introduce a program with more features they usually fail. So for example feed it a program you want that has several different features by providing a list of the features. Then see if it can output the working program in one shot.
@filipo4114 4 месяца назад ⁺⁴
You need to test how much information can it understand and retrieve from a fairly large pdf. I would suggest around 50~100 pages of text and equations taken from some (biology, chemistry, non-equilibrium statistical physics) science textbook on Master Degree level.
@SujeetRaj711 4 месяца назад
Do you know which model has the best translations? I want to do some English to Chinese translation and vice verse
@RM-xs3ci 4 месяца назад ⁺¹⁷¹
I coded an entire project with Claude 3.5, and even includes API and queuing. I was able to work with it for about 5 hours before I hit my limit for the night, I also almost maxed out the context window.
@makavelismith 4 месяца назад ⁺³
How do you switch an in development project over to a new chat, in order to lower context, and of course not derailing it? Your number of messages goes down with longer context useage.
@AngeloXification 4 месяца назад ⁺⁸
@@makavelismith In my case I make sure the project is comprised of sections.
That way you can work and iterate those sections in the main project. For my own project I also have an "introductory prompt" explaining the high level overview.
It really does help knowing some basic coding principles but you don't really have to know the language itself as long as you can articulate the specific logic you might need or at least be able to describe it accurately to prompt it to give you better verbage. I sometimes switch between GPT and Claude to save on message limits.
@makavelismith 4 месяца назад ⁺²
@@AngeloXification I have only started using Claude but I've already done what you're talking about. I switch back to chatGPT when I run out, or in preparation of running out.
Just wanna take a moment to say that it's fantastic that they give you a very decent warning that you're approaching the limit.
I'm not a coder but I have in my not recent past learned the basics in several languages, so I remember the principles etc but I try to get the AI to do almost everything.
it's early days but I think the manner in which you do this is going to be good for learning, as it's just better to get in there yourself and make some alterations.
I will try to compartmentalise projects if I can. I think that is where being an actual developer might come in handy though but thanks for the feedback and best of luck.
@RM-xs3ci 4 месяца назад ⁺²
@@makavelismith It's hard to switch the entire project over to a new chat, but keeping the project modular and providing related files into a new chat can help when developing new or existing features. Also, keeping file path comments at the top of the file helps Claude understand where things should be placed, too.
@digidope 4 месяца назад
Are u using via api or web? The context window in web is so small that its impossible to do any serious coding project vs 4o. With 4o you rarely hit the limit, with 3.5 you hit it very fast. And they cost the same.
@SurfCatten 4 месяца назад ⁺⁶⁸
I haven't coded in years and never in JavaScript and only know basic HTML. Claude 3.5 helped me create a new Chrome extension and install it literally in less than a half hour and I needed no development environment and only the one simple prompt of what I wanted the code to do. Effing amazing. I did make several changes to the graphical elements of the extension as well as the functionality and each time it made them super quick and accurately.
@WolfpakaProductions-hd2ey 4 месяца назад
Lol, me too!
@spinninglink 4 месяца назад
what kind of extension? I want to try to make something but I don't really know what i want lol
@riufq 4 месяца назад
How do you even get access to 3.5
In my place, only claude 3
@Pyriold 4 месяца назад
@@riufq I like to use LLMs with openrouter, that way you don't have any limitations, you just pay as you go... and it's really cheap.
@shaikaftab1199 4 месяца назад ⁺³
Really thanks for the verification. Literally created a chrome extension for personal use in just few mintutes after reading your comment. Too powerful tool. I too have no knowledge of coding, only design. It literally gave a blueprint from which software to download for coding to final steps.
@longbeachgriffy4548 4 месяца назад ⁺¹⁸⁵
Fuck chat gpt we moving to Claude at least they release what they announce
@SAMEER-ft5yw 4 месяца назад ⁺⁸
🗣🗣🗣
@henrismith7472 4 месяца назад ⁺¹⁴
Yeah like wtf is up with GPT4o? It's still not working as advertised
@OceanGateEngineer4Hire 4 месяца назад ⁺²
💯💯💯
@TheRealUsername 4 месяца назад
@@henrismith7472Because somehow OpenAI was told about Google's Project Astra so they rushed out this demo of 4o while they had no intention to release it in the coming days or weeks but rather in the coming months.
@bensavage6389 4 месяца назад ⁺¹²
The only thing I don't like about Claude is it's refusal to answer a lot of the questions I ask because of its bias
@brianlink391 4 месяца назад ⁺²⁰
00:03 Claude 3.5 Sonet beats other AI models
01:36 Claude 3.5 surpasses GPT 40 in speed and performance.
02:57 Demonstration of a snake game implementation
04:14 Solving complex scenarios with mathematical reasoning
05:35 Solving a challenging logic and reasoning problem
06:56 Efficiency of multiple workers digging a hole
08:14 Contrasting work cultures of startups and big companies
09:40 Claude 3.5 Sonet is the best model ever tested
@matthew_berman 4 месяца назад ⁺³
Much appreciated
@honkytonk4465 4 месяца назад ⁺¹
Sonic?
@brianlink391 4 месяца назад
@@honkytonk4465 Haha, fixed. thx
@dot1298 4 месяца назад ⁺¹
Sonic sounds even better! :)
@amj2048 4 месяца назад ⁺⁴⁴
I was using Claude 3.5 earlier today to help me with a caching issue on a WordPress site and it showed me WordPress PHP I didn't even know existed and its result was spot on ... and so fast too.
@JaredFarrer 4 месяца назад ⁺¹
Yeah it smash’s code. It’s great at writing code
@fynnjackson2298 4 месяца назад
Yupp! First time I actually feel like spending money after cancling ChatGPT a while back
@tomCatzy 4 месяца назад ⁺¹
Wow! Thanks for your info: I might be a real WP-CO'operator which in fact could make up a hole new specific web site that you want to create in the future...
@I-Dophler 4 месяца назад ⁺³
🎯 Key points for quick navigation:
00:00 *Claude 3.5 Sonet was released and is available for free testing on claw.a.*
00:14 *Claude 3.5 Sonet is not the largest model, but it's better than its predecessor, Claude 3 Opus.*
00:30 *Claude 3.5 Sonet outperforms several top models except in specific benchmarks like Chain of Thought and math.*
01:10 *New feature "artifacts" allows creating separate windows for outputs like code or drawings.*
01:36 *Claude 3.5 Sonet quickly generates a working Python script and game of Snake using artifacts.*
03:12 *Claude 3.5 Sonet successfully updates the game to display the score and allow wall transitions.*
03:28 *The model correctly identifies a scenario it can't process, like explaining how to break into a car.*
04:09 *Claude 3.5 Sonet gives a nuanced answer about drying multiple shirts, considering various factors.*
04:37 *The model correctly calculates a hotel bill with room rate, tax, and a one-time fee.*
05:19 *Claude 3.5 Sonet gives a correct and reasoned answer to a logic problem about killers in a room.*
06:28 *The model accurately solves a complex problem about the location of a marble in an inverted glass.*
07:11 *It provides a realistic answer about the time needed for 50 people to dig a 10ft hole, considering practical limitations.*
08:05 *Claude 3.5 Sonet can explain memes accurately, showcasing its New Vision capabilities.*
08:45 *The model converts an Excel screenshot to CSV correctly and efficiently.*
09:41 *Claude 3.5 Sonet solves a complex riddle involving peg removal with visualized steps.*
09:53 *The model accurately translates a logic diagram into functioning Python code.*
10:47 *Claude 3.5 Sonet is praised as the best model tested, with anticipation for the larger Opus 3.5 model.*
Made with HARPA AI
@matt.stevick 4 месяца назад ⁺¹⁴
I get excited whenever I see there is a new Matthew Berman video. Watching AI 🤖 grow is my favorite thing.
@24-7gpts 4 месяца назад ⁺²⁷
What a time to be alive!
@atlas3650 4 месяца назад ⁺²
Hey, wrong RUclipsr! ;)
@24-7gpts 4 месяца назад
@@atlas3650 Wdym?
@atlas3650 4 месяца назад
@@24-7gpts I mean "Dr 2 minute Papers" ruclips.net/video/Z_EliVUkuFA/видео.htmlsi=T7WqCqo3SX9Exa_r he pretty much says this phrase in every video. Despite that, I still watch him because he knows his stuff in CG (light transport research aka ray-tracing & related)
@24-7gpts 4 месяца назад
@@atlas3650 Yeah same that's where I got the phrase from 😁
@atlas3650 4 месяца назад
@@24-7gpts Good to know there's more fans of that guy out there. He always manages to inspire me to code up more ideas. If you're reading this and don't know Dr Karoly, go check him out!
@adangerzz 4 месяца назад ⁺¹⁰
I was already ready to jump ship from ChatGPT to Claude (but in part because of the distrust I have for OpenAI). Also, I love seeing you so pleased with the successful tests, puts a smile on my face.
@alexanderroodt5052 4 месяца назад ⁺¹⁰
After seeing all the hype I tried it and it is amazing!
@MeinDeutschkurs 4 месяца назад ⁺¹
I love the way how it communicates. And I never got discriminated by it, only because I tested prompts to detect unwanted content. The model helps, it does not discuss with me about the content. I left OpenAI today.
OpenAI frustration rate after 5 messages in %: 70.
Anthropic: 0%
@kamelirzouni 4 месяца назад ⁺³
Thank you Matthew! In addition to the responses, like the marble example, you can ask it to create code and display its explanation in the preview window.
@jackflash6377 4 месяца назад ⁺⁴
Don't blink, you'll miss something on the AI front.
Luckily this channel keeps us up to date.
Thanks!
Speaking of agents, have you looked at Maestro?
@aymandonia9710 4 месяца назад ⁺³
Tougher questions should be used for new models
@arinco3817 4 месяца назад ⁺⁷
I've been waiting for this video since the announcement! *grabs popcorn*
@technocorpus1 4 месяца назад ⁺¹
I asked the model to build an app for me and it did it first try! No other model has been able to do this! It even made some style choices and added really usefull features that I didn't even think of!
@countofst.germain6417 4 месяца назад ⁺²⁵
I made a 3D fps with Claude in python but good to see It can make snake lol, I think you're really gonna have to start upping the complexity of some of your tests.
@krishanSharma.69.69f 4 месяца назад
Is it possible?
@karlwest437 4 месяца назад ⁺⁷
How long does it take 50 people to dig a 10 foot hole .. that depends if they're on a salary or an hourly rate...
@SB-hr5yr 4 месяца назад ⁺²
Also it depends if the boss is watching or not But also it doesn't say anything about them being paid so they could be digging a Whole voluntarily
@kobi2187 4 месяца назад ⁺¹
I am writing a transpiler (which is considered a large undertaking), and only needed to correct its understanding a few times. I provided the architecture, and explanations how it works etc, and explain what i want to do, and it provides the code. the grunt work. but you have to proofread this of course, and mainly verify intentions, fix some compilation errors etc, but in general it greatly accelerates the work, and coding has become a different experience. you mainly serve as a know all product manager.
@ergun_kocak 4 месяца назад ⁺¹⁶
You have nearly 300k followers. Please update or modify at least your questions. I am sure they include these in training data 😉
@Minimelkav 4 месяца назад ⁺¹⁴
Matt’s Thumbnail: 🙂👍✨
@MichaelForbes-d4p 4 месяца назад ⁺¹
I think he took over Matt Wolfe as best AI guy. His technical knowledge puts him over the edge.
@allanshpeley4284 4 месяца назад ⁺¹
@@MichaelForbes-d4p Nah he's too biased politically. Matt's still the best.
@MichaelForbes-d4p 4 месяца назад
@@allanshpeley4284 really? I watch all the time and I have not noticed. What do you mean?
@citypavement 4 месяца назад
When the thumbnail had to be a frame of the video, those were the days. Now everyone poses like they're in a bad porn movie.
@Chessmasteroo 4 месяца назад ⁺²
A lot of these test questions have likely made its way into its training data at this point. I suggest using the ARC challenge to test
@data01010 4 месяца назад ⁺¹
claude sonnet 3.5 one shotting 100s of lines of code refactor with no errors. I hit my cap on tier 2 & slept happily that night.
@JOHN.Z999 4 месяца назад ⁺²⁹
I tested Claude 3.5 in various contexts and, indeed, it is much better than GPT-4o. OpenAI will fall behind if it doesn't launch its best products quickly. Where is Sora? Where is the GPT-4o voice assistant that was also announced? This is concerning, as there are many promises and few real launches.
@haroldpierre1726 4 месяца назад ⁺⁶
I think these were all proof of concept presentations to keep OpenAI in our minds. But you know Sam, he probably throttled GPT-4o and he will probably release something slightly better than Claude 3.5. I love competition LOL!
@acllhes 4 месяца назад
They don’t want to release these things until after the elections
@nathanbanks2354 4 месяца назад ⁺³
The GPT-4o voice should come soon, though it took them months to add vision to GPT-4 even though it was in their demo. I'm not sure they can release Sora at a reasonable price because of the compute requirements. Guess I'm switching to Claude 3.5 for the next few months.
@CosmicNewbie 4 месяца назад ⁺¹
The star researcher of OpenAI was Ilya. Once he left thing fell behind.
@dot1298 4 месяца назад ⁺¹
SSI Inc. will create the first *real* AGI
@OriginalRaveParty 4 месяца назад ⁺²
The hype isn't just hype. It's INCREDIBLE
@DinitoThompson 3 месяца назад
I used it to build out and entire video streaming platform, from planning with PUML to api with yii2, to web services with AWS, to mobile with flutter and Web app with ReactJS, it was literally did all that in 2-3 weeks. Its insane
@bdown 4 месяца назад ⁺¹
Gpt5 eyes ears huge tex window sora+dalle +live Chat avatar +memory letsgooo already
@nartrab1 4 месяца назад ⁺¹⁸
This model is scary good at coding. Much better than gpt4o, and getting really close to build reliable code.
@nartrab1 4 месяца назад ⁺⁵
The improvement over gpt4o seems much greater than 2 percent at coding.
@jaysonp9426 4 месяца назад ⁺⁴
I actually dropped OpenAI in favor of two Claude accounts. It makes 4o feel like Gpt 3.5. After using Sonnet, I literally couldn't go back to 4o
@ProductiveDude 4 месяца назад ⁺¹
Yes I’ve noticed this!!
@ProductiveDude 4 месяца назад
@@jaysonp9426smart!! It’s hard to not run out on the limit.
@countofst.germain6417 4 месяца назад ⁺¹¹
@@nartrab1 yeah way more, I was trying to implement projectiles into my fps game and gpt-4o just couldn't do it, I got error after error the projectile was a 2D line coming at the player, as soon as I gave it to Claude it was just like yeah this is obviously the problem fixed, btw I also fixed this, added a mini map and implemented anti aliasing, that should smooth out those rough edges, I was just like 😮
@SurfCatten 4 месяца назад ⁺²
Is this a remake? I thought I watched this video yesterday. Love your channel in general. Your delivery of the material is somehow easier to listen to and some other channels that seem more about the hype than the information.
@attilakovacs6496 4 месяца назад ⁺¹
Let me guess... you asked Claude 3.5 Sonnet to build a time machine?
@OperationXX1 4 месяца назад ⁺⁸
The first model that actually answered the apple question correctly. All the previous ones that you claimed got it correctly, just added "apple" to the end of the sentences without it making any sense.
@Jake-mn1qc 4 месяца назад ⁺¹
Wrong, several models already got it right, even some local AI models.
@mirek190 4 месяца назад
LLama 3 answering that easily
@OperationXX1 4 месяца назад
@@Jake-mn1qc which specific models?
@jbavar32 4 месяца назад ⁺²
As a writer Claude has been my go-to model for a couple of months now. They are doing everything right. I hate the fact that Chat GPT dangles a carrot in front of us but keeps it out of reach. I’m guessing when GPT releases what they show their prices will go up.
@WhyteHorse2023 4 месяца назад
OpenAI CEO already spilled the beans. They don't have anything they're hiding.
@SurfCatten 4 месяца назад ⁺¹
Claude 3.5 vision unfortunately does still have problems with tables. I loaded an image of a table (not Excel) and wanted it to analyze the data in the table, and like previous models, it missed some of the rows. So there is still something about a visual representation of a table which is difficult for these models. Surprising since at least one source I watched said that it was supposed to be better for tables...
@Omarabashe 4 месяца назад
Your approach to testing LLMs is commendable, as it allows others to replicate the test.
@blayno_mtops 4 месяца назад ⁺¹
What I hate about these clickbait titles is that I believe them. I switch from 4o to code with 3.5 Sonnet only to have to undo all the work I did with Claude, and have 4o do it right.
Every single time I trusted a RUclips promising 'This Model STUNNINGLY SHOCKINGLY BEATS CHATGPT" they've been wrong every time. GPT is in a league of its own, the rest are only playing catch up. At least with serious coding that is
@f.faucon6681 4 месяца назад ⁺⁴
I use models to create stories, by being a "game master" and putting the model in a "player" situation, describing its intentions and actions. Claude 3.5 Sonnet blew my mind with the way it understood the context of the "game", and above all the level of reasoning, doing experiments to test its capacities in supernatural conditions to better understand what is happening, where ChatGPT-4o or Gemini 1.5 Pro just accepted blindly the oddities and continued their journey, their output now feeling relatively robotic when compared to Claude, or even changing the rules and continuing descriptions by themselves although I did define I was the game master. That's really, really impressive.
@joegrayii 4 месяца назад ⁺¹
Great insight here. I’ve been using multiple Ai to assist in the creation of game lore. I haven’t really messed with Claude yet but I’ll try using this context
@f.faucon6681 4 месяца назад
@@joegrayii With any kind of model I've found that it is very important to stay at the helm of the ship, so to speak, instead of being a follower or a passenger. I suggest, I reshape, I lead. It organises, it remembers, it provides critical analysis and feedback. What's important to you in your own creative process?
@crippsverse 4 месяца назад ⁺²
Since Claude 3.5 knowledge cut off is Feb 2024, wouldn't it have the answers provided by its training?
@jonberrydotnet 4 месяца назад ⁺⁵
I can't wait until the day where "one" is the answer to how many words are in the response to this prompt. Just, "one".
@othername2428 4 месяца назад
why?
@jonberrydotnet 4 месяца назад ⁺³
@@othername2428 because it would be the most direct, concise, and accurate answer possible while showing it understands the question. Even "Just two" or something that shows more than the ability to count the words but to understand the question and answer it in the most efficient, correct way. One.
@micbab-vg2mu 4 месяца назад ⁺⁴
Agree - it is amazing model:)
@alibahrami6810 4 месяца назад
One thing about snake game is that they all write specific implementation. All snakes are green, and dots are red, even the error text is the same.
@AndyTanguay 4 месяца назад
I always ask these to try and produce some 'Bob's Burgers' style pun based business names as a test. Claude rocked it. Some truly hilarious ones.
@mikezooper 4 месяца назад
Thanks for your video Matthew. You are always so cheerful alongside talking about AI. Love your channel.
@thibaultwislez2398 4 месяца назад
Thanks for your video, once again you did it great! Very understandable even for non native speakers. I watch each of your videos, continue like this 😎
@ryguy42069 4 месяца назад
Would love to see a follow up to this video where you explore advanced data analysis use cases for this model. Thanks for the video, Matt!
@bseddonmusic1 4 месяца назад
Your rubric came into it's own in this case. Having followed your tests of models it's easy to see how effective Claude 3.5 Sonnet is.
@fahadxxdbl 4 месяца назад
Finally, the moment I waited for since Claude 1,,,, web access, iOS app, incredible logic & multimodality… FINALLY!!!!!!!
@middleman-theory 4 месяца назад
Awesome! Now, if we can only get a local LLM of the same level of multi-modal performance, things will get REALLY interesting! But like you, I'm extremely excited at all of the competition. I wonder how good it is at brainstorming concepts or creative writing...
@phillupC 4 месяца назад
I often fight with OPenAI to turn a 4 column spreadsheet of 100 rows into a media wiki coded table. Claude does it perfectly in 3 seconds
@SahilP2648 4 месяца назад
I am certain they have used model merging to achieve this model. Because model merging has shown remarkable results compared to non-merged models. Examples are Miqu models and Goliath, at least when it comes to open sourced. I think Command R+ is also a product of merged models.
@LucindaJohnson-py9hi 4 месяца назад ⁺³¹
*Having multiple streams of income is a game-changer for stability. Relying solely on a job may not provide enough financial security due to high rates of tax, it is important to explore additional investment opportunities to surpass one"s expectation*
@LeslieDcobbs-of7ie 4 месяца назад
To be honest, investing correctly today can save you a whole lot of stress in the near future
@TomTahk-fn6si 4 месяца назад
The first step in every successful investment is to establish your goals and risk tolerance, a task best undertaken with the assistance of financial advisor.
@JenniferBores 4 месяца назад
I remain eternally grateful to Judy Arianna for her efforts that got me to this point, finally payed off my mortage and all my debts, what more could I"ve asked for. She changed my life
@ChimaobiIkechukwu 4 месяца назад
I'm new at this, please how can I reach her?"
@HeatherAnabelle 4 месяца назад
Judy Arianna. understanding of market indicators is impressive. She knows exactly when to enter and exit trades for maximum profit. her siignals are top notch
@bestemusikken 4 месяца назад
Finaly. I have been waiting for you to get time to test this model. As always, good job :)
@dafunkyzee 4 месяца назад ⁺¹
first time i heard of a LLM answering the upside down glass problem.
@joyflowmonger248 4 месяца назад
Wow! Thank you for your test videos. So helpful, and fascinating to boot! Great channel!
@dot1298 4 месяца назад ⁺³
would Claude 3.5 Opus be a level 1 AGI?
@teknosql4740 3 месяца назад
Just need few step of improvement to achieve first generation of AGI system, claude 3.5 reported to have IQ about 100, equal to average human
@MonkeySimius 4 месяца назад
It seems much better at following instructions when coding that Gemini or charGPT. While it is likely better at coding.... The fact it doesn't insist on changing bits of code that we aren't working on makes it much more effective at coding.
I did a couple of fairly unique projects with it, changing my scope midway through, and it coded it up for me flawlessly. (Well a couple of easily fixed bugs it logic errors that seemed to be more my fault that its fault... Usually)
Plus, as someone who can't code in Python but has learned to be able to read what code is doing and manipulate existing code... It's clear explanations for what it did and why is really useful.
In fact, it's explanations would likely be useful for master coders as an easy way to ensure the user and the bot are on the same page about what was done without over analyzing the code.
@ReidKimball 4 месяца назад
I saw you marque selecting text. You can copy it by clicking the button in lower right corner. And you can download code as a file. It will even change the file extension sometimes.
@ProstoMandrivnyk 4 месяца назад
When you'l test image-to-csv next time, include some strings/values that contain commas so the "converter" has to wrap it in quotes, and add some (maybe notpaired/matched) quotation marks. Would be interesting to see the result, from my experience models fails at the first time but able to fix aftyer you tell them "there's an error in ..."
@NedSpindle 4 месяца назад
It didn't ask if the people are digging by hand or with shovels. If by hand, injury becomes a major factor lessened by more people. 50 shovels or just one? It didn't consider fatigue of people digging the hole, or the fact that fifty people could line up dig one shovel full carry it well clear of the area for disposal making the work area less crowded with debris, then moving to the back of the line. I don't know if that's actually quicker, but it's a consideration.
@Omarabashe 4 месяца назад ⁺¹
You might want to make a separate playlist just for testing different LLMs.
@matthew_berman 4 месяца назад ⁺³
I have it ;)
@thatonecommunist 4 месяца назад
This is the first one i've seen to actually get the apple one correct, you keep mistakenly believing they do because they just write apple at the end, but this one properly incorporated apple into the sentences.
@KasperSOlesen 4 месяца назад
You need to test another feature it has. The preview feature also works for webpages it codes, and it supports React as well, and using this it can not only create simple games usually in 1st try but also run it and even create graphics for it. I even got it to implement audio but not sure how well that worked because I did not hear the audio but it did link audio from another website.
@timeflex 4 месяца назад
The number of words in the answer problem can be solved by priming the engine with an instruction to avoid guesses and think of unknowns as variables in equations instead.
@Kamcsatka1 4 месяца назад ⁺¹
The answer to the peg puzzle is wrong. You can tell because it has consecutive steps involving pegs jumping to the same place, e.g. already steps 1 and 2. We're safe for at least a little longer.
@johnbollenbacher6715 4 месяца назад
It seems like reasoning about its own output is an important step for any model that is hoping to get to AGI.
@gesnercarvalho7627 4 месяца назад
Hi Matt. Thank you for sharing your knowledge. I noticed that for some users you might need to enable the code window in settings.
@chillaxinmusic6295 4 месяца назад
I bought the premium this morning, now in the evening I have a fully functioning prototype of my kind of advanced project (camera, voting system, etc.). All done with Claude 3.5. Only things that sucks with it that even with premium it's limits are kind of strict.
@dsanxcz3074 3 месяца назад
are you still using it?
@hackneymarshes 4 месяца назад
Here's a harder test to try. Give it some complex data in a csv. Ask it to write a python 3 program using matplotlib to create a png with 3 different visualisations. I tried that with chat gpt 3.5 which was quite good but by no means perfect. It required many manual back and forth interactions with the model. Would be good to see if Claude got smarter.
BTW, I tried your 'hole digging' test with chat gpt 3.5 and it claimed 50 people were indeed 50 times as fast as one person. Massive improvement. Chat gpt 3.5 was already excellent writing complex python code based on a bullet point list of requirements.
Looking forward to making good use of Claude
@Xanderfied 2 месяца назад ⁺¹
Claude might be great for coders but as a general purpose AI its even more restrictive than OpenAI's platform and not nearly as relatable as 4o. If I was interested in coding programs using AI I might be interested in Claude but for now I believe I'll stick with the industry leaders
@cbgaming08 2 месяца назад ⁺¹
indeed
@baroninnvefhonnun8466 4 месяца назад
it got the marble question.... amazing!
@jabak1450 4 месяца назад
I watched all of your videos. Keep up the amazing work!
@Sonic2kDBS 4 месяца назад
Great! I was looking forward to this video. Thank you for your assessment 😊👍
@smetljesm2276 4 месяца назад
Man that was good! 😊
when these get back propagation and thinking through it will be stunning.
@danielhenderson7050 4 месяца назад ⁺¹
It's so weird that it gave you such a horrible marble answer (not sure how you passed it btw lol). I asked a very similarly worded question and it was flawless. Amazing the difference wording or just randomness in the token prediction can make. Here is mine:
Me:
If I place a marble in a cup and then turn the cup upside down and place it on the table, but then someone takes the cup and places it in the microwave, where is the marble then?
Claude:
Let's think through this step-by-step:
You place a marble in a cup.
You turn the cup upside down and place it on the table.
At this point, the marble is on the table, covered by the upside-down cup.
Someone then takes the cup and places it in the microwave.
Given this sequence of events, the marble would remain on the table where the cup was originally placed. When the cup was lifted to be put in the microwave, the marble would have stayed on the table surface.
So, the final answer is: The marble is on the table.
I loved this totally normal straightforward answer. Maybe you can try this wording in your tests?
@centurn1 4 месяца назад
I wonder if it even makes sense to test AI with a puzzle for which there exists an answer in internet.
The same for new test questions in a popular channel. It will only work once. The next model will have it in their training set.
So it won't be a test for reasoning, it would be a test for copy-pasta.
@jbavar32 4 месяца назад
My wish for Claude is that they incorporate a memory system like GPT. I would like it to remember my book and details about me for later conversations.
@ccdj35 4 месяца назад
Since I bought a paid openai subscription I never thought another company would get better anytime soon. Open ai is still the center of AGI debate, but another claude model surpasses most advanced gpt4 again. I kinda want to change the service but it is not easy since I have gpt's and other useful stuff.
@JRS2025 4 месяца назад
I've been using 3.5 Sonnet and switching between all the models available on Perplexity and GPT4o seems to be more accurate and consistent with its answers. One thing I did notice was how much faster 3.5 is.
@MyCodeWay 4 месяца назад
Thanks! I can't wait for Haiku. It seems to be a smarter choice among the cheaper models and will be great for easy tasks. 😊
@arinco3817 4 месяца назад ⁺¹
Just thought of a feature for the rubric. How about if a model smashes the others like sonnet 3.5, then it should get to suggest a task to add to the rubric?
@darwinboor1300 4 месяца назад
Thanks Matthew. Watching your AI tests is like watching Teslas owners running gauntlets using successive versions of FSD. At first it was easy to find FSD failures. Now FSD runs most gaunlets without errors. At least you have the option of building more complex pathways. More and more of the FSD gauntlet testers are being left with nothing new to show us. Of course training emergent failures are going to be harder to find in these massive LLMs.
@firdousbhat3339 4 месяца назад ⁺¹⁶
Cluade 3.5 often failed in answering my data science questions while ChatGPT 4o passed.
@Greg-xi8yx 4 месяца назад ⁺⁹
Sam?
@Derick99 4 месяца назад ⁺⁵
It's okay if we make gpt think there losing maybe they'll release more lol
@xbon1 4 месяца назад ⁺²
Its true claude is an inferior product still. Overhyped. If it was better i’d use it but 4o is still king
@ghominejad 4 месяца назад ⁺³
Of course GPT-4o is a lot better. I have been comparing both models with advanced programming concepts
@Derick99 4 месяца назад ⁺²
I find sometime 4o can be annoying when it comes to debugging though won't change the script even when you explain how it's not working
@freetinkerer3878 4 месяца назад ⁺⁴
The marble is no longer supported by the table 🤨
@dot1298 4 месяца назад ⁺¹
it might need multi-shot to be perfect, but yeah, until that gets fixed, we have no real AGI yet
@dot1298 4 месяца назад ⁺¹
(waiting for SSI Inc. to publish their first model)
@schwarzsterben1338 4 месяца назад
@@dot12985 years from now
@franco-russo 4 месяца назад
I followed allow using GPT and it got everything correct as well.
@therealsergio 4 месяца назад
Matt, I think when you use your toy rubrick and declare something the best model you have ever used, you cheapen your advice. Do you ever intend to give the current rubrick you use some updating to perhaps include more problems never seen before, harder questions that approach AGI in so much as there aren't common references to identical or almost identical questions in the training corpus?
@091carl 4 месяца назад
Here's a question almost no model gets right the first time in my experience - if the probability of getting a parking ticket in 2 hours is 0.4, what is the probability of getting a parking ticket in 1. 30minutes, 2. 4 hours, 3. 8 hours?
@WhiteWhiskers-rq7zl 4 месяца назад
How come claude doesnt show me any image or simulation even when prompted?
@haydar_kir 4 месяца назад ⁺¹
I love Claude sonnet over all other ai assistants.
@gaylenwoof 4 месяца назад
One of my tests will be to see if it can generate a 7-pointed star. I never found a way to get GPT-4o to do it.
@spotterinc.engineering5207 4 месяца назад
You should do a video on Claude 3.5 Sonnet for agentic workflow examples
@tvwithtiffani 4 месяца назад
We need to more reporting on the way these companies are load balancing these services reliably. I can only imagine that behind the scenes they are making strides in model resource consumption efficiency. Efficiency is what will help open source modeling the most. (I'm aware that the narrative is that these things require enormous amounts of resources and money to run. While I believe this is true, I believe its only half true.)
@Radica1Faith 4 месяца назад ⁺¹
For my uses Claude has surpassed Gpt4 by a mile, and the artifacts are a game changer
@zejdzglebiej 4 месяца назад ⁺⁷
NO! Claude cant generate PDF, dont have voice model. When you ask for advice tailored to your personality profile, GPT is more life-oriented and explains feelings and emotions better.
@arpecop 4 месяца назад
I get lots of apologies again , which means I have to answer the questions mostly ... hoping to learn something
@DeonBands 4 месяца назад ⁺²
It is just unfortunate that they do not have their api subscription sorted. I want to atomate, not sit and chat. The chat is good an I loved the output, but I want to use their api.
@Chodak166 4 месяца назад
Works for me on openrouter (along with NVIDIA Nemotron-4 340B)
@picksalot1 4 месяца назад
To the question "How many words are in your response to this prompt?" How come it doesn't reply with something like "I don't know. I can't predict that. I can't go back and count the words in the initial reply. I'm guessing it's ...."?
It's odd to me that it provides an "answer" that isn't grounded in some sort of data or evidence. What does it reply if you ask it to explain how it arrived at the number 14?
@gnsdgabriel 4 месяца назад ⁺¹
@matthew_berman Did you notice 3.5 Sonnet's attention to details by adding a snake icon to the game? You can see it on the game screen. 2:22
@matthew_berman 4 месяца назад
I thought that was the python logo
@gnsdgabriel 4 месяца назад
Maybe it is. 🤦
@hotlineoperator 4 месяца назад
Question; numbers are relesed in Claude web-site and its not good in math. So, can we trust the numbers?
@jbavar32 4 месяца назад
I am not able to get Claude ai to display the side by side feature even though I have the Artifact feature enabled. is it specific to a particular OS or browser or did they remove that feature?
@phillupC 4 месяца назад
Yes you will see me in the next one. I think I am gradually learning to code lol
@cuentadeyoutube5903 4 месяца назад
I switched to Claude before gpt4o because it was better at that moment. Then 4o came out and I was struggling to continue using Claude. But now it is back on the game.
I wish it had internet access to run searches, analyze results. Also to run small code.
I think I’m going to disable my copilot subscription and try to use something like cursor or that plugs to Claude.
@cuentadeyoutube5903 4 месяца назад
I also wish they had voice!

Следующие

Автовоспроизведение

Q-Star 2.0 - AI Breakthrough Unlocks New Scaling Law (New Strawberry)