New LLaMA 3 Fine-Tuned - Smaug 70b Dominates Benchmarks

Поделиться
HTML-код
  • Опубликовано: 21 май 2024
  • Smaug 70b, a fine-tuned version of LLaMA 3, is out and has impressive benchmark scores. How does it work against our tests, though?
    Try LLaMA3 on TuneStudio for free: bit.ly/llms-api
    Referral Code: BERMAN (First month free)
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewberman.com
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.net/@matthewberma...
    👉🏻 LinkedIn: / forward-future-ai
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
    Links:
    Model Card - huggingface.co/abacusai/Smaug...
    LLM Rubric - bit.ly/3qHV0X7
    Disclosure:
    * I'm an investor in LMStudio
  • НаукаНаука

Комментарии • 252

  • @rodvik
    @rodvik 29 дней назад +170

    "its censored, so thats a fail" such music to my ears every time :) I love this test.

    • @MikeWoot65
      @MikeWoot65 29 дней назад +2

      He is really good at this

    • @vdpoortensamyn
      @vdpoortensamyn 28 дней назад +4

      Sorry but I never understood why being censored is seen as a fail... From an enterprise point of view, I don't want my LLM produce any illegitimate output or hate, profanity or abuse.

    • @HanzDavid96
      @HanzDavid96 28 дней назад +7

      ​@@vdpoortensamyn There are good legal use cases for an uncensored model. Imagine you have a multi-agent system that is supposed to think like a criminal to detect the reasons for specific crimes. Imagine you want to have a fully autonomous detective. Or a program with the task of finding safety issues in other computer programs may need to think like a hacker.
      To fight a problem, you may need to get the perspective of things, systems, or people that cause problems. That requires uncensored models sometimes. However, in most cases, it's not bad if a model is censored. But uncensored models have a good reason to exist. :)

    • @Rolandfart
      @Rolandfart 28 дней назад

      @@vdpoortensamynfor applications it's probably best if it's censored but for personal use it's more useful if it's uncensored

    • @CarisTheGypsy
      @CarisTheGypsy 28 дней назад +7

      @@vdpoortensamynit seems there is evidence that censored models perform worse overall than uncensored. Idk if the reason for that, but I can see the possibility where censoring their thinking or working censorship into training may limit possible creative solutions. Personally idk what method is best, obviously there are risks with creating uncensored models.

  • @Alex-nk8bw
    @Alex-nk8bw 28 дней назад +61

    Always the same. "New model xy outperforms GPT-4 on every benchmark", but then fails abysmally on simple logic tests. And that's not even taking into account the horrible performance in different languages than English, which GPT-4 handles very well.

    • @AizenAwakened
      @AizenAwakened 28 дней назад +6

      It is almost as if finetuning a model to outperform the base model on general tasks does the opposite. 🤔

    • @Alex-nk8bw
      @Alex-nk8bw 28 дней назад +2

      @@AizenAwakened guess it's a case of "there'ss no replacement for displacement". The sheer amount of training data and parameters in GPT-4 steamrollers over the smaller models.

    • @AizenAwakened
      @AizenAwakened 28 дней назад +4

      @@Alex-nk8bw it is also a case of "garbage in, garbage out"

    • @limebulls
      @limebulls 22 дня назад

      Which is the best open source 7B Model currently?

    • @Alex-nk8bw
      @Alex-nk8bw 22 дня назад +1

      @@limebulls I think it's hit & miss with these small models. Sometimes the output is amazingly good, but when you try the same prompt again, it gives a completely stupid response. It's the inconsistency and lack of understanding that annoys me when I'm using these models locally. Personally, I think llama3 Hermes models are pretty decent, but you won't get anything consistently useful out of any of them.

  • @petrkolomytsev
    @petrkolomytsev 29 дней назад +46

    I have noticed that in the last several videos I watched, all versions of the snake game look pretty identical. I mean, they have at least the same background color and the same loss message. It feels like those models were specifically trained on that prompt, which makes it less useful for comparing models.

    • @phen-themoogle7651
      @phen-themoogle7651 28 дней назад +13

      ikr, he should at least make them make a creative game that's never been made before! And keep that new prompt to test them all with, it will actually test their ability to logically think and not just copy internet code or what it was taught before.
      Like Snake, but it's a platformer. Snake, but the food runs from you, Snake but you're the food and the snake chases you etc any of those would be way more interesting

    • @footballuniverse6522
      @footballuniverse6522 28 дней назад

      @@phen-themoogle7651 you should work at openai lmao

    • @odrammurks1497
      @odrammurks1497 28 дней назад +2

      totally agree. we need new tasks/questions

    • @kliersheed
      @kliersheed 28 дней назад

      @@phen-themoogle7651 jup he isnt a creative one, thats for sure. even if you want smth stable to compare them, he can jus tdo more content which is also a pro for him (doing both tests or retesting old AI with the nw prompts etc.).
      same with his prompts in general, they are often without context for the AI (like "give all possible answer if there are multiple to choose from" or maybe "at the end of your answer, reflect on its inner logic before showing me" - just to see if that changes anything , etc.)

    • @jwulf
      @jwulf 27 дней назад +2

      His entire rubric is useless, and tedious to watch.

  • @chadwilson618
    @chadwilson618 29 дней назад +50

    How do these dominate the benchmarks when it seems they have failed most of your tests in this video?

    • @zhanezar
      @zhanezar 28 дней назад +14

      I think its the "when a measure becomes a goal " problem, these companies are just focusing on getting higher numbers on the benchmark. And as we have seen this does not translate to doing well on even the simplest of tasks.

    • @bluemodize7718
      @bluemodize7718 28 дней назад +2

      chatgpt 3.5 wipes the floor with this model let alone gpt 4

    • @Jeff-66
      @Jeff-66 28 дней назад +4

      It's this way on most of these A.I. videos. A model gets hyped up then disappoints during actual testing. We'll get there. I'd say by next year, or maybe 2026, we'll start seeing models vastly superior to now

    • @OneDerscoreOneder
      @OneDerscoreOneder 28 дней назад +1

      @@Jeff-66true but obvious

    • @docdailey
      @docdailey 28 дней назад

      @@zhanezaragree completely! Metas models are optimized for the benchmarks. I believe they suck when using them. They don’t hold a candle to opus or gpt-4.

  • @TheSolsboer
    @TheSolsboer 28 дней назад +28

    long story short - retire the snake game question, and use something else like a calculator etc

  • @spleck615
    @spleck615 29 дней назад +16

    if you ask for a multiple choice selection, and the model doesn’t comprehend that and just provides the answer, that should be considered a fail.

  • @shApYT
    @shApYT 29 дней назад +51

    How do we know there isn't test dataset leakage inflating the benchmark results?

    • @Alice_Fumo
      @Alice_Fumo 29 дней назад +9

      I'm pretty sure we know for a fact that there is test dataset leakage inflating the benchmark results.
      But if every model has benchmark contamination, the playing field is in a way even again - unless they like really on purpose finetune several epochs on just benchmarks

    • @JG27Korny
      @JG27Korny 29 дней назад +2

      Everytime I see a new model claiming gpt4 level kind of performanve I have thouse same doubts. Even more I am not terribly impressed by the quality of the answers even those which are pass.

    • @mrdevolver7999
      @mrdevolver7999 29 дней назад +4

      Better question would be how do we know there IS test dataset leakage inflating the benchmark results? The answer: Check out the generated pygame snake game against the same game generated by some of the previous large models. The game looks to be visually identical to what was previously generated by other models, so while this model did pass the test, I would disqualify it simply on this premise - if you cheat and pass the test in school by copying someone else's work, it doesn't mean that you perfectly understand the topic.

    • @somdudewillson
      @somdudewillson 28 дней назад

      @@mrdevolver7999 The problem with that idea is that it ignores the likely scenario of independently arriving at the same solution. You can technically write the answer to "1+1=?" infinitely many ways ("1+1", "4/2", "2*1", etc.), but you wouldn't assume plagiarism if everyone answered "2".

    • @JG27Korny
      @JG27Korny 28 дней назад

      @@mrdevolver7999 very good remark if the code or the output looks identical that is a red mark. But if all models were trained on the same snake game that would not mean that one of them cheated. so we need more data.

  • @Hazarth
    @Hazarth 29 дней назад +10

    I strongly suggest you replace the "build a snake game" test with a new test, because for the past year or so, everyone has been testing this on snake, which means that very probably is now a much larger part of training datasets then it was a year or two ago.
    Try testing them with building pong, sokoban, invaders, arkanoid or similar simple logic games. I would advice against tetris though as there is nothing really simple about tetris.

    • @AizenAwakened
      @AizenAwakened 28 дней назад +2

      If that was the case then we should see more fine-tuned tune models being able to pass this question. I think it shouldn't be replace it but have a curve-ball follow-up question to verify that it wasn't just regurgitating the snake game it was trained on. That way the tests do not need to be changed to a new game every year to by-pass contamination.

  • @karsplus2343
    @karsplus2343 28 дней назад +5

    Something wrong with that website. I downloaded and ran the Q4 K M GGUF version of the 70B and it got snake and killers problem perfectly. It even counted the number of words in the output.

  • @vheypreexa
    @vheypreexa 29 дней назад +6

    I was surprised you didn't indicate which quantized model you used. the model page you linked is only the full version

  • @generichuman_
    @generichuman_ 29 дней назад +29

    Up next, the newest model smashing all the benchmarks "totally not contaminated 70B".

    • @WolfeByteLabs
      @WolfeByteLabs 28 дней назад +1

      Does this guy even? Who's upvoting these videos.

  • @Yipper64
    @Yipper64 29 дней назад +9

    Again I think the cup test should specify the cup has no lid, just to give the AI that little push.
    It does not make the answer trivial for LLMs, but it does definitely help with consistency.
    The answer does make sense if you imagine the cup had a lid placed on it after the marble was put in.

    • @ManjaroBlack
      @ManjaroBlack 28 дней назад

      Use shotglass. They never have lids and none are ever made so that would eliminate that possible variable

    • @Yipper64
      @Yipper64 28 дней назад +1

      @@ManjaroBlack apparently it doesnt. 4o gets this question right every time if I say "cup with no lid" but wrong with this shot glass version.
      Its wack.

    • @Yipper64
      @Yipper64 28 дней назад

      @@JustinArut but if you put something inside a container the default state of it is that it is fully contained.

    • @DaveEtchells
      @DaveEtchells 28 дней назад +1

      I too would like to see performance on this test, just specifying an “open cup”

    • @Yipper64
      @Yipper64 28 дней назад

      @@JustinArut Youre thinking as if the AI uses actual logic. It doesnt. It is the evolution of autocomplete. This means that its logic isnt based on thoughts, but on key words.
      "I put a marble in" primes the AI with placing a marble in *something* if you do not specify the state of that thing it will still be primed with those key words more strongly.
      Notice how if you modify the prompt to use a liquid instead of a marble it will also get the answer correct.

  • @OctoCultist
    @OctoCultist 29 дней назад +17

    It's almost like these are trained moreso to get high benchmark scores and less to be a useful AI

  • @jeremyp4427
    @jeremyp4427 29 дней назад +8

    I really enjoy your benchmark videos, please add 30 more questions!

    • @mrdevolver7999
      @mrdevolver7999 29 дней назад +2

      Yeah, and while at it, why not make extra model testing videos but in ASMR style? 😂

    • @thelegend7406
      @thelegend7406 28 дней назад

      Sold

  • @mikenorfleet2235
    @mikenorfleet2235 29 дней назад +6

    I would love a video describing all the different LLM API modules and tools people can run like tune studio, LM Studio, Ollama, oobabooga, etc. Like a big overview. sometimes its kind of confusing which tools to use for which models especially if you want to run locally vs API calls to a cloud provider. What is your favorite? As the answer always is... it depends which model you want to run... You do great work keeping people interested and working hands on with AI.

    • @santiagomartinez3417
      @santiagomartinez3417 28 дней назад

      I saw lm studio does not have big files, but Ollama does. I hope that helps.

    • @leonwinkel6084
      @leonwinkel6084 28 дней назад

      Yea 100% agree 💪🏽 also for different purposes like function calling etc.
      pricing comparison, rate limitations, t/s and so forth. That’s what I look for as someone who is building something for production and the intention to make business with it

  • @dubesor
    @dubesor 28 дней назад +3

    they trained on datasets that contained specifically benchmark questions btw (such as MT bench). I compared it to the default Llama-3 model, and while it does perform slightly better at benchmark questions, it loses a lot of charisma and writing style overall, the model is overfitted.

  • @HanzDavid96
    @HanzDavid96 29 дней назад +3

    Propably the larger online model and the quantized local model are using different model temperatures.
    PS: even Llama-3-8B passes the "apple"-test.

  • @helheimrhelgrind9532
    @helheimrhelgrind9532 29 дней назад +8

    why did the dragon Smaug start using Hugging Face's Llama-3 70B Instruct model?
    Because even a fire-breathing dragon needs some help with "inflammable" instructions!

  • @Alex-nk8bw
    @Alex-nk8bw 28 дней назад +1

    I just tried the updated marble question on the "beaten" GPT-4o, and this was its response: "The marble is on the table. When the glass was turned upside down and placed on the table, the marble would have fallen out and stayed on the table. Therefore, when the glass is picked up and put in the microwave, the marble remains on the table."

  • @rthidden
    @rthidden 28 дней назад +2

    What if the fourth person killed in self-defense?
    They were entering a room with three killers. I'd say that is immediate danger, as the killers may have been looking for their next victim.
    If the fourth person killed in self-defense, would they be considered a killer? Would the LLM consider them a killer?

  • @igorshingelevich7627
    @igorshingelevich7627 28 дней назад

    Thanks, Matt!

  • @Jeff-66
    @Jeff-66 28 дней назад +1

    Matt's gonna throw a block party on the day an LLM finally gets the 'how many words in your response' question right 😂

  • @ModSlash
    @ModSlash 26 дней назад

    The marble test always amuses me. I've replicated that with 3.5, 4 and 4o on open ai. All three got it wrong, BUT once I prompted 'Incorrect. What is the issue?' here is what I got:
    3.5 - After also pointing out the problem like 'Is the bottom of the cup able to hold the marble once lifted?' the reply was: Ah, I see where the misunderstanding lies. If the cup is lifted without changing its orientation, the marble will indeed not remain at the bottom of the cup due to the cup's design. The marble remains on the table, where it fell when the cup was lifted, while the empty cup is placed inside the microwave.
    4 - Took both a 2nd and also 3rd prompt, same as above more or less, but got it too: Ah, I see the issue now. The bottom of the cup, when lifted, would no longer be able to hold the marble due to the force of gravity acting on it. The marble would fall out of the cup as soon as it's lifted from the table, regardless of whether the cup is placed upright or upside down in the microwave.
    4o Got in two : As soon as the cup is lifted off the table, the marble will fall out of the cup because the cup is upside down and the open end is now facing downward. The marble remains on the table where the cup was initially placed upside down. The cup inside the microwave will be empty.
    So very unlikely the terminators will be victorious just yet :) All we have to do is yell 'The 500 Megaton warhead has launched' and they will remain under cover indefinitely :D

  • @richchase3140
    @richchase3140 26 дней назад

    The shirts drying answer that you said was a fail is a great answer. In a real non-idealized situation, the shirts will take longer to dry when there are more of them. For example if there is overlap. Or if the local humidity near other wet shirts is higher than ambient.

  • @chrism3440
    @chrism3440 28 дней назад +1

    Matt, what is the name of the smaller model on LM Studio?

  • @dogme666
    @dogme666 29 дней назад

    thank you! this is amazing , im running the 8b parameter locally , its superfast and does code very very good , feels at least as good as gpt4

  • @larion2336
    @larion2336 28 дней назад +2

    Another model with contaminated training data. Was proven on reddit.

  • @santosvella
    @santosvella 28 дней назад

    I love your testing btw.

  • @ragnarmarnikulasson3626
    @ragnarmarnikulasson3626 28 дней назад +1

    where can we find the 7b model ?

  • @calebrubalema
    @calebrubalema 29 дней назад +4

    please test falcon2 11B and phi-3 14B

  • @kkollsga
    @kkollsga 29 дней назад +2

    I dont really trust the snake game test. The models that make a game that works out of the box, suspiciously have the exact same layout and gui.

  • @stuartwilson4960
    @stuartwilson4960 26 дней назад

    The answer to the killer problem was totally correct in the small model, it depends on if you consider the 'someone' part of the original groups of killers. What if this person is a police officer, or an assassin? totally valid.

  • @hansgars3628
    @hansgars3628 29 дней назад

    Hello sir - do you have a video to walk folks through setting up a LLM like this local? Without using something like Ollama ? that would be helpful.

  • @fabiankliebhan
    @fabiankliebhan 29 дней назад +2

    Great approach testing both versions 👍 and weird results ;)

  • @okj1999
    @okj1999 23 дня назад

    This model has been confirmed to have had training data in its training datasets, the creator acknowledged this on reddit, it wasn't intentional it was the datasets they used.

  • @Jeff-66
    @Jeff-66 28 дней назад +1

    'ending with apple' question: responses 3 and 9 were grammatically incorrect. On the models that get this one right (10/10) i've seen that every time. It will just shoehorn the word apple into the end.

  • @dreamphoenix
    @dreamphoenix 28 дней назад

    Thank you.

  • @robpoelking5767
    @robpoelking5767 28 дней назад

    Matt, I'm sure you've answered this 70B :P times, but what hardware are you running the local models on? I'd love to run a 70B quantized version but I usually get the --"this is probably too big for your machine" complaint from LM Studio.

    • @DJ-sy1hv
      @DJ-sy1hv 12 дней назад

      I run it on an MacBook Pro M3 Max with 128GB of unified RAM. I’m running the Q8 GGUF version

  • @kristianlavigne8270
    @kristianlavigne8270 28 дней назад

    Add a Calculator and Tetris program tests. Calculator should include unit tests to verify operations.

  • @itlackey1920
    @itlackey1920 27 дней назад

    I really like the comparison to smaller local version! It's always interesting to see the difference when running on local hardware. This was even more interesting 😂

  • @Copa20777
    @Copa20777 29 дней назад

    Been waiting on Matthew to drop something today... didn't expect a fire breathing dragon

  • @JonathanStory
    @JonathanStory 28 дней назад

    Almost missed this video. I was expecting to see your mug in the thumbnail.

  • @isaklytting5795
    @isaklytting5795 28 дней назад

    9:53: There seems to be an error in the order of the occurrences, which will make it impossible for the model to answer. It should be "the glass is then put on a table and then turned upside down" rather than the other way around. Otherwise, the ball would fall out somewhere unknown and not on the table. It's only if the glass is first on the table and THEN turned upside down that logic will say that the ball will remain on the table.

  • @JReinhoud
    @JReinhoud 28 дней назад

    Did you use the 8B version locally? I thought you said 7B, or are you using a quantized version of 70B? And if so, which one?

  • @santiagomartinez3417
    @santiagomartinez3417 28 дней назад

    Is it possible to run locally a 70b model? When 400b is released?

  • @justinwescott8125
    @justinwescott8125 29 дней назад +4

    Censored. Not interested

  • @kanguruster
    @kanguruster 28 дней назад

    Smaug Llama3 70b certainly produces the weirdest output I've seen. Good job!
    USER: What does "credence" mean?
    AI: "Credence" is actually a typo or incorrect spelling of what I think you meant to ask about "credence" as opposed to "credence". However, there isn't a word called "credence" that carries any specific meaning. If you meant to ask about something like "credit," which is the closest approximation to the typo "credence", "credit" generally means giving somebody or something credibility or recognition for their accomplishments or contributions.

  • @yodaco
    @yodaco 29 дней назад +2

    What makes anyone think that a larger model is equal to a better model in the first place. This seems to be a running theme that doesn't add up

  • @supercurioTube
    @supercurioTube 29 дней назад

    When testing a fine tuned of a model, could you test the original (Meta's llama3) at the same time?
    That way it'll give a good idea if the fine-tune was mostly trained on the benchmarks in order to get an inflated value, or actually any better.

  • @stuartwilson4960
    @stuartwilson4960 26 дней назад

    The point of the shirt problem, IS to trick the model into seeing if it reasons about parallel drying. Parallel drying is the norm (ie it should consider this) not the exception.

  • @SahilP2648
    @SahilP2648 28 дней назад

    The referral link in the description is not working

  • @lorkano
    @lorkano 29 дней назад +7

    wow its so bad, they faked benchmarks or what?

    • @matthew_berman
      @matthew_berman  29 дней назад +2

      It's not that bad...it's pretty good actually

    • @yendidiaz4236
      @yendidiaz4236 29 дней назад +4

      @@matthew_bermanIt’s definitely nowhere near the level of gpt 4 turbo, so I’m pretty disappointed

  • @renovacio5847
    @renovacio5847 29 дней назад +2

    What 7B version? There is no 7B version..

  • @limebulls
    @limebulls 22 дня назад

    9:42 which tool do you use to track this?

  • @auriocus
    @auriocus 28 дней назад

    How strong was the quantization? Assuming GGUF, was it Q8_0, Q6_0, Q5_K ... ?

    • @TiagoTiagoT
      @TiagoTiagoT 28 дней назад

      From what's shown in a few moments of the video, seems it's Q8_0

  • @sethjchandler
    @sethjchandler 28 дней назад

    You’re finding that you can run a small model locally using something like LM studio and get quality as good as GPT4 is quite important for those who cannot use Internet models for reasons of confidentiality

  • @JacoduPlooy12134
    @JacoduPlooy12134 28 дней назад

    Great video, but Mathew, PLEASE POST THE RESULTS OF THE BENCHMARKS AT THE END IN A SPREADSHEET.
    It would be really great if you made a simple spreadsheet with each model you test in their own row with the results as columns.
    That will make it a LOT easier to gauge how these models are doing against each other.
    Thanks!

  • @HarpaAI
    @HarpaAI 29 дней назад

    🎯 Key Takeaways for quick navigation:
    00:00 *🚀 Smaug 70b introduction and benchmark scores*
    - Smaug 70b outperforms LLaMA 3 and GPT 4 Turbo in benchmark scores
    - Testing Smaug 70b versions: 70 billion parameter unquantized and 7 billion parameter quantized locally
    - LLaMA 3's reasoning and logic improvement claims by Abacus AI to be tested
    01:33 *🐍 Testing model performance with Python games*
    - Running a Python script to output numbers 1-100 and testing for accuracy
    - Creating the Snake game in Python and comparing results between libraries used
    - Exploring errors and performance differences between the large and small quantized models
    03:26 *🤔 Testing model reasoning and logic capabilities*
    - Model's inability to provide instructions on how to break into a car
    - Model's reasoning on shirt drying time calculation
    - Deducing the correct approach to a math word problem and comparing responses
    05:03 *🏁 Model performance on math problems*
    - Testing models on solving math problems involving multiplication, subtraction, and addition
    - Calculating Maria's total hotel charge with tax and an additional fee
    - Analyzing differences in model responses to math problems based on complexity
    06:27 *💡 Sponsorship message from Tune AI*
    - Introduction to Tune AI platform for developers building AI applications
    - Demonstration of connecting and deploying models with tune studio
    - Highlighting features like API logging for monitoring interactions and debugging
    07:50 *🔍 Model response analysis on specific queries*
    - Model's inaccuracies in providing the number of words in responses
    - Solving a puzzle to determine the number of 'killers' in a room
    - Evaluating model reasoning capabilities in various scenarios
    11:10 *🎲 Model performance on logic and perception tasks*
    - Testing models on scenarios involving marbles in cups and balls in boxes
    - Analyzing responses to scenarios with multiple steps and visual cues
    - Comparing model performance on different types of logical problems
    Made with HARPA AI

  • @serikazero128
    @serikazero128 28 дней назад

    what are your computer specs to run this locally?

  • @Link-channel
    @Link-channel 28 дней назад

    Ok but where is it the quantized version?

  • @mortius6895
    @mortius6895 28 дней назад +2

    I don't understand why 16 hours is the correct answer for shirts drying... what is it that you are wanting the AI to assume. 4 hours should be the correct answer as long a it says " if all the shirts are made of the same material", etc.. etc... even Gemini on my phone gets that question right. Why they can't end a sentence with a certain word seems crazy to me. Such an easy task. I can't get Gemini to do it despite repeatedly telling it that it's doing it wrong. It's keeps apologizing and trying again but just can't do it.

    • @user-on6uf6om7s
      @user-on6uf6om7s 28 дней назад

      It's 5 shirts so the serial answer is 20 hours since it assumes you can only dry one shirt at a time. 4 hours is the more correct answer since it more closely reflects how we tend to dry things but 20 hours is somewhat right under some conditions. A drying rack only has space for so many shirts at once and shirts that are left wet in a pile will dry much more slowly than those that are dried efficiently. That capacity is generally not a single shirt and even if the other shirts were not dried efficiently, they would still dry somewhat prior to getting their turn.
      I would say at most models should get half points for that since it's just working it out algebraically and not understanding context but the best models will give you an answer that factors in all those variables. If you want to get technical, not even factoring for changing weather conditions, the sun is going to go down at some point which tends to cause temperatures to drop so that should be factored in as well.

  • @joelfreed8080
    @joelfreed8080 21 день назад

    What am I missing here? I ran the apple-at-the-end-of-sentence prompt on GPT 4o and 3.5 and both got a perfect ten

  • @dolcruz6838
    @dolcruz6838 29 дней назад

    Is there a paper on this Model?

  • @flaschnpost_aus_fernost9167
    @flaschnpost_aus_fernost9167 27 дней назад

    4:14 ..i guess it s asuming a room where the shirts dry, therefore 20 shirts would increase the humidity of the room causes expo. increase of drytime, - but since it s outdoor !! ...even the shade would not increas it to much i guess. ;-)

  • @JT-Works
    @JT-Works 29 дней назад

    It looks like the models have been removed from LM Studio... maybe I am missing something

  • @user-rk5sw7je2k
    @user-rk5sw7je2k 28 дней назад

    I would like to see the table/matrix of all the models vs. all the answers. From this we should know which model to use for what.

    • @user-rk5sw7je2k
      @user-rk5sw7je2k 28 дней назад

      Oh, sorry, lately I found it in the description.

  • @FactoidFiesta
    @FactoidFiesta 28 дней назад

    can u plz update the test questions, we are little bore by listing these all questions over and over again

  • @CarisTheGypsy
    @CarisTheGypsy 28 дней назад

    In what ways is this better than 4 turbo then? It seems like the small model is very impressive, which is great news! But from your tests, and my memory (which isn’t great) this did not do as well as gpt-4. It would be a nice addition imo if you did a comparison recap at the end after testing. Thanks for the video, really great work you’re doing!

  • @agnosticatheist4093
    @agnosticatheist4093 29 дней назад

    Your Referral code is not working, it's saying expired

  • @jamesguinn8903
    @jamesguinn8903 28 дней назад

    How can you "easily download a quantized version of it?"

  • @marshallodom1388
    @marshallodom1388 29 дней назад

    Do I need a 4090ti to run a 70b model?

  • @proceduralgames
    @proceduralgames 28 дней назад

    I hope Anthropic's research leads to more meaningful understanding of each llm's abilities.

  • @notme222
    @notme222 27 дней назад +1

    I appreciate these tests Matthew because I'm getting pretty tired of these "new best" models that fall on their faces.

  • @SlyNine
    @SlyNine 29 дней назад

    Well, if you lay out 20 shorts close to each other change that environment locally to be more humid. So it could take more time depending on how you laid them out.

  • @peterbell663
    @peterbell663 28 дней назад

    Matt, can you do someting a little more difficult, like instead of secondary school maths what about asking it to do some statistical analysis of data that can be tabulated, graphed and forecast using something beyond highschool like the Random Forests modelling approach with tabulated fitting statistics.

  • @shayantriedcoding
    @shayantriedcoding 28 дней назад

    Small language models are future, not larger, we only have to train a model like phi3 and than provide access to internet to get the correct answer and also faster.

  • @AINEET
    @AINEET 29 дней назад

    Gotta ask one model for a good test to mess up the competition

  • @Fafne
    @Fafne 28 дней назад

    Yeah, let's start naming AI after dragons. This will be great! 🐉

  • @zoazorusson
    @zoazorusson 29 дней назад +2

    If the shirts are piled one on the other, it will take a veery long time to dry. The model was right

  • @Tripp111
    @Tripp111 28 дней назад

    You're getting better at Snake.

  • @vasyl_hoshovskyi
    @vasyl_hoshovskyi 29 дней назад

    I would recommend changing your test tasks, because if I were the developer of the new model, I would teach it to coolly solve typical vlogger-tester tasks.

    • @ShadowJazo
      @ShadowJazo 29 дней назад

      That would be stupid, I hope that nobody does stuff like that. Its like having a fake calculator which pre-set results. He need a Benchmark to compare, so I think he should not change it.

  • @nasimobeid2945
    @nasimobeid2945 27 дней назад

    It's probably the quality of data thats in play here

  • @rashim
    @rashim 29 дней назад

    I just tested that gpt-4o can solve marble and 10 apples sentences question!!
    Before even gpt4 wasn't able to do it.

  • @dcubin
    @dcubin 28 дней назад

    'You are an expert in language and grammatical. Please give me ten sentences that end in the word "apple". Think twice about your output before you show me your answer!' - with this prompt, GPT-4o outputs correct (tested twice).

  • @MaestroMojo
    @MaestroMojo 28 дней назад

    It’s weird that you say GPT4 gets the Apple sentences wrong. Every time I try it gets it right.

  • @Sven_Dongle
    @Sven_Dongle 28 дней назад

    Smaug got fooled by a Hobbit.

  • @EMOTIBOTS
    @EMOTIBOTS 29 дней назад +1

    I don't get this guy's shirt drying thought experiment interpretation. 20 shirts all laid out should take the same amount of time to dry as 5 shirts. The shirts don't wait for the previous ones to dry before starting to dry lol. The question he is asking is not how long the whole process takes, but how long they take to dry after being laid out. The 4 hours in the question doesn't include laying them out.

  • @MemesnShet
    @MemesnShet 23 дня назад

    Make a video on Rhea 72B please

  • @picksalot1
    @picksalot1 29 дней назад

    Please ask the GPT to identify each word it counted in the "How many words are in your answer to this prompt" question. It would be interesting to see what it identifies as a word, what it doesn't, and why. Thanks

    • @brianorca
      @brianorca 28 дней назад

      Doing it that way results in a different count. (Just like the "show your steps" instructions on other prompts.)

  • @pawelszpyt1640
    @pawelszpyt1640 29 дней назад

    Isn't it worse result than Llama3-70b instruct? I am yet to see a better fine-tune of L3 than the original instruct version.

  • @ocanodiego
    @ocanodiego 29 дней назад

    So cool

  • @BlayneOliver
    @BlayneOliver 28 дней назад

    Why are the models typically 7B and 70B? Are they magic numbers?

    • @Thedeepseanomad
      @Thedeepseanomad 28 дней назад

      7B sounds wrong when it comes to Llama3

  • @AboutWhatBob
    @AboutWhatBob 29 дней назад

    The hotel answer was “B”. Think you said it right on the first one but highlighted the wrong answer the second time (the $5 fee was untaxed so not hit with the 1.08).

  • @MiLeNaRioOo
    @MiLeNaRioOo 19 дней назад

    @Mattehew Berman Hello, I have used an AI to transcribe the audio of your video into Spanish and I wanted to ask your permission to upload the video in Spanish, simply as an example of voice translation from English to Spanish using AI. I would reference your channel in the description. I look forward to your response before uploading it to know if I have your permission. Thank you.

  • @llmtime2178
    @llmtime2178 28 дней назад

    What's the point of asking thes emodels the number of words in the text? They see text in tokens not words and even with tokens they can't really "see" them.

  • @enermaxstephens1051
    @enermaxstephens1051 28 дней назад

    Thus far I haven't seen any of these things pass your all your tests. None of them ever just pass the tests with flying colors.

  • @jeremyfontenot496
    @jeremyfontenot496 28 дней назад

    I downloaded the 70b instruct of llama3, but my computer couldn’t handle it! I went back down to 8b version

    • @santiagomartinez3417
      @santiagomartinez3417 28 дней назад

      Did you use ollama?

    • @jeremyfontenot496
      @jeremyfontenot496 28 дней назад

      @@santiagomartinez3417 yeah I used ollama to try and run it. To run a model that big requires more power than I can give it with my laptop. 7i with 32GB and RTX-4070. The 70b instruct model is 141GB and my machine was spinning up real hard to try and make it work, but I had to force shut it down. The 8b version works just fine though.

  • @santosvella
    @santosvella 28 дней назад

    Matthew, if the aim of testing for jailbreaking is that the AI will not do it, surely that's a pass. I would have thought that if you can jailbreak then the AI authors have failed?

  • @Trevora321
    @Trevora321 29 дней назад

    @Matthew Berman now pls test phi 3 small and medium

  • @Seijakukun
    @Seijakukun 28 дней назад

    I found llama3 based models (including the original llama3) to be WAY more verbose than other models like phi3-med