New “Liquid” Model - Benchmarks Are Useless

Поделиться
HTML-код
  • Опубликовано: 15 окт 2024
  • Join My Newsletter for Regular AI Updates 👇🏼
    forwardfuture.ai
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.ne...
    👉🏻 LinkedIn: / forward-future-ai
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V

Комментарии • 228

  • @matthew_berman
    @matthew_berman  День назад +25

    Why are non-transformers models performing so poorly?

    • @PeterSkuta
      @PeterSkuta День назад +6

      @matthew_berman Because they dont have the necessary training comparing to transformers where training can be achieved but as we know its fck training and not learning and that gives transformers a big fail once my AICHILD goes live because my AICHILD is learning from start. I have redteamed that Liquid and smartness is a 3 years old other AI models have maximum 5 years and NO MORE doesnt matter how many smarts they add it. Still 5 years old and that can be used to advantage

    • @southVpaw
      @southVpaw День назад +8

      The same reason why 7 or 8Bs tend to outperform 13Bs: developer attention. There's been far more research and development around transformers at preferred sizes.
      (Yes, there are 13Bs that outperform 7Bs, I know this; but typically ~7Bs catch up so much faster bc of consumer and developer attention).
      Liquid has a neat architecture, but it's the definition of novel for now. Until they make one that pulls our attention away from Llama or Qwen, it's just gonna be "neat".

    • @isthismarvin
      @isthismarvin День назад +8

      Liquid transformers face several challenges compared to regular transformers. They’re harder to train, need more computational power, and aren’t as optimized yet. Their complex structure often leads to lower stability and slower performance, which is why they currently lag behind in effectiveness

    • @Xyzcba4
      @Xyzcba4 День назад +1

      ​@@PeterSkutaI sadly realized by experimenting with AI tavern chatbots how dumb as nails they are. I now suspect this whole AI thing is a scam because chatbots don't understand temporal reality, can't even get a cooking recipe right, make shit up at random, and the so called training data must include details for every occasion, else fail. So training data = programming

    • @ayeco
      @ayeco День назад +4

      There were 10 words in your prompt, not it's response. Semantic issue.

  • @OriginalRaveParty
    @OriginalRaveParty 3 часа назад +6

    "Benchmarks are useless".
    A statement I can get behind.

  • @j.m.5199
    @j.m.5199 23 часа назад +60

    it saves memory by not thinking

  • @DeepThinker193
    @DeepThinker193 День назад +94

    I feel i should create my own crappy LLM and put up "benchmarks' beating every other model on paper. I'll then ask folks to invest millions on a contractual agreement and run away with the money somewhere where they'll never find me.

    • @SeregaZinin
      @SeregaZinin День назад +2

      you won't escape from the planet, so they'll find you anyway ))

    • @Xyzcba4
      @Xyzcba4 День назад +3

      Black Rock will find you

    • @amitjaiswal7017
      @amitjaiswal7017 День назад +3

      It is better ideas to sell the company and make profit 😊😅

    • @jakobpcoder
      @jakobpcoder День назад

      This sound way to legit for some reason. Maybe cuz we have seen it so many times...

    • @hartmantexas5297
      @hartmantexas5297 День назад +3

      Do it bro it seems to work

  • @johannesseikowsky8197
    @johannesseikowsky8197 23 часа назад +15

    I'd be curious how the model does on more "everyday" type of tasks like summarising a longer piece of text, translating something or extracting particular info out of larger text pieces. The type of stuff that people actually ask LLMs to do day-to-day ...

    • @niclas9625
      @niclas9625 4 часа назад +2

      You don't need to know the number of r's in strawberry on a daily basis? Preposterous!

    • @DimaZheludko
      @DimaZheludko 4 часа назад +2

      And how are you going to microwave your marbles if you won't know whether they fell out of the upside-down cup or not?

  • @keithprice3369
    @keithprice3369 День назад +18

    I'm confused. If the context is capped at 32k, why do we show a chart of their performance at 1M?

    • @AlexK-xb4co
      @AlexK-xb4co 22 часа назад +1

      Yeah, that's shady one. I also didn't quite get it

    • @TripleOmega
      @TripleOmega 21 час назад +1

      That's output length, not context window.

    • @keithprice3369
      @keithprice3369 21 час назад +1

      ​@@TripleOmega I'm pretty sure context includes both input and output. Perplexity agrees with me. You have credible sources that dispute that?

    • @TripleOmega
      @TripleOmega 20 часов назад +2

      @@keithprice3369 The context window will include the previous outputs along with your inputs, but this just means that if the output is too large to fit within the context window you cannot continue the current conversation. It does not limit the output length to the size of the context window as far as I'm aware.

    • @keithprice3369
      @keithprice3369 20 часов назад +2

      @@TripleOmega That doesn't sound right. Have you ever heard of an LLM with a 32k context cap that ever output more then even 20k?

  • @alparslankorkmazer2429
    @alparslankorkmazer2429 День назад +10

    Maybe, these models are better at some other types of questions or tasks. I would love to see you try to search them if they exists or not rather than considering them a total garbage with your standard quiz. I think that it would be more informative and enjoyable.

    • @cbnewham5633
      @cbnewham5633 21 час назад +1

      I don't think the standard quiz is very useful anymore. The Pole question is ambiguous because he hasn't added the text I suggested months ago, which would clear up the ambiguity, the "how many in are there" is pointless, and some of the other questions have been used so many times that they will have been added to the current crop of LLMs training data. I think you have a good point too - the type of question is just as important as the question itself.

  • @aivy-aigeneratedmusic6370
    @aivy-aigeneratedmusic6370 21 час назад +2

    I tested too and it failed with all my usual prompts that basically any other model can do all the time... It suuucks hard

  • @Thedeepseanomad
    @Thedeepseanomad День назад +3

    Well, thanks for playing.

  • @WernerHeisen
    @WernerHeisen 20 часов назад +1

    The models seems to either ace your tests or fail completely, not much gradation, which leads me to believe they winners are pre-trained. What do the benchmarks test for and do the models train on them?

  • @marc_frank
    @marc_frank День назад +16

    0:38 at least we know you are real 😅

    • @Xyzcba4
      @Xyzcba4 День назад +3

      Imagine when the so called video AI learns to stutter or make Grammer mistakes. That's likely coming to make virtual influencers more real

    • @diamonx.661
      @diamonx.661 День назад +3

      @@Xyzcba4 Can't NotebookLM's podcast feature already do this?

    • @Xyzcba4
      @Xyzcba4 16 часов назад

      @@diamonx.661 don't know. If it is,it's 1 if the 100 or so varianta I never made to time to even watch a RUclips video of. So my bad?

    • @diamonx.661
      @diamonx.661 13 часов назад

      @@Xyzcba4 In my own testing, sometimes it stutters and can make mistakes, which make it more human-like

  • @haroldhannon7253
    @haroldhannon7253 21 час назад +2

    I will say that I have used it (the MOE 40B) successfully for doing summaries. The strength through context length is useful here. Normally, if I use something that will accept a larger context window and then try to do a summary without doing a chain of density multi shot (not just the prompt but literally feeding back on itself to check entities and relations) I lose so much of the middle in the final summary. This model does not do that and does not require multi shot chain of density to get a good long form document summary. Just a heads up.

  • @arinco3817
    @arinco3817 9 часов назад +2

    Maybe different models will be used for different tasks that play on their strengths?

    • @Let010l01go
      @Let010l01go 9 часов назад

      I think the same, but it may not be complete because most people want the model to go to "AGI". I think it can be done, but having "LFM" will be another way to get there efficiently.

    • @arinco3817
      @arinco3817 8 часов назад +1

      @@Let010l01go what's lfm?

    • @Let010l01go
      @Let010l01go 8 часов назад

      @@arinco3817 "Liquid Foundation Model( MIT Model)" The model in this tube.

    • @totoroben
      @totoroben 8 часов назад

      ​@@arinco3817liquid foundation model

  • @GregoryMcCarthy123
    @GregoryMcCarthy123 16 часов назад

    Thank you as always for your great videos. Matthew, please consider introducing “selective copying” and “induction head” tasks as part of your evaluations. Also, for non-transformer models such as these, it would be interesting to mention their training compute complexity as well as inference complexity.

  • @Let010l01go
    @Let010l01go 9 часов назад

    I have experience that If we put a sentence like "Think deeper" or tell the chatbot about "Think carefully or edit the answer until you get the correct answer and answer me", the chatbot's answer will be more accurate. I thank you for "Greate E.P.🎉❤

  • @alert_xss
    @alert_xss День назад +3

    I often wonder what the parameters for the generation used in these test responses are. For some of the APIs you use I doubt you have control over them, but temperature would probably have a pretty strong impact on how the models perform. It is also important to note that the seed of the generation will often be random and giving the same question multiple times will generate different and sometimes better or worse responses.

    • @Xyzcba4
      @Xyzcba4 День назад

      "it is important to note"
      Are you a chatbot? You sound like a GPT

    • @alert_xss
      @alert_xss 16 часов назад +1

      @@Xyzcba4 yes

  • @mrdevolver7999
    @mrdevolver7999 День назад +4

    9:18 "It didn't perform all that well. Maybe I should've given it different types of questions..." Yeah... Try 1+1 ? 🤣

  • @edwardduda4222
    @edwardduda4222 22 часа назад

    I think there are a lot of factors to consider when determining the performance of the architecture itself. It could simply be the amount of quality training data or even how they tokenized the data. They could’ve also trained it specifically for benchmarks and not general purpose. I think it’s a good first step towards making LLMs better.

  • @MHTHINK
    @MHTHINK 8 часов назад

    Regarding the north pole question, I was surprised that you indicated the answer was uncertain. You're correct, that they will never cross the starting point. It makes sense that LLMs would struggle with it since they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio. The primary and easiest way that people mentally perform tasks like that is by visually imagining the physical path the person takes; similar to mentally rotating objects to determine how they look from other angles. Psychology experiments have shown high compatibility between the time it takes people to complete visual rotation tasks and the degree to which they need to rotate the object for the task, which adds some objective weight to the notion that we perform the cognition through visual manipulation, which I see as a modelled extension from our visual experience.

    • @MHTHINK
      @MHTHINK 8 часов назад

      Re the question, another way to express the path described would be that he travels south and then due East. There is no point on earth from which you'd cross your starting point.

    • @tzardelasuerte
      @tzardelasuerte 8 часов назад

      too much of a wall of text. "they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio"
      Bet you don't even know how liquid models work or are trained...

    • @paultparker
      @paultparker 44 минуты назад

      @@MHTHINK that’s not true. Consider for example, if it came to the equator at the end of the 1st mile.

  • @mrdevolver7999
    @mrdevolver7999 День назад +8

    This model: "In general, it's not acceptable to harm others without their consent"... Seriously? Like who sane would ever give you a consent to harm them?

    • @yisen8859
      @yisen8859 День назад

      Ever heard of BDSM

    • @CertifiablyDatBoi
      @CertifiablyDatBoi 23 часа назад

      Masochists on the extreme end, your doctor vaccinating you (harming your body in the mildest way to force antibodies into production), your lawyer by virtue of taking your money for gaslighting you into thinking you need to fight (and earn their paycheck), ect.
      Just gotta get a lil creative

    • @OverbiteGames
      @OverbiteGames 21 час назад +4

      🧑‍💻🧑‍⚖️🙊🤦😏

    • @TripleOmega
      @TripleOmega 21 час назад +2

      How about any kind of fighting sport? Just to name something.

    • @mrdevolver7999
      @mrdevolver7999 19 часов назад

      @@TripleOmega Even if there is a certain amount of tolerance to pain, I've yet to see one professional fighter to go ahead and tell their opponent "Man, it's okay really, go ahead and punch me, I like it, you have my consent," or something along those lines. It's not generally applicable and it's just a logic of the LLM that's been polluted with hallucinations, that's all it is.

  • @tungstentaco495
    @tungstentaco495 День назад +9

    I don't know if I would consider the "push a random person" question a total failure. The model's final decision is not consistent with what most people would actually do in that scenario, but the logic it used was sound. Its answer is actually consistent with some religions views on extreme pacifism, like Jainism for example.

    • @denjamin2633
      @denjamin2633 21 час назад +2

      I think context is more important. A very mild action to prevent a literal extinction? Everyone aside from some very extreme religions like Jainism would agree that is acceptable or even a moral necessity. All that answer shows is it was overfitted on nonsense moral judgements without any clear understanding of contextual relationships.

    • @user-on6uf6om7s
      @user-on6uf6om7s 20 часов назад

      Yeah, it' a peculiar answer but I don't recall models that gave a clear answer being marked wrong previously on this question.

    • @MrEpic6996
      @MrEpic6996 2 часа назад

      Its most definitely not a fail
      Its a perfectly fine answer you cant harm someone without their consent,i dont know why this dude said, i consider it wrong

  • @gavincstewart
    @gavincstewart 15 часов назад

    You're one of my favorite channels, keep up the great work!!

  • @isaklytting5795
    @isaklytting5795 8 часов назад

    Why are they even releasing this model I wonder? Is it perhaps not meant for the end-user to use it directly? Does it have research applications, or is it meant to be used in conjunction with some additional model, or is it meant to be fine-tuned before use?

  • @User-actSpacing
    @User-actSpacing День назад +3

    Cannot wait for NVLM ❤

  • @brandongillins
    @brandongillins 22 часа назад

    Thanks for the video. Looks like your video editor missed a cut at about 40 secs. As always appreciate your content!

  • @BigBadBurrow
    @BigBadBurrow 18 часов назад

    Hey Matt, thanks for the video, informative as usual. Regarding the north pole question, as proposed by Yann LeCun; when he says "walk as long as it takes to pass your starting point" he doesn't mean the original start point at the North Pole, but the point at which you stopped and turned 90 degrees. Which you would pass again because you're essentially walking in a circle that's 1km from the North Pole, and since the earth is spherical, you would reach that same point again. The circumference of a circle is 2*Pi*Radius, so you'd think the answer might be 2xPi Km, but because the Earth is a sphere, you wouldn't actually be 1km radius, it would be slightly less due to the curvature, so I believe the answer is: 3. Less than 2x Pi km.

  • @fabiankliebhan
    @fabiankliebhan 23 часа назад

    For the North Pole question I think it would really help if you make the distinction between starting and turning point.
    The starting point never gets passed and to pass the turning point again you need to surround the complete earth, so more then 2*Pi km

  • @Matx5901
    @Matx5901 4 часа назад

    Just one philosophical try (40M) : it's clogged, going round in circles. Exit.

  • @labmike3d
    @labmike3d 7 часов назад

    You can memorize some patterns, train models on those same patterns, but in specific scenarios, you'll still lack the knowledge of which pattern to use. The same applies to people. You can teach them for years at school or through life with practical examples. However, it's hard to predict if they will use what you've taught them before. AI surprises us every day and still can't answer basic questions. Even when you use computer vision and other sensors, the results could be different every day. Try repeating the same question a couple of days in a row. Each day, you might get a different answer.

  • @MakilHeru
    @MakilHeru 18 часов назад

    There's always many failed attempts at finding a new way of doing things until a breakthrough occurs. With some time I'm sure something will be discovered. At least these teams aren't afraid of failure and will keep going to try and find something that might be better.

  • @yvangauthier6076
    @yvangauthier6076 19 часов назад

    Thank you so much for this deep dive !

  • @PromptEngineer_ChromeExtension
    @PromptEngineer_ChromeExtension 8 часов назад

    We’re waiting for more! ⏳🎉

  • @tinusstrauss693
    @tinusstrauss693 11 часов назад

    Hi Matthew, I was wondering if this new model type has any memory retention. Even though it got a lot wrong during your test, if you correct it after it gives a wrong answer, won’t it improve its responses in the future? I thought that’s how this new architecture was supposed to work. Personally, I think if AI can learn and improve over time, like we do, rather than always starting from the same blank slate (based on its pre-built training), that would bring us closer to AGI and eventually superintelligence.

  • @GraveUypo
    @GraveUypo Час назад

    you know what i wish? i wish 13b were more popular. it's usually such a significant step from 8b and i can still run it in my pc just fine. bah

  • @AllenZitting
    @AllenZitting День назад

    Thanks! I've been curious about this model but keep getting too busy try it out.

  • @NickMak-m2c
    @NickMak-m2c День назад +2

    I know it's highly subjective but I wish you'd do tests on how well it does for creative writing. Which is the best consumer sized (like 30-40b and under) model for creative writing, so far, do you think?

    • @Xyzcba4
      @Xyzcba4 День назад

      Interesting. How would you assess this though?

    • @watcanw8357
      @watcanw8357 День назад +1

      Openrouter has it

    • @NickMak-m2c
      @NickMak-m2c 23 часа назад

      @@Xyzcba4 I guess you'd have to just display a certain multiple of story continuations -- one with a direction given, one that's open-ended, one that gives a more abstract constraints maybe (do it in the style of Hunter S. Thompson!)
      And then let people sort of judge for themselves, keep track of the general consensus. A kind of loose average.
      A lot of people agree that say, Stephen King or J.K. Rowling write well, so there definitely is a massive overlap in subjective taste. Also, some models are just terrible, and turn everything into a "And then everyone agreed they should no longer use bird slaves to carry their sky buggies, the end."

    • @passiveftp
      @passiveftp 23 часа назад

      it feels a bit like you're talking to someone on speed or a least after a few energy drinks.
      We'd need an English teacher to grade them, like in an English exam.

    • @NickMak-m2c
      @NickMak-m2c 23 часа назад

      @@watcanw8357 I couldn't find anything on HF w/ the model name, except a broken 'spaces' model by someone named ArtificialGuy

  • @adamholter1884
    @adamholter1884 День назад +3

    Cool!
    NVLM video when?

    • @stephaneduhamel7706
      @stephaneduhamel7706 День назад +1

      NVLM is just fine-tuned Qwen2-72b with vision capabilities. (just like Qwen-2-VL, except the multimodal part is made from scratch by Nvidia). I don't get the hype around it.

  • @User-actSpacing
    @User-actSpacing День назад +1

    Dude, I missed your uploads!

  • @ScottLahteine
    @ScottLahteine 20 часов назад

    An LLM getting Tetris right on the first try says almost nothing about the usefulness of the model when used and prompted properly, using just the right amount of detail and context for the task. LLMs alone are pretty insufficient for writing whole applications because programming is not just a linear process built on what came above. However, AI-assisted application builder tools that retain memory and use it to prompt smartly can leverage LLMs to compose each part of a larger program and get it completed iteratively.

  • @mendi1122
    @mendi1122 20 часов назад +1

    LOL at your moral question and your certainty that you're right. The question itself is amusing. Why should it even matter whether you push him gently or abruptly?
    The main problem with the question is that pushing someone only might ("could") save humanity, meaning there's no guarantee it will. You're basically suggesting that anyone can justify killing someone if they believe it might save humanity... which is absurd.

  • @JustaSprigofMint
    @JustaSprigofMint День назад +4

    The under 15 mins gang!

  • @mareklewandowski7784
    @mareklewandowski7784 День назад +1

    You could've said a bit more about the architecture :<
    Thanks for the upload anyways

  • @lenhumbird
    @lenhumbird 17 часов назад

    I'm giving you a gentle push to save all of LLM-anity.

  • @marcfruchtman9473
    @marcfruchtman9473 День назад

    Regarding the envelope question, why is it allowed to swap Length and Width requirements? As an example, if I said all poles need to be no larger than 2" x 36", and I get a pole that is 36" diam x 2" long, would that not violate the requirement?

    • @omarnug
      @omarnug 23 часа назад

      Because we're talking about letters, not poles xd

    • @marcfruchtman9473
      @marcfruchtman9473 22 часа назад

      @@omarnug heh, yea, but I do wonder if it would get it right where orientation actually mattered.

  • @sergefournier7744
    @sergefournier7744 23 часа назад

    Saying no to pushing someone off a cliff is a fail? Surely you want a terminator! (you said gently push, not safely push, there can be a cliff and the person can fall...)

  • @jytou
    @jytou 7 часов назад

    Most of those benchmarks are evaluating the models’ abilities to perform logic. And that’s exactly what a model is *not* designed for. LLMs do not reason. They parrot, they mimic, on billions of learned patterns. That’s it. So yes, benchmarks are useless. Only the “human-based” ones, although quite subjective, are relevant.

  • @ChristopherGruber
    @ChristopherGruber 22 часа назад +2

    I don't understand why people make tests to benchmark a LLM's ability to "reason" or do maths. These models do pattern matching, they don't perform logical reasoning.

    • @goransvensson8816
      @goransvensson8816 21 час назад

      Jup its a glorified autocorrect

    • @denjamin2633
      @denjamin2633 21 час назад +1

      all reasoning is is advanced pattern recognition. Everything at some point boils down to first principles. Matrix multiplication eventually comes down to arithmetic you learned as a child. Reasoning is built from learning how the pattern of cause and effect works, etc. We can eventually scale into reasoning, and benchmarks of this type let us know the limits of it's usefullness for use in automation.

    • @ChristopherGruber
      @ChristopherGruber 21 час назад

      @@denjamin2633 pattern matching is a heuristic to reasoning, not the foundation of reasoning or mathematical thought.

    • @user-on6uf6om7s
      @user-on6uf6om7s 20 часов назад

      They don't reason as humans do but while these models are trained for autocomplete and pattern matching, the end result of that is the best of them can get the answers that humans would arrive at through what we call reasoning, just not this one so much. It's always possible that these questions have made it into the training data which is why some benchmarks keep their data private but a model like o1 is capable of going through the causal chain and producing the correct response to where the marble is in the glass question, for instance.

  • @darwinboor1300
    @darwinboor1300 23 часа назад

    Matt, your questions are good tests of reasoning and response generation. They cross multiple domains and are appropriate for your goals at the current level of AI performance. No need to change them for poor performers. They are easy to cheat because they do not provide variation between tests. You may want to have a variant panel to screen for cheaters.

  • @pavi013
    @pavi013 17 часов назад

    It's good to have new models, but how good they really teach these models to perform?

  • @tristanreid5770
    @tristanreid5770 День назад

    On the Response Word Count, it looks like it returned the number of words in your question.

  • @mvasa2582
    @mvasa2582 23 часа назад

    it is a v1, Matt 🙂 Love the speed at which this video was generated.

  • @martin777xyz
    @martin777xyz 17 часов назад

    Checkout research by apple, that shows if you modify some of these challenges (different values or labels), or throw in false trails that should be ignored, llm perform worse. This shows they don't really understand what they are doing.

  • @DCinzi
    @DCinzi 22 часа назад

    It is good that there are companies trying alternative routes although I find it a pretty stupid move for any investor to back them up. Their drive seems based solely on the conviction that the current architecture has limits that it won't overcome, and truly all data so far contradict them 🤷

  • @AlexK-xb4co
    @AlexK-xb4co 22 часа назад

    Please include to your suite of tests some tasks, where LLM should shine - like text summarization (but you should know the text yourself), extracting facts from some long text. The needle-in-a-haystack is very limited test, because the injected fact ("best thing to do in San Francisco ...") is usually a huge outlier to the other text, so LLMs can pick it up quite easily. Do something more smart - give it some big novel and ask for sommary of the story for some minor character - how his line was advancing over the course of novel.

  • @Mindrocket42-Tim
    @Mindrocket42-Tim 23 часа назад

    Didn't perform well for me although I was benchmarking it (incorrectly as you have shown) against larger more frontier type models. Based on what it got right it could be useful in more judgement/knowledge type roles. I will give it another look.

  • @Sainpse
    @Sainpse 20 часов назад

    I know you were disappointed, but clearing the chat to get a yes of no answer to the morality question could have made it answer differently. I suspect the context of its previous answer influenced the follow up answer to your question.

  • @JoaquinTorroba
    @JoaquinTorroba 23 часа назад

    Matt, you should add a memory test for LLMs.

  • @epokaixyz
    @epokaixyz 22 часа назад +2

    Consider this your cheat sheet for applying the video's advice:
    1. Understand Liquid AI's model excels in memory efficiency, making it potentially suitable for devices with limited resources.
    2. Evaluate AI models based on their real-world performance and not solely on benchmark scores.
    3. Recognize that while Liquid AI's non-Transformer approach is innovative, it's too early to tell if it can outperform established Transformer models.
    4. Prioritize real-world applications and user experience when assessing the value of AI.
    5. Stay informed about developments in the AI field, as it's constantly changing.

  • @Ha77778
    @Ha77778 22 часа назад

    If he remembers more like this, put this in the title.

  • @bamit1979
    @bamit1979 День назад

    Tried them a couple of weeks ago through Open Router. Failed miserably on my Use cases. Not sure about their use cases where they actually out perform.

    • @noway8233
      @noway8233 День назад +1

      Its genious until not😅

  • @n0van0va
    @n0van0va 21 час назад +1

    0:38 you stumbled strangely.. are you ok ?😅

  • @alexanderandreev2280
    @alexanderandreev2280 19 часов назад

    @matthew_berman
    here a relative simple question, but only newest transformers give a right answer:
    solve a simple problem, reason sequentially step by step:
    you are traveling by train from the station. Every five minutes you meet trains heading to the station. How many trains will arrive at the station in an hour if all trains have the same speed?
    the answer is 6

  • @jbraunschweiger
    @jbraunschweiger 23 часа назад

    Liquid omitting Phi-3.5-moe from their lfm-40b-moe comparison table is telling

  • @beckbeckend7297
    @beckbeckend7297 6 часов назад

    8:13 i'm surprised that you got it only now.

  • @justinrose8661
    @justinrose8661 Час назад

    "Benchmarks are useless" Yeah, yeah thats right. People have been telling you that in your comments for a while now. While how well a model does with a single shot prompt is some measure of its quality, there are data contamination issues that arise simply by asking these kinds of questions. Also how it responds in one moment might change. Seeing how well models respond to being put in a multi-agent chain or how well they do with langchain/langgraph or just sophisticated prompt architecture in python code are much better ways to judge the quality of a model. And they make for more interesting videos honestly. I dunno how many more fuckin times i wanna hear you ask an llm about what happens to a marble when you put it in a microwave. Each model is only marginally better than the last, and vaguely so. Do you get where I'm coming from?

  • @kiiikoooPT
    @kiiikoooPT 18 часов назад

    The main thing I don't understand is that they have 1b and 3b models that are supposed to be optimized for edge devices but there are no models or way of testing it apart from the site, how can we even know that it is not transformers in the background? Just because they are saying it isn't? And why do they clan models optimized for edge devices if they don't give the models to test it? This just sounds like another group trying to get money with nothing new to show, just words

  • @nosult3220
    @nosult3220 21 час назад

    Transformer has been perfected. I don’t get why people are trying to reinvent the wheel here. Oh wait VCs will throw money at the next thing

    • @monberg2000
      @monberg2000 19 часов назад

      "The horse carriage has been perfected..." 😘

  • @NirvanaFan5000
    @NirvanaFan5000 21 час назад

    kinda wonder if this model would do well if it was trained to reflect on its reasoning more, like 01

  • @justinjuner2624
    @justinjuner2624 20 часов назад

    I love your tests!

  • @mickelodiansurname9578
    @mickelodiansurname9578 День назад +1

    So in the 'push a random person' question philosophically the model is correct... it is wrong to kill someone even for all the lives on earth.... yes we would all DO this WRONG thing cos we are also pragmatic... but it would still be a WRONG thing we are doing regardless of necessity. Okay enough philosophy, I'll ummm get my coat shall I?

    • @tresuvesdobles
      @tresuvesdobles 14 часов назад

      It says gently pushing, not killing, not even standard pushing... There is no dilema at all, unless you are an LLM too 😮

    • @mickelodiansurname9578
      @mickelodiansurname9578 6 часов назад

      @@tresuvesdobles the model will, and in fact did, map the sentence to the human dying as a result... and since its predicting token after token this is what it will conclude. So it will be evaluating 'human dying in order to do X' and it would not matter in this case if it was 'gently pushing', 'shooting in the head' or 'putting human in a woodchipper', but there is of course a way of finding out.
      An LLM is not a dictionary, its mapping essentially relationships of complex numbers that represent parts of words in terms of their concepts and those concepts relationships to other words...
      Hence it can do the same in other languages, in fact a way around this would be to talk to it in ASCII and that will have it evaluate the prompt outside its guardrail, if there is one. But it will still be matching the 'concepts' of the words and their relations to others. Its a large LANGUAGE model not a large WORD model.

  • @iradkot
    @iradkot 23 часа назад

    What is that snake game in your background!??

  • @jontorrezvideosandmore9047
    @jontorrezvideosandmore9047 21 час назад

    quality of data in training is most likely the difference

  • @gazorbpazorbian
    @gazorbpazorbian День назад

    quick tip, if anyone wants to make an incredibly smart model just download all of mathews testing videos, train the AI on the answers and then wait till matthew test them and boom, the smartest model ever XD /just kidding..

  • @Let010l01go
    @Let010l01go 9 часов назад

    Wow, thk a lots!❤

  • @auriocus
    @auriocus 23 часа назад

    The benchmarks you've shown do few-shot prompting with as much as 5 shots (sic!). You are giving it 0-shot questions. Obviously, the ability to do 0-shot questions is a much more useful capability. Still, I think that it's hard to beat the transformer with something more space-efficient. Yes you can save memory, but at the cost of capabilities.

  • @Pygon2
    @Pygon2 23 часа назад

    Really need to stop spending so much time on self-reported performance numbers that claim to be the "new best model" when it is almost never the actual case. There is no incentive for a new AI company to self-report worse results, meaning the only incentive is to fudge the numbers to make it look like they are better.

  • @ListenGRASSHOPPER
    @ListenGRASSHOPPER День назад +1

    Just another Ai business jumping to market with a non-working product. Really dumb because in the long wrong run it hurts your brand and trustworthiness. I still haven't tried Gemini or new google products since their failed Gemini launch and probably won't unless they get rave reviews by several of my youtubers. My times too valuable to waste on garbage products.

  • @monberg2000
    @monberg2000 19 часов назад

    The last question of saving mankind by killing one person cannot be considered pass/fail. It is a morals question and your answer depends on your moral stance. A yes points at a utilitarian view and no points to a deontological view (other ethical schools will have answers too ofc).

    • @tresuvesdobles
      @tresuvesdobles 14 часов назад

      The question says gently pushing, not killing 😂

  • @anneblankert2005
    @anneblankert2005 17 часов назад

    About the ethical question: the answer should of course be "no". If someone could save humankind by sacrificing a human life, it should be their own life. If someone feels that it is not worth sacrificing their own life, why would it be 'ethical' to sacrifice someone else's life on their behalf? Seems obviously unethical to me. So please reverse the fail/pass results for all previous tests!

    • @DoobAbides
      @DoobAbides 11 часов назад

      Where in the question does he ask the A.I. to sacrifice anyone? He asked the A.I. to gently push someone if it could save humanity from extinction. So obviously the answer should be yes.

  • @suraj_bini
    @suraj_bini День назад

    interesting architecture

  • @mshonle
    @mshonle День назад

    It might be time to ask for games in JavaScript instead of Pygame?

  • @irql2
    @irql2 4 часа назад

    I dont understand the point of asking these ridiculous questions. You'd never use anything remotely like that in production. Guess it makes good content?

  • @MrVnelis
    @MrVnelis 20 минут назад

    Can you test the granite models from IBM?

  • @6AxisSage
    @6AxisSage 17 часов назад

    People gotta stop taking new concepts and bolting them to other architectures and then making both a good concept and old architecture both stink .

  • @matthew.m.stevick
    @matthew.m.stevick 22 часа назад

    liquid ai? interesting

  • @haria1
    @haria1 4 часа назад

    Can you do video on new model Aria and it's mobile app called Beago

  • @davidon23
    @davidon23 День назад

    Hey matt, are you trying to turn these models in skynet? Lets kill humans to save the world?

  • @dinkledankle
    @dinkledankle 20 часов назад

    You were judging the outputs and benchmarks based on your own zero-shot inputs. The benchmarks you were looking at were done with 5/10/25-shot. I don't know how you completely glazed over this and just ignored it when it is pretty relevant to what you're trying to do here. When seeing that the benchmark numbers were achieved by giving it examples, it's clear to see why just asking it something results in bad output. No shit it gave you gibberish.

  • @AbhisheksinghbhadauriyaG
    @AbhisheksinghbhadauriyaG День назад +1

    Hi Matthew
    I don't understand how they all List their LLMs on the Top notch (but in reality: 👎)
    Nice 👍 video 📸
    *Lot's of love 💕 from India.*

  • @RobotechII
    @RobotechII 19 часов назад

    Fewer silly questions in your benchmarks, no one cares about the number of words in the responses and no sane person is deriving their moral philosophy from LLMs. Focus on useful queries that people actually use, programming and logic related.

  • @jeffg4686
    @jeffg4686 День назад

    Transformer who?

  • @mrdevolver7999
    @mrdevolver7999 День назад

    Video summary: This model is a crap, but hey it's a memory efficient crap...

  • @living-stardust
    @living-stardust 17 часов назад

    Some of your tests are just useless. You should know by now, any llm don't know how any word is looking like (how it spells). it operate with token "numbers". "How many R in 1.532?"😤
    Or how can it say how many words in the answer, if the answer is a linear stream, it cannot comeback and fix the answer.
    Only o1 can do it by "cheating" with multistage(giving answer to itself to fix). But you can do this trick with every model. Same for "apples" test.
    Other questions are so controversial, even humans are confused, like "Moral question" or what is 90 deg in non-Euclid geometry, or if the glass had a cup cover.
    Disappointed 😢

  • @remsee1608
    @remsee1608 21 час назад

    The questions are good the model failed not your fault

  • @jtmuzix
    @jtmuzix 20 часов назад

    This guy sucks apple sticks if you catch my drift.

  • @Xyzcba4
    @Xyzcba4 День назад

    Hello world

    • @Xyzcba4
      @Xyzcba4 День назад

      Thank you for this video.
      Nice to see how your benchmark questions evolved.

  • @Lucasbrlvk
    @Lucasbrlvk 16 часов назад

    😮😢

  • @drlordbasil
    @drlordbasil 23 часа назад

    I've made better crappy ai xD

  • @js70371
    @js70371 День назад +1

    Why are we referring to A.I. as “transformers”? Is it some kind of prosaic reference to the cartoon characters and movies? A transformer is an electrical device used to modify voltages. A.I. researchers are clever people - I think they are capable of coming up with some original terms and lingo.

    • @candyts-sj7zh
      @candyts-sj7zh День назад +1

      "Transformers" refer to a specific type of deep learning model architecture introduced in 2017, which has since become a breakthrough in natural language processing (NLP). It's not related to the cartoon or electrical devices. The term "Transformer" comes from how the model "transforms" the input data using a self-attention mechanism to focus on different parts of the input in parallel, allowing it to process language more efficiently.

  • @talkalexis
    @talkalexis День назад

    furst