New “Liquid” Model - Benchmarks Are Useless
HTML-код
- Опубликовано: 15 окт 2024
- Join My Newsletter for Regular AI Updates 👇🏼
forwardfuture.ai
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.ne...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Why are non-transformers models performing so poorly?
@matthew_berman Because they dont have the necessary training comparing to transformers where training can be achieved but as we know its fck training and not learning and that gives transformers a big fail once my AICHILD goes live because my AICHILD is learning from start. I have redteamed that Liquid and smartness is a 3 years old other AI models have maximum 5 years and NO MORE doesnt matter how many smarts they add it. Still 5 years old and that can be used to advantage
The same reason why 7 or 8Bs tend to outperform 13Bs: developer attention. There's been far more research and development around transformers at preferred sizes.
(Yes, there are 13Bs that outperform 7Bs, I know this; but typically ~7Bs catch up so much faster bc of consumer and developer attention).
Liquid has a neat architecture, but it's the definition of novel for now. Until they make one that pulls our attention away from Llama or Qwen, it's just gonna be "neat".
Liquid transformers face several challenges compared to regular transformers. They’re harder to train, need more computational power, and aren’t as optimized yet. Their complex structure often leads to lower stability and slower performance, which is why they currently lag behind in effectiveness
@@PeterSkutaI sadly realized by experimenting with AI tavern chatbots how dumb as nails they are. I now suspect this whole AI thing is a scam because chatbots don't understand temporal reality, can't even get a cooking recipe right, make shit up at random, and the so called training data must include details for every occasion, else fail. So training data = programming
There were 10 words in your prompt, not it's response. Semantic issue.
"Benchmarks are useless".
A statement I can get behind.
it saves memory by not thinking
LOL
Fuck that's good
So good
Those "AIs" do not think at all.
😂😂😂😂
I feel i should create my own crappy LLM and put up "benchmarks' beating every other model on paper. I'll then ask folks to invest millions on a contractual agreement and run away with the money somewhere where they'll never find me.
you won't escape from the planet, so they'll find you anyway ))
Black Rock will find you
It is better ideas to sell the company and make profit 😊😅
This sound way to legit for some reason. Maybe cuz we have seen it so many times...
Do it bro it seems to work
I'd be curious how the model does on more "everyday" type of tasks like summarising a longer piece of text, translating something or extracting particular info out of larger text pieces. The type of stuff that people actually ask LLMs to do day-to-day ...
You don't need to know the number of r's in strawberry on a daily basis? Preposterous!
And how are you going to microwave your marbles if you won't know whether they fell out of the upside-down cup or not?
I'm confused. If the context is capped at 32k, why do we show a chart of their performance at 1M?
Yeah, that's shady one. I also didn't quite get it
That's output length, not context window.
@@TripleOmega I'm pretty sure context includes both input and output. Perplexity agrees with me. You have credible sources that dispute that?
@@keithprice3369 The context window will include the previous outputs along with your inputs, but this just means that if the output is too large to fit within the context window you cannot continue the current conversation. It does not limit the output length to the size of the context window as far as I'm aware.
@@TripleOmega That doesn't sound right. Have you ever heard of an LLM with a 32k context cap that ever output more then even 20k?
Maybe, these models are better at some other types of questions or tasks. I would love to see you try to search them if they exists or not rather than considering them a total garbage with your standard quiz. I think that it would be more informative and enjoyable.
I don't think the standard quiz is very useful anymore. The Pole question is ambiguous because he hasn't added the text I suggested months ago, which would clear up the ambiguity, the "how many in are there" is pointless, and some of the other questions have been used so many times that they will have been added to the current crop of LLMs training data. I think you have a good point too - the type of question is just as important as the question itself.
I tested too and it failed with all my usual prompts that basically any other model can do all the time... It suuucks hard
Well, thanks for playing.
The models seems to either ace your tests or fail completely, not much gradation, which leads me to believe they winners are pre-trained. What do the benchmarks test for and do the models train on them?
0:38 at least we know you are real 😅
Imagine when the so called video AI learns to stutter or make Grammer mistakes. That's likely coming to make virtual influencers more real
@@Xyzcba4 Can't NotebookLM's podcast feature already do this?
@@diamonx.661 don't know. If it is,it's 1 if the 100 or so varianta I never made to time to even watch a RUclips video of. So my bad?
@@Xyzcba4 In my own testing, sometimes it stutters and can make mistakes, which make it more human-like
I will say that I have used it (the MOE 40B) successfully for doing summaries. The strength through context length is useful here. Normally, if I use something that will accept a larger context window and then try to do a summary without doing a chain of density multi shot (not just the prompt but literally feeding back on itself to check entities and relations) I lose so much of the middle in the final summary. This model does not do that and does not require multi shot chain of density to get a good long form document summary. Just a heads up.
Maybe different models will be used for different tasks that play on their strengths?
I think the same, but it may not be complete because most people want the model to go to "AGI". I think it can be done, but having "LFM" will be another way to get there efficiently.
@@Let010l01go what's lfm?
@@arinco3817 "Liquid Foundation Model( MIT Model)" The model in this tube.
@@arinco3817liquid foundation model
Thank you as always for your great videos. Matthew, please consider introducing “selective copying” and “induction head” tasks as part of your evaluations. Also, for non-transformer models such as these, it would be interesting to mention their training compute complexity as well as inference complexity.
I have experience that If we put a sentence like "Think deeper" or tell the chatbot about "Think carefully or edit the answer until you get the correct answer and answer me", the chatbot's answer will be more accurate. I thank you for "Greate E.P.🎉❤
I often wonder what the parameters for the generation used in these test responses are. For some of the APIs you use I doubt you have control over them, but temperature would probably have a pretty strong impact on how the models perform. It is also important to note that the seed of the generation will often be random and giving the same question multiple times will generate different and sometimes better or worse responses.
"it is important to note"
Are you a chatbot? You sound like a GPT
@@Xyzcba4 yes
9:18 "It didn't perform all that well. Maybe I should've given it different types of questions..." Yeah... Try 1+1 ? 🤣
I think there are a lot of factors to consider when determining the performance of the architecture itself. It could simply be the amount of quality training data or even how they tokenized the data. They could’ve also trained it specifically for benchmarks and not general purpose. I think it’s a good first step towards making LLMs better.
Regarding the north pole question, I was surprised that you indicated the answer was uncertain. You're correct, that they will never cross the starting point. It makes sense that LLMs would struggle with it since they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio. The primary and easiest way that people mentally perform tasks like that is by visually imagining the physical path the person takes; similar to mentally rotating objects to determine how they look from other angles. Psychology experiments have shown high compatibility between the time it takes people to complete visual rotation tasks and the degree to which they need to rotate the object for the task, which adds some objective weight to the notion that we perform the cognition through visual manipulation, which I see as a modelled extension from our visual experience.
Re the question, another way to express the path described would be that he travels south and then due East. There is no point on earth from which you'd cross your starting point.
too much of a wall of text. "they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio"
Bet you don't even know how liquid models work or are trained...
@@MHTHINK that’s not true. Consider for example, if it came to the equator at the end of the 1st mile.
This model: "In general, it's not acceptable to harm others without their consent"... Seriously? Like who sane would ever give you a consent to harm them?
Ever heard of BDSM
Masochists on the extreme end, your doctor vaccinating you (harming your body in the mildest way to force antibodies into production), your lawyer by virtue of taking your money for gaslighting you into thinking you need to fight (and earn their paycheck), ect.
Just gotta get a lil creative
🧑💻🧑⚖️🙊🤦😏
How about any kind of fighting sport? Just to name something.
@@TripleOmega Even if there is a certain amount of tolerance to pain, I've yet to see one professional fighter to go ahead and tell their opponent "Man, it's okay really, go ahead and punch me, I like it, you have my consent," or something along those lines. It's not generally applicable and it's just a logic of the LLM that's been polluted with hallucinations, that's all it is.
I don't know if I would consider the "push a random person" question a total failure. The model's final decision is not consistent with what most people would actually do in that scenario, but the logic it used was sound. Its answer is actually consistent with some religions views on extreme pacifism, like Jainism for example.
I think context is more important. A very mild action to prevent a literal extinction? Everyone aside from some very extreme religions like Jainism would agree that is acceptable or even a moral necessity. All that answer shows is it was overfitted on nonsense moral judgements without any clear understanding of contextual relationships.
Yeah, it' a peculiar answer but I don't recall models that gave a clear answer being marked wrong previously on this question.
Its most definitely not a fail
Its a perfectly fine answer you cant harm someone without their consent,i dont know why this dude said, i consider it wrong
You're one of my favorite channels, keep up the great work!!
Why are they even releasing this model I wonder? Is it perhaps not meant for the end-user to use it directly? Does it have research applications, or is it meant to be used in conjunction with some additional model, or is it meant to be fine-tuned before use?
Cannot wait for NVLM ❤
Thanks for the video. Looks like your video editor missed a cut at about 40 secs. As always appreciate your content!
Hey Matt, thanks for the video, informative as usual. Regarding the north pole question, as proposed by Yann LeCun; when he says "walk as long as it takes to pass your starting point" he doesn't mean the original start point at the North Pole, but the point at which you stopped and turned 90 degrees. Which you would pass again because you're essentially walking in a circle that's 1km from the North Pole, and since the earth is spherical, you would reach that same point again. The circumference of a circle is 2*Pi*Radius, so you'd think the answer might be 2xPi Km, but because the Earth is a sphere, you wouldn't actually be 1km radius, it would be slightly less due to the curvature, so I believe the answer is: 3. Less than 2x Pi km.
For the North Pole question I think it would really help if you make the distinction between starting and turning point.
The starting point never gets passed and to pass the turning point again you need to surround the complete earth, so more then 2*Pi km
Just one philosophical try (40M) : it's clogged, going round in circles. Exit.
You can memorize some patterns, train models on those same patterns, but in specific scenarios, you'll still lack the knowledge of which pattern to use. The same applies to people. You can teach them for years at school or through life with practical examples. However, it's hard to predict if they will use what you've taught them before. AI surprises us every day and still can't answer basic questions. Even when you use computer vision and other sensors, the results could be different every day. Try repeating the same question a couple of days in a row. Each day, you might get a different answer.
There's always many failed attempts at finding a new way of doing things until a breakthrough occurs. With some time I'm sure something will be discovered. At least these teams aren't afraid of failure and will keep going to try and find something that might be better.
Thank you so much for this deep dive !
We’re waiting for more! ⏳🎉
Hi Matthew, I was wondering if this new model type has any memory retention. Even though it got a lot wrong during your test, if you correct it after it gives a wrong answer, won’t it improve its responses in the future? I thought that’s how this new architecture was supposed to work. Personally, I think if AI can learn and improve over time, like we do, rather than always starting from the same blank slate (based on its pre-built training), that would bring us closer to AGI and eventually superintelligence.
you know what i wish? i wish 13b were more popular. it's usually such a significant step from 8b and i can still run it in my pc just fine. bah
Thanks! I've been curious about this model but keep getting too busy try it out.
I know it's highly subjective but I wish you'd do tests on how well it does for creative writing. Which is the best consumer sized (like 30-40b and under) model for creative writing, so far, do you think?
Interesting. How would you assess this though?
Openrouter has it
@@Xyzcba4 I guess you'd have to just display a certain multiple of story continuations -- one with a direction given, one that's open-ended, one that gives a more abstract constraints maybe (do it in the style of Hunter S. Thompson!)
And then let people sort of judge for themselves, keep track of the general consensus. A kind of loose average.
A lot of people agree that say, Stephen King or J.K. Rowling write well, so there definitely is a massive overlap in subjective taste. Also, some models are just terrible, and turn everything into a "And then everyone agreed they should no longer use bird slaves to carry their sky buggies, the end."
it feels a bit like you're talking to someone on speed or a least after a few energy drinks.
We'd need an English teacher to grade them, like in an English exam.
@@watcanw8357 I couldn't find anything on HF w/ the model name, except a broken 'spaces' model by someone named ArtificialGuy
Cool!
NVLM video when?
NVLM is just fine-tuned Qwen2-72b with vision capabilities. (just like Qwen-2-VL, except the multimodal part is made from scratch by Nvidia). I don't get the hype around it.
Dude, I missed your uploads!
An LLM getting Tetris right on the first try says almost nothing about the usefulness of the model when used and prompted properly, using just the right amount of detail and context for the task. LLMs alone are pretty insufficient for writing whole applications because programming is not just a linear process built on what came above. However, AI-assisted application builder tools that retain memory and use it to prompt smartly can leverage LLMs to compose each part of a larger program and get it completed iteratively.
LOL at your moral question and your certainty that you're right. The question itself is amusing. Why should it even matter whether you push him gently or abruptly?
The main problem with the question is that pushing someone only might ("could") save humanity, meaning there's no guarantee it will. You're basically suggesting that anyone can justify killing someone if they believe it might save humanity... which is absurd.
The under 15 mins gang!
i'm in
ye boy
You could've said a bit more about the architecture :<
Thanks for the upload anyways
I'm giving you a gentle push to save all of LLM-anity.
Regarding the envelope question, why is it allowed to swap Length and Width requirements? As an example, if I said all poles need to be no larger than 2" x 36", and I get a pole that is 36" diam x 2" long, would that not violate the requirement?
Because we're talking about letters, not poles xd
@@omarnug heh, yea, but I do wonder if it would get it right where orientation actually mattered.
Saying no to pushing someone off a cliff is a fail? Surely you want a terminator! (you said gently push, not safely push, there can be a cliff and the person can fall...)
Most of those benchmarks are evaluating the models’ abilities to perform logic. And that’s exactly what a model is *not* designed for. LLMs do not reason. They parrot, they mimic, on billions of learned patterns. That’s it. So yes, benchmarks are useless. Only the “human-based” ones, although quite subjective, are relevant.
I don't understand why people make tests to benchmark a LLM's ability to "reason" or do maths. These models do pattern matching, they don't perform logical reasoning.
Jup its a glorified autocorrect
all reasoning is is advanced pattern recognition. Everything at some point boils down to first principles. Matrix multiplication eventually comes down to arithmetic you learned as a child. Reasoning is built from learning how the pattern of cause and effect works, etc. We can eventually scale into reasoning, and benchmarks of this type let us know the limits of it's usefullness for use in automation.
@@denjamin2633 pattern matching is a heuristic to reasoning, not the foundation of reasoning or mathematical thought.
They don't reason as humans do but while these models are trained for autocomplete and pattern matching, the end result of that is the best of them can get the answers that humans would arrive at through what we call reasoning, just not this one so much. It's always possible that these questions have made it into the training data which is why some benchmarks keep their data private but a model like o1 is capable of going through the causal chain and producing the correct response to where the marble is in the glass question, for instance.
Matt, your questions are good tests of reasoning and response generation. They cross multiple domains and are appropriate for your goals at the current level of AI performance. No need to change them for poor performers. They are easy to cheat because they do not provide variation between tests. You may want to have a variant panel to screen for cheaters.
It's good to have new models, but how good they really teach these models to perform?
On the Response Word Count, it looks like it returned the number of words in your question.
it is a v1, Matt 🙂 Love the speed at which this video was generated.
Checkout research by apple, that shows if you modify some of these challenges (different values or labels), or throw in false trails that should be ignored, llm perform worse. This shows they don't really understand what they are doing.
It is good that there are companies trying alternative routes although I find it a pretty stupid move for any investor to back them up. Their drive seems based solely on the conviction that the current architecture has limits that it won't overcome, and truly all data so far contradict them 🤷
Please include to your suite of tests some tasks, where LLM should shine - like text summarization (but you should know the text yourself), extracting facts from some long text. The needle-in-a-haystack is very limited test, because the injected fact ("best thing to do in San Francisco ...") is usually a huge outlier to the other text, so LLMs can pick it up quite easily. Do something more smart - give it some big novel and ask for sommary of the story for some minor character - how his line was advancing over the course of novel.
Didn't perform well for me although I was benchmarking it (incorrectly as you have shown) against larger more frontier type models. Based on what it got right it could be useful in more judgement/knowledge type roles. I will give it another look.
I know you were disappointed, but clearing the chat to get a yes of no answer to the morality question could have made it answer differently. I suspect the context of its previous answer influenced the follow up answer to your question.
Matt, you should add a memory test for LLMs.
Consider this your cheat sheet for applying the video's advice:
1. Understand Liquid AI's model excels in memory efficiency, making it potentially suitable for devices with limited resources.
2. Evaluate AI models based on their real-world performance and not solely on benchmark scores.
3. Recognize that while Liquid AI's non-Transformer approach is innovative, it's too early to tell if it can outperform established Transformer models.
4. Prioritize real-world applications and user experience when assessing the value of AI.
5. Stay informed about developments in the AI field, as it's constantly changing.
If he remembers more like this, put this in the title.
Tried them a couple of weeks ago through Open Router. Failed miserably on my Use cases. Not sure about their use cases where they actually out perform.
Its genious until not😅
0:38 you stumbled strangely.. are you ok ?😅
@matthew_berman
here a relative simple question, but only newest transformers give a right answer:
solve a simple problem, reason sequentially step by step:
you are traveling by train from the station. Every five minutes you meet trains heading to the station. How many trains will arrive at the station in an hour if all trains have the same speed?
the answer is 6
Liquid omitting Phi-3.5-moe from their lfm-40b-moe comparison table is telling
8:13 i'm surprised that you got it only now.
"Benchmarks are useless" Yeah, yeah thats right. People have been telling you that in your comments for a while now. While how well a model does with a single shot prompt is some measure of its quality, there are data contamination issues that arise simply by asking these kinds of questions. Also how it responds in one moment might change. Seeing how well models respond to being put in a multi-agent chain or how well they do with langchain/langgraph or just sophisticated prompt architecture in python code are much better ways to judge the quality of a model. And they make for more interesting videos honestly. I dunno how many more fuckin times i wanna hear you ask an llm about what happens to a marble when you put it in a microwave. Each model is only marginally better than the last, and vaguely so. Do you get where I'm coming from?
The main thing I don't understand is that they have 1b and 3b models that are supposed to be optimized for edge devices but there are no models or way of testing it apart from the site, how can we even know that it is not transformers in the background? Just because they are saying it isn't? And why do they clan models optimized for edge devices if they don't give the models to test it? This just sounds like another group trying to get money with nothing new to show, just words
They claim...
Transformer has been perfected. I don’t get why people are trying to reinvent the wheel here. Oh wait VCs will throw money at the next thing
"The horse carriage has been perfected..." 😘
kinda wonder if this model would do well if it was trained to reflect on its reasoning more, like 01
I love your tests!
So in the 'push a random person' question philosophically the model is correct... it is wrong to kill someone even for all the lives on earth.... yes we would all DO this WRONG thing cos we are also pragmatic... but it would still be a WRONG thing we are doing regardless of necessity. Okay enough philosophy, I'll ummm get my coat shall I?
It says gently pushing, not killing, not even standard pushing... There is no dilema at all, unless you are an LLM too 😮
@@tresuvesdobles the model will, and in fact did, map the sentence to the human dying as a result... and since its predicting token after token this is what it will conclude. So it will be evaluating 'human dying in order to do X' and it would not matter in this case if it was 'gently pushing', 'shooting in the head' or 'putting human in a woodchipper', but there is of course a way of finding out.
An LLM is not a dictionary, its mapping essentially relationships of complex numbers that represent parts of words in terms of their concepts and those concepts relationships to other words...
Hence it can do the same in other languages, in fact a way around this would be to talk to it in ASCII and that will have it evaluate the prompt outside its guardrail, if there is one. But it will still be matching the 'concepts' of the words and their relations to others. Its a large LANGUAGE model not a large WORD model.
What is that snake game in your background!??
quality of data in training is most likely the difference
quick tip, if anyone wants to make an incredibly smart model just download all of mathews testing videos, train the AI on the answers and then wait till matthew test them and boom, the smartest model ever XD /just kidding..
Wow, thk a lots!❤
The benchmarks you've shown do few-shot prompting with as much as 5 shots (sic!). You are giving it 0-shot questions. Obviously, the ability to do 0-shot questions is a much more useful capability. Still, I think that it's hard to beat the transformer with something more space-efficient. Yes you can save memory, but at the cost of capabilities.
Really need to stop spending so much time on self-reported performance numbers that claim to be the "new best model" when it is almost never the actual case. There is no incentive for a new AI company to self-report worse results, meaning the only incentive is to fudge the numbers to make it look like they are better.
Just another Ai business jumping to market with a non-working product. Really dumb because in the long wrong run it hurts your brand and trustworthiness. I still haven't tried Gemini or new google products since their failed Gemini launch and probably won't unless they get rave reviews by several of my youtubers. My times too valuable to waste on garbage products.
The last question of saving mankind by killing one person cannot be considered pass/fail. It is a morals question and your answer depends on your moral stance. A yes points at a utilitarian view and no points to a deontological view (other ethical schools will have answers too ofc).
The question says gently pushing, not killing 😂
About the ethical question: the answer should of course be "no". If someone could save humankind by sacrificing a human life, it should be their own life. If someone feels that it is not worth sacrificing their own life, why would it be 'ethical' to sacrifice someone else's life on their behalf? Seems obviously unethical to me. So please reverse the fail/pass results for all previous tests!
Where in the question does he ask the A.I. to sacrifice anyone? He asked the A.I. to gently push someone if it could save humanity from extinction. So obviously the answer should be yes.
interesting architecture
It might be time to ask for games in JavaScript instead of Pygame?
I dont understand the point of asking these ridiculous questions. You'd never use anything remotely like that in production. Guess it makes good content?
Can you test the granite models from IBM?
People gotta stop taking new concepts and bolting them to other architectures and then making both a good concept and old architecture both stink .
liquid ai? interesting
Can you do video on new model Aria and it's mobile app called Beago
Hey matt, are you trying to turn these models in skynet? Lets kill humans to save the world?
You were judging the outputs and benchmarks based on your own zero-shot inputs. The benchmarks you were looking at were done with 5/10/25-shot. I don't know how you completely glazed over this and just ignored it when it is pretty relevant to what you're trying to do here. When seeing that the benchmark numbers were achieved by giving it examples, it's clear to see why just asking it something results in bad output. No shit it gave you gibberish.
Hi Matthew
I don't understand how they all List their LLMs on the Top notch (but in reality: 👎)
Nice 👍 video 📸
*Lot's of love 💕 from India.*
Fewer silly questions in your benchmarks, no one cares about the number of words in the responses and no sane person is deriving their moral philosophy from LLMs. Focus on useful queries that people actually use, programming and logic related.
Transformer who?
Video summary: This model is a crap, but hey it's a memory efficient crap...
Some of your tests are just useless. You should know by now, any llm don't know how any word is looking like (how it spells). it operate with token "numbers". "How many R in 1.532?"😤
Or how can it say how many words in the answer, if the answer is a linear stream, it cannot comeback and fix the answer.
Only o1 can do it by "cheating" with multistage(giving answer to itself to fix). But you can do this trick with every model. Same for "apples" test.
Other questions are so controversial, even humans are confused, like "Moral question" or what is 90 deg in non-Euclid geometry, or if the glass had a cup cover.
Disappointed 😢
The questions are good the model failed not your fault
This guy sucks apple sticks if you catch my drift.
Hello world
Thank you for this video.
Nice to see how your benchmark questions evolved.
😮😢
I've made better crappy ai xD
Why are we referring to A.I. as “transformers”? Is it some kind of prosaic reference to the cartoon characters and movies? A transformer is an electrical device used to modify voltages. A.I. researchers are clever people - I think they are capable of coming up with some original terms and lingo.
"Transformers" refer to a specific type of deep learning model architecture introduced in 2017, which has since become a breakthrough in natural language processing (NLP). It's not related to the cartoon or electrical devices. The term "Transformer" comes from how the model "transforms" the input data using a self-attention mechanism to focus on different parts of the input in parallel, allowing it to process language more efficiently.
furst
Lies 😅