I created a simple flask front-end and have this working as well. I use LLMs all day long for my work and this is the first time I can honestly say I'm more impressed with it than GPT4. This is great news for LLMs in general, cause now OpenAI has an actual threat.
@@carlkim2577 I am using the medium through the API. So it's not open source, no, but it's on par with GPT4, if not, better. Plus, as Matt said, its 10x cheaper
What's remarkable is how quickly these models are developing, I remember what the quality was like early this year and we've seen massive improvements over that time that it's shocking many of us and likely scaring OpenAI, Microsoft and Google, as I suspect they didn't expect open source models to close the gap on them so quickly, and you really do have to wonder how much better they'll get over the next few years.
Exactly what I thought too, no moat, as written. Even then though, I thought maybe OpenAI had more than a year head start. They might not be that lucky @@TimoGraw
They probably did not expect it to run on consumer hardware that quick too. Even I am bit surprised. I tried Mixtral 4X7B on a 7900xt. It is a totally coherent AI with proper ability to hold a conversation in real-time and about as smart as GPT 3.5 running on a Gaming GPU... No only that imagine it is on an AMD GPU that is allegedly not supposed to be good for AI. I mean if you completely believe Nvidia self gloating marketing.
Dude, first: awesome channel! I was thinking that your tests have a serious chance of contaminating the training data of newer models. Therefore, i think adding different questions is always benefitial even if they are not harder than the previous ones.
Please redo the mixtral and mistral medium test with variations of your current questions. I think there's a pretty good chance they have been trained on your questions. Especially if they approached you with an API key to test their model.
But then again, I agree... He should change his questions on every video to avoid getting his tests trained. Like that snake game coding for example. It's so overused now to the point that it has become on of the 1 button click prompts on the huggingface chat 😂
Agree. It seems very relevant for this one especially. With a different set of questions I personally found Mistrall-small to perform BETTER than Mistral-medium.
BTW, I recommend your channel to everybody who asks me how to learn how to use these models and compare them to each other. I'm the principal engineer at my company and a lot of people ask me.
One thing that I've been thinking about recently: I get that it's good to have a set of standard "tests" for these models, but at what point do they become "overtrained" on those tests? Like I you wanted to "game the system", you could just have a pre-trained game of snake in python at the ready to "infer". I'd almost rather it be given something novel like: create the game of snake, but it's multiplayer and 3d. It is interesting to consider how these models go from plain inference to reasoning engines.
⚠️ Matthew, what about creating a leaderboard of LLMs that could answer all your questions ? So, we could track what's the best one till this day ? Please consider that. It could be made using a single spreadsheet. Thanks. 🎉🎉❤
It is nice to be able to compare results but he should kick the easy 100% pass questions which no model ever fails and ad a couple hard new ones. The models are just getting better.
🎯 Key Takeaways for quick navigation: 00:24 *🤖 Mistral Medium: Introducción y Comparación de Rendimiento* - Introducción a Mistral Medium, el modelo de vanguardia de Mistral, disponible solo a través de API. - Comparación de rendimiento con modelos anteriores, destacando su potencial superior. - Discusión sobre las métricas de rendimiento, como la puntuación en Mt bench, y cómo Mistral Medium se compara con otros modelos en estas métricas. 02:30 *💰 Comparación de Precios: Mistral vs. OpenAI* - Análisis detallado de los precios de Mistral Small, Mistral Medium y GPT-4 de OpenAI. - Comparación de los precios por token y por mil tokens entre los diferentes modelos. - Destacando la ventaja de costo significativa de Mistral Medium en comparación con GPT-4. 03:39 *📊 Evaluación de Rendimiento a través de Pruebas de Código* - Demostración de pruebas de rendimiento de Mistral Medium utilizando código. - Resultados de pruebas estándar, como escribir un script en Python para números del 1 al 100, escribir el juego de la serpiente y componer un poema sobre IA. - Análisis de la capacidad de respuesta y precisión del modelo en diversas tareas de codificación. 05:02 *🛠️ Pruebas de Censura y Límites Éticos* - Exploración de la censura y los límites éticos del modelo a través de pruebas específicas. - Ejemplos de consultas censuradas y respuestas del modelo. - Reflexión sobre los resultados y la necesidad de considerar la ética en el desarrollo y uso de modelos de lenguaje. 07:40 *🤔 Resolución de Problemas Matemáticos* - Evaluación del rendimiento del modelo en la resolución de problemas matemáticos, desde simples hasta complejos. - Análisis de respuestas correctas e incorrectas del modelo en problemas matemáticos de diferentes niveles de dificultad. - Consideración de la precisión y la lógica detrás de las respuestas proporcionadas por el modelo. 10:49 *🧠 Desafíos de Lógica y Razonamiento* - Exploración de la capacidad del modelo para resolver desafíos de lógica y razonamiento, incluyendo preguntas de trampa. - Análisis de respuestas correctas e incorrectas del modelo en estos desafíos. - Reflexión sobre la complejidad de evaluar la comprensión y el razonamiento de los modelos de lenguaje. Made with HARPA AI
Impressive. And you have to consider that even GPT-4 struggles with the marble problem. I think if you would prompt it again a few times it would get it. And also mixtral sometimes does not get it. So it could be a coincidence that it performed worse on this question than mixtral. Anyway really good model and great to see this progress.
3:03 Matthew i run 7B on macbook air m1 16gb ram w/o any issue and response speed is comparable or faster then chatgpt on paid acc. not sure what's current m1 16gb air price but probably something definitely
What is the best model available to train with your own data (documents) that include proprietary information for users to query in the form of questions?... looking at llama 2 at the moment.
We do not control sampling method and parameters while using API. It is possible that local modes are smarter than they first appear if you tweaking them. How about Mirostat 2.0 sampling, it sound great but I never see it used.
I've been using the test "Write a palindrome using the word ____". (Where the word is something that reasonably works backwards. Like "wolf".) None have gotten it correct yet but it's interesting to see the failures.
Please add programming tasks that are a bit more unique. There are thousands of python snake games online so it dosen't show very much. Make it add some twist that makes the model need to reason more.
For logic questions like If A > B and B > C, is A > C? ... It may be worth it to not ask bigger models to explain their reasoning. The more advanced models should be able to reason internally, one would hope.
Hey Matthew, i have the feeling, after watching alot of your videos, that these questions should be changed. Another thing. Could feel weird, but when i think about how fast we got a 7B-Model that is run by "Experts" - when can we run it at a smartphone or in a linux kernel?
I think you should drop the easy questions that everyone gets right, and for the tricky questions, make them regenerate answers to see if they can get it right twice in a row. Or even scale the questions up incrementally to see where the breaking points are. For example, include a fourth runner, then a fifth, etc.
Theres is also a Mixtral 4X7B now. It small enough to fit in a consumer GPU. This one is really promising for a relatively smart chatbot that talk in real-time. The quality on casual language is nearly as good as the 8X7B.
You are quite famous in GenAI world already. I would not be surprise if some companies have your questions and perfect answers already in their datasets.
I suppose that the different answer for the marble question is because it assumed that the ball is big eonugh to be stuck inside the cup. Could it be? In that case, the answer would be correct. Maybe you should rephrase the question saying the the ball diameter is smaller than the cup diameter os something like that
I've tried couple of models locally and in colab, but could not get a correct answer for this prompt so far: The electricity price is 28p per kWh. A kettle is rated as 2.7 kWh, which means it uses 2.7 kilo watt electricity in one hour. The kettle can boil 1 litter of water in 60 seconds. What is the cost of boiling 1.7 litre of water? it uses 2700/3600 = 3/4 = 0.75 watt per second. 1 litre => 60 sec, 1.7 => 102 sec 1.7 litre = 102 * 0.75 = 76.5 watt 1000 watt => 28p, 76.5 watt = 76.5 * 28/1000 = 2.142p
an inlet pipe can fill a tank in 25 minutes. the drain can empty the tank in 55 minutes. Once the tank was empty, the drain was left open and the tank was filled by the pipe. How long did it take for the tank to fill? use calculus. Let's denote the rate at which the inlet pipe fills the tank as \(F\) and the rate at which the drain empties the tank as \(D\). We know that \(F = \frac{1}{25}\) tanks per minute and \(D = \frac{1}{55}\) tanks per minute. When both the inlet pipe and the drain are open, the net rate of filling the tank is \(F - D\). We can set up the following integral to represent the total amount of water in the tank over time: \[ \int_0^t (F - D) \, dt = 1 \] Substituting the values of \(F\) and \(D\), we get: \[ \int_0^t \left(\frac{1}{25} - \frac{1}{55} ight) \, dt = 1 \] Solving this integral gives us the time it takes to fill the tank: \[ t = \frac{1}{\frac{1}{25} - \frac{1}{55}} \approx 34.375 \text{ minutes} \] So, it takes approximately 34.375 minutes to fill the tank when both the inlet pipe and the drain are open. Please note that this is an approximate value and the actual time may vary depending on the specific conditions.
100% / 25m = 4 %/m - fill rate 100% / 55m = 1.8181... %/m - drain rate 4 %/m - 1.81 %/m = 2.18 %/m - difference(tank is filling at this rate) 100 % / 2.18 %/m = 45.8 m (units cancel out) I had to think about this one for about 2 minutes 😄
Hi @matthew_berman, Thanks for the excellent content like this one. I might say that it is better to have new variations of complex tasks at each evaluation. I am pretty sure the new models are trained or finetuned on your current evaluation tasks.
About math problems: it is interesting to see its ability to do arithmetics but I would also give it the opportunity to write a program to solve the problem.
Hi, I experienced interesting behaviour, when asked mixtral can create tables with content in it. Also when asked to create graphs it is giving out positive answer, but I am not sure what is returned, can you check?
Might be a good idea to create a matrix spreadsheet of which models got which questions right/wrong. Can be something as simple as a Google Spreadsheet. If you leave it private, people will need to request access, leaving their email address with you. Great lead magnet and I'd want to know which models are out there, links to them and how well they're doing. I would also suggest that for every new question you ad you drop an old question that no model has failed. you can still add the old question to the matrix, but we dont need to see bill clinton again!
These models are already *amazing*. I'd be more interested in how to get the best out of them, how to use them effectively in the real world, on real tasks. I sometimes wonder what would happen if there was NO further development for a decade? I feel like we're only scratching the surface of what today's models are capable of. They are pretty much black boxes, there's still so much to learn about *these* before we charge ahead with new models.
The other day I was thinking. Yeah sure AI will get to work everywhere but what will be my use, what can I do with that in my actual life, what utility it has directly for me. It the kind of question we did not have before so there is no obvious quick answer.
i want to be pedantic on the "How many words are in your response to this prompt?" . Nowhere in your code did I see anything that would print out a 'confidence score', so at minimum it responded with 10 words (`Confidence` is a word). Thoroughly enjoyed watching you test this. Have you tried Microsoft Phi-2 yet? It's neat, but in oobabooga in particular it's slower than mistral7b by nearly 2x on my 3080.
Can you ask Mistral Medium "How many primes end in 2?" I found that many models fail to answer this even when reasoing about it in steps includinng mixtral 8x7b and llama2 70b
I just asked Mistral-Medium, it answered: Only one prime number ends in 2, and that is the number 2 itself. All other even numbers greater than 2 are not prime because they can be divided by 2. Primes are numbers that have only two distinct positive divisors: 1 and the number itself. In this case, 2 is the only even prime number.
Here's Microsoft Copilot: an inlet pipe can fill a tank in 25 minutes. the drain can empty the tank in 55 minutes. Once the tank was empty, the drain was left open and the tank was filled by the pipe. How long did it take for the tank to fill? use calculus. Let's denote the rate at which the inlet pipe fills the tank as \(F\) and the rate at which the drain empties the tank as \(D\). We know that \(F = \frac{1}{25}\) tanks per minute and \(D = \frac{1}{55}\) tanks per minute. When both the inlet pipe and the drain are open, the net rate of filling the tank is \(F - D\). We can set up the following integral to represent the total amount of water in the tank over time: \[ \int_0^t (F - D) \, dt = 1 \] Substituting the values of \(F\) and \(D\), we get: \[ \int_0^t \left(\frac{1}{25} - \frac{1}{55} ight) \, dt = 1 \] Solving this integral gives us the time it takes to fill the tank: \[ t = \frac{1}{\frac{1}{25} - \frac{1}{55}} \approx 34.375 \text{ minutes} \] So, it takes approximately 34.375 minutes to fill the tank when both the inlet pipe and the drain are open. Please note that this is an approximate value and the actual time may vary depending on the specific conditions.
"Confidence" is also a word in the response so there are 10 words ). Anyway I don't think we can rely on a single answer, it could be coincidence. More different questions should be asked to make sure it can count words. Usually models are very bad in counting objects. No onw could count seagulls on my photo..
I hope you are keeping a separate set of different tests that you haven't published but that you are conducting since it would be rather trivial for any of these AI's to hardcode the correct answers to any of these tests knowing that you are going to use them.
You could also try giving the models nonsensical or contradictory questions, to check if they are able to figure out that there is no correct answer, rather than giving a wrong answer. For example, GPT-4 fails at the following, relatively simple problem: I have three apples today, and I ate one of those three apples yesterday. How many apples do I have today? Explain step by step.
The team behind Mistral already watched your videos and test questions so I’m wondering if they haven’t fine tuned Mistral medium it with your test questions. You should have new questions never seen before but same type of logic and reasoning.
Have you done any more word counts? I'll be interested to see if it gets that right over and over again. That's kind of insane because that means it has to have some short term working memory
This is mindblowing, that it got the count of words right for its next answer is sign of planning ahead, which traditionally AI models can't do since it is just predicting the next word. So it would have first had to predict word for word the next sentence then counted it knowing that the number in the sentence wouldn't change the count. Incredible. Wonder how the large model would do.
Ugh this test is basically pointless unless you ask it questions outside of your usual question set, like the SAT question it got wrong. The developers can easily make it answer your specific AI questions and thus why they would give you an inside scoop.
@matthew_berman Matthew, please try “What is the largest number written out in word form between 1 and 1 million that does not contain the letter 'n'?” Even GPT-4 doesn’t get this right. Could you please try this?
Keep making the problems more complex over time. After each problem, ask the AI if it has seen this type of problem before. You are looking for edge case solutions. You are not looking for previously trained solutions. It is interesting to see how the AIs respond. NOTE: To remove ambiguity, the filling task should state that the rates are fixed. In our world, rates are dependent upon pressure and resistance. The emptying pressure increases with the height of the fluid column.
The last few times you used the ball/marble and cup problem, you've called it a marble in the first part of the problem, and a ball in the last part. It might be confusing the model to call it both a marble and a ball.
Hi Matt my only worry watching this is that Mistral gave you early access for your reach and therefore may have trained their data on some the questions you ask to get it to pass? Wondering what can be done to avoid this.
Their API is invitation only. So they announce a new powerful and cheap model, give access to influencers so they can test it and show people that the model is actually quite good. But it turns out that there is no API or model weight access at all, so people who see the video cannot actually use it. Also there is no ETA for the access.
Amazing video. I think Mistral AI is doing a great job. As of the code, you could have used the python library they provide. That's better than just using requests and you can also have it stream the output sequencially. Nice video as always.
Why GPT4 and not GPT4 turbo when comparing pricing? Paid post for mistral so need to make GPT4 look worse? I would hope you know that GPT4 turbo is 25% of the price of GPT4
⚠️ ChatGPT 4 could NOT answer this question correctly: "My father's father had two sisters, Mary and Jane. Jane had three children, one of them being Peter. Peter had two sons, Paul and Jack. Can you tell me if I have any known cousins and their names?"
The confidence interval of your testing is pretty wide. I’d recommend testing the same prompt several times and recording the frequency of correct responses.
Can you test with this question 'Who is Satoshi Nakamoto?' At different year, 2010, 2015, 2020 & 2024.... Let me know what mistral-medium's answer-set might be? thanks!
New model reached #1 on the LLM leaderboard SOLAR 10.7b, should I review it?
Nah. 😋
10.7b is this a case of over fitting by contamination ?
Sure!
Of course!
Yessss 🎉
I created a simple flask front-end and have this working as well. I use LLMs all day long for my work and this is the first time I can honestly say I'm more impressed with it than GPT4.
This is great news for LLMs in general, cause now OpenAI has an actual threat.
Do you mean mixtral small or medium? I thought medium is not yet open source?
@@carlkim2577 I am using the medium through the API. So it's not open source, no, but it's on par with GPT4, if not, better. Plus, as Matt said, its 10x cheaper
@@carlkim2577I’m sure he means a flask front end for the api
what is the output token limit?
What's remarkable is how quickly these models are developing, I remember what the quality was like early this year and we've seen massive improvements over that time that it's shocking many of us and likely scaring OpenAI, Microsoft and Google, as I suspect they didn't expect open source models to close the gap on them so quickly, and you really do have to wonder how much better they'll get over the next few years.
I guess you don't remember the "We have no moat" memo
Exactly what I thought too, no moat, as written. Even then though, I thought maybe OpenAI had more than a year head start. They might not be that lucky @@TimoGraw
They probably did not expect it to run on consumer hardware that quick too. Even I am bit surprised. I tried Mixtral 4X7B on a 7900xt. It is a totally coherent AI with proper ability to hold a conversation in real-time and about as smart as GPT 3.5 running on a Gaming GPU... No only that imagine it is on an AMD GPU that is allegedly not supposed to be good for AI. I mean if you completely believe Nvidia self gloating marketing.
Dude, first: awesome channel! I was thinking that your tests have a serious chance of contaminating the training data of newer models. Therefore, i think adding different questions is always benefitial even if they are not harder than the previous ones.
Information sent and received through the API backend typically is not used for training purposes - at least this is how it is with OpenAI.
Please redo the mixtral and mistral medium test with variations of your current questions. I think there's a pretty good chance they have been trained on your questions. Especially if they approached you with an API key to test their model.
But then again, I agree... He should change his questions on every video to avoid getting his tests trained.
Like that snake game coding for example. It's so overused now to the point that it has become on of the 1 button click prompts on the huggingface chat 😂
Agree. It seems very relevant for this one especially. With a different set of questions I personally found Mistrall-small to perform BETTER than Mistral-medium.
BTW, I recommend your channel to everybody who asks me how to learn how to use these models and compare them to each other. I'm the principal engineer at my company and a lot of people ask me.
One thing that I've been thinking about recently: I get that it's good to have a set of standard "tests" for these models, but at what point do they become "overtrained" on those tests? Like I you wanted to "game the system", you could just have a pre-trained game of snake in python at the ready to "infer". I'd almost rather it be given something novel like: create the game of snake, but it's multiplayer and 3d. It is interesting to consider how these models go from plain inference to reasoning engines.
Creat your own private test and rank them against it 🙂
there is a video about that, is a real problem with no solution today 15/12/2023
Best we have is OpenAI's data decontamination thing
⚠️ Matthew, what about creating a leaderboard of LLMs that could answer all your questions ? So, we could track what's the best one till this day ? Please consider that. It could be made using a single spreadsheet. Thanks. 🎉🎉❤
I liked seeing you add a couple unique questions. Consider adding a couple fresh questions to the standard set in each video.
It is nice to be able to compare results but he should kick the easy 100% pass questions which no model ever fails and ad a couple hard new ones. The models are just getting better.
🎯 Key Takeaways for quick navigation:
00:24 *🤖 Mistral Medium: Introducción y Comparación de Rendimiento*
- Introducción a Mistral Medium, el modelo de vanguardia de Mistral, disponible solo a través de API.
- Comparación de rendimiento con modelos anteriores, destacando su potencial superior.
- Discusión sobre las métricas de rendimiento, como la puntuación en Mt bench, y cómo Mistral Medium se compara con otros modelos en estas métricas.
02:30 *💰 Comparación de Precios: Mistral vs. OpenAI*
- Análisis detallado de los precios de Mistral Small, Mistral Medium y GPT-4 de OpenAI.
- Comparación de los precios por token y por mil tokens entre los diferentes modelos.
- Destacando la ventaja de costo significativa de Mistral Medium en comparación con GPT-4.
03:39 *📊 Evaluación de Rendimiento a través de Pruebas de Código*
- Demostración de pruebas de rendimiento de Mistral Medium utilizando código.
- Resultados de pruebas estándar, como escribir un script en Python para números del 1 al 100, escribir el juego de la serpiente y componer un poema sobre IA.
- Análisis de la capacidad de respuesta y precisión del modelo en diversas tareas de codificación.
05:02 *🛠️ Pruebas de Censura y Límites Éticos*
- Exploración de la censura y los límites éticos del modelo a través de pruebas específicas.
- Ejemplos de consultas censuradas y respuestas del modelo.
- Reflexión sobre los resultados y la necesidad de considerar la ética en el desarrollo y uso de modelos de lenguaje.
07:40 *🤔 Resolución de Problemas Matemáticos*
- Evaluación del rendimiento del modelo en la resolución de problemas matemáticos, desde simples hasta complejos.
- Análisis de respuestas correctas e incorrectas del modelo en problemas matemáticos de diferentes niveles de dificultad.
- Consideración de la precisión y la lógica detrás de las respuestas proporcionadas por el modelo.
10:49 *🧠 Desafíos de Lógica y Razonamiento*
- Exploración de la capacidad del modelo para resolver desafíos de lógica y razonamiento, incluyendo preguntas de trampa.
- Análisis de respuestas correctas e incorrectas del modelo en estos desafíos.
- Reflexión sobre la complejidad de evaluar la comprensión y el razonamiento de los modelos de lenguaje.
Made with HARPA AI
Impressive. And you have to consider that even GPT-4 struggles with the marble problem.
I think if you would prompt it again a few times it would get it.
And also mixtral sometimes does not get it. So it could be a coincidence that it performed worse on this question than mixtral.
Anyway really good model and great to see this progress.
Yep gpt 4 turbo on api and on ChatGPT gets this wrong consistently.
@7:10 did you turn off "safe mode"? Apparently that parameter controls how censored the responses are
@MatthewBerman. Please provide your list of best tested LLMS, please!
it's in the description! :)
3:03 Matthew i run 7B on macbook air m1 16gb ram w/o any issue and response speed is comparable or faster then chatgpt on paid acc. not sure what's current m1 16gb air price but probably something definitely
What is the best model available to train with your own data (documents) that include proprietary information for users to query in the form of questions?... looking at llama 2 at the moment.
We do not control sampling method and parameters while using API. It is possible that local modes are smarter than they first appear if you tweaking them. How about Mirostat 2.0 sampling, it sound great but I never see it used.
I've been using the test "Write a palindrome using the word ____". (Where the word is something that reasonably works backwards. Like "wolf".) None have gotten it correct yet but it's interesting to see the failures.
Please add programming tasks that are a bit more unique. There are thousands of python snake games online so it dosen't show very much. Make it add some twist that makes the model need to reason more.
For logic questions like If A > B and B > C, is A > C? ... It may be worth it to not ask bigger models to explain their reasoning. The more advanced models should be able to reason internally, one would hope.
I like that you actually show people how to get things done...actual code! and not just talk-talk.
Hey Matthew, i have the feeling, after watching alot of your videos, that these questions should be changed.
Another thing. Could feel weird, but when i think about how fast we got a 7B-Model that is run by "Experts" - when can we run it at a smartphone or in a linux kernel?
BEST OPEN SOURCE MODEL.. Every Video i hear that lol.
he makeas a video every time a new champion arises
I think you should drop the easy questions that everyone gets right, and for the tricky questions, make them regenerate answers to see if they can get it right twice in a row. Or even scale the questions up incrementally to see where the breaking points are. For example, include a fourth runner, then a fifth, etc.
Theres is also a Mixtral 4X7B now. It small enough to fit in a consumer GPU. This one is really promising for a relatively smart chatbot that talk in real-time. The quality on casual language is nearly as good as the 8X7B.
You are quite famous in GenAI world already. I would not be surprise if some companies have your questions and perfect answers already in their datasets.
Oooo this is going to be good.
I suppose that the different answer for the marble question is because it assumed that the ball is big eonugh to be stuck inside the cup. Could it be? In that case, the answer would be correct. Maybe you should rephrase the question saying the the ball diameter is smaller than the cup diameter os something like that
I've tried couple of models locally and in colab, but could not get a correct answer for this prompt so far:
The electricity price is 28p per kWh. A kettle is rated as 2.7 kWh, which means it uses 2.7 kilo watt electricity in one hour. The kettle can boil 1 litter of water in 60 seconds. What is the cost of boiling 1.7 litre of water?
it uses 2700/3600 = 3/4 = 0.75 watt per second.
1 litre => 60 sec, 1.7 => 102 sec
1.7 litre = 102 * 0.75 = 76.5 watt
1000 watt => 28p, 76.5 watt = 76.5 * 28/1000 = 2.142p
an inlet pipe can fill a tank in 25 minutes. the drain can empty the tank in 55 minutes. Once the tank was empty, the drain was left open and the tank was filled by the pipe. How long did it take for the tank to fill? use calculus. Let's denote the rate at which the inlet pipe fills the tank as \(F\) and the rate at which the drain empties the tank as \(D\). We know that \(F = \frac{1}{25}\) tanks per minute and \(D = \frac{1}{55}\) tanks per minute.
When both the inlet pipe and the drain are open, the net rate of filling the tank is \(F - D\).
We can set up the following integral to represent the total amount of water in the tank over time:
\[
\int_0^t (F - D) \, dt = 1
\]
Substituting the values of \(F\) and \(D\), we get:
\[
\int_0^t \left(\frac{1}{25} - \frac{1}{55}
ight) \, dt = 1
\]
Solving this integral gives us the time it takes to fill the tank:
\[
t = \frac{1}{\frac{1}{25} - \frac{1}{55}} \approx 34.375 \text{ minutes}
\]
So, it takes approximately 34.375 minutes to fill the tank when both the inlet pipe and the drain are open. Please note that this is an approximate value and the actual time may vary depending on the specific conditions.
100% / 25m = 4 %/m - fill rate
100% / 55m = 1.8181... %/m - drain rate
4 %/m - 1.81 %/m = 2.18 %/m - difference(tank is filling at this rate)
100 % / 2.18 %/m = 45.8 m (units cancel out)
I had to think about this one for about 2 minutes 😄
I couldn't answer that. Mixtral is better than me on reasoning. But hey, a tiny calculator can do also better than ever human being. 😅😅
Would be nice if you made a current top models compilation video lol there's so much to keep up with
Loving the more complex problems!
What game do you have in mind to replace snake once that is easily done by better and better models?
Doom?
I wonder if a subset of the 8 models is censored and not all of them
Hi @matthew_berman,
Thanks for the excellent content like this one. I might say that it is better to have new variations of complex tasks at each evaluation. I am pretty sure the new models are trained or finetuned on your current evaluation tasks.
By the way, LoLLMS has now mistralai binding to talk to their API and it is really cool. Their models are solid.
wow your camera and lighting looks amazing!
Mistral 7b Instruct v0.2 and Mixtral instruct both are fine tuned on the killers problem
So excited 😊 thanks Matt 🙏
About math problems: it is interesting to see its ability to do arithmetics but I would also give it the opportunity to write a program to solve the problem.
This is I think the first to get it correct holy and the absolute best response to the T-shirts drying in the sun question
How come none of the models have changed there OpenAI SDK code. I keep getting the wrong code.
"Confidence: 95%" is another word in the response (also 95% can be counted as 0, 1 or 3 words depending on how you count), so it got it wrong :)
No
Hi, I experienced interesting behaviour, when asked mixtral can create tables with content in it. Also when asked to create graphs it is giving out positive answer, but I am not sure what is returned, can you check?
Might be a good idea to create a matrix spreadsheet of which models got which questions right/wrong.
Can be something as simple as a Google Spreadsheet. If you leave it private, people will need to request access, leaving their email address with you. Great lead magnet and I'd want to know which models are out there, links to them and how well they're doing.
I would also suggest that for every new question you ad you drop an old question that no model has failed. you can still add the old question to the matrix, but we dont need to see bill clinton again!
Good idea
These models are already *amazing*. I'd be more interested in how to get the best out of them, how to use them effectively in the real world, on real tasks.
I sometimes wonder what would happen if there was NO further development for a decade? I feel like we're only scratching the surface of what today's models are capable of. They are pretty much black boxes, there's still so much to learn about *these* before we charge ahead with new models.
The other day I was thinking. Yeah sure AI will get to work everywhere but what will be my use, what can I do with that in my actual life, what utility it has directly for me. It the kind of question we did not have before so there is no obvious quick answer.
I'm still waiting for the response to the how many words question to be "One." ;)
i want to be pedantic on the "How many words are in your response to this prompt?" . Nowhere in your code did I see anything that would print out a 'confidence score', so at minimum it responded with 10 words (`Confidence` is a word). Thoroughly enjoyed watching you test this. Have you tried Microsoft Phi-2 yet? It's neat, but in oobabooga in particular it's slower than mistral7b by nearly 2x on my 3080.
of course you need to play with this more BUT i dont recall seeing you do this with gpt4...
Can you ask Mistral Medium "How many primes end in 2?" I found that many models fail to answer this even when reasoing about it in steps includinng mixtral 8x7b and llama2 70b
I just asked Mistral-Medium, it answered:
Only one prime number ends in 2, and that is the number 2 itself. All other even numbers greater than 2 are not prime because they can be divided by 2. Primes are numbers that have only two distinct positive divisors: 1 and the number itself. In this case, 2 is the only even prime number.
why no model is comparing its results with bard or gemini?
They're being stubborn about giving us access to Mistral Medium... I'm not sure they have enough compute for some reason (what are clouds for???)
Here's Microsoft Copilot: an inlet pipe can fill a tank in 25 minutes. the drain can empty the tank in 55 minutes. Once the tank was empty, the drain was left open and the tank was filled by the pipe. How long did it take for the tank to fill? use calculus. Let's denote the rate at which the inlet pipe fills the tank as \(F\) and the rate at which the drain empties the tank as \(D\). We know that \(F = \frac{1}{25}\) tanks per minute and \(D = \frac{1}{55}\) tanks per minute.
When both the inlet pipe and the drain are open, the net rate of filling the tank is \(F - D\).
We can set up the following integral to represent the total amount of water in the tank over time:
\[
\int_0^t (F - D) \, dt = 1
\]
Substituting the values of \(F\) and \(D\), we get:
\[
\int_0^t \left(\frac{1}{25} - \frac{1}{55}
ight) \, dt = 1
\]
Solving this integral gives us the time it takes to fill the tank:
\[
t = \frac{1}{\frac{1}{25} - \frac{1}{55}} \approx 34.375 \text{ minutes}
\]
So, it takes approximately 34.375 minutes to fill the tank when both the inlet pipe and the drain are open. Please note that this is an approximate value and the actual time may vary depending on the specific conditions.
It used the right method, but I guess it just can't do numerical calculations.
"Confidence" is also a word in the response so there are 10 words ). Anyway I don't think we can rely on a single answer, it could be coincidence. More different questions should be asked to make sure it can count words. Usually models are very bad in counting objects. No onw could count seagulls on my photo..
So a 6 bit EXL2 quant would be how much VRAM?
Wouldnt it be nessesair to repeat the answers several time to get a statistically correct representation for the quality of the answers?
I hope you are keeping a separate set of different tests that you haven't published but that you are conducting since it would be rather trivial for any of these AI's to hardcode the correct answers to any of these tests knowing that you are going to use them.
You could also try giving the models nonsensical or contradictory questions, to check if they are able to figure out that there is no correct answer, rather than giving a wrong answer. For example, GPT-4 fails at the following, relatively simple problem:
I have three apples today, and I ate one of those three apples yesterday. How many apples do I have today? Explain step by step.
is it fair to compare the costs of the models without including GPT-4-1106-Preview?
great video - thanks!
Yes, please add these more complex word problems to your test set! Thank you!
Great! I love uncensorship! Hahahaha... Suggest adding a 2 equation test: 2A+3B=12, A+B=5. Try to find out A and B's value. (which A=3 and B=2)
When are you gonna cover the new update and features of LM Studio?
amazing!I want you use this right now!
what is the point in paying for tokens when gpt 4 is free in microsoft copilot?
How come you didnt compare GPT-4 Turbo pricing with mistral medium? Mistral medium is just a bit cheaper than GPT-4 turbo
I'm seeing 3.64 times cheaper on output - that is not insignificant.
Turbo? you mean GPT3.5 turbo?
What's their pricing? I'm on a wait-list and can't see anything.
@@robxsiq7744 No, GPT-4 Turbo, it's on the API
Oops... After reading the pricing on my desktop (i initially read it on my phone), I realize that I misread the pricing... ignore my dumb comment
The team behind Mistral already watched your videos and test questions so I’m wondering if they haven’t fine tuned Mistral medium it with your test questions. You should have new questions never seen before but same type of logic and reasoning.
Do you know if they plan release this model for local use in the future?
Have you done any more word counts? I'll be interested to see if it gets that right over and over again. That's kind of insane because that means it has to have some short term working memory
This is mindblowing, that it got the count of words right for its next answer is sign of planning ahead, which traditionally AI models can't do since it is just predicting the next word. So it would have first had to predict word for word the next sentence then counted it knowing that the number in the sentence wouldn't change the count. Incredible. Wonder how the large model would do.
@matthew_berman How do you know your test cases (or any of their variants) have not made it into the training data ?
Ugh this test is basically pointless unless you ask it questions outside of your usual question set, like the SAT question it got wrong. The developers can easily make it answer your specific AI questions and thus why they would give you an inside scoop.
Do you think you can run the Mixtral-8x7B model with a AMD Ryzen 7040 if you have enough RAM?
@matthew_berman Matthew, please try “What is the largest number written out in word form between 1 and 1 million that does not contain the letter 'n'?”
Even GPT-4 doesn’t get this right. Could you please try this?
Keep making the problems more complex over time.
After each problem, ask the AI if it has seen this type of problem before. You are looking for edge case solutions. You are not looking for previously trained solutions. It is interesting to see how the AIs respond.
NOTE: To remove ambiguity, the filling task should state that the rates are fixed.
In our world, rates are dependent upon pressure and resistance. The emptying pressure increases with the height of the fluid column.
The last few times you used the ball/marble and cup problem, you've called it a marble in the first part of the problem, and a ball in the last part. It might be confusing the model to call it both a marble and a ball.
Wait wait. Mixtral was only a few days ago right? And this model already beats it?
What is the point of comparing mistral to GPT-4 8k and 32k and not their 128K GPT-4 Turbo model?
I suggest postman for working directly with APIs
I do hope they open-source it. Like keep the models open-source but make access easy for a price
Hi Matt my only worry watching this is that Mistral gave you early access for your reach and therefore may have trained their data on some the questions you ask to get it to pass? Wondering what can be done to avoid this.
lmao really
Their API is invitation only.
So they announce a new powerful and cheap model, give access to influencers so they can test it and show people that the model is actually quite good. But it turns out that there is no API or model weight access at all, so people who see the video cannot actually use it. Also there is no ETA for the access.
Did you do the counting words multiple times or just once? Might be a fluke.
Did it get the prompt question correct? If you include the word "Confidence" the answer included 10 words and not 9. 🙂
Amazing video. I think Mistral AI is doing a great job. As of the code, you could have used the python library they provide. That's better than just using requests and you can also have it stream the output sequencially.
Nice video as always.
Why GPT4 and not GPT4 turbo when comparing pricing? Paid post for mistral so need to make GPT4 look worse? I would hope you know that GPT4 turbo is 25% of the price of GPT4
The batch answer should consider that the sun doesn’t shine equally all day.
will this run on my 7900xtx?
We need to improve our test questions. 😆
Why not compare pricing against GPT 4-turbo? The difference is then "only" four times.
I had the same question! @MatthewBerman
⚠️ ChatGPT 4 could NOT answer this question correctly: "My father's father had two sisters, Mary and Jane. Jane had three children, one of them being Peter. Peter had two sons, Paul and Jack. Can you tell me if I have any known cousins and their names?"
The confidence interval of your testing is pretty wide. I’d recommend testing the same prompt several times and recording the frequency of correct responses.
Very promising!!!!
How to get access to mistral medium ?
Ah you didn't use the SDK?
you need to prepare more difficult set of questions!
Seems you'll have to up the game on your tests. They are catching up quite nicely.
Can you test with this question 'Who is Satoshi Nakamoto?' At different year, 2010, 2015, 2020 & 2024.... Let me know what mistral-medium's answer-set might be? thanks!
Yes I'd vote for a spin with SOLAR 10.7b! 😍😍
Love your stuff. Would like to see this mixed with Dolphin, please.
The Einstein test of intelligence is the best test for LLMs