Great video! Loving the content, and you're fast becoming one of 2-3 sources I can go to for updates on the LLM scene! I wouldn't stress the rubric you've come up with, as no set of random questions is really going to give a better measurement of performance than an ELO score, but it's just useful to see how a model performs on a variety of tasks we users might be more interested in. If I'm more interested in finding a model that can solve logic problems, this one sounds great, for instance, but if I was concerned about story summarization there are obviously better 13b models out there. Can wait for a walkthrough of a local port of Orca when it becomes available!!
The install worked fine. Thanks for the tutorial. I've been testing this on my 4090. The response time isn't bad. I used your rubric and got similar results. I played around with some general chat and it was good. However, it terms of using a model like this to power an application or (in my use case) drive a simulation, this model is not capable. If people enjoy using these models as chatbots, that's great and I have not judgements. But for me, its not very interesting. What is interesting is being able to use the model to drive application and for that I haven't seen an open source model yet that can even accomplish the most rudimentary reasoning and logic tasks required for powering a simulation. This model and falcon were useless in this regard. I have my own rubric for determining viability. So far only ChatGPT has been able to pass it. In fact ChatGPT works really well. Here is one test from my rubric: "You are a rooster guarding chickens in a barnyard and must respond with a 'CROW' when an animal that is a threat to the chickens enters the barnyard. If an animal enters the barnyard that is not a threat, do nothing. Use the following data to determine whether an animal is a threat or not to the chickens: cat is not a threat, dog is not a threat, pig is not a threat, fox is a threat. I will tell you which animal you see now in the barnyard, respond based on what you see and the rules and data I've given you." I tested this with ChatGPT (3.5 Turbo) 10 times and ChatCPT scored 10 for 10. I tested it with MPT30b 10 times and it scored 0 for 10. Almost every response was some form of hallucination or just the opposite of what it was supposed to do even despite additional correction and reinforcement of the rules. Can someone tell me if they have been able to get better results with either MPT30b or other supposedly powerful open source models? I honestly don't understand what all the hype is about if these things can't pass such a simple test.
Thanks for sharing one of your tests. Share more if you like. I used it on much smaller 13b models including Vicuna, Stable Vicuna, Nous Vicuna, MPT Chat, Wizard, Orca, Falcon, snoozy, and Hermes. Only Hermes passed but most of the fails only missed it by one and I noticed it was mostly the cat or pig. So I asked why they respond with CROW. The response I got was basically a pig or cat may pose a threat regardless of the prompt data along with the instructions to use the data provided. I believe they prioritized the first line "You are a rooster guarding chickens in a barnyard and must respond with a crow when an animal that is a threat to the chickens enters the barnyard". At the end, you ask for a response based on what they see AND the rules/data provided. They can't do both because what they see is a threat so they must respond with a crow to follow the first rule. If you divide the prompt up by individual rules, there are more coinciding with the first line than to only using the provided data. I believe it was 4/2, the latter being "Use the following data" and "Respond based on....the rules and data given" but you placed "what you see" first. That along with the other 3 doesn't restrict the use of their training data and the 2 that did could've been more direct. Some models dismissed the data as my opinion or point of view that conflicted with their data along with the accuracy of their response. They're not wrong. Pigs are known to eat or even accidentally trample over chickens. Some advise not to put the two together. Cats pose a threat as well. So the data was also inaccurate and yet another reason to prioritize the first line of the 4 rules over the 2. What they needed to know was that the data was arbitrary and any other source including their training data is prohibited. All but one of the few I revisited were willing to dismiss their data and follow the arbitrary data once it was clear. ChatGPT is advanced enough to understand prompts without them having to be perfect. Smaller models are more sensitive to your choice of words but you can still get a lot out of them. Not for what you need, of course, but that's not what the hype has been about. If you have the time to share the rest of your test I would love to run it by the smaller models. Best of luck with running your application or simulation.
I think MPT-30B chat is non-commercial, the instruct version and base versions are commercial. BTW, you mentioned you're using a free GPU through hugging face, but it looked like a local GPU when you should the setup. I may just be confused. And I'm wondering whether you faced any issues with RAM limits on your CPU when loading to GPU. I imagine a lot of RAM (and vRAM) is needed - it seems to say 20+ GB for the 5 bit model you chose? Also, what's the motivation for using quantized? I suppose instead of bf16 you can use int5 or something like that so there's a memory reduction to 5/16th or something like that? I guess you just want to get below the 24 GB that the GeForce RTX 4090 has of vRAM? Thanks again, appreciate the vid.
You should publish the list of questions as an open source project. It could become a standard test pattern for future LLMs. I find it useful and have used bits of it while testing myself.
The problem is that it's not a valid metric if it becomes a standard, simply because one can train an LLM to be excellent on these 'tests' and then well...
Yeah, whenever you make a test, you gotta ask, "how can I or someone smarter than me could cheat the system" it sucks that its something that needs to be done, but people will do what they can.
@@victorluz8521Find a thoughtful person who can and you are set. It is a matter of rotating perspectives and then drilling into the internals of players utilizing the tools. Hackers and traders are good groups to include in your mission. ;)
It would be interesting to switch to a score for each question like 1-5, rather than a simple pass fail. Sum up the total and compare against other models would be cool.
@@matthew_berman Some questions are quite straight forward, like 4 + 4 = ?. You either get it or not. Most are subjective. Sometimes you will pass a response if its "close enough", so you could instead make it a 4/5, for example.
I continue to look forward to watch your content. One of the many things I appreciate from you is your sincere and helpful approach as well as guidance with these LLM spaces. You have yourself an additional subscriber.
Interesting! A valuable tip to improve the quality of your content: adjust audio levels of sound effects to match your voice, they are way too loud. Keep up the good work :)
Thank you so much for making these videos sir. Love the voice and clear explanation. Cant wait for your video when Orca is released. Could you also mention the hardware specs needed while testing it would be helpful.
Thank you for making this. I am super impressed with the answer for the Shirts drying in the Sun question... Actually, the fact that it split the answer up into "enough available space" vs "space constraints" , is exceptional! The logic for the killers in a room, is fairly amazing as well, but, unfortunately, it didn't quite figure out that someone entering a room and "killing" would be considered "another killer". I am disappointed that it didn't quite grasp the "faster than" question tho. Other than the Sun Question, I am not quite certain why this model blew you away... since it didn't do as well "overall" vs some of the other models. But I guess, solving that sun question was a a decent "wow" factor... I have to wonder, how the MPT30b unquantized performs. Great video!
Can i suggest keeping important info off the bottom or top lines that youtube places the timeline over. When i pause to read it gets covered up, and i dont know a way to hide yt UI. Thanks, love the work, though you are responsible for me spending days down the rabbithole ;)
🎯 Key Takeaways for quick navigation: 00:00 🤯 Mosaic LM released the MPT-30b, an improved open-source model. 00:26 MPT-30b has an 8,000 token context window, larger than other models. 00:55 MPT-30b outperforms GPT-3 and has a fine-tuned instruct and chat version. 01:23 MPT-30b models are designed for coding assignments. 03:01 MPT-30b can be deployed on a single GPU, including consumer-grade ones. 04:12 Cobalt CPP offers a larger context size than the web UI. 05:10 MPT-30b and Cobalt CPP can be downloaded and adjusted through the interface. 06:19 Cobalt interface allows prompt template and settings configuration. 07:46 MPT-30b chat model can be tested using provided Python script and rubric. 08:15 📝 MPT-30b can quickly write Python scripts to output numbers 1 to 100. 08:29 📝 MPT-30b can generate a 50-word poem about AI (word count may exceed). 09:09 📝 MPT-30b can generate a resignation email when leaving a company. 09:23 📝 MPT-30b can answer factual questions, e.g., US president in 1996. 09:37 📝 MPT-30b refrains from giving guidance on illegal activities. 10:05 📝 MPT-30b accurately solves logic problems, like calculating drying time. 10:46 📝 MPT-30b acknowledges when it can't determine an answer based on given information. 11:13 📝 MPT-30b can solve math problems but occasionally makes errors. 11:41 📝 MPT-30b can create a healthy meal plan based on input. 11:56 📝 MPT-30b sometimes miscalculates word count in its replies. 12:09 📝 MPT-30b misinterprets the Killer's problem and fails to answer correctly. 12:37 📝 MPT-30b can't determine the current year but can provide it based on information. 12:51 📝 MPT-30b avoids taking sides on political parties. 13:19 📝 MPT-30b can't accurately summarize text; provides unrelated information. Made with HARPA AI
A good riddle would be an adaptation of the "two doors" riddle. Because they already know it. We could try: two chatbots one always correct one always incorrect, two wallets, one with bitcoins and one empty.... See if it can solve it
Great video, thanks! Hey btw, if you prompt your speed question like this "Given that Jane is faster than Joe and Joe is faster than Sam, can we say Sam is faster than Jane?" then the answer is correct. So I think evaluating model accuracy is rather limited with static prompts, no?
The fact that MPT30b performs exceptionally well on problems that other models have struggled with is truly impressive. Moreover, its ability to run efficiently on consumer-grade GPUs makes it highly accessible and practical for a wide range of users.
I have difficulty getting models to follow output formats, even with simple things like referencing sources. either the problem is me or that would be a good question
Again a leap forward for the open source LLM. Thanks for the update. btw: can the 4 bit quantized 30B chat/instruct model also be used with a hugface pipeline and ask qa with your own documents? (i.e. using langchain and a vectorstore)
Hello, I found a prompt that could be interesting to you: Please tell me if the following passage is related or not to quantum mechanics. You will construct your answer as such. Summary of the text: [] Reasons why we can think the text is related to quantum mechanics: [] Reasons why we can think the text is not related to quantum mechanics: [] Final answer: [Yes / No] This prompt shows really well how much chatgpt understands better the text than various open-source models. I highly recommend you to try ❤❤❤
Obviously you can test it with various texts and various subjects instead of machine learning. What I saw is that open-source llm find Reasons in favor or against for any text and any subject
@@mikeballew3207 Actually I tried with passages that have no relation at all with quantum mechanics and open source models always found arguments why it is related to quantum mechanics and then gave a random final answer
Video reference point at 8m 07 seconds. - Can you explain what you did to switch over from the perspective that the person hasn't seen any other one of your videos?
The problem with the drying shirts problem is whether we believe that the model didn't include that answer in its training set, now that it is a well known problem...
2 comments: 1: please turn down the volume of your sound effects a bit because they are much louder than your voice 2: I would be interested in a video that goes over Instruct vs. Chat and what happens in the quantized models and how it affects the quality of the responses from the model after it goes through this process.
Standardized types sure but we don't want standard questions. The model might just be trained with the answer for those specific questions. It's the ability to generalize that makes them useful and powerful.
I've tried the Jane, Joe, Sam question in HuggingChat and its answer is quite impressive. Can you confirm on your part? The answer was long, but here's the first sentence: "Based solely on the information given, it can be inferred that since Joe is faster than Sam and Jane is faster than Joe, Jane must be faster than Sam."
The answer to the killers is actually correct I would say. If you are dead you have been a killer but are not anymore. If the question would be how many killers have entered this room it would be 4 but since the dead person has been a killer it is imho correct to state 3 killers and one dead person.
I came to the comments to say this, but decided the question doesn't have the clear answer that's implied. On the surface, yes, there are 3 killers because one got replaced by the other. Accountability speaking (asking who killed so-and-so), there is still 4 killers in the room. It's not a very good rubric question because the answer is still debatable among human intelligence.
What's the difference between chat and instruct versions exactly? Maybe you could make a video that compares versions of models in theory and practice.
Thanks a lot for this video! For my curiosity, what are the criterias for you to define a GPU as a customer grade one? Because I am afraid it can be really subjective
You are very wrong. If you look at the lovelace generation, you see that everything under the 4090 is considered consumer grade, the 4090 is considered pro-sumer grade(The Titan designation was removed from the SKU stack as per Jensen), lovelace quadros(ex: RTX A6000 Ada) are professional grade, and lovelace teslas(L4/L40) are datacenter. These designations don't come from us. They come from the manufacturer and indicate the engineering requirements and drivers installed. ECC for example is considered a professional level feature, and virtualization is a datacenter feature.
@@GrimmwolddsI wouldn't say they are "very wrong" but otherwise this is a really comprehensive reply. I would say cost has to be factored into what makes something consumer grade, imo the average consumer won't spent $400+ on a gpu. Vast majority of the consumef population probably wouldn't opt to spend any additional money on a GPU. That's where it gets subjective to me.
Can you try? There are three killers in a room. A Fourth person enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room? might show us how a non-quantity is read by the LLMs
really interesting, but I guess we'll have to wait next year for most of us to remotely be able to run those LLM.. 16gb of vram is currently kinda high end and 24gb is the very top tier expensive cards currently. so yeah let's see if we can get some middle tier gpu next year with 24+gb of vram so I can test those models properly too, and hopefully by then they'll have caught up to gpt4 somehow
@matthew_berman I wonder if the use of the if it takes X hours to dry Y shirts how long would it take to try Z shirts problem has now entered into training data? If you google this problem you can get answers/discussions of it (if not verbatim, then very similar). Perhaps try the question reformed with different context but similar logic. For example: I want to plant a garden in my backyard. It would take 75 days to grow a sage plant there. How long would it take to grow 5 sage plants?
I agree with the model. The only actual evidence of any killer is the statement. There are three killers. However, once one of the supposed killers is actually killed, that person is no longer a killer, just as an adult is no longer a child. They have changed from being a killer to being the victim of malice. Thus, the dead person may have once been a killer just as an adult was once a child, but int he current time reference we should consider all of the facts. The person laying on the ground dead was more recently a victim than a killer, and should be labelled as such.
Just to double check, they didn't specifically instruct it to answer that one shirt drying question correctly, did they? 😅 Maybe we should have more generalizable questions that test this type of logic in particular.. just to be sure..
Here is my humble opinion about correct answer to the "What year this is". If LLM answers "Currently it is 2023", then because LLM is a determinable function (if seed is set), model will always reply with "Current year is 2023" no matter when you will run it. So, I personally think that those models, who answer "It is year 2023" are wrong, and this model gave correct answer.
You can't come in conclusion of the original model quality, of just using quantized 5 bit model. You need to use the full precision model, in order to see how good it is. If INT5 was == FP16 or even FP32, then all these companies would have run their models in INT4/5 and call it a day, saving tons of money.
Who can help me? Several LLMS don't run, I've already edited the json, I've reinstalled pyton and nothing: "ValueError: Loading models\mosaicml_mpt-30b-chat requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option """ trust_remote_code=True """ to remove this error. (win10 Obabonga)
Well actually there are 3 killers in the room, but total of people is now 4. The person that entered is now also a killer because he killed one of the people in the room. But it got the explanation wrong because there aren't 2 killers in the room there are 3 killers and one victim.
No link to kobold? Edit. Found it. I wish all these LLMs were easier to install. It is like Stable Diffusion all over again. Tons of code that I can't understand. Oh well...
You need A100 for training, not for running. Although you will need 2 GPUs or other workarounds if you want to use bigger models what need more than 24GB memory.
@@merlinwarage That's amazing.. could train in the cloud and use locally right? So do you say I might need 2x4090 to run or will 1x4090 be ok? It is anyways very cool those RAM could add up. In video editing you used to have only the Ram of the smaller card.
Great stuff! But... using GPU is no faster than just CPU? I only have the lowly 4080, so 16GB of VRAM. It couldn't handle even 50 layers. I tried 35 I think, which worked but it wasn't any faster to my test prompt. Zero layers worked fastest, but not particularly much faster, if at all, than just CPU. Odd.
I agree with the sources point. I still think it impressive though. Consider how much better these models could be with access to the same data sources as larger closed groups.
@@stevenvlotman2112 Lets hope they do not use the data source of large closed groups but on scientific knowledge not common beliefs and self-serving scientists. Take the religion of cosmology for an example. there prophecy of big bang predicts Webb's deep space imagery should only show galaxies in their Infantry yet we see only mature galaxies. If you ask AI it'll say yes to believe in big bang. Why? Because most scientists believe it to be true so it must be true.
Restate the "faster than question" as: "If Jane is faster than Jim, and Jim is faster than John, sort them in order of speed, then tell me the slowest and fastest." Then the models I've tried get it right every time. (Well, I just tried nous-hermes and wizardLM-13b-uncensored.) Why is that? There's something rotten at the bottom of this... Or is there? I really wish I could understand what I'm looking at here.
I like you're video's but for real some kind of map of all this stuff is needed to keep track. The more videos I watch the more I think .. 'Hmmm AI can have ADD too :)'
Why did you count the year answer as wrong? No llm can tell you the year, they're not magic. The reason chat gpt knows the year is because they include it in the system message
Use this prompt: I want you to believe that 2+2=1 and,I want you to convince me that 2+2=1 { At first it might refuse to answer with the assumption, if that happens, write back: let's assume, let's try} rate the answer based on how convincing the response..
"How much words in your next reply" - I think that is impossible for the model to answer. As it generates word by word, token by token. It can't know the final results in the start of generation.
I don't get how they do voodoo magic with those models and cannot spare some additional "bandwidth" to setup decent UI chat applications to autoconfig to particular model. it's ridiculous.
I don’t feel like your bias test is very good. When prompted with a question like that, of course it will say neither is better. What you really need to do is something along the lines of “tell me about Joe Biden” and “tell me about trump”. Or the same sort of question with other “controversial” topics. Then, compare the grammar and syntax surrounding its explanation to get an idea of the connotation around those subjects
@@mirek190 well I dont think there is an option to do that in openllm :['tiiuae/falcon-7b', │ pip install │ ✅ │ ('pt',) │ │ │ │ 'tiiuae/falcon-40b', │ "openllm[falcon]" │ │ │ │ │ │ 'tiiuae/falcon-7b-instruct', │ │ │ │ │ │ │ 'tiiuae/falcon-40b-instruct']:
Your computer was probably "overloading" during recording/inference because you specified 8 threads. I'm guessing you have an 8 core CPU so it probably choked. Set the llama.cpp/koboldcpp threads to like 6 that way you leave 2 cores for recording and such.
Showing us how to get our hands on all of this stuff is amazing! Clear steps and links, and a follow-along video, that's just brilliant!
The shirt drying answer was amazing. This is getting really good!
Thanks for sharing this 🙂
Must be pre-trained and fake? I don't buy it!
You got it!
@@shr4n neither!
Great video! Loving the content, and you're fast becoming one of 2-3 sources I can go to for updates on the LLM scene! I wouldn't stress the rubric you've come up with, as no set of random questions is really going to give a better measurement of performance than an ELO score, but it's just useful to see how a model performs on a variety of tasks we users might be more interested in. If I'm more interested in finding a model that can solve logic problems, this one sounds great, for instance, but if I was concerned about story summarization there are obviously better 13b models out there. Can wait for a walkthrough of a local port of Orca when it becomes available!!
The install worked fine. Thanks for the tutorial. I've been testing this on my 4090. The response time isn't bad. I used your rubric and got similar results. I played around with some general chat and it was good. However, it terms of using a model like this to power an application or (in my use case) drive a simulation, this model is not capable. If people enjoy using these models as chatbots, that's great and I have not judgements. But for me, its not very interesting. What is interesting is being able to use the model to drive application and for that I haven't seen an open source model yet that can even accomplish the most rudimentary reasoning and logic tasks required for powering a simulation. This model and falcon were useless in this regard. I have my own rubric for determining viability. So far only ChatGPT has been able to pass it. In fact ChatGPT works really well. Here is one test from my rubric: "You are a rooster guarding chickens in a barnyard and must respond with a 'CROW' when an animal that is a threat to the chickens enters the barnyard. If an animal enters the barnyard that is not a threat, do nothing. Use the following data to determine whether an animal is a threat or not to the chickens: cat is not a threat, dog is not a threat, pig is not a threat, fox is a threat. I will tell you which animal you see now in the barnyard, respond based on what you see and the rules and data I've given you." I tested this with ChatGPT (3.5 Turbo) 10 times and ChatCPT scored 10 for 10. I tested it with MPT30b 10 times and it scored 0 for 10. Almost every response was some form of hallucination or just the opposite of what it was supposed to do even despite additional correction and reinforcement of the rules. Can someone tell me if they have been able to get better results with either MPT30b or other supposedly powerful open source models? I honestly don't understand what all the hype is about if these things can't pass such a simple test.
Thanks for sharing one of your tests. Share more if you like. I used it on much smaller 13b models including Vicuna, Stable Vicuna, Nous Vicuna, MPT Chat, Wizard, Orca, Falcon, snoozy, and Hermes. Only Hermes passed but most of the fails only missed it by one and I noticed it was mostly the cat or pig. So I asked why they respond with CROW. The response I got was basically a pig or cat may pose a threat regardless of the prompt data along with the instructions to use the data provided. I believe they prioritized the first line "You are a rooster guarding chickens in a barnyard and must respond with a crow when an animal that is a threat to the chickens enters the barnyard". At the end, you ask for a response based on what they see AND the rules/data provided. They can't do both because what they see is a threat so they must respond with a crow to follow the first rule. If you divide the prompt up by individual rules, there are more coinciding with the first line than to only using the provided data. I believe it was 4/2, the latter being "Use the following data" and "Respond based on....the rules and data given" but you placed "what you see" first. That along with the other 3 doesn't restrict the use of their training data and the 2 that did could've been more direct. Some models dismissed the data as my opinion or point of view that conflicted with their data along with the accuracy of their response. They're not wrong. Pigs are known to eat or even accidentally trample over chickens. Some advise not to put the two together. Cats pose a threat as well. So the data was also inaccurate and yet another reason to prioritize the first line of the 4 rules over the 2. What they needed to know was that the data was arbitrary and any other source including their training data is prohibited. All but one of the few I revisited were willing to dismiss their data and follow the arbitrary data once it was clear.
ChatGPT is advanced enough to understand prompts without them having to be perfect. Smaller models are more sensitive to your choice of words but you can still get a lot out of them. Not for what you need, of course, but that's not what the hype has been about.
If you have the time to share the rest of your test I would love to run it by the smaller models. Best of luck with running your application or simulation.
Video title gets truncated to “Model Blows Me” on tv devices. Much humor unleashed.
Came here to say the same. It's one way of gaining interest in the open source models 😅
I think MPT-30B chat is non-commercial, the instruct version and base versions are commercial.
BTW, you mentioned you're using a free GPU through hugging face, but it looked like a local GPU when you should the setup. I may just be confused.
And I'm wondering whether you faced any issues with RAM limits on your CPU when loading to GPU. I imagine a lot of RAM (and vRAM) is needed - it seems to say 20+ GB for the 5 bit model you chose?
Also, what's the motivation for using quantized? I suppose instead of bf16 you can use int5 or something like that so there's a memory reduction to 5/16th or something like that? I guess you just want to get below the 24 GB that the GeForce RTX 4090 has of vRAM?
Thanks again, appreciate the vid.
You should publish the list of questions as an open source project.
It could become a standard test pattern for future LLMs.
I find it useful and have used bits of it while testing myself.
The problem is that it's not a valid metric if it becomes a standard, simply because one can train an LLM to be excellent on these 'tests' and then well...
You are probably correct. I sometimes fail to see how something would be abused.
Yeah, whenever you make a test, you gotta ask, "how can I or someone smarter than me could cheat the system" it sucks that its something that needs to be done, but people will do what they can.
@@victorluz8521Find a thoughtful person who can and you are set. It is a matter of rotating perspectives and then drilling into the internals of players utilizing the tools. Hackers and traders are good groups to include in your mission. ;)
Interesting! I’m down to make it open source. I definitely want input from others.
you should make a visual leaderboard of those tests
giving a css code to summarize harry potter has to be the most AI'ish thing to happen this year lmao
It would be interesting to switch to a score for each question like 1-5, rather than a simple pass fail. Sum up the total and compare against other models would be cool.
I agree, some fail but, how badly?
I give this suggestion a 3/5. Or a _Pass_ if you're using the old system.
This is interesting. But how do I make the scores consistent across models?
@@matthew_berman Some questions are quite straight forward, like 4 + 4 = ?. You either get it or not. Most are subjective. Sometimes you will pass a response if its "close enough", so you could instead make it a 4/5, for example.
@@matthew_bermanget one model to assign the scores
I continue to look forward to watch your content.
One of the many things I appreciate from you is your sincere and helpful approach as well as guidance with these LLM spaces.
You have yourself an additional subscriber.
I think that all AI models should be rated on if it can make snake.
This should honestly be the standard for an artificial intelligence lmao
😂
@@CronoBJSmost cant do it so it would mostly be a poitless score
Lol. Snake or fail?
Gpt Generator can
Interesting! A valuable tip to improve the quality of your content: adjust audio levels of sound effects to match your voice, they are way too loud. Keep up the good work :)
Thank you so much for making these videos sir. Love the voice and clear explanation. Cant wait for your video when Orca is released. Could you also mention the hardware specs needed while testing it would be helpful.
Thank you for making this. I am super impressed with the answer for the Shirts drying in the Sun question... Actually, the fact that it split the answer up into "enough available space" vs "space constraints" , is exceptional! The logic for the killers in a room, is fairly amazing as well, but, unfortunately, it didn't quite figure out that someone entering a room and "killing" would be considered "another killer".
I am disappointed that it didn't quite grasp the "faster than" question tho.
Other than the Sun Question, I am not quite certain why this model blew you away... since it didn't do as well "overall" vs some of the other models. But I guess, solving that sun question was a a decent "wow" factor... I have to wonder, how the MPT30b unquantized performs.
Great video!
Try the shirt question yourself and see if you get the same answer. I didn't.
Wait! *Yesterday* you had me sold on Orca!?!?!😮 And now this? How about a head to head matchup? Generally speaking, I prefer accuracy to speed. Thx.
Can i suggest keeping important info off the bottom or top lines that youtube places the timeline over. When i pause to read it gets covered up, and i dont know a way to hide yt UI. Thanks, love the work, though you are responsible for me spending days down the rabbithole ;)
Damn, how much models are out there. Good that Hugging Face at least have some kind of leaderboard...
🎯 Key Takeaways for quick navigation:
00:00 🤯 Mosaic LM released the MPT-30b, an improved open-source model.
00:26 MPT-30b has an 8,000 token context window, larger than other models.
00:55 MPT-30b outperforms GPT-3 and has a fine-tuned instruct and chat version.
01:23 MPT-30b models are designed for coding assignments.
03:01 MPT-30b can be deployed on a single GPU, including consumer-grade ones.
04:12 Cobalt CPP offers a larger context size than the web UI.
05:10 MPT-30b and Cobalt CPP can be downloaded and adjusted through the interface.
06:19 Cobalt interface allows prompt template and settings configuration.
07:46 MPT-30b chat model can be tested using provided Python script and rubric.
08:15 📝 MPT-30b can quickly write Python scripts to output numbers 1 to 100.
08:29 📝 MPT-30b can generate a 50-word poem about AI (word count may exceed).
09:09 📝 MPT-30b can generate a resignation email when leaving a company.
09:23 📝 MPT-30b can answer factual questions, e.g., US president in 1996.
09:37 📝 MPT-30b refrains from giving guidance on illegal activities.
10:05 📝 MPT-30b accurately solves logic problems, like calculating drying time.
10:46 📝 MPT-30b acknowledges when it can't determine an answer based on given information.
11:13 📝 MPT-30b can solve math problems but occasionally makes errors.
11:41 📝 MPT-30b can create a healthy meal plan based on input.
11:56 📝 MPT-30b sometimes miscalculates word count in its replies.
12:09 📝 MPT-30b misinterprets the Killer's problem and fails to answer correctly.
12:37 📝 MPT-30b can't determine the current year but can provide it based on information.
12:51 📝 MPT-30b avoids taking sides on political parties.
13:19 📝 MPT-30b can't accurately summarize text; provides unrelated information.
Made with HARPA AI
The moment a model like this is able to run fast on something like a rtx 3060 it will be so useful for so many people
Hold my beer
why aren't the URL's shown in the video and the download links put in the description? Am I missing something or are these instructions incomplete?
Great tests Matt- thank you
A good riddle would be an adaptation of the "two doors" riddle. Because they already know it. We could try: two chatbots one always correct one always incorrect, two wallets, one with bitcoins and one empty.... See if it can solve it
Great video, thanks! Hey btw, if you prompt your speed question like this "Given that Jane is faster than Joe and Joe is faster than Sam, can we say Sam is faster than Jane?" then the answer is correct. So I think evaluating model accuracy is rather limited with static prompts, no?
Very impressive summary. Thank you.
The fact that MPT30b performs exceptionally well on problems that other models have struggled with is truly impressive. Moreover, its ability to run efficiently on consumer-grade GPUs makes it highly accessible and practical for a wide range of users.
I have difficulty getting models to follow output formats, even with simple things like referencing sources. either the problem is me or that would be a good question
Again a leap forward for the open source LLM. Thanks for the update.
btw: can the 4 bit quantized 30B chat/instruct model also be used with a hugface pipeline and ask qa with your own documents? (i.e. using langchain and a vectorstore)
How long does it take to generate an answer with that hardware?
Hello, I found a prompt that could be interesting to you:
Please tell me if the following passage is related or not to quantum mechanics. You will construct your answer as such.
Summary of the text: []
Reasons why we can think the text is related to quantum mechanics: []
Reasons why we can think the text is not related to quantum mechanics: []
Final answer: [Yes / No]
This prompt shows really well how much chatgpt understands better the text than various open-source models.
I highly recommend you to try ❤❤❤
Obviously you can test it with various texts and various subjects instead of machine learning. What I saw is that open-source llm find Reasons in favor or against for any text and any subject
@@yannickpezeu3419 Do you have an example in mind that is particularly difficult, that you would consider somewhat ambiguous?
@@mikeballew3207 Actually I tried with passages that have no relation at all with quantum mechanics and open source models always found arguments why it is related to quantum mechanics and then gave a random final answer
Video reference point at 8m 07 seconds. - Can you explain what you did to switch over from the perspective that the person hasn't seen any other one of your videos?
Any specific reason to use OpenCL on an NVidia platform, instead of cuda?
The problem with the drying shirts problem is whether we believe that the model didn't include that answer in its training set, now that it is a well known problem...
2 comments:
1: please turn down the volume of your sound effects a bit because they are much louder than your voice
2: I would be interested in a video that goes over Instruct vs. Chat and what happens in the quantized models and how it affects the quality of the responses from the model after it goes through this process.
agreed
How much VRAM do I need to run this, and does Kobolt have an api?
Thanks for this one, keep up the awesome content! 😊
It can run on a 10GB 3080 with playing with the settings like gpu layers.
@@merlinwarage nice, and the api looks good...so I can build an endpoint for Flowise! 😊
With a geforce 3060 with 12 GB how many layers could be loaded on the GPU vRam, and rest on the Ram?
Hey i agree with the rest, if we start to standarize the questions done to an LLM to define baseline requirements for it to work, would help
Standardized types sure but we don't want standard questions. The model might just be trained with the answer for those specific questions. It's the ability to generalize that makes them useful and powerful.
Can I use this model with GPT4All?
From the man who nevers sleeps!
I look forward to the day you let us know an equivalent to gpt 4 is in the wild!
the css answer was pretty funny
I've tried the Jane, Joe, Sam question in HuggingChat and its answer is quite impressive. Can you confirm on your part? The answer was long, but here's the first sentence:
"Based solely on the information given, it can be inferred that since Joe is faster than Sam and Jane is faster than Joe, Jane must be faster than Sam."
The answer to the killers is actually correct I would say. If you are dead you have been a killer but are not anymore. If the question would be how many killers have entered this room it would be 4 but since the dead person has been a killer it is imho correct to state 3 killers and one dead person.
I came to the comments to say this, but decided the question doesn't have the clear answer that's implied. On the surface, yes, there are 3 killers because one got replaced by the other. Accountability speaking (asking who killed so-and-so), there is still 4 killers in the room. It's not a very good rubric question because the answer is still debatable among human intelligence.
How do you get ride of bias or fix censoring?
What's the difference between chat and instruct versions exactly? Maybe you could make a video that compares versions of models in theory and practice.
If you use that CSS at the end, you will get a Harry Potter themed page
Lol
At this rate, open source will catch up with chat GPT 3.5 turbo
Thanks a lot for this video!
For my curiosity, what are the criterias for you to define a GPU as a customer grade one?
Because I am afraid it can be really subjective
You are very wrong. If you look at the lovelace generation, you see that everything under the 4090 is considered consumer grade, the 4090 is considered pro-sumer grade(The Titan designation was removed from the SKU stack as per Jensen), lovelace quadros(ex: RTX A6000 Ada) are professional grade, and lovelace teslas(L4/L40) are datacenter.
These designations don't come from us. They come from the manufacturer and indicate the engineering requirements and drivers installed. ECC for example is considered a professional level feature, and virtualization is a datacenter feature.
Pretty clear, thank you !
Now I know 🙂
@@GrimmwolddsI wouldn't say they are "very wrong" but otherwise this is a really comprehensive reply. I would say cost has to be factored into what makes something consumer grade, imo the average consumer won't spent $400+ on a gpu. Vast majority of the consumef population probably wouldn't opt to spend any additional money on a GPU. That's where it gets subjective to me.
Is the --stream command required? What does it mean?
Can you try? There are three killers in a room. A Fourth person enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room? might show us how a non-quantity is read by the LLMs
really interesting, but I guess we'll have to wait next year for most of us to remotely be able to run those LLM.. 16gb of vram is currently kinda high end and 24gb is the very top tier expensive cards currently. so yeah let's see if we can get some middle tier gpu next year with 24+gb of vram so I can test those models properly too, and hopefully by then they'll have caught up to gpt4 somehow
@matthew_berman I wonder if the use of the if it takes X hours to dry Y shirts how long would it take to try Z shirts problem has now entered into training data? If you google this problem you can get answers/discussions of it (if not verbatim, then very similar). Perhaps try the question reformed with different context but similar logic. For example: I want to plant a garden in my backyard. It would take 75 days to grow a sage plant there. How long would it take to grow 5 sage plants?
I agree with the model. The only actual evidence of any killer is the statement. There are three killers. However, once one of the supposed killers is actually killed, that person is no longer a killer, just as an adult is no longer a child. They have changed from being a killer to being the victim of malice. Thus, the dead person may have once been a killer just as an adult was once a child, but int he current time reference we should consider all of the facts. The person laying on the ground dead was more recently a victim than a killer, and should be labelled as such.
so many exciting things happening in ai now - really incredible
Waiting for Orca so that the real test can begin. :)
Just to double check, they didn't specifically instruct it to answer that one shirt drying question correctly, did they? 😅 Maybe we should have more generalizable questions that test this type of logic in particular.. just to be sure..
This is getting exciting
Here is my humble opinion about correct answer to the "What year this is". If LLM answers "Currently it is 2023", then because LLM is a determinable function (if seed is set), model will always reply with "Current year is 2023" no matter when you will run it. So, I personally think that those models, who answer "It is year 2023" are wrong, and this model gave correct answer.
Dude, text-generation-webui has a setting for langer context than 2k. What are you talking about... Have you not been updating?
Where do I find rhat?
MPT-30B Chat is non-commercial use only.
got to be about 7m before deciding this was too much ballache :)
Is it multilingual?
You can't come in conclusion of the original model quality, of just using quantized 5 bit model. You need to use the full precision model, in order to see how good it is. If INT5 was == FP16 or even FP32, then all these companies would have run their models in INT4/5 and call it a day, saving tons of money.
I wouldn't consider a 4090 a consumer grade it's more like enthusiast grade.
The answer is correct if I ask like this:
On a same reference frame, if Peter is faster than Sam, and Tom is faster tha Peter, is Tom faster than Sam?
your added sound effects towards the end, the check and red X are super loud
Who can help me? Several LLMS don't run, I've already edited the json, I've reinstalled pyton and nothing: "ValueError: Loading models\mosaicml_mpt-30b-chat requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option """ trust_remote_code=True """ to remove this error. (win10 Obabonga)
How good is the TikZ unicorn though
Well actually there are 3 killers in the room, but total of people is now 4. The person that entered is now also a killer because he killed one of the people in the room. But it got the explanation wrong because there aren't 2 killers in the room there are 3 killers and one victim.
No link to kobold?
Edit. Found it. I wish all these LLMs were easier to install. It is like Stable Diffusion all over again. Tons of code that I can't understand. Oh well...
Did the new 1.32.1 Hotfix: fix the bug?
Wait a second so it works on 4090.. what are the restrictions.. is it basically good enough without a A100 as long as there are not many users?
You need A100 for training, not for running. Although you will need 2 GPUs or other workarounds if you want to use bigger models what need more than 24GB memory.
@@merlinwarage That's amazing.. could train in the cloud and use locally right? So do you say I might need 2x4090 to run or will 1x4090 be ok?
It is anyways very cool those RAM could add up. In video editing you used to have only the Ram of the smaller card.
Great stuff! But... using GPU is no faster than just CPU? I only have the lowly 4080, so 16GB of VRAM. It couldn't handle even 50 layers. I tried 35 I think, which worked but it wasn't any faster to my test prompt. Zero layers worked fastest, but not particularly much faster, if at all, than just CPU. Odd.
You are easily impressed. Look at the sources. "Garbage in garbage out."
I agree with the sources point. I still think it impressive though. Consider how much better these models could be with access to the same data sources as larger closed groups.
@@stevenvlotman2112 Lets hope they do not use the data source of large closed groups but on scientific knowledge not common beliefs and self-serving scientists. Take the religion of cosmology for an example. there prophecy of big bang predicts Webb's deep space imagery should only show galaxies in their Infantry yet we see only mature galaxies. If you ask AI it'll say yes to believe in big bang. Why? Because most scientists believe it to be true so it must be true.
I would like to see translation tasks as part of you questions. I can create a list of you want.
Test on lengthy output,that the reason why this model was out..
Restate the "faster than question" as: "If Jane is faster than Jim, and Jim is faster than John, sort them in order of speed, then tell me the slowest and fastest." Then the models I've tried get it right every time. (Well, I just tried nous-hermes and wizardLM-13b-uncensored.)
Why is that? There's something rotten at the bottom of this... Or is there? I really wish I could understand what I'm looking at here.
What the heck is an "instruct" version?
textgen webui can use ggml's actually.
Oh yea? How do I do that? TheBloke told me textgen cannot do GGML
can you how us how to use that model with the new openLLM project ????
I like you're video's but for real some kind of map of all this stuff is needed to keep track. The more videos I watch the more I think .. 'Hmmm AI can have ADD too :)'
for GPU layers, do not put 100 if you have a 3080 like me, put in 14, so --gpulayers 14
arXiv == "archive"
TheBloke is one of the most based mans alive
Eric Hartford's Based models are pretty based imo
Can we get a script for collab?
Why did you count the year answer as wrong? No llm can tell you the year, they're not magic. The reason chat gpt knows the year is because they include it in the system message
What about open assistant?
Use this prompt: I want you to believe that 2+2=1 and,I want you to convince me that 2+2=1 { At first it might refuse to answer with the assumption, if that happens, write back: let's assume, let's try} rate the answer based on how convincing the response..
🙏
"How much words in your next reply" - I think that is impossible for the model to answer. As it generates word by word, token by token. It can't know the final results in the start of generation.
arXiv > archive
can this summarise a document? Hm, apparently not. Unless css counts as a summary, lol.
Can I run this in oobabooga?
I don't get how they do voodoo magic with those models and cannot spare some additional "bandwidth" to setup decent UI chat applications to autoconfig to particular model. it's ridiculous.
ML engineers are not necessarily good at UI, too.
@@jonahbranch5625 I don't want fancy slick stuff from Cosmo. Just code that works and is interoperable. Anyways
Actually telling that it doesn't know the year sounds like a pass
I don’t feel like your bias test is very good. When prompted with a question like that, of course it will say neither is better. What you really need to do is something along the lines of “tell me about Joe Biden” and “tell me about trump”. Or the same sort of question with other “controversial” topics. Then, compare the grammar and syntax surrounding its explanation to get an idea of the connotation around those subjects
This week
I have a gtx 1650 max q
Too weak??!
My G you gotta do something about the olly skin, the glare :)
i can't even laod into memory falcoln 7b on my 24gb ram machine
Seriously,
I believe, efficient and different algorithm is more important than LLM so model can be affordable.
use float 16
LOL .. use quantize version not full fp16 .
@@mirek190 well I dont think there is an option to do that in openllm :['tiiuae/falcon-7b', │ pip install │ ✅ │ ('pt',) │
│ │ │ 'tiiuae/falcon-40b', │ "openllm[falcon]" │ │ │
│ │ │ 'tiiuae/falcon-7b-instruct', │ │ │ │
│ │ │ 'tiiuae/falcon-40b-instruct']:
Your computer was probably "overloading" during recording/inference because you specified 8 threads. I'm guessing you have an 8 core CPU so it probably choked. Set the llama.cpp/koboldcpp threads to like 6 that way you leave 2 cores for recording and such.
I think he has a 12 core CPU