How to Find The Best AI ChatBot For You (For FREE)
HTML-код
- Опубликовано: 9 июн 2024
- Sharing a cool tool I've been playing with AND talking about why it's tough to compare LLMs.
Discover More From Me:
🛠️ Explore thousands of AI Tools: futuretools.io/
📰 Weekly Newsletter: www.futuretools.io/newsletter
🎙️ The Next Wave Podcast: @TheNextWavePod
😊 Discord Community: futuretools.io/discord
❌ Follow me on X: x.com/mreflow
🧵 Follow me on Instagram: / mr.eflow
Resources From Today's Video:
gmtech.com/
Sponsorship/Media Inquiries: tally.so/r/nrBVlp
Mailing Address: 3755 Avocado Blvd Unit 287, La Mesa, CA 91941
#AINews #AITools #ArtificialIntelligence
Time Stamps:
0:00 Intro
0:12 A Great Comparison Tool
2:45 Comparing LLMs on Creativity
5:39 They're All The Same!
6:42 Comparing Joke Telling
7:20 The Number 42?
9:52 Comparing Image Models
12:50 Like I Said, They're All The Same
14:33 Find More Cool AI Tools - Наука
Just to get ahead of the comments that I know are coming... I know there are some edge use-cases where certain models outperform other models. The point that I'm trying to make is that for 90% of people and the main use-cases people use these LLMs for, they're all going to perform relatively equally.
I would agree with this assessment.
Standing behind a hypothesis, that isn’t yet established takes some courage.
Well appreciated.
Again, use this prompt for comedy. You are an experienced stand up comedy writer. Write an Original 3 minute stand up routine showing in brackets audience responses. Write a funny bit about Texas Wildlife
Matt, please redo this video with better prompts.
Nobody just asks for a joke or a number from 0-100 😂
Interesting to see the side-by-side comparison. AND the single focus. Thanks. However, you mentioned they're all pretty same at creative writing. I respectfully disagree. Currently, I'm finding Claude is far better than ChatGPT for showing vs telling. But that could change tomorrow. Such Fun!
In my experience, it all depends on whether the model is local or not and what pre/post processing is being used. With local models (all using LM Studio) your assessment is quite fair. There's not much difference between them. Hell, they don't even know what words *are*, which is why they have trouble counting them. Any that answer correctly and consistently likely have pre/post response processing. I could be wrong, but that's my current impression.
Wow Matt, can't thank you enough for this review and feedback! I'm a long-time follower and really enjoy your content. We're a small team but have lots of cool features in the works along with integrating the latest models - Llama3, Gemini 1.5, Claude 3 Opus & Haiku are coming in the next release. Stay tuned!
Very cool, I just signed up - when are you adding Llama3 and Phi3?
@@literailly Awesome thank you for signing up! Hope you're enjoying it so far. Llama3 will be in our release tomorrow, Phi3 we're still testing and will have an ETA soon.
and china‘s new AI?
im subscribing my business to you next month, and i wonder if you would integrate gemini ultra and Stable Diffusion cuz that would be crazy good
@@xart2621 We currently have Google Gemini 1.0 Pro and will be releasing Gemini 1.5 Pro in our next release. We also currently offer three Stable Diffusion models: 3.0, 3.0 Turbo and SDXL. Reach out with any questions, thank you for your support!
It's not the size, it's how you use it.
wtf. haha
Optimized for large monitors
I mean you can get the job done with anything, but it’s certainly easier to do with a larger language model
My wife tells me that every night
Another lie from the women's lib era.
I might just be hallucinating here, but have you people noticed the different 'writing styles' in LLMs? Like, when testing two models on the Chatbot Arena, I sometimes go "ah, that's GPT 4" or something.
This is probably a product of their finetuning or RLHF/RLAF, but GPT-4, Claude, Gemini and even Llama have their own quirks in how they generate text. I can't pinpoint how or where they differ, I just intuitively notice that they FEEL different.
And Is probably the way GPT-4 cheated to get back to #1 in 12 hours with 8000 battles before they announced the release 😂
We're hardwired to identify patterns. You can even tell if they updated a model you use daily just by noticing tiny changes in the way they respond.
Yes. Tantamount to having personalities.
You aren't wrong. They do Respond Differently.
Veritasium has a great video on why 37 is the top response (by a large margin) from a human when asked for a random number.
Yep. I was surprised about the 42 because I have seen this video as well.
42 is the most common for Mistral reasons but GPT-4 gives 37 to resemble more to humans
All of these models now can answer simple questions. They differ in how well they handle real work, like answering a email written about a complex job-specific topic, written in a language that few people speak, with 200 pages of documentation as context. If you want to compare them, try a task so hard that only 1 or a few models can handle it. I can imagine it might be a bit hard to turn into a video though.
This is true. The more complex the question, the more varied response.
In my experience, the more text in the prompt, the harder it is for a model to keep up as well; some models, of course, are better at that than others, but not all of them are, and the one(s) that can keep up with you AND provide quality outputs that adhere to your massive prompt are the ones that are perfect for your workflow; because Matt is sort of right that they otherwise answer similarly besides that.
I just completed an AI/ML Postgraduate class at the University of Texas. Every sample Jupyter notebook used 42 as a seed for training the ANNs
English dude. . . . English
Good find. Ive been doing my AI side by side comparisons using docs and tables 😂 i was thinking of creating something like what theyve done. They executed it very well. But it must get expensive.
Veritasium proved 42 and 37 are one of the most common answers in a public survey.
Veritasium just did a video on the number 37. Which is proportionally selected higher by people when asked for a random number
Yeah cool video
8:20 - Fun fact, 37 is one of the most commonly picked numbers when you ask humans to pick a number between 1 and 100. 42 is more commonly used in tech/geek/nerd culture due to the Douglas Adams reference (which if you know what these LLMs are trained on, makes a lot of sense). So these results make perfect sense when you remember than an LLM is incapable of actually generating a random number. As it predicts the most likely next word given all previous words, based on the text it's been trained on (which is generated by humans), every LLM will be heavily biased towards the most common *human* responses, so in this case, 37 and 42. I'm sure if you ask it to pick a number between 1 and 10, it will be heavily biased towards 3 and 7.
They will all converge into one single AI...
They are absolutely not the same and I’m not talking about “edge cases”.
You can get dramatically different responses from models with the same prompt.
ChatGPT with custom instructions wins all the way for me because of convenience. I didn't have opportunity to test Claude, but it is likely best out of the box for coding. Dolphin Mixtral for storytelling because of lack of guard clauses.
@@backstabbahiw is chat gpt more convenient than the others?
@@YungSlimeGaming Yes. For others, you need to keep a file with custom instructions for every use case. Giving it the right prompt at the start is important for every model but only Chat makes it easy.
But I must say that its getting behind big. I won't be renewing next month and will get VPN+Claude.
Thank you sir! Love the one-subject experiment.
Here's a crazy thought: Are we 100% sure that this GMTECH service is actually using the real API's? I mean, it would be an "easy" way to make a lot more money if they just pretend that those models are ChatGPT, Gemini, MetaAI, and so on, but in reality are just an opensource model trained to act in different ways to mimic closed-source models like ChatGPT. I am not saying GMTECH is lying to earn a lot of money fast -- I'm just saying that the responses of the models on their platform are surprisingly similar.
Fun idea, but nope we are definitely using all the real models from various APIs (AWS Bedrock, OpenAI, Google Vertex and Mistral). Trust me, it is just as surprising to me the similarities in these models! We do no prompt transformation either, what you send is what we send to the API. We're here to provide a useful interface for model comparison! Hope you enjoy it :)
@@gmtechai I appreciate your reply. However, if ever you choose to convert your platform and use the business strategy I mentioned, I do expect to get a royalty :P
@@Legacy_Inc. Deal! If Google changes their APIs one more time, they quietly become Llama 2, and you get your royalty :)
Someone should look into this. Don't take their word for it.
@@helloworldcsofficial You can ask the models "what model are you?" and most will tell you btw. Reach out with any questions, happy to help!
For LLMs, simple generic prompts will give generic answers that are similar. If you want splits, you need to type in very complex promts with a dozen or more layers or have long and layered conversations. Seriously, one simple prompt is like testing whether a car starts. It's not a test of the car's performance.
Please, Make more comparing videos, i really like it, everyone does, give more logic, reasoning, not subjective answers
How did you create the animations and such near the ending when talking about the different expertises of different LLMs? It looked really cool, was it AI created?
The advancement in AI deveopment is like the wild west right now. What a crazy time to be alive.
And great channel btw. Happy I stumbled upon this one!
Thank you for your excellent comments explanations and solutions related to AI!
this is super cool -- we def wanna play around with it too. thanks for sharing! hoping they get Lllam 3 and Opus soon
haha also we didn't spend enough time on that Rent-a-Chicken idea
Thank you! Both will be in our next release :)
promptfoo runs locally and allow you to also assert responses displaying you benchmark results.
Matt we”re really liking the single topic video version. 👍
Thank you so much Matt for demystifying the complex world of AI and making it accessible to us all. Your insightful videos not only enhance our understanding but also ignite our curiosity. Keep up the fantastic work!
If I want really good result I use few of them, providing answer of one and asking it to add something that previous model maybe skipped.
Try to ask for a RANDOM number between 1 and 100, then you should get different results from nearly all models. That is a good example for totally different interpretations due to altering just small parameters in the prompts.
I wanna try with Phi-3 and some Chinese LLMs. And definitely we need a kind of a special mode answers in a different window (2nd screen) with auto summary proposal with different Colors for each LLM answers combining different models answers all in one. However we need to reinforce fact checking by connecting them to the web and source validation.
Unfortunately, so far Phi-3 is not available through an API. As for Chinese LLMs, which are you interested in? We have tested Yi but it was not really useful enough to include in the application.
For me,
- Gemini is good for Q&A.
- Cloud AI is good for productivity.
- ChatGPT is good for intelligence-based work like fixing grammar.
- PI AI is like a good friend.
- Copilot is a good image generator.
Great find Matt! I have been looking for something like this. Too bad it is not free for limited monthly use. Suggestion for future testing and comparison of these tools: instead of using what to me are mostly simple and silly (sorry) examples, try using some practical questions that require real-world knowledge so one can evaluate the answers based on usefulness, correctness, completeness, clarity, etc. rather than just time to generate a silly answer.
Have you considered the company in question is generating 5 responses from the same model and passing them off as different?
Very interesting! For these experiments I would recommend setting the temperature to 0 so you can make sure that the fluctuations are really between the models and not within and also for T=0 you will get the answer the LLM thinks is the best one
In Douglas Adams' famous novel "The Hitchhiker's Guide to the Galaxy", the computer Deep Thought is asked to calculate the Answer to the Ultimate Question of Life, the Universe, and Everything. After 7.5 million years of calculation, Deep Thought provides the answer: 42 There is a theory that explains the answer (42) and it has to do with the asterisk key in a keyboard (the meaning of the universe is in everything)
Go Matt go! Make math capabilities and math knowledge reviews of LLM models
Just watched Veratsium's video on 37, between that and hitch hikers guide to the galaxy no surprises in the numbers
Interesting... i wonder why that is? aren't they built up from scratch individually?
Other LLMS: 42
Llama 3: I'm not like the other girls 💅, 43
use Multilevel Queue Scheduling as the test. one of the hardest things to program from scratch
Rent-a-Chicken service. Sign me up!
Liking the tweaked format 😊
what's the image thingy running behind u in the video? 😸
If instead of saying "Give me a number between 1 and 100" you ask "Give me a random number between 1 and 100", are the results any less consistent with 42s? I wonder if the models don't infer the "random" part the way most people would?
I wouldn't try to create your own leaderboard with new questions unless you really want to, take from what's been done in research. You need domain-specific expert knowledge to come up with these questions that make a difference. I think at some point, there's always a cutoff for what the AI knows how to answer correctly once you increase complexity enough for any subject
Nice find.
4:15 That's much less than half a penny - 0.05 of penny with the Gemini Pro
What is the best one for chemistry and maths?
❤❤❤ thank you !
I prefer Claude, but the headings and sections are valuable. Claude generates just the type of either notes, feedback, and plans of action I look for
it's shame Matt didn't mentioned the Pony model in image generation test
Midjourney and Ideogram is missing in image comparison no competition found
Corporate needs you to find the differences between this (heavily censored) chatbot and this (heavily censored) chatbot
from what I understand, it's something more related to frequency
Ha, I had to test the 42 question, and it turns out my GPT i made, DM Tool Kit interprets the request as a random number generation request and rolls it randomly instead.
something i have found ChatGPT 4.0 to be surprisingly good at is calculating DPR for D&D. i give it the info for my character and ask for it to calculate the DPR, then i ask it something like "ok, what if the enemy is in faeries fire, what would my new DPR be" or "ok, let's say i replace 2 levels of rogue with 2 levels of fighter to get action surge, what would my expected DPR for that round be" or just give it a number of little variations on my character and the expected DPR change.
far as i can tell, the numbers seem accurate, but i'm not that good at math...there are no GLARING errors.
this was particularly useful since my table uses a different crit calculation and getting the existing DPR calculators to handle this wasn't working well (basically our crit is: double the dice but the first set of dice is max. so if you have 1d10+5 and you crit it's 10+1d10+5.) so when explaining this to chatGPT it knew how to incorporate that information into the DPR calculations.
it also helped me prove a point to a DM of mine: he runs it that when we are traveling we roll a die to see if we have an encounter, the die depends on how dangerous is it, if we roll a 1 we have an encounter. i suggested he roll a die and we roll a die, and if it's the same number we have an encounter, add a bit of suspense, he said it would make encounters less frequent (his idea being the probability of rolling a 1 on, say, a d6 is 1/6 but the probability of 2d6s rolling the same number is 1/36.) i was certain he was wrong but i couldn't work out where his error was (after all the probability of rolling any number on a d6 is 1/6, so what does it matter if it's a 1 or some other number, the number he rolls doesn't affect my roll any)
chatGPT explained why i was right, it's because the probability of 2d6 rolling the same number ISN'T 1/36 it's 1/6. the probability of rolling 2 1's on 2d6 is 1/36, but so it is for rolling 2 2's and 2 3's etc, so when you add up the probabilities you get 6/36 which is 1/6.
A wolf howling at the moon in graffiti. AI and graffiti will get someone's attention.
I been opening all the LLM on my chrome tabs manually 😢
Amazing Video
Are you going to do a video on the new up-and-coming storyboard platform called Mootion Storyteller from Unity ...I think? I just got something about it in my email. It's sort of like katalyst aI. Only it can continue on to actually make videos.
42... That's crazy
psst! don't tell the tech industry there will eventually just one model "the ai" dominant in the end, just like there is just one dominant search engine, or the ai bubble will bust to early. they still need to marinade a bit until they realized it's 2k all over again and they have learned nothing.
Who is they?
73 and 37 are most popular pseudo random numbers
You can compare models in Poe AI an it has all kinds of big models
Do you use any AI on your videos? I really like the editing on them!
When are Suno and Udio gonna some seriously stiff competition? Also, 15 min is _short_ to you? 🤔
good one
thanks for to the good works
What's the one with least censorship?
It's a very bad thing that each one is giving the same answer to questions that should have a random distribution. Even explanations of static concepts should be slightly different from model to model.
Funny enough, 37 I think is one of the most human picked "random numbers" 1-100
i'm sure somebody prob already pointed this out, but 37 is not random either. it's the most commonly presented number when humans are asked to pick a random number between 1 and 100. see the veritasium video about it.. meaning that this is the "random number" attributed the most likely probability of a number given the context of the previous tokens. (ie the training data says 37 or 42 are the most common values given when humans are asked for a random number)..
It could also be because a lot of us are @KevinSmith fans & it's a secret way we identify ourselves. If you hear in "in a row?" when the number 37 is mentioned, then it's definitely giving 'one of us' vibes of one sort or another ❤
I tried Pi llm with the number test and it gave me 54
Google imagegn is doing good day by day.
thx
I don't think you can evaluate the difference between those model with such random and simple prompt. You need a more complex scenario or code to see where the variance lay
By Chirstmas everything perfect
37 and 73 are two of the most "random" feeling numbers if you ask people for a number between 1 and 100 too. There's a good RUclips video about 37
Am I the only person that uses them for poetry/song lyrics? (Which the also suck at)
So they are all wrong... meaning they could all provide false information toward one same topic, because their core is the same?
11:25 u didn't mention it cost the most at $0.0650
Hey Matt :)
So the moral of the video is it doesnt really matter which one but pick one.
Expect like other software that will soon change and each LLM will be specific to a type of content
Well, you should pick the oldest AI model to use if you said so.
Also AIs answered with random "seed", which you may get better answer after few more tried.
Of course, 42. Don't forget your towel.
idk, i feel some pretty distinct differences between models, like meta struggles to keep track with the conversation. i use ai to develop and organize my dnd campaigns, if you ask meta for a mission outline, it spits one out ask it to make some alterations to one aspect, then ask for an updated mission outline and it adds and changes to all kinds of stuff that it shouldnt have, you tell it changed stuff and ask for the original with the altered aspect, and it acknowledges the changes then spits out completely different stuff....ive also noticed misspellings often help my images in theirs for some weird reason(for instance typing blade often results in two blades extending from a hilt, where typing blad gives a much better looking image with the proper single blade).......GPT is great for keeping track of the conversation, but the characters and stories it generates tend to be very generic feeling, you have to really work with it to get something original feeling.....Gemini struggles with keeping track of the conversation sometimes, and doesnt appear to want to reference anything from previous sessions(not near as bad as meta) but its characters and story points are better than the others with characters definitely having a more fleshed out feeling.......none of these really show up in single question style analysis but stand out when to me over time using them.
I'm now getting into 'rent a chicken' service
there is a sort of dataset shortage
Same on video generations imo
In time ;)
These AI models are all approaching the same level of gathering all man's knowledge based on Algothryms so we should now start working on Man's problems of just getting along example communication and understanding each other before reacting. came across a dr. (professor of Stanford) has his pHD in Math, Chemistry and Physics and had a 5 hour conversation and how to efficiently teach his students (Engilish is not first language) the 5 W's or How to 'LOVE' then First of anything is Communication of the student/s first Language: Came across the Solution to the First Problem : Communication and that is Chinese Devices (hardware and software): TimeKettle. Today , hopefully , I will meet up with Dr. Carlos ... ">>" to mention a possible solution to communicate to his foreign students in their first languages.
the cutting of words , I've bucks (five bucks), ept 4(gpt 4)
etc???
How about LMSYS
Very good for head to head (2 models)
Vercel is also nice... but their pro version is a little broken
There all the same at this point it’s up to the individuals who use it. Everyone has to the power to create cool stuff
37 is best
So they do not know what a joke is? Cause they all cited puns, not jokes.
I use chat GPT 4 all the time to learn arm 32 and 64 bit assembly no other model besides claud 3 opus can do that kind of programming. I do not say this lightly. I have been doing this for months.
Maybe on benchmarks but real use case? Nah. Not even close.
I figured that the best AI is whichever one has the least bullsh*t restrictions on it
Poe might be a better choice than gmtech since there are more models to choose from. Although, a Poe subscription gives you a set number of credits per month, so if gmtech is unlimited that might be better value depending on how you use it.
A dragon with three heads, not a tree headed dragon... you talk to a computer, not a human.
shower thought:
Phi3 Is not a llm(large language model) it is technically a slm(small language model) 😂
Matt, LLMs can do comedy. You're not prompting it correctly. Try mine; You are an experienced stand up comedy writer. Write an Original 3 minute stand up routine showing in brackets audience responses. Write a funny bit about Texas Wildlife
They are not the same. Answers may seem similar, but you should NOT search differences in the text... Every model is good at specific things... Understanding comes with experience.