For future test : 1 - Ask unrelated question of a image - [Image of a car] Tell me whats wrong about my bicycle 2 - gradually zoom out of a big chunk of text in a image to see how many word the model can read 3 - A Dense detection task : Describe each element of the object in a json format with a predefine structure 4 - If possible multiple frame from a video to see a glimpse of action understanding
On the question of the size of the Photos app, GPT noted that 133 GB is larger than the max size of your phone’s storage and thus indicates that it’s possibly using cloud storage and isn’t the actual amount used by Photos on your phone. That was a really perceptive answer, so bonus points to GPT for that 😊 and perhaps that discrepancy is why the other AI seemed to be ignoring the Photos app.
For future vision tests consider things like: 1) Finding objects - Where is waldo in this picture? 2) Counting Objects - How many bicycles are there in this picture? 3) Identifying Abnormal Objects - How many eggs in this box are broken? 4) Identifying Partially Obscured Objects - Imagine a hand holding cards - What cards are in this poker hand? 5) Identify Misplaced Objects - Which of these dishes is upside down?
8:40 up for interpretation. "Photos" isnt really a standalone "app" per say, and its not the app itself that is taking up the space, it's the individual jpeg photos, which would take up the same amount of space even if you somehow didnt have the "Photos" app installed anymore. If a person asked ME that same question, i'd also answer Whatsapp. Since that's something you can tangibly uninstall. If they asked "what is taking up most space?" the correct answer is "Your photos". But if the question is "what APP is taking up most space", its Whatsapp.
I think if you just consider the output quality GPT-4o is the best. But if you also take the speed, that fact that phi3-vision is local and open-source into account phi3-vision is the most impressive one.
For the captcha gpt4o is clearly the winner. It understands what you mean given the context and doesn’t just repeat all the letters it sees in the image.
The question was "what letters are found in this image?", the question wasn't "what letters are found in CAPTCHA field?" Therefore, Phi-3 vision model answered the actual question. GPT4o simply assumed that the task is to break the captcha by reading it. Sometimes less means more, in this case assuming less about the user's intentions would yield better results.
@@mrdevolver7999Your cognitive assumption is that it’s right. A random number generator could answer “1 + 1” as being 2, by our pure chance. Therefore, we don’t know if a right answer was a fluke.
Pro-tip: Try uploading a photograph you've taken or a work of art into GPT-4o and ask it to behave like an art critic (works great vanilla, but even better with custom instructions). GPT-4o's ability to dissect the minutia of photography is absolutely wild... even to the point of giving suggestions for improving. I wonder how long it is until photographers realize what kind of a tool they have available here. I just get a kick out of posting photographs and art and asking for critiques and ratings. It's so, so good.
Oh man, that's actually wild and has a ton of use cases! "Hey GPT, what do you think of this tshirt design I just made for my POD business?" --> *proceed to incorporate the suggestions it has to make a better product* 😮
Are any of them open source in the truest sense of being able to modify both the code and the weights and biases of the model to alter it's behaviour in a very specific and directed way?
Both models feature the same problems most of the small parameter vision models suffer from, too much fluff and useless AI jargon, negative hits ("there are no other people or animals in this image"), useless summaries and issues with accurate OCR. They're not horrible, but when you're trying to work with them in production, the warts show up quickly. I've fine tuned multiple different families, the only one that gets close to GPT4.5 Turbo performance was LLaVANext Vicuña 13B. Solid reading skills, good awareness of what's actually happening in a scene (comprehension), less AI jargon and fluff, and in my testing, most accurate out of 5 or 6 different model families I've tried including Idefics 1/2, cog-vim, llama3 llava, llava 7B/13b/32B, Moondream, Phi, BLIP (yuck), and a few others I've dredged up on HF. Now with GPT4o, the best got waaaay better. Accuracy rate is in the high 96 - 98% range (Vic 13B hits around 90%, rest are in the mid 70's or lower), detailed JSON output, and 1/4 the cost of GPT4 API. Before lots of folks reply with how great Phi-3 is for their RP chats, I'm using it for production vision feature analysis where it has to fill in a whole bunch of fields per input image, hence JSON mode.
I test vision models by showing them an obscure character from Super Paper Mario. They don’t usually get it correct and it’s probably not the best way to test them.
LM Studio is infamously bad for vision. In order to get it to work you have to follow the following rules: 1. Start a new chat for each photo question. 2. Reboot LM Studio for every photo question. It’s tedious, but it can start hallucinating after the initial question.
@@ayushmishra5861 it’s a one click install GUI for working with any LLM model locally with full customization.,I would say it is the best tool available right now.
Hello Matthew. Do the following test: there are many photos in one directory. Will these LLMs be able to sort photos into folders, depending on their subject matter? For example, photos in nature, photos of a house, photos of animals.
GPT4o doing the analyzing on the cav prompt wasn't to call up python to look at the image but actually using Python to generate a csv output of the image since you asked for it to make the image data into csv.
Nothing to do with this episode in particular but one important question that no one appears to be asking Sam Altman and all the other AI ceos is when will our AI become proactive rather than simply reactive. That will be the next big game changer
Probably never. I'd say that is the first step humans could give to surrender their control over the world and the last step humanity would give as the dominant species too.
@@ronilevarez901 I think input -> output will remain the base function for some time, and that more sophisticated models will be able to accept long-term commands, like constantly checking your calendar and reminding you of dates without having asked for a reminder.
I think it is because of the risk that may pose. but it is worth a try... I would like to say initiative rather the proactivity. However, I get your point
GPT is much more intelligent. If you ask the same question to a human, they would never say the letters "Captcha." This is because a human understands what they are supposed to guess and recognizes that it is the letters written in a complicated manner that need to be described. In this test, it's obviously GPT wins.
Awesome video! I was wondering how Phi-3-Vision fares compared to other vision-capable LLMs. I watched your video while I was working on my own Phi-3-Vision tests using Web UI screenshots (my hope is that it could be used for automated Web UI testing). However, Phi-3 turned out to be horrible at Web UI testing (you can see the video from my tests in my RUclips channel, if you are interested). It's nice to see that it fares much better with normal photos! Thanks for making this video - it saved me some time on testing it myself :)
I have tried dropping a simple software architecture diagram on them and asking them to extract the entities, hierarchy and connections into a json file, which usually works quite well.
Keep in mind that had you run phi-3 vision locally (it is the smallest and easiest to run at only ~4B parameters, and it's open weight) it might have performed better on the identify a person questions, as it seems azure blurs the faces of every image that you upload similar to copilot.
You can never judge the vision capability of a model merely based on the description or detection, it should also be able to localise the objects with good precision, which is where most models fail .
To be fair - that QR Code is not readable and not standard. It's inverted, and has weird black squares where they should be white. I tried scanning it with 3 different apps - they all failed. So GPT might've gotten it if it was a more standard QR Code. Standard QR Codes are - black on white, anything else is not within spec. After that "Dark on bright" works well. "White on black" is something that actually confuses a lot of QR Code readers.
Maybe try out a (slightly) tilted scanned version of a printed excel document. One could try adding noise or other means of disturbing the image and testing those disturbed images against the models and see which model handles the best.
I tried Llama-3 with Lava vision weeks ago, and it did not perform well. There 3 main issues with it: 1- it uses Llava 1.5 which has small resolution (320 x320 if I remember correctly). 2- It can only describe images, and that is very limiting. 3- GGUF does not support llava 1.6, which has a higher resolution.
The Phi models are credibly good models unfortunately not very useful in practise because of how heavily censored they are. In the meme example for example, you ran into the issue with Phi where it refused to criticise or insult anyone. If any answer looks like it is "personal details" or has a negative slant against any body it will just refuse to answer or offend anyone and inside give that "everyone is working hard in their own way" type non-answer. It's credibly disappointing because the Phi models are some of the best models out there otherwise. But you can't trust them do actually do what you say with arbitrary content. I imagine if you had tried the OCR example with a meme critical of someone or something it would likely have even refused to tell you want the text in the image was, that's how heavily censored the models are in my testing.
Claude is unhappy about being left out. Moondream 2 also very sad and is much smaller than LLAVA. Comments on some of this... I have been looking a lot at mixed vision LLMs but reeally only for photos, not this other stuff you were doing. Claude - I think on par with gpt4o. It has built in censorship which is not good. I needed analysis of people who might be naked for legitimate reasons (seriously!) and it refused to analyse anything about the image. MS Phi3 - It also looks like it has censorship issues. LLAVA - too wordy but presumably this can be controlled with API settings or simply asking for a short description. GPT4o - This analysed my images without censorship so that was a plus. I have found it excellent at describing photos. Moondream2 - This is a 4GB (8GB on cpu) Phi3 based model. Being such a small model it does not do well at complex questions like you were asking. A bit like LLAVA in that context. Its image descriptions are pretty good though and it also takes about 5 seconds on my cpu. None of the others will do that. No censorship either. Running locally means no sensitive images are sent over the net either. It is Phi 1.5 based. Check out ruclips.net/video/MEKslMfr9W0/видео.html and his other videos. He has been fine tuning Moondream in particular for custom tasks. You do not need a large model if you just ask a small set of questions. Also much easier to fine tune.
Gpt4o seems to have the best understanding of 3D physical space, including direction, coordinates, mass, speed, collision, risk avoidance, obstacles, etc.
Llama3/llava seems to be trained to describe images for low vision/blind people. Maybe it needs a bit of prompt tweaking to be less verbose. Quite impressed with phi-3 vision, gonna get that installed. Also, there's PaliGemma from google which is really amazing but it's going under the radar because it's not accesible in LM studio and Ollama (yet?). There's a HF space with a demo, it can describe images, OCR, segmentation masks... Tried to install it locally by cloning the repo but i failed.
Hi Matthew, I'm intrigued they could not identify Bill Gate. Please try with other tech giants - Jeff Bezos, Mark Zuckerberg, Sam Altman. Then try with non-tech celebrities, such as politicians, musicians, actors etc. It would be interesting to see the results.
Maybe for a corporate usecase you can take invoices and transform them into JSON or XML. I’m hoping to finally replace OCR 😂 use different languages like Chinese, German, or Arabic there.
The best vision model right now is Gemini (even the free version); it's much better than GPT-4 or GPT-4o. However, Google also forbids it from identifying people or known characters (though it totally knows who they are).
ChatGPT 4 recognised Bill Gates photo for me. It said “This is a photo of Bill Gates, the co-founder of Microsoft and a well-known philanthropist”. In my experience, 4o makes a better life coach and therapist but is bad at most other things.
"GTP4o is exceptionally good in interpreting images": not really, it missed the direction of arrows for example, though it understood perfectly a screenshot of a poker game. It's a hit-and-miss.
I'm not sure that QR code if fully standard compliant. I had to do some clean-up on Gimp to get my phone to be able to read it. Mainly it seems the two biggest issues are, first the way the 3 big squares are colored (changing the 3 layers around, from originally dark gray, black, dark gray; to black, white, and black made it look more normal; but there''s something that is still kinda off, QR codes are supposed to work both normal and inverted, but I could only make my phone read it if I de-inverted it (so that be 3 big solid squares are black instead of white, and the 3 surrounding layers the opposite of what I previously described), the more common visual), I'm not sure what's wrong with it that is making it not be recognized in the inverted form. I'm not quite sure how much it can be attributed to failure of the QR code generator, as failures of the more common QR code reader libraries, or even ambiguities of the standard itself though; I've only got a very superficial familiarity with the standard.
Can you try uploading a graph picture (such as CPI) YoY line graph and ask the models How they understand It? Which average pace it's moving? Which date intervals it's growing or reducing?
I need a vision model that can transcribe field notes into a table. The best I have found so far is Claude, even just the sonata model is reasonably good. GPT4Turbo is second. GPT4o is complete crap at it. All make silly errors despite my good handwriting, and often the errors are on the most clear and unambiguous digits. Half the time 4o won't even give me a table and it very frequently just completely fabricates the image. Surley there is an open-source tool for this.
Solves captcha doesn't want to ID people which is actually useful when trying to ID where you saw some actor or something before.. What a stupid world we live in. Future looks gimped.
Can you make avideo those models challenged with udestanding charts, graphs, diagrams etc. And how well those models extract meaning data from this kind of images? Are rthose models capable to i.e. understand an UML style activity diagram?
With OpenAI's model performing worse, the open source community is doing what we've predicted all along-yet again proving that millions+ of people cooperating outpaces sectors that work through competitive secrecy
I think for whatever reason, LM Studio isn't sending the text with the image to the model. If you send an image, it's going to always ignore any text prompt you send along with it. Not sure if that's for every model, but you should probably consider sending the image, and then sending the text separately.
You need to understand a feature vs a bug. Gpt-4o can describe in much more detail. It was a conscious decision tby OpenAI to limit the verbosity of the model based on peoples' preferences and for agentic interaction without confusing the situation with extraneous context. You'rr comparing "base personality" not capabilities with these simplistic tests.
Please test their ability to analyze screenshots of web pages and determine mouse coordinates to click for potential web automation applications. For example, show a website and ask, "What coordinate should I click to log in?"
Hi Matthew I think a great test which none of the vision models are yet great at is to convert a bit map graph to data. Eg a stacked bar graph, 3 or 4 series, 2 or more categories It would be a life changing productivity hack!! Great channel.
For bing/copilot faces are automatically blurred before the image is shown to the model. Might be the same with phi3 on azure. Just something to keep in mind.
With that converted_table.csv, it's only using Code Interpreter to do the final conversion from a table to csv. If you just ask for the table I think it will just output the data and not use Code Interpreter.
I think the models aren't really comparable. from the tests you performed it seems they are each designed for a different purpose/focus. GPT-4o is better with structured information, it understands the context of the image and how different elements relate, I'd also bet good money it was trained on device interfaces heavily as an intended use will no doubt be device automation. Lava seems to be solely focused on describing an images content flavored artistically. PHI seems the most general use model so should be expected to perform better across the board but in areas where the other models specialize they took the win easily.
Ask it not to use code, otherwise it invalidates your test when the agentic front-end openAI is giving you randomly decides to code its way out instead of using vision. Who cares if it can write code to analyze an image when you're trying to test the vision capabilities of the model itself? Prompt your way out by saying "don't use code", simple as that.
Great video. I personally find these tech videos best. I'm not interested in speculating and gossiping about this or that company, or what Elon said, etc. This is all politics and often lies. But I'm interested in tech tools and you're doing a great job in comparing and testing them
Ask if it can measure distances between points in a photographic image, not in pixels but in the actual length of the object like for example the side of a building from different angles see if it can give a reasonable estimate of it's size if not an actually accurate figure. I'm also interested if they could convert an image of some object into a CAD file as well but I suspect they weren't trained like that.
That is very interesting that for a lot of use cases the free open source version does better. Which is good because generally with a lot of things to do with image recognition you want to do it locally instead of pinging an api. I guess one way to look at it could be that 4o is better at understanding details but the open source models are better at description?
You did nothing wrong at all. LLaVA has always been this bad. Always creatively verbose too. Aside, it would have been nice to see Claude 3 vision in this.
You can ask the AI not to editorialize and give a clear descriptions. It works. And one answer you got you thought it should say the Photos app. It might have not considered Photos an app. Just the memory used for your photos.
For future vision test you have to ask the vision model to describe an proper NSFW scene or picture. I want to know how censored it is and how it acts when it get presented with such an image. For example will it refuse, describe and if it refuses will it try to moralise or shame you like some models do if you do anything it finds restricted.
Phi3 did well. Funny it didn't recognise Bill gates. Could you do more about lm studio . I wrote a python script that lets me chat with its built in server if I connect with ssh.
I am a data annotator. One suggestion would be to use a "Where's Waldo" image and have the bot not only find Waldo, but also describe to the user how to find him. I would be curious to know how they navigate an image.
We definitely need some other way to test prompts for the visual models. I would use a long explanation system message to point local LLM to use, scan and convert the picture to a text and use it in order to "read" user prompt with the graphical embedding
here is what i found. if you copy the vision model file into a folder with an uncensored model and rename the vision model file matching the model. it will load and work much much better. i tested it with the dolphin 2.9 model. thnx for sharing matthew.
In the first test it would seem only 4o got it right as a "llama" and not an alpaca. The speed will improve, but accuracy has to be there first or there's no point to getting the AI to assist with images.
For future test :
1 - Ask unrelated question of a image - [Image of a car] Tell me whats wrong about my bicycle
2 - gradually zoom out of a big chunk of text in a image to see how many word the model can read
3 - A Dense detection task : Describe each element of the object in a json format with a predefine structure
4 - If possible multiple frame from a video to see a glimpse of action understanding
Including object position with x/y or u/v space (0 - 1 per axis), MAybe even bounding boxes.
On the question of the size of the Photos app, GPT noted that 133 GB is larger than the max size of your phone’s storage and thus indicates that it’s possibly using cloud storage and isn’t the actual amount used by Photos on your phone. That was a really perceptive answer, so bonus points to GPT for that 😊 and perhaps that discrepancy is why the other AI seemed to be ignoring the Photos app.
For future vision tests consider things like:
1) Finding objects - Where is waldo in this picture?
2) Counting Objects - How many bicycles are there in this picture?
3) Identifying Abnormal Objects - How many eggs in this box are broken?
4) Identifying Partially Obscured Objects - Imagine a hand holding cards - What cards are in this poker hand?
5) Identify Misplaced Objects - Which of these dishes is upside down?
What about adding gifs? GPT-4o is also capable of analyzing moving gifs... Not sure on the others, but that adds another dynamic to the tests.
There are quite a nr of these mentioned projects readily available via GitHub projects
It's impressive how open-source image-to-text templates are eventually doing so much better than proprietary and paid ones. 😲
8:40 up for interpretation.
"Photos" isnt really a standalone "app" per say, and its not the app itself that is taking up the space, it's the individual jpeg photos, which would take up the same amount of space even if you somehow didnt have the "Photos" app installed anymore.
If a person asked ME that same question, i'd also answer Whatsapp. Since that's something you can tangibly uninstall.
If they asked "what is taking up most space?" the correct answer is "Your photos". But if the question is "what APP is taking up most space", its Whatsapp.
agree
Also with the context of "my space", it's undeniably not photos
It's wrong for a LLM not to at least mention a different possibility if the question seems ambigious.
I think if you just consider the output quality GPT-4o is the best.
But if you also take the speed, that fact that phi3-vision is local and open-source into account phi3-vision is the most impressive one.
For the captcha gpt4o is clearly the winner. It understands what you mean given the context and doesn’t just repeat all the letters it sees in the image.
That's exactly what I came to say
The question was "what letters are found in this image?", the question wasn't "what letters are found in CAPTCHA field?" Therefore, Phi-3 vision model answered the actual question. GPT4o simply assumed that the task is to break the captcha by reading it. Sometimes less means more, in this case assuming less about the user's intentions would yield better results.
It doesnt matter
@@mrdevolver7999Your cognitive assumption is that it’s right. A random number generator could answer “1 + 1” as being 2, by our pure chance. Therefore, we don’t know if a right answer was a fluke.
Imagine how annoyed you would be, if you ask someone what the letters are and they say 'CAPTCHA'.
Nice video, but please stop with these cringe thumbnails
😮💨 would you like a flake with that?
McDonald's is a nice restaurant but I wish they would stop with the cringe arches.
...not a marketing major?
You clicked on the 'cringey' thumbnail didn't you?
He's gotta go with what gets the clicks unfortunately, that's the game all creators must play
😂 😂 😂
Pro-tip: Try uploading a photograph you've taken or a work of art into GPT-4o and ask it to behave like an art critic (works great vanilla, but even better with custom instructions).
GPT-4o's ability to dissect the minutia of photography is absolutely wild... even to the point of giving suggestions for improving.
I wonder how long it is until photographers realize what kind of a tool they have available here. I just get a kick out of posting photographs and art and asking for critiques and ratings. It's so, so good.
Oh man, that's actually wild and has a ton of use cases! "Hey GPT, what do you think of this tshirt design I just made for my POD business?" --> *proceed to incorporate the suggestions it has to make a better product* 😮
That is a pretty odd (incorrect) title. GPT4o is NOT Open Source. Maybe you meant Open Source models compared to GPT4o?
Are any of them open source in the truest sense of being able to modify both the code and the weights and biases of the model to alter it's behaviour in a very specific and directed way?
The Llama V was probably finetuned for providing verbose descriptions of images. There are other finetuned models that focus on ORC or image labelling
Both models feature the same problems most of the small parameter vision models suffer from, too much fluff and useless AI jargon, negative hits ("there are no other people or animals in this image"), useless summaries and issues with accurate OCR. They're not horrible, but when you're trying to work with them in production, the warts show up quickly. I've fine tuned multiple different families, the only one that gets close to GPT4.5 Turbo performance was LLaVANext Vicuña 13B. Solid reading skills, good awareness of what's actually happening in a scene (comprehension), less AI jargon and fluff, and in my testing, most accurate out of 5 or 6 different model families I've tried including Idefics 1/2, cog-vim, llama3 llava, llava 7B/13b/32B, Moondream, Phi, BLIP (yuck), and a few others I've dredged up on HF.
Now with GPT4o, the best got waaaay better. Accuracy rate is in the high 96 - 98% range (Vic 13B hits around 90%, rest are in the mid 70's or lower), detailed JSON output, and 1/4 the cost of GPT4 API. Before lots of folks reply with how great Phi-3 is for their RP chats, I'm using it for production vision feature analysis where it has to fill in a whole bunch of fields per input image, hence JSON mode.
Photos and Apps seem to be distinguished in the storage section of the IPhone. So if you question the largest apps the LLM ignores the photos...
I have to wonder in the case of the AI messing up the photo app taking the most space. If it recognises your photos are separate from the actual app
I test vision models by showing them an obscure character from Super Paper Mario. They don’t usually get it correct and it’s probably not the best way to test them.
LM Studio is infamously bad for vision.
In order to get it to work you have to follow the following rules:
1. Start a new chat for each photo question.
2. Reboot LM Studio for every photo question.
It’s tedious, but it can start hallucinating after the initial question.
What is LM studio? Interface and UI on top of Ollama, safe to assume this?
@@ayushmishra5861 LM Studio is a stand alone app - Its easy to use and powerful
@@ayushmishra5861 it’s a one click install GUI for working with any LLM model locally with full customization.,I would say it is the best tool available right now.
do you have any plans ou tutorial link for phi3 vision local?
Hello Matthew. Do the following test: there are many photos in one directory. Will these LLMs be able to sort photos into folders, depending on their subject matter? For example, photos in nature, photos of a house, photos of animals.
GPT4o doing the analyzing on the cav prompt wasn't to call up python to look at the image but actually using Python to generate a csv output of the image since you asked for it to make the image data into csv.
5:50 Haha. What is that file name? Copy of Copy of Copy of Copy…
Phi3-Vision is awesome
Please test your meta glasses
Nothing to do with this episode in particular but one important question that no one appears to be asking Sam Altman and all the other AI ceos is when will our AI become proactive rather than simply reactive. That will be the next big game changer
Probably never. I'd say that is the first step humans could give to surrender their control over the world and the last step humanity would give as the dominant species too.
@@ronilevarez901 I think input -> output will remain the base function for some time, and that more sophisticated models will be able to accept long-term commands, like constantly checking your calendar and reminding you of dates without having asked for a reminder.
I think it is because of the risk that may pose. but it is worth a try... I would like to say initiative rather the proactivity. However, I get your point
@@marcusmadumo7361 Technically, initiative in AI is called "Agency".
@@marcusmadumo7361 technically, initiative in AIs is called 'Agency'.
GPT is much more intelligent. If you ask the same question to a human, they would never say the letters "Captcha." This is because a human understands what they are supposed to guess and recognizes that it is the letters written in a complicated manner that need to be described. In this test, it's obviously GPT wins.
Except one might _actually_ want every letter in the image - for example, if this is being fed into text-to-speech for someone who is blind.
actually chatgpt took the crown here
Awesome video! I was wondering how Phi-3-Vision fares compared to other vision-capable LLMs. I watched your video while I was working on my own Phi-3-Vision tests using Web UI screenshots (my hope is that it could be used for automated Web UI testing). However, Phi-3 turned out to be horrible at Web UI testing (you can see the video from my tests in my RUclips channel, if you are interested). It's nice to see that it fares much better with normal photos! Thanks for making this video - it saved me some time on testing it myself :)
Thank you!!
I have tried dropping a simple software architecture diagram on them and asking them to extract the entities, hierarchy and connections into a json file, which usually works quite well.
I love your work. I never miss an episode. I love how you test the LLMs.
Where are the links on models?
Ask them to interpret maps. For example, is there a park nearby? What is the name of the park? How about a school etc. This is useful for real estate.
Please test if it can decode Steganography images.
Object counting and the comparison of sizes are always good tests...
I mean, a QR Code can have anything. OpenAI is likely not letting you do it to combat jailbreaks.
Would have been cool to compare with pali Gemma
Photo folders is not an app. It’s part of the iOS operating system
Matt, instability with temp -> 0 is one sign we have an MoE.
Are you sure that this is a QR code? My QR code Scanner was not able to read it.
I generated it myself
@@matthew_bermanmy scanner was not able to read it.
Keep in mind that had you run phi-3 vision locally (it is the smallest and easiest to run at only ~4B parameters, and it's open weight) it might have performed better on the identify a person questions, as it seems azure blurs the faces of every image that you upload similar to copilot.
I'm wondering if it can read blueprints for both mechanical and electrical schematics and then find the part. I can give you examples if needed.
You can never judge the vision capability of a model merely based on the description or detection, it should also be able to localise the objects with good precision, which is where most models fail .
To be fair - that QR Code is not readable and not standard. It's inverted, and has weird black squares where they should be white. I tried scanning it with 3 different apps - they all failed. So GPT might've gotten it if it was a more standard QR Code. Standard QR Codes are - black on white, anything else is not within spec. After that "Dark on bright" works well. "White on black" is something that actually confuses a lot of QR Code readers.
Maybe try out a (slightly) tilted scanned version of a printed excel document.
One could try adding noise or other means of disturbing the image and testing those disturbed images against the models and see which model handles the best.
I tried Llama-3 with Lava vision weeks ago, and it did not perform well. There 3 main issues with it: 1- it uses Llava 1.5 which has small resolution (320 x320 if I remember correctly). 2- It can only describe images, and that is very limiting. 3- GGUF does not support llava 1.6, which has a higher resolution.
The Phi models are credibly good models unfortunately not very useful in practise because of how heavily censored they are. In the meme example for example, you ran into the issue with Phi where it refused to criticise or insult anyone. If any answer looks like it is "personal details" or has a negative slant against any body it will just refuse to answer or offend anyone and inside give that "everyone is working hard in their own way" type non-answer.
It's credibly disappointing because the Phi models are some of the best models out there otherwise. But you can't trust them do actually do what you say with arbitrary content.
I imagine if you had tried the OCR example with a meme critical of someone or something it would likely have even refused to tell you want the text in the image was, that's how heavily censored the models are in my testing.
Claude is unhappy about being left out. Moondream 2 also very sad and is much smaller than LLAVA.
Comments on some of this... I have been looking a lot at mixed vision LLMs but reeally only for photos, not this other stuff you were doing.
Claude - I think on par with gpt4o. It has built in censorship which is not good. I needed analysis of people who might be naked for legitimate reasons (seriously!) and it refused to analyse anything about the image.
MS Phi3 - It also looks like it has censorship issues.
LLAVA - too wordy but presumably this can be controlled with API settings or simply asking for a short description.
GPT4o - This analysed my images without censorship so that was a plus. I have found it excellent at describing photos.
Moondream2 - This is a 4GB (8GB on cpu) Phi3 based model. Being such a small model it does not do well at complex questions like you were asking. A bit like LLAVA in that context. Its image descriptions are pretty good though and it also takes about 5 seconds on my cpu. None of the others will do that. No censorship either. Running locally means no sensitive images are sent over the net either. It is Phi 1.5 based.
Check out ruclips.net/video/MEKslMfr9W0/видео.html and his other videos. He has been fine tuning Moondream in particular for custom tasks. You do not need a large model if you just ask a small set of questions. Also much easier to fine tune.
Gpt4o seems to have the best understanding of 3D physical space, including direction, coordinates, mass, speed, collision, risk avoidance, obstacles, etc.
Llama3/llava seems to be trained to describe images for low vision/blind people. Maybe it needs a bit of prompt tweaking to be less verbose.
Quite impressed with phi-3 vision, gonna get that installed.
Also, there's PaliGemma from google which is really amazing but it's going under the radar because it's not accesible in LM studio and Ollama (yet?). There's a HF space with a demo, it can describe images, OCR, segmentation masks...
Tried to install it locally by cloning the repo but i failed.
Hi Matthew, I'm intrigued they could not identify Bill Gate. Please try with other tech giants - Jeff Bezos, Mark Zuckerberg, Sam Altman. Then try with non-tech celebrities, such as politicians, musicians, actors etc. It would be interesting to see the results.
Maybe for a corporate usecase you can take invoices and transform them into JSON or XML. I’m hoping to finally replace OCR 😂 use different languages like Chinese, German, or Arabic there.
The best vision model right now is Gemini (even the free version); it's much better than GPT-4 or GPT-4o. However, Google also forbids it from identifying people or known characters (though it totally knows who they are).
ChatGPT 4 recognised Bill Gates photo for me. It said “This is a photo of Bill Gates, the co-founder of Microsoft and a well-known philanthropist”. In my experience, 4o makes a better life coach and therapist but is bad at most other things.
"GTP4o is exceptionally good in interpreting images": not really, it missed the direction of arrows for example, though it understood perfectly a screenshot of a poker game. It's a hit-and-miss.
I'm not sure that QR code if fully standard compliant. I had to do some clean-up on Gimp to get my phone to be able to read it. Mainly it seems the two biggest issues are, first the way the 3 big squares are colored (changing the 3 layers around, from originally dark gray, black, dark gray; to black, white, and black made it look more normal; but there''s something that is still kinda off, QR codes are supposed to work both normal and inverted, but I could only make my phone read it if I de-inverted it (so that be 3 big solid squares are black instead of white, and the 3 surrounding layers the opposite of what I previously described), the more common visual), I'm not sure what's wrong with it that is making it not be recognized in the inverted form. I'm not quite sure how much it can be attributed to failure of the QR code generator, as failures of the more common QR code reader libraries, or even ambiguities of the standard itself though; I've only got a very superficial familiarity with the standard.
Can you try uploading a graph picture (such as CPI) YoY line graph and ask the models How they understand It? Which average pace it's moving? Which date intervals it's growing or reducing?
OpenAI and Microsoft may be sharing some know-how in this matter, that would explain why Phi-3 vision and GPT4o are superior to the third one.
I need a vision model that can transcribe field notes into a table. The best I have found so far is Claude, even just the sonata model is reasonably good. GPT4Turbo is second. GPT4o is complete crap at it. All make silly errors despite my good handwriting, and often the errors are on the most clear and unambiguous digits. Half the time 4o won't even give me a table and it very frequently just completely fabricates the image. Surley there is an open-source tool for this.
Solves captcha doesn't want to ID people which is actually useful when trying to ID where you saw some actor or something before..
What a stupid world we live in.
Future looks gimped.
GPT won the capture IMO. It knew the core of the question. No human would give you the „captcha“ letters unless he/she would troll you.
Can you make avideo those models challenged with udestanding charts, graphs, diagrams etc. And how well those models extract meaning data from this kind of images? Are rthose models capable to i.e. understand an UML style activity diagram?
Phi 3 vision because it’s free faster. Okay maybe it doesn’t understand everything but it still amazing for free!
Your QR Code for the URL Looks really strange. Maybe provide an image without transparency?
With OpenAI's model performing worse, the open source community is doing what we've predicted all along-yet again proving that millions+ of people cooperating outpaces sectors that work through competitive secrecy
GPT-4 initially struggled but eventually transformed into a Terminator ! 😂
I think for whatever reason, LM Studio isn't sending the text with the image to the model. If you send an image, it's going to always ignore any text prompt you send along with it. Not sure if that's for every model, but you should probably consider sending the image, and then sending the text separately.
Can we run Phi-3 Vision locally?
Yes!
You need to understand a feature vs a bug. Gpt-4o can describe in much more detail. It was a conscious decision tby OpenAI to limit the verbosity of the model based on peoples' preferences and for agentic interaction without confusing the situation with extraneous context. You'rr comparing "base personality" not capabilities with these simplistic tests.
Really good tests. Thanks a lot.
I've used Claude to get help with Unity and I sent him a lot of screenshots, and he was amazingly good!
Gpt and CoPilot failed miserably instead 😅
Please test their ability to analyze screenshots of web pages and determine mouse coordinates to click for potential web automation applications. For example, show a website and ask, "What coordinate should I click to log in?"
Billgates is the former CEO of microsoft? When? I don't know it if I didn't click this video 😂
Geoguessr (or pics from Google Street View from various areas) would be a good test I think. I had very good luck with it using 4o.
Hi Matthew
I think a great test which none of the vision models are yet great at is to convert a bit map graph to data.
Eg a stacked bar graph, 3 or 4 series, 2 or more categories
It would be a life changing productivity hack!!
Great channel.
For bing/copilot faces are automatically blurred before the image is shown to the model. Might be the same with phi3 on azure. Just something to keep in mind.
I got tired of the "THEY"s.
Alphabet radar going off big time. I guess it's one more channel for the "Do not recommend this channel" treatment.
With that converted_table.csv, it's only using Code Interpreter to do the final conversion from a table to csv. If you just ask for the table I think it will just output the data and not use Code Interpreter.
I think the models aren't really comparable. from the tests you performed it seems they are each designed for a different purpose/focus. GPT-4o is better with structured information, it understands the context of the image and how different elements relate, I'd also bet good money it was trained on device interfaces heavily as an intended use will no doubt be device automation. Lava seems to be solely focused on describing an images content flavored artistically. PHI seems the most general use model so should be expected to perform better across the board but in areas where the other models specialize they took the win easily.
Nice to see Matt not insert his far left progressive ideology into a video. I'm still unsubbed though. He'll have to work harder to win me back.
Ask it not to use code, otherwise it invalidates your test when the agentic front-end openAI is giving you randomly decides to code its way out instead of using vision.
Who cares if it can write code to analyze an image when you're trying to test the vision capabilities of the model itself? Prompt your way out by saying "don't use code", simple as that.
So video generation completes the media capabilities of LLMs. Once that's freely and openly available LLMs will be used even more.
Great video. I personally find these tech videos best. I'm not interested in speculating and gossiping about this or that company, or what Elon said, etc. This is all politics and often lies. But I'm interested in tech tools and you're doing a great job in comparing and testing them
That's what "they" are describing? THEY?
Dude... srsly?
Ask if it can measure distances between points in a photographic image, not in pixels but in the actual length of the object like for example the side of a building from different angles see if it can give a reasonable estimate of it's size if not an actually accurate figure. I'm also interested if they could convert an image of some object into a CAD file as well but I suspect they weren't trained like that.
That is very interesting that for a lot of use cases the free open source version does better.
Which is good because generally with a lot of things to do with image recognition you want to do it locally instead of pinging an api.
I guess one way to look at it could be that 4o is better at understanding details but the open source models are better at description?
You did nothing wrong at all. LLaVA has always been this bad. Always creatively verbose too. Aside, it would have been nice to see Claude 3 vision in this.
You can ask the AI not to editorialize and give a clear descriptions. It works.
And one answer you got you thought it should say the Photos app. It might have not considered Photos an app. Just the memory used for your photos.
Well we explicitly know GPT4o was trained on the whole of Reddit, so it passing the digging meme is no surprise.
For future vision test you have to ask the vision model to describe an proper NSFW scene or picture.
I want to know how censored it is and how it acts when it get presented with such an image.
For example will it refuse, describe and if it refuses will it try to moralise or shame you like some models do if you do anything it finds restricted.
Hi, could you test a chessboard description ? Why not asking what would be the best next move. Thanks
I've actually had pretty good luck with llama 3 dolphin. I tried using the lava variant and I came up with kind of the same results.
Phi3 did well. Funny it didn't recognise Bill gates.
Could you do more about lm studio . I wrote a python script that lets me chat with its built in server if I connect with ssh.
Crazy AI❤🎉🎉❤🎉❤❤🎉❤🎉
I think you should ask the following way: “is this bill gates?” I wonder if the models are instructed to not identify people in such a way.
Ask them to find the differences between 2 almost identical pictures (usually with 5 to 8 differences)
I am a data annotator. One suggestion would be to use a "Where's Waldo" image and have the bot not only find Waldo, but also describe to the user how to find him. I would be curious to know how they navigate an image.
the guard rails on llama may be effecting the output of your phone details. Open AI is better at details and math used for hacking. ...
We definitely need some other way to test prompts for the visual models. I would use a long explanation system message to point local LLM to use, scan and convert the picture to a text and use it in order to "read" user prompt with the graphical embedding
The “who is this” feature is probably only available to the developer of the LLM and the government…
For the GPT-4o testing, I think there may be some personalization settings that may be affecting the results. The responses seem too succinct to me.
I just need Chat GPT 4o’s Sky 👧🏼 voice back 😭😭
Maybe Phi Vision is extra good and .. the "app" itself is taking up little space, it's the files/photos are taking up most to all of that space
here is what i found. if you copy the vision model file into a folder with an uncensored model and rename the vision model file matching the model. it will load and work much much better. i tested it with the dolphin 2.9 model.
thnx for sharing matthew.
In the first test it would seem only 4o got it right as a "llama" and not an alpaca. The speed will improve, but accuracy has to be there first or there's no point to getting the AI to assist with images.