LLaVA 1.6 is here...but is it any good? (via Ollama)

Поделиться
HTML-код
  • Опубликовано: 6 сен 2024
  • LLaVA (or Large Language and Vision Assistant) recently released version 1.6. In this video, with help from Ollama, we're going to compare this version with 1.5 to see how it's improved over the last few months. We'll see how well it describes a photo of me, if it can create a caption for an image, how well it extracts text/code from images, and whether it can understand a diagram.
    Resources
    * Blog - www.markhneedh...
    * LLaVA 1.6 release - llava-vl.githu...
    * Ollama - ollama.ai/
    * Ollama Python Library - pypi.org/proje...

Комментарии • 40

  • @thesilentcitadel
    @thesilentcitadel 12 дней назад

    Hi Mark, thanks for the video and also for sharing the code via your blog page. It occurred to me that I didn't quite understand how the source image is being manipulated under the hood when it is being used in the inference. For example, in the white arrow on a blue wall example, it seems that the image you used is 1000x667, but the supported resolutions for the model are indicated in three aspect ratios, up to 672x672, 336x1344, 1344x336. The code you used doesn't specify which aspect the image was, so am interested to understand this further. I wonder, if a single image, such as the one with the code screenshot would be better recognised if the consideration of the source image size and the compatible input image size was considered. I.e. Split the input image up or some such.

  • @tiredofeverythingnew
    @tiredofeverythingnew 7 месяцев назад +2

    Thanks Mark great video. Loving the content lately.

    • @learndatawithmark
      @learndatawithmark  7 месяцев назад

      Glad you liked it! Let me know if there's any other topics you'd like me to cover.

  • @munchcup
    @munchcup 7 месяцев назад +1

    I find it easy that for more accurate results in text images to use pytessaract instead of llms but a description of an image llms serve well.Hope this helps.

    • @learndatawithmark
      @learndatawithmark  7 месяцев назад

      Oh interesting. I noticed that ChatGPT was using pytessaract so perhaps they aren't even using GPT4-V when you ask it to extract text from images at all! Didn't think of that

  • @troedsangberg
    @troedsangberg 5 месяцев назад +1

    Comparing 7b models to ChatGPT is of course slightly misleading. I'm getting satisfactory results from 13b (fits in my GPU) and am quite happy using it for image captioning specifically.

    • @learndatawithmark
      @learndatawithmark  5 месяцев назад

      Oh interesting. I tried 13b and 34b and was getting pretty similar results to what I showed in the video. Now you've got me wondering why I wasn't seeing better captions!

  • @utayasurian419
    @utayasurian419 9 дней назад

    I have a single image. But i would like to run multiple prompts using single image that shoukd already process by llavnext. Goal is to achieve llavanext to embed the same image over and over again. Any suggestions?

  • @rajavemula3223
    @rajavemula3223 7 дней назад

    hey , i need help i am new with vlm's and i want a model that i can ask questions and descriptions about live cam feed .which model can be fit for me?

  • @GaneshEswar
    @GaneshEswar 25 дней назад

    Amazing

  • @JoeBurnett
    @JoeBurnett 7 месяцев назад +3

    That arrow wasn’t pointing to the left as 1.6 indicated….

  • @PoGGiE06
    @PoGGiE06 5 месяцев назад

    Great video, thanks.

  • @geoffreygordonashbrook1683
    @geoffreygordonashbrook1683 7 месяцев назад

    What size version of this model were you using? Have you compared variants such as Bakllava? What did you need to do to get open-AI to work? If you could show how to run various llava models on huggingface, e.g. from The Bloke, that would be swell. Many thanks for all the helpful videos and insights!

    • @learndatawithmark
      @learndatawithmark  7 месяцев назад

      I think it was this one - 7b-v1.6. There are 13b and 34b versions too, but i haven't tried those yet. They've also got a bunch of others, some based on Mistral/Vicuna (ollama.ai/library/llava/tags)
      Not sure how different those ones would be - I need to give them a try!
      I did actually have a LLaVA 1.5 vs Bakllava example in progress but then stopped when I saw that there was a new LLaVA model out. I'll have to get back to it.
      Re: OpenAI - When I asked it to extract the text it kept throwing an exception when using pytessaract. So then I asked it what the code was doing (which it got right) and then I asked it to extract the code. And somehow that combination worked?!

    • @josefsteiner8616
      @josefsteiner8616 6 месяцев назад

      @@learndatawithmark i tried it with the 34b and i think it wasn't bad. i only had a screenshot from the youtubevideo of the diagram so the quality wasn't really good. maybe you can try it with the original image. that's the Answer:
      "
      The image you've provided appears to be a diagram illustrating the concept of data transformation and processing from a simple structure to a more complex one.
      On the left side, there is a basic representation with two boxes connected by an arrow, labeled "From this ..." This could represent data in its most raw or unstructured form, where information may not be processed or integrated into any system yet.
      On the right side, we see a more sophisticated diagram representing a network or a set of interconnected systems. There are multiple boxes connected with lines indicating relationships or data flow. Each box is labeled with various terms such as "Node A," "Node B,"
      "Process," and "Service," which suggest that this represents a complex system where data goes through various processes and services before it reaches its final form.
      The arrow from the left to the right with the label "... to this!" implies that the data moves from a simple state on the left to a more structured or processed state on the right, possibly within a larger network of systems or as part of a workflow processing
      system. This could be used in educational materials to explain concepts such as data integration, data flow in complex systems, or the transformation process in information technology infrastructure.
      "

  • @user-yu2wr5qf7g
    @user-yu2wr5qf7g 4 месяца назад

    thx. very helpful. subscribed.

  • @techgeekal
    @techgeekal 7 месяцев назад

    Hey great content ! In your experience, how much is the performance difference between ollama version of the model (compressed) and its original version?

    • @learndatawithmark
      @learndatawithmark  7 месяцев назад

      I tried the last few examples that didn't work well on the 7b model with the 13b and 34b models and I didn't see any better results. My impression is that this model is good with photos but struggles with other types of image.

  • @bennguyen1313
    @bennguyen1313 6 месяцев назад

    I have pdf files of handwritten data that I'd like to OCR, perform calculations and finally edit or append the pdf with the results.
    I like the idea of using a Custom GPT, but only GPT4 Plus subscribers could use it. So I'd prefer a standalone browser or desktop solution, that anyone drag and drop a file into. However, not sure if ChatGPT4's API assistant has all the Vision / Ai PDF Plugin support.
    If using Ollama, would anyone who wants to use my application also need to install the 20GB Ollama?

    • @learndatawithmark
      @learndatawithmark  6 месяцев назад +1

      You'd need to host an LLM somewhere if you want to create an application that other people can use. Unless you're having them run the app locally, I think it'd be better to use one of the LLM hosting services.
      Maybe something like replicate? replicate.com/yorickvp/llava-13b/api

  • @efexzium
    @efexzium 5 месяцев назад +1

    How can we use llava to control the mouse ?

    • @learndatawithmark
      @learndatawithmark  5 месяцев назад

      To control the mouse? I don't think it can do that - why do you want it to do that?

    • @southVpaw
      @southVpaw 3 месяца назад

      My thought is to overlay a grid on a screenshot of your desktop (write a function to take a screenshot and apply the grid when you send a prompt or whatever), then ask LlaVa to respond with the grid location closest to the spot you want to click. Clean that response and send it to pyautogui to move the mouse to the correct spot. Prompt engineer to taste.

  • @Ravi-sh5il
    @Ravi-sh5il 4 месяца назад

    Hai am getting this:
    >>> /load llava:v1.6
    Loading model 'llava:v1.6'
    Error: model 'llava:v1.6' not found

    • @learndatawithmark
      @learndatawithmark  4 месяца назад

      Did you pull it down to your machine? See ollama.com/library/llava

    • @Ravi-sh5il
      @Ravi-sh5il 4 месяца назад +2

      @@learndatawithmark Thank I forgot to pull :)

  • @annwang5530
    @annwang5530 3 месяца назад

    Hi can llava be integrated with groq?

    • @learndatawithmark
      @learndatawithmark  3 месяца назад

      I don't think groq have any of the multi modal models available at the moment. But there are a bunch of GPU as a service providers that keep popping up, so it should be possible to deploy it to one of them.
      One I played with a couple of weeks ago is Beam and now I kinda wanna see if I can deploy LLaVA there :D
      ruclips.net/video/WY6loJ6DYBA/видео.html

    • @annwang5530
      @annwang5530 3 месяца назад

      @@learndatawithmark thanks man

  • @tapos999
    @tapos999 4 месяца назад

    which mac it was running on?

  • @xiaofeixue7001
    @xiaofeixue7001 4 месяца назад

    Is this the paid version of ChatGPT?

    • @learndatawithmark
      @learndatawithmark  4 месяца назад

      Yes that's GPT-4. I don't think you can upload images to GPT-3.5?

  • @Ravi-sh5il
    @Ravi-sh5il 4 месяца назад +1

    how to load the 23.7 GB lava-v1.6-34b.Q5_K_S.gguf ?,am currently having a 4.7 GB
    Can you please help me with this Brother.?
    Thanks in advance!

    • @learndatawithmark
      @learndatawithmark  4 месяца назад

      Are you getting an error?

    • @Ravi-sh5il
      @Ravi-sh5il 4 месяца назад

      @@learndatawithmark am unable to figure it outt as of how to load the 23GB file into ollama
      please help.give the command that can pull the 27gb

    • @Ravi-sh5il
      @Ravi-sh5il 4 месяца назад

      @@learndatawithmark Actuallly I dont know how to load the 23.7GB Llava on ollama

  • @varadhamjyoshnadevi1545
    @varadhamjyoshnadevi1545 7 месяцев назад

    HI
    I have few doubts on openai . Could you pls share you mail id ?