GPT-4o: Fine-tune OpenAI's Multimodal Model | Live Coding & Q&A (Oct 3rd)

Поделиться
HTML-код
  • Опубликовано: 28 ноя 2024

Комментарии • 23

  • @waytae1
    @waytae1 Месяц назад +2

    This video is one of the most inspiring I've watched in the last 10 years. As a computer vision engineer, I really appreciate the clear explanations and the great ideas shared. Thank you so much for this amazing content!

    • @Roboflow
      @Roboflow  Месяц назад +1

      Wow! I didn't expect such a warm response! This made my day!

  • @KAZVorpal
    @KAZVorpal Месяц назад +2

    A good question would be whether open AI is ever going to contribute anything real to Ai research, theory, and progress.
    So far, all they've done is take somebody else's idea - using attention with a pre-trained transformer - and milk it endlessly without any transformative ideas of their own.
    They have not actually changed the state of the art at all, just re-engineered it and threw ever more resources at it.
    There is no fundamental difference between GPT 4o and GPT 2. They just keep throwing memory, data, and RAG engineering plug-in gimmicks at it.

    • @Roboflow
      @Roboflow  Месяц назад +1

      Eh... I won't try to defend OpenAI now because they really don't reveal much information about the models they launch and the optimizations they use.
      But I think that in the past OpenAI has pushed open research forward. The example that immediately comes to mind is CLIP, which revolutionized computer vision overnight.

    • @KAZVorpal
      @KAZVorpal Месяц назад +1

      @@Roboflow
      I agree that they are revolutionizing the APPLICATION of other people's ideas.
      That they keep what they're doing secret is a violation of the promises they made when initially raising money and support in the industry. At that time I was really excited about them as a source of open development, like I was about Linus Torvalds' Linux, thirty two years ago.
      But CLIP was simply a clever re-engineering, it's a Vision Transformer. In other words, exactly the same underlying tech/theory as a GPT.
      Really, every thing they've done has been just taking Attention is All You Need, a paper by OTHER PEOPLE that was itself truly ground-breaking and transformative of theory, and apply it to various tasks. CLIP, DAL-E, ChatGPT, all are pretrained transformers using attention.
      Understand that we NEED someone to do that. Henry Ford didn't invent the automobile, but he made it more useful and accessible. It's good that they're doing that.
      But it bothers me that part of the lack of progress since then is because OpenAI has sucked all the air out of the industry, inhaling it themselves ONLY to produce more engineering, no actual science or theory...and that they're keeping secret the details, which also impedes the science and theory, which is often based on things learned in the application of re-engineering like they're doing.

  • @ramkumarkoppu
    @ramkumarkoppu Месяц назад +1

    Next, plan for llama vlm fine tuning with roboflow datasets, probably object detection

    • @Roboflow
      @Roboflow  Месяц назад

      Hi @ramkumarkoppu! I'm curious if you managed to realize that plan?

  • @cagataydemirbas7259
    @cagataydemirbas7259 Месяц назад +1

    Is Annotation Necessary for the Model to Learn the Details in the Image? or beneficial ? for example, when I throw an image and ask, there is a speed sign on the left side, there are children eating ice cream on the right side, the stork is flying above, the weather is sunny, the ground is asphalt and pavement, etc. Do I need to anotate these for the model to learn them?

    • @KAZVorpal
      @KAZVorpal Месяц назад +1

      Understand that chat GPT can't see or understand the image in any way, at all. What it actually does is pass the image to a completely different system, that that can understand images, and then passes GPT a text description of it. So either you do the annotation, or an RAG (Retrieval Augumented Generation) plugin has to do the annotation.
      GPT four cannot see images, make images, hear sounds, speak, remember things, none of it. Those things are all faked using external plugins.

    • @Roboflow
      @Roboflow  Месяц назад +1

      Hi @KAZVorpal (correct me if I'm wrong, but) starting with GPT-4 this is no longer true. In the case of omni models, text and images are equivalent inputs.

    • @Roboflow
      @Roboflow  Месяц назад +1

      Hi @cagataydemirbas7259, it all depends on the task you are performing. If we are talking about image captioning, image + description pairs would be used as training data, but if you want to train a VLM for an object detection task, you need to show the model examples of image + bounding boxes pairs.

    • @KAZVorpal
      @KAZVorpal Месяц назад +1

      @@Roboflow Hey there, Roboflow. No, the Omni models are plain LLMs, in fact are lighter LLMs than the original GPT-4.0, which is why they have such humble names...if they combined image and language for reals, they'd be either GPT 5...or more likely have a whole new name, like Omnivision Chat or something.
      What they are increasingly doing, is using sub-modules, RAGs, plugins, various tricks, to make them more useful without advancing the technology at all. They are finding this necessary, not only because they are failing to make any breakthroughs in machine learning technology, but also because the amount of power and resources necessary for the real GPT for is overwhelming to them. Their latest preview model is actually and even lighter version, but it uses an interesting Chain of Thought technique to produce more consistent results with fewer resources.
      OpenAI still has not come up with a way to tokenize images and text in the same transformer. Which is weird, because there are so many possible ways to do it, and they have the resources to try them.
      By the way, thanks for putting up with the terrible speech to text transcriptions in my comments. I just went back and edited them to be a bit more readable. I really should only comment from my computer.

  • @ramkumarkoppu
    @ramkumarkoppu Месяц назад +1

    What is the object detection inference latency with this comparable to YOLO-world?

    • @SkalskiP
      @SkalskiP Месяц назад

      It's slooooooow. 5-10s in my experience.

    • @ramkumarkoppu
      @ramkumarkoppu Месяц назад +1

      @@SkalskiP probably run yolo in the front end for realtime and cascade with GPT-4o for zero shot?

    • @Roboflow
      @Roboflow  Месяц назад

      VLMs (GPT-4o included) are not yet ready for real-time scenarios ;)

  • @austino2184
    @austino2184 Месяц назад +1

    Terrific video

    • @Roboflow
      @Roboflow  Месяц назад +1

      Thanks a lot! I'm really glad you liked it. There we some technical problems that made me stressed, but I''m happy it turned out well.

    • @austino2184
      @austino2184 Месяц назад

      ​@@RoboflowI love the observations around coordinate input. Little hacks like that are hard to come by.
      Thinking around output structure is so important for LLM fine tuning but there isn't much out there around this topic. Thanks again.

  • @saadashraf1293
    @saadashraf1293 Месяц назад +1

    Would this be good for hate meme or offensive meme pictures detection? Memes that have some pictures along with text in them????

    • @Roboflow
      @Roboflow  Месяц назад

      Very good question! I'm afraid not. During my stream, I mentioned censorship. OpenAI opens your training data and analyzes it for their terms of use violations. I think data containing offensive meme pictures might be flagged by OpenAI. It's hard to train a model without data.

  • @14types
    @14types Месяц назад +1

    how he said - easy raid?

    • @Roboflow
      @Roboflow  Месяц назад +1

      I messed up the name. Fast ReID: github.com/JDAI-CV/fast-reid