Referring Image Segmentation and Compositional Visual-Linguistic Models | Multimodal Weekly 41

Поделиться
HTML-код
  • Опубликовано: 2 окт 2024
  • ​​In the 41st session of Multimodal Weekly, we welcomed two researchers working in multimodal understanding.
    ​​​✅ Mars Ha, a research scientist at Twelve Labs, will present his research on data augmentation for Referring Image Segmentation.
    Connect with Mars: seongsuha.gith...
    Check out his research thesis here: s-space.snu.ac...
    ​​✅ Mu Cai, a Ph.D. student at the University of Wisconsin-Madison, will dive into compositional visual-linguistic models via visual markers and counterfactual examples.
    Connect with Mu: pages.cs.wisc....
    Check out his project ViP-LLaVA: vip-llava.gith...
    Check out his project CounterCurate: countercurate....
    Check out his project LLaVA-PruMerge: llava-prumerge...
    Timestamps
    00:12 Introduction
    03:00 Mars starts
    03:23 Referring image segmentation
    04:23 Referring scenarios and difficulties in RIS benchmarks
    08:55 Negative mosaic augmentation
    12:30 Experiments
    16:00 Future work
    17:21 Q&A with Mars
    23:45 Mu starts
    24:31 What does "Compositional" mean?
    24:55 Enhancing region understanding via composing images and visual prompt
    25:29 Current LMMs do a decent job in whole image understanding
    26:13 How do existing LMMs have a sense of location
    27:17 We can simply overlay the visual prompts into the original image!
    29:56 The training process is very simple
    31:00 What are the visual prompts?
    31:43 Where does the dataset come from?
    32:40 Preliminary results
    35:41 Visual prompt understanding benchmark
    37:31 Visual prompting receives best performance
    38:33 Future works
    39:52 Enhancing visual-linguistic reasoning via composing counterfactual images and captions
    41:50 Quantitative results
    43:34 Use counterfactual reasoning
    46:18 Just use the original fine-tuning code
    46:39 Performance
    47:58 Discussion
    48:41 Let LLMs understand visual world via composing natural languages and SVG code
    49:12 How to translate pixels to text?
    49:48 LLMs can understand scalable vector graphics!
    50:32 Experiments
    51:13 Future work in efficient VLM
    52:07 Q&A with Mu
    01:04:45 Conclusion
    Join the Multimodal Minds community to receive an invite for future webinars: / discord

Комментарии •