Referring Image Segmentation and Compositional Visual-Linguistic Models | Multimodal Weekly 41
HTML-код
- Опубликовано: 2 окт 2024
- In the 41st session of Multimodal Weekly, we welcomed two researchers working in multimodal understanding.
✅ Mars Ha, a research scientist at Twelve Labs, will present his research on data augmentation for Referring Image Segmentation.
Connect with Mars: seongsuha.gith...
Check out his research thesis here: s-space.snu.ac...
✅ Mu Cai, a Ph.D. student at the University of Wisconsin-Madison, will dive into compositional visual-linguistic models via visual markers and counterfactual examples.
Connect with Mu: pages.cs.wisc....
Check out his project ViP-LLaVA: vip-llava.gith...
Check out his project CounterCurate: countercurate....
Check out his project LLaVA-PruMerge: llava-prumerge...
Timestamps
00:12 Introduction
03:00 Mars starts
03:23 Referring image segmentation
04:23 Referring scenarios and difficulties in RIS benchmarks
08:55 Negative mosaic augmentation
12:30 Experiments
16:00 Future work
17:21 Q&A with Mars
23:45 Mu starts
24:31 What does "Compositional" mean?
24:55 Enhancing region understanding via composing images and visual prompt
25:29 Current LMMs do a decent job in whole image understanding
26:13 How do existing LMMs have a sense of location
27:17 We can simply overlay the visual prompts into the original image!
29:56 The training process is very simple
31:00 What are the visual prompts?
31:43 Where does the dataset come from?
32:40 Preliminary results
35:41 Visual prompt understanding benchmark
37:31 Visual prompting receives best performance
38:33 Future works
39:52 Enhancing visual-linguistic reasoning via composing counterfactual images and captions
41:50 Quantitative results
43:34 Use counterfactual reasoning
46:18 Just use the original fine-tuning code
46:39 Performance
47:58 Discussion
48:41 Let LLMs understand visual world via composing natural languages and SVG code
49:12 How to translate pixels to text?
49:48 LLMs can understand scalable vector graphics!
50:32 Experiments
51:13 Future work in efficient VLM
52:07 Q&A with Mu
01:04:45 Conclusion
Join the Multimodal Minds community to receive an invite for future webinars: / discord