The Future of Video Editing with Multimodal AI | Multimodal Weekly 40

Multimodal Reasoning, Video Instruction-Tuning & Explaining Vision Backbones | Multimodal Weekly 53

Modality Alignment for Multimodal Perception & Open-Source Lightweight MLLM | Multimodal Weekly 48

Understanding Porsche's New Six Stroke Engine Patent

I finally confronted Goth Egg

Jannik Sinner vs Carlos Alcaraz For The Title! 🏆 | Beijing 2024 Final Highlights

Referring Image Segmentation and Compositional Visual-Linguistic Models | Multimodal Weekly 41

Twelve Labs

Просмотров 140

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 окт 2024
In the 41st session of Multimodal Weekly, we welcomed two researchers working in multimodal understanding.
✅ Mars Ha, a research scientist at Twelve Labs, will present his research on data augmentation for Referring Image Segmentation.
Connect with Mars: seongsuha.gith...
Check out his research thesis here: s-space.snu.ac...
✅ Mu Cai, a Ph.D. student at the University of Wisconsin-Madison, will dive into compositional visual-linguistic models via visual markers and counterfactual examples.
Connect with Mu: pages.cs.wisc....
Check out his project ViP-LLaVA: vip-llava.gith...
Check out his project CounterCurate: countercurate....
Check out his project LLaVA-PruMerge: llava-prumerge...
Timestamps
00:12 Introduction
03:00 Mars starts
03:23 Referring image segmentation
04:23 Referring scenarios and difficulties in RIS benchmarks
08:55 Negative mosaic augmentation
12:30 Experiments
16:00 Future work
17:21 Q&A with Mars
23:45 Mu starts
24:31 What does "Compositional" mean?
24:55 Enhancing region understanding via composing images and visual prompt
25:29 Current LMMs do a decent job in whole image understanding
26:13 How do existing LMMs have a sense of location
27:17 We can simply overlay the visual prompts into the original image!
29:56 The training process is very simple
31:00 What are the visual prompts?
31:43 Where does the dataset come from?
32:40 Preliminary results
35:41 Visual prompt understanding benchmark
37:31 Visual prompting receives best performance
38:33 Future works
39:52 Enhancing visual-linguistic reasoning via composing counterfactual images and captions
41:50 Quantitative results
43:34 Use counterfactual reasoning
46:18 Just use the original fine-tuning code
46:39 Performance
47:58 Discussion
48:41 Let LLMs understand visual world via composing natural languages and SVG code
49:12 How to translate pixels to text?
49:48 LLMs can understand scalable vector graphics!
50:32 Experiments
51:13 Future work in efficient VLM
52:07 Q&A with Mu
01:04:45 Conclusion
Join the Multimodal Minds community to receive an invite for future webinars: / discord

Комментарии •

Следующие

Автовоспроизведение

The Future of Video Editing with Multimodal AI | Multimodal Weekly 40

The Future of Video Editing with Multimodal AI | Multimodal Weekly 40

Multimodal Reasoning, Video Instruction-Tuning & Explaining Vision Backbones | Multimodal Weekly 53

Multimodal Reasoning, Video Instruction-Tuning & Explaining Vision Backbones | Multimodal Weekly 53

Modality Alignment for Multimodal Perception & Open-Source Lightweight MLLM | Multimodal Weekly 48

Modality Alignment for Multimodal Perception & Open-Source Lightweight MLLM | Multimodal Weekly 48

Understanding Porsche's New Six Stroke Engine Patent

Understanding Porsche's New Six Stroke Engine Patent

I finally confronted Goth Egg

I finally confronted Goth Egg

Jannik Sinner vs Carlos Alcaraz For The Title! 🏆 | Beijing 2024 Final Highlights

Jannik Sinner vs Carlos Alcaraz For The Title! 🏆 | Beijing 2024 Final Highlights

Davante Adams requests trade from Raiders; Tyreek Hill's outburst raises eyebrows | SPEAK

Davante Adams requests trade from Raiders; Tyreek Hill's outburst raises eyebrows | SPEAK

Composed Video Retrieval, Consent In Crisis, and Video Annotations at Scale | Multimodal Weekly 57

Composed Video Retrieval, Consent In Crisis, and Video Annotations at Scale | Multimodal Weekly 57

Single-Step Language Model Alignment & Smaller-Scale Large Multimodal Models | Multimodal Weekly 49

Single-Step Language Model Alignment & Smaller-Scale Large Multimodal Models | Multimodal Weekly 49

What are AI Agents?

What are AI Agents?

The U-Net (actually) explained in 10 minutes

The U-Net (actually) explained in 10 minutes

Multimodal Data Lake, Video Repetition Counting, and Low-Resource Vision | Multimodal Weekly 51

Multimodal Data Lake, Video Repetition Counting, and Low-Resource Vision | Multimodal Weekly 51

Peter Hitchens in heated clash over Israel's war

Peter Hitchens in heated clash over Israel's war

Has Generative AI Already Peaked? - Computerphile

Has Generative AI Already Peaked? - Computerphile

How-to Videos, Feeling Multimodal Intelligence, & Visually-Grounded Video QA | Multimodal Weekly 52

How-to Videos, Feeling Multimodal Intelligence, & Visually-Grounded Video QA | Multimodal Weekly 52

РОДИТЕЛИ СТАЛИ ЗОБМИ. Видео на канале😱

РОДИТЕЛИ СТАЛИ ЗОБМИ. Видео на канале😱

🛑самое главное в жизни!

🛑самое главное в жизни!

Как открыть багажник?

Как открыть багажник?

как вижу я vs что происходит на самом деле ( свои деньги )

как вижу я vs что происходит на самом деле ( свои деньги )

Лиса🦊 УЖЕ НА ВСЕХ ПЛОЩАДКАХ!

Лиса🦊 УЖЕ НА ВСЕХ ПЛОЩАДКАХ!

ПИ ДИДДИ (ч. 2): Джастин Бибер, мемуары бывшей девушки, Алия, песня "She Knows"

ПИ ДИДДИ (ч. 2): Джастин Бибер, мемуары бывшей девушки, Алия, песня "She Knows"