Multimodal Reasoning, Video Instruction-Tuning & Explaining Vision Backbones | Multimodal Weekly 53
HTML-код
- Опубликовано: 2 окт 2024
- In the 53rd session of Multimodal Weekly, we had three exciting researchers working in multimodal understanding and reasoning benchmark, video instruction tuning, and explanation methods for Transformers and ConvNets.
✅ Xiang Yue, Postdoctoral Researcher at Carnegie Mellon University, will introduce MMMU - a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
Follow Xiang: xiangyue9607.g...
MMMU: mmmu-benchmark...
✅ Orr Zohar, Ph.D. Student at Stanford University, will introduce Video-STaR - a self-training for video language models, allowing the use of any labeled video dataset for video instruction tuning.
Follow Orr: orrzohar.githu...
Video-STaR: orrzohar.githu...
✅ Mingqi Jiang, Ph.D. Student at Oregon State University, will propose explanation methods in order to gain insights about the decision-making of different visual recognition backbones.
Follow Mingqi: mingqij.github...
CDMMTC: mingqij.github...
Timestamps:
00:13 Introduction
02:25 Xiang starts
02:45 Progress of notable ML models (specifically multimodal models)
03:38 5 levels of AGI
05:23 From existing MM benchmarks to measuring expert AGI
06:18 MMMU - multi-discipline multimodal understanding and reasoning
07:09 Sampled MMMU examples from each discipline
07:32 Recognition of MMMU
08:18 Rigorous data curation process and high-quality data
10:44 Effective suite for tracking multimodal model development
12:16 Excellent model diagnosis tool
14:26 Error analysis + language as vehicle
15:32 Lack of knowledge
15:52 Perceptual error
16:13 Reasoning error
16:40 MMMU-Pro: expanded options and realistic visual content
19:21 Conclusion & acknowledgement
21:07 Orr starts
21:31 Why do we care about video-LLMs?
22:22 Collecting video instruction tuning data is hard
22:55 Existing annotation approaches
23:20 Resulting video instruction tuning datasets
24:07 Compute-dataset size tradeoff
25:10 Video-STaR - use any video label for video instruction tuning!
26:48 Answer generation
27:02 Label rationalization
27:34 Label verifier
28:12 Data flow
31:33 Source and generated datasets
32:32 Quantitative performance
34:34 Qualitative performance
37:05 Mingqi starts
37:30 Attribution map approaches for model explanation
38:05 ConvNets may only need a small amount of parts
39:17 Structural attention graphs
39:52 Idea of this paper
40:35 Different behaviors from ConvNets and Transformers
42:22 Minimal Sufficient Explanations
44:22 "Compositional" behavior
44:50 "Disjunctive" behavior
45:20 Experiments
51:02 Cross testing and experiments
53:12 Conclusion
Join the Multimodal Minds community to receive an invite for future webinars: / discord