Multimodal Reasoning, Video Instruction-Tuning & Explaining Vision Backbones | Multimodal Weekly 53

Поделиться
HTML-код
  • Опубликовано: 2 окт 2024
  • ​​​In the 53rd session of Multimodal Weekly, we had three exciting researchers working in multimodal understanding and reasoning benchmark, video instruction tuning, and explanation methods for Transformers and ConvNets.
    ​​​​​​​​​​​​✅ Xiang Yue, Postdoctoral Researcher at Carnegie Mellon University, will introduce MMMU - a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
    Follow Xiang: xiangyue9607.g...
    MMMU: mmmu-benchmark...
    ​​​​​​​✅ Orr Zohar, Ph.D. Student at Stanford University, will introduce Video-STaR - a self-training for video language models, allowing the use of any labeled video dataset for video instruction tuning.
    Follow Orr: orrzohar.githu...
    Video-STaR: orrzohar.githu...
    ​​​​​​​✅ Mingqi Jiang, Ph.D. Student at Oregon State University, will propose explanation methods in order to gain insights about the decision-making of different visual recognition backbones.
    Follow Mingqi: mingqij.github...
    CDMMTC: mingqij.github...
    Timestamps:
    00:13 Introduction
    02:25 Xiang starts
    02:45 Progress of notable ML models (specifically multimodal models)
    03:38 5 levels of AGI
    05:23 From existing MM benchmarks to measuring expert AGI
    06:18 MMMU - multi-discipline multimodal understanding and reasoning
    07:09 Sampled MMMU examples from each discipline
    07:32 Recognition of MMMU
    08:18 Rigorous data curation process and high-quality data
    10:44 Effective suite for tracking multimodal model development
    12:16 Excellent model diagnosis tool
    14:26 Error analysis + language as vehicle
    15:32 Lack of knowledge
    15:52 Perceptual error
    16:13 Reasoning error
    16:40 MMMU-Pro: expanded options and realistic visual content
    19:21 Conclusion & acknowledgement
    21:07 Orr starts
    21:31 Why do we care about video-LLMs?
    22:22 Collecting video instruction tuning data is hard
    22:55 Existing annotation approaches
    23:20 Resulting video instruction tuning datasets
    24:07 Compute-dataset size tradeoff
    25:10 Video-STaR - use any video label for video instruction tuning!
    26:48 Answer generation
    27:02 Label rationalization
    27:34 Label verifier
    28:12 Data flow
    31:33 Source and generated datasets
    32:32 Quantitative performance
    34:34 Qualitative performance
    37:05 Mingqi starts
    37:30 Attribution map approaches for model explanation
    38:05 ConvNets may only need a small amount of parts
    39:17 Structural attention graphs
    39:52 Idea of this paper
    40:35 Different behaviors from ConvNets and Transformers
    42:22 Minimal Sufficient Explanations
    44:22 "Compositional" behavior
    44:50 "Disjunctive" behavior
    45:20 Experiments
    51:02 Cross testing and experiments
    53:12 Conclusion
    Join the Multimodal Minds community to receive an invite for future webinars: / discord

Комментарии •