LLM Chronicles #6.3 Multi-Modal LLMs for Image, Sound and Video

Поделиться
HTML-код
  • Опубликовано: 12 июл 2024
  • In this episode we look at the architecture and training of multi-modal LLMs. After that, we’ll focus on vision and explore Vision Transformers and how they are trained with contrastive learning (OpenAI's CLIP and Google's SigLIP). Vision Transformers are the most commonly used building block in MLLMs with vision capabilities. Finally, we’ll get hands-on and look into Google’s open-weight PaliGemma, analysing its implementation to see these concepts in action within a real-world multi-modal LLM.
    Series website: llm-chronicles.com/
    🖹 Canvas and Colab Notebook:
    - LLM Limitations and Challenges: llm-chronicles.com/pdfs/llm-c...
    - Colab Notebook: colab.research.google.com/dri...
    🕤 Timestamps:
    01:32 - MLLM Architecture
    03:49 - Training MLLMs
    07:02 - Vision Transformer
    09:24 - Contrastive Learning (CLIP, SigLIP)
    12:35 - Lab: PaliGemma
    22:53 - Summary
    References:
    - Vision transformer: arxiv.org/pdf/2010.11929
    - Survey of multi modal LLMs: arxiv.org/pdf/2306.13549
    - Microsoft's CLAP: arxiv.org/pdf/2206.04769
    - SigLip: arxiv.org/pdf/2303.15343
  • НаукаНаука

Комментарии • 6

  • @Heart-Stories3D
    @Heart-Stories3D День назад +1

    Thanks you very much 🙏

    • @donatocapitella
      @donatocapitella  День назад

      @@Heart-Stories3D thank you fir your comment! :)

  • @En1Gm4A
    @En1Gm4A 11 дней назад +1

    thanks - hightlight today

  • @micbab-vg2mu
    @micbab-vg2mu 11 дней назад

    great talk - thank you

  • @Tuly03
    @Tuly03 2 дня назад +4

    Video is impressive, thank you. Ever heard of Immersive Translate?? It is a tool with meticulously crafted prompts, that allows translations in the technology field become more accurate and professional.

  • @masterarfanmasterarfan4721
    @masterarfanmasterarfan4721 23 минуты назад

    8 w92