Building Multimodal Models

Поделиться
HTML-код
  • Опубликовано: 16 сен 2024
  • Like 👍. Comment 💬. Subscribe 🟥.
    🏘 Discord: / discord
    github.com/hu-...
    What matters when building vision-language models?
    arxiv.org/pdf/...
    Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities
    arxiv.org/pdf/...
    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
    storage.google...
    Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
    arxiv.org/pdf/...

Комментарии • 21

  • @hjups
    @hjups 4 месяца назад +9

    The image tokenizer used by Chameleon is a VQ-VAE. They state that it's based on the one from "Make-A-Scene" which used a VQ-VAE. You could probably also train an image tokenizer along with the model (there was a recent paper that did that), but it would be insanely expensive and lead to fighting losses with the multiple modalities. So instead they train a VQ-VAE (f16 Z=8196), where they probably do something like img_vocab[0] = max(txt_vocab). Note this is being used as a tokenizer with a codebook (i.e. image vocab), not an image encoder like CLIP or DINO (where the real tensor values are used). Consequentially, it also means that the image quality will necessarily be poorer due to the fixed vocab size.
    FYI, VQ-VAEs are GANs too (you train them with KL, reconstruction, and a reconstruction discriminator loss), so it might be more precise to call them VQ-VAE-GANs.
    Regarding sequential output, the issue is not that it's a transformer (they're set-to-set models, not sequence-to-sequence by the way). The problem is probability collapse and capacity. With text, the probability of word N depends on the choice of word N-1. For example, "the cat " vs. "the cat ". "on a mat" and "up a tree" depend on "sat" vs. "ran", which could be equally probable. This can be solved via a diffusion process (or mask decoding), but then there's no guarantee that the text segment will be completed within a fixed length of M tokens. Conversely, images can be diffused (or mask decoded), because they are guaranteed to be a known fixed size. Presumably you could decode the text in chunks that overlap (like image out-painting), but then that becomes a lot more computationally expensive to compute than auto-regressive generation.
    Also, we're never going to get to a point where a MLM can generate action tokens for robotics (unless they are high level embeddings), because the latency will be too high to close the control loop. Generating macro-actions which feed into a VLM, which in turn feed into a kinematics model (like what Figure-01 is doing) is more practical due to the hierarchical processing tradeoff with time and complexity. This is also in contradiction to your claim about engineering complexity with model architectures (expert systems). The human brain is a collection of expert systems, so why wouldn't an AGI-like architecture be as well? The simplicity comes at the micro-level where the layers may be MLPs and attention (analogous to cortical columns), but then those layers are combined in a heterogenous way. I guess the takeaway is: simpler architecture is better if you have infinite compute and infinite data, but in the real-world you have neither, so architecture very much does matter from an efficiency standpoint.

  • @wolpumba4099
    @wolpumba4099 4 месяца назад +3

    *Building Multimodal Models: Summary of the Stream*
    *Main Points:*
    * *Multimodal Models are the Future:* (0:00) Leading AI companies like OpenAI and Google are releasing multimodal models like GPT-40 and Project Astra, capable of handling audio, video, and text.
    * *Open Source vs. Big Tech:* (2:34) The open source community (represented by Hugging Face) is limited in resources and often relies on gluing pre-trained vision encoders and language models together, creating basic VLM's. In contrast, companies like Meta (Facebook) can train massive models like Chameleon from scratch with early fusion, leading to better performance.
    * *Chameleon:* (4:59) This new model is trained end-to-end, uses a shared representation space for modalities, and outperforms LLMs, Gemini Pro, and GPT-4V in some cases. It consumes and generates interleaved text and images.
    * *Mirasol 3B:* (42:31) Google's research includes this model which consumes video, audio, and text, but only outputs text. It utilizes a "Token Turing Machine," essentially an LSTM built with Transformers, showcasing memory capabilities like those seen in GPT-40.
    * *Challenges:*
    * *Tokenization:* (16:44) Image tokenizers are still limited and struggle with OCR of small text.
    * *Logit Drift:* (32:28) Modalities can compete with each other during training due to varying entropy. Solutions like query key normalization are being explored.
    * *Inference Complexity:* (1:11:38) Stateful inference with memory requires sophisticated infrastructure and solutions.
    * *Future Direction:* (1:17:39) We are moving towards end-to-end trained, early fusion models that handle audio, video, and text input and output. This will likely involve:
    * Larger models, data sets, and training runs.
    * Moving beyond 2D image generation to 3D environments.
    * Potential integration with robotics using control tokens.
    * *Implications for Research:* (1:44:02) Meaningful research with smaller compute budgets is becoming increasingly difficult. Open source contributions might be less impactful, but personal exploration and learning remain valuable.
    *Key Takeaways:*
    * We are on the cusp of major breakthroughs in multimodal AI.
    * End-to-end, early fusion models trained on massive datasets are the likely path forward.
    * The capabilities of these models will revolutionize how we interact with AI and the world around us.
    i used gemini 1.5 pro to summarize the transcript

  • @thegigasurgeon
    @thegigasurgeon 4 месяца назад +1

    such an underrated channel

  • @askeletalghost
    @askeletalghost 4 месяца назад +6

    petition to bring back the screaming horn

    • @maslaxali8826
      @maslaxali8826 4 месяца назад +1

      noo .Expandin for a bigger audience

    • @NLPprompter
      @NLPprompter 3 месяца назад

      no no no no...... last time my wife slap me, she said stop that noise... it seems the audio reaching living room which freak her out

  • @dm204375
    @dm204375 4 месяца назад +3

    great stream as always.

  • @wolpumba4099
    @wolpumba4099 4 месяца назад +7

    Summary starts at 1:33:00

  • @jmirodg7094
    @jmirodg7094 3 месяца назад +1

    Extremely interesting even-though i do not fully agree with your conclusion. Yes the companies with massive compute capabilities have an edge but I trust the open-source community to find ways to approach their performances with a fraction of the compute. so for Ed yes continue you PhD on VLM it is not only for your fun, it will allow every one to have good and free AI!

    • @Joshua-hb7cs
      @Joshua-hb7cs 3 месяца назад

      What do uou mean by "for Ed yes continue you PhD on VLM it is not only for your fun"

    • @jmirodg7094
      @jmirodg7094 3 месяца назад +1

      @@Joshua-hb7cs towards the end Hu-Po mentioned that this person Ed, who apparently is doing his PhD on VLM and say that low budget research is kind of pointless but should still pursue it for fun. But I would argue that the low budget research has proven to have a capacity to find innovative way to approach the performance of the high budget frontier research at a fraction of the cost. This is critical to have an open source ecosystem and allow us to be free from the big players.

    • @Joshua-hb7cs
      @Joshua-hb7cs 3 месяца назад

      @@jmirodg7094 thanks!

  • @rohollahhosseyni8564
    @rohollahhosseyni8564 4 месяца назад

    Wow. Another great stream. I love them. I've learned a lot of things from your streams.
    Thanks hu-po

  • @taylorhawkes
    @taylorhawkes 4 месяца назад

    Very helpful videos for keeping up with the lay of the land and understanding how to read papers. Keep or the great work!

  • @maslaxali8826
    @maslaxali8826 4 месяца назад

    Was on vacation last week, missed the live lectures. Hopefully see u next week

  • @lucamatteobarbieri2493
    @lucamatteobarbieri2493 4 месяца назад

    In the brain sensing is initially processed separately, than information is integrated in associative corteces. Than the info can go to prefrontal areas and than it goes to motor areas. Pretty straightforward if you boil it down to the fundamentals. You got all sorts of redundancies and structures like the amygdalas that constitute additions to make things work better. But the main flow of information in the brain is not that complex imho. Neuroanatomy is a goldmine for AI.

  • @musifmuzammir354
    @musifmuzammir354 2 месяца назад

    in Generating Text and Images how the model know which token should be send to BPE tokenizer and which token should be sent to image de-tokenizer?

  • @richardnunziata3221
    @richardnunziata3221 3 месяца назад

    I wonder if for the vision modal input we could sync with the audio and text temporally by sending just the eye sccards over the audio and text .

  • @richardnunziata3221
    @richardnunziata3221 3 месяца назад

    THere is a paper for muli token generation : ."Better & Faster Large Language Models via Multi-token Prediction"

  • @VijayEranti
    @VijayEranti 4 месяца назад

    Can you di these kind of survey papers. Like latest survey and findings in ai agents

  • @fintech1378
    @fintech1378 4 месяца назад

    AR/VR is definitely the future, the metaverse, and unfortunately for many people its crypto is coming back because of this