Soft Mixture of Experts - An Efficient Sparse Transformer

Поделиться
HTML-код
  • Опубликовано: 16 дек 2024

Комментарии • 9

  • @christopherhornle4513
    @christopherhornle4513 Год назад +3

    Thank you! In the paper they also state that one slot per expert seems to be optimal.

  • @flipflopsn
    @flipflopsn 9 месяцев назад

    Underrated channel, great video!

  • @collin6526
    @collin6526 11 месяцев назад +3

    How does soft MoE reduce computation? If the tokens just get distributed to slots for all the experts, that implies all the experts are running, which prevents the performance gain. Is there some selection at inference time as for which should run?

  • @kumargaurav2170
    @kumargaurav2170 Год назад +1

    Great video👍🏻

  • @krishnakosaraju
    @krishnakosaraju 10 месяцев назад

    Hi, is MoE paradigm beneficial for Decoder only architectures? And are the advantages only for Visual Transformers or the LLM too?

  • @postmodernismm
    @postmodernismm Год назад

    Thank you 👍 Could you deal with NeRF and something like that ??