Long-Context LLM Extension

Поделиться
HTML-код
  • Опубликовано: 14 ноя 2024

Комментарии • 10

  • @griffinadams5057
    @griffinadams5057 День назад

    great as always!

  • @lone0017
    @lone0017 Месяц назад

    Please keep up the good work, videos like these are incredibly helpful !
    As someone who self-study ML, your channel along with similar others (Yannic Kilcher's, AI coffee talks) are such great sources of insight and education. Hopefully one day I could have the honour to do my phd with you.

  • @samlaki4051
    @samlaki4051 Месяц назад +1

    ahh been binging these since yesterday. really gives me a lot to think about regarding ways to think about extending the usual transformer architecture for different purposes.
    thank you Professor!
    - from sunny syracuse

    • @srush_nlp
      @srush_nlp  Месяц назад

      Glad they're useful! Was just in Syracuse last month. Beautiful time of the year.

  • @kei55340
    @kei55340 24 дня назад

    Thanks for the great video! This is a more clear explanation of long-context extension than every other source I've seen.
    One confusion I have with your video is that you connect high frequency rotation subspaces with short length (which I assume means small values of n-m). Why is there such a connection? In particular, why would tweaking a high frequency rotation be especially impactful on shorter length gaps?

  • @timbertrand1136
    @timbertrand1136 Месяц назад +1

    Awesome video. Very Informative. Thanks a lot!

  • @victormanuel8767
    @victormanuel8767 Месяц назад +1

    No I have NOT memorized the equation for self attention (I can barely read the math in these papers ) 😅.
    But this has shown me what methods are effective and when.
    Great visualizations. Fantastic work.

    • @srush_nlp
      @srush_nlp  Месяц назад

      Think of it like F=MA or E=MC^2. It's just "softmax(KQ) V" all the way down these days.

  • @AM-yk5yd
    @AM-yk5yd Месяц назад

    What's also would be interesting is to see how fp precision affects long context.
    HF for example is very selective when model uses F32 and when base type. In MistralRotaryEmbedding, Sin Cos arguments are calculated in F32 so at least there are >64K values, but even then it depends on inv_freq(len=dim/2) which is casted to F32.
    I suspect that when there are more positions than bits in single value, model will be more affected by such "noise" as there are more tokens. (Though with limited VRAM it's not exactly a problem to care)