FlashAttention - Tri Dao | Stanford MLSys #67

Поделиться
HTML-код
  • Опубликовано: 27 дек 2024

Комментарии • 14

  • @anishbhanushali
    @anishbhanushali Год назад +10

    22:08 (basics of attention + memory hierarchy in GPU till here ) actual explainations starts

  • @TheAIEpiphany
    @TheAIEpiphany Год назад +2

    btw at 28:10 the animation got the order wrong compared to the paper's Algorithm 1, the inner loop should be going over queries not over values

  • @for-ever-22
    @for-ever-22 9 месяцев назад +2

    These videos are amazing

  • @rfernand2
    @rfernand2 Год назад +2

    Great work and presentation. Where else could this be applied?

  • @shuminghu
    @shuminghu Год назад

    Why does tiling reduce HBM to SRAM transfer? Or is it through pipelining that transfer time overlap more with compute?

  • @denizlarson8862
    @denizlarson8862 Год назад +2

    good research and nicely explained

  • @kawingchan
    @kawingchan Год назад +1

    I am not familiar at all with CPU or GPU architecture, so i naturally wonder how much of this also applies to Apple GPU (MPS). It was mentioned this is already in pytorch, but i do doubt if it even get activated on MPS. I would love to know, maybe at high level, how it may (if possible) be ported to Apple GPU, which has this unified memory thing.

  • @xianbiaoqi7009
    @xianbiaoqi7009 Год назад

    Good idea and nice talk.

  • @brandomiranda6703
    @brandomiranda6703 Год назад

    ML for theorem proving would also benefit with longer sequences! Reference Lemma proved in 300 BC...

  • @aamirmirza2806
    @aamirmirza2806 Год назад

    Really nice well explained.

  • @sskhdsk
    @sskhdsk Год назад

    simple and effective

  • @JazevoAudiosurf
    @JazevoAudiosurf Год назад

    well explained

  • @deepanshusingh2527
    @deepanshusingh2527 Год назад

    This is utilised in inference as well? How fast compared to naive implementation?