I am not familiar at all with CPU or GPU architecture, so i naturally wonder how much of this also applies to Apple GPU (MPS). It was mentioned this is already in pytorch, but i do doubt if it even get activated on MPS. I would love to know, maybe at high level, how it may (if possible) be ported to Apple GPU, which has this unified memory thing.
22:08 (basics of attention + memory hierarchy in GPU till here ) actual explainations starts
btw at 28:10 the animation got the order wrong compared to the paper's Algorithm 1, the inner loop should be going over queries not over values
These videos are amazing
Great work and presentation. Where else could this be applied?
Why does tiling reduce HBM to SRAM transfer? Or is it through pipelining that transfer time overlap more with compute?
good research and nicely explained
I am not familiar at all with CPU or GPU architecture, so i naturally wonder how much of this also applies to Apple GPU (MPS). It was mentioned this is already in pytorch, but i do doubt if it even get activated on MPS. I would love to know, maybe at high level, how it may (if possible) be ported to Apple GPU, which has this unified memory thing.
Good idea and nice talk.
ML for theorem proving would also benefit with longer sequences! Reference Lemma proved in 300 BC...
11:09
Really nice well explained.
simple and effective
well explained
This is utilised in inference as well? How fast compared to naive implementation?