It's not really Internal RAG, it's more of internal summarization - similar to RWKV (the mechanism is different though). RAG would require that the model can retrieve from an infinite DB rather than a finite summarized state. This method would very likely fail at LiM tasks, similar to one that was tried in a previous video (with instructions in the middle of a block of unrelated text). The model would have to know that the instruction is going to be more important than specific details from the text passage (the same concept would apply to retrieving specific details). That also means that this method may fail at copying outside of the current block, similar to Mamba variants (and for the same reason).
It's a summarization state, constructed from the outer product of the block K-V vectors. So each block of size S has K and V vectors of size Sxd, and they form a dxd "summary" of the K-V state for that block. Then the next block can "query" into that dxd state using a linear attention mechanism, which is added to the local self attention (within the block). Essentially, a fancy hybrid model like Jamba, just implemented different, but should have similar pitfalls. At least the summarization state here is of size dxd rather than 1x(a*d), where a
With their deepmind arm, im thinking theyll reach organics/organic-analog computing first. Imagine if states and events were global - global tx/rx. A chemical solution. Shame on google for assisting in the war machine with their tech. "Dont be evil"
Just wanted to say you are doing the community such a great service and contribution. Thank you!
It's not really Internal RAG, it's more of internal summarization - similar to RWKV (the mechanism is different though).
RAG would require that the model can retrieve from an infinite DB rather than a finite summarized state. This method would very likely fail at LiM tasks, similar to one that was tried in a previous video (with instructions in the middle of a block of unrelated text). The model would have to know that the instruction is going to be more important than specific details from the text passage (the same concept would apply to retrieving specific details). That also means that this method may fail at copying outside of the current block, similar to Mamba variants (and for the same reason).
So, essentially, Its a key value memroy netowrk abked into a LLM model?
It's a summarization state, constructed from the outer product of the block K-V vectors. So each block of size S has K and V vectors of size Sxd, and they form a dxd "summary" of the K-V state for that block. Then the next block can "query" into that dxd state using a linear attention mechanism, which is added to the local self attention (within the block). Essentially, a fancy hybrid model like Jamba, just implemented different, but should have similar pitfalls. At least the summarization state here is of size dxd rather than 1x(a*d), where a
I wish i had good level of math to understand where those formulas being derived
Virtual Multiport Memory?
With their deepmind arm, im thinking theyll reach organics/organic-analog computing first. Imagine if states and events were global - global tx/rx. A chemical solution.
Shame on google for assisting in the war machine with their tech. "Dont be evil"