Great stuff! Do you know why those values were used for the positional encoding? instead of 0,1,2,3,45? Or can any values be used? I assume you wanted the last position to have 111, but does it have to be 111?
Yes, I picked those values in particular (0, 1, 3, 4, 6, 7) because it was easy to construct query and key matrices that would make the first query row up with the last key row, the second query row line up with the second to last key row and so on; This was easy for me to find because it just involved multiplying the query weights by -1. If I did 0, 1, 2, 3, 4, 5, I would have needed to find some other weights that made the 0 query row line up closest with the 5 key row and the 2 query row line up with the 4 row and this would have been tougher. You could probably use a single positional encoding column that ranges from -1.0 (beginning of the string) to 1.0 (end of the string). Then, you would have the same property as my 3 binary bits, where you can multiply the query weights by -1 to most strongly match with a key of opposite position.
The toy example in this video mapped the ASCII characters to numbers. In practice, words (or subwords) are turned into numbers in a process called embedding: - medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca - blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ - dugas.ch/artificial_curiosity/GPT_architecture.html So there's a pre-processing step where usually the sentence or sentences (a list of words) are turned into their embeddings (a list of numbers) as they are fed into the transformer. The transformer does a bunch of math and outputs embeddings (list of numbers), which are converted back into a list of words.
I'm very confused by the claim that you cannot find a matrix to swap the order of rows at 17:50. How is the answer not just a standard permutation matrix? en.wikipedia.org/wiki/Permutation_matrix
As far as I understand them, permutation matrices multiplied on the right swap columns around, not rows. For example, if you put 1s on the minor diagonal of your permutation matrix, it mirrors the result across the vertical axis because it swaps the first and last column, the second and second-to-last column. I could not find a permutation matrix (or any matrix) that achieved the mirroring across the horizontal axis, i.e. that swapped rows. I could still be wrong though about such a matrix not existing (when multiplied on the right). It was the boldest claim in the whole video for which I did not have a great citation/proof for.
@@ConceptsIlluminated A function f is linear if f(x + y) = f(x) + f(y) and f(c x) = c f(x) for all x, y in domain and scalar c. It's clear that the function you describe in the video is linear by this definition.
I think you are correct that swapping rows is linear, given that definition, and I was incorrect in saying it was nonlinear. I still have been unable to find a matrix R that transposes the rows of matrix M when multiplied on the right, a la M x R. Assuming no such matrix exists, a more accurate phrasing should have been "The self-attention mechanism lets us do transformations that are impossible with just a single matrix multiplication step", yeah?
Please, don't stop making videos. You are a legend!
After reading so many blogs, papers which explained transformers, this video helped me understand the intuition behind the Query/Key and Value matrix.
The best video that explain in details how the "magics" works. This is an amazing video!!!!!!
Amazing explanation of how attention magic works!! Thanks a lot
My Lord, so cool and genius !
Superb video
This is fantastic video!
(It made me think we need spreadsheets with autodiff. :D)
Excel is a deep learning cheat code
Thank you. Great work ♥️
Beautiful
thank you for this gems!
Great stuff!
Do you know why those values were used for the positional encoding? instead of 0,1,2,3,45? Or can any values be used?
I assume you wanted the last position to have 111, but does it have to be 111?
Yes, I picked those values in particular (0, 1, 3, 4, 6, 7) because it was easy to construct query and key matrices that would make the first query row up with the last key row, the second query row line up with the second to last key row and so on; This was easy for me to find because it just involved multiplying the query weights by -1.
If I did 0, 1, 2, 3, 4, 5, I would have needed to find some other weights that made the 0 query row line up closest with the 5 key row and the 2 query row line up with the 4 row and this would have been tougher.
You could probably use a single positional encoding column that ranges from -1.0 (beginning of the string) to 1.0 (end of the string). Then, you would have the same property as my 3 binary bits, where you can multiply the query weights by -1 to most strongly match with a key of opposite position.
Thanks for explaining this in a understandable way. I don't get how this then is used to process multiple words (am I missing something)
The toy example in this video mapped the ASCII characters to numbers. In practice, words (or subwords) are turned into numbers in a process called embedding:
- medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca
- blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
- dugas.ch/artificial_curiosity/GPT_architecture.html
So there's a pre-processing step where usually the sentence or sentences (a list of words) are turned into their embeddings (a list of numbers) as they are fed into the transformer. The transformer does a bunch of math and outputs embeddings (list of numbers), which are converted back into a list of words.
I'm very confused by the claim that you cannot find a matrix to swap the order of rows at 17:50. How is the answer not just a standard permutation matrix? en.wikipedia.org/wiki/Permutation_matrix
As far as I understand them, permutation matrices multiplied on the right swap columns around, not rows. For example, if you put 1s on the minor diagonal of your permutation matrix, it mirrors the result across the vertical axis because it swaps the first and last column, the second and second-to-last column. I could not find a permutation matrix (or any matrix) that achieved the mirroring across the horizontal axis, i.e. that swapped rows.
I could still be wrong though about such a matrix not existing (when multiplied on the right). It was the boldest claim in the whole video for which I did not have a great citation/proof for.
@@ConceptsIlluminated A function f is linear if f(x + y) = f(x) + f(y) and f(c x) = c f(x) for all x, y in domain and scalar c. It's clear that the function you describe in the video is linear by this definition.
I think you are correct that swapping rows is linear, given that definition, and I was incorrect in saying it was nonlinear.
I still have been unable to find a matrix R that transposes the rows of matrix M when multiplied on the right, a la M x R. Assuming no such matrix exists, a more accurate phrasing should have been "The self-attention mechanism lets us do transformations that are impossible with just a single matrix multiplication step", yeah?
I actually had the same question. BTW, transpose can also be a linear transformation if combined with reshapes.
@@ConceptsIlluminated I think it can be done with a regular left matrix multiplication, ie, z = Px.