One thing I don't understand well. After training, how we manage the final output ? For a large input, do we "force it" to speak directly i.e. an output for each input OR we first insert all the input and then look at the output at a certain point. Because basically, one could wait to complete the reading part and then force an answer. Maybe i am not that clear (not my language) but it seems there could be several ways to retrieve an output from this type of transformer. Please be kind, thanks for the video :)
Great vids! At around 3 mins when you get into the attention matrices, I think the dimensions aren’t right because if Q is d by s and K is d by s and we take QK-transpose then d by s matmul s by d gives a d by d matrix but the video shows that matrix to be s by s.
Thanks! In that part, I transposed the diagram as I thought it looked a little better that way. Sometimes the diagrams I draw are transposed, but I try to label the dimensions to avoid ambiguity. So the s by s matrix is the resulting matrix, not a d by d one. A s by s matrix results in sequence relations while a d by d matrix results in dimension relations across the entire sequence.
Thanks for the video! Just one question. The H state dimension is fixed, but it accumulates additional information proceeding with the token sequence. After the first segment, H1 just "summarize" one segment, but after segment N, Hn summarize current segment + Hn-1 that is the summary of all the past context. Do you think would make sense to increase H dimension proceeding with context, i.e. dimension of Hn grows with n? The idea is that we keep information per bit constant in H, so that we can really grow to unlimited context without state becoming a bottleneck.
I think it makes sense to increase the hidden state, though doing so would result in a memory dependence on the sequence length during inference, which is currently a big problem. One can think of a softmax attention transformer as having an infinite hidden state (the keys/values are just stacked), on the other hand an RNN has a constant size hidden state. Perhaps something in the middle would perform better than an RNN, but not require as much memory as a Transformer?
@@gabrielmongaras Yeah, the trick is to find a state update function xn+1 = S(xn, segment_n) so that dim(xn+1) > dim(xn), i.e. projecting vector xn into a bigger space, while preserving the semantic and eventually imbuing segment_n's new data. Indeed because state dimension in the paper is tailored for a quite long context, with a growing state you can even start from a _smaller_ state x1 and then grow it with the number of segments....so maybe for a not too large context you can even have a memory reduction! But I don't see memory usage as a problem, you can always clamp it to a maximum if really needed, a kind of max_memory parameter...it can't be worse than the original.
Thanks, another great vid!
Great vids! But I didn't figure out how the training process is accomplished?
great vid gabriel 👍
17:50, shouldn't H_i be a summation of k_j and v_j over j, instead of i, where j goes from 1 to i.
Yep. Nice catch! Put a note in the video about the reindexing.
One thing I don't understand well. After training, how we manage the final output ? For a large input, do we "force it" to speak directly i.e. an output for each input OR we first insert all the input and then look at the output at a certain point. Because basically, one could wait to complete the reading part and then force an answer. Maybe i am not that clear (not my language) but it seems there could be several ways to retrieve an output from this type of transformer. Please be kind, thanks for the video :)
i have to get used to those scienetific notations, i struggle to provide code from these articles myself
Great vids! At around 3 mins when you get into the attention matrices, I think the dimensions aren’t right because if Q is d by s and K is d by s and we take QK-transpose then d by s matmul s by d gives a d by d matrix but the video shows that matrix to be s by s.
Thanks! In that part, I transposed the diagram as I thought it looked a little better that way. Sometimes the diagrams I draw are transposed, but I try to label the dimensions to avoid ambiguity. So the s by s matrix is the resulting matrix, not a d by d one. A s by s matrix results in sequence relations while a d by d matrix results in dimension relations across the entire sequence.
Thanks for the video!
Just one question. The H state dimension is fixed, but it accumulates additional information proceeding with the token sequence. After the first segment, H1 just "summarize" one segment, but after segment N, Hn summarize current segment + Hn-1 that is the summary of all the past context. Do you think would make sense to increase H dimension proceeding with context, i.e. dimension of Hn grows with n? The idea is that we keep information per bit constant in H, so that we can really grow to unlimited context without state becoming a bottleneck.
I think it makes sense to increase the hidden state, though doing so would result in a memory dependence on the sequence length during inference, which is currently a big problem. One can think of a softmax attention transformer as having an infinite hidden state (the keys/values are just stacked), on the other hand an RNN has a constant size hidden state. Perhaps something in the middle would perform better than an RNN, but not require as much memory as a Transformer?
@@gabrielmongaras Yeah, the trick is to find a state update function xn+1 = S(xn, segment_n) so that dim(xn+1) > dim(xn), i.e. projecting vector xn into a bigger space, while preserving the semantic and eventually imbuing segment_n's new data. Indeed because state dimension in the paper is tailored for a quite long context, with a growing state you can even start from a _smaller_ state x1 and then grow it with the number of segments....so maybe for a not too large context you can even have a memory reduction!
But I don't see memory usage as a problem, you can always clamp it to a maximum if really needed, a kind of max_memory parameter...it can't be worse than the original.
so basically what you're tellling me is the world is going to end
Absolutely :)