Very interesting. I've been seeing a lot of papers talking about "dynamically allocating compute" and this is a somewhat persuasive method, but I wonder if this couldn't be achieved more effectively just using a MoE where one of the experts happens to be the identity, with an auxiliary loss to encourage it to use the identity if needed.
I do see that they do this in the paper, but I'm confused why this wasn't the approach they present on in the paper; the math doesn't require that weird softmax->sigmoid hack to make inference work, and it performs better, so why didn't they include more exploration of that? Plus they have two different variations on MoDE but don't specify which they used for the results they show... it feels strange that the most persuasive thing in the paper for me is such an afterthought.
I do think the proposed method is quite hacky with the sigmoid trick they do. MoDE seems much more natural and in Fig (7) it does better than the standard approach they take. Wish they went more in depth into MoDE. Maybe they're working on a future paper with MoDE and that's why they barely included it?
@@gabrielmongarasI agree. I had a mental image of them thinking of it at the last moment before publishing and not wanting to throw out all their previous work! But it's maybe more charitable to imagine they wanted to be able to compare against a baseline dense attention layer instead of just a MoE layer.
That's probably more likely lol. Found that it worked better and just put it in as a side note since they would have to redo a lot of parts of the paper.
Thanks for the video tutorial, it is really helpful! At time 24:08, when you mention about softmax, do you mean a softmax is done to compute the routing scalars? If yes, then as per my understanding they don't compute routing scalars using softmax. The scalars are computed just by doing an inner product of token with routing weights vector.
Oh yes, I see what you're talking about. On page 6, right above equation (1), they mention the rth weight is computed as the inner product between the weight and the vector, which is different from normal MoE. I suppose this fixes the gradient problem I was talking about. Thanks for the clarification!
i thought people theorise that transformers still use the 'slack' tokens for other purposes, so the compute is not wasted, i guess this shows that maybe those theories needed to be rigorously tested. although actually since they only sandwich the layers maybe it is fully used. this method effectively gives some tokens up to double the mixing time
Yeah, definitely a problem I have 😅 Been trying to get better at it, and realized I could've explained the extra loss part in much fewer words after uploading. In general, sometimes it's hard to know if the explanation given is satisfying or not when trying to balance conciseness and length.
Im not sure I understand, even though the sigmoids are independent, why would it allow for causal sampling if it was trained to mimic a distribution that isn’t causal? It carries information from the future albeit indirectly no? For example, if we were training on a distribution of a biased lottery, we would still be predicting the future from just some of the tokens?
I don't think they talked about doing that in the paper. My intuition says it may be hard and probably wouldn't work as well as we might hope. The activations for attention are whatever it needs to do the attention mechanism. However, in this paper, the activations are also used for ranking. My first thought is that these two activation distributions are quite different, making the model start from a poor state. I wonder if Google did something like this, but found it didn't work that well and decided not to add it in the paper? Would be totally worth trying if you have the compute though! Maybe you could start off with initializing routing to all tokens and slowly decrease this during fine-tuning.
Just an understanding of machine learning models at a high level and how transformers work. The experts themselves are just a linear layer or feed forward layer in MoE and the single expert in this paper is a transformer layer.
I'd add that a basic understanding of statistics can help with some introductory degree of calculus. But for the most part there is more trial and error for these discoveries than you might not believe. The understanding comes after sometimes ;)
@@gabrielmongaras what do you think helped you best understand neural networks? I have a shallow understanding of how transformers work. I know how the encoder works, but I don't really understand the decoder fully. I also know pytorch only well enough to build simple convolutional neural networks. I also have a really strong understanding of calculus and linear algebra.
awesome content, seriously
Very interesting. I've been seeing a lot of papers talking about "dynamically allocating compute" and this is a somewhat persuasive method, but I wonder if this couldn't be achieved more effectively just using a MoE where one of the experts happens to be the identity, with an auxiliary loss to encourage it to use the identity if needed.
I do see that they do this in the paper, but I'm confused why this wasn't the approach they present on in the paper; the math doesn't require that weird softmax->sigmoid hack to make inference work, and it performs better, so why didn't they include more exploration of that? Plus they have two different variations on MoDE but don't specify which they used for the results they show... it feels strange that the most persuasive thing in the paper for me is such an afterthought.
I do think the proposed method is quite hacky with the sigmoid trick they do. MoDE seems much more natural and in Fig (7) it does better than the standard approach they take. Wish they went more in depth into MoDE. Maybe they're working on a future paper with MoDE and that's why they barely included it?
@@gabrielmongarasI agree. I had a mental image of them thinking of it at the last moment before publishing and not wanting to throw out all their previous work! But it's maybe more charitable to imagine they wanted to be able to compare against a baseline dense attention layer instead of just a MoE layer.
That's probably more likely lol. Found that it worked better and just put it in as a side note since they would have to redo a lot of parts of the paper.
Thanks for the video tutorial, it is really helpful! At time 24:08, when you mention about softmax, do you mean a softmax is done to compute the routing scalars? If yes, then as per my understanding they don't compute routing scalars using softmax. The scalars are computed just by doing an inner product of token with routing weights vector.
Oh yes, I see what you're talking about. On page 6, right above equation (1), they mention the rth weight is computed as the inner product between the weight and the vector, which is different from normal MoE. I suppose this fixes the gradient problem I was talking about. Thanks for the clarification!
great video! this is much easier to understand than just reading the paper. what app are you using for annotating the paper and making notes?
Thanks! Glad you found my video helpful! I'm using the default Samsung Notes app to make all the annotations and notes.
awsome, btw maybe try using excalibur
i thought people theorise that transformers still use the 'slack' tokens for other purposes, so the compute is not wasted, i guess this shows that maybe those theories needed to be rigorously tested. although actually since they only sandwich the layers maybe it is fully used. this method effectively gives some tokens up to double the mixing time
Why use lot words when few words do trick?
Yeah, definitely a problem I have 😅
Been trying to get better at it, and realized I could've explained the extra loss part in much fewer words after uploading. In general, sometimes it's hard to know if the explanation given is satisfying or not when trying to balance conciseness and length.
@@gabrielmongarasthey might be referring to mixture of depths using only few of the words. Personally, I thought your explanations were great
Im not sure I understand, even though the sigmoids are independent, why would it allow for causal sampling if it was trained to mimic a distribution that isn’t causal? It carries information from the future albeit indirectly no?
For example, if we were training on a distribution of a biased lottery, we would still be predicting the future from just some of the tokens?
Ah, I think you mention exactly that afterwards 😅 thanks
One more question, can these be added to existing models and trained separately? From the description sounds like it’s possible
I don't think they talked about doing that in the paper. My intuition says it may be hard and probably wouldn't work as well as we might hope. The activations for attention are whatever it needs to do the attention mechanism. However, in this paper, the activations are also used for ranking. My first thought is that these two activation distributions are quite different, making the model start from a poor state. I wonder if Google did something like this, but found it didn't work that well and decided not to add it in the paper? Would be totally worth trying if you have the compute though! Maybe you could start off with initializing routing to all tokens and slowly decrease this during fine-tuning.
What field of math do you need to understand this?
Just an understanding of machine learning models at a high level and how transformers work. The experts themselves are just a linear layer or feed forward layer in MoE and the single expert in this paper is a transformer layer.
I'd add that a basic understanding of statistics can help with some introductory degree of calculus. But for the most part there is more trial and error for these discoveries than you might not believe. The understanding comes after sometimes ;)
@@gabrielmongaras what do you think helped you best understand neural networks? I have a shallow understanding of how transformers work. I know how the encoder works, but I don't really understand the decoder fully. I also know pytorch only well enough to build simple convolutional neural networks. I also have a really strong understanding of calculus and linear algebra.
Calculus and Linear Algebra.
@@tgugdevil thank you, sorry, I forgot to mention that I already have a strong understanding of those concepts