This is one of the best explanations for MoE. Going into enough depth to give good idea about internal workings, problems, evaluation results. Great work!
You made a complex topic appear simple by giving just the right insight at the right time, thereby hitting the sweet spot between making it indigestible and way too simplified. I was really wondering about the training process and you gave invaluable insight about that. It is not made clear in the paper and the code was also somewhat confusing. So, thanks for that buddy.
12:20 I heard that there is a minimum size for an expert to become reasonably functional. It worked for GPT4 because it had 1,800b parameters, which was more than it needed considering the size of the data set used. However, splitting a 7b parameter LLM like Mistral into 8 would make each expert less than 1b parameters. As a result it may have ~8x faster inference but the performance of even the best expert chosen by the router would be much worse than the original 7b parameter Mistral, or even a half sized 3.5b Mistral. Even at 70b parameters (Llama 2) a mixture of elements would perform significantly worse in response to every prompt than the original 70b LLM, or even a half sized 35b Llama 2. It's not until the parameter count starts to exceed what is ideally required considering the size of the input corpus that a MOE becomes reasonable. And even then a 1,800b parameter non-MOE GPT4 would perform ~10% better than a MOE, but such a small bump in performance isn't worth the ~8x inference cost. And using a 225b non-MOE GPT4 would perform much worse than the ideally chosen 225b expert. So in the end you get a notable bump in performance with the same inference cost. Yet at 180b or less a corpus capable of capturing a web dump, 1000s of books... is too big to be reasonably split into a MOE. Each expert needs to be larger than a minimum size (~100b or more) to capture the nuances of language and knowledge every expert requires as a base in order to respond as reasonably and articulately as GPT4 does.
You're right. My terminology was off. As you say, weights the components of matrices. Neurons encompass the full operations of taking inputs, putting them through operations with weights + biases + activations, and getting outputs.
Loved your presentation.... Mixtral mentions using a TopK() for routing... how can such a method work if they use Fast Feed Forward (All are binary decisions)
Howdy! thanks, appreciate that. In fast feed forward, one option is to just use the top expert. I believe you can still calculate the probability of each leaf being chosen, so it should still be possible to do a top k approach. It just means you activate all of the decision tree nodes (of which there are few relative to linear layer weights anyways).
Very interesting! Would it not be worth to test with one introductory sentence with a dedicated sentence pointing to the subject of the chat Vs no such leading sentence
Yeah, I think performance could def be improved with prompt engineering - although I was trying to keep things simple so devs know what works plug and play or not. Long summaries are hard because there is so much information feeding forward into the next token prediction that smaller models will either refuse or respond with blank or repetition. Refusing or responding with blank makes sense because that's a common occurence in text. I'm less sure what drives repetition, probably that's to do with mistuning of length parameters. Anyway, bigger models can handle more of the attention information and generate meaningful text above the baseline probability of refusal/blank responses.
Isn't MOE good at Multi-task learning and Multi-objective scenarios? Isn't that one of the main reasons to employ MOE - that was my understanding, will be great to get your thoughts
MoE doesn't have anything in particular that makes it good at learning. Probably it's slower to learn because you end up training multiple experts on some similar content. The benefit is in cutting inference time. So it's really a cost/speed than a quality improvement.
last time I had to deal with tokens, I was putting them in the skeeball at Chuck e Cheese, lol. That was the last time. oh, no, there's macros. nm. I came to learn about MoE, but got some interesting training on Fast feed forward networks. Pretty cool. Might have to watch this again. From what I'm learning, this can't use like ControlNet or LoRA adapters, right? Seems like MoE is only for the big boys - only someone able to afford a blackwell, or another recent big dog gpu.
haha, love it. Yeah I don't know controlnet but LoRA works fine on MoE (so long as you only apply it to attention layers, which are shared, and not the sparse feed forwards). It's for the big boys the MoE, yeah I tend to agree. 7B and 34B models are just better imo if running locally or even on rental machines... To do MoE you need to be renting at least 2 GPUs
Well typically it's a power of two because computing is binary and a lot derives off of that. As to why 8 and not 4 or 16... . If you do 2, that's only a 2x increase in speed... But if you do 16, then you have load balancing issues at inference because they may not all be used roughly equally. That's my best guess.
You could, but it may not be the most efficient way. Most likely, a lot of the semantics and statistical relationships would be repeated in the experts, so it is best to let gradient descent do the segregation.
@TrelisResearch Most likely, things get repeated anyway. No one ever said neural networks are efficient, they just fit to a curve reasonably well when a human doesn't necessarily know how to do it.
This is one of the best explanations for MoE. Going into enough depth to give good idea about internal workings, problems, evaluation results. Great work!
You made a complex topic appear simple by giving just the right insight at the right time, thereby hitting the sweet spot between making it indigestible and way too simplified. I was really wondering about the training process and you gave invaluable insight about that. It is not made clear in the paper and the code was also somewhat confusing. So, thanks for that buddy.
Appreciate that! thanks
Agreed. This was an impressive explanation.
One of the more approachable videos on the concept in RUclips.
thank you for this accessible explanation of a somewhat complex subject
12:20 I heard that there is a minimum size for an expert to become reasonably functional.
It worked for GPT4 because it had 1,800b parameters, which was more than it needed considering the size of the data set used.
However, splitting a 7b parameter LLM like Mistral into 8 would make each expert less than 1b parameters. As a result it may have ~8x faster inference but the performance of even the best expert chosen by the router would be much worse than the original 7b parameter Mistral, or even a half sized 3.5b Mistral. Even at 70b parameters (Llama 2) a mixture of elements would perform significantly worse in response to every prompt than the original 70b LLM, or even a half sized 35b Llama 2.
It's not until the parameter count starts to exceed what is ideally required considering the size of the input corpus that a MOE becomes reasonable. And even then a 1,800b parameter non-MOE GPT4 would perform ~10% better than a MOE, but such a small bump in performance isn't worth the ~8x inference cost. And using a 225b non-MOE GPT4 would perform much worse than the ideally chosen 225b expert. So in the end you get a notable bump in performance with the same inference cost.
Yet at 180b or less a corpus capable of capturing a web dump, 1000s of books... is too big to be reasonably split into a MOE. Each expert needs to be larger than a minimum size (~100b or more) to capture the nuances of language and knowledge every expert requires as a base in order to respond as reasonably and articulately as GPT4 does.
Great video and a really clear description. Thanks a lot!
you're welcome!
Incredibly well made video. Thank you.
Very nice explanation
I like how you think, you found a new sub
Matrices represent weights. Not neurons. The biases in the neurons are represented using vectors that are added after multiplying by a matrix.
You're right. My terminology was off.
As you say, weights the components of matrices.
Neurons encompass the full operations of taking inputs, putting them through operations with weights + biases + activations, and getting outputs.
Loved your presentation.... Mixtral mentions using a TopK() for routing... how can such a method work if they use Fast Feed Forward (All are binary decisions)
Howdy! thanks, appreciate that.
In fast feed forward, one option is to just use the top expert.
I believe you can still calculate the probability of each leaf being chosen, so it should still be possible to do a top k approach. It just means you activate all of the decision tree nodes (of which there are few relative to linear layer weights anyways).
@@TrelisResearch Thank you!
Very interesting! Would it not be worth to test with one introductory sentence with a dedicated sentence pointing to the subject of the chat Vs no such leading sentence
Yeah, I think performance could def be improved with prompt engineering - although I was trying to keep things simple so devs know what works plug and play or not.
Long summaries are hard because there is so much information feeding forward into the next token prediction that smaller models will either refuse or respond with blank or repetition. Refusing or responding with blank makes sense because that's a common occurence in text. I'm less sure what drives repetition, probably that's to do with mistuning of length parameters.
Anyway, bigger models can handle more of the attention information and generate meaningful text above the baseline probability of refusal/blank responses.
@@TrelisResearch have you seen that priming technique ruclips.net/video/piRMk2KIx2o/видео.htmlsi=ZjdtMd-idT29QKA4
Isn't MOE good at Multi-task learning and Multi-objective scenarios? Isn't that one of the main reasons to employ MOE - that was my understanding, will be great to get your thoughts
MoE doesn't have anything in particular that makes it good at learning. Probably it's slower to learn because you end up training multiple experts on some similar content.
The benefit is in cutting inference time. So it's really a cost/speed than a quality improvement.
This video's insane!
Gpt-3 came out in the summer of 2020. Maybe you meant chatgpt came out in November of 22?
Where does the router sit? Is it with every expert in a GPU or it sits on the CPU.
It sits in every layer of the model on the GPU! There is a router in each layer typically for the feed forward portion
last time I had to deal with tokens, I was putting them in the skeeball at Chuck e Cheese, lol. That was the last time. oh, no, there's macros. nm.
I came to learn about MoE, but got some interesting training on Fast feed forward networks. Pretty cool. Might have to watch this again.
From what I'm learning, this can't use like ControlNet or LoRA adapters, right?
Seems like MoE is only for the big boys - only someone able to afford a blackwell, or another recent big dog gpu.
haha, love it.
Yeah I don't know controlnet but LoRA works fine on MoE (so long as you only apply it to attention layers, which are shared, and not the sparse feed forwards).
It's for the big boys the MoE, yeah I tend to agree. 7B and 34B models are just better imo if running locally or even on rental machines... To do MoE you need to be renting at least 2 GPUs
@@TrelisResearch- thanks, yeah, I guess rental would be the way to go these days. It's getting ridiculous. But they are damn good...
@@TrelisResearch- I missed the punchline, "Attention is all you need!" - oc, that's Transformers - but so what - close enough. Same field
Isn’t a mixture of expert is similar to a GAN by having two networks that use each other to improve.
The experts don’t use each other to improve. They don’t see each others outputs.
Why 8 experts? Is there any structural consideration behind the choice?
Well typically it's a power of two because computing is binary and a lot derives off of that.
As to why 8 and not 4 or 16... .
If you do 2, that's only a 2x increase in speed...
But if you do 16, then you have load balancing issues at inference because they may not all be used roughly equally.
That's my best guess.
Why not intentionally train each expert in a topic? To make it an expert in something?
You could, but it may not be the most efficient way.
Most likely, a lot of the semantics and statistical relationships would be repeated in the experts, so it is best to let gradient descent do the segregation.
@TrelisResearch Most likely, things get repeated anyway. No one ever said neural networks are efficient, they just fit to a curve reasonably well when a human doesn't necessarily know how to do it.
@@ernststravoblofeld yup I think both of those things are true too 👍