Thank you for consistently producing such in-depth, informative content. Your long-format videos are a treasure trove of knowledge. Really appreciate the effort you put into making these detailed explanations!
Nice work. So glad you do such in depth processes. Question; in this video you go over distillation with the goal of keeping as much knowledge and functionality of the original model as possible. But what about if you really are only interested in a smaller domain of said functionality? i would assume instead of using 2% of whatever dataset you could use even fewer samples of a compatible dataset? You would end up with a very small very specialized model that may be better then the original at your specific domain? Even better if I could train locally on a single 3090.
Yes perhaps. The thing is that the background knowledge may provide useful scaffolding for your smaller subset of knowledge. My guess is that you should distill on 2% plus your subset of data. And yes, if you are doing less than 1B models, then distilling on local hardware is possible. Much bigger is hard although perhaps - with galore approaches or adafactor - you could do a 4-5B modem
How would distilation compare to archetecture search, when only concerned with a smaller domain. For instance in T2I only pictures of animals. Would it be less compute in total to find and train a NOVEL 100M param architecture vs a 4B param distilled model. I feel like there is more work to be done in model archetecture.
Well if the task you’re developing a model for is novel, you may not be able to distil. However, maybe you could distill and then do fine tuning. Or do fine tuning and distill from that
@@TrelisResearch Thank you. The purpose of the exercise would be mainly to find a new layer or sub layer architecture for the same task as the original model.
Thank you for consistently producing such in-depth, informative content. Your long-format videos are a treasure trove of knowledge. Really appreciate the effort you put into making these detailed explanations!
thanks greg
Thank you! This is fascinating AND instructive. You have a true talent for explaining complex ideas.
thanks!
Always an excellent share, congratulations
merci loic
Nice work. So glad you do such in depth processes. Question; in this video you go over distillation with the goal of keeping as much knowledge and functionality of the original model as possible. But what about if you really are only interested in a smaller domain of said functionality? i would assume instead of using 2% of whatever dataset you could use even fewer samples of a compatible dataset? You would end up with a very small very specialized model that may be better then the original at your specific domain?
Even better if I could train locally on a single 3090.
Yes perhaps.
The thing is that the background knowledge may provide useful scaffolding for your smaller subset of knowledge.
My guess is that you should distill on 2% plus your subset of data.
And yes, if you are doing less than 1B models, then distilling on local hardware is possible. Much bigger is hard although perhaps - with galore approaches or adafactor - you could do a 4-5B modem
Nice 🔥🔥🔥
cheers
How would distilation compare to archetecture search, when only concerned with a smaller domain. For instance in T2I only pictures of animals. Would it be less compute in total to find and train a NOVEL 100M param architecture vs a 4B param distilled model.
I feel like there is more work to be done in model archetecture.
Well if the task you’re developing a model for is novel, you may not be able to distil.
However, maybe you could distill and then do fine tuning. Or do fine tuning and distill from that
@@TrelisResearch Thank you. The purpose of the exercise would be mainly to find a new layer or sub layer architecture for the same task as the original model.
Hi! What models do you recommend for coding smaller than 48gb? Do you have any fine-tuned?
Check the latest qwen and deepseek models
very helpful, thanks Trelis
also is there a discord server we can join and get connected
there is, but - fair warning - it's paid lifetime access. You can find some free and paid options for support at trelis.com/about though .