Distillation of Transformer Models

Поделиться
HTML-код
  • Опубликовано: 14 дек 2024

Комментарии • 17

  • @gregmeldrum
    @gregmeldrum 2 месяца назад +3

    Thank you for consistently producing such in-depth, informative content. Your long-format videos are a treasure trove of knowledge. Really appreciate the effort you put into making these detailed explanations!

  • @sergerylenberg8711
    @sergerylenberg8711 2 месяца назад

    Thank you! This is fascinating AND instructive. You have a true talent for explaining complex ideas.

  • @loicbaconnier9150
    @loicbaconnier9150 2 месяца назад

    Always an excellent share, congratulations

  • @EternalKernel
    @EternalKernel 2 месяца назад +1

    Nice work. So glad you do such in depth processes. Question; in this video you go over distillation with the goal of keeping as much knowledge and functionality of the original model as possible. But what about if you really are only interested in a smaller domain of said functionality? i would assume instead of using 2% of whatever dataset you could use even fewer samples of a compatible dataset? You would end up with a very small very specialized model that may be better then the original at your specific domain?
    Even better if I could train locally on a single 3090.

    • @TrelisResearch
      @TrelisResearch  2 месяца назад

      Yes perhaps.
      The thing is that the background knowledge may provide useful scaffolding for your smaller subset of knowledge.
      My guess is that you should distill on 2% plus your subset of data.
      And yes, if you are doing less than 1B models, then distilling on local hardware is possible. Much bigger is hard although perhaps - with galore approaches or adafactor - you could do a 4-5B modem

  • @danieladama8105
    @danieladama8105 2 месяца назад

    Nice 🔥🔥🔥

  • @EternalKernel
    @EternalKernel 2 месяца назад +1

    How would distilation compare to archetecture search, when only concerned with a smaller domain. For instance in T2I only pictures of animals. Would it be less compute in total to find and train a NOVEL 100M param architecture vs a 4B param distilled model.
    I feel like there is more work to be done in model archetecture.

    • @TrelisResearch
      @TrelisResearch  2 месяца назад +1

      Well if the task you’re developing a model for is novel, you may not be able to distil.
      However, maybe you could distill and then do fine tuning. Or do fine tuning and distill from that

    • @EternalKernel
      @EternalKernel 2 месяца назад

      @@TrelisResearch Thank you. The purpose of the exercise would be mainly to find a new layer or sub layer architecture for the same task as the original model.

  • @btaranto
    @btaranto 2 месяца назад

    Hi! What models do you recommend for coding smaller than 48gb? Do you have any fine-tuned?

    • @TrelisResearch
      @TrelisResearch  2 месяца назад +1

      Check the latest qwen and deepseek models

  • @SiD-hq2fo
    @SiD-hq2fo 2 месяца назад

    very helpful, thanks Trelis
    also is there a discord server we can join and get connected

    • @TrelisResearch
      @TrelisResearch  2 месяца назад

      there is, but - fair warning - it's paid lifetime access. You can find some free and paid options for support at trelis.com/about though .