Model Distillation: Same LLM Power but 3240x Smaller

Поделиться
HTML-код
  • Опубликовано: 25 окт 2024

Комментарии • 26

  • @gunaysoni6792
    @gunaysoni6792 2 месяца назад +8

    Strictly speaking this is just fine-tuning using synthetic data and not distillation. Distillation for language models trains the student model on the entire probability distribution of the Teacher model and not just SFT with next token prediction.

    • @muhannadobeidat
      @muhannadobeidat 2 месяца назад

      He distilled into a fine tuned model with his own use case. I think it is an excellent use case albeit specific to a classification sentiment analysis example.

    • @lncoder9213
      @lncoder9213 17 дней назад

      In black box KD, only the teacher model’s prompt and response pairs are available. This approach is applicable to models that do not predict logits.
      In white box KD, teacher model’s log probabilities are used. White Box KD is applicable only to open-source models that produce logits.

  • @muhannadobeidat
    @muhannadobeidat 2 месяца назад

    Excellent video especially the white paper review at the top and how you used knowledge from llama3.1 to train a smaller model

  • @KevinKreger
    @KevinKreger 2 месяца назад +2

    Nice to find you! Great topic

  • @unclecode
    @unclecode 2 месяца назад +2

    Fascinating. Very well organized and clearly explained. I have a few questions:
    1. Have you tried fine-tuning the RoBERTa model using human-annotated labels? It would be interesting to compare the accuracy of a model trained on synthetic labels versus one trained on human-annotated data. Is there a significant difference?
    2. I understand we have a dataset labeled by a larger model, which we then use to train a smaller model. But I'm curious if, instead of just labeling, the large model can generate the entire dataset, especially for customized data that doesn't necessarily exist. For example, instead of tweets, we could generate business data from customer reviews. We could fine-tune a large model with a sample of customer reviews, to teach the model about the tone and style, then we use the new model to create, say, 5,000 customer reviews and annotate them, then use this to fine-tune a smaller model. This would be an extreme version of model distillation where both data and labels are generated by the large model.
    3. Have you considered trying this with a smaller model, less than 100 million parameters? Since this is a sentiment analysis, using an even smaller model might yield faster results and keep the accuracy high.

    • @AdamLucek
      @AdamLucek  2 месяца назад +1

      Thanks! And great questions, some thoughts:
      1. Yes if you wanted a lightweight specialized classification model then just using the human annotated labels would be the traditional way. There exist plenty of RoBERTa base models trained on the same set I used- your goal then is to do exactly what you wish to measure, just direct accuracy of your model on the dataset. This is valid but also not quite the mark of the demonstration here which used the accuracy as a baseline to compare two models, so a higher "accuracy" in this case doesn't actually represent better success from the model training
      2. You're on the right track for this one, and it's definitely a technique that many are using, especially when it comes to data augmentation. I highly reccomend reading through "A Survey of Knowledge Distillation of Large Language Models" arxiv.org/pdf/2402.13116, of which many of the examples in this video come from. They go over a plethora of different ways, outside of just classification, that this is being used.
      3. I have not! Kept it light to turn this video around faster, but many optimizations can be made to this model. Can try different base language models, can do different fine tuning hyperparameters, etc. While the accuracy metric is the same, the distribution of labels is only ~75% accurate to Llama 3.1 405B's on that current model. Different sizes, methods, and iterations could be improved upon!

  • @HenockTesfaye
    @HenockTesfaye 2 месяца назад +1

    Very clear. Thank you.

  • @GNARGNARHEAD
    @GNARGNARHEAD 2 месяца назад

    exciting stuff, thanks for sharing

  • @gramnegrod
    @gramnegrod 2 месяца назад +1

    Wow very interesting agent builder for many use cases! I wonder if using DSPY would help the teacher to make an even better dataset to approximate the 65.4?

    • @AdamLucek
      @AdamLucek  2 месяца назад

      Certainly! In Moritz Laurer's blog here huggingface.co/blog/synthetic-data-save-costs he uses chain of thought, few shot, and self consistency- where I only used chain of thought and few shot prompting. Using DSPY instead for prompting could optimize it further!

  • @i2c_jason
    @i2c_jason 2 месяца назад

    Would it be possible to achieve the same end result with a bunch of conditional iterating and API calls to pay-for-play LLMs, if money is not an issue and you are trying to prototype a scalable generative AI application (SW 3.0 application let's say) for a pitch deck? I love the distillation idea as it parallels something I'm working on, but I'm concerned that I'll put too many time resources into something that won't scale as these capabilities become native to the API calls of the paid LLMs. What are your opinions on scaling something as a very small bootstrapped startup?

  • @simonlove99
    @simonlove99 2 месяца назад

    Good insights and intro - one key challenge though for me was the conclusion of transfer - without a vanilla roberta rating, we don't know if there was any material influence on the output. How would roberta have scored on the task pre-finetuning ?

    • @AdamLucek
      @AdamLucek  2 месяца назад +3

      Very good points! My method here is very basic with many optimizations and evaluations yet to be performed. While the accuracy was similar, the distribution of classifications was roughly 75% similar on RoBERTa's side- many improvements along the way can still be made. We do however know that the RoBERTa model learned something tho, as we used the base model which cannot perform this task in its original state!

  • @drkvaladao776
    @drkvaladao776 2 месяца назад +1

    Is there any gguf model distilled to use?

  • @TheShreyas10
    @TheShreyas10 2 месяца назад

    Exciting stuff, interesting to see but does it also supports summarization or only text classification?

    • @AdamLucek
      @AdamLucek  2 месяца назад

      My example was classification, in theory you can do this with anything! Googles gemma 2B is an entire regular language model that’s trained using distilled data

  • @cosmockips907
    @cosmockips907 2 месяца назад +1

    Vanaduke is calling your name

    • @AdamLucek
      @AdamLucek  2 месяца назад +1

      What's this? More wolves hungry for the blood of Almire? Our great kingdom shall never fall to the likes of beasts!

    • @cosmockips907
      @cosmockips907 2 месяца назад

      @@AdamLucek Glad youre doing well, miss the streams

  • @fabriziocasula
    @fabriziocasula 2 месяца назад

    can I use Roberta with Ollama?? How can I download It :-)

  • @SussyBacca
    @SussyBacca 2 месяца назад +1

    Uh, 125m params for sentiment analysis is HUGE. You don't even need AI for it you can use Bayseian statistics. Qualitatively speaking this is a bit of cold oatmeal.

    • @gramnegrod
      @gramnegrod 2 месяца назад +1

      Well ok,,, maybe a bad example dataset. But the point is that the micro llm did it

    • @vikphatak
      @vikphatak 2 месяца назад

      128M, not 128B