Fine-tuning Language Models for Structured Responses with QLoRa

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024

Комментарии • 55

  • @TrelisResearch
    @TrelisResearch  Год назад +4

    UPDATE (ADVANCED):
    - The ADVANCED portion of the video, and notebook, shows left-padding. I have since switched to right padding of sequences. This seems more robust because the beginning of sequence token (an s in angled brackets) is always tokenized as one token when at the start of an input (i.e. right padding). By contrast, after pad tokens, the tokenizer often tokenizes the beginning of sequence token as three tokens, which can lead to misalignent of the loss mask and attention mask as well as unknown tokens.

    • @hariduraibaskar9056
      @hariduraibaskar9056 10 месяцев назад

      so do we change to right padding now?

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад +1

      @hariduraibaskar9056 . Yup, the script now uses right padding. I think it's a bit more robust when fine-tuning for structured responses. For doing unsupervised training/pre-training, left padding is probably better because you don't want unfinished sentences ending with

  • @mohammadkhair7
    @mohammadkhair7 Год назад +4

    What an amazing and insightful tutorial with detailed understanding and review of the LLaMA-2 fine-tuning and inference all extended with function calling. Much appreciated!
    I will be supporting this channel and advertising it. Well done.

  • @emrahe468
    @emrahe468 Год назад +3

    ty for the nice tutorial, this helped me greatly running the fine tuned model

  • @tomhavy
    @tomhavy Год назад +1

    Hell yes , thanks a lot for this !

  • @colkbassad
    @colkbassad 8 месяцев назад +1

    Please keep up the content, you're very gifted at teaching and presenting clearly. I'm very interested (read: obsessed) with function calling on local LLMs. My goal is a solution that doesn't require a network connection. I've found mistral-7b is the best trade-off on performance between hardware requirements and inferencing reliability. It tends to fall apart with more functions, though.
    I try to keep things simple by grouping my function descriptions into areas of responsibility (e.g. map-navigation and map-styling) and having a main agent that decides what the user is trying to do. Then based on the choice from the main agent I invoke the sub-agent that has relevant function descriptions.
    It seems to help keep the model focused and is more efficient with the context window. I even get promising results with the default instruct version but I'm very interested in fine-tuning for my use case. I tried nexusraven 13b and it works well, but it runs too slow on my a5000 laptop. Do you think this is worth pursuing? Can you recommend some of your gated content given what I'm up to?

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад

      Howdy @colkbassad! Sure, a few ideas:
      1. Have a look at this function calling video: ruclips.net/video/hHn_cV5WUDI/видео.htmlsi=6woUAR2XFGnQzWdB . In case you haven't seen it already.
      2. Yes, Mistral v0.1 and v0.2 are (perhaps oddly?) only ok at function calling. By far the best value model I've tested (in terms of capabilities per model size) is OpenChat 3.5 (huggingface.co/Trelis/openchat_3.5-function-calling-v3). It's also demo'd in the video above.
      3. Actually models that are good at code are great at function calling. There are v2 function calling models on HuggingFace under Trelis for all DeepSeek model sizes and it is a strong model. The drawback is that coding models are not as strong on non code and non-function chats.

  • @WinsonDabbles
    @WinsonDabbles 8 месяцев назад +1

    I started watching all your videos! I couldnt find one that explained how to create fine tuning datasets on your own personal or company’s data to fine tune on though. Only existing datasets created by people from HF. Have you any tips or ideas? Happy to pay for these tips/info. Good job! Enjoying every single one i have watched

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад +1

      ah yeah, you need to go to the three videos entitled "Fine-tuning versus embeddings" - that's where I do a custom dataset (touch rugby rules)

    • @WinsonDabbles
      @WinsonDabbles 8 месяцев назад

      @@TrelisResearch amazing! I’ll go watch it! Thank you! Keep killing it man!

  • @user-gw1zd2kk5v
    @user-gw1zd2kk5v Год назад +5

    2^4 = 16 (not 32)
    2^32 = 4,294,967,296

  • @ghrasko
    @ghrasko 8 месяцев назад +1

    Hi,
    somewhen around 38:00 in the video you start working on a colab sheet "QLoRa Training for Small Datasets". I have purchased the Advanced Fine Tuning package, but I can't find it there.

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад

      Howdy! Yeah the latest version of that script is the function calling notebook in the function calling branch

  • @MrSCAAT
    @MrSCAAT 8 месяцев назад

    Great Work

  • @user-yu8sp2np2x
    @user-yu8sp2np2x 8 месяцев назад +1

    How do you decide the r=16 and lora_alpha=32 in the LoraConfig?

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад

      It's empirical and r of 8 or 16 with alpha of about 32 tends to work well.
      r is the rank of the adapter matrices. So the adapters are of size r X embedding_size . If you make r as big as the embedding size, then the adapters are pointless because they are just as big as the weight matrices. The whole idea is to train smaller adapters.
      So you want r to be a lot smaller than the embedding size but you also want it big enough so that the adapter matrices can retain some info.
      Meanwhile alpha * learning rate / r is the learning rate used for the adapters, so you always want alpha to be some multiple or fraction of r that is not too far from 1 (i.e. have alpha be four times r is fine). Keep an eye out for a new vid on tiny models where I talk about sizing r.

    • @user-yu8sp2np2x
      @user-yu8sp2np2x 8 месяцев назад +1

      ​@@TrelisResearch Sure, Thanks for the insights!
      Also, Have you tried fine-tuning vision models? What are your views regarding them?
      Also, I have watched most of your videos from the beginning. If I want to learn things in depth, could you suggest ways to do that!!

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад

      @@user-yu8sp2np2x you're then at the point to start reading papers! Attention is All you Need. LoRA, AWQ, GPTQ, Lima, and many more!

  • @Wanderlust1342
    @Wanderlust1342 Год назад +2

    Excellent stuff, can you tell about the max_steps arg, if lets say i set it to 1000, would it mean we will be looking for first 1000 batches?

    • @TrelisResearch
      @TrelisResearch  Год назад

      Howdy! Yes if you have a batch size of 1 (and gradient accumulation of 1, I think). If your batch size is 2, then 1000 steps would include 2000 rows of data.

  • @gustavofelicidade_
    @gustavofelicidade_ 10 месяцев назад

    Subscribed!

  • @user-lz8wv7rp1o
    @user-lz8wv7rp1o Год назад +1

    great

  • @prestonmccauley43
    @prestonmccauley43 Год назад +2

    Overall this was a really good video. A lot of good detail and explanation. I think you over complicated it a bit with using the function. A lot of us are just looking how to use a non-function based data set. For example 3 columns of data in a data set with an instruction model. How much data is necessary? Your technique of padding and custom pad token to this looks like a good process, but can we make a simple example as well?

    • @TrelisResearch
      @TrelisResearch  Год назад +1

      Regarding data quantity - it's hard to generalise. Probably the larger the model, the less the data required (as the model has more of a statistical base already). My experience is that 50-100 datapoints can be enough.
      One specific example. In the case of function calling, it's important to train with some examples where the model is given functions in the prompt, but the prompt does not require a function call. This avoids the model being led to believe that the presence of functions means they must be used.
      So one has to consider the edge cases.
      BTW, what are some examples of fine-tuning datasets that would be useful to show? What would you want to fine tune the model for?

    • @prestonmccauley43
      @prestonmccauley43 Год назад +1

      I would gladly share a video focusing on qlora with small data set non-functions. I'm a teacher as well, and been watching about 50 youtube or so in the past several weeks. That is clearly the most significant gap. I get peft, qlora, lora, models, hyperparamaters, - everyone is missing he IA data science part:) @@RonanMcGovern - I would even pay for that great colab, tutorial or using tools like Axolotyl, text-gen,

    • @TrelisResearch
      @TrelisResearch  Год назад

      @@prestonmccauley43 great stuff, any feedback on the colab templates in the video description is welcome. BTW, the free one is for fine-tuning on a simple dataset (not function calling).
      Just a note to your earlier question on amount of data required - probably more important is a) quality of data and b) being very exact with the attention and loss mask.
      btw, if you have a very small dataset, you can consider just putting that into the system message, that is quicker and can be as good as fine-tuning.

  • @ArunKumar-bp5lo
    @ArunKumar-bp5lo 10 месяцев назад

    thanks so much

  • @damonpalovaara4211
    @damonpalovaara4211 5 месяцев назад

    I've researching tarnerization of weights (-1, 0, 1) which reduces model size down to 2 bits per weight and compressed down to 1.58 bits per weight for transfers

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      me too!
      I want to do a vid, but libraries aren't mature just yet and no one has released their weights (and just quantizing down doesn't work well).

    • @damonpalovaara4211
      @damonpalovaara4211 5 месяцев назад

      I'm working on a technique that ternarizes using gradient descent by using a smooth-ternarize function@@TrelisResearch

    • @damonpalovaara4211
      @damonpalovaara4211 5 месяцев назад

      @TrelisResearch I also found a technique for efficiently storing 5 weights into a single byte making the weights take up 1.6 bits each

  • @user-yu8sp2np2x
    @user-yu8sp2np2x 8 месяцев назад

    Regarding the size of meta-llama/Llama-2-7b-chat-hf
    It has two files .safetensors
    The cumulative size is around 14GB
    Q1) As you mentioned, the model is nothing but the weights so these .safetensors are weights, right?
    Q2) As you explained, the size of models should be 28GB but they are 14GB. So, are they model weights in 16bit data type?

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад

      1) Yes, safetensors is a file format that is quicker to load than .pytorch
      2) Yes, in 32-bit, the model would be about 28 GB

    • @user-yu8sp2np2x
      @user-yu8sp2np2x 8 месяцев назад

      @@TrelisResearch Thanks, Also
      How to identify what all are the target modules for training. Say for llamas we have different and for others we might have different!

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад

      @@user-yu8sp2np2x just run "print(model)" to see the list of modules. Generally you want to train attention and optionally you can train up and down and gate proj. See the chat fine-tuning video for more discussion.

  • @vent_srikar7360
    @vent_srikar7360 6 месяцев назад

    How do i fine tune with my own data though?

    • @TrelisResearch
      @TrelisResearch  6 месяцев назад +1

      Have a look at the embeddings vs fine tuning videos

  • @user-oy9xr2xt9f
    @user-oy9xr2xt9f 6 месяцев назад

    2^4 is 16 (at 04:15) ?

  • @ambakumari4058
    @ambakumari4058 10 месяцев назад

    is GPU necessary for this ?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      You could run on CPU but it will be really really slow unless you're training a very small model like TinyLlama or DeepSeek 1.3B

  • @vaaaliant
    @vaaaliant Год назад +1

    Great start of the video, you explaining everything is great. But once you start running the functions and they don't run, you sort of lose me.

    • @TrelisResearch
      @TrelisResearch  Год назад +1

      💯that's good and fair feedback. I've been working since to trim content and better organise. New vid on this topic upcoming

    • @vaaaliant
      @vaaaliant Год назад

      @@TrelisResearch Great, I'll subscribe and look forward to your future content. Keep it up!

  • @ayoubelmhamdi7920
    @ayoubelmhamdi7920 10 месяцев назад +1

    you start writing text word by word, then we skip writing to copy paste, why i cannot continue watching copy pasting. 😢😂

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад

      Howdy, are you saying that I'm going too fast with the explanation?
      if so, thanks for the feedback, I'll keep that in mind.
      Otherwise, let me know what you mean. Cheers

    • @ayoubelmhamdi7920
      @ayoubelmhamdi7920 9 месяцев назад +1

      @@TrelisResearch
      To be fair, I meant no one of programming persons I follow except @tsoding, why? Because he starts coding apps from scratch. Every idea should begin with writing the "Hello world," then upgrade the project to the end, without any copy-paste.
      When you attempt to code faster, it makes the code very complex for me.

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад

      @@ayoubelmhamdi7920 thanks for the comment. My target audience here is for devs in the intermediate to advanced stage of coding. However, I think you still make a good point and there's opportunity for me to be more clear to show the steps. Thanks for the feedback!

    • @TemporaryForstudy
      @TemporaryForstudy 6 месяцев назад

      I also felt the same. My advice is that if you have two monitors, write the code first and then start recording the video on the second monitor. try to explain everything, and you already have complete code in the second monitor, so you can see from there also.@@TrelisResearch