Mixtral Fine tuning and Inference

Поделиться
HTML-код
  • Опубликовано: 16 дек 2024

Комментарии • 46

  • @nikmad
    @nikmad 11 месяцев назад +3

    Great video. In my experience, its quite a bit faster to pre download the model shards to backblaze and sync with the pod. This gets around 100-130 gbps vs ~40gps for direct download. Useful if doing iterative model improvements, interrupted training runs etc.

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад +1

      Good tip! another option (probably less good than yours) is to use hf download, which downloads shards in parallel: huggingface.co/docs/hub/models-downloading

  • @MakeStuffWithAI
    @MakeStuffWithAI Год назад +2

    Really loving these project-based videos!

  • @nan0ponk
    @nan0ponk 11 месяцев назад +1

    Just found your channel, you're the goat man!

  • @adithyaskolavi
    @adithyaskolavi Год назад +2

    Really informative video.
    can you make a video on setting up a model for full weight finetuning using something like deepseed ? Would be very usefull

    • @TrelisResearch
      @TrelisResearch  Год назад

      Yeah, let me consider that.
      BTW, what do you see as the benefit of a full fine-tune over LoRA (potentially training the embed and norm as well). LLM matrices are rank deficient, so full fine-tuning doesn't bring out a benefit. Have you had experience or seen work indicating otherwise?
      As a side note, the trainings in the notebooks I show do work across parallel GPUs.

  • @Ai-Marshal
    @Ai-Marshal 9 месяцев назад

    Thanks for the detailed videos Trellis Research !! Great content.
    Fine tuned and pushed the Mixtral 8x7B model to hugging face, but how to host it independently on runpod using VLLM ? When I try to do that, it gives me error. Tried searching a lot of videos and articles. But of no use so far.

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад +1

      You can just adapt a template from GitHub.com/TrelisResearch/one-click-llms

  • @alchemication
    @alchemication 11 месяцев назад +1

    Awesome content. Thank you! Purchased your gated repo and learning loads from it.
    What is the difference between pushing just an adapter to hub vs the whole model? Is it mostly for convenienice down the line (inference?) or am I missing something?

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад

      An adapter is smaller to push or save. But you still need access to the base model. Generally - unless there is a good reason - you're better merging the adapter and pushing the merged model.

  • @GrahamAnderson-z7x
    @GrahamAnderson-z7x 10 месяцев назад

    Have you tried fine tuning a small T5 model to act as a gatekeeper/traffic-cop to determine the question’s intent and send to the right model? Or, is it too small for intent tasks? Thanks for a great channel.

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      You mean send the query to the right expert? That choice is already handled by a router in each layer of this mixture model

    • @GrahamAnderson-z7x
      @GrahamAnderson-z7x 10 месяцев назад

      I was considering running multiple/separate models (mixtral, sqlcoder, deepseek-coder) with a light gatekeeper on top T5/TinyLlama/BERT that determines the category of the question and sends the user's question to the right model. From a previous convo, I think you recommend training a SINGLE model to perform the tasks of a multiple/separate models. Is that still correct? I'm if unclear wrapping my head around this part. Apologies if unclear. @TrelisResearch

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      @@GrahamAnderson-z7x I see!
      Yeah that could make sense if you have different strong specialist models. Probably expensive though if not serving lots of requests though.

  • @AbhijeetTamrakar-k4l
    @AbhijeetTamrakar-k4l 11 месяцев назад +2

    I inferenced mistral on GPU and found that the average time it takes to generate the output is 5 secs. Can we improve the inference speed?

    • @AbhijeetTamrakar-k4l
      @AbhijeetTamrakar-k4l 11 месяцев назад

      Also, after fine_tuning the inference time is reduced to 2.7sec. Do you have any idea regarding this?

    • @AbhijeetTamrakar-k4l
      @AbhijeetTamrakar-k4l 11 месяцев назад

      @TrelisResearch I am experiencing a problem where my model forgets the knowledge of the base model and only responds as per the fine-tuned version. For example if for some words say "School bag" is not present on the training dataset then it returns contextually irrelevant results. And for the cases where we have some relevant text in the training dataset there, we get good results.
      Also, is it wrong to say that models tries to learn the structure of the fine-tuning dataset and then returns results only according the the structure learnt

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      Regarding inference speed - are you inferencing in bf16 (16 bit)_. Maybe try out the one click templates on the Trelis Github.
      My guess re fine-tuning is perhaps because the adapters aren't merged.
      Regarding forgetting, you may need to reduce the training epochs. Are you measuring validation loss?

  • @AbhijeetTamrakar-k4l
    @AbhijeetTamrakar-k4l 11 месяцев назад +1

    How does multi-language support work? You please answer both in general and with respect to Mistral's models
    For example: I tried generating the output in german language.
    Case 1: My prompt was in german but I mentioned in the prompt to generate the output in german language
    Result: Got result in the English Language
    Case 2: My prompt was in german
    Result: Got result in German Language
    So, according to my understanding
    In order to generate the response in a different language, I have to prompt as well in that language
    Please correct me if I am wrong!!

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад

      It will depend on the model.
      In general it's good to:
      1. Put a system message to set the language
      2. Ask the Qs in German
      Mistral doesn't have a system message, so that means tinkering around with trying to put it in the first user message and/or before the first user message.
      Fine-tuning on some German Q&A would help too. BTW, there is a german mistral on huggingface called leo

    • @AbhijeetTamrakar-k4l
      @AbhijeetTamrakar-k4l 11 месяцев назад

      ​@@TrelisResearch Oh, if Mistral doesn't have a system message then It is not that effective for fine-tuning compared to Llama which has a system message as a system message helps to set the tone of the conversation.
      Regarding the German model, If I want to have multiple language capabilities then having multiple models for every language vs a single model with multiple language capabilities.
      I think a single model with multiple languages would be better but I have not tested the performance.

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад +1

      @@AbhijeetTamrakar-k4l this might be a good case for a simple fine tune. Let me add it to my list of potential videos

  • @equious8413
    @equious8413 11 месяцев назад

    Man, love the video. Gated repos kill everything for me.

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад

      Thanks, and I understand where you’re coming from! Selling products instead of advertisements is a trade-off of my business model. I try to balance paid and free. You might find some material in these free repos useful: github.com/TrelisResearch/install-guides and github.com/TrelisResearch/one-click-llms .

    • @equious8413
      @equious8413 11 месяцев назад +1

      @@TrelisResearch ya, I understand. Guess I'd take it over hearing about world of tanks.
      RUclips really made premium useless when everyone has to do an in-video ad spot. Lol

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад

      🤣

  • @MagagnaJayzxui
    @MagagnaJayzxui Год назад

    Any ideas how to get mixtral as an API with function calling for use with Open Interpreter and or Autogen? I was able to load it in 8 and. 4 bit with text generation webui, and create the OpenAI API end point but Open Interpreter isn’t speaking the same language as the model either fails when Open Interpreter tries function calling or also doesn’t provide any tokens when using the -local flag which triggers Open Interpreter not to use function calling I guess

    • @TrelisResearch
      @TrelisResearch  Год назад

      some thoughts:
      - Are you using a function-calling model? If so, which one?
      - Does text generation webui's openai api support function calling? That would be necessary for it to work.
      Broadly, the steps you would need to take are:
      - start with a function calling model.
      - pick an api that can serve the function calling model.
      - dig into Open Interpreter to figure out the syntax it expects for function calling -> then align that to the API you have set up.
      This video will take you through all the steps, except cross-checking syntax with openinterpreter: ruclips.net/video/hHn_cV5WUDI/видео.htmlsi=z0dz9Z87zPdIfWWb

  • @toromanow
    @toromanow Год назад +1

    Access to your repo appeaars to be $86.99 How long is it good for and what does it include?

    • @TrelisResearch
      @TrelisResearch  Год назад

      Howdy!
      It’s lifetime access and includes any future improvements and new functionalities added to the repo.
      The repo has already expanded a lot and I have adjusted the price upwards for new buyers periodically and will continue to. This gives a way for me to reward earlier supporters while also incentivizing me to add more content.
      You can check out Trelis.com for more info on the repo.

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад

      howdy! Price is for lifetime access, including additions! More info here: trelis.com/enterprise-server-api-and-inference-guide/
      Price changes over time as I add more materials, so there's a benefit for those who buy earlier.

  • @8888-u6n
    @8888-u6n Год назад

    Grate video 👍Is there a way to split mistral over multiple 6gb cards, I have 50 6x6gb cards? I would pay for help setting this up

    • @TrelisResearch
      @TrelisResearch  Год назад

      Yes, most likely this will work with TGI. You'll need to:
      - Install TGI (take a look at this vid: ruclips.net/video/Ror2xOOA-VE/видео.htmlsi=xgK9TX2SX8o9okYz)
      - Then run TGI configured for Mixtral (find the config from the template here: github.com/TrelisResearch/one-click-llms)
      It's covered in the Advanced Inference repo: trelis.com/enterprise-server-api-and-inference-guide/
      This should spread the model across GPUs.

  • @rossanobr
    @rossanobr Год назад

    How more expensive is too run this instead of openai on production?

    • @TrelisResearch
      @TrelisResearch  Год назад +1

      OpenAI (or any service, like gemini or togetherai) is always going to be cheaper for single requests. A centralised service will be cheaper regardless of what model is used, because they batch together large requests. Running LLMs at low batch size is inefficient because the GPU's computing core is massively underused.

  • @nirsarkar
    @nirsarkar Год назад

    Very good explanation!

  • @marioricoibanez144
    @marioricoibanez144 Год назад

    Incredible video!

  • @Jason-ju7df
    @Jason-ju7df Год назад

    I read that fine tuning results were worse than the base model on Mixtral 8x7b dolphin-2.5-mixtral-8x7b

    • @TrelisResearch
      @TrelisResearch  Год назад

      Howdy! I'm not entirely clear what you are saying here.
      dolphin-2.5-mixtral-8x7b is one specific fine tune of the Mixtral model. It's an uncensored version focused on coding.
      My video is focused on the method of fine-tuning. The result of fine-tuning will depend on the dataset used.

  • @free_thinker4958
    @free_thinker4958 Год назад

    Is there any way to run mixtral locally hhh?

    • @TrelisResearch
      @TrelisResearch  Год назад

      If you have a Mac with 32 GB of RAM you can run with GGUF: huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main . My video on code llama for Mac covers the install.
      What hardware have you?

    • @TrelisResearch
      @TrelisResearch  11 месяцев назад

      what type of gpu or laptop have you? how much VRAM?

  • @redone823
    @redone823 Год назад

    Your thumbnail makes you look like you could pass for wolverine. You should do auditions.

  • @MehdiMirzaeiAlavijeh
    @MehdiMirzaeiAlavijeh 10 месяцев назад

    please let me know how to create a fixed forms with the below structures with special command to LLM:
    Give me score out of 4 for (based on the TOEFL rubric) without any explanation, just display the score.
    General Description:
    Topic Development:
    Language Use:
    Delivery:
    Overall Score:
    Identify the number of grammatical and vocabulary errors, providing a sentence-by-sentence breakdown.
    'Sentence 1:
    Errors:
    Grammar:
    Vocabulary:
    Recommend effective academic vocabulary and grammar:'
    'Sentence 2:
    Errors:
    Grammar:
    Vocabulary:
    Recommend effective academic vocabulary and grammar:'
    .......

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      probably you can get inspiration from the recent video about getting structured responses (e.g. in json), that will provide tips for you

  • @MarxOrx
    @MarxOrx Год назад

    FIRST!