Great video. In my experience, its quite a bit faster to pre download the model shards to backblaze and sync with the pod. This gets around 100-130 gbps vs ~40gps for direct download. Useful if doing iterative model improvements, interrupted training runs etc.
Good tip! another option (probably less good than yours) is to use hf download, which downloads shards in parallel: huggingface.co/docs/hub/models-downloading
Yeah, let me consider that. BTW, what do you see as the benefit of a full fine-tune over LoRA (potentially training the embed and norm as well). LLM matrices are rank deficient, so full fine-tuning doesn't bring out a benefit. Have you had experience or seen work indicating otherwise? As a side note, the trainings in the notebooks I show do work across parallel GPUs.
Thanks for the detailed videos Trellis Research !! Great content. Fine tuned and pushed the Mixtral 8x7B model to hugging face, but how to host it independently on runpod using VLLM ? When I try to do that, it gives me error. Tried searching a lot of videos and articles. But of no use so far.
Awesome content. Thank you! Purchased your gated repo and learning loads from it. What is the difference between pushing just an adapter to hub vs the whole model? Is it mostly for convenienice down the line (inference?) or am I missing something?
An adapter is smaller to push or save. But you still need access to the base model. Generally - unless there is a good reason - you're better merging the adapter and pushing the merged model.
Have you tried fine tuning a small T5 model to act as a gatekeeper/traffic-cop to determine the question’s intent and send to the right model? Or, is it too small for intent tasks? Thanks for a great channel.
I was considering running multiple/separate models (mixtral, sqlcoder, deepseek-coder) with a light gatekeeper on top T5/TinyLlama/BERT that determines the category of the question and sends the user's question to the right model. From a previous convo, I think you recommend training a SINGLE model to perform the tasks of a multiple/separate models. Is that still correct? I'm if unclear wrapping my head around this part. Apologies if unclear. @TrelisResearch
@@GrahamAnderson-z7x I see! Yeah that could make sense if you have different strong specialist models. Probably expensive though if not serving lots of requests though.
@TrelisResearch I am experiencing a problem where my model forgets the knowledge of the base model and only responds as per the fine-tuned version. For example if for some words say "School bag" is not present on the training dataset then it returns contextually irrelevant results. And for the cases where we have some relevant text in the training dataset there, we get good results. Also, is it wrong to say that models tries to learn the structure of the fine-tuning dataset and then returns results only according the the structure learnt
Regarding inference speed - are you inferencing in bf16 (16 bit)_. Maybe try out the one click templates on the Trelis Github. My guess re fine-tuning is perhaps because the adapters aren't merged. Regarding forgetting, you may need to reduce the training epochs. Are you measuring validation loss?
How does multi-language support work? You please answer both in general and with respect to Mistral's models For example: I tried generating the output in german language. Case 1: My prompt was in german but I mentioned in the prompt to generate the output in german language Result: Got result in the English Language Case 2: My prompt was in german Result: Got result in German Language So, according to my understanding In order to generate the response in a different language, I have to prompt as well in that language Please correct me if I am wrong!!
It will depend on the model. In general it's good to: 1. Put a system message to set the language 2. Ask the Qs in German Mistral doesn't have a system message, so that means tinkering around with trying to put it in the first user message and/or before the first user message. Fine-tuning on some German Q&A would help too. BTW, there is a german mistral on huggingface called leo
@@TrelisResearch Oh, if Mistral doesn't have a system message then It is not that effective for fine-tuning compared to Llama which has a system message as a system message helps to set the tone of the conversation. Regarding the German model, If I want to have multiple language capabilities then having multiple models for every language vs a single model with multiple language capabilities. I think a single model with multiple languages would be better but I have not tested the performance.
Thanks, and I understand where you’re coming from! Selling products instead of advertisements is a trade-off of my business model. I try to balance paid and free. You might find some material in these free repos useful: github.com/TrelisResearch/install-guides and github.com/TrelisResearch/one-click-llms .
@@TrelisResearch ya, I understand. Guess I'd take it over hearing about world of tanks. RUclips really made premium useless when everyone has to do an in-video ad spot. Lol
Any ideas how to get mixtral as an API with function calling for use with Open Interpreter and or Autogen? I was able to load it in 8 and. 4 bit with text generation webui, and create the OpenAI API end point but Open Interpreter isn’t speaking the same language as the model either fails when Open Interpreter tries function calling or also doesn’t provide any tokens when using the -local flag which triggers Open Interpreter not to use function calling I guess
some thoughts: - Are you using a function-calling model? If so, which one? - Does text generation webui's openai api support function calling? That would be necessary for it to work. Broadly, the steps you would need to take are: - start with a function calling model. - pick an api that can serve the function calling model. - dig into Open Interpreter to figure out the syntax it expects for function calling -> then align that to the API you have set up. This video will take you through all the steps, except cross-checking syntax with openinterpreter: ruclips.net/video/hHn_cV5WUDI/видео.htmlsi=z0dz9Z87zPdIfWWb
Howdy! It’s lifetime access and includes any future improvements and new functionalities added to the repo. The repo has already expanded a lot and I have adjusted the price upwards for new buyers periodically and will continue to. This gives a way for me to reward earlier supporters while also incentivizing me to add more content. You can check out Trelis.com for more info on the repo.
howdy! Price is for lifetime access, including additions! More info here: trelis.com/enterprise-server-api-and-inference-guide/ Price changes over time as I add more materials, so there's a benefit for those who buy earlier.
Yes, most likely this will work with TGI. You'll need to: - Install TGI (take a look at this vid: ruclips.net/video/Ror2xOOA-VE/видео.htmlsi=xgK9TX2SX8o9okYz) - Then run TGI configured for Mixtral (find the config from the template here: github.com/TrelisResearch/one-click-llms) It's covered in the Advanced Inference repo: trelis.com/enterprise-server-api-and-inference-guide/ This should spread the model across GPUs.
OpenAI (or any service, like gemini or togetherai) is always going to be cheaper for single requests. A centralised service will be cheaper regardless of what model is used, because they batch together large requests. Running LLMs at low batch size is inefficient because the GPU's computing core is massively underused.
Howdy! I'm not entirely clear what you are saying here. dolphin-2.5-mixtral-8x7b is one specific fine tune of the Mixtral model. It's an uncensored version focused on coding. My video is focused on the method of fine-tuning. The result of fine-tuning will depend on the dataset used.
If you have a Mac with 32 GB of RAM you can run with GGUF: huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main . My video on code llama for Mac covers the install. What hardware have you?
please let me know how to create a fixed forms with the below structures with special command to LLM: Give me score out of 4 for (based on the TOEFL rubric) without any explanation, just display the score. General Description: Topic Development: Language Use: Delivery: Overall Score: Identify the number of grammatical and vocabulary errors, providing a sentence-by-sentence breakdown. 'Sentence 1: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' 'Sentence 2: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' .......
Great video. In my experience, its quite a bit faster to pre download the model shards to backblaze and sync with the pod. This gets around 100-130 gbps vs ~40gps for direct download. Useful if doing iterative model improvements, interrupted training runs etc.
Good tip! another option (probably less good than yours) is to use hf download, which downloads shards in parallel: huggingface.co/docs/hub/models-downloading
Really loving these project-based videos!
Just found your channel, you're the goat man!
Really informative video.
can you make a video on setting up a model for full weight finetuning using something like deepseed ? Would be very usefull
Yeah, let me consider that.
BTW, what do you see as the benefit of a full fine-tune over LoRA (potentially training the embed and norm as well). LLM matrices are rank deficient, so full fine-tuning doesn't bring out a benefit. Have you had experience or seen work indicating otherwise?
As a side note, the trainings in the notebooks I show do work across parallel GPUs.
Thanks for the detailed videos Trellis Research !! Great content.
Fine tuned and pushed the Mixtral 8x7B model to hugging face, but how to host it independently on runpod using VLLM ? When I try to do that, it gives me error. Tried searching a lot of videos and articles. But of no use so far.
You can just adapt a template from GitHub.com/TrelisResearch/one-click-llms
Awesome content. Thank you! Purchased your gated repo and learning loads from it.
What is the difference between pushing just an adapter to hub vs the whole model? Is it mostly for convenienice down the line (inference?) or am I missing something?
An adapter is smaller to push or save. But you still need access to the base model. Generally - unless there is a good reason - you're better merging the adapter and pushing the merged model.
Have you tried fine tuning a small T5 model to act as a gatekeeper/traffic-cop to determine the question’s intent and send to the right model? Or, is it too small for intent tasks? Thanks for a great channel.
You mean send the query to the right expert? That choice is already handled by a router in each layer of this mixture model
I was considering running multiple/separate models (mixtral, sqlcoder, deepseek-coder) with a light gatekeeper on top T5/TinyLlama/BERT that determines the category of the question and sends the user's question to the right model. From a previous convo, I think you recommend training a SINGLE model to perform the tasks of a multiple/separate models. Is that still correct? I'm if unclear wrapping my head around this part. Apologies if unclear. @TrelisResearch
@@GrahamAnderson-z7x I see!
Yeah that could make sense if you have different strong specialist models. Probably expensive though if not serving lots of requests though.
I inferenced mistral on GPU and found that the average time it takes to generate the output is 5 secs. Can we improve the inference speed?
Also, after fine_tuning the inference time is reduced to 2.7sec. Do you have any idea regarding this?
@TrelisResearch I am experiencing a problem where my model forgets the knowledge of the base model and only responds as per the fine-tuned version. For example if for some words say "School bag" is not present on the training dataset then it returns contextually irrelevant results. And for the cases where we have some relevant text in the training dataset there, we get good results.
Also, is it wrong to say that models tries to learn the structure of the fine-tuning dataset and then returns results only according the the structure learnt
Regarding inference speed - are you inferencing in bf16 (16 bit)_. Maybe try out the one click templates on the Trelis Github.
My guess re fine-tuning is perhaps because the adapters aren't merged.
Regarding forgetting, you may need to reduce the training epochs. Are you measuring validation loss?
How does multi-language support work? You please answer both in general and with respect to Mistral's models
For example: I tried generating the output in german language.
Case 1: My prompt was in german but I mentioned in the prompt to generate the output in german language
Result: Got result in the English Language
Case 2: My prompt was in german
Result: Got result in German Language
So, according to my understanding
In order to generate the response in a different language, I have to prompt as well in that language
Please correct me if I am wrong!!
It will depend on the model.
In general it's good to:
1. Put a system message to set the language
2. Ask the Qs in German
Mistral doesn't have a system message, so that means tinkering around with trying to put it in the first user message and/or before the first user message.
Fine-tuning on some German Q&A would help too. BTW, there is a german mistral on huggingface called leo
@@TrelisResearch Oh, if Mistral doesn't have a system message then It is not that effective for fine-tuning compared to Llama which has a system message as a system message helps to set the tone of the conversation.
Regarding the German model, If I want to have multiple language capabilities then having multiple models for every language vs a single model with multiple language capabilities.
I think a single model with multiple languages would be better but I have not tested the performance.
@@AbhijeetTamrakar-k4l this might be a good case for a simple fine tune. Let me add it to my list of potential videos
Man, love the video. Gated repos kill everything for me.
Thanks, and I understand where you’re coming from! Selling products instead of advertisements is a trade-off of my business model. I try to balance paid and free. You might find some material in these free repos useful: github.com/TrelisResearch/install-guides and github.com/TrelisResearch/one-click-llms .
@@TrelisResearch ya, I understand. Guess I'd take it over hearing about world of tanks.
RUclips really made premium useless when everyone has to do an in-video ad spot. Lol
🤣
Any ideas how to get mixtral as an API with function calling for use with Open Interpreter and or Autogen? I was able to load it in 8 and. 4 bit with text generation webui, and create the OpenAI API end point but Open Interpreter isn’t speaking the same language as the model either fails when Open Interpreter tries function calling or also doesn’t provide any tokens when using the -local flag which triggers Open Interpreter not to use function calling I guess
some thoughts:
- Are you using a function-calling model? If so, which one?
- Does text generation webui's openai api support function calling? That would be necessary for it to work.
Broadly, the steps you would need to take are:
- start with a function calling model.
- pick an api that can serve the function calling model.
- dig into Open Interpreter to figure out the syntax it expects for function calling -> then align that to the API you have set up.
This video will take you through all the steps, except cross-checking syntax with openinterpreter: ruclips.net/video/hHn_cV5WUDI/видео.htmlsi=z0dz9Z87zPdIfWWb
Access to your repo appeaars to be $86.99 How long is it good for and what does it include?
Howdy!
It’s lifetime access and includes any future improvements and new functionalities added to the repo.
The repo has already expanded a lot and I have adjusted the price upwards for new buyers periodically and will continue to. This gives a way for me to reward earlier supporters while also incentivizing me to add more content.
You can check out Trelis.com for more info on the repo.
howdy! Price is for lifetime access, including additions! More info here: trelis.com/enterprise-server-api-and-inference-guide/
Price changes over time as I add more materials, so there's a benefit for those who buy earlier.
Grate video 👍Is there a way to split mistral over multiple 6gb cards, I have 50 6x6gb cards? I would pay for help setting this up
Yes, most likely this will work with TGI. You'll need to:
- Install TGI (take a look at this vid: ruclips.net/video/Ror2xOOA-VE/видео.htmlsi=xgK9TX2SX8o9okYz)
- Then run TGI configured for Mixtral (find the config from the template here: github.com/TrelisResearch/one-click-llms)
It's covered in the Advanced Inference repo: trelis.com/enterprise-server-api-and-inference-guide/
This should spread the model across GPUs.
How more expensive is too run this instead of openai on production?
OpenAI (or any service, like gemini or togetherai) is always going to be cheaper for single requests. A centralised service will be cheaper regardless of what model is used, because they batch together large requests. Running LLMs at low batch size is inefficient because the GPU's computing core is massively underused.
Very good explanation!
Incredible video!
I read that fine tuning results were worse than the base model on Mixtral 8x7b dolphin-2.5-mixtral-8x7b
Howdy! I'm not entirely clear what you are saying here.
dolphin-2.5-mixtral-8x7b is one specific fine tune of the Mixtral model. It's an uncensored version focused on coding.
My video is focused on the method of fine-tuning. The result of fine-tuning will depend on the dataset used.
Is there any way to run mixtral locally hhh?
If you have a Mac with 32 GB of RAM you can run with GGUF: huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main . My video on code llama for Mac covers the install.
What hardware have you?
what type of gpu or laptop have you? how much VRAM?
Your thumbnail makes you look like you could pass for wolverine. You should do auditions.
please let me know how to create a fixed forms with the below structures with special command to LLM:
Give me score out of 4 for (based on the TOEFL rubric) without any explanation, just display the score.
General Description:
Topic Development:
Language Use:
Delivery:
Overall Score:
Identify the number of grammatical and vocabulary errors, providing a sentence-by-sentence breakdown.
'Sentence 1:
Errors:
Grammar:
Vocabulary:
Recommend effective academic vocabulary and grammar:'
'Sentence 2:
Errors:
Grammar:
Vocabulary:
Recommend effective academic vocabulary and grammar:'
.......
probably you can get inspiration from the recent video about getting structured responses (e.g. in json), that will provide tips for you
FIRST!