** GGUF Not working ** Towards the end of the video, I state that the issue with function calling with GGUF is due to the prompt format. However, the issue is that the GGUF model (unlike the base model) is responding with incorrectly formed json objects. There appears to be an issue with the GGUF quantization that I need to resolve. I'll updated here once resolved.
Thanks for the amazing content! Based on my downloads of various quantised models from The Bloke, 5-bit quantisation would seem to be the sweet spot if you want reduced memory usage, but you still care about quality.
Yeah - it’s not as though there is one point that perplexity suddenly drops off. Very roughly, I say 8bit but yeah some 6 or 5 bit quants are good too.
Thank you so much! Youre research is highly appreciated, and this video solves the feasibility question mark in my mind! Looking forward to digging into your company and vids. 👍👍👍🎆
Man, how do you come up with ideas for the new videos?! This is pure gold! Would you consider doing something with non-english languages (considering Europe has a nice mixture of those)? I'm wondering if this is even something I should be thinking about when fine tuning open LLMs...
Cheers. I am thinking about French, German, Italian, Spanish and Polish (and English obviously). But even a clue on how to deal with one extra language would be nice. I don't speak theese languages, which even makes it more "fun". I have a custom dataset of 1000 FAQ-style question/answer pairs, currently in English, so that would be an example use case to play with..
Damn, did my reply not show up here? I must be loosing my mind... In general I deal with English, but also German, French, Italian, Spanish and Polish. Even seeing how to fine tune for 1 non-english language could be very interesting, what are some best practices, limitations, etc. I do have a custom dataset of ~1000 FAQ-style question/answer pairs (upsampled by GPT-4 from original ~150 questions/answers).
I have a 2018 16inch with a x86 but 32 gig of RAM, and I can run the hell out of solar, deep seek and mixtral simultaneously, with ollama or 1 at the time with jan (slower).
lets agree they are called SLMs as in SLiM unless you want to start using the metric system of pico, nano, micro, milli, etc 😄 in 5 years "big" as in "big data" will be considered small compared to the biggest
Yes! Very effective for training for classification - the basic premise is the same as training for function calling (take a look at the recent vid and also the older vid on structured responses).
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
yeah i wanna do a vid on multi-modal. I tried out llava and was unimpressed by performance versus openai, so I thought I would delay a little bit. I'll revisit soon
Hey , I have a question. I trained a tokenizer changing the length of the tokenizer , then did peft + qlora ( embedding, lm head and qkv) fine tuning. But the model does not perform well? Is it because of the lack of datasets? Or because i have changed the dimensions?
I'd need more info to say... - What kind of dataset were you using and training for what application? - Did you merge the LoRA onto the base model you trained? (you have to be careful not to lose the updated embed and lm head layers). - when changing the embeddings settings you have to update both the tokenizer and model. The best video for all of this is the one I did on Chat Fine-tuning.
great video guys, can someone help me understand when you use just the lora adapter weights for inference and when you merge the lora weights to the original model
generally it's best to merge because inference is slower unmerged (there's an extra addition step to apply the adapter). The reason not to merge is that you can store the adapter (which is small) separately [if that's useful].
it's probably going to be really slow, but perhaps your best option is to look at llamafile, because that is properly optimised for cpu. Possibly you could also try models like smollm.
Thanks for sharing . I just had a look and it looks like a strong option to get a chat going. Would be nice if they add Phi as an option . As you saw in this vid a 4bit quant is still too big for my machine . Btw when llamcpp is installed and you run ./server - there’s also a simple chat interface on the localhost port
** GGUF Not working **
Towards the end of the video, I state that the issue with function calling with GGUF is due to the prompt format. However, the issue is that the GGUF model (unlike the base model) is responding with incorrectly formed json objects.
There appears to be an issue with the GGUF quantization that I need to resolve. I'll updated here once resolved.
Your relaxed conversation style makes it seem like AI is just following a series of if-else statements.😁😁😁
As usual pure 🔥. Thanks for putting the time and energy into your outstanding didactic videos🙌🏾
Thanks for the amazing content! Based on my downloads of various quantised models from The Bloke, 5-bit quantisation would seem to be the sweet spot if you want reduced memory usage, but you still care about quality.
Yeah - it’s not as though there is one point that perplexity suddenly drops off. Very roughly, I say 8bit but yeah some 6 or 5 bit quants are good too.
You post most valuable content on AI/ML.
Tiny LLMs = Tiny Large LMs = (Tiny Large) LMs = LMs
😂 Keep an eye out for an upcoming video on Large TLLMs
Medium Language Models
... whoa
Thank you so much! Youre research is highly appreciated, and this video solves the feasibility question mark in my mind! Looking forward to digging into your company and vids. 👍👍👍🎆
Awesomme video. Phi 2 is now available for commercial use under MIT License.
Man, how do you come up with ideas for the new videos?! This is pure gold! Would you consider doing something with non-english languages (considering Europe has a nice mixture of those)? I'm wondering if this is even something I should be thinking about when fine tuning open LLMs...
Cheers! What language? And what topic?
Cheers. I am thinking about French, German, Italian, Spanish and Polish (and English obviously). But even a clue on how to deal with one extra language would be nice. I don't speak theese languages, which even makes it more "fun". I have a custom dataset of 1000 FAQ-style question/answer pairs, currently in English, so that would be an example use case to play with..
Damn, did my reply not show up here? I must be loosing my mind... In general I deal with English, but also German, French, Italian, Spanish and Polish. Even seeing how to fine tune for 1 non-english language could be very interesting, what are some best practices, limitations, etc. I do have a custom dataset of ~1000 FAQ-style question/answer pairs (upsampled by GPT-4 from original ~150 questions/answers).
Another great video.
UPDATE: Phi-2 is now available - incl. for commercial use - under an MIT license!
I have a 2018 16inch with a x86 but 32 gig of RAM, and I can run the hell out of solar, deep seek and mixtral simultaneously, with ollama or 1 at the time with jan (slower).
lets agree they are called SLMs as in SLiM unless you want to start using the metric system of pico, nano, micro, milli, etc 😄 in 5 years "big" as in "big data" will be considered small compared to the biggest
Can you add some train videos?like distributed training with deepspeed...can ray used to distributed training?
check out the trelis fine-tuning playlist, you'll see a multi-gpu video there
Great insights. Would low rank training would be useful for narrow tasks like text classification for example?
Yes! Very effective for training for classification - the basic premise is the same as training for function calling (take a look at the recent vid and also the older vid on structured responses).
Nice video as always, for function calling I am using NexusRaven V1 on a 1070ti and I think its better than gpt4.
PS Im using Ollama for inference.
thanks for the tips, I'll dig in on those
@@TrelisResearch Its super fast
it would be nice a video about groq (not grok) but I don't know how many infos there are around at the moment
yeah, I'm kind of tracking it, but they don't give a way to inference a custom model yet afaik, once they do, I think that would def be interesting
Can you do a video on finetuning a multimodal LLM (Video-LlaMA, LLaVA, or CLIP) with a custom multimodal dataset containing images and texts for relation extraction or a specific task? Can you do it using open-source multimodal LLM and multimodal datasets like video-llama or else so anyone can further their experiments with the help of your tutorial. Can you also talk about how we can boost the performance of the fine-tuned modal using prompt tuning in the same video?
yeah i wanna do a vid on multi-modal. I tried out llava and was unimpressed by performance versus openai, so I thought I would delay a little bit. I'll revisit soon
Hey , I have a question. I trained a tokenizer changing the length of the tokenizer , then did peft + qlora ( embedding, lm head and qkv) fine tuning. But the model does not perform well? Is it because of the lack of datasets? Or because i have changed the dimensions?
I'd need more info to say...
- What kind of dataset were you using and training for what application?
- Did you merge the LoRA onto the base model you trained? (you have to be careful not to lose the updated embed and lm head layers).
- when changing the embeddings settings you have to update both the tokenizer and model.
The best video for all of this is the one I did on Chat Fine-tuning.
@@TrelisResearch okay i will watch the video and i come back to you thanks 🙏
do you know if the advanced inference supports native logit biasing and constrained generation via the API
great video guys, can someone help me understand when you use just the lora adapter weights for inference and when you merge the lora weights to the original model
generally it's best to merge because inference is slower unmerged (there's an extra addition step to apply the adapter).
The reason not to merge is that you can store the adapter (which is small) separately [if that's useful].
@@TrelisResearch thanks for your reply. got it. please continue with your content, it has helped me a lot
Hi Ronan could you please do tutorial on GuardRails?
interesting idea, let me add that to the list of potential vids
you remind me Andrey Karpathy
can i run the fine tuned deepseec llm in raspberry pi 4 with 4gb ram.
plz reply. i need to know
it's probably going to be really slow, but perhaps your best option is to look at llamafile, because that is properly optimised for cpu. Possibly you could also try models like smollm.
What's the best way to get clients for these types of solutions?
Howdy, Are you asking how to come up with applications for tiny LLMs? i.e. use cases/markets where having tiny llms is useful?
Even a second hand GTX 1070 laptop would be able to handle the 4 bit quantised variant.
Can any be loaded on a iPhone ?
In principle yes, although I haven’t dug into that yet. I’ll add to my potential videos list
I using MLC chat / chatterUI on android but i think they have iphone versions too.
in the meanwhile phi 2 model changed licensing to a permissive one
Do you find Mozilla's Llamafile project interesting or useful? As someone who dabbles, I'm still not sure how to think about it.
Thanks for sharing . I just had a look and it looks like a strong option to get a chat going. Would be nice if they add Phi as an option . As you saw in this vid a 4bit quant is still too big for my machine .
Btw when llamcpp is installed and you run ./server - there’s also a simple chat interface on the localhost port