You have a great channel Sam. I really like how you're jumping into the topics that most people are ignoring and only opting for the sexy stuff. You're covering important things. Great insights, thank you.
Thank you for the information. I plan to use Llama 2 for simple tasks in English, such as data retrieval, summarization, and chatting based on the provided context. For translations, logic tasks, and coding, I use the GPT-4 API (March version).
Your explanations are simple but deep. Today from video I know about tokenizers much more. Great tutorial !!! ps: Can you make more videos about tokenizers and deeper understanding of LLMs
Great review this is what I needed. I want an open sourced LLM in order to build a chatbot for qna retrieval from documents in the Greek language using langchain so this will help me a lot to find a model from hugging face. Thanks again 😊. Looking forward for the fine tuning tutorial
@@samwitteveenaiYes, that's me 😅. Your presentations have helped me a lot in my project. It would be great if you could find the time and make a tutorial on how we could use petals with langchain. I am asking this because not every one including me, has access to high ram gpus or pay for high ram time in colab in order to run those big llms like llama e.t.c.
I am trying to fine-tune a model which works like ChatGPT for Punjabi language, using the mt5-base, however I am not sure if I should go ahead with it since it does not even generate text and when I try to use it, I just get a response as 0. I have checked the tokenizers, they work fine with Punjabi language, can anyone please tell how may I go on about it? Thanks in advance!
I wish there were tutorials on how to deploy the downloaded model on Azure. With the commercial license, many companies are considering it but they cannot use HuggingFace due to data security etc.
Any resources you'd have on how to actually fine-tune to make the model better at other language? I loved the video but still got confused if we should be looking at increasing the vocab size or actually just feeding a translated dataset in a different language would be enough. Again, Thanks for this
So most the rules of fine-tuning apply, one difference is that you will often add more pre-training on the target language before doing instruction fine tuning in that language etc. You can get more data for general language for most languages from datasets like OSCAR and Common Crawl, depending on the language. For a lot of languages people have also translated things like the Alpaca dataset etc
A very informative video Sam! What tokenization would you recommend to fine tune Llama2 for Indonesian language? In general how to make Llama 2 work with Bahasa
unfortunately you can't change the the tokenizer on a model once it has changed. You will have to try with the current one. For Bahasa it won't be as bad as Thai or Greek etc.
Thanks for explaining the impact of different tokenizers. I assume each LLM are using its specific tokenizer and you cannot use for example a t5 tokenizer in a llma model?
@@samwitteveenai I am doing that. And adding those tokens. But I am not sure about the minimum tokens I should create per language. I might want to add other languages later.
Hi Sam, thanks for great video! Can we fine tune the Llama2 for translation task from Turkish to German languages? I have done tokenizer test for Turkish but it did not provide a great result while as you know German is okay. That's why I have questioned :)
This may work but it is far from ideal for 2 main reasons 1 the tokenizer issues with Turkish (which you have checked) but also 2. LLaMA-2 was not really built for doing translation. For translation you will probably be better to fine tune something like the mT5 or another Seq2Seq model.
Dear Sam , Thank you for this video. As you showed the LLaMa is trained mainly on English and does support the western European languages. My future goal is to train a LLM for indo aryan script. I have tried alpaca but the results were so much the reason was the same as you mentioned. What would be the step if we want to fine-tune LLama for any other language
You would need to extend the vocabulary of the tokenizer, do multiple stages of pretraining then fine tune. This would require at least 8 A100 GPUs. Check out the chinese llama/alpaca, they did something similar.
@@ardasevinc4 yes thank you for replying. I did check that paper recently! But there is an other approch named as okapi by university of Oregon i will first try out that. To do like Chinese llma i really need gpus and unfortunately we don't have.
@@beginnerscode5684 okapi seems interesting. Thanks for mentioning that. It'll still be tough to get llama2 to speak other langs if the base model's training dataset includes very little of it...
Ι was woundering for the following. As it is mentioned to the video If someone uses a tokenizer which tokenizes each word to character level then this tokenizer probably is not ideal for the language in interest. After watching Sam' excellent turorial I went to the open ai's webpage and used their tokenizer. I've notice that when I give a sentence in Greek then I get chararcter level tokens. Does this mean that when I will send a query to their model then it will tokenize the query to the character level? Because if that's the case then it the expenses will go exponentially up for someone who wants to use chatgpt models for the Greek language. I would appreciate if someone could clarify or disapprove my point.😊
It depends on which model you're using. Not every models are the same, some better and some worse. If you tested the tokenizer of the model you're using and get character level tokens for the Greek sentence, then yes, the cost of using the model is much higher than English, and it's not only affect the cost, it might also hurt the model understanding and express of the Greek language as well.
You are totally right about ChatGPT etc cost much more for languages that aren't a good match for its tokenizer. I retweeted a tweet all about this a few months back I think it was for Turkish. It is often an order of magnitude more expensive. The model can do the character tokens etc as it is son big, but it is much more expensive.
@@samwitteveenai If I hadn't watched your video about tokenizers I wouldn't have notice it. Thanks once more. Now I know that openai will be very expensive for my case. Unfortunately I haven't found yet any open sourced LLM model which is good for RetrievalQA for the Greek language ☹️. I think I will try google translation. The problem is that I have mathematical terms and google translation doesn't deliver what I want. Let me give an example, someone might find it usefull. Two angles are called complementary angles when they sum to 90 degrees whereas when they sum to 180 degrees they are called supplementary angles. In Greek complimentary angles= συμπληρωματικές γωνίες, supplementary angles = παραπληρωματικές γωνίες. Google translate sometimes translates συμπληρωματικές as supplementary and sometimes as complementary. If someone knows a model which works for Greek he would save my day 😅. My task is to do closed domain extructive QA for mathematical definitions and methodologies from texts in greek . Can Bert like models been used with langchain? Sorry for the big post and thank you once more, your presentations are excellent 👍
So I was going to make a video about exactly this but then they made the library no longer open source, so a bit reluctant now. Might do it at some point.
As I understood it if I was making a chatbot etc then it would apply in that case. More than that though its how they benefitted from other people contributing to it and then changed it later. Just seems they could have handled it better overall.
I saw that one of the main contributors said they would make an open source fork with their startup and also do some things like remove the need for docker etc. I certainly want to support that.
thanks buddy, i am working in a fine tunning with my own data to Dolly2.0but i hope dont have any problem because that will be on spanish, this is a good point to start thanks! If i am working in a Q&A but i dont have dataset just my database with my own tables what would be your hint? my goal is would be write questions about of my data and have answer like graph or answer like that?
Thank you so much for this. I had something similar in mind. In my case I wanted to finetune it for IsiXhosa, my home language. Have you had a chance to play around with Facebook's MMS models yet?
@@samwitteveenai Sanskrit. Bert based models apparently work, but I use oobabooga and can't get them to work with ooba. I had some success with vicuna 1.1, in spite of the tokenizer breaking everything down to one letter. Not so much with vicuna 1.5. No luck with bloom or orca or llama1. Haven't tried llama2 because vicuna outperforms it pretraining for sanskrit. I'm surprised with so many south Asians in computers that more models don't at least speak hindi.
Update on Sanskrit tokens project: I managed to add tokens to mistral7b. And I had to "resize the embeddings" and "the head". Subsequently, the model does inference, but fine tuning causes a cuda error. I now wonder if the embedding are correct or what the issue is?
That was quite insightful, we don't speak too much about tokenizers and there is definitely room for improvement. Thanks !
Your explanations are simple but deep. Great !!!
Very interesting and informative. Thank you. Looking forward to your next videos on finetuning.
Very interesting comparison of languages. Thank you for the clarification❣👏
You have a great channel Sam. I really like how you're jumping into the topics that most people are ignoring and only opting for the sexy stuff. You're covering important things. Great insights, thank you.
Thanks much appreciated. I am trying to stay away from just purely the latest sexy stuff and cover more things in a bit more depth and with code etc.
Thanks for sharing this video. It’s comprehensive🎉
Glad it was helpful!
Thank you so much, this is very useful. I thought I will learn how to use llama 2 to fine tune in Thai, but now I have to reconsider.
Very insightful, thanks. Great with some technical deep dives.
Thank you very much for info on the multilingual
I'm glad to know that you can speak Thai. I am your FC from Thailand.
ขอบคุณมากครับ 😃
Great video Sam!!
Thank you for the information. I plan to use Llama 2 for simple tasks in English, such as data retrieval, summarization, and chatting based on the provided context. For translations, logic tasks, and coding, I use the GPT-4 API (March version).
Your explanations are simple but deep. Today from video I know about tokenizers much more. Great tutorial !!!
ps: Can you make more videos about tokenizers and deeper understanding of LLMs
Very helpful, thank you so much
Great review this is what I needed. I want an open sourced LLM in order to build a chatbot for qna retrieval from documents in the Greek language using langchain so this will help me a lot to find a model from hugging face. Thanks again 😊. Looking forward for the fine tuning tutorial
Hey George I think you might have been the person who asked about Greek before. Glad to hear this helped.
@@samwitteveenaiYes, that's me 😅. Your presentations have helped me a lot in my project. It would be great if you could find the time and make a tutorial on how we could use petals with langchain. I am asking this because not every one including me, has access to high ram gpus or pay for high ram time in colab in order to run those big llms like llama e.t.c.
yes I have started looking into Petals :D
I am trying to fine-tune a model which works like ChatGPT for Punjabi language, using the mt5-base, however I am not sure if I should go ahead with it since it does not even generate text and when I try to use it, I just get a response as 0. I have checked the tokenizers, they work fine with Punjabi language, can anyone please tell how may I go on about it?
Thanks in advance!
Great lectures.
Would be nice to see a tutorial on training a tokenizer from scratch
Very Insightful thanks.. Arabic is also a problem with Tokenizers
Yes this is one I have looked at recently after the video and it is also challenging.
I wish there were tutorials on how to deploy the downloaded model on Azure. With the commercial license, many companies are considering it but they cannot use HuggingFace due to data security etc.
Sorry I don't have much to do with MSFT currently.
What's a good model for English Thai translations? I live in Chiang Mai and would like to build something fun.
Hi where did you put your notebooks on llama2 please cant find them on github ?
Thanks
Any resources you'd have on how to actually fine-tune to make the model better at other language? I loved the video but still got confused if we should be looking at increasing the vocab size or actually just feeding a translated dataset in a different language would be enough.
Again, Thanks for this
So most the rules of fine-tuning apply, one difference is that you will often add more pre-training on the target language before doing instruction fine tuning in that language etc. You can get more data for general language for most languages from datasets like OSCAR and Common Crawl, depending on the language. For a lot of languages people have also translated things like the Alpaca dataset etc
A very informative video Sam! What tokenization would you recommend to fine tune Llama2 for Indonesian language? In general how to make Llama 2 work with Bahasa
unfortunately you can't change the the tokenizer on a model once it has changed. You will have to try with the current one. For Bahasa it won't be as bad as Thai or Greek etc.
Could you help me that how to make a own tokenization model for any indic language?
Thanks for explaining the impact of different tokenizers. I assume each LLM are using its specific tokenizer and you cannot use for example a t5 tokenizer in a llma model?
Yes the models have to use the tokenizers they are trained with.
Hi, Now that Llama 3.1 is released, can you tell roughly how many new tokens should I create from the same tokenizer for an another language.
yeah you can just load their tokenizer and check it. Llama 3 is certainly better for many languages
@@samwitteveenai I am doing that. And adding those tokens. But I am not sure about the minimum tokens I should create per language. I might want to add other languages later.
Hi Sam, thanks for great video! Can we fine tune the Llama2 for translation task from Turkish to German languages? I have done tokenizer test for Turkish but it did not provide a great result while as you know German is okay. That's why I have questioned :)
This may work but it is far from ideal for 2 main reasons 1 the tokenizer issues with Turkish (which you have checked) but also 2. LLaMA-2 was not really built for doing translation. For translation you will probably be better to fine tune something like the mT5 or another Seq2Seq model.
Dear Sam ,
Thank you for this video. As you showed the LLaMa is trained mainly on English and does support the western European languages. My future goal is to train a LLM for indo aryan script. I have tried alpaca but the results were so much the reason was the same as you mentioned. What would be the step if we want to fine-tune LLama for any other language
You would need to extend the vocabulary of the tokenizer, do multiple stages of pretraining then fine tune. This would require at least 8 A100 GPUs. Check out the chinese llama/alpaca, they did something similar.
@@ardasevinc4 yes thank you for replying. I did check that paper recently! But there is an other approch named as okapi by university of Oregon i will first try out that. To do like Chinese llma i really need gpus and unfortunately we don't have.
@@beginnerscode5684 okapi seems interesting. Thanks for mentioning that. It'll still be tough to get llama2 to speak other langs if the base model's training dataset includes very little of it...
Yes that is going to be challenge, if your language is not based on latin copra then certainly it is challange @@ardasevinc4
Ι was woundering for the following. As it is mentioned to the video If someone uses a tokenizer which tokenizes each word to character level then this tokenizer probably is not ideal for the language in interest. After watching Sam' excellent turorial I went to the open ai's webpage and used their tokenizer. I've notice that when I give a sentence in Greek then I get chararcter level tokens. Does this mean that when I will send a query to their model then it will tokenize the query to the character level? Because if that's the case then it the expenses will go exponentially up for someone who wants to use chatgpt models for the Greek language. I would appreciate if someone could clarify or disapprove my point.😊
It depends on which model you're using. Not every models are the same, some better and some worse. If you tested the tokenizer of the model you're using and get character level tokens for the Greek sentence, then yes, the cost of using the model is much higher than English, and it's not only affect the cost, it might also hurt the model understanding and express of the Greek language as well.
You are totally right about ChatGPT etc cost much more for languages that aren't a good match for its tokenizer. I retweeted a tweet all about this a few months back I think it was for Turkish. It is often an order of magnitude more expensive. The model can do the character tokens etc as it is son big, but it is much more expensive.
@@samwitteveenai If I hadn't watched your video about tokenizers I wouldn't have notice it. Thanks once more. Now I know that openai will be very expensive for my case. Unfortunately I haven't found yet any open sourced LLM model which is good for RetrievalQA for the Greek language ☹️. I think I will try google translation. The problem is that I have mathematical terms and google translation doesn't deliver what I want. Let me give an example, someone might find it usefull. Two angles are called complementary angles when they sum to 90 degrees whereas when they sum to 180 degrees they are called supplementary angles. In Greek complimentary angles= συμπληρωματικές γωνίες, supplementary angles = παραπληρωματικές γωνίες. Google translate sometimes translates συμπληρωματικές as supplementary and sometimes as complementary. If someone knows a model which works for Greek he would save my day 😅. My task is to do closed domain extructive QA for mathematical definitions and methodologies from texts in greek . Can Bert like models been used with langchain? Sorry for the big post and thank you once more, your presentations are excellent 👍
Hi Sam, do you know how to use llama2 using an api from HuggingFace TGI ? with langchain. I don’t know how to write prompts..
Thanks
So I was going to make a video about exactly this but then they made the library no longer open source, so a bit reluctant now. Might do it at some point.
@@samwitteveenaiThey only change the license for firm which sell inferences, not for using it in a firm . Isn't it ?
As I understood it if I was making a chatbot etc then it would apply in that case. More than that though its how they benefitted from other people contributing to it and then changed it later. Just seems they could have handled it better overall.
@@samwitteveenai There now a new fork made from apache 2.0 version
I saw that one of the main contributors said they would make an open source fork with their startup and also do some things like remove the need for docker etc. I certainly want to support that.
How to train other language? Can you help..
thanks buddy, i am working in a fine tunning with my own data to Dolly2.0but i hope dont have any problem because that will be on spanish, this is a good point to start thanks! If i am working in a Q&A but i dont have dataset just my database with my own tables what would be your hint? my goal is would be write questions about of my data and have answer like graph or answer like that?
what are the models supporting arabic language ?
Thank you so much for this. I had something similar in mind. In my case I wanted to finetune it for IsiXhosa, my home language. Have you had a chance to play around with Facebook's MMS models yet?
Hie , I have a similar task. Did you find any breakthrough with your language, IsiXhosa
ขอบคุณแซม
ยินดีมากครับ
Can you do a video elaborating on model sizes, loading techniques to reduce gpu memory ,...?
br
Would you show, how to train your custom tokenizer, so we can support new language?
tokenizer is trained during pre-traning, you need to retrain all the model to custom tonkenizer..
I appear to have this very issue. Too bad the solution is to dump llama2.
What's the language you are after? there could be a multi lingual LLaMA around the corner.
@@samwitteveenai Sanskrit. Bert based models apparently work, but I use oobabooga and can't get them to work with ooba. I had some success with vicuna 1.1, in spite of the tokenizer breaking everything down to one letter. Not so much with vicuna 1.5. No luck with bloom or orca or llama1. Haven't tried llama2 because vicuna outperforms it pretraining for sanskrit.
I'm surprised with so many south Asians in computers that more models don't at least speak hindi.
Update on Sanskrit tokens project: I managed to add tokens to mistral7b. And I had to "resize the embeddings" and "the head". Subsequently, the model does inference, but fine tuning causes a cuda error. I now wonder if the embedding are correct or what the issue is?
i speek greek