LLaMA2 for Multilingual Fine Tuning?

Sam Witteveen

Просмотров 17 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 дек 2024

Комментарии • 74

@edouardalbert7788 Год назад ⁺⁶
That was quite insightful, we don't speak too much about tokenizers and there is definitely room for improvement. Thanks !
@GenAIWithNandakishor Год назад ⁺¹
Your explanations are simple but deep. Great !!!
@toddnedd2138 Год назад ⁺¹
Very interesting and informative. Thank you. Looking forward to your next videos on finetuning.
@FalahgsGate Год назад ⁺¹
Very interesting comparison of languages. Thank you for the clarification❣👏
@MrDeanelwood Год назад
You have a great channel Sam. I really like how you're jumping into the topics that most people are ignoring and only opting for the sexy stuff. You're covering important things. Great insights, thank you.
@samwitteveenai Год назад
Thanks much appreciated. I am trying to stay away from just purely the latest sexy stuff and cover more things in a bit more depth and with code etc.
@SunnyJocker 6 месяцев назад
Thanks for sharing this video. It’s comprehensive🎉
@samwitteveenai 6 месяцев назад
Glad it was helpful!
@auddy7889 Год назад
Thank you so much, this is very useful. I thought I will learn how to use llama 2 to fine tune in Thai, but now I have to reconsider.
@ringpolitiet Год назад
Very insightful, thanks. Great with some technical deep dives.
@futureautomation9518 Год назад ⁺¹
Thank you very much for info on the multilingual
@ChatchaiPummala Год назад
I'm glad to know that you can speak Thai. I am your FC from Thailand.
@samwitteveenai Год назад
ขอบคุณมากครับ 😃
@nickki8ara Год назад
Great video Sam!!
@micbab-vg2mu Год назад
Thank you for the information. I plan to use Llama 2 for simple tasks in English, such as data retrieval, summarization, and chatting based on the provided context. For translations, logic tasks, and coding, I use the GPT-4 API (March version).
@УукнеУкн 10 месяцев назад
Your explanations are simple but deep. Today from video I know about tokenizers much more. Great tutorial !!!
ps: Can you make more videos about tokenizers and deeper understanding of LLMs
@sagartamang0000 4 месяца назад
Very helpful, thank you so much
@georgekokkinakis7288 Год назад ⁺¹
Great review this is what I needed. I want an open sourced LLM in order to build a chatbot for qna retrieval from documents in the Greek language using langchain so this will help me a lot to find a model from hugging face. Thanks again 😊. Looking forward for the fine tuning tutorial
@samwitteveenai Год назад
Hey George I think you might have been the person who asked about Greek before. Glad to hear this helped.
@georgekokkinakis7288 Год назад
@@samwitteveenaiYes, that's me 😅. Your presentations have helped me a lot in my project. It would be great if you could find the time and make a tutorial on how we could use petals with langchain. I am asking this because not every one including me, has access to high ram gpus or pay for high ram time in colab in order to run those big llms like llama e.t.c.
@samwitteveenai Год назад ⁺²
yes I have started looking into Petals :D
@rukaiyahasan2945 Год назад ⁺¹
I am trying to fine-tune a model which works like ChatGPT for Punjabi language, using the mt5-base, however I am not sure if I should go ahead with it since it does not even generate text and when I try to use it, I just get a response as 0. I have checked the tokenizers, they work fine with Punjabi language, can anyone please tell how may I go on about it?
Thanks in advance!
@caiyu538 11 месяцев назад
Great lectures.
@aurkom Год назад ⁺¹
Would be nice to see a tutorial on training a tokenizer from scratch
@HazemAzim Год назад
Very Insightful thanks.. Arabic is also a problem with Tokenizers
@samwitteveenai Год назад ⁺¹
Yes this is one I have looked at recently after the video and it is also challenging.
@IQmates Год назад ⁺¹
I wish there were tutorials on how to deploy the downloaded model on Azure. With the commercial license, many companies are considering it but they cannot use HuggingFace due to data security etc.
@samwitteveenai Год назад
Sorry I don't have much to do with MSFT currently.
@kevinbatdorf Год назад
What's a good model for English Thai translations? I live in Chiang Mai and would like to build something fun.
@loicbaconnier9150 Год назад
Hi where did you put your notebooks on llama2 please cant find them on github ?
Thanks
@Chob_PT Год назад ⁺¹
Any resources you'd have on how to actually fine-tune to make the model better at other language? I loved the video but still got confused if we should be looking at increasing the vocab size or actually just feeding a translated dataset in a different language would be enough.
Again, Thanks for this
@samwitteveenai Год назад ⁺¹
So most the rules of fine-tuning apply, one difference is that you will often add more pre-training on the target language before doing instruction fine tuning in that language etc. You can get more data for general language for most languages from datasets like OSCAR and Common Crawl, depending on the language. For a lot of languages people have also translated things like the Alpaca dataset etc
@juda-marto Год назад
A very informative video Sam! What tokenization would you recommend to fine tune Llama2 for Indonesian language? In general how to make Llama 2 work with Bahasa
@samwitteveenai Год назад
unfortunately you can't change the the tokenizer on a model once it has changed. You will have to try with the current one. For Bahasa it won't be as bad as Thai or Greek etc.
@gunasekhar8440 10 месяцев назад
Could you help me that how to make a own tokenization model for any indic language?
@henkhbit5748 Год назад
Thanks for explaining the impact of different tokenizers. I assume each LLM are using its specific tokenizer and you cannot use for example a t5 tokenizer in a llma model?
@samwitteveenai Год назад ⁺¹
Yes the models have to use the tokenizers they are trained with.
@pranilpatil4109 2 месяца назад
Hi, Now that Llama 3.1 is released, can you tell roughly how many new tokens should I create from the same tokenizer for an another language.
@samwitteveenai 2 месяца назад
yeah you can just load their tokenizer and check it. Llama 3 is certainly better for many languages
@pranilpatil4109 2 месяца назад
@@samwitteveenai I am doing that. And adding those tokens. But I am not sure about the minimum tokens I should create per language. I might want to add other languages later.
@BusraSebin Год назад
Hi Sam, thanks for great video! Can we fine tune the Llama2 for translation task from Turkish to German languages? I have done tokenizer test for Turkish but it did not provide a great result while as you know German is okay. That's why I have questioned :)
@samwitteveenai Год назад ⁺¹
This may work but it is far from ideal for 2 main reasons 1 the tokenizer issues with Turkish (which you have checked) but also 2. LLaMA-2 was not really built for doing translation. For translation you will probably be better to fine tune something like the mT5 or another Seq2Seq model.
@beginnerscode5684 Год назад
Dear Sam ,
Thank you for this video. As you showed the LLaMa is trained mainly on English and does support the western European languages. My future goal is to train a LLM for indo aryan script. I have tried alpaca but the results were so much the reason was the same as you mentioned. What would be the step if we want to fine-tune LLama for any other language
@ardasevinc4 Год назад ⁺¹
You would need to extend the vocabulary of the tokenizer, do multiple stages of pretraining then fine tune. This would require at least 8 A100 GPUs. Check out the chinese llama/alpaca, they did something similar.
@beginnerscode5684 Год назад ⁺¹
@@ardasevinc4 yes thank you for replying. I did check that paper recently! But there is an other approch named as okapi by university of Oregon i will first try out that. To do like Chinese llma i really need gpus and unfortunately we don't have.
@ardasevinc4 Год назад
@@beginnerscode5684 okapi seems interesting. Thanks for mentioning that. It'll still be tough to get llama2 to speak other langs if the base model's training dataset includes very little of it...
@beginnerscode5684 Год назад
Yes that is going to be challenge, if your language is not based on latin copra then certainly it is challange @@ardasevinc4
@georgekokkinakis7288 Год назад
Ι was woundering for the following. As it is mentioned to the video If someone uses a tokenizer which tokenizes each word to character level then this tokenizer probably is not ideal for the language in interest. After watching Sam' excellent turorial I went to the open ai's webpage and used their tokenizer. I've notice that when I give a sentence in Greek then I get chararcter level tokens. Does this mean that when I will send a query to their model then it will tokenize the query to the character level? Because if that's the case then it the expenses will go exponentially up for someone who wants to use chatgpt models for the Greek language. I would appreciate if someone could clarify or disapprove my point.😊
@TaoWang1 Год назад ⁺²
It depends on which model you're using. Not every models are the same, some better and some worse. If you tested the tokenizer of the model you're using and get character level tokens for the Greek sentence, then yes, the cost of using the model is much higher than English, and it's not only affect the cost, it might also hurt the model understanding and express of the Greek language as well.
@samwitteveenai Год назад
You are totally right about ChatGPT etc cost much more for languages that aren't a good match for its tokenizer. I retweeted a tweet all about this a few months back I think it was for Turkish. It is often an order of magnitude more expensive. The model can do the character tokens etc as it is son big, but it is much more expensive.
@georgekokkinakis7288 Год назад
@@samwitteveenai If I hadn't watched your video about tokenizers I wouldn't have notice it. Thanks once more. Now I know that openai will be very expensive for my case. Unfortunately I haven't found yet any open sourced LLM model which is good for RetrievalQA for the Greek language ☹️. I think I will try google translation. The problem is that I have mathematical terms and google translation doesn't deliver what I want. Let me give an example, someone might find it usefull. Two angles are called complementary angles when they sum to 90 degrees whereas when they sum to 180 degrees they are called supplementary angles. In Greek complimentary angles= συμπληρωματικές γωνίες, supplementary angles = παραπληρωματικές γωνίες. Google translate sometimes translates συμπληρωματικές as supplementary and sometimes as complementary. If someone knows a model which works for Greek he would save my day 😅. My task is to do closed domain extructive QA for mathematical definitions and methodologies from texts in greek . Can Bert like models been used with langchain? Sorry for the big post and thank you once more, your presentations are excellent 👍
@loicbaconnier9150 Год назад
Hi Sam, do you know how to use llama2 using an api from HuggingFace TGI ? with langchain. I don’t know how to write prompts..
Thanks
@samwitteveenai Год назад ⁺¹
So I was going to make a video about exactly this but then they made the library no longer open source, so a bit reluctant now. Might do it at some point.
@loicbaconnier9150 Год назад
@@samwitteveenaiThey only change the license for firm which sell inferences, not for using it in a firm . Isn't it ?
@samwitteveenai Год назад ⁺¹
As I understood it if I was making a chatbot etc then it would apply in that case. More than that though its how they benefitted from other people contributing to it and then changed it later. Just seems they could have handled it better overall.
@loicbaconnier9150 Год назад
@@samwitteveenai There now a new fork made from apache 2.0 version
@samwitteveenai Год назад ⁺¹
I saw that one of the main contributors said they would make an open source fork with their startup and also do some things like remove the need for docker etc. I certainly want to support that.
@aditiasetiawan563 3 месяца назад
How to train other language? Can you help..
@RobertoAntonioMenjívarHernánde Год назад
thanks buddy, i am working in a fine tunning with my own data to Dolly2.0but i hope dont have any problem because that will be on spanish, this is a good point to start thanks! If i am working in a Q&A but i dont have dataset just my database with my own tables what would be your hint? my goal is would be write questions about of my data and have answer like graph or answer like that?
@devedtara Год назад
what are the models supporting arabic language ?
@hlumisa.mazomba Год назад
Thank you so much for this. I had something similar in mind. In my case I wanted to finetune it for IsiXhosa, my home language. Have you had a chance to play around with Facebook's MMS models yet?
@thabolezwemabandla2461 8 месяцев назад
Hie , I have a similar task. Did you find any breakthrough with your language, IsiXhosa
@lnakarin Год назад
ขอบคุณแซม
@samwitteveenai Год назад
ยินดีมากครับ
@DanielWeikert Год назад
Can you do a video elaborating on model sizes, loading techniques to reduce gpu memory ,...?
br
@michallecbych7556 Год назад
Would you show, how to train your custom tokenizer, so we can support new language?
@yuchi65535 Год назад
tokenizer is trained during pre-traning, you need to retrain all the model to custom tonkenizer..
@user-wp8yx Год назад
I appear to have this very issue. Too bad the solution is to dump llama2.
@samwitteveenai Год назад
What's the language you are after? there could be a multi lingual LLaMA around the corner.
@user-wp8yx Год назад ⁺¹
@@samwitteveenai Sanskrit. Bert based models apparently work, but I use oobabooga and can't get them to work with ooba. I had some success with vicuna 1.1, in spite of the tokenizer breaking everything down to one letter. Not so much with vicuna 1.5. No luck with bloom or orca or llama1. Haven't tried llama2 because vicuna outperforms it pretraining for sanskrit.
I'm surprised with so many south Asians in computers that more models don't at least speak hindi.
@user-wp8yx 10 месяцев назад
Update on Sanskrit tokens project: I managed to add tokens to mistral7b. And I had to "resize the embeddings" and "the head". Subsequently, the model does inference, but fine tuning causes a cuda error. I now wonder if the embedding are correct or what the issue is?
@jorgeromero4680 Год назад
i speek greek

Следующие

Автовоспроизведение

Fine Tuning GPT-3.5-Turbo - Comprehensive Guide with Code Walkthrough