How to Create Custom Datasets To Train Llama-2

Поделиться
HTML-код
  • Опубликовано: 2 ноя 2024
  • НаукаНаука

Комментарии • 110

  • @chuckwashington6663
    @chuckwashington6663 Год назад +54

    Thanks, this gives me exactly what I needed to understand how to create a dataset for fine tuning. Most of the other videos skip over the details of the formatting and other parameters that go into creating your own dataset. Thanks again!

    • @engineerprompt
      @engineerprompt  Год назад +7

      Thank you for your support. I'm glad it was helpful 😊

  • @SafetyLabsInc_ca
    @SafetyLabsInc_ca Год назад +4

    Datasets are key for fine tuning. This is a great video!

  • @oliversilverstein1221
    @oliversilverstein1221 Год назад +2

    FYI you're the man. idk why it was so hard to find a good pipeline to train literally went througfh all the libs and no one mentioned autotrainer advanced lol

  • @pareak
    @pareak 9 месяцев назад +1

    Thank you so much! This just gives me a really good basis on how I can start finetuning my own model! Because the model will in the end be as good as the training set.

  • @samcavalera9489
    @samcavalera9489 Год назад +5

    You're an AI champion. Thanks for the fine-tuning lectures 🙏🙏🙏

  • @drbinxy9433
    @drbinxy9433 Год назад +1

    You are a legend my man

  • @umeshtiwari9249
    @umeshtiwari9249 Год назад

    Thanks. very nice way you explained the concept. it gives boost to the knowledge and to the area where usually people have fear in mind to grasp but the way you explained it, to me it looks very easy. today i got the ability to fine tune the model myself. thanks a lot Sir. looking forward to more advanced topics from you.

  • @xiangyao9192
    @xiangyao9192 Год назад +1

    I have a question. Why don't we use the conversation format given by llama2, which contains , something like that? thanks

    • @engineerprompt
      @engineerprompt  Год назад +1

      You will need to use that if you are using the instruct/chat version. Since I was fine tuning the base version, you can define your own format. Hope this helps

  • @abhijitbarman
    @abhijitbarman Год назад +1

    @Prompt Engineering. Wow, exactly what I was looking for . I have another request, Can you please make a video on Prompt-Tuning/P-Tuning which is also a PEFT technique ?

  • @kevon217
    @kevon217 9 месяцев назад

    Thanks for covering this topic!

  • @vitocorleon6753
    @vitocorleon6753 Год назад +1

    I need help please. I just want to be pointed in the right direction since I'm new to this and since I couldn't really find any proper guide to summarize the steps for what I want to accomplish.
    I want to integrate a LLama 2 70B chatbot into my website. I have no idea where to start. I looked into setting up the environment on one of my cloud servers(Has to be private). Now I'm looking into training/fine-tuneing the chat model using our data from our DBs(It's not clear for me here but I assume it involves two steps, first I have to have the data in a CSV format since it's easier for me, second I will need to format it in Alpaca or Openassistant formats). After that, the result should be a deployment-ready model ?
    Just bullet points I'd highly appreciate that.

  • @vbywrde
    @vbywrde 10 месяцев назад

    Very coherent and well explained. Thank you kindly. I'm curious also if you have any advice about creating a dataset that would allow me to fine tune my model on my database schema? What I'd like to do is run my model locally, and ask it to interact with my database, and have it do so in a smooth and natural manner. I'm curious about how one would structure a database schema as a dataset for fine tuning. Any recommendations or advice would be greatly appreciated. Thanks again! Great videos!

  • @Phoenix-fr9ic
    @Phoenix-fr9ic Год назад +2

    Can I finetune llama 2 for pdf to question answers generation?

  • @stickmanland
    @stickmanland Год назад +2

    Thanks for the informative video. I am wondering: Is there a way to do this, but with local LLMs?

  • @haouarino
    @haouarino Год назад +1

    Thank you very much for the video. In the case of plaintext, how the dataset could be formatted?

  • @ishaanshettigar1554
    @ishaanshettigar1554 Год назад +1

    How does this differ if I'm looking to fine-tune for Llama2 7b code instruct

  • @LeonvanBokhorst
    @LeonvanBokhorst Год назад +1

    Wow. Thanks again, sir 🙏

  • @derejehinsermu6928
    @derejehinsermu6928 Год назад

    Thank you man , that is exactly what i am looking for

  • @LeKhang98
    @LeKhang98 7 месяцев назад

    Thank you very much. Is 300 rows a good number for training? I know it depends on many factors but I don't know how to identify if my dataset is bad or it's just too small.

  • @valthrudnir
    @valthrudnir Год назад +1

    Hello, thank you for sharing - is this method applicable to GGML / GPTQ models from say TheBloke's repo for example the 'Firefly Llama2 13B v1.2 - GPTQ' or would the training parameters need to be adjusted?

    • @engineerprompt
      @engineerprompt  Год назад +2

      I haven't tried this with Quantized models so I am not sure how that will behave. One thing to keep in mind is that you want to use the "base" model not the chat version for best results. Will look at it and see if it can be done.

  • @DemoGPT
    @DemoGPT Год назад

    Kudos on the excellent video! Your hard work is acknowledged. Could we expect a video about DemoGPT from you?

  • @kepenge
    @kepenge 9 месяцев назад

    Thanks for great service to the community.
    On my experiment, the Config.json is not created, is that normal?

  • @TheCloudShepherd
    @TheCloudShepherd 11 месяцев назад

    Daaaamn bro that's brilliant

    • @engineerprompt
      @engineerprompt  11 месяцев назад

      Thank you. More to come on fine-tuning :)

  • @MohamedElGhazi-ek6vp
    @MohamedElGhazi-ek6vp Год назад +1

    it's very helpful thanks, is it the same process to create a data from multiple documents for a Qusetion Answering model ?

  • @Newcomer-qt5yc
    @Newcomer-qt5yc Месяц назад

    Hello is this still relevant for fine tuning llama 3.1? Thank you.

  • @chiachinghsieh2150
    @chiachinghsieh2150 Год назад +1

    Thanks SO MUCH for sharing this! Really helpful. I also trying to train on my own data on LLAMA 2. But I am facing a problem from deploy the model. I trained the model on AWS Sagenmaker and store the model in an S3 bucket. When I try to deploy the model and feed it with the prompt, I keep getting errors. my input follows the rule like ###Human....###Assistant. But I still have errors. I wonder if I use the wrong tokenizer. But I couldn't use AutoTokenizer.from_pretrained() in Sagemaker. Wonder if you have some advice!!

  • @tarun4705
    @tarun4705 Год назад +1

    Very informative

  • @AGAsnow
    @AGAsnow Год назад +2

    How could I limit it, for example I train it with several relevant paragraphs about the little prince novel, how do I limit it so that it only answers questions that are in the context of the little prince novel

  • @vobbilisettyjayadeep4346
    @vobbilisettyjayadeep4346 Год назад

    You are a saviour

  • @HarishRaoS
    @HarishRaoS 9 месяцев назад

    thanks for this video

  • @brunapupoo4809
    @brunapupoo4809 6 месяцев назад +1

    when I try to run the command in the terminal it gives error: autotrain [] llm: error: the following arguments are required: --project-name

  • @lrkx_
    @lrkx_ Год назад +1

    If you don’t mind sharing, what’s the performance of a Mac like when fine tuning? I’m quite keen to see how long it takes to fine tune a 7B vs a 13B parameter model on a consumer machine on a small/medium sized dataset. Thanks for the tutorial, very helpful!

    • @vedchaudhary1597
      @vedchaudhary1597 Год назад +1

      7B with 4 bit quantization takes about 12.9 GBs of GPU RAM, i dont think mac will be able to run it locally

  • @denissandmann
    @denissandmann 2 месяца назад

    And the model comes up with new prompts on its own?

  • @rahulrajpvr7d
    @rahulrajpvr7d Год назад

    thank you so much brother❤❤

  • @LeoNux-um7tg
    @LeoNux-um7tg 10 месяцев назад

    Can I use my files for data sets? I'm just planning to train a model that can remind me of commands in linux and its options so I don't have to keep reading manuals everytime I use commands that aren't regularly use.

  • @md.rakibulhaque2262
    @md.rakibulhaque2262 11 месяцев назад

    Getting this error with AutoModelForCausalLM:
    from transformers import AutoModelForCausalLM
    MJ_Prompts does not appear to have a file named config.json.
    instead i had to import
    from peft import AutoPeftModelForCausalLM
    and use the AutoPeftModelForCausalLM to inference from the model.
    and one more question, did we train an adapter model here?
    please tell me how can i solve this. I am using free collab.

  • @medicationrefill
    @medicationrefill Год назад +1

    Can I train my own LLM model using data generated by chatpgt, if the model is intended for academic/commercial use?

    • @engineerprompt
      @engineerprompt  Год назад +1

      Probably you cant use it commercial purposes. But most of the open source models out there (at least the initial versions) were trained on data generated by chatgpt

  • @marcoabk
    @marcoabk 7 месяцев назад

    there is a way to do it with the 13b original from llama2 already in my hard drive?

  • @georgekokkinakis7288
    @georgekokkinakis7288 Год назад +1

    I am facing the following problem. The model gets uploaded to the hugging face repo but without a config.json file. Any solutions? Also can the finetuned model run on the free google colab or should we shard it?

    • @engineerprompt
      @engineerprompt  Год назад +1

      Are you fine-tuning it locally or on Google Colab? I am doing it locally without any issues.

    • @georgekokkinakis7288
      @georgekokkinakis7288 Год назад +1

      @@engineerprompt I am fine-tuning it on Google-Colab. In a post I made on your other video about fine-tuning Llama2 you mentioned that it seems to be a problem with the free tier of colab. I hope 🙏 you will find a fix , because not everyone owns a gpu.

  • @Shahawir
    @Shahawir Год назад +1

    I wonder if it is possible train LLAMA, on data where input are numbers and categorical variables(string), of fixed length, to predict a timer series of fixed length, anyone knows if this is possible? And how to fine the model if I have it locally

  • @techmontc8360
    @techmontc8360 Год назад

    Hey sir, thank you for the great tutorial. I've some question, it seems in this training you didn't define "--model_max_length" parameter. Is there any differences if you define this parameter or not ?

  • @gamingisnotacrime6711
    @gamingisnotacrime6711 Год назад +1

    So if we are fine tuning the chat model, can we use same format as above? ; Human...., Assistant.....

  • @muhannadobeidat
    @muhannadobeidat 6 месяцев назад

    Thanks for the video.
    Two things please:
    1. When you use autotrain package, then all details are hidden and one is not able to see what is being done and in what exact steps. I would suggest a video like that please if you have even same example.
    2. Secondly, it is not clear to me what is the data vs label being fed into the model training phase, what is the loss function, how it is being calculated, etc...

    • @engineerprompt
      @engineerprompt  6 месяцев назад

      I agree with you. Autotrain abstracts alot of details but if you are interested in more detailed setup. I would recommend to look for "fine-tune" videos on my channel. Here is one example:
      ruclips.net/video/lCZRwrRvrWg/видео.html

  • @Phoenix-fr9ic
    @Phoenix-fr9ic Год назад +1

    Can I use this technique for document based question answers generation dataset?

    • @lx-l-xl
      @lx-l-xl 6 месяцев назад

      Hey did you found any solution for Q & A model

  • @JJ-yw3ug
    @JJ-yw3ug Год назад +1

    I would like to ask, is the RTX4090 sufficient to fine-tune the 13B model, or can it only fine-tune the 7B model? Because I've noticed that the 13B model with default settings doesn't pose a problem for the RTX4090 in terms of parameter handling, but I'm uncertain whether a single RTX4090 is enough if data fine-tuning is required

    • @engineerprompt
      @engineerprompt  Год назад +1

      I don't think you can fine tune 13B with 24GB vRAM. Your best bet will be 7B

  • @MichealAngeloArts
    @MichealAngeloArts Год назад +1

    Do you need the 3 columns (Concept, Description, text) in train.csv or just 1 column (text) is enough?

    • @engineerprompt
      @engineerprompt  Год назад +3

      Just one column

    • @godataprof
      @godataprof Год назад +2

      Just the last text column

    • @nutCaseBUTTERFLY
      @nutCaseBUTTERFLY Год назад +2

      @@engineerprompt So I watch the video 5 times, and it is still not clear what columns go where. You didn't even bother to open the .csv file so that we can see the schema. But you did show us the log file!

    • @Enju-Aihara
      @Enju-Aihara Год назад

      @@engineerprompt ​ i wanna know too

    • @filippobistaffa5913
      @filippobistaffa5913 Год назад +2

      You just need a "text" column present in your train.csv file, the other columns will be ignored... if you want you can change which column will be used with --text_column column_name

  • @nqaiser
    @nqaiser Год назад

    What hardware specifications would be needed to fine tune a 70b model?
    Once the fine-tuning is complete, can you run the model using oogabooga?

  • @oxytic
    @oxytic Год назад

    Great bro 👍

  • @susteven4974
    @susteven4974 Год назад

    how I can fine tuning llama-2-7b-chat ,can I use your dataset format?

  • @fups8222
    @fups8222 Год назад

    why can't you fine-tune chat model of llama 2? the text completion of the fine tuned model I'm using is giving terrible results from my exact instructions in my prompt. I am using Puffin13B, but when feeding exact instructions it just cannot do them like I am prompting it to do.

  • @aiwesee
    @aiwesee Год назад +1

    For fine-tuning of the large language models (llama-2-13b-chat), what should be the format(.text/.json/.csv) and structure (like should be an excel or docs file or prompt and response or instruction and output) of the training dataset? And also how to prepare or organise the tabular dataset for training purpose?

    • @mohammedmujtabaahmed490
      @mohammedmujtabaahmed490 7 месяцев назад

      Hey, did you find the answer for your question? If yes, please tell me what format shouldnthe dataset be for fine tuning please.

  • @jamesljl
    @jamesljl Год назад +1

    would u pls give a sample of how the csv file looks like ? thanks a lot !

    • @engineerprompt
      @engineerprompt  Год назад +1

      Let me see what I can do, the format is shown in the video.

  • @xiangye524
    @xiangye524 Год назад

    Getting the error:
    ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration.
    Does anyone know what is the issue ?
    running on google collab thanks

  • @AdinathKale-z7w
    @AdinathKale-z7w 2 месяца назад

    on which GPU u training the model

  • @Univers314
    @Univers314 Год назад +1

    Can Chatgpt3.5 generate files?

  • @prestonmccauley43
    @prestonmccauley43 Год назад

    I did something similar working on my test data set to get this a bit more understood. I created a python script to merge all the data sets together. I still seem to be struggling to grasp the core training approach using SFT and what models work with what. Its' like the last puzzle missing

  • @manavshah9062
    @manavshah9062 Год назад +1

    Hey, i have fine-tuned the model using my own dataset but when running the bash command somehow the model did not get uploaded to hugginface but after the training completed i zipped the model and downloaded it. Now is there a way by which i can upload this fine-tuned model to hugginface now?

  • @dmitrymalyshev3810
    @dmitrymalyshev3810 Год назад

    So, if you have a problem, same on my, in google colab this code not will work, because on free google colab this script don't end job and don't create config.json, and you will have aproblems.
    And i think, that is reason, why this script don't push my model on huggingface hub.
    But your work is great, so thanks for that.

    • @georgekokkinakis7288
      @georgekokkinakis7288 Год назад

      I am also facing the same problem. Actually in my case the model is uploaded to the huggingface repo but it is missing the config.json file. Any sollutions?

  • @emrahe468
    @emrahe468 Год назад +1

    how or why do we decide on ###Human: ??
    i see lots of variations on different videos. some use ->: , others use ###Input: etc. etc.

    • @engineerprompt
      @engineerprompt  Год назад

      It's really upto you how you want to define the format. Some models accepts instructions along with the user input, so really you get to decide based on your application.

  • @sauravmukherjeecom
    @sauravmukherjeecom Год назад

    Is it possible to directly finetune gptq models?

  • @LilibethCalva
    @LilibethCalva Год назад +1

    Hola, soy novata en esto, lo que trato de hacer es un chat personalizado con llama2, por ejemplo tengo datos de una empresa x que son como 15 columnas X 300filas.
    Hasta ahora me responde aunque aún esta con sus respuestas ilógicas , EN FIN quiero saber es que si creo la columan text de {humano, assitant } para cada columna o pregunta posible Y como se prepararía los datos para entrenar el modelo
    Porfa, alguien me guía en esto?

  • @islamicinterestofficial
    @islamicinterestofficial Год назад

    Getting the error:
    FileNotFoundError: Couldn't find a dataset script at /content/train.csv/train.csv.py or any data file in the same directory.
    Even though I'm running the autotrain code line in the same directory where my train.csv file present. I'm running on colab btw

    • @islamicinterestofficial
      @islamicinterestofficial Год назад

      The solution is to just provide the path of folder where the csv file is present. But don't write the csv file name...

  • @AADITKASHYAPDPSN-STD
    @AADITKASHYAPDPSN-STD 6 месяцев назад

    Sir, I don't have ChatGPT Plus. Are there any alternatives?

    • @engineerprompt
      @engineerprompt  6 месяцев назад

      Look into the Groq, its a free API (at the moment).

    • @AADITKASHYAPDPSN-STD
      @AADITKASHYAPDPSN-STD 6 месяцев назад

      @@engineerprompt Thank you so much sir. But, I decided I'd do it without any LLMs. So, I wrote my own code using python and pandas. If you want, I could share the code to you?

  • @fernando88to
    @fernando88to Год назад

    How do I use this local template in the localGPT project?

    • @engineerprompt
      @engineerprompt  Год назад

      The localGPT code has support for custom prompt template. You will need to provide your template there.

  • @ruizhou1243
    @ruizhou1243 9 месяцев назад

    there is no code and sinppet for how to it work? I don't know the meaning

  • @srikrishnavamsi1470
    @srikrishnavamsi1470 Год назад

    How can i contact you sir.

  • @milesbarn
    @milesbarn 8 месяцев назад

    It is not allowed to use GPT-4's output, any of it even if the data fed into it is yours, to train other models than OpenAI's according to their terms.

    • @phat80
      @phat80 5 месяцев назад +2

      Who cares? Who will know? 😅

  • @Mysteriousworld1902
    @Mysteriousworld1902 4 месяца назад

    your session crashed after using all ram error

  • @YamengLi-g4l
    @YamengLi-g4l Год назад

    How can I build my label with input and output? I found that llama 2 pieced the input and output together, can my label match the input_id?