This is probably the best video for finetuning i have came so far. it is very detailed. but the only thing missing is "custom dataset". can you please make a video on how to make a custom dataset, i mean, the correct format and everything we should be following to make our correct "custom dataset", and also if possible do it for the latest llama 3.2 3b model. please also show us that after making the dataset, where to and how to upload the dataset for the finetuning. please make it very detailed.
What if we want to add multiple datasets to the training? Do we run that code block multiple times with the different urls put in each time, or will that break it?
What I would suggest is run 1 dataset, and once the finetuned model is saved in your collab, use than location in place of the original hugging face model link to train on another dataset. This way you would have both llms with 1. Single dataset 2. Both datatset and can compare if it is being overtrained.
@@xclbrxtra Thanks for the fast response. Alternatively, you can change the code at the end to this to combine any number of datasets: from datasets import load_dataset, concatenate_datasets dataset1 = load_dataset("gbharti/finance-alpaca", split = "train") dataset2 = load_dataset("practical-dreamer/RPGPT_PublicDomain-alpaca", split = "train") dataset3 = load_dataset("vicgalle/alpaca-gpt4", split = "train") dataset4 = load_dataset("iamtarun/python_code_instructions_18k_alpaca", split = "train") dataset = concatenate_datasets([dataset1, dataset2, dataset3, dataset4]) dataset = dataset.map(formatting_prompts_func, batched = True,)
@@xclbrxtra Or just delete my comment with a code workaround... I even made sure the example used non-overlapping datasets to prevent overtraining. But thanks for nothing..........
The best model training video currently out there!
This is probably the best video for finetuning i have came so far. it is very detailed. but the only thing missing is "custom dataset". can you please make a video on how to make a custom dataset, i mean, the correct format and everything we should be following to make our correct "custom dataset", and also if possible do it for the latest llama 3.2 3b model. please also show us that after making the dataset, where to and how to upload the dataset for the finetuning. please make it very detailed.
🎉 thanks!
What if we want to add multiple datasets to the training? Do we run that code block multiple times with the different urls put in each time, or will that break it?
What I would suggest is run 1 dataset, and once the finetuned model is saved in your collab, use than location in place of the original hugging face model link to train on another dataset. This way you would have both llms with 1. Single dataset 2. Both datatset and can compare if it is being overtrained.
@@xclbrxtra Thanks for the fast response. Alternatively, you can change the code at the end to this to combine any number of datasets:
from datasets import load_dataset, concatenate_datasets
dataset1 = load_dataset("gbharti/finance-alpaca", split = "train")
dataset2 = load_dataset("practical-dreamer/RPGPT_PublicDomain-alpaca", split = "train")
dataset3 = load_dataset("vicgalle/alpaca-gpt4", split = "train")
dataset4 = load_dataset("iamtarun/python_code_instructions_18k_alpaca", split = "train")
dataset = concatenate_datasets([dataset1, dataset2, dataset3, dataset4])
dataset = dataset.map(formatting_prompts_func, batched = True,)
@@xclbrxtra Or just delete my comment with a code workaround... I even made sure the example used non-overlapping datasets to prevent overtraining. But thanks for nothing..........
Can I do that on mobile 😢
Yes, it uses google Collab so it's online
I wanted a video like this for a long time 🫡, thanks sir❤