@code_your_own_AI - Thank you so much for this great video ! I am in the middle of preparing the dataset for fine-tuning and wanted to refer to the dataset you had to prepare and preprocess for your Pytorch code assistant. It would really help me to understand if there is any character limit that I should be aware about while creating the dataset. I also wanted to know the format in which 'Instruction' , 'Input', 'Output' should be presented - will it be in JSONL format / or a .txt format / .csv format ? Also for using StarCoder , is there a character / token limit that I need to follow for each example ?
Thanks for the video, really interesting. I am fine tuning code LLMs and this was helpful. Aside; I find it so strange that token prediction is able to do this, since the LLM needs to be able to "plan" that a function will be needed and declare it and then use it in exactly the right way.
Amazing video, thank you for sharing all of this knowledge with us! These fine-tuned generative code systems can be the next level of programming,.
@code_your_own_AI - Hi - are you going to finish the colab and show the results to prove it works? And provide a link to it? Thank you!
Your explanation is awesome!! Thanks for sharing these videos
You are welcome.
Thank you for the video! Can this be used to fine-tune for other languages such as Lisp or Haskell? Or should I pre-train a new model from scratch?
Can you add a video of finetuning starcoder for auto completion and not instructions like they show in they repo?
@code_your_own_AI - Thank you so much for this great video ! I am in the middle of preparing the dataset for fine-tuning and wanted to refer to the dataset you had to prepare and preprocess for your Pytorch code assistant. It would really help me to understand if there is any character limit that I should be aware about while creating the dataset.
I also wanted to know the format in which 'Instruction' , 'Input', 'Output' should be presented - will it be in JSONL format / or a .txt format / .csv format ? Also for using StarCoder , is there a character / token limit that I need to follow for each example ?
did you get a reply? looking to do the same
Thanks for the video, really interesting. I am fine tuning code LLMs and this was helpful. Aside; I find it so strange that token prediction is able to do this, since the LLM needs to be able to "plan" that a function will be needed and declare it and then use it in exactly the right way.
Wow :D What a cliff hanger at the end^^
Sir I just wanted to know, can I finetune with predefined codes and there promt for better code genration ?? if yes how to proceed with that
Nice video! Is the outlier in part 3?
of course!
can I do code refactoring task with star coder? how can I prepare my dataset for that task?
I don't see any links in the description....
Yes Please
Will this work on M2 CPU?
I see more and more solutions for apple silicon on reddit and hacker news, but I have no empirical data on stability or performance.