Yes! More videos about fine tuning would be great. Also, for people who have laptops with gpu's smaller than 16gig, ways to maybe more slowly be able to train/tune models in the llama/mistral/zephyr family. And also using and training for non-generative uses of the model such as distance metrics and classification, etc. Many thanks!
The very restrictive default system prompt actually indicates that the models themselves aren't overly censored, pruned or gated (otherwise there wouldn't be a need for this system prompt). On top of that, I highly recommend playing with the Temperature. It struck me that the demo spaces have the temperature defaulting to 0.1 (almost zero), but the sliders go even to 5. I'm getting much more interesting results with higher temperatures, even as high as 1.5-1.75. If you cross that, the models start being very »drunk« but actually the ramblings they emit are quite funny. Llama 2 are capable of generating quite diverse texts, I found in my simple experiments. For the public demo spaces, Meta went double-safe with extremely low default temperature and a very restrictive system prompt, I guess to avoid day-0 flak.
Agree in my early fine tuning tests of the base models, the don’t seem to have crippled these models. I have tried the temperature a lot but will run some tests on that. Thanks
Finally Meta open source llma. Big kudos for Meta. Would like to see how to tune the model for prediction. For example for applying a bank loan. Based on some personal financial information the model should predict yes or no and also explain the reasoning behind the Final result. Never see an example doing prediction using an llm. Also how to tune the model of you q&a data. Treat like normal documents or ...? As always great video👏👏
When it *smiles apologetically* even it knows it isn't being as helpful of an assistant as it as it could be hehe. Glad to see it's a fellow AI enthusiast and advocate for open-source at heart.
Idk if you read the paper, but they say they trained "system prompt" behavior on synthetic instructions generated from the constraints: hobbies, languages, and characters, and randomly mixed them together. Also, they progressively made the descriptors varying levels of descriptive, all the way down to just the character name. Once you see it, it's clear, the slightest suggestion of a hobby, language, or character, risks putting this model in "roleplay mode" where you then see lots of *bouncy bouncy* and/or emojis. And once it's deciding it's "roleplaying," it's likely to exhibit a weird amalgamation of the different roleplays. I have not found a system prompt or other formatting that can 100% prevent the model from deciding itself into the "roleplaying" behavior, but the difference is stark when it kicks in [INST] You are an AI chatbot. What's a dog? [/INST] OH BOY, A DOG IS A FURRY FRIEND! *pant pant* 🐶🐾 ... (etc)
Yeah the "Public Figure" is a bit weird. Also I found it interesting that they got those synthetic constraints from the model itself. I had some fun with the 70B model with a system prompt where I told it, it was a drunk assistant that slurred its words and had bad spelling. It certainly played drunk but not so good at the bad spelling. Overall I think the power in these lies in fine tuning from the base model yourself. Have you found any good tricks for steering it away from the roleplaying?
Such a great video. Not only informative but also experimental. (i did come to the video to get more information regarding the tokenizer and got distracted)
What really amazed me is that 13B model of Llama 2 multilingual polyglot, such was impossible in same 30B Llama 1, only from 65B. It's like they compressed 65B into 13B.
amazing as always thank you! One comment, just my personal opinion, I would like to see *less* langchain stuff, as many of us do not like the framework. Looking forward to your fine-tuning videos, and more general LLM hacks. thank you!
Hello Sam, thanks for a nice explanation, very nice video. Which Resource Type are you using on Colab? I tried with V100 but it's not working for bfloat16. Any recommendation? Thanks!
Very helpful. Thanks for sharing!. I am trying to create CSV question answer using Llama2. But it is not able to provide correct answer as I have 99 columns and 180 rows. I used TAPAS but no success as it has 512 token limitations. I am also looking for a way to filter subset of dataframe based on query but there is no open source model available for that. Any other approch that you would suggest to solve this problem?
Excellent video thank you again for sharing the code! I'm a little flabbergasted that my 4090 can't seem to run the meta-llama/Llama-2-13b-chat-hf in 8bit. It will load without quantization but I have no working memory to prompt it afterwards. Any suggestions?
Hi Sam, thanks for the wonderful video, I have doubt regarding the batch wise prompt, by passing the batchsize as input and it takes multiple prompts and generated the data based on the multiple batch prompts, how can I acquire this, can you let me know?
hey can you suggest me how to do inference to do conversation with llama2 chat model to demonstrate that it can remember the context of the earlier prompts completions
Hello sir I am trying load the transformer as llm using CTransformer in VS Code but it doesnt have a tokenizer, so after embeddings and all while running the app in streamlit i am getting excessive token error, so how to overcome that error like 1k tokens excessing max_token_length of about 512, I have tried different embedding models and vector storages but the results stands the same. So should the clone the entire repo but in that case i am getting error like AutoModelCasualLM not suitable for the model i am loading locally, can you please suggest a solution to these. For first case i am using 4-bit quantised model running on CPU and for second one falcon-7B instruct
You're also going to need a lot of VRAM. I have no idea how to convert the models to different formats but luckily others have done it already on hf. I think the user TheBloke has converted the Llama2 models, ended up begin only 14GB for the 7B model, ran slow as on my home machine, so I will only be able to run them in the cloud. Either that or I'll need to upgrade to a server farm full of A100s 😂
I'd love to see a video about using these for roleplaying by giving them complete scenarios, personalities, back-story etc - can they stay in character and do they remember and obey these instructions?
Would love to see fine tuning and deployment video. How can I deploy as an API endpoint and for cheap? HF is $1 an hour. I only have an RTX 3080 at home, so i think I need cloud deployment?
You have only one choice - that new bittorent method of Petals by offloading processing to all hardware on your local network or cooperation with neighbors or friends.
Sam, I know you know of this one but I think the nous-hermes-13b.ggmlv3.q4_0.bin, running in CLI is a remarkable model. Just in the command line alone with 200 tokens I was able to get a continual thought process by simply asking can you continue. I am going to run in docker and use the exposed endpoint to query through the other techs. What a wonderful journey this is!
Excellent! I am trying the uncensored fp16 model tonight! I will most certainly be changing the prompts
Yes! More videos about fine tuning would be great. Also, for people who have laptops with gpu's smaller than 16gig, ways to maybe more slowly be able to train/tune models in the llama/mistral/zephyr family. And also using and training for non-generative uses of the model such as distance metrics and classification, etc. Many thanks!
The very restrictive default system prompt actually indicates that the models themselves aren't overly censored, pruned or gated (otherwise there wouldn't be a need for this system prompt). On top of that, I highly recommend playing with the Temperature. It struck me that the demo spaces have the temperature defaulting to 0.1 (almost zero), but the sliders go even to 5. I'm getting much more interesting results with higher temperatures, even as high as 1.5-1.75. If you cross that, the models start being very »drunk« but actually the ramblings they emit are quite funny.
Llama 2 are capable of generating quite diverse texts, I found in my simple experiments. For the public demo spaces, Meta went double-safe with extremely low default temperature and a very restrictive system prompt, I guess to avoid day-0 flak.
Agree in my early fine tuning tests of the base models, the don’t seem to have crippled these models. I have tried the temperature a lot but will run some tests on that. Thanks
Finally Meta open source llma. Big kudos for Meta. Would like to see how to tune the model for prediction. For example for applying a bank loan. Based on some personal financial information the model should predict yes or no and also explain the reasoning behind the Final result. Never see an example doing prediction using an llm. Also how to tune the model of you q&a data. Treat like normal documents or ...? As always great video👏👏
When it *smiles apologetically* even it knows it isn't being as helpful of an assistant as it as it could be hehe. Glad to see it's a fellow AI enthusiast and advocate for open-source at heart.
Lol agree
Idk if you read the paper, but they say they trained "system prompt" behavior on synthetic instructions generated from the constraints: hobbies, languages, and characters, and randomly mixed them together. Also, they progressively made the descriptors varying levels of descriptive, all the way down to just the character name. Once you see it, it's clear, the slightest suggestion of a hobby, language, or character, risks putting this model in "roleplay mode" where you then see lots of *bouncy bouncy* and/or emojis. And once it's deciding it's "roleplaying," it's likely to exhibit a weird amalgamation of the different roleplays.
I have not found a system prompt or other formatting that can 100% prevent the model from deciding itself into the "roleplaying" behavior, but the difference is stark when it kicks in
[INST] You are an AI chatbot.
What's a dog? [/INST] OH BOY, A DOG IS A FURRY FRIEND! *pant pant* 🐶🐾 ... (etc)
Yeah the "Public Figure" is a bit weird. Also I found it interesting that they got those synthetic constraints from the model itself. I had some fun with the 70B model with a system prompt where I told it, it was a drunk assistant that slurred its words and had bad spelling. It certainly played drunk but not so good at the bad spelling. Overall I think the power in these lies in fine tuning from the base model yourself. Have you found any good tricks for steering it away from the roleplaying?
Such a great video. Not only informative but also experimental.
(i did come to the video to get more information regarding the tokenizer and got distracted)
What really amazed me is that 13B model of Llama 2 multilingual polyglot, such was impossible in same 30B Llama 1, only from 65B. It's like they compressed 65B into 13B.
Yeah the smaller models certainly have gotten a lot better.
I'm still learning but this is been really informative thanks Sam
Beautiful!
What a time to be alive
Absolutely love this! Thanks for it
amazing as always thank you! One comment, just my personal opinion, I would like to see *less* langchain stuff, as many of us do not like the framework. Looking forward to your fine-tuning videos, and more general LLM hacks. thank you!
Good feed back. Thanks.
Do you know alternatives to LangChain? For example for agents
@@samwitteveenai Just let them skip the langchain videos. I find your langchain videos very helpful. Vishal does not speak for "many of us".
Hello Sam, thanks for a nice explanation, very nice video. Which Resource Type are you using on Colab?
I tried with V100 but it's not working for bfloat16. Any recommendation?
Thanks!
Very helpful. Thanks for sharing!.
I am trying to create CSV question answer using Llama2. But it is not able to provide correct answer as I have 99 columns and 180 rows. I used TAPAS but no success as it has 512 token limitations. I am also looking for a way to filter subset of dataframe based on query but there is no open source model available for that. Any other approch that you would suggest to solve this problem?
Excellent video thank you again for sharing the code! I'm a little flabbergasted that my 4090 can't seem to run the meta-llama/Llama-2-13b-chat-hf in 8bit. It will load without quantization but I have no working memory to prompt it afterwards. Any suggestions?
Good stuff. One question: Why the model is download as float16 but after the inference is done with bfloat16?
Hi Sam, thanks for the wonderful video, I have doubt regarding the batch wise prompt, by passing the batchsize as input and it takes multiple prompts and generated the data based on the multiple batch prompts, how can I acquire this, can you let me know?
Whats the strategy for APIs for these models? Should we anticipate the community building those or Meta?
hey can you suggest me how to do inference to do conversation with llama2 chat model to demonstrate that it can remember the context of the earlier prompts completions
Hello sir I am trying load the transformer as llm using CTransformer in VS Code but it doesnt have a tokenizer, so after embeddings and all while running the app in streamlit i am getting excessive token error, so how to overcome that error like 1k tokens excessing max_token_length of about 512, I have tried different embedding models and vector storages but the results stands the same. So should the clone the entire repo but in that case i am getting error like AutoModelCasualLM not suitable for the model i am loading locally, can you please suggest a solution to these. For first case i am using 4-bit quantised model running on CPU and for second one falcon-7B instruct
All the models downloaded, 350Gb, was just waiting for you to show me what to do with them. 👍
We're gonna need a bigger boat... 🦈
@@DanielVagg I picked the right time to upgrade my network with a 2.5GbE switch.
You're also going to need a lot of VRAM. I have no idea how to convert the models to different formats but luckily others have done it already on hf. I think the user TheBloke has converted the Llama2 models, ended up begin only 14GB for the 7B model, ran slow as on my home machine, so I will only be able to run them in the cloud. Either that or I'll need to upgrade to a server farm full of A100s 😂
@@DanielVagg Thanks for the info, super handy.
Do you testing new Petals bittorent method? A super cluster for poor? 😉
I'd love to see a video about using these for roleplaying by giving them complete scenarios, personalities, back-story etc - can they stay in character and do they remember and obey these instructions?
Would love to see fine tuning and deployment video. How can I deploy as an API endpoint and for cheap? HF is $1 an hour. I only have an RTX 3080 at home, so i think I need cloud deployment?
You have only one choice - that new bittorent method of Petals by offloading processing to all hardware on your local network or cooperation with neighbors or friends.
@@fontenbleau interesting! do you have an example tutorial?
@@Ryan-yj4sd I haven't yet tried it myself yet, I'm in the search of perfect Linux distribution. Maybe first tutorial videos already published.
Could you create a tutorial documenting the features of loralib?
thanks sam. can this be possibly run on a laptop?
I think the 4bit versions should work on laptop. I am trying to make a video on them for next week.
what is maximum lenght of this model? and is ok to assume is is 512 , it means 512 tokens and each token like 4 words?
the context window is 4096 for this. each word would average about 2-3 tokens.
I don't see the code in your github. Did you have a colab link you could share? Thanks
The Colab is in the description and I will put it up on github in a few hours.
It looks like Facebook is trying to cover their ass knowing that the opensource community will unnerf the models
wow. Thanks sam
Can you provide a link to the ipynb file?
Check out the description for the Colab etc
i was having gated issue 403 turned out need permission from HF also from meta
very nice!
Sam, I know you know of this one but I think the nous-hermes-13b.ggmlv3.q4_0.bin, running in CLI is a remarkable model. Just in the command line alone with 200 tokens I was able to get a continual thought process by simply asking can you continue. I am going to run in docker and use the exposed endpoint to query through the other techs. What a wonderful journey this is!
I forgot to add that this is all being run under GPT4ALL.