LLaMA2 Tokenizer and Prompt Tricks

Поделиться
HTML-код
  • Опубликовано: 27 окт 2024
  • НаукаНаука

Комментарии • 57

  • @Dr_Tripper
    @Dr_Tripper Год назад +1

    Excellent! I am trying the uncensored fp16 model tonight! I will most certainly be changing the prompts

  • @onoff5604
    @onoff5604 11 месяцев назад +1

    Yes! More videos about fine tuning would be great. Also, for people who have laptops with gpu's smaller than 16gig, ways to maybe more slowly be able to train/tune models in the llama/mistral/zephyr family. And also using and training for non-generative uses of the model such as distance metrics and classification, etc. Many thanks!

  • @AdamTwardoch
    @AdamTwardoch Год назад +10

    The very restrictive default system prompt actually indicates that the models themselves aren't overly censored, pruned or gated (otherwise there wouldn't be a need for this system prompt). On top of that, I highly recommend playing with the Temperature. It struck me that the demo spaces have the temperature defaulting to 0.1 (almost zero), but the sliders go even to 5. I'm getting much more interesting results with higher temperatures, even as high as 1.5-1.75. If you cross that, the models start being very »drunk« but actually the ramblings they emit are quite funny.
    Llama 2 are capable of generating quite diverse texts, I found in my simple experiments. For the public demo spaces, Meta went double-safe with extremely low default temperature and a very restrictive system prompt, I guess to avoid day-0 flak.

    • @samwitteveenai
      @samwitteveenai  Год назад +1

      Agree in my early fine tuning tests of the base models, the don’t seem to have crippled these models. I have tried the temperature a lot but will run some tests on that. Thanks

    • @henkhbit5748
      @henkhbit5748 Год назад +1

      Finally Meta open source llma. Big kudos for Meta. Would like to see how to tune the model for prediction. For example for applying a bank loan. Based on some personal financial information the model should predict yes or no and also explain the reasoning behind the Final result. Never see an example doing prediction using an llm. Also how to tune the model of you q&a data. Treat like normal documents or ...? As always great video👏👏

  • @marshallmcluhan33
    @marshallmcluhan33 Год назад +2

    When it *smiles apologetically* even it knows it isn't being as helpful of an assistant as it as it could be hehe. Glad to see it's a fellow AI enthusiast and advocate for open-source at heart.

  • @DaeOh
    @DaeOh Год назад +4

    Idk if you read the paper, but they say they trained "system prompt" behavior on synthetic instructions generated from the constraints: hobbies, languages, and characters, and randomly mixed them together. Also, they progressively made the descriptors varying levels of descriptive, all the way down to just the character name. Once you see it, it's clear, the slightest suggestion of a hobby, language, or character, risks putting this model in "roleplay mode" where you then see lots of *bouncy bouncy* and/or emojis. And once it's deciding it's "roleplaying," it's likely to exhibit a weird amalgamation of the different roleplays.
    I have not found a system prompt or other formatting that can 100% prevent the model from deciding itself into the "roleplaying" behavior, but the difference is stark when it kicks in
    [INST] You are an AI chatbot.
    What's a dog? [/INST] OH BOY, A DOG IS A FURRY FRIEND! *pant pant* 🐶🐾 ... (etc)

    • @samwitteveenai
      @samwitteveenai  Год назад +1

      Yeah the "Public Figure" is a bit weird. Also I found it interesting that they got those synthetic constraints from the model itself. I had some fun with the 70B model with a system prompt where I told it, it was a drunk assistant that slurred its words and had bad spelling. It certainly played drunk but not so good at the bad spelling. Overall I think the power in these lies in fine tuning from the base model yourself. Have you found any good tricks for steering it away from the roleplaying?

  • @abhay803
    @abhay803 Год назад

    Such a great video. Not only informative but also experimental.
    (i did come to the video to get more information regarding the tokenizer and got distracted)

  • @fontenbleau
    @fontenbleau Год назад +2

    What really amazed me is that 13B model of Llama 2 multilingual polyglot, such was impossible in same 30B Llama 1, only from 65B. It's like they compressed 65B into 13B.

    • @samwitteveenai
      @samwitteveenai  Год назад +1

      Yeah the smaller models certainly have gotten a lot better.

  • @DanielVagg
    @DanielVagg Год назад +2

    I'm still learning but this is been really informative thanks Sam

  • @gaius100bc
    @gaius100bc Год назад

    Beautiful!
    What a time to be alive

  • @MadhavanSureshRobos
    @MadhavanSureshRobos Год назад +1

    Absolutely love this! Thanks for it

  • @vishalgoklani
    @vishalgoklani Год назад +2

    amazing as always thank you! One comment, just my personal opinion, I would like to see *less* langchain stuff, as many of us do not like the framework. Looking forward to your fine-tuning videos, and more general LLM hacks. thank you!

    • @samwitteveenai
      @samwitteveenai  Год назад

      Good feed back. Thanks.

    • @azmo_
      @azmo_ Год назад

      Do you know alternatives to LangChain? For example for agents

    • @ringpolitiet
      @ringpolitiet Год назад

      @@samwitteveenai Just let them skip the langchain videos. I find your langchain videos very helpful. Vishal does not speak for "many of us".

  • @vuletass
    @vuletass 11 месяцев назад

    Hello Sam, thanks for a nice explanation, very nice video. Which Resource Type are you using on Colab?
    I tried with V100 but it's not working for bfloat16. Any recommendation?
    Thanks!

  • @1ofallkind
    @1ofallkind Год назад +1

    Very helpful. Thanks for sharing!.
    I am trying to create CSV question answer using Llama2. But it is not able to provide correct answer as I have 99 columns and 180 rows. I used TAPAS but no success as it has 512 token limitations. I am also looking for a way to filter subset of dataframe based on query but there is no open source model available for that. Any other approch that you would suggest to solve this problem?

  • @andrewbednar7251
    @andrewbednar7251 Год назад +1

    Excellent video thank you again for sharing the code! I'm a little flabbergasted that my 4090 can't seem to run the meta-llama/Llama-2-13b-chat-hf in 8bit. It will load without quantization but I have no working memory to prompt it afterwards. Any suggestions?

  • @javiergimenezmoya86
    @javiergimenezmoya86 Год назад

    Good stuff. One question: Why the model is download as float16 but after the inference is done with bfloat16?

  • @zeeshansheikh8163
    @zeeshansheikh8163 Год назад

    Hi Sam, thanks for the wonderful video, I have doubt regarding the batch wise prompt, by passing the batchsize as input and it takes multiple prompts and generated the data based on the multiple batch prompts, how can I acquire this, can you let me know?

  • @AI_by_AI_007
    @AI_by_AI_007 Год назад

    Whats the strategy for APIs for these models? Should we anticipate the community building those or Meta?

  • @HarshRaj-e6z4n
    @HarshRaj-e6z4n Год назад

    hey can you suggest me how to do inference to do conversation with llama2 chat model to demonstrate that it can remember the context of the earlier prompts completions

  • @vsudbdk5363
    @vsudbdk5363 Год назад

    Hello sir I am trying load the transformer as llm using CTransformer in VS Code but it doesnt have a tokenizer, so after embeddings and all while running the app in streamlit i am getting excessive token error, so how to overcome that error like 1k tokens excessing max_token_length of about 512, I have tried different embedding models and vector storages but the results stands the same. So should the clone the entire repo but in that case i am getting error like AutoModelCasualLM not suitable for the model i am loading locally, can you please suggest a solution to these. For first case i am using 4-bit quantised model running on CPU and for second one falcon-7B instruct

  • @julian-fricker
    @julian-fricker Год назад +5

    All the models downloaded, 350Gb, was just waiting for you to show me what to do with them. 👍

    • @DanielVagg
      @DanielVagg Год назад +1

      We're gonna need a bigger boat... 🦈

    • @julian-fricker
      @julian-fricker Год назад +1

      @@DanielVagg I picked the right time to upgrade my network with a 2.5GbE switch.

    • @DanielVagg
      @DanielVagg Год назад +2

      You're also going to need a lot of VRAM. I have no idea how to convert the models to different formats but luckily others have done it already on hf. I think the user TheBloke has converted the Llama2 models, ended up begin only 14GB for the 7B model, ran slow as on my home machine, so I will only be able to run them in the cloud. Either that or I'll need to upgrade to a server farm full of A100s 😂

    • @julian-fricker
      @julian-fricker Год назад +1

      @@DanielVagg Thanks for the info, super handy.

    • @fontenbleau
      @fontenbleau Год назад +1

      Do you testing new Petals bittorent method? A super cluster for poor? 😉

  • @morganandreason
    @morganandreason Год назад

    I'd love to see a video about using these for roleplaying by giving them complete scenarios, personalities, back-story etc - can they stay in character and do they remember and obey these instructions?

  • @Ryan-yj4sd
    @Ryan-yj4sd Год назад +1

    Would love to see fine tuning and deployment video. How can I deploy as an API endpoint and for cheap? HF is $1 an hour. I only have an RTX 3080 at home, so i think I need cloud deployment?

    • @fontenbleau
      @fontenbleau Год назад

      You have only one choice - that new bittorent method of Petals by offloading processing to all hardware on your local network or cooperation with neighbors or friends.

    • @Ryan-yj4sd
      @Ryan-yj4sd Год назад

      @@fontenbleau interesting! do you have an example tutorial?

    • @fontenbleau
      @fontenbleau Год назад

      @@Ryan-yj4sd I haven't yet tried it myself yet, I'm in the search of perfect Linux distribution. Maybe first tutorial videos already published.

  • @aurkom
    @aurkom Год назад

    Could you create a tutorial documenting the features of loralib?

  • @guanjwcn
    @guanjwcn Год назад +1

    thanks sam. can this be possibly run on a laptop?

    • @samwitteveenai
      @samwitteveenai  Год назад +1

      I think the 4bit versions should work on laptop. I am trying to make a video on them for next week.

  • @human_agi
    @human_agi Год назад

    what is maximum lenght of this model? and is ok to assume is is 512 , it means 512 tokens and each token like 4 words?

    • @samwitteveenai
      @samwitteveenai  Год назад

      the context window is 4096 for this. each word would average about 2-3 tokens.

  • @Ryan-yj4sd
    @Ryan-yj4sd Год назад

    I don't see the code in your github. Did you have a colab link you could share? Thanks

    • @samwitteveenai
      @samwitteveenai  Год назад

      The Colab is in the description and I will put it up on github in a few hours.

  • @RobotechII
    @RobotechII Год назад +3

    It looks like Facebook is trying to cover their ass knowing that the opensource community will unnerf the models

  • @FarhadKumer
    @FarhadKumer Год назад

    wow. Thanks sam

  • @RoyAAD
    @RoyAAD 9 месяцев назад

    Can you provide a link to the ipynb file?

    • @samwitteveenai
      @samwitteveenai  9 месяцев назад

      Check out the description for the Colab etc

  • @ArunKumar-bp5lo
    @ArunKumar-bp5lo Год назад

    i was having gated issue 403 turned out need permission from HF also from meta

  • @eddyjens4948
    @eddyjens4948 Год назад

    very nice!

  • @Dr_Tripper
    @Dr_Tripper Год назад

    Sam, I know you know of this one but I think the nous-hermes-13b.ggmlv3.q4_0.bin, running in CLI is a remarkable model. Just in the command line alone with 200 tokens I was able to get a continual thought process by simply asking can you continue. I am going to run in docker and use the exposed endpoint to query through the other techs. What a wonderful journey this is!

    • @Dr_Tripper
      @Dr_Tripper Год назад

      I forgot to add that this is all being run under GPT4ALL.