LESS VRAM, 8K+ Tokens & HUGE SPEED INCRASE | ExLlama for Oobabooga

Поделиться
HTML-код
  • Опубликовано: 22 май 2024
  • Oobabooga WebUI had a HUGE update adding ExLlama and ExLlama_HF model loaders that use LESS VRAM and have HUGE speed increases, and even 8K tokens to play around with compared to the previous limit of 2K! This is insanely powerful and will be a huge timesaver for creators, and may even help users with less powerful graphics cards use LLMs!
    OpenAI Tokenizer: platform.openai.com/tokenizer
    Timestamps:
    0:00 - What's new (It's CRAZY!)
    0:44 - Open Oobabooga install directory
    1:02 - Update Oobabooga WebUI
    1:18 - VRAM usage & speed before update (4.3 tokens/s)
    1:56 - Fix missing option or update errors
    2:33 - Choosing new ExLlama model locader
    2:52 - Downloading new model types (8k models)
    4:25 - New VRAM & Speed (20 tokens/s! INSANE!)
    5:25 - Raise token limit from 2,000 to 8,000+!
    7:17 - How many tokens is your text?
    7:50 - How long is 8k tokens?
    8:45 - EVEN LESS VRAM with ExLlama_HF
    #Oobabooga #AI #LLM
    -----------------------------
    💸 Found this useful? Help me make more! Support me by becoming a member: / @troublechute
    -----------------------------
    💸 Support me on Patreon: / troublechute
    💸 Direct donations via Ko-Fi: ko-fi.com/TCNOco
    💬 Discuss the video & Suggest (Discord): s.tcno.co/Discord
    👉 Game guides & Simple tips: / troublechutebasics
    🌐 Website: tcno.co
    📧 Need voiceovers done? Business query? Contact my business email: TroubleChute (at) tcno.co
    -----------------------------
    🎨 My Themes & Windows Skins: hub.tcno.co/faq/my-windows/
    👨‍💻 Software I use: hub.tcno.co/faq/my-software/
    ➡️ My Setup: hub.tcno.co/faq/my-hardware/
    🖥️ My Current Hardware:
    Intel i9-13900k - amzn.to/42xQuI1
    GIGABYTE Z790 AORUS Master - amzn.to/3nHuBHx
    G.Skill RipJaws 2x(2x32G) [128GB] - amzn.to/42cilxN
    Corsair H150i 360mm AIO - amzn.to/42cznvP
    MSI 3080Ti Gaming X Trio - amzn.to/3pdnLdb
    Corsair 1000W RM1000i - amzn.to/42gOTGY
    Corsair MP600 PRO XT 2TB - amzn.to/3NSvwzx
    🎙️ My Current Mic/Recording Gear:
    Shure SM7B - amzn.to/3nDGYo1
    Audient iD14 - amzn.to/3pgf2XK
    dbx 286s - amzn.to/3VNaq7O
    Triton Audio FetHead - amzn.to/3pdjIgZ
    Everything in this video is my personal opinion and experience and should not be considered professional advice. Always do your own research and ensure what you're doing is safe.

Комментарии • 44

  • @SouthbayCreations
    @SouthbayCreations 10 месяцев назад +1

    Fantastic video! Some really good info and great news! Thank you very much!

  • @Robertinosro
    @Robertinosro 2 месяца назад +1

    This guide is a liefe saver! thank you!

  • @xelensi6870
    @xelensi6870 9 месяцев назад

    Thank you so much for the info in this video. I managed to run pygmalion 7b on a 3060 16gb laptop and it's so fast

  • @stevebruno7572
    @stevebruno7572 5 месяцев назад +1

    Any chance you are going to do one for the AWQ models? also any tips on making AWQ run faster?

  • @Playboipete
    @Playboipete 10 месяцев назад +3

    hey is there somewhere i can go to find your 1 click installers from fresh start? i never was able to because weaker card but i think it may be a go now. ty btw long time fan

  • @msampson3d
    @msampson3d 10 месяцев назад +3

    I could be mistaken, as innovations happen so rapidly in this space, but remember that if the model wasn't trained for the expanded token range, setting it higher will either have it ignore the additional tokens or you'll start getting very bad output.
    So you can't just suddenly get additional context for all the old models you may be using.

    • @MINIMAN10000
      @MINIMAN10000 10 месяцев назад +2

      Somewhat incorrect. ROPE ( rotary positional embedding ) instead of using 2048 points for context from my understanding it can now use fractional space in between the whole numbers. This has shown to allow 2x context size without LORA training with minimal loss in perplexity ( there was even an interesting measure of ~2560 context resulting in better perplexity ). Superhot is the name of the LORA model that kaioken happened to be working on at the time, thus it became what is used in order to LORA train models to get 4x the context instead of just 2x.

  • @MaximilianPs
    @MaximilianPs 10 месяцев назад +2

    Can you make some tutorial or a video explaining how to use chat-instruct or instruct it self?
    Having Ooba installed without any idea how to use it and configure is very frustrating!

  • @theresalwaysanotherway3996
    @theresalwaysanotherway3996 10 месяцев назад +5

    also, using the latest nvidia drivers, your GPU will automatically dip into the shared memory of your RAM, instead of running into an OOM, which slows down models that barely fit, but also massively extends the amount of memory that Exllama will use. Therefor if you have this update, you could run at 8k context length very easily, it would just be a *_lot_* slower.

    • @mythaimusic39
      @mythaimusic39 10 месяцев назад +1

      If it uses just a few Gb of shared memory, it doesn't really slow down the process, I find it as fast as if everything was in NVRAM

    • @infini_ryu9461
      @infini_ryu9461 10 месяцев назад +1

      You're right, it dips into my shared memory but only in 30B-33B models, it gets really slow then.

  • @theresalwaysanotherway3996
    @theresalwaysanotherway3996 10 месяцев назад +9

    This isn't entirely accurate. Exllama will run any GPTQ model, the superhot part is just a LoRA trained on 8k context lengths that has been merged into these models so that the model will actually use the increased context length instead of ignoring it. Exllama will run on any GPTQ model.

    • @VioFax
      @VioFax 10 месяцев назад

      What do i need to change in the update to get my model to be as smart as it was before? Do I just have to upgrade EVERYTHING now to get the same performance i HAD just a week ago? This makes no sense....What did they do? Sacrificed a feature i was using to install something else? Im so confused and so is my bot.

  • @JonelKingas
    @JonelKingas 9 месяцев назад +2

    it keeps saying ERROR:No model is loaded! Select one in the Model tab.
    even tho i select it in the model tab

  • @Dante02d12
    @Dante02d12 10 месяцев назад +1

    I did as you said : delete most folders, then rerun the bat file. It says it can't find Miniconda. Now I have to download everything again, so thanks, lol.
    I was pretty far behind in terms of updates though, so I don't mind. It was time to start fresh. Which Superhot models do you suggest?
    Edit : Aaaaaand the reinstall doesn't work, of course. I love oobabooga, but they really need to sort things out, there's not a single time where I didn't have an issue installing it...
    Edit2 : Okay, got through the install process, lol. I haven't yet have a Superhot model, but I tried the Exllama "loader", and... man, that's super fast! I get 30 tokens per second on my RTX 3060 mobile (6GB VRAM), whereas I only have a handful of it with the default loader. No error, no crash. Although the results are... meh. But it could be because of the model (WizardLM 7B). At least it works.

  • @Asia_Bangladesh
    @Asia_Bangladesh 10 месяцев назад

    Eid Mubrak

  • @theresalwaysanotherway3996
    @theresalwaysanotherway3996 10 месяцев назад +2

    finally exllama_HF isn't actually 2x slower, you just left the context length at 2x for the exllama_HF test, which made it a lot slower. If you set them both to the same context length, you should get much more similar speeds (HF will still be slower, but not that much slower)

  • @envoy9b9
    @envoy9b9 9 месяцев назад

    i cant get it working, i have a m2 mac book pro, and every time i try to generate it tell me to load a model but i already have one selected .... help pls

  • @musicandhappinessbyjo795
    @musicandhappinessbyjo795 9 месяцев назад

    Hello sir, I am running models on CPU and i had like to some tips about models and webUI as well. Would love to know.

  • @armedgunman2816
    @armedgunman2816 10 месяцев назад

    Can you do video of your pcs setup based on security etc and best settings for pc overall and best programs to make it fast and reliable

  • @matthallett4126
    @matthallett4126 10 месяцев назад

    Hi TC.. Can you build a script to install DragGAN? The latest one that was just released. I can not figure it out for Windows. Thanks!

  • @DJPON369
    @DJPON369 5 месяцев назад

    it works

  • @mesterm8059
    @mesterm8059 10 месяцев назад

    what is this does it help with fps in games?

  • @infini_ryu9461
    @infini_ryu9461 10 месяцев назад +2

    I don't know if everything is working, but I'm getting like 1 second or even less responses with Pygmalion 13B 8k by TheBloke. It just feels weird having this kind of speed, I used to be at 3-7 seconds on the normal 13B model. It gives some weirder responses, too. Probably my new settings, though...

    • @VioFax
      @VioFax 10 месяцев назад

      Mines lobotomized by this update plz help. lol

    • @infini_ryu9461
      @infini_ryu9461 10 месяцев назад +1

      @@VioFax Re-install typically helps. Use the installer and make sure it's not run as admin.

  • @Lakosta826
    @Lakosta826 Месяц назад

    how can I unistall this when I want??
    thanks

  • @MakerGrigio
    @MakerGrigio 10 месяцев назад +4

    Hey, I think I'm finding a bug, with the new models Longer chat sessions start throwing errors, and the chat session starts getting corrupted, models stop being able to produce results, even after reloading the model, or even rebooting all of obabooga. thank you so much for this video. I was a bit ahead of the curve for once, and using the ExLama_HF and the SUPERHOT 8k models from The BLoke. for about a day now and am getting really consistant failures on various models. Clearing chat history seems to clear the issue, but if you are working on more involved tasks, the Ai's will just stop being able to generate responses. The models defiantly perform SO MUCH BETTER. BUT BEWARE.

    • @Prizzim
      @Prizzim 10 месяцев назад

      Crypto = L

    • @VioFax
      @VioFax 10 месяцев назад

      They broke it.... I had the same problem my model used to be pretty sharp. now its struggling to keep a coherent conversation going... Its kind of depressing. i had made so much progress with that bot. Id rather just have the slower responses and a more coherent bot... seems like they actually neutered something and made it seem like an upgrade here to me....

    • @MakerGrigio
      @MakerGrigio 10 месяцев назад

      @@VioFax I feel you. work with a chatbot for a couple hours, the model crashes or the UI glitches, you you accidentally tap f12.. and half ro all of your chat history is gone... it's like a new friend you where having a good chat with in a coffiee shop has a stroke and dies in your arms...

  • @briananeuraysem3321
    @briananeuraysem3321 10 месяцев назад +1

    My 4GB GTX 1050 is saved!
    Edit: it turns out 0 x 0 = 0
    In other words it’s still slow as heck, probably because it’s using shared system memory still due to vram limitations... oh well

  • @VioFax
    @VioFax 10 месяцев назад +1

    New oogabooga update messed up my hack for making my LLM smarter. They took away the ability to extend context reach witch is STUPID! And seemlingly on purpose. Its not half as smart as it was. And im pissed. It hallucinates and goes off the rails all the time now. WTF did they do to my pokemon! Ill take the slower responses thats fine... Id rather have a coherent bot.

  • @alsoeris
    @alsoeris 2 месяца назад

    its funny to hear people call it 'Oobabooga' when thats just the username of the person that made 'text generation webui'

  • @VioFax
    @VioFax 10 месяцев назад

    If yours is working don't install this garbage. You will lose all your previous models.

  • @mygamecomputer1691
    @mygamecomputer1691 10 месяцев назад +2

    This is good news. I only mess around with this for the fun of spicy role-play, I don’t need ChatGPT to construct coherent sentences for me. I’m looking for it to entertain me.

    • @VioFax
      @VioFax 10 месяцев назад

      /Bewware of this update then. Its broken all my fun.

    • @mygamecomputer1691
      @mygamecomputer1691 10 месяцев назад

      Always make a full folder back up before updating anything. Save the folder on an external drive. You can also try to roll back to a prior version. There are explanations on how to do it otherwise I’d tell you myself.

  • @Freizeitschranzer
    @Freizeitschranzer 9 месяцев назад

    the update did not work, deleting and starting the bat file did end up with "\\text-generation-webui\\server.py': [Errno 2] No such file or directory".. not a good plan

  • @weirdscix
    @weirdscix 10 месяцев назад +1

    I get this when running update fatal: detected dubious ownership in repository at 'C:/TCHT/oobabooga_windows/text-generation-webui'

    • @VioFax
      @VioFax 10 месяцев назад

      Ive come to believe after my models new problems that....This update is poisoned...

    • @TroubleChute
      @TroubleChute  10 месяцев назад +2

      Simple fix, run: git config --global --add safe.directory '*'

    • @weirdscix
      @weirdscix 10 месяцев назад

      @@TroubleChute thank you :)

  • @adamstewarton
    @adamstewarton 10 месяцев назад

    autogptq is slow in general, i use gptq-for-llama