Thanks for the walkthrough! So I'm running the utility on an i7 16 GB RAM (no GPU) laptop, but for me it takes close to 10 mins for each response to fully generate. How did you manage to speed yours up? I'm also using Vicuna 7b
Thanks for the help! This is amazing. Sadly I had to run in CPU mode (I only have 8 GB VRAM and that was not enough, even with the 8-bit option turned on). I was curious, for anyone running this on GPU, what card(s) setup do you have?
Update: I have figured out how to fix the Web UI on Windows. I created a Pull Request to the main repo to fix the bug. I will be live streaming tomorrow demonstrating this!
It was part of their utils package. For Windows systems there was a bug with the logging system in it which expected UTF-8 but was getting Windows-1252. Setting the PYTHONUTF8 environmental variable equal to 1 for Python 3.7+ solves the issue. This let’s the model_worker and gradio server boot properly.
Though, there does appear to be another issue, at least when running with CUDA. It has a slow but steady memory leak, which will eventually cause the model to crash.
I'm having all kinds of issues setting up the development environment. Tedious issues like setting up the proper paths, correct version of things, hardware compatibility, ext. I've spent multiple days on this. Is there a place online I should go for help on this? What subreddits? Forums?
Hi Aemon! Thanks for the video. I want to build a PC to run a model like you are doing. Would you mind sharing your specs and any advice you have on putting together a PC that can run this model?
Hey Michael! Vicuna can run on a variety of hardware! So depending on how you want to run the model, will determine which hardware you will want to roll with. We have a video on it, if you’d like to check that out as well! There are two different models. The 7 billion parameter model and the 13 billion parameter model. The 7 billion parameter model requires 30 gigs of system RAM to convert from the LLaMa weights to Vicuna and the 13 billion requires 60 gigs. The 13 billion parameter model is a decent step up in overall performance, though the 7 billion parameter model is pretty impressive as well! To run either base model with a GPU, you will need 14 gigs for the 7 billion parameter and 28 gigs for the 13 billion parameter model. For CPU mode, with the base model, again 30 and 60 gigs respectively. This is a pretty harsh requirement, though fortunately we can make this a bit easier on your system, with quantization! Vicuna can run 8-bit quantization, which turns the FP16 weights to 8 bit integers, making it only require only 8.5 gigs of VRAM on your GPU for the 7B and 17 gigs for the 13B. For CPU mode roughly 16 gigs of RAM for the 7B and 32 gigs for the 13B. The only caveat with the quantization right now is that it’s slightly less performant but that should be fixed soon. If you want an absolutely monster, I would recommend the following specs: RAM: 64 gigs of DDR5 @ 5200MHz CPU: 7900/7950X Motherboard: Anything that supports this CPU will be fine GPU: RTX 3080/3090/4080/4090 with a heavy preference for the 3090 or 4090, for their VRAM. Thanks for watching and I hope this helps!
@@AemonAlgiz thank you! You are doing a lot to help me pivot my career. I'm learning a lot and really enjoying it. If you are ever in Austin, dinner and drinks on me!
It can if you use the 4-bit quantized model. I have a video on the channel with how to get the models, though the installation has changed since I made the video. I am creating an updated version which will be out tomorrow. Though, if you feel comfortable walking through the installation yourself, the description has where to find the models!
We’re dropping a new video in about an hour with the now publically available versions, so this is no longer accurate information! I’ll be updating this video to point at the new one :)
Hey Y, here ya go! Vicuna 13B V1.1! With 4-Bit Quantization What Can't it Run On? OogaBooga One Click Installer. ruclips.net/video/Z3HIPGzZRnc/видео.html This video takes you right to where to get the model
Thanks for the walkthrough!
So I'm running the utility on an i7 16 GB RAM (no GPU) laptop, but for me it takes close to 10 mins for each response to fully generate. How did you manage to speed yours up? I'm also using Vicuna 7b
Thanks for the help! This is amazing. Sadly I had to run in CPU mode (I only have 8 GB VRAM and that was not enough, even with the 8-bit option turned on). I was curious, for anyone running this on GPU, what card(s) setup do you have?
I’m running a 4080 and it runs without a hitch, though you can also use GGML models which are optimized for CPU!
Update: I have figured out how to fix the Web UI on Windows. I created a Pull Request to the main repo to fix the bug. I will be live streaming tomorrow demonstrating this!
Nice, you talking about oogabooga?
It was part of their utils package. For Windows systems there was a bug with the logging system in it which expected UTF-8 but was getting Windows-1252. Setting the PYTHONUTF8 environmental variable equal to 1 for Python 3.7+ solves the issue. This let’s the model_worker and gradio server boot properly.
Though, there does appear to be another issue, at least when running with CUDA. It has a slow but steady memory leak, which will eventually cause the model to crash.
@@AemonAlgiz Right. I also discovered that the VRAM will only add up but not clear some of it. So I turn to GGML model only for now.
I'm having all kinds of issues setting up the development environment. Tedious issues like setting up the proper paths, correct version of things, hardware compatibility, ext. I've spent multiple days on this. Is there a place online I should go for help on this? What subreddits? Forums?
You could hop on TheBlokes AI discord, we can help you there!
Hi Aemon! Thanks for the video. I want to build a PC to run a model like you are doing. Would you mind sharing your specs and any advice you have on putting together a PC that can run this model?
Hey Michael!
Vicuna can run on a variety of hardware! So depending on how you want to run the model, will determine which hardware you will want to roll with. We have a video on it, if you’d like to check that out as well!
There are two different models. The 7 billion parameter model and the 13 billion parameter model. The 7 billion parameter model requires 30 gigs of system RAM to convert from the LLaMa weights to Vicuna and the 13 billion requires 60 gigs. The 13 billion parameter model is a decent step up in overall performance, though the 7 billion parameter model is pretty impressive as well!
To run either base model with a GPU, you will need 14 gigs for the 7 billion parameter and 28 gigs for the 13 billion parameter model. For CPU mode, with the base model, again 30 and 60 gigs respectively. This is a pretty harsh requirement, though fortunately we can make this a bit easier on your system, with quantization!
Vicuna can run 8-bit quantization, which turns the FP16 weights to 8 bit integers, making it only require only 8.5 gigs of VRAM on your GPU for the 7B and 17 gigs for the 13B. For CPU mode roughly 16 gigs of RAM for the 7B and 32 gigs for the 13B. The only caveat with the quantization right now is that it’s slightly less performant but that should be fixed soon.
If you want an absolutely monster, I would recommend the following specs:
RAM: 64 gigs of DDR5 @ 5200MHz
CPU: 7900/7950X
Motherboard: Anything that supports this CPU will be fine
GPU: RTX 3080/3090/4080/4090 with a heavy preference for the 3090 or 4090, for their VRAM.
Thanks for watching and I hope this helps!
@@AemonAlgiz super helpful. Thank you! I'll let you know how it turns out 😀
Can someone please share a link for the form at 1:44?
You no longer need this, models are freely available on Huggingface now, I will update this video
@@AemonAlgiz thank you! You are doing a lot to help me pivot my career. I'm learning a lot and really enjoying it. If you are ever in Austin, dinner and drinks on me!
I have an RTX 3070, I don't know how many gigs of ram it has, will this not be able to run the 14b vicuna model?
It can if you use the 4-bit quantized model. I have a video on the channel with how to get the models, though the installation has changed since I made the video. I am creating an updated version which will be out tomorrow. Though, if you feel comfortable walking through the installation yourself, the description has where to find the models!
@@AemonAlgiz Thanks, I will wait or your new video, since I've been failing to install it.
Can we get converted weights directly for Vicuna?
You can! They’re all available on Huggingface now
but if we get it from elsewhere, how do we know it is safe?
We’re dropping a new video in about an hour with the now publically available versions, so this is no longer accurate information! I’ll be updating this video to point at the new one :)
Hey Y, here ya go! Vicuna 13B V1.1! With 4-Bit Quantization What Can't it Run On? OogaBooga One Click Installer.
ruclips.net/video/Z3HIPGzZRnc/видео.html
This video takes you right to where to get the model
Thanks mister! 👊
Thank you for watching! I hope it was helpful
This sounds like a lot of work, to say the least.
I figured out how to fix this on Windows! Just made a pull request to fix the repo.
tks