API For Open-Source Models 🔥 Easily Build With ANY Open-Source LLM
HTML-код
- Опубликовано: 16 окт 2024
- In this video, we review OpenLLM, and I show you how to install and use it. OpenLLM makes building on top of open-source models (llama, vicuna, falcon, opt, etc) as easy as it is to build on top of ChatGPT's API. This allows developers to create incredible apps on top of open-source LLMs while having first-class support for tools (langchain, hugging face agents bentoML), and one-click deployment. Also, fine-tuning is coming soon!
Enjoy :)
Join My Newsletter for Regular AI Updates 👇🏼
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
OpenLLM Github - github.com/ben...
Yet another piece to the democratization of AI! Very valuable.
Agreed!
Yes indeed!
The future is now 🙌🏼
Democracy means those in power rule. We live in a democracy that’s clearly 1000% centralized. I got the message tho. 👍🏿
@@applyingpressureeveryday Democratization of technology means the de-centralization of technology.
Incredible content, and he doesn't waffle either!!! just to the point, good pace, great voice, great cadence, and perfect audio levels. This channel is gonna be big.
Thank you :)
@@matthew_berman With open llm , you don't get an Open AI like Api token right?
@@matthew_berman How can a project like Aider utilized open llm?
You are becoming my favorite AI channel! This is literally exactly what I've needed. I've been looking for an open llm alternative to openai API for querying PDFs with Langchain. I haven't been able to test the largest LLMs using Langflow because it always times out from Huggingface.
Glad I could help 🎉
Matthew are you pushing it to github? I 'm also working for the same looking for ideas but a beginner though looking for help
stop it I cant keep up anymore :) everyday I am pivoting around your content, gimme a break already! What an exciting time to be alive!
Haha nice :) wait until you see the next video!
First time man. To the point and straight forward, Thank youuuuu
!!!!!!!!!!!!
You got it!
This is exactly what I’ve recently been looking for! Thanks for showing it off :)
It would be a lot easier for us to follow along and be successful if you did these demos starting with a brand new machine with just python and conda pre-installed. That way our experience would be more likely to match the one in your video *exactly* and we wouldn’t struggle at the points where you say “the first time I tried this, I got an error” or “I already have this installed”. Just a suggestion.
This is awesome. I've played with some different open-source models in Runpod(which is great, btw). And I looked into installing the Text Generator WebUI locally... but I don't have a suitable GPU yet. Ultimately, I want a self-hosted (preferably in a container) API that can run various models and hit from a web browser, or from a console app, or from a game. This looks like exactly what I want.
Now I just need to find a GPU to toss into my server...
Oobabooga's webui-text-generator is compatible with ggml models which are CPU only but can use gpu for speedup although latest versions don't use my gou for some reason.
@@Trahloc Good to know. I previously tried some stuff that wouldn't run without the Nvidia GPU. I'll have to give this a try and see how it works.
Had'nt heard of OpenLLM before but now I can't hold my excitement to test it out. Well paced, Well executed tutorial that touches on the important aspects of deployment. Please follow this space closely because we'll be following you !!
Thank you for this great tutorial
I‘m excited! Yeah! I‘m interested in custom/not listed models, also NLLB-200… And what about Mac? There is no xformers available.
I prefer the Oobabooba Web UI, which basically runs an API locally and has a nice button to "import" any hugging face model.. But this is interesting too
When you cut out all the dead space, your sentences run together without the natural pause that would allow beginners to digest each new concept before being bombarded with the next five new concepts that are rattled off at the speed of light. Tutorials work best when the newbies have time to let new concepts sink in. I'll be stuck trying to wrap my head around what you just said, and I continually have to pause and rewind to catch what you said while I was still chewing on the first bite. You also run your words together, within the sentences, so I have to continually rewind to make sure that I heard you correctly. Many of us are complete newbs to all of this. The info you provide is great. I watch a ton of your videos. I just wish you'd go a hair slower and dumb it down for those of us who are brand new and have to look up the definition of each piece of new tech jargon used (had to ask AI what the hell a bento was, and it thought I was interested in Japanese cuisine).
Awesome quick video!
Thanks :)
Will be waiting for their fine-tuning feature. Should be interesting.
Are these models running locally? If yes, what are the hardware requirements?
Yes. It depends on the model. Smaller models have very little requirements.
@@matthew_berman thank you so much Sir.
@@khandakerrahin1003 you got it!
Wow, that is really simple. Thanks for showing this api tool for LLM 👍
@@JohnSmith-jc7yi no way. You can run local models on much smaller machines
My last computer was a gaming rig. My newest build this week will be specifically for ML and I cannot wait!!! Easy sub.
I'm gonna need a cardboard box server again. Time to start a 24/7 AI stream. 😂
Looking forward to the mosaic 33b. Loving the videos
Literally testing it right now!
Still not sure how building models work in the examples, i see you using the models but how do we build on top of them? Sorry if I missed it.
Are you using WSL? Would you recommend using Windows over Linux? I've had problems trying to install all the Nvidia GPU drivers and CUDA and pytorch modules... using Ubuntu, to the point I had to reinstall Ubuntu.
Thanks for the video. This is getting close to something i'm looking for, but this still requires a permanent system set up with some decent hardware. Would be interesting to see this combined into a single google colab that could be run as needed, for those of us looking to utilize this on an occasional basis.
Colab would be very useful. I wish we can keep a colab running forever.
Please explain if this is hosted locally as a server or if we need runpod or chainlit
You can run this locally AND/OR deploy it to the cloud when you're ready for production.
@@matthew_berman please go over steps to host for production
I also would like to know how to deploy this to the cloud. And what alternatives there are for doing that. Does HuggingFace have a cloud solution (for free)?
Super interesting and exciting project. I didn't quite get though if the models are running locally? I thought this required a lot of GPU power.
You can run LLMs locally on the the CPU, GPU or shared between the CPU and GPU. CPU only is quite slow though.
He's running a $2000 video card.
i was looking for something like this for 2 weeks, thank you for your video .. made my learning much easier ... please make a langchain video too
what is the advantage of this over textgen webui? and does it handle custom models as well as textgen webui (4bit gptq models etc)
Really dreaming about the moment this can be used to ask my own set of documents like in your previous videos about gpt4all
It is like personal computers in era of Steve Jobs, when they were still not so available to anyone. I guess soon this will become even more open with projects like that.
Hey Maathew - this is really great! So with this I can replace the OpenAI api and run all the apps that are built to use OpenAI?
dude, totally just did it! YOU RULE
very similar to localai
seems the difference is the localai is compatible with the openai api
Love this channel
Thank you :)
What is the beginning transition? That is epic looking.
Just adding a positive comment for the algorithm! Great video
Haha thank you!
How fast if this is running locally? Is speed going to be an issue?
A feature on my wish list is being able to GET and POST the context so it can be edited on the fly
“ I don’t know what any of that means but doesn’t seem to be causing any problems“ amen lol
Ty for all these videos getting tons of ideas
Uhh so a wrapper over a wrapper (HuggingFace/LangChain)??? What does this new API add exactly (except for new bugs)?
Great youtuber. Regards from Uruguay!
I loved your content mate ❤️ Thanks for your video. Just a quick question can we use localhost:3000 to a domain. This localhost url can be used as an API till I am running in my PC what if I want to point this url to a domain name which can be easily accessible to all?
Will be waiting for your answer 😥
Keep up the great work dude ❤
Wonderful Content!! will this more easier to work with AutoGen? 🤔
Maybe a silly question but what the minimum hardware requirements?
Is conda installation more stable than pip? Just wondering which one to use. mostly, I have used pip previously.
Awesome! Great stuff and thank you very much! Do you have an idea how to implement a qlora finetuned Falcon Modell?
Interesting, would be cool to have response streaming feature.
probably can through the gRPC interface
@@s0ckpupp3t yes, but they depend on another project, which doesn't support it 😕
@@s0ckpupp3t at least yet
Just wanted to point out for anyone trying, if you do this on Windows and wanted to install directly without conda, you'll get an error pointing to vllm library pointing that it can only be used on Linux.
It is superb knowledge. As a sequence, can you create a video, which can help user decide to choose GPU & CPU Configuration for serving.
I want to create a custom chatbot that utilizes multiple Gemini and GPT APIs.
Does an API remember the history of messages in a chat?
This is crucial for maintaining context within the conversation.
Why did you go through the process of creating a Conda env when you then install with Pip?
Awesome! Thanks for sharing! 😀
I thing you should show what are these LLMs are really capable of,the examples you are showing are pretty simple
well if and when the make the training section available and langchain then it will be really cool project to have !!!
Many thanks!!!
Nice video, well done ,thanks :)
They don't say if the api is compatible with openai's counterpart. It would be nice to be able to use the tools built for openai's api
I am facing issues with the "openllm start opt", I get an error of "KeyError: 'OPENLLM_OPT_MODEL_ID'" why is that? I searched online and still can't find a solution
Great content. Thank you!
Will there be input/output token limits when building custom llm models using openllm like we see in other monetized llm api models?
Awesome, thank you for this vid
Does using falcon model improve the accuracy more than the opt one?
Please help. I have cuda and torch all working, but when running the model, it says Cuda not found or something. Any ideas?
Hi Matthew, thanks for this video! I have a question about how to use open-llm and have documents as a knowledge base.
when I installe openllm the installation processing started but was not complete I tried it 3 times but same result I get, what kind of mistake I don't know can you help me?
Sorry for my noob question, but could someone explain why we’d need more than ChatGPT 4?
If you've got models downloaded, can they be used?
Is it possible to link an own finetunned LLM stored in your local machine?
You are tooooo awesome!!!
AMAZING!!! 💪🔥🙌
Very nice
Which server configuration do you reccomend if I wanna run falcon?
I want to run the model on runpid and create some API to run a service (python) from my personal computer. Any idea on how to do that?
Finally!!
Wooo
Does this API also work when running LLMS using CPU resources?
Awesome!, I just want to know what model is rather good for spanish language. I have tried some and are just awful.
It says I have no GpU available to run the falcon model. I have NVIDIA drivers down loaded but still no luck. What can I do? How about GPU from runpod?
hey for me when i try to import openllm in python it shows me the module dosnt exist. any suggestions
can i use the same in a javascript application ??
What is the minimum GPU requirement to use this
You are a freakin genius.😎
Absolutely crazy
4:04 It started rambling like a mad man 😭
Does it support autogen or crewAI?
Can we provide a customized knowledge base to the system?
Do you know of any current services where you could host something like this on the cloud for free to test out creating something like a chat bot that you could call and add extra functionality to, via python code running locally on your machine?
Any Idea server configuration needed to use this AI models on custom servers of AWS or Linode ???
How can we run an LLM at Home and have the same API that open API uses
does API support the embedding functionality?
Embeddings are just custom text that is passed to the LLM to use as a reference. To get the embeddings: You would need to run a model that can specifically convert text to vectors. Then send you embedded docs to that embeddings model via the API. Take the vector response and store it in a vector store. Then when you make a query, convert you query to a vector via your local model, then perform a similarity search on your vector store. That will return some docs, and you pass the text of those docs to the LLM.
@@cheifei are you implying it's absolutely irrelevant how you create the embeddings? don't different models use different ambedding algorithms, that's why they have different vector dimensionalities among other things?
@@eyemazed no, I am not implying that. I agree with you that you have to use the same embedding model for consistency. I think the missing piece is that that you pass the text of the query and the text (not vectors) of the embedded docs to the LLM.
@@cheifei i see, i thought you needed to use the same embedding API for vectorizing the context that you pass along with prompt to the LLM as the LLM uses to vectorize your prompt. so if I understand correctly you're free too choose any embedding API/vector store that you want because it's separate from the LLM and is only used to retrieve the context that you can send along with your prompt to the LLM
@@eyemazed That is correct.
Is this a replacement for TextGen WebUI? do they perform the same function?
Yeah same function but all of the textgen webuis cost money. If you build an app for many people you will have to pay a lot every time your users send a query
I am getting this error, and cant find any solutions to fix the dependency error:
"Failed to build ninja
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (ninja)"
Nice now i just need only 24gb gpu card
Is there any free hosting so i can host and test it and also how to use domain instead of localhost?
Please dont get me wrong i am a software developer but i have no idea how to use llms.
does it have a streaming api endpoint?
Does this run on CPU or GPU?
Has anyone been able to get this working recently? I follow the docs to a 'T' and the opt model is unable to start-up. I ran openllm models --show-available and it looks like it's not properly downloading the model locally after running 'pip install "openllm[opt]" as it says 'No local models available". Do we need to download the models with 'openllm import ...'? I've tried that as well with 'openllm import opt facebook/opt-1.3b" as well to no avail. Surely I must be doing somehting silly!? Any help appreciated!
Got it working! Turns out you need to manually import the models, running the pip install openllm[] does not download the model. You must use the import command and specify the model, in my case I also needed to pass the -serialisation legacy flag because the model did not support safetensors!
I think it's great.
Now all you need is $10,000 computer! No but seriously the last piece of the puzzle here is a service like runpod where you can install this and it charges you for exact inference time for each request. Does anybody know of anything like it?
I think the 3B and 7B parameters versions models can run locally with a CPU or even a 12GB 3060 RTX.
No, the last piece of the puzzle is open source models that aren't crap.
@@clray123 The easy fine-tuning of these models for specific tasks and the algorithmic optimization to run these models more efficiently in a spectrum of hardware from low to the high end is what is going to make the difference against propetary models.
@@elchippe LoRA fine-tuning is like a 200 lines script of Python code. You clone the script from Git and run it. The difficulty of fine tuning is not because you lack some silly API, but rather the choice of parameters and foremost of the input data. And you will not be able to finetune any serious models on "low end" hardware, even with (Q)LoRA and what not.
@@clray123 Yeah well I meant this particular puzzle of being able host your own personal API of an open source model. Model quality is beside the point.
How would i use this in runpod ? :)
But how to deploy it on db?
Can i deploy this to colab?
Yes
i think langchain can create api ? no?