Build Open Source "Perplexity" agent with Llama3 70b & Runpod - Works with Any Hugging Face LLM!
HTML-код
- Опубликовано: 16 июн 2024
- In this video, you'll learn how to build a custom AI agent using the powerful Llama 3 70b model deployed on Runpod using vLLM. This method is also compatible with any Hugging Face LLM, providing flexibility and scalability for your AI projects.
Need to develop some AI? Let's chat: www.brainqub3.com/book-online
Register your interest in the AI Engineering Take-off course: • Building Chatbots with...
Hands-on project (build a basic RAG app): www.educative.io/projects/bui...
Stay updated on AI, Data Science, and Large Language Models by following me on Medium: / johnadeojo
GitHub repo: github.com/john-adeojo/custom...
vLLM blog: blog.vllm.ai/2023/06/20/vllm....
Can You Run it: huggingface.co/meta-llama/Met...
Runpod Template: runpod.io/console/deploy?temp...
Custom agent deep dive: • Build your own Local "...
Chapters
Introduction: 00:00
Inference Server Schema: 01:40
Determine memory requirements: 04:50
Deploying server on Runpod: 07:32
Using the inference server with the agent: 16:30
Demoing the custom agent: 19:35 Наука
Thank you for the excellent video. I appreciate all the detailed steps for setting up a vLLM (virtual Large Language Model) on RunPod. It's a cost-effective alternative to purchasing an expensive PC, which could break the bank.
WORKED as advertised! Well done, John. Thank you.
Good stuff as always, thank you very much.
Great video again. Can't wait for you to try and run the coding models.
Phenomenal
please check "Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models"
Nice video !! But it seems most of your videos on AI agents is around web search.
Thank you very much, great content
You're very welcome!
Wow - big difference between the 8b and 70b models. Do you think the 70b models are good enough for agents?
Hey, amazing content! I was just wondering, you deploy the pods "On-Demand", does that mean you only pay the GPU time you actually needed it? Or does it cost you as long as the pod is running because the GPU is reserved for you or something like that?
Thank you! Regarding your question, the example in this tutorial charges hourly. However, they do also provide a serverless deployment Going the serverless route means you pay nothing when the GPU is idle. Here's the doc www.runpod.io/serverless-gpu
Nice reault here with Llama3 70b fp16.
The whole time I was thinking "what about groq?" however. Since the inference for the same model appears to be free.
Groq has very low rate limits atm. But yeah speed is amazing
Good stuff! 👍 So would this be considered just as secure as hosting on Azure? I mean would your company data be sequestered in its own virtual machine environment?
Great question. At its core, RunPod is a platform that orchestrates GPU resources. The GPUs themselves are provided by third-party data centers. This is what I pulled from their compliance doc:
“End-to-end Encryption: Data in transit and at rest is encrypted using industry-leading protocols. This ensures that your AI workloads and associated data remain confidential and tamper-proof.”
“Compliance Adherence: Different data centers might have varying compliance certifications. While we ensure that all our partners uphold stringent standards, the specifics of each compliance are directly managed by the respective data center.”
Here’s the doc if you want to read further:www.runpod.io/compliance
Awesome, thanks for the info!.👍
I just found what I was looking for with your link. Here are a list of compliances regarding data security.
List of Certifications
It's vital to understand that while RunPod does not directly hold certifications like SOC 2, ISO 27001, or GDPR, many of our partner data centers do. Here's a quick snapshot of many of the certifications our data centers hold:
ISO 27001
ISO 20000-1
ISO 22301
ISO 14001
HIPAA
NIST
PCI
SOC 1 Type 2
SOC 2 Type 2
SOC 3
HITRUST
GDPR compliant
excellent :) thanks, how much GPU do you actually need, ? other than a service.
Same question
Great vid thanks. Please test the new Microsoft Phi 3 medium etc as agents that might work well as it's much better than. Llama 8b
I'll be doing a series of test for a variety of open source model Phi will be on the list.
@@Data-Centric Awesome thanks, on a recent video i saw something interesting, the lady mentioned the mistral 7b model makes a great agent for some reasons like architecture and native function calling i think, i see a new one was just released, apparently as agent it works better than other popular local ollama ones, but obviously not the 70b level
Thanks for the video! How do you find out how much compute time/cost of the queries you run?
For this deployment pattern pods are running 24/7 until you stop them. So your compute cost is charged hourly (quoted when you deploy a pod!). You could probably work out cost per query from the hourly cost.
@@Data-Centric so basically it is time based then where you start at the beginning on the request and then stop when the request has completed, so it goes off the elapsed time between start and stop?
Is there anyway you can add a config option in your github for using runpod serverless? It seems like it could be better when doing inference cost wise.
I'll look into this!
I'm curious bro why you chose Runpod over lighting ai?
No reason other than I haven't used lighting ai.
Cost?
why i cant connect to http 8000 ?
I was wondering if this is really going to be cheaper then say using openrouter or together ai lama 70b at 80c per million tokens I been running thousands of apia calls on a quite a bit of data and userd lea the a dollar on the API I'm wondering if 2$ per hour is going to be cheaper I guess if you running agents continuously for hours they can do unlimited stuff in that house then the GPU rental maybe best and API per token will cost more right I think the only way to know is to test and compare right? Also it's possible that the APIs are quantasized more then the runpoid version do you would get better results from the on demand rental, on demand means only when you use it right so you goto turn it off when done always? Unsure how these rentals work been looking at vast can rent and run vllm there as well and it's apparently the cheapest but when I checked prices for your setup where only about 20c cheaper per hour not a huge diff and I think vast has reliability concerns have you looked at it?
Honestly, it's pretty hard to compete with the API costs unless you are saturating the GPUs. GPU rental like runpod is great for defined tasks (like summarizing 10,000 papers or something like that.
@@robxmccarthy Awesome thanks for confirming!