LangChain + Ray Tutorial: How to Generate Embeddings For 33,000 Pages in Under 4 Minutes
HTML-код
- Опубликовано: 1 май 2023
- This tutorial guides you through how to generate embeddings for thousands of PDFs to feed into an LLM. LangChain makes this easy to get started, and Ray scales your workload out across a cluster.
By using Ray Data, a distributed data processing system, you'll be able to generate and store embeddings for 2,000 PDF documents from cloud storage, parallelize across 20 GPUs, all in under 4 minutes and less than 100 lines of code!
Learn More
---
Blog Post: www.anyscale.com/blog/turboch...
Code: github.com/ray-project/langch...
LangChain Docs: python.langchain.com/en/lates...
Ray Docs: docs.ray.io/en/latest/
Ray Overview: www.ray.io/
Join the Community!
---
Twitter: / raydistributed
Slack: docs.google.com/forms/d/e/1FA...
Discuss Forum: discuss.ray.io/
Managed Ray
---
If you're interested in a managed Ray service, check out: www.anyscale.com/signup
#llm #machinelearning #langchain #ray #gpt #chatgpt
very clear and to the point, well presented
Fantastic tutorial! It will be awesome if there is another tutorial on how to set this up locally for local development..
Thanks for the vid! What are you thoughts on other vector DBs like Pinecone and how they measure up to Ray?
so once your have the embeddings in the vector DB...how to then query and test how fast the Q &A's are...thanks a ton
How does it compare to InstructorEmbedding?
And what's the performance for CPU?
You used which model ?? Actually I am doing same with open so, the token size of a small query is reaching 3000, don’t know why
Thank you!
Apart from s3 bucket, can I use my one drive directory or any other local file directory?
Yes! It would look similar. You would still do `ray.data.read_binary_files` and pass in your local directory instead of a path to S3.
How can you do this using Azure OpenAI?
Do you do any paid consulting?
How does the code parallize the ops ? When its loaded from s3 the result is just jobs right ? subsequent calls then setup a pipeline ?
Not sure what you mean by jobs, but none of the computation actually gets triggered until you iterate through the dataset
OSError: [Errno 0] What can you suggest for AssignProcessToJobObject() failed error?
what was the rough cost for the aws time for 2000PDFs?
The whole job itself is under 4 minutes…so for the actual compute less than $1 if using spot instances
@amogkamsetty5892 it was under 4 minutes bc of parallelism i think itd be significantly longer on a single spot instance and youd need a gpu instance anyways sorry for replying 6 months later
Thanks but what about scanned pdfs any way to handle the exceptions
While run the code i received the following error. How to clearly that.................
Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format
is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes.
Is it possible to use cl100k_base as the model for creating the embeddings?
If so, how can I make it use the GPUs.
Why load the PDF files as bytes?
The data needs to be pulled from S3 over the network, which is done as bytes. Then, the bytes are parsed as strings on the compute cluster.
Can you do the same thing without LangChain? I do not want to use LangChain.
Why don't you want to use LangChain?
You can build your own custom functions. Refering to the source documentation from Langchain can help give an understanding. I've done this a few times!
Are you using cloud GPUs?
isnt open ai and hugginface embeddings incompatible??
I don't think that's the case! Any embedding model can be used with any LLM. Doesn't necessarily have to be OpenAI embeddings with GPT4.
@@amogkamsetty5892 No bro, I dont think so.. chatgpt 3.5 davinci api return vector of some 1532 dimensions and every model returns different dimensions based on how they are trained initially
I think what @Amog means is that you can use the above algorithm with either OpenAI or HuggingFace. You have to use the same embedding for both building the vector store and querying, but in both cases they just return vectors.
Does Ray have a free tier?
ray is open source!
probably mean the Anyscale infra.
Attention !!!!!!!!
7:39
Cost start: 29.32 Overall: 35.05
You spent 5.32 for embedding 2,000 PDF files(33,000 pages) by using Ray's service.
Parallelyze 😅
Interesting but if you have this skills why you use langchain?? You use an extra layer that penalize performance and maintenance time when you can use basic functions very easy and much more efficiently and still using ray.
Langchain is a great tool for prototyping in production can lead to a nightmare when models will change in the near future.
While run the code i received the following error. How to clearly that.................
Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format
is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes.