LangChain + Ray Tutorial: How to Generate Embeddings For 33,000 Pages in Under 4 Minutes

Поделиться
HTML-код
  • Опубликовано: 1 май 2023
  • This tutorial guides you through how to generate embeddings for thousands of PDFs to feed into an LLM. LangChain makes this easy to get started, and Ray scales your workload out across a cluster.
    By using Ray Data, a distributed data processing system, you'll be able to generate and store embeddings for 2,000 PDF documents from cloud storage, parallelize across 20 GPUs, all in under 4 minutes and less than 100 lines of code!
    Learn More
    ---
    Blog Post: www.anyscale.com/blog/turboch...
    Code: github.com/ray-project/langch...
    LangChain Docs: python.langchain.com/en/lates...
    Ray Docs: docs.ray.io/en/latest/
    Ray Overview: www.ray.io/
    Join the Community!
    ---
    Twitter: / raydistributed
    Slack: docs.google.com/forms/d/e/1FA...
    Discuss Forum: discuss.ray.io/
    Managed Ray
    ---
    If you're interested in a managed Ray service, check out: www.anyscale.com/signup
    #llm #machinelearning #langchain #ray #gpt #chatgpt

Комментарии • 39

  • @powpowpony9920
    @powpowpony9920 7 месяцев назад

    very clear and to the point, well presented

  • @fabsync
    @fabsync 4 месяца назад

    Fantastic tutorial! It will be awesome if there is another tutorial on how to set this up locally for local development..

  • @markchung4299
    @markchung4299 Год назад

    Thanks for the vid! What are you thoughts on other vector DBs like Pinecone and how they measure up to Ray?

  • @ChuckWilliamsTechnology
    @ChuckWilliamsTechnology Год назад +2

    so once your have the embeddings in the vector DB...how to then query and test how fast the Q &A's are...thanks a ton

  • @greendsnow
    @greendsnow Год назад +1

    How does it compare to InstructorEmbedding?
    And what's the performance for CPU?

  • @softwaredeveloper2990
    @softwaredeveloper2990 Год назад

    You used which model ?? Actually I am doing same with open so, the token size of a small query is reaching 3000, don’t know why

  • @elvisasihene2403
    @elvisasihene2403 Год назад +1

    Thank you!
    Apart from s3 bucket, can I use my one drive directory or any other local file directory?

    • @amogkamsetty5892
      @amogkamsetty5892 Год назад +2

      Yes! It would look similar. You would still do `ray.data.read_binary_files` and pass in your local directory instead of a path to S3.

  • @bakertamory
    @bakertamory Год назад +1

    How can you do this using Azure OpenAI?

  • @izzymiller2019
    @izzymiller2019 Год назад +1

    Do you do any paid consulting?

  •  Год назад +1

    How does the code parallize the ops ? When its loaded from s3 the result is just jobs right ? subsequent calls then setup a pipeline ?

    • @amogkamsetty5892
      @amogkamsetty5892 Год назад +1

      Not sure what you mean by jobs, but none of the computation actually gets triggered until you iterate through the dataset

  • @gunnursenturk2399
    @gunnursenturk2399 10 месяцев назад

    OSError: [Errno 0] What can you suggest for AssignProcessToJobObject() failed error?

  • @nimaa-j7823
    @nimaa-j7823 Год назад +11

    what was the rough cost for the aws time for 2000PDFs?

    • @amogkamsetty5892
      @amogkamsetty5892 Год назад +2

      The whole job itself is under 4 minutes…so for the actual compute less than $1 if using spot instances

    • @98f5
      @98f5 7 месяцев назад

      ​@amogkamsetty5892 it was under 4 minutes bc of parallelism i think itd be significantly longer on a single spot instance and youd need a gpu instance anyways sorry for replying 6 months later

  • @fantasyxpress7966
    @fantasyxpress7966 16 дней назад

    Thanks but what about scanned pdfs any way to handle the exceptions

  • @ArunVR-py3lv
    @ArunVR-py3lv Год назад

    While run the code i received the following error. How to clearly that.................
    Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format
    is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes.

  • @rossdickinson5010
    @rossdickinson5010 Год назад

    Is it possible to use cl100k_base as the model for creating the embeddings?

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Год назад +1

    Why load the PDF files as bytes?

    • @amogkamsetty5892
      @amogkamsetty5892 Год назад

      The data needs to be pulled from S3 over the network, which is done as bytes. Then, the bytes are parsed as strings on the compute cluster.

  • @labloke5020
    @labloke5020 Год назад +1

    Can you do the same thing without LangChain? I do not want to use LangChain.

    • @PetersonChevy
      @PetersonChevy Год назад +1

      Why don't you want to use LangChain?

    • @rossdickinson5010
      @rossdickinson5010 Год назад

      You can build your own custom functions. Refering to the source documentation from Langchain can help give an understanding. I've done this a few times!

  • @yvonnewebster8439
    @yvonnewebster8439 7 месяцев назад

    Are you using cloud GPUs?

  • @_thehunter_
    @_thehunter_ Год назад +2

    isnt open ai and hugginface embeddings incompatible??

    • @amogkamsetty5892
      @amogkamsetty5892 Год назад

      I don't think that's the case! Any embedding model can be used with any LLM. Doesn't necessarily have to be OpenAI embeddings with GPT4.

    • @_thehunter_
      @_thehunter_ Год назад +1

      @@amogkamsetty5892 No bro, I dont think so.. chatgpt 3.5 davinci api return vector of some 1532 dimensions and every model returns different dimensions based on how they are trained initially

    • @anyscale
      @anyscale  Год назад +1

      I think what @Amog means is that you can use the above algorithm with either OpenAI or HuggingFace. You have to use the same embedding for both building the vector store and querying, but in both cases they just return vectors.

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Год назад +2

    Does Ray have a free tier?

    • @CadeDaniel
      @CadeDaniel Год назад +2

      ray is open source!

    • @iansnow4698
      @iansnow4698 Год назад +1

      probably mean the Anyscale infra.

  • @l501l501l
    @l501l501l 9 месяцев назад

    Attention !!!!!!!!
    7:39
    Cost start: 29.32 Overall: 35.05
    You spent 5.32 for embedding 2,000 PDF files(33,000 pages) by using Ray's service.

  • @quantumbyte-studios
    @quantumbyte-studios Год назад +1

    Parallelyze 😅

  • @blender_wiki
    @blender_wiki 10 месяцев назад

    Interesting but if you have this skills why you use langchain?? You use an extra layer that penalize performance and maintenance time when you can use basic functions very easy and much more efficiently and still using ray.
    Langchain is a great tool for prototyping in production can lead to a nightmare when models will change in the near future.

  • @ArunVR-py3lv
    @ArunVR-py3lv Год назад

    While run the code i received the following error. How to clearly that.................
    Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format
    is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes.