Generative AI / LLM - Document Retrieval and Question Answering

Поделиться
HTML-код
  • Опубликовано: 11 дек 2024

Комментарии • 79

  • @christopheryoungbeck8837
    @christopheryoungbeck8837 7 месяцев назад +1

    I'm a junior engineer intern for a startup called Radical AI and I am doing exactly this process right now lol. You do a better job of explaining everything than my seniors.

    • @ml-engineer
      @ml-engineer  6 месяцев назад

      appreciate your feedback Christopher.

    • @SahlEbrahim
      @SahlEbrahim 6 месяцев назад

      Hi, did you run this code? I can't access the XML file as it is not found. How did you run this code.

  • @kenchang3456
    @kenchang3456 Год назад +2

    Excellent walk-through, I'll have to give it a try. Thank you very much.

  • @OritYaron
    @OritYaron Год назад +3

    Great video. what modifications should be done to run queries on public index (not under VPC)?

    • @ml-engineer
      @ml-engineer  Год назад

      Hi
      The deployment and querying are different:
      How to deploy a public endpoint
      cloud.google.com/vertex-ai/docs/matching-engine/deploy-index-public
      How to query a public endpoint
      cloud.google.com/vertex-ai/docs/matching-engine/query-index-public-endpoint

  • @russelljazzbeck
    @russelljazzbeck Год назад +1

    At 3:38 you say 'You need a Google project", but I'm not sure what that is exactly. Do I need a GCP account and then create a VM for that?

    • @ml-engineer
      @ml-engineer  Год назад +1

      Yes the LLM and the vector database I am using are googles services. You need a GCP account but no need for a VM it's all serverless.

    • @ml-engineer
      @ml-engineer  Год назад +1

      Here is a great introduction into GCP ruclips.net/video/GKEk1FzAN1A/видео.html

  • @d_b_
    @d_b_ Год назад +1

    7:38 for the embedding, is the data sent off the computer? It seems like it if you are using retries. If so, is there any way to completely contain this process so that no data leaves the machine? This would be relevant to at least the embedding, vector DB, and LLM prompts. Thank you

    • @ml-engineer
      @ml-engineer  Год назад +1

      Hi everything is running on the cloud this includes the embedding model, text model and vector database. Yes it is leaving your computer. If you want to run this locally on your machine you need to use open source models and frameworks and host the model and database locally. For vector similarity you could use Spotify Annoy or Facebook Faiss and a falcon model as LLM.

    • @d_b_
      @d_b_ Год назад

      ​@@ml-engineerThank you! Great video

  • @rinkugangishetty5713
    @rinkugangishetty5713 9 месяцев назад +1

    I have content of nearly 100 pages. Each page have nearly 4000 characters. What chunk size I can choose and what retrieval method I can you for optimised answers?

    • @ml-engineer
      @ml-engineer  9 месяцев назад

      The chunk size depends on the model you are planning to use. But generally I highly recommend to have text overlapping within your chunks. Also finding the right sizes chunk size can be treated like a hyperparameter. To large chunks might add to much noise to the context. To narrow chunks my miss information.

  • @Tech_Inside.
    @Tech_Inside. Год назад +1

    Hey, sir I want to know if I have any company documents locally so how can we use it ? And load data and one more thing does it provides answer exactly mentioned in pdf or documents or perform any type of text generation on output?

    • @ml-engineer
      @ml-engineer  Год назад

      You are fully flexible where you store your data, this could be local or on a Cloud Storage Bucket, a website all possible.
      The RAG approach takes your question is looking for relevant documents. Those documents are then passed together with your initial question to the LLM and the LLM answers the question based on this context. It is generating text yes. You can tweak that output further with prompt engineering.
      BTW google released a new feature that requires less implementation effort
      medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f
      It is less flexible but works for many use cases.

  • @abdulsami6117
    @abdulsami6117 11 месяцев назад +1

    Really Helpful Video!

  • @HailayKidu
    @HailayKidu 8 месяцев назад +1

    Nice , does the data is in xml format ?

    • @ml-engineer
      @ml-engineer  6 месяцев назад

      the data format is flexible, its not limited to xml

  • @jon200y
    @jon200y Год назад +1

    Great video!

  • @TarsiJMaria
    @TarsiJMaria Год назад +1

    Hey Sascha,
    I've been playing around a lot more, but i've run into accuracy issues that i wanted to solve using MMR (max marginal relevance) search.
    It looks like teh Vertex ai Vector store (in Langchain) doesn't support this, at least not the NodeJs version but if i'm not mistaken it's the same in python.
    Do you know what the best approach would be?
    As a workaround i'm overriding the default similarity search and filtering the results before passing it as context

    • @ml-engineer
      @ml-engineer  Год назад

      Hi Tarsi
      I assume it's a accuracy issue during the matching of query to documents. In that case you might need to consider optimizing your embeddings to your specific use case.
      What kind of documents are you working on?
      Fine tuning dedicated embeddings will help to solve this. (Not yet supported by Googles embedding API)

    • @TarsiJMaria
      @TarsiJMaria Год назад +1

      @@ml-engineer The problem is that i'm planning on having a service where a user can add any kind of document to for example a google drive and then embedding iets content. So it can range from docs, pdf, presentations.
      My current workaround works, where i filter out results lower than a specific score.
      But thats something you would want to solve on the vector store side instead of on my server side.
      Is there no way to set a score threshold when matching results?

    • @ml-engineer
      @ml-engineer  Год назад

      @@TarsiJMaria Understood. Yes there is a threshold parameter you can set see api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.matching_engine.MatchingEngine.html
      like this:
      docsearch.as_retriever(
      search_type="similarity_score_threshold",
      search_kwargs={'score_threshold': 0.8}
      )

    • @TarsiJMaria
      @TarsiJMaria Год назад

      @@ml-engineer That's what I also thought.. but it seams that the search_kwargs are only for MMR and not similarity_search.
      At least, thats the case when I dive into the nodejs langchain code.
      But I'm pretty sure its the same for python.

  • @saisandeepkantareddy1890
    @saisandeepkantareddy1890 Год назад +1

    Thanks for the video, how do we evaluate the model?

    • @ml-engineer
      @ml-engineer  Год назад

      Hi
      I do not train any model. We are using pre-build LLMs for this specific use case. No model evaluation taking place here.
      Though you can evaluate how accurate the retrieved documents are. As this is the main purpose of this architecture we need to ensure we are actually getting the right documents. For that you can build your own benchmark dataset with questions and the expected retrieved documents.

  • @abheeshth
    @abheeshth Год назад +1

    hi , can you please tell me for vertex Api we have to pay ?

    • @ml-engineer
      @ml-engineer  Год назад

      Yes
      there is nothing for free on this planet except my videos =D
      cloud.google.com/vertex-ai/pricing

  • @AGUSTINAGHELFI
    @AGUSTINAGHELFI Год назад +1

    Hello! thank you so much for the video. I have a problem at the last code cell:
    _InactiveRpcError:

    • @ml-engineer
      @ml-engineer  Год назад

      Hi
      looks like you are using a public Matching Engine endpoint.
      This endpoint requires a different SDK usage.
      I used a private endpoint in my example
      cloud.google.com/vertex-ai/docs/matching-engine/deploy-index-vpc
      Googles LangChain implementation does not support public Matching Engine Endpoints yet.

  • @louis-philippekyer4335
    @louis-philippekyer4335 Год назад +1

    What if I want to batch process different sites ? What would be your approach ?

    • @ml-engineer
      @ml-engineer  Год назад +1

      You can use LangChain and it's different document loaders. I have an example on index website here medium.com/google-cloud/generative-ai-learn-the-langchain-basics-by-building-a-berlin-travel-guide-5cc0a2ce4096
      This is just one of many document loaders that are available

    • @louis-philippekyer4335
      @louis-philippekyer4335 Год назад +1

      Thanks @@ml-engineer

  • @sridattamarati
    @sridattamarati Год назад +2

    Is palm 2 not open source ?

  • @wongkenny240
    @wongkenny240 Год назад +1

    Can this be deployed to a Vertex AI endpoint as a custom container?

    • @ml-engineer
      @ml-engineer  Год назад

      Hi Wong
      I am using Google Vertex AI PaLM API. There is no custom model involved. Nothing to deploy to a Vertex AI Endpoint.
      But if you are using a open source LLM you can deploy it to Vertex AI Endpoints.

  • @vedasrisailingamolla9425
    @vedasrisailingamolla9425 Год назад +1

    embeddings = VertexAIEmbeddings(project=PROJECT_ID)
    getting below error: NotFound: 404 Publisher Model `publishers/google/models/textembedding-gecko@001` is not found.
    Gone through the documentation. They are using the textembedding-gecko@001 model only.

    • @ml-engineer
      @ml-engineer  Год назад

      Hi Vedasri
      I tried and cannot reproduce, which version of the SDK are you using? I just did a test run with 1.26.0 and 1.26.1.
      Can you give this notebook a try and see if you can call the model there?
      colab.research.google.com/drive/1eOe0iR6qZ4uOX-4VRIgtcgTaecR00z-Y#scrollTo=nJ799_PMs6Z8
      As you say according to documentation and to my last try textembedding-gecko@001 is indeed correct.
      cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings

  • @vasilmutafchiev8155
    @vasilmutafchiev8155 Год назад +1

    Hi! Great video!
    Is there a way that we can limit the model to respond only to the data that we gave him? Thank you!

    • @ml-engineer
      @ml-engineer  Год назад +1

      Yes in the notebook I shared I enforce that with the prompt.
      prompt=f"""
      Follow exactly those 3 steps:
      1. Read the context below and aggregrate this data
      Context : {matching_engine_response}
      2. Answer the question using only this context
      3. Show the source for your answers
      User Question: {question}
      If you don't have any context and are unsure of the answer, reply that you don't know about this topic.
      """

  • @TarsiJMaria
    @TarsiJMaria Год назад +1

    Hey,
    Thanks for the great content! I had 2 questions:
    With this setup, what needs to happen if you want to add new data to the vectore store?
    First we chunk the new document and create new embeddings and upload to the GS bucket, is that all or does something need to happen with the Matching Engine / Index?
    Other question, do you know if the LangChain Javascript library has any limitations in this use case?

    • @ml-engineer
      @ml-engineer  Год назад

      Hi Tarsi
      Exactly how you describe it to add new data to the vector store (in that case Matching Engine) you chunk the document create a embedding for each chunk upload the data to a Google Cloud Storage bucket and send the embedding to the Matching Engine. The Matching Engine supports real time indexing (streaming inserts) which is a amazing feature. I wrote about it here: medium.com/google-cloud/real-time-deep-learning-vector-similarity-search-8d791821f3ad
      The documentation for the Javascript version of LangChain seems to be more outdated. I would always recommend the Python version. But in the end you will probably can achieve the goal with both of them.

    • @TarsiJMaria
      @TarsiJMaria Год назад +1

      Thanks for the reply!
      After a lot of struggling with the terrible docs, I succeeded in getting Matching engine set up... and now have a working QA chain in JS.
      Next step is updating the vector store (add/delete).. i already have that from a python example.. but in my case javascript is preferable.
      I'm quite surprised by how bad the Langchain js docs are... because in the end it's quite easy with very little code.

    • @ml-engineer
      @ml-engineer  Год назад

      @@TarsiJMaria yes they are. If you already have it in python you at least can save yourself a bit of time.

  • @manikmittal
    @manikmittal Год назад +1

    Is it going to share the enterprise internal data to outside?

    • @ml-engineer
      @ml-engineer  Год назад +1

      data is not shared,it stays all yours.

    • @manikmittal
      @manikmittal Год назад

      @@ml-engineer Thanks

  • @shivayavashilaxmipriyas9401
    @shivayavashilaxmipriyas9401 Год назад +1

    Great Video !! Does the LLM gets trained here ? This is a major doubt here. Or is it just used as an engine for answering based on the embeddings and similarity ?

    • @ml-engineer
      @ml-engineer  Год назад

      Hi
      No fine tuning / training with this approach. I have written more about that in the article linked in the video description.
      RAG architectures usually work based on embeddings of query and embeddings of documents to retrieve similar documents

  • @itsdavidalonso
    @itsdavidalonso Год назад +1

    Hey Sasha, thanks a lot for making this! It would be great to learn if there's an easy way to embed documents in Firebase. Would be extremely useful to have a workflow where embeddings are generated for each document when it's changed (e.g. a user updates content on an app) so that the query is always matched against realtime data sources.
    I was also wondering if there's a way to do a semantic search query combined with regular filtering on metadata (e.g. product prices, size, etc). Would love to see a follow up tutorial on this in the future :)

    • @itsdavidalonso
      @itsdavidalonso Год назад

      Any thoughts Sasha?

    • @ml-engineer
      @ml-engineer  Год назад +1

      Hi David
      I would recommend Google Matching Engine for that use case. It supports real time indexing and also has filtering built in.
      I have written an article on real time indexing medium.com/google-cloud/real-time-deep-learning-vector-similarity-search-8d791821f3ad

    • @ml-engineer
      @ml-engineer  Год назад +1

      I will write a follow up article on that 😍

    • @itsdavidalonso
      @itsdavidalonso Год назад +1

      @@ml-engineer I looked through the medium article, thanks for sharing! I now have an idea of how to accomplish streaming embeddings into the index, but could not find anything about doing metadata filtering, is this what you want to follow up on?

    • @ml-engineer
      @ml-engineer  Год назад

      The official documentation on how to filter with Matching Engine
      cloud.google.com/vertex-ai/docs/vector-search/filtering
      Might need to write an article about it =)

  • @AnkitKumarSingh-p4k
    @AnkitKumarSingh-p4k Год назад +1

    Just noticed this channel. Great content, with code walkthroughs. Appreciate your effort!!
    Have got a question @ml-engineer :
    Is it possible to Question-Answer separate documents with only one index for the bucket? While retrieval or questioning from vector search, I want to specify which document/datapoint_id I want to query from.
    Currently when I add data points for multiple documents to same index, the retrieval response for a query match is based on globally from all the documents, instead of the required one.
    P.s. : I am using MatchingEngine utility maintained by Google.

    • @ml-engineer
      @ml-engineer  Год назад +1

      Yes you can use the filtering that matching engine is offering. cloud.google.com/vertex-ai/docs/vector-search/filtering with that the vector matching only operates on the documents that are part of the filtering results.

    • @AnkitKumarSingh-p4k
      @AnkitKumarSingh-p4k Год назад

      ​@@ml-engineer Thanks for the response.
      I believe we are supposed to add allow and deny list while adding datapoint_ids to index. And when we retrieve nearest neighbours, the "restricts" metadata is also returned.Then either we can filter directly OR pass the document chunk to llm with restricts metadata (Former is done in MatchingUtility of google -> similarity_search())
      But there is still a change of getting datapoint_ids of other documents saved in the index, instead of the one I want to query from.
      I was looking for something where vector search engine automatically filters the datapoints based on input query. or GCP bucket where my chunks and datapoints are stored.

    • @ml-engineer
      @ml-engineer  Год назад

      Exactly you add those deny and allow lists when adding the documents to the index. After that you can filter based on query runtime.
      Can you describe this in more detail?
      - But there is still a change of getting datapoint_ids of other documents saved in the index, instead of the one I want to query from.
      Do I understand you are asking how you can automatically get the documents from Cloud Storage based on the retrieved documents from the Matching Engine back into the LLMs context?

  • @carthagely122
    @carthagely122 Год назад +1

    Thank you for your job

    • @ml-engineer
      @ml-engineer  Год назад

      thank you for watching it

    • @ml-engineer
      @ml-engineer  Год назад

      This could be also interesting for your Google just released a more or less managed RAG / Grounding solution:
      medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f

  • @campbellslund
    @campbellslund Год назад

    Hi! Thanks for making this walkthrough, it was super helpful as a beginner. I was able to follow all the steps you detailed, however, when I try running the final product it produces the same context every time - regardless of the question I prompt. Do you have any idea why that might be? Thanks in advance!

    • @ml-engineer
      @ml-engineer  Год назад

      Hi
      Have a look into the embeddings to see if they are actually different. If not there is an issue during embedding creation.

  • @bivasbisht1244
    @bivasbisht1244 Год назад +1

    Thank you for the explanation , really liked it !!
    I was wondering if we use DPR (Dense Passage Retrieverl) on our own data and want to evaluate its performance like precision , recall and F1 score, if we have a small reference data which can serve as ground truth. Can we do that ? I am confused on the fact that since DPR is trained only on wiki data as far as i know, will it be nice to measure the efficiency of the DPR retrieval , when i follow this RAG approach ?

    • @ml-engineer
      @ml-engineer  Год назад

      RAGs are usually not evaluated using F1 or similar scores we have with more traditional machine learning. I can recommend github.com/explodinggradients/ragas for that.

    • @ml-engineer
      @ml-engineer  Год назад

      This could be also interesting for your Google just released a more or less managed RAG / Grounding solution:
      medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f

  • @VadimTsviling
    @VadimTsviling Год назад +1

    Hey, thanks for sharing this great video. My question is, what would happen if the answer to my query is in multiple documents, like more general questions related to all the documents?

    • @ml-engineer
      @ml-engineer  Год назад +2

      If it is in multiple documents the retrieval process will return multiple documents.
      If it is in all documents you will run into issue due to multiple reasons
      1. The matching index / vector database are built in a way to return the top X matching documents. You can increase this value but if it is in all documents there is no need to find matching documents anymore, because its anway in all documents.
      2. The matching documents are used as context when running our prompt. The context size depends on the model for normal models like Googles PaLM or OpenAI GPT it is around 8000 token. There are also larger version of 32000 token up to 100000 token but those come at a higher cost. In the end you need to evaluate if the number of documents fit your context.

  • @JoonasMeriläinen
    @JoonasMeriläinen Год назад +1

    Hello, great video! I have been trying to implement exactly the same thing you have done and this video just appeared in my Google search results.
    But I don't understand the part gcs_bucket_name=DOCS_BUCKET where the bucket is defined to be DOCS_BUCKET='doit-llm' and the txt files (which have the actual text chunks which are provided to the prompt as context) are in the bucket under "documents" folder in some *.txt files, so something like gs://doit-llm/documents/001.txt. And the embeddings would be similarly in gs://doit-llm/embeddings/embeddings.json. How does the vector database understand that embeddings.json contain the vectors and the documents folder has text chunks in .txt files? Does it just blindly scan for the given bucket for any text? And assume that the ID value in embeddings.json has to match the filename of the .txt files? I cannot access the documentation of the MatchingEngineIndex.create_tree_ah_index function which would probably help to understand how it works.

    • @ml-engineer
      @ml-engineer  Год назад +1

      You are absolutely right the id needs to match the documents. I am using googles langchain integration which takes care of this mapping during retrieval. So it matches the returned Ids from the document index to the documents stores in the Google cloud storage bucket.