I'm a junior engineer intern for a startup called Radical AI and I am doing exactly this process right now lol. You do a better job of explaining everything than my seniors.
Hi The deployment and querying are different: How to deploy a public endpoint cloud.google.com/vertex-ai/docs/matching-engine/deploy-index-public How to query a public endpoint cloud.google.com/vertex-ai/docs/matching-engine/query-index-public-endpoint
7:38 for the embedding, is the data sent off the computer? It seems like it if you are using retries. If so, is there any way to completely contain this process so that no data leaves the machine? This would be relevant to at least the embedding, vector DB, and LLM prompts. Thank you
Hi everything is running on the cloud this includes the embedding model, text model and vector database. Yes it is leaving your computer. If you want to run this locally on your machine you need to use open source models and frameworks and host the model and database locally. For vector similarity you could use Spotify Annoy or Facebook Faiss and a falcon model as LLM.
I have content of nearly 100 pages. Each page have nearly 4000 characters. What chunk size I can choose and what retrieval method I can you for optimised answers?
The chunk size depends on the model you are planning to use. But generally I highly recommend to have text overlapping within your chunks. Also finding the right sizes chunk size can be treated like a hyperparameter. To large chunks might add to much noise to the context. To narrow chunks my miss information.
Hey, sir I want to know if I have any company documents locally so how can we use it ? And load data and one more thing does it provides answer exactly mentioned in pdf or documents or perform any type of text generation on output?
You are fully flexible where you store your data, this could be local or on a Cloud Storage Bucket, a website all possible. The RAG approach takes your question is looking for relevant documents. Those documents are then passed together with your initial question to the LLM and the LLM answers the question based on this context. It is generating text yes. You can tweak that output further with prompt engineering. BTW google released a new feature that requires less implementation effort medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f It is less flexible but works for many use cases.
Hey Sascha, I've been playing around a lot more, but i've run into accuracy issues that i wanted to solve using MMR (max marginal relevance) search. It looks like teh Vertex ai Vector store (in Langchain) doesn't support this, at least not the NodeJs version but if i'm not mistaken it's the same in python. Do you know what the best approach would be? As a workaround i'm overriding the default similarity search and filtering the results before passing it as context
Hi Tarsi I assume it's a accuracy issue during the matching of query to documents. In that case you might need to consider optimizing your embeddings to your specific use case. What kind of documents are you working on? Fine tuning dedicated embeddings will help to solve this. (Not yet supported by Googles embedding API)
@@ml-engineer The problem is that i'm planning on having a service where a user can add any kind of document to for example a google drive and then embedding iets content. So it can range from docs, pdf, presentations. My current workaround works, where i filter out results lower than a specific score. But thats something you would want to solve on the vector store side instead of on my server side. Is there no way to set a score threshold when matching results?
@@TarsiJMaria Understood. Yes there is a threshold parameter you can set see api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.matching_engine.MatchingEngine.html like this: docsearch.as_retriever( search_type="similarity_score_threshold", search_kwargs={'score_threshold': 0.8} )
@@ml-engineer That's what I also thought.. but it seams that the search_kwargs are only for MMR and not similarity_search. At least, thats the case when I dive into the nodejs langchain code. But I'm pretty sure its the same for python.
Hi I do not train any model. We are using pre-build LLMs for this specific use case. No model evaluation taking place here. Though you can evaluate how accurate the retrieved documents are. As this is the main purpose of this architecture we need to ensure we are actually getting the right documents. For that you can build your own benchmark dataset with questions and the expected retrieved documents.
Hi looks like you are using a public Matching Engine endpoint. This endpoint requires a different SDK usage. I used a private endpoint in my example cloud.google.com/vertex-ai/docs/matching-engine/deploy-index-vpc Googles LangChain implementation does not support public Matching Engine Endpoints yet.
You can use LangChain and it's different document loaders. I have an example on index website here medium.com/google-cloud/generative-ai-learn-the-langchain-basics-by-building-a-berlin-travel-guide-5cc0a2ce4096 This is just one of many document loaders that are available
Hi Wong I am using Google Vertex AI PaLM API. There is no custom model involved. Nothing to deploy to a Vertex AI Endpoint. But if you are using a open source LLM you can deploy it to Vertex AI Endpoints.
embeddings = VertexAIEmbeddings(project=PROJECT_ID) getting below error: NotFound: 404 Publisher Model `publishers/google/models/textembedding-gecko@001` is not found. Gone through the documentation. They are using the textembedding-gecko@001 model only.
Hi Vedasri I tried and cannot reproduce, which version of the SDK are you using? I just did a test run with 1.26.0 and 1.26.1. Can you give this notebook a try and see if you can call the model there? colab.research.google.com/drive/1eOe0iR6qZ4uOX-4VRIgtcgTaecR00z-Y#scrollTo=nJ799_PMs6Z8 As you say according to documentation and to my last try textembedding-gecko@001 is indeed correct. cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings
Yes in the notebook I shared I enforce that with the prompt. prompt=f""" Follow exactly those 3 steps: 1. Read the context below and aggregrate this data Context : {matching_engine_response} 2. Answer the question using only this context 3. Show the source for your answers User Question: {question} If you don't have any context and are unsure of the answer, reply that you don't know about this topic. """
Hey, Thanks for the great content! I had 2 questions: With this setup, what needs to happen if you want to add new data to the vectore store? First we chunk the new document and create new embeddings and upload to the GS bucket, is that all or does something need to happen with the Matching Engine / Index? Other question, do you know if the LangChain Javascript library has any limitations in this use case?
Hi Tarsi Exactly how you describe it to add new data to the vector store (in that case Matching Engine) you chunk the document create a embedding for each chunk upload the data to a Google Cloud Storage bucket and send the embedding to the Matching Engine. The Matching Engine supports real time indexing (streaming inserts) which is a amazing feature. I wrote about it here: medium.com/google-cloud/real-time-deep-learning-vector-similarity-search-8d791821f3ad The documentation for the Javascript version of LangChain seems to be more outdated. I would always recommend the Python version. But in the end you will probably can achieve the goal with both of them.
Thanks for the reply! After a lot of struggling with the terrible docs, I succeeded in getting Matching engine set up... and now have a working QA chain in JS. Next step is updating the vector store (add/delete).. i already have that from a python example.. but in my case javascript is preferable. I'm quite surprised by how bad the Langchain js docs are... because in the end it's quite easy with very little code.
Great Video !! Does the LLM gets trained here ? This is a major doubt here. Or is it just used as an engine for answering based on the embeddings and similarity ?
Hi No fine tuning / training with this approach. I have written more about that in the article linked in the video description. RAG architectures usually work based on embeddings of query and embeddings of documents to retrieve similar documents
Hey Sasha, thanks a lot for making this! It would be great to learn if there's an easy way to embed documents in Firebase. Would be extremely useful to have a workflow where embeddings are generated for each document when it's changed (e.g. a user updates content on an app) so that the query is always matched against realtime data sources. I was also wondering if there's a way to do a semantic search query combined with regular filtering on metadata (e.g. product prices, size, etc). Would love to see a follow up tutorial on this in the future :)
Hi David I would recommend Google Matching Engine for that use case. It supports real time indexing and also has filtering built in. I have written an article on real time indexing medium.com/google-cloud/real-time-deep-learning-vector-similarity-search-8d791821f3ad
@@ml-engineer I looked through the medium article, thanks for sharing! I now have an idea of how to accomplish streaming embeddings into the index, but could not find anything about doing metadata filtering, is this what you want to follow up on?
The official documentation on how to filter with Matching Engine cloud.google.com/vertex-ai/docs/vector-search/filtering Might need to write an article about it =)
Just noticed this channel. Great content, with code walkthroughs. Appreciate your effort!! Have got a question @ml-engineer : Is it possible to Question-Answer separate documents with only one index for the bucket? While retrieval or questioning from vector search, I want to specify which document/datapoint_id I want to query from. Currently when I add data points for multiple documents to same index, the retrieval response for a query match is based on globally from all the documents, instead of the required one. P.s. : I am using MatchingEngine utility maintained by Google.
Yes you can use the filtering that matching engine is offering. cloud.google.com/vertex-ai/docs/vector-search/filtering with that the vector matching only operates on the documents that are part of the filtering results.
@@ml-engineer Thanks for the response. I believe we are supposed to add allow and deny list while adding datapoint_ids to index. And when we retrieve nearest neighbours, the "restricts" metadata is also returned.Then either we can filter directly OR pass the document chunk to llm with restricts metadata (Former is done in MatchingUtility of google -> similarity_search()) But there is still a change of getting datapoint_ids of other documents saved in the index, instead of the one I want to query from. I was looking for something where vector search engine automatically filters the datapoints based on input query. or GCP bucket where my chunks and datapoints are stored.
Exactly you add those deny and allow lists when adding the documents to the index. After that you can filter based on query runtime. Can you describe this in more detail? - But there is still a change of getting datapoint_ids of other documents saved in the index, instead of the one I want to query from. Do I understand you are asking how you can automatically get the documents from Cloud Storage based on the retrieved documents from the Matching Engine back into the LLMs context?
This could be also interesting for your Google just released a more or less managed RAG / Grounding solution: medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f
Hi! Thanks for making this walkthrough, it was super helpful as a beginner. I was able to follow all the steps you detailed, however, when I try running the final product it produces the same context every time - regardless of the question I prompt. Do you have any idea why that might be? Thanks in advance!
Thank you for the explanation , really liked it !! I was wondering if we use DPR (Dense Passage Retrieverl) on our own data and want to evaluate its performance like precision , recall and F1 score, if we have a small reference data which can serve as ground truth. Can we do that ? I am confused on the fact that since DPR is trained only on wiki data as far as i know, will it be nice to measure the efficiency of the DPR retrieval , when i follow this RAG approach ?
RAGs are usually not evaluated using F1 or similar scores we have with more traditional machine learning. I can recommend github.com/explodinggradients/ragas for that.
This could be also interesting for your Google just released a more or less managed RAG / Grounding solution: medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f
Hey, thanks for sharing this great video. My question is, what would happen if the answer to my query is in multiple documents, like more general questions related to all the documents?
If it is in multiple documents the retrieval process will return multiple documents. If it is in all documents you will run into issue due to multiple reasons 1. The matching index / vector database are built in a way to return the top X matching documents. You can increase this value but if it is in all documents there is no need to find matching documents anymore, because its anway in all documents. 2. The matching documents are used as context when running our prompt. The context size depends on the model for normal models like Googles PaLM or OpenAI GPT it is around 8000 token. There are also larger version of 32000 token up to 100000 token but those come at a higher cost. In the end you need to evaluate if the number of documents fit your context.
Hello, great video! I have been trying to implement exactly the same thing you have done and this video just appeared in my Google search results. But I don't understand the part gcs_bucket_name=DOCS_BUCKET where the bucket is defined to be DOCS_BUCKET='doit-llm' and the txt files (which have the actual text chunks which are provided to the prompt as context) are in the bucket under "documents" folder in some *.txt files, so something like gs://doit-llm/documents/001.txt. And the embeddings would be similarly in gs://doit-llm/embeddings/embeddings.json. How does the vector database understand that embeddings.json contain the vectors and the documents folder has text chunks in .txt files? Does it just blindly scan for the given bucket for any text? And assume that the ID value in embeddings.json has to match the filename of the .txt files? I cannot access the documentation of the MatchingEngineIndex.create_tree_ah_index function which would probably help to understand how it works.
You are absolutely right the id needs to match the documents. I am using googles langchain integration which takes care of this mapping during retrieval. So it matches the returned Ids from the document index to the documents stores in the Google cloud storage bucket.
I'm a junior engineer intern for a startup called Radical AI and I am doing exactly this process right now lol. You do a better job of explaining everything than my seniors.
appreciate your feedback Christopher.
Hi, did you run this code? I can't access the XML file as it is not found. How did you run this code.
Excellent walk-through, I'll have to give it a try. Thank you very much.
You're Welcome
Great video. what modifications should be done to run queries on public index (not under VPC)?
Hi
The deployment and querying are different:
How to deploy a public endpoint
cloud.google.com/vertex-ai/docs/matching-engine/deploy-index-public
How to query a public endpoint
cloud.google.com/vertex-ai/docs/matching-engine/query-index-public-endpoint
At 3:38 you say 'You need a Google project", but I'm not sure what that is exactly. Do I need a GCP account and then create a VM for that?
Yes the LLM and the vector database I am using are googles services. You need a GCP account but no need for a VM it's all serverless.
Here is a great introduction into GCP ruclips.net/video/GKEk1FzAN1A/видео.html
7:38 for the embedding, is the data sent off the computer? It seems like it if you are using retries. If so, is there any way to completely contain this process so that no data leaves the machine? This would be relevant to at least the embedding, vector DB, and LLM prompts. Thank you
Hi everything is running on the cloud this includes the embedding model, text model and vector database. Yes it is leaving your computer. If you want to run this locally on your machine you need to use open source models and frameworks and host the model and database locally. For vector similarity you could use Spotify Annoy or Facebook Faiss and a falcon model as LLM.
@@ml-engineerThank you! Great video
I have content of nearly 100 pages. Each page have nearly 4000 characters. What chunk size I can choose and what retrieval method I can you for optimised answers?
The chunk size depends on the model you are planning to use. But generally I highly recommend to have text overlapping within your chunks. Also finding the right sizes chunk size can be treated like a hyperparameter. To large chunks might add to much noise to the context. To narrow chunks my miss information.
Hey, sir I want to know if I have any company documents locally so how can we use it ? And load data and one more thing does it provides answer exactly mentioned in pdf or documents or perform any type of text generation on output?
You are fully flexible where you store your data, this could be local or on a Cloud Storage Bucket, a website all possible.
The RAG approach takes your question is looking for relevant documents. Those documents are then passed together with your initial question to the LLM and the LLM answers the question based on this context. It is generating text yes. You can tweak that output further with prompt engineering.
BTW google released a new feature that requires less implementation effort
medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f
It is less flexible but works for many use cases.
Really Helpful Video!
Thank you
Nice , does the data is in xml format ?
the data format is flexible, its not limited to xml
Great video!
Thank you
Hey Sascha,
I've been playing around a lot more, but i've run into accuracy issues that i wanted to solve using MMR (max marginal relevance) search.
It looks like teh Vertex ai Vector store (in Langchain) doesn't support this, at least not the NodeJs version but if i'm not mistaken it's the same in python.
Do you know what the best approach would be?
As a workaround i'm overriding the default similarity search and filtering the results before passing it as context
Hi Tarsi
I assume it's a accuracy issue during the matching of query to documents. In that case you might need to consider optimizing your embeddings to your specific use case.
What kind of documents are you working on?
Fine tuning dedicated embeddings will help to solve this. (Not yet supported by Googles embedding API)
@@ml-engineer The problem is that i'm planning on having a service where a user can add any kind of document to for example a google drive and then embedding iets content. So it can range from docs, pdf, presentations.
My current workaround works, where i filter out results lower than a specific score.
But thats something you would want to solve on the vector store side instead of on my server side.
Is there no way to set a score threshold when matching results?
@@TarsiJMaria Understood. Yes there is a threshold parameter you can set see api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.matching_engine.MatchingEngine.html
like this:
docsearch.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={'score_threshold': 0.8}
)
@@ml-engineer That's what I also thought.. but it seams that the search_kwargs are only for MMR and not similarity_search.
At least, thats the case when I dive into the nodejs langchain code.
But I'm pretty sure its the same for python.
Thanks for the video, how do we evaluate the model?
Hi
I do not train any model. We are using pre-build LLMs for this specific use case. No model evaluation taking place here.
Though you can evaluate how accurate the retrieved documents are. As this is the main purpose of this architecture we need to ensure we are actually getting the right documents. For that you can build your own benchmark dataset with questions and the expected retrieved documents.
hi , can you please tell me for vertex Api we have to pay ?
Yes
there is nothing for free on this planet except my videos =D
cloud.google.com/vertex-ai/pricing
Hello! thank you so much for the video. I have a problem at the last code cell:
_InactiveRpcError:
Hi
looks like you are using a public Matching Engine endpoint.
This endpoint requires a different SDK usage.
I used a private endpoint in my example
cloud.google.com/vertex-ai/docs/matching-engine/deploy-index-vpc
Googles LangChain implementation does not support public Matching Engine Endpoints yet.
What if I want to batch process different sites ? What would be your approach ?
You can use LangChain and it's different document loaders. I have an example on index website here medium.com/google-cloud/generative-ai-learn-the-langchain-basics-by-building-a-berlin-travel-guide-5cc0a2ce4096
This is just one of many document loaders that are available
Thanks @@ml-engineer
Is palm 2 not open source ?
Hi
No it's not open source
Can this be deployed to a Vertex AI endpoint as a custom container?
Hi Wong
I am using Google Vertex AI PaLM API. There is no custom model involved. Nothing to deploy to a Vertex AI Endpoint.
But if you are using a open source LLM you can deploy it to Vertex AI Endpoints.
embeddings = VertexAIEmbeddings(project=PROJECT_ID)
getting below error: NotFound: 404 Publisher Model `publishers/google/models/textembedding-gecko@001` is not found.
Gone through the documentation. They are using the textembedding-gecko@001 model only.
Hi Vedasri
I tried and cannot reproduce, which version of the SDK are you using? I just did a test run with 1.26.0 and 1.26.1.
Can you give this notebook a try and see if you can call the model there?
colab.research.google.com/drive/1eOe0iR6qZ4uOX-4VRIgtcgTaecR00z-Y#scrollTo=nJ799_PMs6Z8
As you say according to documentation and to my last try textembedding-gecko@001 is indeed correct.
cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings
Hi! Great video!
Is there a way that we can limit the model to respond only to the data that we gave him? Thank you!
Yes in the notebook I shared I enforce that with the prompt.
prompt=f"""
Follow exactly those 3 steps:
1. Read the context below and aggregrate this data
Context : {matching_engine_response}
2. Answer the question using only this context
3. Show the source for your answers
User Question: {question}
If you don't have any context and are unsure of the answer, reply that you don't know about this topic.
"""
Hey,
Thanks for the great content! I had 2 questions:
With this setup, what needs to happen if you want to add new data to the vectore store?
First we chunk the new document and create new embeddings and upload to the GS bucket, is that all or does something need to happen with the Matching Engine / Index?
Other question, do you know if the LangChain Javascript library has any limitations in this use case?
Hi Tarsi
Exactly how you describe it to add new data to the vector store (in that case Matching Engine) you chunk the document create a embedding for each chunk upload the data to a Google Cloud Storage bucket and send the embedding to the Matching Engine. The Matching Engine supports real time indexing (streaming inserts) which is a amazing feature. I wrote about it here: medium.com/google-cloud/real-time-deep-learning-vector-similarity-search-8d791821f3ad
The documentation for the Javascript version of LangChain seems to be more outdated. I would always recommend the Python version. But in the end you will probably can achieve the goal with both of them.
Thanks for the reply!
After a lot of struggling with the terrible docs, I succeeded in getting Matching engine set up... and now have a working QA chain in JS.
Next step is updating the vector store (add/delete).. i already have that from a python example.. but in my case javascript is preferable.
I'm quite surprised by how bad the Langchain js docs are... because in the end it's quite easy with very little code.
@@TarsiJMaria yes they are. If you already have it in python you at least can save yourself a bit of time.
Is it going to share the enterprise internal data to outside?
data is not shared,it stays all yours.
@@ml-engineer Thanks
Great Video !! Does the LLM gets trained here ? This is a major doubt here. Or is it just used as an engine for answering based on the embeddings and similarity ?
Hi
No fine tuning / training with this approach. I have written more about that in the article linked in the video description.
RAG architectures usually work based on embeddings of query and embeddings of documents to retrieve similar documents
Hey Sasha, thanks a lot for making this! It would be great to learn if there's an easy way to embed documents in Firebase. Would be extremely useful to have a workflow where embeddings are generated for each document when it's changed (e.g. a user updates content on an app) so that the query is always matched against realtime data sources.
I was also wondering if there's a way to do a semantic search query combined with regular filtering on metadata (e.g. product prices, size, etc). Would love to see a follow up tutorial on this in the future :)
Any thoughts Sasha?
Hi David
I would recommend Google Matching Engine for that use case. It supports real time indexing and also has filtering built in.
I have written an article on real time indexing medium.com/google-cloud/real-time-deep-learning-vector-similarity-search-8d791821f3ad
I will write a follow up article on that 😍
@@ml-engineer I looked through the medium article, thanks for sharing! I now have an idea of how to accomplish streaming embeddings into the index, but could not find anything about doing metadata filtering, is this what you want to follow up on?
The official documentation on how to filter with Matching Engine
cloud.google.com/vertex-ai/docs/vector-search/filtering
Might need to write an article about it =)
Just noticed this channel. Great content, with code walkthroughs. Appreciate your effort!!
Have got a question @ml-engineer :
Is it possible to Question-Answer separate documents with only one index for the bucket? While retrieval or questioning from vector search, I want to specify which document/datapoint_id I want to query from.
Currently when I add data points for multiple documents to same index, the retrieval response for a query match is based on globally from all the documents, instead of the required one.
P.s. : I am using MatchingEngine utility maintained by Google.
Yes you can use the filtering that matching engine is offering. cloud.google.com/vertex-ai/docs/vector-search/filtering with that the vector matching only operates on the documents that are part of the filtering results.
@@ml-engineer Thanks for the response.
I believe we are supposed to add allow and deny list while adding datapoint_ids to index. And when we retrieve nearest neighbours, the "restricts" metadata is also returned.Then either we can filter directly OR pass the document chunk to llm with restricts metadata (Former is done in MatchingUtility of google -> similarity_search())
But there is still a change of getting datapoint_ids of other documents saved in the index, instead of the one I want to query from.
I was looking for something where vector search engine automatically filters the datapoints based on input query. or GCP bucket where my chunks and datapoints are stored.
Exactly you add those deny and allow lists when adding the documents to the index. After that you can filter based on query runtime.
Can you describe this in more detail?
- But there is still a change of getting datapoint_ids of other documents saved in the index, instead of the one I want to query from.
Do I understand you are asking how you can automatically get the documents from Cloud Storage based on the retrieved documents from the Matching Engine back into the LLMs context?
Thank you for your job
thank you for watching it
This could be also interesting for your Google just released a more or less managed RAG / Grounding solution:
medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f
Hi! Thanks for making this walkthrough, it was super helpful as a beginner. I was able to follow all the steps you detailed, however, when I try running the final product it produces the same context every time - regardless of the question I prompt. Do you have any idea why that might be? Thanks in advance!
Hi
Have a look into the embeddings to see if they are actually different. If not there is an issue during embedding creation.
Thank you for the explanation , really liked it !!
I was wondering if we use DPR (Dense Passage Retrieverl) on our own data and want to evaluate its performance like precision , recall and F1 score, if we have a small reference data which can serve as ground truth. Can we do that ? I am confused on the fact that since DPR is trained only on wiki data as far as i know, will it be nice to measure the efficiency of the DPR retrieval , when i follow this RAG approach ?
RAGs are usually not evaluated using F1 or similar scores we have with more traditional machine learning. I can recommend github.com/explodinggradients/ragas for that.
This could be also interesting for your Google just released a more or less managed RAG / Grounding solution:
medium.com/google-cloud/vertex-ai-grounding-large-language-models-8335f838990f
Hey, thanks for sharing this great video. My question is, what would happen if the answer to my query is in multiple documents, like more general questions related to all the documents?
If it is in multiple documents the retrieval process will return multiple documents.
If it is in all documents you will run into issue due to multiple reasons
1. The matching index / vector database are built in a way to return the top X matching documents. You can increase this value but if it is in all documents there is no need to find matching documents anymore, because its anway in all documents.
2. The matching documents are used as context when running our prompt. The context size depends on the model for normal models like Googles PaLM or OpenAI GPT it is around 8000 token. There are also larger version of 32000 token up to 100000 token but those come at a higher cost. In the end you need to evaluate if the number of documents fit your context.
Hello, great video! I have been trying to implement exactly the same thing you have done and this video just appeared in my Google search results.
But I don't understand the part gcs_bucket_name=DOCS_BUCKET where the bucket is defined to be DOCS_BUCKET='doit-llm' and the txt files (which have the actual text chunks which are provided to the prompt as context) are in the bucket under "documents" folder in some *.txt files, so something like gs://doit-llm/documents/001.txt. And the embeddings would be similarly in gs://doit-llm/embeddings/embeddings.json. How does the vector database understand that embeddings.json contain the vectors and the documents folder has text chunks in .txt files? Does it just blindly scan for the given bucket for any text? And assume that the ID value in embeddings.json has to match the filename of the .txt files? I cannot access the documentation of the MatchingEngineIndex.create_tree_ah_index function which would probably help to understand how it works.
You are absolutely right the id needs to match the documents. I am using googles langchain integration which takes care of this mapping during retrieval. So it matches the returned Ids from the document index to the documents stores in the Google cloud storage bucket.