This is great video and thanks for providing such fantastic content, can we use open source llm like llama2 for Question-Answering over Tabular/CSV/ considering above approach?
Great video! I really love your way of presenting. I am looking forward to the next one :) I have a few questions though: 1. Would it be possible to not embed twice the user query to optimize the time to response? It seems like the embedding is done twice, for both looking into the table and the row indexes. 2. What would be a good strategy if the initial user query requires two different SQL queries t be answered? Or if the SQL is a bit more complex 3. What if the datasets have million of rows? It look likes you are building the row index based on all rows, which might take both a lot of time and memory to be done. And it might slow down the time to answer in the long run no?
This is great. Thanks for sharing Jerry. Quick question: as I understand, for df, you are inputting it into the API, and for sql, you are giving some rows to simulate the code, right? Just wondering if there is an option to “mask” or “use dummy” option for the tubular data going into the cloud? Tables and dbs often contain company sensitive data. You can’t take any into the cloud. Obviously, we can “scramble” ourselves first, but if there is an option to dummy our data before sending it into OpenAI, that would make it easier for enterprise app design. I don’t think gpt needs real data to simulate the codes. I could be wrong though.
For the last part, if you already retrieved example rows from the vector DB, why go look again in the SQL database ? Or maybe the vector DB may not always return the correct row ?
Thank you for a very insightful tutorial! I have a scenario where in some tables I have multiple columns that are related to date or datetime. When a question with respect to date is asked, correct date column is not picked up. Is there a way to strategize the correct column selection step?
Will this method work for tables with interlinking data, with primary and foreign keys , where the final result is achieved by having a join between 2 or more tables? Asking from the POV of financial data containing user info , employee info, product info, all interlinked and the retrieval is through a fairly complex SQL query. Thanks in advance
I must say, this video is absolutely fantastic! The depth of knowledge and insight you've shared here is truly impressive. I couldn't help but be captivated by your presentation style, which strikes the perfect balance between being engaging and informative. I've learned so much from watching this, and I genuinely appreciate the effort you put into making it so valuable for your audience. You've earned yourself a loyal subscriber here, and I can't wait to see what other brilliant content you have in store for us! ��
Might be a dumb question but: are the built in SQLTableRetrievalEngine basically a query pipeline already prepared and made? Is there some way to customize some of the components in the built in pipeline or do I need to make a custom query pipeline myself? Dealing with very large database with too many tables that are too large… trying to find a better way for table retrieval it improve the sql query made.
1. How can I upload more than 1 data frame such that they interact with each other using a common column between them? 2. And also how can I customize the instructions according to the data frame?
Not sure if I understand the text-to-sql prompt. Why does it ask the LLM to generate the SQLQuery, SQLResult, and Answer all at the same time? Shouldn't it generate the query first, run it to get a result, then generate the answer given the query result?
when the code is runing, it is doing it by like you said. For example I tried in my local machine and my dataset. > Running module input with input: query: What is the service with the highest number of customers in the first month of 2024? > Running module table_retriever with input: input: What is the service with the highest number of customers in the first month of 2024? > Running module table_output_parser with input: query_str: What is the service with the highest number of customers in the first month of 2024? table_schema_objs: [SQLTableSchema(table_name='monthly_customer_service_data', context_str='The table provides information about different services including Google, Apple, Spotify, etc. for the month of October 2021. I... > Running module text2sql_prompt with input: query_str: What is the service with the highest number of customers in the first month of 2024? schema: Table 'monthly_customer_service_data' has columns: Date (VARCHAR), Service (VARCHAR), Customer_Quantity_Monthly (INTEGER), Turnover_per_Customer_Monthly (INTEGER), Category (VARCHAR), and foreign keys... > Running module text2sql_llm with input: messages: Given an input question, first create a syntactically correct sqlite query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return... > Running module sql_output_parser with input: response: assistant: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' SQLResult: ('TV+', 1205392) Answer: The service with the highest number of custom... > Running module sql_retriever with input: input: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' > Running module response_synthesis_prompt with input: query_str: What is the service with the highest number of customers in the first month of 2024? sql_query: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01' context_str: [NodeWithScore(node=TextNode(id_='6c280c68-69bb-495b-9a6b-911cf4a42849', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[('Lifebox... > Running module response_synthesis_llm with input: messages: Given an input question, synthesize a response from the query results. Query: What is the service with the highest number of customers in the first month of 2024? SQL: SELECT Service, MAX(Customer_Qua...
for vector indexing, you can try indexing batches/chunks of rows instead of each row individually. really the goal here is to find some coarse few-shot examples that help the text-to-sql prompt
yes I agree and understand , but is my problem that fact im trying to index a SQL table ? because you had your data in tons of CSV files, is that the trick should I save my SQL table into CSvs first ?
is there a wa i can pass a csv or xml fiile as input to llm and get a new column which represent sentiment score from a column in the csv, looking for a way where i can pass the csv file and make one llm call to avoid cost of multiple calls for each comment
My Question on row indexing, you are indexing entire row into vector database which is again very costly operation. what if the table contains billion of record it would create lot of overhead to storage. Do you think this is the only option or is there any better way to do this.
It is important to optimize your pipeline according to your use case. It is good to generalize, but i think getting 90% of the answers is a great win. if you have tables with billions of records that are all available to querying, that sounds like a serious problem that requires relevant hardware and solution. having billions of rows means you have a business scale much bigger than few dollars budget for embeddings. now, on to the solutions, you could do a number of things: 1- filter only the data you expect to query, tables, rows, columns. 2- summarize data and aggregate if possible. 3- index only what you need. you can index primary/foreign keys, this is something you already need to do for querying billions of records on SQL/NOSQL db. 4- when indexing key values, you can get unique values first, then index them. 5- use your own embedding model instead of sourcing it out to external apis, faster and cheaper.
what version are you on? this was made before our v10 release today, so make sure you `pip install llama-index==0.9.45.post1` (colab: colab.research.google.com/drive/1fRkgSn2PSlXSMgLk32beldVnLMLtI1Pc?usp=sharing)
This is absolutely brilliant. Thank you very much for this very informative video.
So clever! Thank you so much for sharing.
This is great video and thanks for providing such fantastic content, can we use open source llm like llama2 for Question-Answering over Tabular/CSV/ considering above approach?
Would love to see similar example with sqllite instead of using dfs and with batching / async optimizations
Great video! I really love your way of presenting. I am looking forward to the next one :)
I have a few questions though:
1. Would it be possible to not embed twice the user query to optimize the time to response? It seems like the embedding is done twice, for both looking into the table and the row indexes.
2. What would be a good strategy if the initial user query requires two different SQL queries t be answered? Or if the SQL is a bit more complex
3. What if the datasets have million of rows? It look likes you are building the row index based on all rows, which might take both a lot of time and memory to be done. And it might slow down the time to answer in the long run no?
I want to know this as weel
This is great. Thanks for sharing Jerry. Quick question: as I understand, for df, you are inputting it into the API, and for sql, you are giving some rows to simulate the code, right? Just wondering if there is an option to “mask” or “use dummy” option for the tubular data going into the cloud? Tables and dbs often contain company sensitive data. You can’t take any into the cloud. Obviously, we can “scramble” ourselves first, but if there is an option to dummy our data before sending it into OpenAI, that would make it easier for enterprise app design. I don’t think gpt needs real data to simulate the codes. I could be wrong though.
Thank you for the clarity! but i have a question, should the data be preprocessed?
For the last part, if you already retrieved example rows from the vector DB, why go look again in the SQL database ? Or maybe the vector DB may not always return the correct row ?
Thank you for a very insightful tutorial!
I have a scenario where in some tables I have multiple columns that are related to date or datetime. When a question with respect to date is asked, correct date column is not picked up. Is there a way to strategize the correct column selection step?
Wonderful video. Thank you
Would the Text to Panda technique be a good approach when you only have access to stored procedures?
For the text to pandas part: Is there a way how I can prevent the answering of generic not related questions like "What is the meaning of life?"?
Will this method work for tables with interlinking data, with primary and foreign keys , where the final result is achieved by having a join between 2 or more tables?
Asking from the POV of financial data containing user info , employee info, product info, all interlinked and the retrieval is through a fairly complex SQL query. Thanks in advance
I must say, this video is absolutely fantastic! The depth of knowledge and insight you've shared here is truly impressive. I couldn't help but be captivated by your presentation style, which strikes the perfect balance between being engaging and informative. I've learned so much from watching this, and I genuinely appreciate the effort you put into making it so valuable for your audience. You've earned yourself a loyal subscriber here, and I can't wait to see what other brilliant content you have in store for us! ��
thanks for the support 🙏
@@LlamaIndex did you read the user name? I'm wondering if we have two bots talking to each other?
Might be a dumb question but: are the built in SQLTableRetrievalEngine basically a query pipeline already prepared and made? Is there some way to customize some of the components in the built in pipeline or do I need to make a custom query pipeline myself? Dealing with very large database with too many tables that are too large… trying to find a better way for table retrieval it improve the sql query made.
This is awesome. How to intergrate with a chat engine??
Also, does the panda pipeline also lack the robustness on the names like B.I.G ?
1. How can I upload more than 1 data frame such that they interact with each other using a common column between them?
2. And also how can I customize the instructions according to the data frame?
Nice!!
i have less number of rows and single csv file is that good to go with Pandas method it self >?? or SQL is best ?? what you recommend ?
can we upload multiple csv files when we going with text to pandas approach?
Not sure if I understand the text-to-sql prompt. Why does it ask the LLM to generate the SQLQuery, SQLResult, and Answer all at the same time? Shouldn't it generate the query first, run it to get a result, then generate the answer given the query result?
when the code is runing, it is doing it by like you said.
For example I tried in my local machine and my dataset.
> Running module input with input:
query: What is the service with the highest number of customers in the first month of 2024?
> Running module table_retriever with input:
input: What is the service with the highest number of customers in the first month of 2024?
> Running module table_output_parser with input:
query_str: What is the service with the highest number of customers in the first month of 2024?
table_schema_objs: [SQLTableSchema(table_name='monthly_customer_service_data', context_str='The table provides information about different services including Google, Apple, Spotify, etc. for the month of October 2021. I...
> Running module text2sql_prompt with input:
query_str: What is the service with the highest number of customers in the first month of 2024?
schema: Table 'monthly_customer_service_data' has columns: Date (VARCHAR), Service (VARCHAR), Customer_Quantity_Monthly (INTEGER), Turnover_per_Customer_Monthly (INTEGER), Category (VARCHAR), and foreign keys...
> Running module text2sql_llm with input:
messages: Given an input question, first create a syntactically correct sqlite query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return...
> Running module sql_output_parser with input:
response: assistant: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
SQLResult: ('TV+', 1205392)
Answer: The service with the highest number of custom...
> Running module sql_retriever with input:
input: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
> Running module response_synthesis_prompt with input:
query_str: What is the service with the highest number of customers in the first month of 2024?
sql_query: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
context_str: [NodeWithScore(node=TextNode(id_='6c280c68-69bb-495b-9a6b-911cf4a42849', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[('Lifebox...
> Running module response_synthesis_llm with input:
messages: Given an input question, synthesize a response from the query results.
Query: What is the service with the highest number of customers in the first month of 2024?
SQL: SELECT Service, MAX(Customer_Qua...
the colab file doesn't work. In the download data part, the csv dataset link shows "404: Not Found".
how can i use this with a custom LLM from huggingface ?
you are indexing all tables, but what if 1-2 of my 4 tables has 2 milions rows each? its take so long... any tips here?
for vector indexing, you can try indexing batches/chunks of rows instead of each row individually. really the goal here is to find some coarse few-shot examples that help the text-to-sql prompt
yes I agree and understand , but is my problem that fact im trying to index a SQL table ? because you had your data in tons of CSV files, is that the trick should I save my SQL table into CSvs first ?
@@renaudgg you can also increase the batch size on your embeddings model. For example embed_model = OpenAIEmbedding(.., embed_batch_size=1000)
Avoid using OpenAI and run your model "locally". Although this video/code doesn't touch on any of that.
is there a wa i can pass a csv or xml fiile as input to llm and get a new column which represent sentiment score from a column in the csv, looking for a way where i can pass the csv file and make one llm call to avoid cost of multiple calls for each comment
My Question on row indexing, you are indexing entire row into vector database which is again very costly operation. what if the table contains billion of record it would create lot of overhead to storage.
Do you think this is the only option or is there any better way to do this.
It is important to optimize your pipeline according to your use case. It is good to generalize, but i think getting 90% of the answers is a great win. if you have tables with billions of records that are all available to querying, that sounds like a serious problem that requires relevant hardware and solution. having billions of rows means you have a business scale much bigger than few dollars budget for embeddings.
now, on to the solutions, you could do a number of things:
1- filter only the data you expect to query, tables, rows, columns.
2- summarize data and aggregate if possible.
3- index only what you need. you can index primary/foreign keys, this is something you already need to do for querying billions of records on SQL/NOSQL db.
4- when indexing key values, you can get unique values first, then index them.
5- use your own embedding model instead of sourcing it out to external apis, faster and cheaper.
Can we implement this in databricks serving model chatbot
Same strategies will work with a claude 3 backend?
it worked
where can i find part 1 of this tutorial
While running objectindex, it is throwing an error of openai.notfounderror
what version are you on? this was made before our v10 release today, so make sure you `pip install llama-index==0.9.45.post1` (colab: colab.research.google.com/drive/1fRkgSn2PSlXSMgLk32beldVnLMLtI1Pc?usp=sharing)
@@LlamaIndex i am still getting the same error even after using the same version that you specified
Can I plot Graph in this?
hic, nhưng con này cần API KEY CHATGPT.
NEED A API KEY CHATGPT