LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

Поделиться
HTML-код
  • Опубликовано: 20 сен 2024

Комментарии • 45

  • @afizs
    @afizs 7 месяцев назад +4

    This is absolutely brilliant. Thank you very much for this very informative video.

  • @falven
    @falven 7 месяцев назад +1

    So clever! Thank you so much for sharing.

  • @sakilansari9511
    @sakilansari9511 7 месяцев назад +3

    This is great video and thanks for providing such fantastic content, can we use open source llm like llama2 for Question-Answering over Tabular/CSV/ considering above approach?

  • @austinmw89
    @austinmw89 7 месяцев назад +2

    Would love to see similar example with sqllite instead of using dfs and with batching / async optimizations

  • @Miionu
    @Miionu 7 месяцев назад +1

    Great video! I really love your way of presenting. I am looking forward to the next one :)
    I have a few questions though:
    1. Would it be possible to not embed twice the user query to optimize the time to response? It seems like the embedding is done twice, for both looking into the table and the row indexes.
    2. What would be a good strategy if the initial user query requires two different SQL queries t be answered? Or if the SQL is a bit more complex
    3. What if the datasets have million of rows? It look likes you are building the row index based on all rows, which might take both a lot of time and memory to be done. And it might slow down the time to answer in the long run no?

    • @THEGlab
      @THEGlab 5 дней назад

      I want to know this as weel

  • @zhang73
    @zhang73 5 месяцев назад

    This is great. Thanks for sharing Jerry. Quick question: as I understand, for df, you are inputting it into the API, and for sql, you are giving some rows to simulate the code, right? Just wondering if there is an option to “mask” or “use dummy” option for the tubular data going into the cloud? Tables and dbs often contain company sensitive data. You can’t take any into the cloud. Obviously, we can “scramble” ourselves first, but if there is an option to dummy our data before sending it into OpenAI, that would make it easier for enterprise app design. I don’t think gpt needs real data to simulate the codes. I could be wrong though.

  • @JoyAwad-tn9ge
    @JoyAwad-tn9ge 4 месяца назад

    Thank you for the clarity! but i have a question, should the data be preprocessed?

  • @kingzao9046
    @kingzao9046 3 месяца назад

    For the last part, if you already retrieved example rows from the vector DB, why go look again in the SQL database ? Or maybe the vector DB may not always return the correct row ?

  • @GowriKS-yl6lc
    @GowriKS-yl6lc 4 месяца назад

    Thank you for a very insightful tutorial!
    I have a scenario where in some tables I have multiple columns that are related to date or datetime. When a question with respect to date is asked, correct date column is not picked up. Is there a way to strategize the correct column selection step?

  • @chrismaley6676
    @chrismaley6676 6 месяцев назад

    Wonderful video. Thank you
    Would the Text to Panda technique be a good approach when you only have access to stored procedures?

  • @chidy7570
    @chidy7570 24 дня назад

    For the text to pandas part: Is there a way how I can prevent the answering of generic not related questions like "What is the meaning of life?"?

  • @chrisronaldo9554
    @chrisronaldo9554 2 месяца назад

    Will this method work for tables with interlinking data, with primary and foreign keys , where the final result is achieved by having a join between 2 or more tables?
    Asking from the POV of financial data containing user info , employee info, product info, all interlinked and the retrieval is through a fairly complex SQL query. Thanks in advance

  • @BUY_YOUTUBE_VIEWS_e0e71
    @BUY_YOUTUBE_VIEWS_e0e71 7 месяцев назад +6

    I must say, this video is absolutely fantastic! The depth of knowledge and insight you've shared here is truly impressive. I couldn't help but be captivated by your presentation style, which strikes the perfect balance between being engaging and informative. I've learned so much from watching this, and I genuinely appreciate the effort you put into making it so valuable for your audience. You've earned yourself a loyal subscriber here, and I can't wait to see what other brilliant content you have in store for us! ��

    • @LlamaIndex
      @LlamaIndex  7 месяцев назад +1

      thanks for the support 🙏

    • @B0tch0
      @B0tch0 7 месяцев назад

      @@LlamaIndex did you read the user name? I'm wondering if we have two bots talking to each other?

  • @charlene1581
    @charlene1581 3 месяца назад

    Might be a dumb question but: are the built in SQLTableRetrievalEngine basically a query pipeline already prepared and made? Is there some way to customize some of the components in the built in pipeline or do I need to make a custom query pipeline myself? Dealing with very large database with too many tables that are too large… trying to find a better way for table retrieval it improve the sql query made.

  • @kelvinadungosi1579
    @kelvinadungosi1579 4 месяца назад

    This is awesome. How to intergrate with a chat engine??

  • @kingzao9046
    @kingzao9046 3 месяца назад

    Also, does the panda pipeline also lack the robustness on the names like B.I.G ?

  • @KetakiGadgil_15
    @KetakiGadgil_15 5 месяцев назад

    1. How can I upload more than 1 data frame such that they interact with each other using a common column between them?
    2. And also how can I customize the instructions according to the data frame?

  • @ragsAI
    @ragsAI 7 месяцев назад

    Nice!!

  • @sharankumar31
    @sharankumar31 2 месяца назад

    i have less number of rows and single csv file is that good to go with Pandas method it self >?? or SQL is best ?? what you recommend ?

  • @mdsohrabakhtaremam9555
    @mdsohrabakhtaremam9555 23 дня назад

    can we upload multiple csv files when we going with text to pandas approach?

  • @austinmw89
    @austinmw89 7 месяцев назад

    Not sure if I understand the text-to-sql prompt. Why does it ask the LLM to generate the SQLQuery, SQLResult, and Answer all at the same time? Shouldn't it generate the query first, run it to get a result, then generate the answer given the query result?

    • @oguzhanakkurt3947
      @oguzhanakkurt3947 6 месяцев назад

      when the code is runing, it is doing it by like you said.
      For example I tried in my local machine and my dataset.
      > Running module input with input:
      query: What is the service with the highest number of customers in the first month of 2024?
      > Running module table_retriever with input:
      input: What is the service with the highest number of customers in the first month of 2024?
      > Running module table_output_parser with input:
      query_str: What is the service with the highest number of customers in the first month of 2024?
      table_schema_objs: [SQLTableSchema(table_name='monthly_customer_service_data', context_str='The table provides information about different services including Google, Apple, Spotify, etc. for the month of October 2021. I...
      > Running module text2sql_prompt with input:
      query_str: What is the service with the highest number of customers in the first month of 2024?
      schema: Table 'monthly_customer_service_data' has columns: Date (VARCHAR), Service (VARCHAR), Customer_Quantity_Monthly (INTEGER), Turnover_per_Customer_Monthly (INTEGER), Category (VARCHAR), and foreign keys...
      > Running module text2sql_llm with input:
      messages: Given an input question, first create a syntactically correct sqlite query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return...
      > Running module sql_output_parser with input:
      response: assistant: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
      SQLResult: ('TV+', 1205392)
      Answer: The service with the highest number of custom...
      > Running module sql_retriever with input:
      input: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
      > Running module response_synthesis_prompt with input:
      query_str: What is the service with the highest number of customers in the first month of 2024?
      sql_query: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
      context_str: [NodeWithScore(node=TextNode(id_='6c280c68-69bb-495b-9a6b-911cf4a42849', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[('Lifebox...
      > Running module response_synthesis_llm with input:
      messages: Given an input question, synthesize a response from the query results.
      Query: What is the service with the highest number of customers in the first month of 2024?
      SQL: SELECT Service, MAX(Customer_Qua...

  • @wanyingqian5065
    @wanyingqian5065 5 месяцев назад

    the colab file doesn't work. In the download data part, the csv dataset link shows "404: Not Found".

  • @jonashinrichs9825
    @jonashinrichs9825 7 месяцев назад +1

    how can i use this with a custom LLM from huggingface ?

  • @renaudgg
    @renaudgg 7 месяцев назад +1

    you are indexing all tables, but what if 1-2 of my 4 tables has 2 milions rows each? its take so long... any tips here?

    • @LlamaIndex
      @LlamaIndex  7 месяцев назад

      for vector indexing, you can try indexing batches/chunks of rows instead of each row individually. really the goal here is to find some coarse few-shot examples that help the text-to-sql prompt

    • @renaudgg
      @renaudgg 7 месяцев назад

      yes I agree and understand , but is my problem that fact im trying to index a SQL table ? because you had your data in tons of CSV files, is that the trick should I save my SQL table into CSvs first ?

    • @lloganm1234
      @lloganm1234 7 месяцев назад

      @@renaudgg you can also increase the batch size on your embeddings model. For example embed_model = OpenAIEmbedding(.., embed_batch_size=1000)

    • @B0tch0
      @B0tch0 7 месяцев назад

      Avoid using OpenAI and run your model "locally". Although this video/code doesn't touch on any of that.

  • @NA-so5by
    @NA-so5by 7 месяцев назад

    is there a wa i can pass a csv or xml fiile as input to llm and get a new column which represent sentiment score from a column in the csv, looking for a way where i can pass the csv file and make one llm call to avoid cost of multiple calls for each comment

  • @brijeshsingh07
    @brijeshsingh07 6 месяцев назад

    My Question on row indexing, you are indexing entire row into vector database which is again very costly operation. what if the table contains billion of record it would create lot of overhead to storage.
    Do you think this is the only option or is there any better way to do this.

    • @MichelHabib
      @MichelHabib 5 месяцев назад

      It is important to optimize your pipeline according to your use case. It is good to generalize, but i think getting 90% of the answers is a great win. if you have tables with billions of records that are all available to querying, that sounds like a serious problem that requires relevant hardware and solution. having billions of rows means you have a business scale much bigger than few dollars budget for embeddings.
      now, on to the solutions, you could do a number of things:
      1- filter only the data you expect to query, tables, rows, columns.
      2- summarize data and aggregate if possible.
      3- index only what you need. you can index primary/foreign keys, this is something you already need to do for querying billions of records on SQL/NOSQL db.
      4- when indexing key values, you can get unique values first, then index them.
      5- use your own embedding model instead of sourcing it out to external apis, faster and cheaper.

  • @santhoshvaraprasaduppuluri6587
    @santhoshvaraprasaduppuluri6587 2 месяца назад

    Can we implement this in databricks serving model chatbot

  • @straylight7116
    @straylight7116 5 месяцев назад

    Same strategies will work with a claude 3 backend?

  • @deveshmishra5245
    @deveshmishra5245 5 месяцев назад

    where can i find part 1 of this tutorial

  • @ayushgoel7772
    @ayushgoel7772 7 месяцев назад

    While running objectindex, it is throwing an error of openai.notfounderror

    • @LlamaIndex
      @LlamaIndex  7 месяцев назад

      what version are you on? this was made before our v10 release today, so make sure you `pip install llama-index==0.9.45.post1` (colab: colab.research.google.com/drive/1fRkgSn2PSlXSMgLk32beldVnLMLtI1Pc?usp=sharing)

    • @ayushgoel7772
      @ayushgoel7772 7 месяцев назад

      @@LlamaIndex i am still getting the same error even after using the same version that you specified

  • @PratikBhande-qq1ht
    @PratikBhande-qq1ht 5 месяцев назад

    Can I plot Graph in this?

  • @DataScienceandAI-doanngoccuong
    @DataScienceandAI-doanngoccuong 5 месяцев назад

    hic, nhưng con này cần API KEY CHATGPT.

  • @DataScienceandAI-doanngoccuong
    @DataScienceandAI-doanngoccuong 5 месяцев назад

    NEED A API KEY CHATGPT