LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ก.ค. 2024
  • In the second video of this series we show you how to compose an simple-to-advanced query pipeline over tabular data. This includes using LLMs to infer both Pandas operations and SQL queries. This also includes pulling in RAG concepts for advanced capabilities, such as few-shot table and row selection over multiple tables.
    LlamaIndex Query Pipelines makes it possible to express these complex pipeline DAGs in a concise, readable, and visual manner. It's very easy to add few-shot examples, link prompts, LLMs, custom functions, retrievers, and more.
    Colab notebook used in this video: colab.research.google.com/dri...
    This presentation was taken from our documentation guides - check them out 👇
    Text-to-SQL: docs.llamaindex.ai/en/stable/...
    Text-to-Pandas: docs.llamaindex.ai/en/stable/...
    Timeline:
    00:00-06:18 - Intro
    6:18-12:13 - Text-to-Pandas (Basic)
    12:13-27:05 - Query-Time Table Retrieval for Advanced Text-to-SQL
    27:05 - Query-Time Row Retrieval for Advanced Text-to-SQL

ความคิดเห็น • 40

  • @afizs
    @afizs 4 หลายเดือนก่อน +4

    This is absolutely brilliant. Thank you very much for this very informative video.

  • @falven
    @falven 4 หลายเดือนก่อน +1

    So clever! Thank you so much for sharing.

  • @sakilansari9511
    @sakilansari9511 4 หลายเดือนก่อน +3

    This is great video and thanks for providing such fantastic content, can we use open source llm like llama2 for Question-Answering over Tabular/CSV/ considering above approach?

  • @Miionu
    @Miionu 4 หลายเดือนก่อน

    Great video! I really love your way of presenting. I am looking forward to the next one :)
    I have a few questions though:
    1. Would it be possible to not embed twice the user query to optimize the time to response? It seems like the embedding is done twice, for both looking into the table and the row indexes.
    2. What would be a good strategy if the initial user query requires two different SQL queries t be answered? Or if the SQL is a bit more complex
    3. What if the datasets have million of rows? It look likes you are building the row index based on all rows, which might take both a lot of time and memory to be done. And it might slow down the time to answer in the long run no?

  • @chrismaley6676
    @chrismaley6676 3 หลายเดือนก่อน

    Wonderful video. Thank you
    Would the Text to Panda technique be a good approach when you only have access to stored procedures?

  • @zhang73
    @zhang73 2 หลายเดือนก่อน

    This is great. Thanks for sharing Jerry. Quick question: as I understand, for df, you are inputting it into the API, and for sql, you are giving some rows to simulate the code, right? Just wondering if there is an option to “mask” or “use dummy” option for the tubular data going into the cloud? Tables and dbs often contain company sensitive data. You can’t take any into the cloud. Obviously, we can “scramble” ourselves first, but if there is an option to dummy our data before sending it into OpenAI, that would make it easier for enterprise app design. I don’t think gpt needs real data to simulate the codes. I could be wrong though.

  • @GowriKS-yl6lc
    @GowriKS-yl6lc หลายเดือนก่อน

    Thank you for a very insightful tutorial!
    I have a scenario where in some tables I have multiple columns that are related to date or datetime. When a question with respect to date is asked, correct date column is not picked up. Is there a way to strategize the correct column selection step?

  • @JoyAwad-tn9ge
    @JoyAwad-tn9ge 2 หลายเดือนก่อน

    Thank you for the clarity! but i have a question, should the data be preprocessed?

  • @ragsAI
    @ragsAI 4 หลายเดือนก่อน

    Nice!!

  • @austinmw89
    @austinmw89 4 หลายเดือนก่อน +2

    Would love to see similar example with sqllite instead of using dfs and with batching / async optimizations

  • @charlene1581
    @charlene1581 13 วันที่ผ่านมา

    Might be a dumb question but: are the built in SQLTableRetrievalEngine basically a query pipeline already prepared and made? Is there some way to customize some of the components in the built in pipeline or do I need to make a custom query pipeline myself? Dealing with very large database with too many tables that are too large… trying to find a better way for table retrieval it improve the sql query made.

  • @NA-so5by
    @NA-so5by 4 หลายเดือนก่อน

    is there a wa i can pass a csv or xml fiile as input to llm and get a new column which represent sentiment score from a column in the csv, looking for a way where i can pass the csv file and make one llm call to avoid cost of multiple calls for each comment

  • @BUY_YOUTUBE_VIEWS_e0e71
    @BUY_YOUTUBE_VIEWS_e0e71 4 หลายเดือนก่อน +6

    I must say, this video is absolutely fantastic! The depth of knowledge and insight you've shared here is truly impressive. I couldn't help but be captivated by your presentation style, which strikes the perfect balance between being engaging and informative. I've learned so much from watching this, and I genuinely appreciate the effort you put into making it so valuable for your audience. You've earned yourself a loyal subscriber here, and I can't wait to see what other brilliant content you have in store for us! ��

    • @LlamaIndex
      @LlamaIndex  4 หลายเดือนก่อน +1

      thanks for the support 🙏

    • @B0tch0
      @B0tch0 4 หลายเดือนก่อน

      @@LlamaIndex did you read the user name? I'm wondering if we have two bots talking to each other?

  • @chrisronaldo9554
    @chrisronaldo9554 8 วันที่ผ่านมา

    Will this method work for tables with interlinking data, with primary and foreign keys , where the final result is achieved by having a join between 2 or more tables?
    Asking from the POV of financial data containing user info , employee info, product info, all interlinked and the retrieval is through a fairly complex SQL query. Thanks in advance

  • @kingzao9046
    @kingzao9046 18 วันที่ผ่านมา

    Also, does the panda pipeline also lack the robustness on the names like B.I.G ?

  • @kingzao9046
    @kingzao9046 18 วันที่ผ่านมา

    For the last part, if you already retrieved example rows from the vector DB, why go look again in the SQL database ? Or maybe the vector DB may not always return the correct row ?

  • @jonashinrichs9825
    @jonashinrichs9825 4 หลายเดือนก่อน +1

    how can i use this with a custom LLM from huggingface ?

  • @KetakiGadgil_15
    @KetakiGadgil_15 2 หลายเดือนก่อน

    1. How can I upload more than 1 data frame such that they interact with each other using a common column between them?
    2. And also how can I customize the instructions according to the data frame?

  • @kelvinadungosi1579
    @kelvinadungosi1579 หลายเดือนก่อน

    This is awesome. How to intergrate with a chat engine??

  • @deveshmishra5245
    @deveshmishra5245 2 หลายเดือนก่อน

    where can i find part 1 of this tutorial

  • @brijeshsingh07
    @brijeshsingh07 3 หลายเดือนก่อน

    My Question on row indexing, you are indexing entire row into vector database which is again very costly operation. what if the table contains billion of record it would create lot of overhead to storage.
    Do you think this is the only option or is there any better way to do this.

    • @MichelHabib
      @MichelHabib 2 หลายเดือนก่อน

      It is important to optimize your pipeline according to your use case. It is good to generalize, but i think getting 90% of the answers is a great win. if you have tables with billions of records that are all available to querying, that sounds like a serious problem that requires relevant hardware and solution. having billions of rows means you have a business scale much bigger than few dollars budget for embeddings.
      now, on to the solutions, you could do a number of things:
      1- filter only the data you expect to query, tables, rows, columns.
      2- summarize data and aggregate if possible.
      3- index only what you need. you can index primary/foreign keys, this is something you already need to do for querying billions of records on SQL/NOSQL db.
      4- when indexing key values, you can get unique values first, then index them.
      5- use your own embedding model instead of sourcing it out to external apis, faster and cheaper.

  • @wanyingqian5065
    @wanyingqian5065 2 หลายเดือนก่อน

    the colab file doesn't work. In the download data part, the csv dataset link shows "404: Not Found".

  • @austinmw89
    @austinmw89 4 หลายเดือนก่อน

    Not sure if I understand the text-to-sql prompt. Why does it ask the LLM to generate the SQLQuery, SQLResult, and Answer all at the same time? Shouldn't it generate the query first, run it to get a result, then generate the answer given the query result?

    • @oguzhanakkurt3947
      @oguzhanakkurt3947 3 หลายเดือนก่อน

      when the code is runing, it is doing it by like you said.
      For example I tried in my local machine and my dataset.
      > Running module input with input:
      query: What is the service with the highest number of customers in the first month of 2024?
      > Running module table_retriever with input:
      input: What is the service with the highest number of customers in the first month of 2024?
      > Running module table_output_parser with input:
      query_str: What is the service with the highest number of customers in the first month of 2024?
      table_schema_objs: [SQLTableSchema(table_name='monthly_customer_service_data', context_str='The table provides information about different services including Google, Apple, Spotify, etc. for the month of October 2021. I...
      > Running module text2sql_prompt with input:
      query_str: What is the service with the highest number of customers in the first month of 2024?
      schema: Table 'monthly_customer_service_data' has columns: Date (VARCHAR), Service (VARCHAR), Customer_Quantity_Monthly (INTEGER), Turnover_per_Customer_Monthly (INTEGER), Category (VARCHAR), and foreign keys...
      > Running module text2sql_llm with input:
      messages: Given an input question, first create a syntactically correct sqlite query to run, then look at the results of the query and return the answer. You can order the results by a relevant column to return...
      > Running module sql_output_parser with input:
      response: assistant: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
      SQLResult: ('TV+', 1205392)
      Answer: The service with the highest number of custom...
      > Running module sql_retriever with input:
      input: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
      > Running module response_synthesis_prompt with input:
      query_str: What is the service with the highest number of customers in the first month of 2024?
      sql_query: SELECT Service, MAX(Customer_Quantity_Monthly) FROM monthly_customer_service_data WHERE Date = '2024-01-01'
      context_str: [NodeWithScore(node=TextNode(id_='6c280c68-69bb-495b-9a6b-911cf4a42849', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="[('Lifebox...
      > Running module response_synthesis_llm with input:
      messages: Given an input question, synthesize a response from the query results.
      Query: What is the service with the highest number of customers in the first month of 2024?
      SQL: SELECT Service, MAX(Customer_Qua...

  • @PratikBhande-qq1ht
    @PratikBhande-qq1ht 2 หลายเดือนก่อน

    Can I plot Graph in this?

  • @straylight7116
    @straylight7116 3 หลายเดือนก่อน

    Same strategies will work with a claude 3 backend?

  • @renaudgg
    @renaudgg 4 หลายเดือนก่อน +1

    you are indexing all tables, but what if 1-2 of my 4 tables has 2 milions rows each? its take so long... any tips here?

    • @LlamaIndex
      @LlamaIndex  4 หลายเดือนก่อน

      for vector indexing, you can try indexing batches/chunks of rows instead of each row individually. really the goal here is to find some coarse few-shot examples that help the text-to-sql prompt

    • @renaudgg
      @renaudgg 4 หลายเดือนก่อน

      yes I agree and understand , but is my problem that fact im trying to index a SQL table ? because you had your data in tons of CSV files, is that the trick should I save my SQL table into CSvs first ?

    • @lloganm1234
      @lloganm1234 4 หลายเดือนก่อน

      @@renaudgg you can also increase the batch size on your embeddings model. For example embed_model = OpenAIEmbedding(.., embed_batch_size=1000)

    • @B0tch0
      @B0tch0 4 หลายเดือนก่อน

      Avoid using OpenAI and run your model "locally". Although this video/code doesn't touch on any of that.

  • @ayushgoel7772
    @ayushgoel7772 4 หลายเดือนก่อน

    While running objectindex, it is throwing an error of openai.notfounderror

    • @LlamaIndex
      @LlamaIndex  4 หลายเดือนก่อน

      what version are you on? this was made before our v10 release today, so make sure you `pip install llama-index==0.9.45.post1` (colab: colab.research.google.com/drive/1fRkgSn2PSlXSMgLk32beldVnLMLtI1Pc?usp=sharing)

    • @ayushgoel7772
      @ayushgoel7772 4 หลายเดือนก่อน

      @@LlamaIndex i am still getting the same error even after using the same version that you specified

  • @DataScienceandAI-doanngoccuong
    @DataScienceandAI-doanngoccuong 2 หลายเดือนก่อน

    hic, nhưng con này cần API KEY CHATGPT.

  • @DataScienceandAI-doanngoccuong
    @DataScienceandAI-doanngoccuong 2 หลายเดือนก่อน

    NEED A API KEY CHATGPT