LlamaIndex Webinar: Improving RAG with Advanced Parsing + Metadata Extraction

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.ย. 2024
  • In this video we cohost a workshop with the cofounders of Deasie (Reece, Leonard, Mikko) on improving RAG with advanced parsing and metadata.
    ​The data processing layer is one of the most important pieces to get right for RAG. This means that AI engineers need to make careful decisions in terms of parsing and transformations - including metadata extraction and chunking - in order to make sure that their e2e QA system is surfacing relevant results.
    ​This is a nice two-part workshop that demonstrates the following:
    - ​the value of good parsing itself over complex documents, with LlamaParse
    - ​the value of additional value of adding in metadata through Deasie's powerful automated labeling platform
    ​We show overall experimental results over research papers validating the combination of parsing + metadata for good performance.

ความคิดเห็น • 13

  • @SamiSabirIdrissi
    @SamiSabirIdrissi หลายเดือนก่อน +2

    Overall i think this is super dope! I can’t wait to try this. The Increase routing accuracy capability is wild. Pulling relevant data with high accuracy is extremely important! 💪⚡️

  • @SamiSabirIdrissi
    @SamiSabirIdrissi หลายเดือนก่อน +1

    Very very interesting, i feel like this is similar to Azure’s document intelligence feature?

    • @awakenwithoutcoffee
      @awakenwithoutcoffee หลายเดือนก่อน

      yup, similar to unstructured / Llamaparse + LlamaExtract (new) . There is also new OCR models like ColPali!

  •  หลายเดือนก่อน +2

    Same for the metadata tags generation: another open AI GPT wrapper doing generation or mapping depending if the tags are suggested or not. As shown, the best result is obtained with the custom metadata. It means that humans are still in need to do the most difficult and time-consuming task, i.e. defining the custom tasks...😢

    • @awakenwithoutcoffee
      @awakenwithoutcoffee หลายเดือนก่อน

      nah I can't see this not being automated in the foreseeable future. There are already OCR models with long context memory that are able to create metadata tags. Give it a few months and this metadata "problem" will be solved.

  • @pin65371
    @pin65371 หลายเดือนก่อน +2

    It seems to me like this would get much more effective with a graph system? When Jerry was asking about how the data would be retrieved it seemed like graph would work well with this. When you ask a question the LLM first would retrieve relevant parent metadata. From there it can branch out from there. The advantage with that would be that connections that maybe arent so obvious with vector would be very obvious with graph. Also with graph at least you have visibility to be able to manually go in and see what is going on. I liked that last question as well. It seems like maybe it wasnt something they thought about but they might look at how to implement something like that. Tokens are getting so cheap now that it would make sense. Especially if lets say you are using the openai 4o-mini model its 30 cents for a million tokens output. Just getting it to output some extra metadata would essentially be free and would only make the whole system more efficient in the long run.

    • @awakenwithoutcoffee
      @awakenwithoutcoffee หลายเดือนก่อน

      I agree but it still too expensive and difficult to fully automate correctly. Let's keep in touch trough to the comments as us engineers are looking to for production ready techniques. My take is that graphRAG is not ready yet but it might be early next year (for enterprise).

  •  หลายเดือนก่อน +6

    The example with PyPDF is not correct as nobody is using PyPDF texts extracted per page. Instead, there is post-processing on the raw text. All these startups founders think that we are dummies and propose in their "products" the recipes that we are all using for months or years without pretending to build a company on the top of them. Same for metadata.... Almost nothing new here. 😢😮

    • @awakenwithoutcoffee
      @awakenwithoutcoffee หลายเดือนก่อน

      you bring up an important point: The part about cross-page context confused me since Jerry basically didn't know why this was happening. Have you found additional information or techniques yourself ? I'm looking for production ready techniques for meta-data extraction. One alternative new approach is ColPali.

  • @MatijaGrcic
    @MatijaGrcic หลายเดือนก่อน

    This was great, thanks for sharing.

  • @isle1009
    @isle1009 หลายเดือนก่อน

    8:16 Does Deasie support languages other than English well, especially Korean?

    • @Deasie
      @Deasie หลายเดือนก่อน +2

      Yes, we do support other languages, including Korean!