Semantic Chunking

แชร์
ฝัง
  • เผยแพร่เมื่อ 10 ก.ค. 2024
  • A more semantic approach to chunking of text documents for embedding to be used for chat completion with OpenAI LLMs.
    Update: Working with other OpenAI developers, we've come up with a way to automate this process even further using the LLMs themselves. Discussed here: community.openai.com/t/using-...
    By the way, the chunk header idea at 07:29 is not mines. I got it from www.BlinkData.com who provide a ChatPDF service.
    00:00 Introduction
    01:19 The Issue
    03:39 The Solution
    05:10 Real World Example
    09:04 How To?
    10:49 Conclusion
  • บันเทิง

ความคิดเห็น • 18

  • @mazenlahham8029
    @mazenlahham8029 10 หลายเดือนก่อน +1

    Amazing idea, thanks for sharing ❤

  • @tommaso6187
    @tommaso6187 20 วันที่ผ่านมา

    amazing video

  • @naderjanhaoui583
    @naderjanhaoui583 4 หลายเดือนก่อน +1

    If you use an OCR system like the OCR API of Adobe PDF Service, you can easily obtain the semantic schema. Unlike regex, which makes it impossible to detect titles, sections, or other parts of the document, OCR allows you to identify every element in your document, such as tables or lists. This ensures that you have a cleanly parsed document.

    • @SwingingInTheHood
      @SwingingInTheHood  4 หลายเดือนก่อน +2

      Thanks for the info. As someone who has successfully used OCR and regular expressions for decades, I would hardly say it makes it impossible to detect formatting. Au contraire, that's what it was designed for. However, I find working with PDFs and bookmarks becoming much easier. I would recommend PDF WonderShare Element and Nitro PDF Pro. Both have auto bookmarking features, Nitro's being the best because you can search by font and text.

  • @galdx_
    @galdx_ ปีที่แล้ว +1

    Did some tests here and also noticed a substantial improvement when using the header approach per chunk. I searched for some pdf parsers, but could not find one that recognizes the structure of the document and then parses it. Did you have any luck with it?
    I believe that this problem might have been solved by someone already.

    • @SwingingInTheHood
      @SwingingInTheHood  ปีที่แล้ว +1

      A pdf export program that could export documents according to their hierarchal organization would be a dream come true. But, alas, I have yet to find one. I did make a request to ABBYY to look into it. What I have ended up doing is writing code that reads the header I created to chunk the document, then re-organizes all the chunks in hierarchal order. Now, I can import these text files as "book" nodes into Drupal, where they create their own natural "table of contents". And, using my SolrAI module, I vectorize these nodes from within Drupal and now have some pretty organized content that always knows where it is in the hierarchy.

    • @galdx_
      @galdx_ ปีที่แล้ว

      @@SwingingInTheHood yes, it solves the issue, but it is not scalable right? maybe there is an opportunity.

    • @SwingingInTheHood
      @SwingingInTheHood  ปีที่แล้ว +1

      @@galdx_ Au contraire, Drupal is the most scalable CMS available today. It is the preferred CMS of enterprise organizations. The reason the updates are queued is so that they can be upserted to the vector store in a more manageable manner. If you have hundreds, even thousands of updates going on hourly, the only difference would be that they would need to be queued and batched instead of the one-per system I have now. If this is what you mean.

  • @johnday2631
    @johnday2631 9 หลายเดือนก่อน +3

    link to code repo?

    • @SwingingInTheHood
      @SwingingInTheHood  9 หลายเดือนก่อน

      Not yet. But I think I will create a Github repo and post the code I have created for my use. I'll add the link here when it is done. Thanks for the suggestion.

    • @Victor-ww2hx
      @Victor-ww2hx 7 หลายเดือนก่อน +1

      @@SwingingInTheHood still no repo?

    • @cybergarrett
      @cybergarrett หลายเดือนก่อน

      @@Victor-ww2hx bump

    • @SwingingInTheHood
      @SwingingInTheHood  หลายเดือนก่อน

      Still no repo, primarily because the current code is part of the embedding pipeline in my existing system. Trying to pull it out to make it standalone is just too big a task at the moment. However, I am thinking about making an API available: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/100?u=somebodysysop
      Or, if you're up to the coding challenge yourself, in this discussion we have created a roadmap on developing this process yourself: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/

  • @sharannagarajan4089
    @sharannagarajan4089 9 หลายเดือนก่อน +1

    I’m also looking for a solution where PDF hierarchical schema is maintained for chunking

    • @SwingingInTheHood
      @SwingingInTheHood  9 หลายเดือนก่อน

      Outside of custom regex code, another method I've found is to use pdf bookmarking. If it's not that large of a document, I simply go through and bookmark the individual sections, then use a pdf splitter tool to split the document by section. The tool I've been using is Sejda.com, but there ae a few of them out there.

    • @naderjanhaoui583
      @naderjanhaoui583 4 หลายเดือนก่อน

      You can use a ocr system contact me if you need help

    • @SwingingInTheHood
      @SwingingInTheHood  หลายเดือนก่อน

      If you're up to the coding challenge yourself, in this discussion we have created a roadmap on developing this process yourself: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/