Advanced Techniques for Working with Different Document Types in RAG

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 ก.ย. 2024
  • Getting your parsing and chunking right is a key part of RAG. Go from Novice to Expert by learning advanced techniques on how to level up your parsing and chunking using a wide range of document types, including Microsoft Office Documents, including tables and OCR images.
    Once all parsed, then learn how to easily create datasets from the parsed documents, including different dataset types.
    Learn Advanced Techniques for:
    - Working with Different Document Types using a variety of Options and Configurations; and
    - Learn how to Create Datasets for model training using your documents with even Table Information and Images.
    Please subscribe for future content!
    Check out our Github and leave a star!
    github.com/llm...
    Join us in discord:
    / discord

ความคิดเห็น • 10

  • @JebliMohamed
    @JebliMohamed 3 หลายเดือนก่อน +1

    Loved the video!
    The step-by-step guide on parsing docs and data was super helpful.
    I was really impressed by how you used OCR to pull text from images in Microsoft Office files - that was cool.
    The smart chunking strategy explanation was also 👌.

    • @llmware
      @llmware  3 หลายเดือนก่อน

      Thank you so much for your kind feedback! ☺

  • @user-kk1li5mk7q
    @user-kk1li5mk7q 3 หลายเดือนก่อน +1

    This is really a nice way of extracting data and converting the unstructured data into structured form. I believe the data after extraction can be used as a data source for the RAG pipeline and probably LLMs can give more accurate answers.

    • @llmware
      @llmware  3 หลายเดือนก่อน +1

      Thank you so much for your observation - we also believe that documents parsed in this manner will enhance accuracy of LLMs in a RAG workflow!

  • @JebliMohamed
    @JebliMohamed 3 หลายเดือนก่อน +1

    🎯 Key points for quick navigation:
    00:18 *📄 Introduction to document parsing, chunking, and data extraction.*
    00:33 *🛠️ Advanced techniques for extracting images, tables, and automating workflows.*
    01:17 *📚 Preparing datasets for self-supervised learning and fine-tuning.*
    01:31 *💡 Focus on data wrangling and Microsoft Office documents.*
    02:14 *🗂️ Accessing public Microsoft Word, PowerPoint, and Excel documents.*
    03:22 *📂 Downloading and preparing Microsoft Office documents.*
    04:03 *🛠️ Setting up the environment to parse and chunk documents.*
    05:12 *🔍 Smart chunking strategies and their configurations.*
    06:22 *📑 Parsing tables and images from documents.*
    07:32 *🗃️ Exporting tables into CSV files.*
    08:28 *🖼️ Running OCR on extracted images.*
    09:54 *📄 Creating a consolidated JSONL file.*
    10:35 *📊 Building a dataset for unsupervised testing.*
    11:14 *⚡ Parsing 152 files in 6 seconds using a local Mac M1.*
    12:37 *🔍 Running OCR and storing text in the library.*
    13:17 *⏱️ Comparing the speed of digital parsing versus OCR.*
    14:23 *📁 Exploring file artifacts created during parsing.*
    16:29 *📄 Reviewing the created dataset.*
    19:44 *🎥 Closing remarks and upcoming example videos.*
    Made with HARPA AI

    • @llmware
      @llmware  3 หลายเดือนก่อน

      This is so helpful - thank you!!

  • @user-um2uq9nh4z
    @user-um2uq9nh4z 3 หลายเดือนก่อน +2

    wow!!!!! I'm wowed!

    • @llmware
      @llmware  3 หลายเดือนก่อน

      Thank you so much! 🥰

  • @user-cb7yl4nr6h
    @user-cb7yl4nr6h 2 หลายเดือนก่อน +1

    Download the repo, open the example, and just run it and it will work, because I tried in Kolab and the example did not work for me

    • @llmware
      @llmware  2 หลายเดือนก่อน

      We are working on turning many of our YT videos into Colab notebooks as well and will post these notebooks as we make them.