Thank you for your video. What I got out of your presentation is documents typically follow a semantic structure, which includes elements like a Table of Contents (TOC), sections, and sub-sections. Sections and sub-sections are closely related in terms of content and meaning. Therefore, they should be grouped together in the same chunk. Then you have relevant text regarding tables within . Tables should be part of the section/sub-section chunk or it separated point back to the section chunk. Just thinking out loud. Thank you again. I look forward to more videos.
Thanks for sharing Lance! Do you have the input docs used for this testing? I'm curious how dependent it is on the input dataset. It seems like a lot of the assumption is based on page boundaries being a good place to split on, but that seems like it would be the case for carefully hand crafted documents. If you are just exporting a document to a pdf and the table spans page boundaries, I'm wondering how well the page/boundary & Ensemble methods work.
Illuminating 🙏🏼 Not sure what’s in your eval set but I wonder what would be the best strategy for docs like confluence or Google docs where you’re less likely to have the concept of pagination except for printing layout. It’s unexpected where a table would be located when ingesting docs.
Thank you for sharing, looking forward to more insightful videos from you guys Thank you
Thank you for your video. What I got out of your presentation is documents typically follow a semantic structure, which includes elements like a Table of Contents (TOC), sections, and sub-sections. Sections and sub-sections are closely related in terms of content and meaning. Therefore, they should be grouped together in the same chunk. Then you have relevant text regarding tables within . Tables should be part of the section/sub-section chunk or it separated point back to the section chunk. Just thinking out loud. Thank you again. I look forward to more videos.
Awesome coverage - tables in pdf´s is a huge pain point that i suspect will have a strong solution in the next 3-4 months
Thanks for sharing Lance! Do you have the input docs used for this testing?
I'm curious how dependent it is on the input dataset. It seems like a lot of the assumption is based on page boundaries being a good place to split on, but that seems like it would be the case for carefully hand crafted documents. If you are just exporting a document to a pdf and the table spans page boundaries, I'm wondering how well the page/boundary & Ensemble methods work.
Thank you. This is interesting. Could you please share the code for your benchmark as well?
Illuminating 🙏🏼
Not sure what’s in your eval set but I wonder what would be the best strategy for docs like confluence or Google docs where you’re less likely to have the concept of pagination except for printing layout. It’s unexpected where a table would be located when ingesting docs.
Curious, does LangChain have or support Hyde for RAG.