Advanced RAG 02 - Parent Document Retriever

Sam Witteveen

มุมมอง 22 930

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 22 ก.ค. 2024
Colab: drp.li/gyYpV
My Links:
Twitter - / sam_witteveen
Linkedin - / samwitteveen
Github:
github.com/samwit/langchain-t... (updated)
github.com/samwit/llm-tutorials
00:00 Intro
00:52 Parent Document Retriever Diagram
04:56 Code time
#langchain #openai #llm
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 68

@micbab-vg2mu 9 หลายเดือนก่อน ⁺¹
I try to get deeper into RAG - thank you for the video.
@Gingeey23 9 หลายเดือนก่อน
Brilliant video - this series is exactly what I have been after for improving RAG performance on large datasets!
@pdamartin4203 8 หลายเดือนก่อน
I loved the series of videos on advanced RAG, so clear and insightful. Great job on bringing these tips and tricks from your professional knowledge and experience to learners like me. Thank you.
@JosemarMigowskiBrasilia 9 หลายเดือนก่อน ⁺⁴
Congratulations on the videos related to RAG.
This second one, in particular, is exactly in line with the resources we are working on. It was extremely enlightening.
Once again, congratulations and thank you very much for your teachings.
@shortthrow434 9 หลายเดือนก่อน
Thank you Sam. This is very useful. Keep up the excellent work.
@henkhbit5748 9 หลายเดือนก่อน ⁺¹
Great video about parent/child retriever👍. Indeed a nice addition to the "normal" rag retrieval. What I missed in Langchain is how to do RAG effectively for a "help desk question and answer conversation" that is normally stored in some database. For example a customer asks about a problem with his iphone and the agent replies with some answer. And this conversation could be going back and forth ie. cust->agent->cust->agent-> .... The same use case is doing RAG for FAQ information/database..
@HaakonJacobsen 9 หลายเดือนก่อน ⁺⁵
I have been waiting for a series like this, on this topic, Thank you Sam🙏 One of the things I have struggled the most with is sourcing the documents, and get the LLM to specify excactly the sentence(s) that answer the question. That would be such a powerful feature to have.
@priya-dwivedi 9 หลายเดือนก่อน
Agreed. Any tips on citing the answers. Specifically if certain attributes like $, amount or interesting facts are mentioned in an answer, how can we validate them and cite their sources?
@ninonazgaidze1360 9 หลายเดือนก่อน ⁺⁴
For anyone who works on retrieval of info from large documents, this video is like air. Because we really need to be smart enough to take this concept into account if we really want great results. @Sam, I am so glad I saw your videos!
@dimknaf 9 หลายเดือนก่อน
Very useful! Thanks!
@pedrojesusrangelgil5064 9 หลายเดือนก่อน ⁺²
Hi Sam 👋I'm starting my learning path on LLM and RAG and your channel is helping me a "big chunk" with that. What metrics can we use to determine if a model is a "decent" one for RAG applications? Thanks a lot for all your content 🙌
@justriseandgrind6910 9 หลายเดือนก่อน
I never knew this existed haha Going to dig into this for fun
@shamikbanerjee9965 9 หลายเดือนก่อน
👌👌👌 really good ideas
@tubingphd 9 หลายเดือนก่อน
Thank you Sam
@chickenanto 9 หลายเดือนก่อน
Thank you Sam for the great work ! Is the following video about MultiVectorRetriever? That would be awesome!
@curlynguyen6456 5 หลายเดือนก่อน
literally spend the whole day thinking and implementing this by vanilla python without knowing the keywords and this show up at bed time
@peterc.2301 9 หลายเดือนก่อน
One more great video! Could you give any insight on how to use more complex prompts like the "Professor Synapse prompt" in langchain?
@seththunder2077 9 หลายเดือนก่อน ⁺¹
Is it possible to use RetrievalQAChainWithSources instead of the normal RetrievalQAChain? If so, how can I add a memory and a prompt for that? Sorry if it's a beginner question but it doesn't show it can be used when I checked the different methods
@Skhulile84 9 หลายเดือนก่อน
Thanks for this! I'm loving this series on Advanced RAG! Quick one, have you had a look at Semantic Kernel? I see it as a LangChain alternative, but I'm trying to decided on what to go for for a RAG system for some work documents. I'm more leaning on to LangChain as it seems there is just a lot of cool implementations they have. What do you think?
@ninonazgaidze1360 9 หลายเดือนก่อน
For working with a bunch of different documents, which model would you take for the most accurate answers? OpenAI model like "text-embeddings-ada-002" or HuggingFaceBGEEmbeddings, and exactly which?
@andrewlaery 9 หลายเดือนก่อน
Effective RAG is so critical and so powerful. This is a great tutorial to help with tools and workflow. Always top notch, thx! Have you got thoughts n how and where to store all the text data (I.e raw text data, queries, llm responses)? Obv. scale is a consideration, so while in dev mode, store text in JSON locally maybe? Then import import to Python when needed for processing?
@yusufkemaldemir9393 9 หลายเดือนก่อน
Very cool! I am confused though with pro tips 1 and 2 videos and fine tuning. Does this mean for local private docs we can skip fine tuning and explore the options provided in video 1 or 2?
Does your notebook run with m2 MacBook? Do you have any trial with m2 Mac? Thanks again!
@maxlgemeinderat9202 8 หลายเดือนก่อน ⁺¹
Great Tutorial! How can i save the "big_chunks_retriever"? Would you recommend a pickle?
@ghosthanded 3 หลายเดือนก่อน
Hi Sam, I had a question regarding multimodal embeddings how I embed image - text pairs where the associated text is very big
@rishab7746 หลายเดือนก่อน
How do I use a parent document retriever with qdrant on langchain?
@toastrecon 9 หลายเดือนก่อน ⁺¹
I love this topic. One question I have: doesn't the more or less "arbitrary" selection of 400 tokens mean that our parent document will always be split into child documents that are close by in the text? For example: day a document is 10k tokens long. Subject 1 is mainly in the first 2k tokens, and then summarized with some conclusions in the last 1k tokens. Subject 2 is in tokens 2-7k (also included in the summary) and then Subject 3 consists of 7-9k tokens, etc... So, using the 400 slice, maybe you'd have some "contamination" of one subject into another child document? Maybe those numbers are too big in the sense that the overlap wouldn't be that bad, but you get the idea: if you're trying to create embeddings that are topically-focused, but you chop up the parent document just by adjacent tokens in a "window"? Maybe if you had some way to keep changing that window, even making it dynamically adjusted, to maximize some kind of relevance metric? That's not what recursive splitting, is it?
@davidmonterocrespo 2 หลายเดือนก่อน
Thank !!!
@loicbaconnier9150 9 หลายเดือนก่อน
Thanks for all. Could you add a direct link to your notebook ? Quite impossible to find it among all of them.
@mohamednihal8215 3 หลายเดือนก่อน
won't we get tokens exceeded error if we pass in larger chunks to LLM as in context learning and also might be very expensive tokens wise
@syedhaideralizaidi1828 7 หลายเดือนก่อน
Can I use this retriever with azure cognitive search?
@pascualsilva4210 2 วันที่ผ่านมา
Thank you very much.
What is the optimal method for consulting similar documents when working with documents that are 10 to 20 pages long, and then summarizing all of them?
Or is it better to combine the chunks of an entire document and then summarize it with LangChain?
@stablegpt 9 หลายเดือนก่อน
how to specify "k" in this case?
@RUSHABHPARIKH-vy6ey 3 หลายเดือนก่อน
Amazing stuff, can you please opnesource the scraper or link to the code? thanks!
@user-fq3yt7zd5n 9 หลายเดือนก่อน
Do we require large language models (LLMs) trained on billions of parameters for RAG QA on custom knowledge bases? If so, how they ? If not, what are the models that are good for RAG applications to do fast and decent operations, please tell other than openAI.
@alexandershevchenko4167 5 หลายเดือนก่อน
Sam Witteveen, can you please help: how to save/load db (vectorstore) when using Parent Document retriever in langchain?
@askcoachmarty 6 หลายเดือนก่อน
Another great tutorial, thanks! Can you check my understanding of this? This retriever takes one or more parent documents and creates child documents from the incoming "parent" documents. To contrast, with a simple example, let's say I have a document with a table of contents; the TOC is good for understanding the flow and structure of the document, but then each section identified in the TOC is in another document in the vector store. This ParentDocumentRetriever isn't appropriate or applicable for that use case. Is that correct?
@arkodeepchatterjee 9 หลายเดือนก่อน
please make a video on the web scrapper 🙏🏻🙏🏻
@stablegpt 9 หลายเดือนก่อน
how to specify "k" here??
@maninzn 9 หลายเดือนก่อน
Great video, Sam. If I have a lot of tables in my pdf, what's the best way to create the embedding? Textspliter is not really good for table data
@pensiveintrovert4318 9 หลายเดือนก่อน ⁺¹
The great thing about tables is that authors usually have one table or two per PDF page. I have had very reasonable results with 10Ks, by looking at the whole page.
P.s. you don't need vectors for table data, just headers, footers, row labels.
@akash_a_desai 9 หลายเดือนก่อน
love your tutorials, also can you make video on fast api + langchain? deploying apps in production
@samwitteveenai 9 หลายเดือนก่อน
So LC has some interesting things coming in this space, so I will get to it soon. What are the challenges you are having with the FastAPI etc?
@rickmarciniak2985 2 หลายเดือนก่อน
Hi Sam. Is it possible to add large pdf files of text books into a RAG process like this? Content is technical. Books are on Cytogenetic laboratory testing. 300-700 pages. Have you ever done something like this?
My background is in genetics, not computer science. Not great with python, but I write reports in SQL every day. Thanks!
@ChrisadaSookdhis 9 หลายเดือนก่อน
When we do big chunks + little chunks, both are added to the vector store. In this case, what does the InMemoryStore does and is it still required?
@samwitteveenai 9 หลายเดือนก่อน
Hi Chris. Yes you can store them in a separate vector store rather than in memory. with LlamaIndex you. can some some fancier stuff as well.
@Reccotb 9 หลายเดือนก่อน
@@samwitteveenai : hi can you provide an example? i'm stuck in trying exactly this because inmemorystore is BaseStore[str, Document] and not VectorStore
@alx8439 9 หลายเดือนก่อน
So the secret souce is just to slice and vectorize documents more carefully, into thinner slices, and when to construct the context for LLM prompt just to get some more surrounding text from the doc, where the needle was found?
@samwitteveenai 9 หลายเดือนก่อน ⁺²
It certainly can help for a lot of tasks.
@alx8439 9 หลายเดือนก่อน
@@samwitteveenai that's for sure. I wonder why this simple but effective idea didn't come to their minds earlier :)
@DongToni หลายเดือนก่อน
Hi, Sam, thanks for your video and it's really helpful for me to understand RAG better. For chunks, I want to know more about that. In our case, Chinese and English characters blend together in different document, and we try to split the docs by using certain size, like 512k or 5000 character, but that will make the chunks really mess. That will lead to the output inaccurately. Any suggestion from side? or shall I ask business stakeholder to support us to split the doc into smaller chunks? Thanks.
@samwitteveenai 27 วันที่ผ่านมา
Try a multilingual embedding model like bge-m3. can also look at semantic chunking.
@DongToni 20 วันที่ผ่านมา
@@samwitteveenai I will try it later. Thanks for your recommendation.
@Shriganesh-jc9jc 9 หลายเดือนก่อน
At the end, the parent document that a small chunk originated from is sent to LLM as context for the answers. Is this correct?
@samwitteveenai 9 หลายเดือนก่อน ⁺¹
yes for this, in the video 2 after this I show some other ways you can deal with it.
@Shriganesh-jc9jc 9 หลายเดือนก่อน
@@samwitteveenai Thank you ..
@bodimallareddy9160 9 หลายเดือนก่อน
Can you please let what embedding model for RAG on german text
@samwitteveenai 9 หลายเดือนก่อน ⁺¹
I would probably go with this for now huggingface.co/intfloat/multilingual-e5-large but keep an eye out, there is supposed to be a a multi lingual BGE coming as well.
@pensiveintrovert4318 9 หลายเดือนก่อน
Do you produce transcripts for the videos? With a video one can grasp a few ideas maybe, but to increase the value it would be good to have step by step instructions that can be read. You could use open source models to transcribe.
@samwitteveenai 9 หลายเดือนก่อน ⁺¹
I upload the subtitles which are pretty much a direct transcript. I am actually experimenting with getting a LLM to convert the video transcript to a blog post. I haven’t got it to a standard I am happy releasing but need to try a few more ideas. I totally get some people want to refer back to the text etc.
@pensiveintrovert4318 9 หลายเดือนก่อน
@@samwitteveenai I have been playing with data extraction from 10K PDFs, using CodeLlama 2 34B Phind flavor. It produces pretty clean JSON, but I am now running into problems, that if I change anything trying to improve the output, it breaks something else. Tried multiple passes. Kind of works but very slow running locally.
@pensiveintrovert4318 9 หลายเดือนก่อน
@@samwitteveenai Wouldn't an LLM writing a transcript from CC be a perfect project? 😎
@Canna_Science_and_Technology 9 หลายเดือนก่อน
from langchain.document_loaders import TH-camLoader
@picklenickil 9 หลายเดือนก่อน
I could have done something like this -
Preprocess document to include pagenumber, doc name in the corpus for every page.
Train a universal encoder decoder pipeline, tf . Probably
Use nearest neighbor to get top 5 matches.
Send it to the cheapest LLM I can find and instruct it to 'reply to the query and mention reference in APA' or whatever
..
If I'm feeling fancy.. I would save these references and queries in a text file, because why not...as cache or something.
Boom..
What do you guys think?
@mshonle 9 หลายเดือนก่อน
You might be interested in the DensePhrases work: Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (2019, arXiv:1906.05807), and Learning Dense Representations of Phrases at Scale (2021, arXiv:2012.12624)
@arkodeepchatterjee 9 หลายเดือนก่อน
please release the scrapper 🙏🏻🙏🏻
@dejoma. 6 หลายเดือนก่อน
I am watching these videos and I'm like why does his voice sound familiar. But I've worked it out.. Stewie from family guy!!! HAHA YOU CANT UNHEAR IT

ต่อไป

เล่นอัตโนมัติ

Advanced RAG 03 - Hybrid Search BM25 & Ensembles