Advanced RAG 02 - Parent Document Retriever
ฝัง
- เผยแพร่เมื่อ 22 ก.ค. 2024
- Colab: drp.li/gyYpV
My Links:
Twitter - / sam_witteveen
Linkedin - / samwitteveen
Github:
github.com/samwit/langchain-t... (updated)
github.com/samwit/llm-tutorials
00:00 Intro
00:52 Parent Document Retriever Diagram
04:56 Code time
#langchain #openai #llm - วิทยาศาสตร์และเทคโนโลยี
I try to get deeper into RAG - thank you for the video.
Brilliant video - this series is exactly what I have been after for improving RAG performance on large datasets!
I loved the series of videos on advanced RAG, so clear and insightful. Great job on bringing these tips and tricks from your professional knowledge and experience to learners like me. Thank you.
Congratulations on the videos related to RAG.
This second one, in particular, is exactly in line with the resources we are working on. It was extremely enlightening.
Once again, congratulations and thank you very much for your teachings.
Thank you Sam. This is very useful. Keep up the excellent work.
Great video about parent/child retriever👍. Indeed a nice addition to the "normal" rag retrieval. What I missed in Langchain is how to do RAG effectively for a "help desk question and answer conversation" that is normally stored in some database. For example a customer asks about a problem with his iphone and the agent replies with some answer. And this conversation could be going back and forth ie. cust->agent->cust->agent-> .... The same use case is doing RAG for FAQ information/database..
I have been waiting for a series like this, on this topic, Thank you Sam🙏 One of the things I have struggled the most with is sourcing the documents, and get the LLM to specify excactly the sentence(s) that answer the question. That would be such a powerful feature to have.
Agreed. Any tips on citing the answers. Specifically if certain attributes like $, amount or interesting facts are mentioned in an answer, how can we validate them and cite their sources?
For anyone who works on retrieval of info from large documents, this video is like air. Because we really need to be smart enough to take this concept into account if we really want great results. @Sam, I am so glad I saw your videos!
Very useful! Thanks!
Hi Sam 👋I'm starting my learning path on LLM and RAG and your channel is helping me a "big chunk" with that. What metrics can we use to determine if a model is a "decent" one for RAG applications? Thanks a lot for all your content 🙌
I never knew this existed haha Going to dig into this for fun
👌👌👌 really good ideas
Thank you Sam
Thank you Sam for the great work ! Is the following video about MultiVectorRetriever? That would be awesome!
literally spend the whole day thinking and implementing this by vanilla python without knowing the keywords and this show up at bed time
One more great video! Could you give any insight on how to use more complex prompts like the "Professor Synapse prompt" in langchain?
Is it possible to use RetrievalQAChainWithSources instead of the normal RetrievalQAChain? If so, how can I add a memory and a prompt for that? Sorry if it's a beginner question but it doesn't show it can be used when I checked the different methods
Thanks for this! I'm loving this series on Advanced RAG! Quick one, have you had a look at Semantic Kernel? I see it as a LangChain alternative, but I'm trying to decided on what to go for for a RAG system for some work documents. I'm more leaning on to LangChain as it seems there is just a lot of cool implementations they have. What do you think?
For working with a bunch of different documents, which model would you take for the most accurate answers? OpenAI model like "text-embeddings-ada-002" or HuggingFaceBGEEmbeddings, and exactly which?
Effective RAG is so critical and so powerful. This is a great tutorial to help with tools and workflow. Always top notch, thx! Have you got thoughts n how and where to store all the text data (I.e raw text data, queries, llm responses)? Obv. scale is a consideration, so while in dev mode, store text in JSON locally maybe? Then import import to Python when needed for processing?
Very cool! I am confused though with pro tips 1 and 2 videos and fine tuning. Does this mean for local private docs we can skip fine tuning and explore the options provided in video 1 or 2?
Does your notebook run with m2 MacBook? Do you have any trial with m2 Mac? Thanks again!
Great Tutorial! How can i save the "big_chunks_retriever"? Would you recommend a pickle?
Hi Sam, I had a question regarding multimodal embeddings how I embed image - text pairs where the associated text is very big
How do I use a parent document retriever with qdrant on langchain?
I love this topic. One question I have: doesn't the more or less "arbitrary" selection of 400 tokens mean that our parent document will always be split into child documents that are close by in the text? For example: day a document is 10k tokens long. Subject 1 is mainly in the first 2k tokens, and then summarized with some conclusions in the last 1k tokens. Subject 2 is in tokens 2-7k (also included in the summary) and then Subject 3 consists of 7-9k tokens, etc... So, using the 400 slice, maybe you'd have some "contamination" of one subject into another child document? Maybe those numbers are too big in the sense that the overlap wouldn't be that bad, but you get the idea: if you're trying to create embeddings that are topically-focused, but you chop up the parent document just by adjacent tokens in a "window"? Maybe if you had some way to keep changing that window, even making it dynamically adjusted, to maximize some kind of relevance metric? That's not what recursive splitting, is it?
Thank !!!
Thanks for all. Could you add a direct link to your notebook ? Quite impossible to find it among all of them.
won't we get tokens exceeded error if we pass in larger chunks to LLM as in context learning and also might be very expensive tokens wise
Can I use this retriever with azure cognitive search?
Thank you very much.
What is the optimal method for consulting similar documents when working with documents that are 10 to 20 pages long, and then summarizing all of them?
Or is it better to combine the chunks of an entire document and then summarize it with LangChain?
how to specify "k" in this case?
Amazing stuff, can you please opnesource the scraper or link to the code? thanks!
Do we require large language models (LLMs) trained on billions of parameters for RAG QA on custom knowledge bases? If so, how they ? If not, what are the models that are good for RAG applications to do fast and decent operations, please tell other than openAI.
Sam Witteveen, can you please help: how to save/load db (vectorstore) when using Parent Document retriever in langchain?
Another great tutorial, thanks! Can you check my understanding of this? This retriever takes one or more parent documents and creates child documents from the incoming "parent" documents. To contrast, with a simple example, let's say I have a document with a table of contents; the TOC is good for understanding the flow and structure of the document, but then each section identified in the TOC is in another document in the vector store. This ParentDocumentRetriever isn't appropriate or applicable for that use case. Is that correct?
please make a video on the web scrapper 🙏🏻🙏🏻
how to specify "k" here??
Great video, Sam. If I have a lot of tables in my pdf, what's the best way to create the embedding? Textspliter is not really good for table data
The great thing about tables is that authors usually have one table or two per PDF page. I have had very reasonable results with 10Ks, by looking at the whole page.
P.s. you don't need vectors for table data, just headers, footers, row labels.
love your tutorials, also can you make video on fast api + langchain? deploying apps in production
So LC has some interesting things coming in this space, so I will get to it soon. What are the challenges you are having with the FastAPI etc?
Hi Sam. Is it possible to add large pdf files of text books into a RAG process like this? Content is technical. Books are on Cytogenetic laboratory testing. 300-700 pages. Have you ever done something like this?
My background is in genetics, not computer science. Not great with python, but I write reports in SQL every day. Thanks!
When we do big chunks + little chunks, both are added to the vector store. In this case, what does the InMemoryStore does and is it still required?
Hi Chris. Yes you can store them in a separate vector store rather than in memory. with LlamaIndex you. can some some fancier stuff as well.
@@samwitteveenai : hi can you provide an example? i'm stuck in trying exactly this because inmemorystore is BaseStore[str, Document] and not VectorStore
So the secret souce is just to slice and vectorize documents more carefully, into thinner slices, and when to construct the context for LLM prompt just to get some more surrounding text from the doc, where the needle was found?
It certainly can help for a lot of tasks.
@@samwitteveenai that's for sure. I wonder why this simple but effective idea didn't come to their minds earlier :)
Hi, Sam, thanks for your video and it's really helpful for me to understand RAG better. For chunks, I want to know more about that. In our case, Chinese and English characters blend together in different document, and we try to split the docs by using certain size, like 512k or 5000 character, but that will make the chunks really mess. That will lead to the output inaccurately. Any suggestion from side? or shall I ask business stakeholder to support us to split the doc into smaller chunks? Thanks.
Try a multilingual embedding model like bge-m3. can also look at semantic chunking.
@@samwitteveenai I will try it later. Thanks for your recommendation.
At the end, the parent document that a small chunk originated from is sent to LLM as context for the answers. Is this correct?
yes for this, in the video 2 after this I show some other ways you can deal with it.
@@samwitteveenai Thank you ..
Can you please let what embedding model for RAG on german text
I would probably go with this for now huggingface.co/intfloat/multilingual-e5-large but keep an eye out, there is supposed to be a a multi lingual BGE coming as well.
Do you produce transcripts for the videos? With a video one can grasp a few ideas maybe, but to increase the value it would be good to have step by step instructions that can be read. You could use open source models to transcribe.
I upload the subtitles which are pretty much a direct transcript. I am actually experimenting with getting a LLM to convert the video transcript to a blog post. I haven’t got it to a standard I am happy releasing but need to try a few more ideas. I totally get some people want to refer back to the text etc.
@@samwitteveenai I have been playing with data extraction from 10K PDFs, using CodeLlama 2 34B Phind flavor. It produces pretty clean JSON, but I am now running into problems, that if I change anything trying to improve the output, it breaks something else. Tried multiple passes. Kind of works but very slow running locally.
@@samwitteveenai Wouldn't an LLM writing a transcript from CC be a perfect project? 😎
from langchain.document_loaders import TH-camLoader
I could have done something like this -
Preprocess document to include pagenumber, doc name in the corpus for every page.
Train a universal encoder decoder pipeline, tf . Probably
Use nearest neighbor to get top 5 matches.
Send it to the cheapest LLM I can find and instruct it to 'reply to the query and mention reference in APA' or whatever
..
If I'm feeling fancy.. I would save these references and queries in a text file, because why not...as cache or something.
Boom..
What do you guys think?
You might be interested in the DensePhrases work: Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (2019, arXiv:1906.05807), and Learning Dense Representations of Phrases at Scale (2021, arXiv:2012.12624)
please release the scrapper 🙏🏻🙏🏻
I am watching these videos and I'm like why does his voice sound familiar. But I've worked it out.. Stewie from family guy!!! HAHA YOU CANT UNHEAR IT