Consider these actionable insights from the video: 1. Understand the power of context in search queries and how it enhances accuracy in Retrieval Augmented Generation (RAG). 2. Experiment with different chunking strategies for your data when building your RAG system. 3. Explore and utilize embedding models like Gemini and Voyage for transforming text into numerical representations. 4. Combine embedding models with BM25, a ranking function, to improve ranking and retrieval processes. 5. Implement contextual retrieval by adding context to data chunks using Large Language Models (LLMs). 6. Analyze the cost and benefits of using contextual retrieval, considering factors like processing power and latency. 7. Optimize your RAG system by experimenting with reranking during inference to fine-tune retrieval results.
so the LLM is the Achilles heel of the whole process. if it messes up the context, everything goes south immediately! but if it works well by default, it will enhance the final results
As you said its really costly like graph vector DBs and high maintenance. A classic (sparse + dense retriever) + sparse reranker should simple do a good job also considering most of the new sota models have more context window.
I've been working on something quite similar over the last few months for a corpus of documents that are in a tree hierarchy to increase accuracy. Seems it was not a bad idea after all 😁
Thanks for the update.👍 We see a lot of different techniques to improve RAG and the additional quality improvement are not that big and the cost are much higher (more tokens) and also the inference time goes up... Agree, that for most of use cases its not worth the effort and money.
Thank you! Feeding in whole document text to add few lines of context for each chunk seems way too much for less benefit. Instead we would need a better embedding model to enhance the retrieval without any of the overheads. And Companies will be interested in chunking, embedding and indexing proprietary documents only once in their lifetime. They can't reindex the whole archive everytime a new improvement is released
Is it really worth all the noise and having a new name for it and all? This is an idea that many developers have already been using. I mean anyone who thinks a little bit naturally realizes that adding a little description of what the chunk is about in relation to the rest of the document, would have automatically do it :D Myself and many others have been doing it for very obvious reasons .. I just didnt know I have to give it a name and publish it as technique.. these LLM BS taught me one thing , and that is put a name on any trivial idea and you are now an inventor
Yes, actually there are many more techniques like this which offer similar percent of improvement and none of them are worth it. Basic rag is still enough for now.
How to generate those context for chunks without having the sufficient information to the LLM regarding the chunk? How they are getting the information about the revenue in that example?
@@1littlecoder then it will be very much costly as the entire document is being fed into llm. And what about the llm's token limit? If I have a significantly large document.
Wait, would nbot be more efficent for the LLm to rather than create a context use that compute ti create a new chunk that puts together two previous chuncks (eg. chunch 1 + chunck x) based on context, and rather than go down of the route "lets try to aid the LLM to find the right chunk to the user request by maximizing attention to that one particular chunk", go down the route " lets try to aid the LLM [..] by maximizing the probability to find the right node in a net of higher percentage possibilities"?
@@1littlecoder As far as I know, if u use the prompt caching feature to store all your documents such as your company documents, it would greatly reduce the cost, particularly on the input tokens cost consumption as {{WHOLE DOCUMENT}} are retrieved from the cache. Am I right?
They could have used something similar to LLMLingua on each chunk then pass it to a smaller model for deriving context as it is a very specific use and does not demand a huge model. This way cost can be controlled and the quality can be enhanced. Also, they can add a model router rather than using a predefined model. This model router can choose the model based on the information corpus has. There are many patterns which can enhance this RAG pipeline. This just seems very lazy.
Your content is really good, but I've noticed that you tend to speak very quickly, almost as if you're holding your breath. Is there a reason for this? I feel that a slower, calmer pace would make the information easier to absorb and more enjoyable to follow. It sometimes feels like you're rushing, and I believe a more relaxed delivery would enhance your already great work. Please understand this is meant as constructive feedback, not a criticism. I'm just offering a suggestion to help make your content even better.
I tried another stupidity simple aproach. Create a QA data set with LLM. Find nearest question and provide answer. Surprisingly it also works really great 😅😅😅
This is actually surprisignly good for RAG on expert/narrow domains! i did the same thing for a bot on web accessibility rules, and it worked perfect AF
Consider these actionable insights from the video:
1. Understand the power of context in search queries and how it enhances accuracy in Retrieval Augmented Generation (RAG).
2. Experiment with different chunking strategies for your data when building your RAG system.
3. Explore and utilize embedding models like Gemini and Voyage for transforming text into numerical representations.
4. Combine embedding models with BM25, a ranking function, to improve ranking and retrieval processes.
5. Implement contextual retrieval by adding context to data chunks using Large Language Models (LLMs).
6. Analyze the cost and benefits of using contextual retrieval, considering factors like processing power and latency.
7. Optimize your RAG system by experimenting with reranking during inference to fine-tune retrieval results.
so the LLM is the Achilles heel of the whole process. if it messes up the context, everything goes south immediately! but if it works well by default, it will enhance the final results
As you said its really costly like graph vector DBs and high maintenance. A classic (sparse + dense retriever) + sparse reranker should simple do a good job also considering most of the new sota models have more context window.
Mate, I've been trying to understand RAG for ages, non coder here obviously, but your explanation was brilliant. Thank you
You can create the contextual tag locally using ollama.
Great Video
@@MhemanthRachaboyina thank you
I've been working on something quite similar over the last few months for a corpus of documents that are in a tree hierarchy to increase accuracy. Seems it was not a bad idea after all 😁
Thanks for the update.👍 We see a lot of different techniques to improve RAG and the additional quality improvement are not that big and the cost are much higher (more tokens) and also the inference time goes up... Agree, that for most of use cases its not worth the effort and money.
Honestly that few percent improvement is not worth for most cases...
This is really interesting and I think, intuitively, it will help me with my project. Thank you very much.
Thank you!
Feeding in whole document text to add few lines of context for each chunk seems way too much for less benefit. Instead we would need a better embedding model to enhance the retrieval without any of the overheads.
And Companies will be interested in chunking, embedding and indexing proprietary documents only once in their lifetime. They can't reindex the whole archive everytime a new improvement is released
Is it really worth all the noise and having a new name for it and all? This is an idea that many developers have already been using. I mean anyone who thinks a little bit naturally realizes that adding a little description of what the chunk is about in relation to the rest of the document, would have automatically do it :D Myself and many others have been doing it for very obvious reasons .. I just didnt know I have to give it a name and publish it as technique.. these LLM BS taught me one thing , and that is put a name on any trivial idea and you are now an inventor
Honestly, that's one thing I've actually mentioned on the video. If such improvements are something you need
Yes, actually there are many more techniques like this which offer similar percent of improvement and none of them are worth it. Basic rag is still enough for now.
excellent video and insights!
Glad you enjoyed it!
How to generate those context for chunks without having the sufficient information to the LLM regarding the chunk? How they are getting the information about the revenue in that example?
That is from the entire document
@@1littlecoder then it will be very much costly as the entire document is being fed into llm. And what about the llm's token limit? If I have a significantly large document.
@@souvickdas5564 this techique is golden for local run LLMs. Its free.
I have the same doubt. Please let us know if there's clarity.
i was really caught off guard when you said '....large human being' 😂😂
i just rewatched it 🤣
Unfortunately, large humans are extinct! [or maybe left planet Earth.]
🤣🤣🤣🤣🤣🤣
I was experimenting with this and its really amazing. But too simple approach 😅😅
the beauty is how simple it is :D
@@1littlecoder keeping it simple always works
Thanks 😅
to generate context.. do we need to pass all documents.. how we will address the token limit ?
It would be great if they could just build this into their platform, like openai has with their agents.
Wait, would nbot be more efficent for the LLm to rather than create a context use that compute ti create a new chunk that puts together two previous chuncks (eg. chunch 1 + chunck x) based on context, and rather than go down of the route "lets try to aid the LLM to find the right chunk to the user request by maximizing attention to that one particular chunk", go down the route " lets try to aid the LLM [..] by maximizing the probability to find the right node in a net of higher percentage possibilities"?
Isn’t it agentic chunking strategy??
Smart chunks 🎉
Someone's going to steal this name for a new RAG technqiue :)
is it something similiar to what google calls context caching ?
No context Caching is basically on top of it. Thanks for the reminder. I should probably make a separate video on
@@1littlecoder oh nice , perfect
Thank you for such insights and simple explanation
I think the reason why Anthropic introduces this technique is because of they have the CACHING!!!
easy upsell 👀
@@1littlecoder As far as I know, if u use the prompt caching feature to store all your documents such as your company documents, it would greatly reduce the cost, particularly on the input tokens cost consumption as {{WHOLE DOCUMENT}} are retrieved from the cache. Am I right?
They could have used something similar to LLMLingua on each chunk then pass it to a smaller model for deriving context as it is a very specific use and does not demand a huge model. This way cost can be controlled and the quality can be enhanced. Also, they can add a model router rather than using a predefined model. This model router can choose the model based on the information corpus has. There are many patterns which can enhance this RAG pipeline. This just seems very lazy.
Have been doing this long back and much more
Your content is really good, but I've noticed that you tend to speak very quickly, almost as if you're holding your breath. Is there a reason for this? I feel that a slower, calmer pace would make the information easier to absorb and more enjoyable to follow. It sometimes feels like you're rushing, and I believe a more relaxed delivery would enhance your already great work. Please understand this is meant as constructive feedback, not a criticism. I'm just offering a suggestion to help make your content even better.
Thank you for the feedback. I understand. I have a nature of speaking very fast so n typically I've to slow down. I'll try to do that more diligently
❤🫡
This is the guy who called o1 preview overhyped. 🤭
Did I?
he never said that. he said, gpt 01 is just a glorified chain of though and that's actually true
I tried another stupidity simple aproach.
Create a QA data set with LLM.
Find nearest question and provide answer.
Surprisingly it also works really great 😅😅😅
Here you go. You just invented a new RAG technique 😉
This is actually surprisignly good for RAG on expert/narrow domains! i did the same thing for a bot on web accessibility rules, and it worked perfect AF
@@arashputata which method
@@1littlecoder also u can later use the data to fine-tune 😅😅
yeah that is my not-so-secret weapon too 😂