I am super stoked about this. Soooo many AI content channels just regurgitate the Anthropic's article, but no original content. Would love someone to take this example and actually build an example and show comparison of how RAG worked for them vs this new method.
I like your sense of humour. I *giggled” when you spoke about chickens and eagles categorising both as birds. Chickens have not yet formally been added to the bird category…may be due to their limited flying capacity 😂
So...we've been doing this since MONTHS. Store the chunk without implicit references to improve false in context learning generation and rewrite it using a recursive summarisation until the full document fits in the content window. Didn't know that could justify a published paper...😅 to us, it's just common sens and some implementation details. Anyone agree or are we secretly geniuses? 😂 Great video BTW
12:33 Give the whole document?! but at the beginning 0:40 they say that if kb is lower than 200K tokens its better to send it within the prompt (so not RAG used). 🤔 So what if the document is big >200K in that "situate_context" code?
exactly, that part was murky. i suppose those would be larger chunks, not whole documents. e.g. the target chunk + a few more chunks around, for a larger context
It is so easy to find the answer. Just upload a 500 page document to Claude and look what is happening. You can experience Ai yourself! Trust yourself.
How about storing a hierarchical order of the chunks depending on e.g. paragraphs, which you add to retrieved embedded vectors. In addition, you can ask the LLM for the most important words like name etc entities in the prompt and search for the given word in the chunk texts and again use the hierarchical order of the chunks to obtain the contextual chunks
I have a question: if you need to load the entire document into the prompt, does that mean that Contextual RAG doesn't work for situations where the document has more than 200k tokens? In a way, it seems that this solution undermines the main principle of a RAG, which is to fragment the content to 'fit' into the prompt.
Is this not also very usefull if you use a very long system prompt? I have a system prompt that is a couple pages long. Telling the AI what our coding conventions are. The idea is that instead of giving the Model a bunch of code and having it try to guess what the rules behind the code is we tell it. At least in my small amout of testing this has worked quite well.
I would also like to see and/or work on an open source implementation. If anyone has a resource, or @Discover AI would like to work on it, it would be appreciated. For the process described, why is the whole document and individual chunk fed to the LLM each time - couldn't you just feed the document once and for example a batch of 100 chunks (assuming it fits in the context window)? Then the LLM could produce a batch of contextualised chunks, rather calling it so many times?
Embedding storage is the cheap part. Anthropic is stating 47% accuracy improvement. If you use context caching, this will be fairly cheap to build out. Even cheaper if you use a local embeddings model.
They are giving a master class on how to frontload and save on cost. If you have a large knowledge base its best to do this 1 time, then you don't have to do all the context and everything again. For example all previous years sales numbers for a company. Nice static database. Now something like a chatbot memory would need the extra compute as new information will constantly be ingested.
Glad to hear the concept of contextual retrieval clicked for you so quickly! If you're ready to explain it in 10 minutes now, I'd say the 34 minutes were well spent. Thanks for the feedback!
I am super stoked about this. Soooo many AI content channels just regurgitate the Anthropic's article, but no original content. Would love someone to take this example and actually build an example and show comparison of how RAG worked for them vs this new method.
So much to learn....everyday! Thanks for providing great content! You are one of my daily learning resources. Keep kicking AI!
Thanks. Smile.
Love it keep your humor in these videos its beautiful 😂
I like your sense of humour. I *giggled” when you spoke about chickens and eagles categorising both as birds. Chickens have not yet formally been added to the bird category…may be due to their limited flying capacity 😂
Some birds refused the AI tool box I guess
What about building a rag system with contextual retrieval and open source models like llama?
So...we've been doing this since MONTHS. Store the chunk without implicit references to improve false in context learning generation and rewrite it using a recursive summarisation until the full document fits in the content window.
Didn't know that could justify a published paper...😅 to us, it's just common sens and some implementation details.
Anyone agree or are we secretly geniuses? 😂
Great video BTW
Late Chunking has solved the Chunking Context problem.
Please indicate each of your jokes with a clear label, like joke::
12:33 Give the whole document?! but at the beginning 0:40 they say that if kb is lower than 200K tokens its better to send it within the prompt (so not RAG used). 🤔
So what if the document is big >200K in that "situate_context" code?
exactly, that part was murky. i suppose those would be larger chunks, not whole documents. e.g. the target chunk + a few more chunks around, for a larger context
It is so easy to find the answer. Just upload a 500 page document to Claude and look what is happening. You can experience Ai yourself! Trust yourself.
How does this compare to Jina Late Chuncking approach for contextual understanding
How about storing a hierarchical order of the chunks depending on e.g. paragraphs, which you add to retrieved embedded vectors. In addition, you can ask the LLM for the most important words like name etc entities in the prompt and search for the given word in the chunk texts and again use the hierarchical order of the chunks to obtain the contextual chunks
I have a question: if you need to load the entire document into the prompt, does that mean that Contextual RAG doesn't work for situations where the document has more than 200k tokens?
In a way, it seems that this solution undermines the main principle of a RAG, which is to fragment the content to 'fit' into the prompt.
Is this not also very usefull if you use a very long system prompt?
I have a system prompt that is a couple pages long. Telling the AI what our coding conventions are. The idea is that instead of giving the Model a bunch of code and having it try to guess what the rules behind the code is we tell it. At least in my small amout of testing this has worked quite well.
I would also like to see and/or work on an open source implementation. If anyone has a resource, or @Discover AI would like to work on it, it would be appreciated. For the process described, why is the whole document and individual chunk fed to the LLM each time - couldn't you just feed the document once and for example a batch of 100 chunks (assuming it fits in the context window)? Then the LLM could produce a batch of contextualised chunks, rather calling it so many times?
So. Did you see that meta claims to have solved prompt injection?
Again?
That's why I can't turn away.
So in essence it's a load of BS. Why would we want to triple our embeddings requirement , it's not sustainable.
Embedding storage is the cheap part. Anthropic is stating 47% accuracy improvement. If you use context caching, this will be fairly cheap to build out. Even cheaper if you use a local embeddings model.
They are giving a master class on how to frontload and save on cost. If you have a large knowledge base its best to do this 1 time, then you don't have to do all the context and everything again.
For example all previous years sales numbers for a company. Nice static database.
Now something like a chatbot memory would need the extra compute as new information will constantly be ingested.
This could have been explained in 10 minutes instead of 34.
Glad to hear the concept of contextual retrieval clicked for you so quickly! If you're ready to explain it in 10 minutes now, I'd say the 34 minutes were well spent. Thanks for the feedback!