I don't know if I'm implementing 'prompt caching' incorrectly, but in my case, each chunk processing is taking too long (about 20s) with a 30-page file. Due to the processing time, this approach becomes unfeasible.
Hi Pedro, if I may ask please - how big is each chunk you are sending to Claude (in tokens) and what is the format sent (text or images)? Additionally, what kind of task are you asking it to perform for generating context - e.g., are you asking Claude to create a simple summary vs more intensive analysis? Have you tried experimenting with smaller chunks vs larger chunks? Likewise have you tried experimenting with constraints on the outputs? Have you tried Haiku vs Sonnet? All these factors and more will play a role in the latency estimates. Likewise, you may be able to find speed optimisations via your application code. If you are looking for lightning fast latency, you can also explore executing your tasks asynchronously, followed by either thread pooling to allow concurrent requests, and if necessary and your use-case allows, running in parallel with a library like multiprocessing. The limiting factor will be the API limits set by Claude.
@@emiliod90 Hi, thanks for responding. The entire document (35 pages) has about 14K tokens, and each chunk has around 490 tokens. I'm trying to perform a simple task like "Chat with your PDF" by adding context to each chunk, but the processing time for each request makes it unfeasible. I agree that I could run the requests in parallel, but it's still strange that each request takes about 20 seconds, even with the "Prompt Caching" feature enabled. I believe I'm doing something wrong during the calls, and the cache isn't working. Is there any example project I can run locally?
@@PedroNihwl Hey Pedro, no problem. We are creating a similar chat-with-your-PDF type knowledge base at my work, so I am also very interested. I will share what we have learnt and hopefully it can help you. For code bases I unfortunately can't share ours but I can guarantee that anything you find on TH-cam and Google within the PDF chat and RAG domain are good. I found support from watching channels Prompt Engineering, TwoSetAI and AI Engineer, particularly Jerry Liu. Now to answer your specific question, just so I understand, do you need the "pre-processing", i.e., extracting context, then embedding this into a vector store to be faster, or are you referring to the time required once you've already retrieved the chunk, and sent this to the LLM to await a response? Describing your overall architecture might be worth sharing so I understand how your tackling this issue please?
Contextual retrieval looks like it will definitely help with one of the downsides of RAG (ie. loosing context in a chunk). The other approach I have been looking into quite a bit is the combination of RAG and knowledge graphs. I wonder if knowledge graphs and contextual retrieval for RAG combined would be even better.
Truly impressive.
amazing!
awesome!
This is amazing. Do you have more tutorials or I can find more resources with Anthropic Pinecone and n8n? Thank you!
we're coming out with more content all the time! Is there a specific subtopic or use case for this specific combo that you're enthusiastic about?
I don't know if I'm implementing 'prompt caching' incorrectly, but in my case, each chunk processing is taking too long (about 20s) with a 30-page file. Due to the processing time, this approach becomes unfeasible.
Hi Pedro, if I may ask please - how big is each chunk you are sending to Claude (in tokens) and what is the format sent (text or images)? Additionally, what kind of task are you asking it to perform for generating context - e.g., are you asking Claude to create a simple summary vs more intensive analysis? Have you tried experimenting with smaller chunks vs larger chunks? Likewise have you tried experimenting with constraints on the outputs? Have you tried Haiku vs Sonnet? All these factors and more will play a role in the latency estimates.
Likewise, you may be able to find speed optimisations via your application code. If you are looking for lightning fast latency, you can also explore executing your tasks asynchronously, followed by either thread pooling to allow concurrent requests, and if necessary and your use-case allows, running in parallel with a library like multiprocessing. The limiting factor will be the API limits set by Claude.
@@emiliod90 Hi, thanks for responding.
The entire document (35 pages) has about 14K tokens, and each chunk has around 490 tokens. I'm trying to perform a simple task like "Chat with your PDF" by adding context to each chunk, but the processing time for each request makes it unfeasible. I agree that I could run the requests in parallel, but it's still strange that each request takes about 20 seconds, even with the "Prompt Caching" feature enabled. I believe I'm doing something wrong during the calls, and the cache isn't working.
Is there any example project I can run locally?
@@PedroNihwl Hey Pedro, no problem. We are creating a similar chat-with-your-PDF type knowledge base at my work, so I am also very interested. I will share what we have learnt and hopefully it can help you. For code bases I unfortunately can't share ours but I can guarantee that anything you find on TH-cam and Google within the PDF chat and RAG domain are good. I found support from watching channels Prompt Engineering, TwoSetAI and AI Engineer, particularly Jerry Liu.
Now to answer your specific question, just so I understand, do you need the "pre-processing", i.e., extracting context, then embedding this into a vector store to be faster, or are you referring to the time required once you've already retrieved the chunk, and sent this to the LLM to await a response?
Describing your overall architecture might be worth sharing so I understand how your tackling this issue please?
Contextual retrieval looks like it will definitely help with one of the downsides of RAG (ie. loosing context in a chunk). The other approach I have been looking into quite a bit is the combination of RAG and knowledge graphs. I wonder if knowledge graphs and contextual retrieval for RAG combined would be even better.
you might be interested in th-cam.com/video/ubtLxr7B1Vc/w-d-xo.html