Goodbye RAG - Smarter CAG w/ KV Cache Optimization

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 ม.ค. 2025

ความคิดเห็น • 97

  • @code4AI
    @code4AI  2 วันที่ผ่านมา +10

    With the automatic audio dubbing from TH-cam /Google you hear a synthetic voice in your regional language.
    To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.

    • @T1212-e1o
      @T1212-e1o 2 วันที่ผ่านมา +1

      Is this cag or rag appropriate for pdf in arabic language??

  • @hermannschmidt9788
    @hermannschmidt9788 2 วันที่ผ่านมา +37

    The primary problem remains unsolved: your data needs to fit into the window and you pay for tokens. Caching is an optimization of an already solved problem. It only optimizes runtime. Caching always sounds simple first and then it gets complicated very quickly. You have to manage persistent state, which must be accurate. Have fun with that in an environment where the data source changes all the time. RAG is a compromise on many levels, yet it does its job when well-balanced and it is very robust. Furthermore, the LLMs have to provide a hook where you can inject your KV structures into the process. This is where I dismiss the idea until the leading LLM providers pick it up.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +5

      I understand. You wait till somebody else does it for you.

    • @xXWillyxWonkaXx
      @xXWillyxWonkaXx 2 วันที่ผ่านมา +2

      Dude. Thanks for elaborating.

    • @yuriy6249
      @yuriy6249 2 วันที่ผ่านมา +18

      @@code4AI You didn't get the point. That KV caching mechanism can work only for your self-hosted open source/weights model, you have to have access to the weights - nobody will give you access to the K and V vectors which are the result of the multiplication of your token x with the Wv and Wk matrices, which are those very weights. You can't do this with any of the major provider's (closed) models like OpenAI or MS Gemini etc.
      While you can use RAG with any 3rd party provided API-accessed model, you just append your retrieved documents/tokens into the context.

    • @hermannschmidt9788
      @hermannschmidt9788 2 วันที่ผ่านมา +1

      @@code4AI Are you kidding me? Of course I do!

    • @hermannschmidt9788
      @hermannschmidt9788 2 วันที่ผ่านมา +10

      @@yuriy6249 Nicely explained. It's not goodbye RAG for a long time and mere mortals like me, who don't spend all their time at the bleeding edge of science and need to provide useful software for customers instead.

  • @hyeonryeong_is_crazy
    @hyeonryeong_is_crazy 2 วันที่ผ่านมา +10

    It's so nice to find your channel. Thank you.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +2

      You are welcome.

  • @sakthigeek2458
    @sakthigeek2458 2 วันที่ผ่านมา +12

    This is not new. Its the same KV cache and context caching that we have already. It seems they just use a new name CAG for use with long context models. RAG is for scenarios when the documents exceed the context length of the model and it still holds true even with caching. Also, long context needs much more memory and will cost more than RAG. The whole point of RAG is to only get the relevant part of the knowledge base into the context. Here, for every single query, you will be loading a whole lot of irrelevant stuff on to the context window which may affect the output. Sure, it will have better performance but at what cost?

    • @timothywcrane
      @timothywcrane 2 วันที่ผ่านมา +5

      Not only that, you can take it one step further and use a virtual knowledge graph instead of simple kv index.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา

      Thank you for confirming that CAG has a better performance than RAG.

    • @benjaminoldenburg874
      @benjaminoldenburg874 2 วันที่ผ่านมา +2

      Then find a way to only load the relevant knowledge into the cache. It’s a math problem…

  • @WiseWeeabo
    @WiseWeeabo 2 วันที่ผ่านมา +21

    Do you really need a paper just to say "put it all in the context window instead" lmao..

    • @code4AI
      @code4AI  2 วันที่ผ่านมา

      Maybe you missed the concept of the KV cache?

    • @WiseWeeabo
      @WiseWeeabo 2 วันที่ผ่านมา +5

      @@code4AI oh right, that'd be "put it all in the context window instead and cache it", sorry you're right..

    • @bhaveshsinghal6484
      @bhaveshsinghal6484 2 วันที่ผ่านมา +7

      ​@@WiseWeeaboyou will be amazed by all the papers that exists which are just elaborated version of simple things because researchers needed to get there "PhD"s

    • @kellymoses8566
      @kellymoses8566 วันที่ผ่านมา +5

      Publish or perish means there are TONS of papers that sound impressive but really don't say much.

    • @YTHuang-u8z
      @YTHuang-u8z 10 ชั่วโมงที่ผ่านมา

      Yeah, KV cache is the default behavior in Transformers library, so it is actually nothing new… Just put everything in the context and I am sure that it will start worsening when TopK or document size become larger.

  • @scottmiller2591
    @scottmiller2591 2 วันที่ผ่านมา +8

    Disappointingly, very few of the papers on KV cache optimization (including the survey paper) acknowledge the paper where KV caching originated. As far as I know, it is first mentioned in 2022 - Efficiently Scaling Transformer Inference by Pope et al. (arxiv: 2211.05102).

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +2

      Thank you for this info. I was not aware who discovered it ... maybe I can reference it in my next videos.

  • @AITransformers
    @AITransformers วันที่ผ่านมา +4

    But RAG data is vectorized which means multiple files can be used as a knowledgebase. With CAG, wouldn't that be limited?

  • @arielponce8586
    @arielponce8586 2 วันที่ผ่านมา +10

    CAG has clear limitations for extensive knowledge bases, such as the Bible or technical manuals exceeding 1500 pages (128k tokens) due to the token limit . This model is ineffective for massive texts, requiring fragmentation or integration with RAG to handle dynamic queries.
    1M tokens ~ 2.500 pages
    Today It better suited for smaller datasets.

  • @siddharthgolecha998
    @siddharthgolecha998 2 วันที่ผ่านมา +9

    The only problem with this is that the context length of the LLM I am using is not that big and the documents are a lot

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +2

      Great remark. I will show you a second solution in my next video ... (hint: first one is already in the video)

    • @siddharthgolecha998
      @siddharthgolecha998 2 วันที่ผ่านมา

      Thanks ​@@code4AI. Looking forward to your videos 😄

  • @simplegeektips1490
    @simplegeektips1490 2 วันที่ผ่านมา +2

    I was exploring something similar. Imagine you can cache an ICL teaching a new task... I managed to teach a 123m model how to do table of contents only through ICL. it is like creating a seed, offline, and you can load it every time the task is required.
    What do you think?
    Still it is not clear how to save a d load these caches

  • @NWS189
    @NWS189 2 วันที่ผ่านมา +2

    Great video as always. Question - don't you still have to figure out which kv cache to load by finding / retrieving according to relevancy? It's becomes, in a way, a form of rag. Could you help clarify? Thanks.

  • @benjaminoldenburg874
    @benjaminoldenburg874 2 วันที่ผ่านมา +1

    Imagine pre-calculating caches for sentences and combining them as needed. This could extend Cache-Augmented Generation (CAG) to handle vast knowledge bases efficiently, similar to RAG but with faster inference and better coherence.

  • @felixinit
    @felixinit 21 ชั่วโมงที่ผ่านมา

    Amazing. No he aprendido aún RAG systems y ya hay algo mejor que lo sustituye CAG systems.

  • @MrAloha
    @MrAloha 2 วันที่ผ่านมา +2

    This is pretty cool. OpenAI, Anthropic, and Gemini already offer prompt caching via the API. I like it! Thanks!

    • @ScriptGurus1
      @ScriptGurus1 2 วันที่ผ่านมา +1

      Yeah ever since gemini api cache I don't even bother w rag anymore .. just pass the results to the llm

  • @andydataguy
    @andydataguy วันที่ผ่านมา

    I'm not an academic researcher. But I've built dozens of apps testing out performance. RAPTOR still dominates in my experience. RAG isn't dead its just maturing like prompt engineering

  • @EndoTheDev
    @EndoTheDev 2 วันที่ผ่านมา +4

    what about the "lost in the middle" issue?! isnt that a thing anymore?

    • @damianlewis7550
      @damianlewis7550 2 วันที่ผ่านมา +2

      Nope. Several attention methods used within LLMs have effectively consigned LITM to the trashcan of history.

    • @Thrashmetalman
      @Thrashmetalman 2 วันที่ผ่านมา +1

      @@damianlewis7550 not totally true. we still see some lost in the middle issues with these newer models in our tests

    • @damianlewis7550
      @damianlewis7550 2 วันที่ผ่านมา

      @@Thrashmetalman Depends on the scheme used I guess but most can perform a Haystack-1 fine and some greater numbers of needles.

  • @slightlyarrogant
    @slightlyarrogant 2 วันที่ผ่านมา +5

    Calling gemini-1.5-pro with the context length of 1m tokens is approx. $2.50. I think you can get answer with RAG from the vector store of the same capacity with the cost of $0.002 or less for get-4o-mini. so this CAG with gemini-1.5-pro is 1000 x more expensive. I know because I just spent $60 on 22 api calls for gemini-1.5-pro and they charge you a lot if context window is higher than 128k.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา

      Good remark. Let me answer in my next video ...

  • @ToddWBucy-lf8yz
    @ToddWBucy-lf8yz 2 วันที่ผ่านมา +2

    Ok i see what you are doing here Yesterday ECL today CAG...i wonder if CAG could be used to do static preloading for ECL. Time to find out.

  • @yurijmikhassiak7342
    @yurijmikhassiak7342 2 วันที่ผ่านมา +5

    Nice. Your video whill be better if you focus also on potential drawbacks of the tech. Cost and reliability come into mind. 2 mln context window does not mean it will find all the info there. Benchmarks show LLM can miss a lot of context.

  • @sinayagubi8805
    @sinayagubi8805 วันที่ผ่านมา +1

    Can different sets of cache be used together based on demand without their original texts being processed/cached in the same context window? because 20 books is really nothing for most enterprise use cases. I am working on a project right now that deals with data equivalent of 600+ books for each user.

  • @mohamedkarim-p7j
    @mohamedkarim-p7j 19 ชั่วโมงที่ผ่านมา +1

    Thank for sharing 👍

  • @augmentos
    @augmentos 2 วันที่ผ่านมา +1

    Can I ask are you down with adding transcripts? I wanted to really get a better comprehension even after watching it and I tried to put it into notebook LLM and it wouldn't let me. But for me, my learning style is much better when I can dialogue through something with the LLM.

  • @andydataguy
    @andydataguy วันที่ผ่านมา

    This is really cool. Could be very helpful for knowledge management. Thanks for sharing!

  • @gw1284
    @gw1284 2 วันที่ผ่านมา +1

    I have 2 questions, 1) how to setup the cache in preloading step, 2) how to use this cache in knowledge graph RAG, I find reconstructing graph is one of the bottleneck in RAG

    • @code4AI
      @code4AI  2 วันที่ผ่านมา

      CAG has no retriever. Therefore you have no bottleneck any more.

    • @gw1284
      @gw1284 2 วันที่ผ่านมา

      @@code4AI I am asking about GraphRAG

  • @alexisdamnit9012
    @alexisdamnit9012 2 วันที่ผ่านมา +3

    I feel like this video was a lot of content bait -- not very practical imo.

  • @thingX1x
    @thingX1x 2 วันที่ผ่านมา +1

    I set up CAG in my chatbot with vector stores without even knowing what I was doing haha. I love AI coding.

  • @jacobogerardogonzalezleon2161
    @jacobogerardogonzalezleon2161 วันที่ผ่านมา

    This preprocessing step in CAG, it is some "fine-tuning" step? 🤔

  • @MetaMeta-ic1wr
    @MetaMeta-ic1wr 2 วันที่ผ่านมา +4

    This is amazing!! But I'm stupid. Would you make a tutorial, demonstrations, examples?? So I can actually understand xD

    • @farrael004
      @farrael004 2 วันที่ผ่านมา +1

      Here's your tutorial:
      1. Choose an LLM provider that has prompt caching (just Google to see which ones do).
      2. Put all of the data you want the model to know into system prompt.
      3. Ask a question.
      I don't know if it could be any simpler than this honestly.

  • @acasualviewer5861
    @acasualviewer5861 2 วันที่ผ่านมา +1

    But isn't the big caveat "for a given query?"
    How will this work for an arbitrary query?

  • @IanTindale
    @IanTindale 2 วันที่ผ่านมา +1

    What if, coming back to the land of agents, there were an agent tasked with the responsibility and capability of being the context-knowing agent which would sort of ‘hold on’ to the things that would otherwise need to be cached here - it acts as sort of a consultant agent which knows things and has seen things - or maybe think about it as an ‘experience agent’ which contains experience

  • @hailrider8188
    @hailrider8188 2 วันที่ผ่านมา +1

    Please include links in the description to all the papers mention referenced in your video. It not only assists readers but also gives credit to the authors of the papers upon which your video is based.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +1

      I do explicitly show the original papers with the original authors and their affiliations in my video, in order to honor their work. You have in my video the official pre-print directly from arxiv, in the way and form the authors intended it.
      Please do not claim nonsense in your comments.

    • @hailrider8188
      @hailrider8188 2 วันที่ผ่านมา +1

      @@code4AI So, the only way that the authors know that you cited their paper is to watch your entire video? Is it really that hard to put the links to the papers in the description?

    • @benjaminoldenburg874
      @benjaminoldenburg874 2 วันที่ผ่านมา +1

      @hailrider8188 I think @code4AI can decide HOW he references the paper. Perhaps you want him to sing the names of the authors next?

    • @hailrider8188
      @hailrider8188 2 วันที่ผ่านมา

      @@benjaminoldenburg874 I know, Ctl-C, Ctl-V is hard. Maybe he can remove the names from the papers and stick his and your name on it. it is generally considered inappropriate and unprofessional to mention a research paper in a TH-cam video without citing it in the description. Failing to provide proper citation denies the authors the credit they deserve for their work.

    • @benjaminoldenburg874
      @benjaminoldenburg874 2 วันที่ผ่านมา

      ​@@hailrider8188 alright. Generally, I agree with you. However, it is his choice. Referencing the authors in the video does the job, though perhaps not in the most effective way.

  • @කැලණිකුප්පි
    @කැලණිකුප්පි 2 วันที่ผ่านมา +11

    As long as all the documents fit in the context length 😢😢 guess RAG lives on

    • @Geraltofrivia12gdhdbruwj
      @Geraltofrivia12gdhdbruwj 2 วันที่ผ่านมา

      Yeah it is highlighted in the paper as well.

    • @damianlewis7550
      @damianlewis7550 2 วันที่ผ่านมา +2

      As noted, cache optimization methods can reduce the KV cache overhead so that you can squeeze more into the context. It’s the internal representation that gets cached, not the original document data.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +6

      This is incorrect, as explained later in my video. Hint: KV cache optimization and hierarchical KV cache structures

  • @tantzer6113
    @tantzer6113 2 วันที่ผ่านมา +1

    Having watched this video a second time: First, I think here we’re talking about a key-value cache created at inference from the query input (x1 x2 … x_t-1), not a key-value cache created during training? Second, given the reduction in computation resulting from a key-value cache, I would guess key-value caches are already used during both training and inference? Third, given the reduction in computation, I assume it has become standard to have these caches, making them integral to transformers? I feel guilty having a theoretical physicist answer my dumb questions, but perhaps anybody in the “community” can help. Also, many thanks to Discover AI for all the videos! :-) PS The video refers to loading all the documents in a “pre-processing phase.” Is that the same as putting all the documents in the “context” (i.e., part of the query input)? If not, why not just use the context?

    • @tantzer6113
      @tantzer6113 2 วันที่ผ่านมา

      Ok, I have thought about this more. I guess the point is that keys and values do not require an input query to be generated and therefore key-value pairs can be generated after training but prior to inference is a pre-processing phase, i.e. even before the user has typed the input query (x1 x2 …)? So, by key value pairs that are in chache we simply mean the V and K matrices that are obtained during training? Once training is finished, we have our K and V matrices (of weights) and don’t need to recalculate any keys and values and inference time? Every thought I come up with breaks at some level, and I remain confused.

    • @tantzer6113
      @tantzer6113 2 วันที่ผ่านมา

      Through a long, long conversation with Perplexity AI, I think I’ve achieved some clarification… for the moment at least, until this begins not making sense again! This is how it works: 1) The pre-training is finished. Now, we’re at the “pre-processing” stage, BEFORE the user types any input queries, i.e., before any “inference.” 2) In this stage, the documents are fed to the ENCODER. The encoder generates a Key matrix and a Value matrix for each token. 3) Once this is done, with we have bunch of (key, value) pairs. 3) So, when the time comes for inference, the user query is combined with these existing (key, value) pairs to generate the output. I have some questions about steps 2 and 3 that I’ll ask Perplexity AI. Here they are: 1) in step 2, i.e. preprocessing, it would seem that are documents may not constitute a large enough corpus to train reliable key, value pairs. Suppose you have 5,000 or 50,000, or even 500 K tokens. Is that not too small a corpus? Or are we starting with some existing K, V matrices and modifying their values in something similar to fine-tuning? 2) In step 3 above, i.e. at inference time, I assume the Q matrices are generated by the Decoder? 3) At inference time, the query question is a sequence of tokens x1, x2, …. I assume this generates the matrix Q = [q1 q2 ..] through self-attention. Now, we want to combine this output of the decoder self-attention with the key, value pairs from the pre-precessing stage (using multiplication and softmax) to generate the output tokens. I assume the dimensions of the query Q are compatible with the dimensions of the key, value pairs. Are we in effect taking the input query, whose length may be that of a sentence, and searching through the documents (i.e., keys, as there is one key per token) to see which documents or which parts of the documents are the best match to the query? If so, I still need clarification on how this is done.

    • @tantzer6113
      @tantzer6113 2 วันที่ผ่านมา +2

      Here’s Perplexity AI’s answer to the above questions:
      Your understanding of CAG is generally on the right track, but there are some important clarifications and corrections to make. Let's go through your points and questions:
      CAG Process
      Pre-processing stage: You're correct that this occurs after pre-training and before any user queries.
      Document processing:
      The documents are indeed fed to the encoder.
      The encoder generates key (K) and value (V) vectors for each token at each layer of the model, not just one K and V matrix per document.
      Key-Value Cache:
      The result is a cache of K and V vectors for each token in each document, across all layers of the model.
      Clarifications on Your Questions
      Corpus size for K, V pairs:
      The size of the document corpus doesn't directly affect the reliability of K, V pairs.
      These pairs are generated using the pre-trained model's weights, not trained from scratch.
      Even with a smaller corpus, the model can generate meaningful representations.
      We're not modifying existing K, V matrices; we're generating new ones specific to the input documents.
      Q matrix generation:
      In CAG, the Q (query) vectors are generated by the encoder, not the decoder, at inference time.
      The model processes the user's input query through the encoder layers to produce Q vectors.
      Inference process:
      Your understanding of the query processing is generally correct.
      The user's input generates Q vectors through the encoder's self-attention mechanism.
      These Q vectors are then used to attend to the pre-computed K and V vectors from the document cache.
      The dimensions are indeed compatible, as they're all produced by the same model architecture.
      The process is effectively searching through the cached representations of the documents to find relevant information for the query.
      Additional Clarifications
      CAG eliminates the need for real-time retrieval by pre-computing and caching the K and V vectors for all documents.
      During inference, the model can quickly access these cached representations without reprocessing the entire document set.
      This approach significantly reduces latency, especially for large document collections, as the computationally expensive part (document processing) is done in advance.
      The key advantage of CAG is its ability to preload all relevant information into the model's context, allowing for faster and more comprehensive responses without the need for separate retrieval steps during inference.

  • @h3techsme
    @h3techsme 2 วันที่ผ่านมา +1

    I share your frustration with the "lemmings" following monolithic language models and simplistic RAG "over the ledge" to no good result. The AI hype "echo chamber" is where the gold rush leads to fool's gold.

  • @RuiLima1981
    @RuiLima1981 2 วันที่ผ่านมา +1

    At a point in the video you mention that people might not want to upload confidential data to vector database, but then later you say that we can use Gemini for free with this. Unless I am missing something, if you are using Gemini for free you are effectively uploading your confidential data to google use it in future model training, which means that you are making your data available to the whole world to see.

  • @ernestosantiesteban6333
    @ernestosantiesteban6333 2 วันที่ผ่านมา

    But isn't this the same functionality that Anthropic and Gemini models already have available in the API for the context caching feature?

  • @jcalobarv8653
    @jcalobarv8653 2 วันที่ผ่านมา

    Can I ask how do I start learning this stuff? I fairly know python, I know the basic syntax and oop. How ever I don't know numpy

  • @puremajik
    @puremajik 2 วันที่ผ่านมา

    Do you still need to structure the data?

    • @damianlewis7550
      @damianlewis7550 2 วันที่ผ่านมา

      You don’t with RAG so assume not with CAG either.

  • @Piotr_Sikora
    @Piotr_Sikora 2 วันที่ผ่านมา

    One more thing… on hugging face exist exnteded retrained version of Llama that support 1M tokens :)

  • @fdavis1555
    @fdavis1555 2 วันที่ผ่านมา +1

    Great insights!

  • @reserseAI
    @reserseAI วันที่ผ่านมา

    The real chalenge is law domain, its different with medical domain

  • @andrewowens5653
    @andrewowens5653 2 วันที่ผ่านมา +1

    Excellent!

  • @actorjohanmatsfredkarlsson2293
    @actorjohanmatsfredkarlsson2293 2 วันที่ผ่านมา +1

    Problem with InCA is that its only for classification tasks. It only work with supervised data sets. So it doesn't replace RAG.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +2

      You reply to a video about CAG. Please comment about InCA on the InCA video for the community to understand your semantic comment.

  • @augmentos
    @augmentos 2 วันที่ผ่านมา +2

    Hahah I wake up I’m like I just learned ECL here we go again

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +4

      Wait for tomorrow's video .... I have an idea for the final day of the year.

  • @Piotr_Sikora
    @Piotr_Sikora 2 วันที่ผ่านมา

    I thinking to do something similar like this few weeks ago ;)

  • @zeropaper
    @zeropaper 2 วันที่ผ่านมา +2

    RAG does not require vector storing...
    CAG doesn't work for application that have content that is updated regularly.

    • @code4AI
      @code4AI  2 วันที่ผ่านมา +1

      Your first sentence: incorrect. Your second sentence: incorrect. Thank you for your comment.

    • @covertassassin1885
      @covertassassin1885 วันที่ผ่านมา

      ​@@code4AI
      No, his first sentence is correct...
      Vector embeddings are one method for retrieval but you can implement many other retrieval methodologies that don't rely on semantic similarity.
      RAG is just retrieval augmented generation. So go retrieve some content with whatever methodology, feed that all to an LLM, and generate.
      No vector embedding / storing required.
      It is a small detail but when you say he is incorrect, I don't want other people to be misinformed

    • @code4AI
      @code4AI  วันที่ผ่านมา +2

      @covertassassin1885 RAG is a defined method as published in the document "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401v1), which introduces a model that utilizes vector representations to encode information in a vector space. The dense vector index of Wikipedia is accessed using a pre-trained neural retriever (Dense Passage Retriever, DPR), which employs a bi-encoder architecture to represent both queries and documents as vectors in a shared vector space.
      @covertassassin1885 seems you just learned the origin of RAG and its scientific definition.

  • @densonsmith2
    @densonsmith2 2 วันที่ผ่านมา +1

    If you want to promote something you should include a link to the paper and repo in the notes. I'm now 15 minutes into guessing which search result is the correct one.

  • @daniloribeiro3526
    @daniloribeiro3526 2 วันที่ผ่านมา

    I don't know, Gemini over +20k tokens it starts to trip itself

  • @andrewowens5653
    @andrewowens5653 2 วันที่ผ่านมา

    How about CAG-RAG

  • @_alt
    @_alt 22 ชั่วโมงที่ผ่านมา +1

    Clickbait