The points you bring up about the failures of RAG and incontext learning are more broadly problematic. I've frequently wondered why LLMs fail to be able to "synthesize" information from different sources. Try to get an LLM to debate you on a topic. it's completely futile because the LLM just regurgitates information from its training data (with all its associated biases). It doesn't do any reading between the lines, by taking disparate sources of information and having that "Eurika" moment when two, at first, disconnected ideas come together to introduce new knowledge. In essence, LLMs are still just context dependent pattern recognition machines.
@@rnoro It's also total bullshit. The "AI are just schochastic parrots", "AI is just predicting the next word", and so on have been thoroughly debunked.
EXACTLY ❤ I TRIED ARGUING. 😮 Not good. The censorship can never work for the same reasons. Look at the Captain Underpants kids book series. Literally would be “burned” by the censors in text and images. But it’s the opposite of child abuse.
After you first brought up continual learning versus long context / RAG, I immediately think of the example of just asking for advice from customer service staff with a manual, versus asking a technician. RAG is like a CS person that could quickly find the page related to your question. In context seems a bit better, the person had read the manual and probably can bring you a bit insight, depends on how smart that person is. But continual learning, ideally means the manual is fully understood by the person. They learnt it, instead of remembering it. Ideally, depends on how good a learner they are.
I paused. From my experience working in production with RAG... simply it's like using a legacy technology to fill in the blanks in training. Attention mechanisms are way more subtle and nuanced than semantic search lookup.... if there was a cheap way to viably do continual learning in a production environment, and still allow context windows, etc. it would solve the whole problem of semantic search not being enough to find the right data you really need.
Why can't we update context documents, which is an input for RAG for continuous learning? Means every time we want model to learn new knowledge, just update the patch, which RAG model uses as a context
I agree, RAG feels very empty to me. Like, once you intelligently find the relevent text for a prompt in your vector db, it seems like the LLM isn't really even needed anymore. Why not just use the vector search and ranking by itself at that point? To me, the value of the LLM is its ability to create new content, and RAG basically puts the LLM in a tiny box.
This is also my thinking If you can find the right documents anyway, that means the documents (and the answer) already exists and you don't need the LLM to read it and essentially play it back to you. IF you can find the right documents that is. Therefore we either focus on finding the right documents or we continually train the LLMs -> how are you doing that?
I'm glad I found this channel. The biggest problem I see with LLMs is their static weights, which I imagine is good for safety researchers but problematic when a model cannot expand beyond its training data.
My guess on why continual is that if pulled in an economically efficient manner it can allow researchers who are on the bleeding edge of the ML field to harness the copilot like assistance when working on completely new tasks because often times when doing your own research you often have to answer all the questions on your own and to have a second brain their with you that has all the same context on the research as you do will be very beneficial
2:30 - I'm thinking that the reason you continuously train models in production instead of using RAG is because deciding what information is relevant is unclear. By continuously training models, the model is able to make those connections itself. I'm wondering if this also reduces inference speed, as the prompt no longer has so much extra padding.
You can use the llm to decide what information is relevant, by adding an extra call to the database. For example the llm can output 75 keywords that request the context needed. Then that gets sent to the llm again, allowing for nearly 100% correct context. Along with the possibility of the LLM being allowed to make 5-10 context calls back to the database. You overcome these issues.
I am not convinced by your arguments. In my work, I often need to tell business people that we should not do training or fine-tuning until we see that in-context learning is not enough. And in the end, it turns out that in-context learning is enough, even for those complicated examples you described. Let me argue with your arguments. 1. `Where does the context come from?` - Well, if we have no context then any training is even more out of the question than in-context learning, right? I don't see how that would be an argument for continual learning. 1.1 `RAG won't enable our models to solve complicated or niche problems` - It can. If LLM is capable enough and has great general knowledge then it can often solve problems that no human has ever solved before, just using its general knowledge and some additional context about the problem that needs solving. 2. `The scope of in-context learning is limited by pertaining data` - agreed. However, the most capable models are trained on almost every type of data you can imagine. That means that you probably won't find a problem for which in-context learning won't work for those models. This can however be a problem if you are working with less capable, smaller models. For me the biggest problem of RAGs is that models do not work well with very large contexts. Even if you have 128k context length in gpt4, it won't be able to reason well with 100k tokens of table data or system logs or even a codebase. Instead, it will often "misread" or "forget" information in such a long context. That's more of a limitation of current LLMs, I think, not an intrinsic characteristic of RAG systems.
These are some great points, and there is a large hole in my video because I left out half of the argument for brevity. The most crucial point I think is the question of where the context comes from, because of course, you aren't going to get any benefits of continual learning if you have nothing to learn from 😛 The answer to this is having RL agents generate their own data via experience. The thing about this approach is that it requires you to continually get new data that is more related to what you want to solve. But if you aren't continually learning, you won't be able to generate continually more relevant data (unless it happens to fit into your context window). For smaller problems this is often not a problem, but what if you want to learn an entirely new programming language or learn calculus? You can't just give a bit of context; you have to learn from a continually changing curriculum as you learn one topic after the other. That is where this approach makes sense. This is why I agree with you to an extent on point 1.1. If there is a problem just on the boundary of what humans haven't done before, RAG + in-context learning can potentially solve it. But if you want to build on that new knowledge, and then build on that, and build on that, and so on, that is where you need continual learning, and these are the types of problems that researchers and startups often face. They are problems that require layers of learning. That being said, I disagree with point 2. There are sooo many things these agent are not trained on, but they do require you going deep into specific domains most of the time. The long context is one problem I was originally going to bring up in the video, but I decided against it because models will keep getting better at dealing with longer context as training methods and architectures improve. These other problems will not go away. I appreciate the well thought out critique
@@EdanMeyeragreed, I also would like to bring up the point that context size brings O(n^2) scaling. So with a massive context window, you will eventually hit a point where training becomes just as cheap as a few forward passes. Also even with a nearly unlimited, yet finite, context size, you will eventually run out of room for very specific tasks like working with a code base long term. Then training on that knowledge becomes a very viable option because it would essentially reset the ctx window size. Also LLMs perform worse on larger contexts windows, and that is just a given, they will always perform worse with larger windows of info because humans also have the same flaw, and LLMs are trained to mimic human writing, so having a smaller window will always yield better performance
I have created an algorithm for continous learning model, which can help software developers keep track of problems solved. But implementation needs more brains, so going slow till I am alone. I wasn't aware someone else is thinking on same lines. Thanks for the video
lets be honest, documentation generally sucks. I'm an open source contributor and the biggest issue users report is "your docs suck". RAG doesn't solve out-of-distribution situations, so it's not a magic bullet. Even when developers try to write good docs, it quickly becomes out of date and wrong. until we have better architectures, continuous training will be needed.
The main reason we dont do CL is nbecause there is nothing in old fashioned CL that actually works. Currently the only thing that learns from the previous and improves is in context learning. Yes we need continual learning but it should be by making incontext learning lighter. More fundamentally, we need to separate between memory based incontext continual learning and parameter continual learning. The former is to add up event based knowledge while the latter is to add up intrincsic reasoning capabilities. The former should be based on external memory while the latter is on the parameter level. Right now we have the former in the name of incontext learning using token memory. The community is already doing research along this direction although not in the name of CL.
@@augustecomte7980i dont think there is a particular name for it. And i think there is no framework yet for parameter-level continual learning in a strict sense, where a model can add up its reasoning capabilities by additional training solely on new incoming data. I do not know much in industry but a latest paper to this direction is I think iterative reasoning preference optimization. But it requires all data not just new one
The poetry example has a hidden layer to it. It does not matter whether an AI is trained on high quality poetry. It cannot create high quality poetry, because this requires a completely different kind of intelligence than mere pattern recognition of things that have already been written.
Remember back when we were first learning about grammar, parts of speech, and literary devices? Perhaps an LLM can be used alongside a classroom of children and learn all these finer nuances in the same way. Of course, I think it is also worth recognizing that we create from our experiences. The way we felt at the park on that one special day, when the cool summer breeze was gently blowing across our skin. Would the lack of physical/emotional experience always detract from the quality of a poem? Perhaps it would only succeed in the creation of a more cryptic form of poetry... like, "The Red Wheelbarrow" which lacks technical correctness entirely yet is still a poem of quality.
Great explanation about the limitations of current LLMs. I worked with RAG systems recently and it works...for very specific limited knowledge retrieval type applications. We are still so far away from these systems being generally useful. There's probably a lot more you can do with MoE now that a lot more effort is being put into making models smaller.
(interactive answer from 2:45) RAG doesn't integrate information across contexts - sure you can overlap chunks but your always running the risk of missing aspects of the model that fall outside the fragment window (Tree and Graph RAG overcome this in different ways but then your still doing it in an offline manner which may or may not be an issue for the use case).
Continous learning is not possible at the moment for most top open weights models because the creators have not released the data used to train the model. A significant amount of training using data that does not include the original data will lead to degradation of the model.
All the more reason to work on this, that is a problem that needs solving. Having all the original training data shouldn't be a requirement for continual learning.
@@EdanMeyer This requirement is not something man-made. Its a fundamental property of neural networks. Watch Andrej Karpathy's talk at Sequoia Capital where he discusses these "open-source" models from big tech. You can generate high quality data using your approach but you will need to train a model from scratch using that data. Another approach you could take is categorise your data into "bins" where each datapoint is much more similar to data inside the bin compared to data outside and then train low rank adapters using each bin and then use those low rank adapters to create a pseudo Mixture-of-Experts model. Then as you keep finding new data you can keep adding new low rank adapters to keep increasing the capabilities of your model.
You can self-anchor a fine tuned or continuously learning model using the original, pretrained model's output. For each bit of new training data you learn on, also train on a random completion result from the original model. AFAIK this is what they did with InstructGPT/ChatGPT to get GPT to obey commands.
This sums up my current Evaluation of the LLM hype: those models are limited (by input data, like quality and field/focus). I do not see the hyped exponential growth, just bigger training provided by more data and computing power. And regarding creative and smart combination solutions for (new) tasks, without really good prompting leading the LLM nothing happens in that direction.
Reason 3, which I've worked with since around 2015: My AI needs to be able to learn to get smarter, weights can not be frozen and the net can not be static, nor can it ever become silent.
Great video! I totally agree about the potential downsides about RAG. There's another issue with RAG and in context learning, namely that current LLMs often are good at retrieval when given a long context ("needle in a haystack") but not good very good at connecting the dots and reasoning about what's in the context. But about your example where context is missing, I think continually learning is also not really helping. If you don't have data for context, then you probably also don't have it for learning in most circumstances. It seems a more general problem for which the solution would be to have a LLM that is better at reasoning and generalizing
Yes! You got it! I left out half of the argument for brevity, but you also need to be able to get the data to do continual learning. The solution there is to have an agent generate its own data via experience (i.e. RL). The thing is that you can't generate experiences that are continually more relevant to what you want to learn unless you are continually learning. Hence, both of these systems are essential as a pair.
the first thing that comes to mind when thinking about using continual learning as compared to RAG, is that the LLM is quite like our brains, and we can't effectively retrieve all the relevant information or solve the problem with comparable accuracy if we are seeing something for the first time, even if we have all of the context there is.
I'm at the challenge part and haven't watched further: Continual learning provides flexibility to an otherwise static frame which means it can go outside of its initial scope. In theory this allows an AI to improve on the fly based on input and if combined with a large context length it allows for pretty scary things. Now the biggest problem with that is probably the learning periods which would be tremendously expensive and have to be streamlined in a very efficient and clever method.
if we think of LLMs as brain-esque (which is not necessarily true, but they are somewhat similar), we can think of RAG as just telling you how everything works. showing you the docs, showing you examples, etc. Could you write good code quickly after just seeing examples and docs? maybe, but it gets much easier after practice. practicing stuff literally changes how our neurons connect to make certain connections more likely than before. This is analogical to changing the weights on the LLM. Instead of just trying to remember what they were given in the docs - which is somewhat problematic, their weights are literally changed to make them better at this specific task - like how our brains literally change when we practice things.
Definitely interesting points. What it makes me think is that we might wind up with small models that cover a specific domain of knowledge, and over time have coordinator models that are able to interlink the smaller models for collaboration between them.
@@li_tsz_fung that depends on the training method and optimizer plus if it starts to overfit, but there is a secret little thing I learned with diffusion models that might be useful, if you merge the model with a tiny percentage of "random weights" as you keep training, it keeps 'destroy' some of the previously contained information, while training afterwards will kinda re-heal the whole thing. It prevents overfitting quite well, bc usually overfitting is when it starts to develop TOOOOO strong patterns, but tiny percentages of random weights disrupts that and also helps get it out of any "local maxima" it might get stuck on. local maxima: imagine it as a possible 'hole' where the training is leading the model state into, but its not the most optimal one, it just feels like its optimal for the training algorithm due to the insane multidimensional nature of that many matrices being multiplied together.
?? Not at all. Simply have a separate model remove all personal identifier information. Which actually more elegant since you can provide deductive rules for the model to use for filtering.
I mean big companies have made rules against using chatgpt and the like because due to how much it was used some prompts would return stuff they were developing in-house sometimes not even publicly known things at the end. At the end you need to take care with what data you're handing out and to who 😊
LLMs and other generative & predictive models are already trained on private user data, and have been for over a decade. An easy example to test this is writing an address in a text completion algorithm and having it write down the name of the person living there. It works sometimes, but it shouldn’t be included in the dataset at all
This is the same problem I have with LLMs currently. They just don't use code I've written even though it's open source lol. It's a big problem Another important thing is, sure you may make a coding LLM, but when you get it to code something from the real world, e.g. simulation software, you will likely need an LLM which has knowledge of both physics and coding. General domain knowledge is useful.
If there's no documentation for the library, what will you train your model on? If there's not enough information for context, how is there enough information for learning?
So what’s the solution? Feed it niche documentation and then critique every response as you get them? I.e. provide feedback as you go so you train the model as you yourself learn?
I actually saw another video that touched on this point, but was more about Moore’s law and that at some point continual learning would be the future because we have the computation on hand at a cheap cost. But, I am glad someone is working on the problem now so that we can transfer over when the time comes. Humans continual learn so it only makes sense for our AI models to do the same. Data is the future and ways to generate useful data will be key.
VERY interesting! A few questions, if I may: 1. Do you use LoRA? 2. Does your LLM operate with a level of certainty of the data it produces? Can it detect a paradox or an absurd in its own answer? Can it do so prior to the final output? 3. Do you consider self-improving LLMs? Let me explain. Obviously, a static LLM represents a set of frozen-in-time (not-so-simple) links between an input and an output. The quality of those outputs is directly linked to the number of different data points and vectors connecting them (and to the precision of those vectors). So, the question is -- is it possible for an LLM to scan those vectors, extract patterns and then apply them to other areas to get extra data where the density of data is lower? Something like DLSS does with images, you know?
Ofcourse we need AI systems with continual learning that has been whole vision of AI for years but we have to figure out a new architecture with existing LLM as we everytime we fine tune the system on new semantics the model weights are shifted towards new distribution and this leads to forgetting of previously learned information which disrupts the whole paradigm.
I do not see this as a LLM or an ML problem. Even humans will struggle to learn and apply something new if they do not have a prior knowledge for it. For example an engineer will struggle with accounting or medicine . Coming back to LLM even with a context provided (RAG) if there was no prior knowledge existed in the training LLM will not produce the desired output. As stated before this is not a unique LLM problem , it s a gereral problem common with humans and machines.
Not really, humans learn and adjust their weights and biases over time. However LLMs will never do that despite how much data you feed it, it will forget everything beyond the context window
The difference is that I don't forget literally everything that happened more than 6 hours ago. It would really suck if that was the case lol. Humanity would never progress.
My guesses: - A lot of code is undocumented, or documented too differently to generalize learning from it - Code doesn't just differ from language to language or library to library, but on a per project basis. - Specs change over time - The bigger the context, the more difficult it is to draw out the correct information - More context makes it generate slower or have more error.
oo, i was spot on with some of these, but missed the training specialization one. I was kinda thinking of it, but only in the context of specialization on your type of code.
Great points. I agree…partially. I def believe in lifelong learning styles of training. I say partially because I see inherit rigid flexibility. First off, what data mixing is being used? You didn’t mention this at all. Whenever we continually train, you notice spikes in loss, this is due to the an anticipatory behavior of these models on past representation. So this means we need to contextually sample small portions of pretraining data and mix with new data, along with some DAP-based method. Careful learning rate tuning. We can greatly mitigate catastrophic forgetting. Ain’t no way this isn’t abdunant in your setup? How do you approach this? Also long context ICL has it faults but I feel long-term it will outshine raw tuning. Think batchicl or many-shot ICL with LM generated interleaved rationales. Any limitation you mentioned can be optimized. Pretraining or continual pretraining with a joint Lm(compiled NN, maybe even great latent variables for the LM to represent different states to trigger diff types of symbolic programs in our compiled NN) and system-2 attenion(reformulate to append a linearized KG after context refinement)… I still believe in long context ICL..long-term. Great content though. Especially if the LM is equipped with symbolic logic..obviously it would have profound impacts on generalization, assuming your pretraining data is augmented with this style of hybrid data. You literally need a LM to process all training data I believe for this to really work(not changing, adding to the sample context. You must retain the original entropy or…model collapse 😂). At least this the first principle my startup is building from.
I mean I feel like extremely long contexts, and a model that can actually utilise this context to a good degree of accuracy, is essentially continuous learning in a way lol. But also you do have some good points. And this reminds me of what Andrej Karpathy said about how pretty much no current models are truly open source. If you wanted a truly open source model you would not only open source the weights but also the training sets for example, and having access to these training sets allows you to do things like continued pretraining (as you also mentioned training with a mix of the old and new does well in mitigating catastrophic forgetting, and im pretty sure that's how OAI has been updating the knowledge cutoff in their models). Ive never seen RAG as like a long term solution as it just doesn't allow the model to capture the things it would via continued pretraining or ICL. Having that extra context it can build up an intuition of sorts on the information you have given it which I think will be much more advantageous and its overall just more flexible. And for continuous learning maybe you could update only a subset of the model, the relevant weights? This would be a lot cheaper and CF wouldn't be as prevalent, and those spikes in the loss should also not be as dramatic because you are updating the weights that are more relevant to the new data, it might also lead to reduced overfitting because your not allowing the entire model to adjust to the new dataset. But this solution does present its own challenges.
@@DanielSeacrest great points bro. Facts super long context and proper context utilization(most part, a data fallacy) is essentially active learning, since attention perturbs activated neurons. I seen some cool research where they backprop at test time against a constraint to “look in the future”, all that and things like quiet-star…I see a clear pathway for removing the need of fine tuning for MOST use cases. Yea that makes sense, given a training corpus, identify the most relevant subnetworks and freeze the rest. Wonder how we account for polysemantic nature of models though. Literature time lol. Saying this out loud, it seems so obvious. Has to be a paper already ha.
sounds like it would be very usefull for devops workflows where a bunch of often poorly documented systems need to be joined together, often by some degree of trial and error
Have you considered the security implications of attacks like model inversion against continuously trained models? You’ll need some way to cheaply segment knowledge base access between untrusted tenants or users. That could lead to a lot of parallel models.
One approach is to implement continual learning gradually; you could use recently finished projects to train the model and eventually evolve to real-time continual learning. Yes, I understand you would like a problem-solving prediction based on inference from other experiences. You are asking the model to predict solutions based on abstract patterns that don't exist. Models already do that in the form of hallucinations.
I've also been thinking about this topic for a while. My reflection took me to the point on how human would normally address this issue. As we know, you can be an expert in some fields, but quite unlikely in many fields. On the other hand, we could leverage our "learning" ability to solve any problem, given the learning materials. Intuitively, this "learning ability" sounds fundamentally different than weights updating of a LLM fine-tuning, where we use this ability to learn knowledge. From this point of view, we might still be far from the "General AI".
If you’re constantly retraining the model, how do you know if the model is working or not given the constant changes? Imagine changing out the engine on your vehicle everyday, then measuring mpg. Which engine performed best if you changed it everyday
We do exactly the same thing. What happens is clients think finetuning and in context learning can get them to their target goal. Even worse, some other company has convinced them that finetuning IS the way to go. not knowing these AI wrapper companies just pretty much feed their data into chatgpt to generate synthetic data to fine-tune on. 1. They call our tech pitch BS. 2. After 2-3 weeks, we get a call asking why fine-tuning only results to 60-70% accuracy. and then we tell them "we told you so."
I say this all the time. No matter how awesome a deep learning model is if it cant do continual learning its a huge problem. A lot if practitioners work like this problem does not exist
I paused as well Ive thought of this issue with my own projects. RAG in production can be expensive. Just like I can ask GPT-4 questions about various codebases and it performa really well. In the long run its more efficient to get away from RAG.
That is an interesting idea. But for the point of view of the user how do you manage the continuous learning? I mean where do you get the data from to keep learning if the data doesn’t exist? You mentioned the model learning with the user the question is how?
I would flip the question to them all animals sleep in order to integrate lessons learned into their embodied knowledge so as to avoid holding it in working or short term memory why wouldn't we expect llms to benefit from this same pattern?
I’d love to learn from specific examples where in context learning fails where continuous training succeeds. I don’t fundamentally understand the difference between fine-tuning against data set A and adding data set A to the prompt for in-context learning (except live performance as set A increases). The video is not clear (to my understanding at least) on whether this is the comparison, or you’re comparing fine-tuned LLM A against in-context learning of some subset of A. Help 🙂
In the event of pursuing research in the realm of "Theoretical Physics" or any theory-based research that prolongs for decades and decades before a theory is debunked... or until a new theory arises, what effect would that have on a Large Language Model applied to these areas? Having an LLM trained on theories that in the end may prove false would create a "hyper-intelligent" LLM built on incorrect data... would it not? Is this an area of research that allows for the application of LLM's? If so, to what extent? Perhaps this is not an area of research an LLM would even be trained on. I am in no way educated on the subject, but this video did immediately bring to mind those questions. I think continual learning may prove successful only in the event we are continually learning in a direction that leads to a "correct" conclusion.
2:40 i don't know if i understand correctly but my intuition based on using gpt is that he s really bad at synthesizing the new info using the new rag method. I was exciting i could build a gpt therapist when the gpt assistant feature was introduced, only to find that that's impossible just pasting the documents. Being a therapist and psychologist is about grasping the idea of the texts and appropriate synthesis for the client.
I've only used RAG systems in practice, is this video implying somehow mutating the weights of the base model as part of the continual usage of the model?
Why use continual learning? After pausing the video my guess is to reduce marginal cost of execution. Instead of context stuffing and consuming that many tokens by embedding the knowledge into the weights it reduces output size/cost
I have to question whether LLMs - even in their most advanced form - are even plausibly capable of doing what you've declared your goal to be (generating brand new ideas that have never been thought of before). LLMs are prediction engines, and in order to predict, they learn from the past. How can one predict what has never happened, strictly from learning from the past? You might be able to predict that something new will happen, but never what that would be. I su[ppose that could be interesting in it's own right; the combination of factors X, Y, and Z results in a hole in the LLM's prediction. _Humans_ are pretty terrible at this task. Rarely does anyone have a truly new idea. Most of what feels like new is just recycled bits and pieces of old, reassembled into a different whole.
@@maalikserebryakov what truly insightful, original, inspired thought it must have taken you to construct such a brilliant ad-hominem attack! How clever of you to skirt the difficulties of constructing a counter argument! I'm sure your mommy and daddy are very proud of you. Gold star!
You'd like to bless the model with better priors, there are some tasks like writing Smuts the model simply can't do. It makes sense to continually train on that, and becomes more important as vision gets introduced as modality.
Also wasn't there a paper that showed AI tended to place less importance on everything in the middle of the prompt? i.e. it only cared about the start and end.
I would think it's probably more interesting to start with a micro LLM (few Gigabytes weights), to not have to pull the weight (pun intended) of too much pre-learning. I would also avoid to add multiple iterations of a documentation, but only differences after the first learning. There's a lot to experiment, and eventually learn.
I get what you are saying and don’t disagree in general. I just hope you have top people with good depth of experience on weight consolidation/elasticity engineering and related areas to keep from having problems that are inherent to all continual learning projects I have had experience with. Personally, in most instances I would employ a structured ensemble multi-LLM approach leveraging Graph RAG with selective use of ICL. The probability of catastrophic forgetting and other corruption escalates in a very unpredictable way if you don’t use ICL very judiciously and with a lot of careful, strategic planning. I saw more than one great team lose many months of progress, by having to backtrack to several times only to realize that the only scaleable approach was to combine strategies, and the others combined LLM’s and Graph-RAG with ICL approaches.
LLMs do come up with solutions based on mosaic analysis -or combining different relevant assumptions in one analysis. Even for human it is not straightforward. One research paper may take more than five years to complete in certain cases
I beg to differ. You are missing the largest point of RAG. The point of RAG is that I can store all personal information and context related to me in a vector database. Then when gpt5 or gpt6 releases I can instantly switch the base model over and have it access my vector database. Where as you will have to train gpt5 from scratch. As a small startup you are never going to be able to compete with Google/Amazon/Nvidia with training the underlying models. So it's better to use a RAG system where you can use ingestion and inference to continuously improve the knowledge in your database. Every RAG company will be 100x more powerful when gpt5 releases. When GPT5 releases it will make your company and any others who were previously training llama or gpt4 obsolete until they train gpt5. RAG also isn't limited, traditional RAG systems run into the pitfalls you focus on in the video, but a lot of smart people are working on fixing them. Like for instance, a project I've been working on, uses the LLM to format the data more than previously thought possible before its embedded in the vector database. This allows for your database to be much more powerful as you gain a lot of missing context that is usual for a RAG system. As context length increases we can pass more data from the vector database along with the prompt. AI models are now allowing us to specify more distinctly what is context and what is the actual prompt. Along with allowing the AI to make multiple return calls to the database we can gather any information not in the first prompt. Let me know your thoughts on this.
When chatgpt arrived, it was like a wave, but it turns out that another novelty-driven convergence bias (getting caught up in the enthusiasm of a new idea and devoting all of your energy to making it even better, without evaluating alternative choices or thinking outside the box). As a non-coder who has been following the llm space for the past 6 months, I have realised that it is too long a journey to have a smart brain that will learn and process to execute; it is just a large language model, which implies constricting all data into a model, making it difficult to continuously fine-tune or train. Limitations in Neuro-Symbolic AI include the integration of structured knowledge representations, adaptability, abstract thinking, and reasoning. Emotional Intelligence and Creativity were the main reasons I suspended my interest in llm. I had plenty of creative ideas. One example: I extracted thousands of lyrics to create a model that will creatively construct a song and existing lyrics into a knowledge graph, but it was only a half-finished copy-paste job; the process of delving deeper into each lyrics that I input was much more difficult to implement. Now, I only use chatgpt or local llms for grammar correction and basic information.😅
The problem seems a bit more obvious if you consider the limits. If you have no base model, the in-context learning won't work. If you have a base model but more knowledge than fits in your context, in-context learning won't work. At some point, you are obligated to improve the base model to get results, so it's a bad premise to aim for a solution where you aren't training base models in any level. Great video, nice style!
Good logic, but it's really hard to convince people of this when you're pitching a startup 😅 everyone just likes big LLM big context go brrrr Happy you liked the style, this is the same style I used to use in my first videos. Might do a bit more of it and refine it.
Has anyone built an AI chatbot for a client/ company? If so, I wanted to know if a tool that monitors your AI chatbot for incorrect or dangerous responses and alert the developer and log it when it happens would be useful? Me and my friends had built such a AI monitoring tool for a hackathon and wanted to know it would be helpful for others.
I think this was the limiting factor for me trying to build an AI chess coach. I could give it access to chess analysis APIs in-context and explain how to use them, but I couldn’t get it to actually reason about the positions and break them down for a human. Chess was the poetry in your example. Surprised because I’m sure there’s plenty of chess content in the training data for large models, but I guess chess is orders of magnitude more complicated than most topics, so it needs much more targeted and focused coaching data 🤷🏽♂️
Hi. I would love to collaborate and am in the process of developing an agentic ai system with a somewhat similar phillosphy as yours for Large language modeling, albeit, with a few caveats. I couldn't find your email in the description though.
Hello! You mention an email like your bio but I'm not seeing it, only twitter. Maybe you can add it so I can reach out. I'd love to learn more about your continual learning project
The comments section here are gold 🙌🏾💜 My 2c: Both CL and RAG are vital. RAG provides the closest to real-time context and CL ensures the base model can appropriately interpret the context and meet output requirements. We're still very early in this latest season of AI solutions. I think arguing which is best isnt the most productive use of cognitive resources. It depends entirely on the application usecase. Some applications should absolutely be fine-tuning their own CL systems. But the vast majority can be handled with RAG. This is the allure and popularity of RAG. It's the tool thats cheap, accessible, and will get you pretty far most of the time. Unless you're an experimental quantum physicist or something then you'll surely need to finetune your own. With better data engineering for RAG systems I'm confident both points you described can be resolved within 6-18 months. Plenty of work to do till then 😁
Paused to subscribe😊 As I am on it: the quality of coding solution based on RAG is not good enough even if there is documentation with examples. I know this problem from using ChatGPT based solution to support a novel modelling approach in OR using just-to-market solver.
The points you bring up about the failures of RAG and incontext learning are more broadly problematic. I've frequently wondered why LLMs fail to be able to "synthesize" information from different sources. Try to get an LLM to debate you on a topic. it's completely futile because the LLM just regurgitates information from its training data (with all its associated biases). It doesn't do any reading between the lines, by taking disparate sources of information and having that "Eurika" moment when two, at first, disconnected ideas come together to introduce new knowledge. In essence, LLMs are still just context dependent pattern recognition machines.
AI, as a whole is glorofied statistica
@@plaintext7288 This is the most concise and precise comment on AI I've ever seen!
@@rnoro It's also total bullshit. The "AI are just schochastic parrots", "AI is just predicting the next word", and so on have been thoroughly debunked.
@@sino_diogenes I'm not doubting you, but can you provide sources for that? It'd be interesting to read the arguments behind that
EXACTLY ❤ I TRIED ARGUING. 😮 Not good. The censorship can never work for the same reasons. Look at the Captain Underpants kids book series. Literally would be “burned” by the censors in text and images. But it’s the opposite of child abuse.
After you first brought up continual learning versus long context / RAG, I immediately think of the example of just asking for advice from customer service staff with a manual, versus asking a technician. RAG is like a CS person that could quickly find the page related to your question. In context seems a bit better, the person had read the manual and probably can bring you a bit insight, depends on how smart that person is.
But continual learning, ideally means the manual is fully understood by the person. They learnt it, instead of remembering it. Ideally, depends on how good a learner they are.
Very good analogy!!
This assuming there is a manual at all!
I paused. From my experience working in production with RAG... simply it's like using a legacy technology to fill in the blanks in training. Attention mechanisms are way more subtle and nuanced than semantic search lookup.... if there was a cheap way to viably do continual learning in a production environment, and still allow context windows, etc. it would solve the whole problem of semantic search not being enough to find the right data you really need.
LoRA and other PEFT techniques are rather cheap, and even full CL is only about 3x more compute and memory intensive than standard inference costs.
That is the goal!
Why can't we update context documents, which is an input for RAG for continuous learning?
Means every time we want model to learn new knowledge, just update the patch, which RAG model uses as a context
I agree, RAG feels very empty to me. Like, once you intelligently find the relevent text for a prompt in your vector db, it seems like the LLM isn't really even needed anymore. Why not just use the vector search and ranking by itself at that point? To me, the value of the LLM is its ability to create new content, and RAG basically puts the LLM in a tiny box.
This is also my thinking
If you can find the right documents anyway, that means the documents (and the answer) already exists and you don't need the LLM to read it and essentially play it back to you.
IF you can find the right documents that is.
Therefore we either focus on finding the right documents or we continually train the LLMs -> how are you doing that?
do you see degradation of the model as you are doing continual learning, maybe some catastrophic forgetting?
You have to distill the new data with the old one so that it doesn't regress (according to an exTesla ai lead)
I've found a way to prevent catastrophic forgetting using attractors.
I seem to have catastrophic forgetting with my continual learning, however, I noticed it only happens during exam time.
Same here xD
@@justinlloyd3what are attractors?
I'm glad I found this channel. The biggest problem I see with LLMs is their static weights, which I imagine is good for safety researchers but problematic when a model cannot expand beyond its training data.
My guess on why continual is that if pulled in an economically efficient manner it can allow researchers who are on the bleeding edge of the ML field to harness the copilot like assistance when working on completely new tasks because often times when doing your own research you often have to answer all the questions on your own and to have a second brain their with you that has all the same context on the research as you do will be very beneficial
This.
👏
2:30 - I'm thinking that the reason you continuously train models in production instead of using RAG is because deciding what information is relevant is unclear. By continuously training models, the model is able to make those connections itself. I'm wondering if this also reduces inference speed, as the prompt no longer has so much extra padding.
precisely, fine-tuning will never leave our scope because it's far from optimal to keep models static.
Yup, this is a huge part of it
You can use the llm to decide what information is relevant, by adding an extra call to the database. For example the llm can output 75 keywords that request the context needed. Then that gets sent to the llm again, allowing for nearly 100% correct context. Along with the possibility of the LLM being allowed to make 5-10 context calls back to the database. You overcome these issues.
I am not convinced by your arguments. In my work, I often need to tell business people that we should not do training or fine-tuning until we see that in-context learning is not enough. And in the end, it turns out that in-context learning is enough, even for those complicated examples you described.
Let me argue with your arguments.
1. `Where does the context come from?` - Well, if we have no context then any training is even more out of the question than in-context learning, right? I don't see how that would be an argument for continual learning.
1.1 `RAG won't enable our models to solve complicated or niche problems` - It can. If LLM is capable enough and has great general knowledge then it can often solve problems that no human has ever solved before, just using its general knowledge and some additional context about the problem that needs solving.
2. `The scope of in-context learning is limited by pertaining data` - agreed. However, the most capable models are trained on almost every type of data you can imagine. That means that you probably won't find a problem for which in-context learning won't work for those models. This can however be a problem if you are working with less capable, smaller models.
For me the biggest problem of RAGs is that models do not work well with very large contexts. Even if you have 128k context length in gpt4, it won't be able to reason well with 100k tokens of table data or system logs or even a codebase. Instead, it will often "misread" or "forget" information in such a long context. That's more of a limitation of current LLMs, I think, not an intrinsic characteristic of RAG systems.
These are some great points, and there is a large hole in my video because I left out half of the argument for brevity.
The most crucial point I think is the question of where the context comes from, because of course, you aren't going to get any benefits of continual learning if you have nothing to learn from 😛
The answer to this is having RL agents generate their own data via experience. The thing about this approach is that it requires you to continually get new data that is more related to what you want to solve. But if you aren't continually learning, you won't be able to generate continually more relevant data (unless it happens to fit into your context window).
For smaller problems this is often not a problem, but what if you want to learn an entirely new programming language or learn calculus? You can't just give a bit of context; you have to learn from a continually changing curriculum as you learn one topic after the other. That is where this approach makes sense.
This is why I agree with you to an extent on point 1.1. If there is a problem just on the boundary of what humans haven't done before, RAG + in-context learning can potentially solve it. But if you want to build on that new knowledge, and then build on that, and build on that, and so on, that is where you need continual learning, and these are the types of problems that researchers and startups often face. They are problems that require layers of learning.
That being said, I disagree with point 2. There are sooo many things these agent are not trained on, but they do require you going deep into specific domains most of the time.
The long context is one problem I was originally going to bring up in the video, but I decided against it because models will keep getting better at dealing with longer context as training methods and architectures improve. These other problems will not go away.
I appreciate the well thought out critique
I want to know how you arrived at 1.1, do we have any proof or examples? I’m not convinced, just by nature of what it means to understand something.
@@EdanMeyeragreed, I also would like to bring up the point that context size brings O(n^2) scaling. So with a massive context window, you will eventually hit a point where training becomes just as cheap as a few forward passes.
Also even with a nearly unlimited, yet finite, context size, you will eventually run out of room for very specific tasks like working with a code base long term. Then training on that knowledge becomes a very viable option because it would essentially reset the ctx window size.
Also LLMs perform worse on larger contexts windows, and that is just a given, they will always perform worse with larger windows of info because humans also have the same flaw, and LLMs are trained to mimic human writing, so having a smaller window will always yield better performance
@@redthunder6183r.e. Context window size, SotA can find specific datapoints with near 100% accuracy in 1M+ context window.
I have created an algorithm for continous learning model, which can help software developers keep track of problems solved. But implementation needs more brains, so going slow till I am alone. I wasn't aware someone else is thinking on same lines. Thanks for the video
I was really expecting you to say “but I can’t fit it in the margins”
lets be honest, documentation generally sucks. I'm an open source contributor and the biggest issue users report is "your docs suck". RAG doesn't solve out-of-distribution situations, so it's not a magic bullet. Even when developers try to write good docs, it quickly becomes out of date and wrong. until we have better architectures, continuous training will be needed.
The main reason we dont do CL is nbecause there is nothing in old fashioned CL that actually works.
Currently the only thing that learns from the previous and improves is in context learning.
Yes we need continual learning but it should be by making incontext learning lighter.
More fundamentally, we need to separate between memory based incontext continual learning and parameter continual learning.
The former is to add up event based knowledge while the latter is to add up intrincsic reasoning capabilities. The former should be based on external memory while the latter is on the parameter level. Right now we have the former in the name of incontext learning using token memory.
The community is already doing research along this direction although not in the name of CL.
What keywords are being used instead of Continuous learning ? Can you share please ?
@@augustecomte7980i dont think there is a particular name for it. And i think there is no framework yet for parameter-level continual learning in a strict sense, where a model can add up its reasoning capabilities by additional training solely on new incoming data. I do not know much in industry but a latest paper to this direction is I think iterative reasoning preference optimization. But it requires all data not just new one
The poetry example has a hidden layer to it. It does not matter whether an AI is trained on high quality poetry. It cannot create high quality poetry, because this requires a completely different kind of intelligence than mere pattern recognition of things that have already been written.
Bingo.
Remember back when we were first learning about grammar, parts of speech, and literary devices? Perhaps an LLM can be used alongside a classroom of children and learn all these finer nuances in the same way. Of course, I think it is also worth recognizing that we create from our experiences. The way we felt at the park on that one special day, when the cool summer breeze was gently blowing across our skin. Would the lack of physical/emotional experience always detract from the quality of a poem? Perhaps it would only succeed in the creation of a more cryptic form of poetry... like, "The Red Wheelbarrow" which lacks technical correctness entirely yet is still a poem of quality.
Great explanation about the limitations of current LLMs. I worked with RAG systems recently and it works...for very specific limited knowledge retrieval type applications. We are still so far away from these systems being generally useful. There's probably a lot more you can do with MoE now that a lot more effort is being put into making models smaller.
(interactive answer from 2:45) RAG doesn't integrate information across contexts - sure you can overlap chunks but your always running the risk of missing aspects of the model that fall outside the fragment window (Tree and Graph RAG overcome this in different ways but then your still doing it in an offline manner which may or may not be an issue for the use case).
Continous learning is not possible at the moment for most top open weights models because the creators have not released the data used to train the model. A significant amount of training using data that does not include the original data will lead to degradation of the model.
All the more reason to work on this, that is a problem that needs solving. Having all the original training data shouldn't be a requirement for continual learning.
@@EdanMeyer This requirement is not something man-made. Its a fundamental property of neural networks. Watch Andrej Karpathy's talk at Sequoia Capital where he discusses these "open-source" models from big tech. You can generate high quality data using your approach but you will need to train a model from scratch using that data. Another approach you could take is categorise your data into "bins" where each datapoint is much more similar to data inside the bin compared to data outside and then train low rank adapters using each bin and then use those low rank adapters to create a pseudo Mixture-of-Experts model. Then as you keep finding new data you can keep adding new low rank adapters to keep increasing the capabilities of your model.
You can self-anchor a fine tuned or continuously learning model using the original, pretrained model's output. For each bit of new training data you learn on, also train on a random completion result from the original model. AFAIK this is what they did with InstructGPT/ChatGPT to get GPT to obey commands.
Well that's relevant I was going to search now
This sums up my current Evaluation of the LLM hype: those models are limited (by input data, like quality and field/focus). I do not see the hyped exponential growth, just bigger training provided by more data and computing power. And regarding creative and smart combination solutions for (new) tasks, without really good prompting leading the LLM nothing happens in that direction.
Reason 3, which I've worked with since around 2015: My AI needs to be able to learn to get smarter, weights can not be frozen and the net can not be static, nor can it ever become silent.
Great video! I totally agree about the potential downsides about RAG. There's another issue with RAG and in context learning, namely that current LLMs often are good at retrieval when given a long context ("needle in a haystack") but not good very good at connecting the dots and reasoning about what's in the context.
But about your example where context is missing, I think continually learning is also not really helping. If you don't have data for context, then you probably also don't have it for learning in most circumstances. It seems a more general problem for which the solution would be to have a LLM that is better at reasoning and generalizing
Yes! You got it! I left out half of the argument for brevity, but you also need to be able to get the data to do continual learning. The solution there is to have an agent generate its own data via experience (i.e. RL). The thing is that you can't generate experiences that are continually more relevant to what you want to learn unless you are continually learning. Hence, both of these systems are essential as a pair.
the first thing that comes to mind when thinking about using continual learning as compared to RAG, is that the LLM is quite like our brains, and we can't effectively retrieve all the relevant information or solve the problem with comparable accuracy if we are seeing something for the first time, even if we have all of the context there is.
I'm at the challenge part and haven't watched further:
Continual learning provides flexibility to an otherwise static frame which means it can go outside of its initial scope. In theory this allows an AI to improve on the fly based on input and if combined with a large context length it allows for pretty scary things. Now the biggest problem with that is probably the learning periods which would be tremendously expensive and have to be streamlined in a very efficient and clever method.
if we think of LLMs as brain-esque (which is not necessarily true, but they are somewhat similar), we can think of RAG as just telling you how everything works. showing you the docs, showing you examples, etc. Could you write good code quickly after just seeing examples and docs? maybe, but it gets much easier after practice. practicing stuff literally changes how our neurons connect to make certain connections more likely than before. This is analogical to changing the weights on the LLM. Instead of just trying to remember what they were given in the docs - which is somewhat problematic, their weights are literally changed to make them better at this specific task - like how our brains literally change when we practice things.
Definitely interesting points. What it makes me think is that we might wind up with small models that cover a specific domain of knowledge, and over time have coordinator models that are able to interlink the smaller models for collaboration between them.
That seems the sensible way to go.
Nobody expects one superapp that does it all, but we expecting a model that does it all
@@hydrohasspoken6227 problem is, the direction seems going to be trying to be one model for all
I think privacy of users can be a real problem if continual learning is not done carefully
Quality of the model also can degrade. More learning not necessarily means better
@@li_tsz_fung that depends on the training method and optimizer plus if it starts to overfit,
but there is a secret little thing I learned with diffusion models that might be useful, if you merge the model with a tiny percentage of "random weights" as you keep training, it keeps 'destroy' some of the previously contained information, while training afterwards will kinda re-heal the whole thing.
It prevents overfitting quite well, bc usually overfitting is when it starts to develop TOOOOO strong patterns, but tiny percentages of random weights disrupts that and also helps get it out of any "local maxima" it might get stuck on.
local maxima: imagine it as a possible 'hole' where the training is leading the model state into, but its not the most optimal one, it just feels like its optimal for the training algorithm due to the insane multidimensional nature of that many matrices being multiplied together.
?? Not at all. Simply have a separate model remove all personal identifier information. Which actually more elegant since you can provide deductive rules for the model to use for filtering.
I mean big companies have made rules against using chatgpt and the like because due to how much it was used some prompts would return stuff they were developing in-house sometimes not even publicly known things at the end.
At the end you need to take care with what data you're handing out and to who 😊
LLMs and other generative & predictive models are already trained on private user data, and have been for over a decade. An easy example to test this is writing an address in a text completion algorithm and having it write down the name of the person living there. It works sometimes, but it shouldn’t be included in the dataset at all
I was working on LLMs from about 2018. In context learning was the first thing that the LLMs were good on
If you continually fine tune then you eventually lose much of what your old weights “knew”. How do prevent over fitting and forgetting?
This is the same problem I have with LLMs currently. They just don't use code I've written even though it's open source lol. It's a big problem
Another important thing is, sure you may make a coding LLM, but when you get it to code something from the real world, e.g. simulation software, you will likely need an LLM which has knowledge of both physics and coding. General domain knowledge is useful.
If there's no documentation for the library, what will you train your model on? If there's not enough information for context, how is there enough information for learning?
So what’s the solution? Feed it niche documentation and then critique every response as you get them? I.e. provide feedback as you go so you train the model as you yourself learn?
How exactly are you dealing with catastrophic forgetting?
I actually saw another video that touched on this point, but was more about Moore’s law and that at some point continual learning would be the future because we have the computation on hand at a cheap cost. But, I am glad someone is working on the problem now so that we can transfer over when the time comes. Humans continual learn so it only makes sense for our AI models to do the same. Data is the future and ways to generate useful data will be key.
VERY interesting! A few questions, if I may:
1. Do you use LoRA?
2. Does your LLM operate with a level of certainty of the data it produces? Can it detect a paradox or an absurd in its own answer? Can it do so prior to the final output?
3. Do you consider self-improving LLMs? Let me explain. Obviously, a static LLM represents a set of frozen-in-time (not-so-simple) links between an input and an output. The quality of those outputs is directly linked to the number of different data points and vectors connecting them (and to the precision of those vectors). So, the question is -- is it possible for an LLM to scan those vectors, extract patterns and then apply them to other areas to get extra data where the density of data is lower? Something like DLSS does with images, you know?
Woooo you're back!
Back and back for a while!
Ofcourse we need AI systems with continual learning that has been whole vision of AI for years but we have to figure out a new architecture with existing LLM as we everytime we fine tune the system on new semantics the model weights are shifted towards new distribution and this leads to forgetting of previously learned information which disrupts the whole paradigm.
Sparse training data in highly specific problems is always going to be a problem. Its basically where hallucinations come from.
I do not see this as a LLM or an ML problem. Even humans will struggle to learn and apply something new if they do not have a prior knowledge for it. For example an engineer will struggle with accounting or medicine . Coming back to LLM even with a context provided (RAG) if there was no prior knowledge existed in the training LLM will not produce the desired output. As stated before this is not a unique LLM problem , it s a gereral problem common with humans and machines.
Not really, humans learn and adjust their weights and biases over time. However LLMs will never do that despite how much data you feed it, it will forget everything beyond the context window
The difference is that I don't forget literally everything that happened more than 6 hours ago. It would really suck if that was the case lol. Humanity would never progress.
My guesses:
- A lot of code is undocumented, or documented too differently to generalize learning from it
- Code doesn't just differ from language to language or library to library, but on a per project basis.
- Specs change over time
- The bigger the context, the more difficult it is to draw out the correct information
- More context makes it generate slower or have more error.
oo, i was spot on with some of these, but missed the training specialization one. I was kinda thinking of it, but only in the context of specialization on your type of code.
Great points. I agree…partially. I def believe in lifelong learning styles of training. I say partially because I see inherit rigid flexibility. First off, what data mixing is being used? You didn’t mention this at all. Whenever we continually train, you notice spikes in loss, this is due to the an anticipatory behavior of these models on past representation. So this means we need to contextually sample small portions of pretraining data and mix with new data, along with some DAP-based method. Careful learning rate tuning. We can greatly mitigate catastrophic forgetting. Ain’t no way this isn’t abdunant in your setup? How do you approach this?
Also long context ICL has it faults but I feel long-term it will outshine raw tuning. Think batchicl or many-shot ICL with LM generated interleaved rationales. Any limitation you mentioned can be optimized. Pretraining or continual pretraining with a joint Lm(compiled NN, maybe even great latent variables for the LM to represent different states to trigger diff types of symbolic programs in our compiled NN) and system-2 attenion(reformulate to append a linearized KG after context refinement)…
I still believe in long context ICL..long-term. Great content though. Especially if the LM is equipped with symbolic logic..obviously it would have profound impacts on generalization, assuming your pretraining data is augmented with this style of hybrid data. You literally need a LM to process all training data I believe for this to really work(not changing, adding to the sample context. You must retain the original entropy or…model collapse 😂). At least this the first principle my startup is building from.
I mean I feel like extremely long contexts, and a model that can actually utilise this context to a good degree of accuracy, is essentially continuous learning in a way lol.
But also you do have some good points. And this reminds me of what Andrej Karpathy said about how pretty much no current models are truly open source. If you wanted a truly open source model you would not only open source the weights but also the training sets for example, and having access to these training sets allows you to do things like continued pretraining (as you also mentioned training with a mix of the old and new does well in mitigating catastrophic forgetting, and im pretty sure that's how OAI has been updating the knowledge cutoff in their models).
Ive never seen RAG as like a long term solution as it just doesn't allow the model to capture the things it would via continued pretraining or ICL. Having that extra context it can build up an intuition of sorts on the information you have given it which I think will be much more advantageous and its overall just more flexible.
And for continuous learning maybe you could update only a subset of the model, the relevant weights? This would be a lot cheaper and CF wouldn't be as prevalent, and those spikes in the loss should also not be as dramatic because you are updating the weights that are more relevant to the new data, it might also lead to reduced overfitting because your not allowing the entire model to adjust to the new dataset. But this solution does present its own challenges.
@@DanielSeacrest great points bro. Facts super long context and proper context utilization(most part, a data fallacy) is essentially active learning, since attention perturbs activated neurons. I seen some cool research where they backprop at test time against a constraint to “look in the future”, all that and things like quiet-star…I see a clear pathway for removing the need of fine tuning for MOST use cases.
Yea that makes sense, given a training corpus, identify the most relevant subnetworks and freeze the rest. Wonder how we account for polysemantic nature of models though. Literature time lol. Saying this out loud, it seems so obvious. Has to be a paper already ha.
sounds like it would be very usefull for devops workflows where a bunch of often poorly documented systems need to be joined together, often by some degree of trial and error
More such videos please! Also, previous title - "Why Longer Context Isn't Enough" was better.
Beautiful drawings. Erasing the drawing, that is a nice transition. Great presentation.
Have you considered the security implications of attacks like model inversion against continuously trained models? You’ll need some way to cheaply segment knowledge base access between untrusted tenants or users. That could lead to a lot of parallel models.
How does your startup avoid catastrophic forgetting when doing continuous learning?
One approach is to implement continual learning gradually; you could use recently finished projects to train the model and eventually evolve to real-time continual learning. Yes, I understand you would like a problem-solving prediction based on inference from other experiences. You are asking the model to predict solutions based on abstract patterns that don't exist. Models already do that in the form of hallucinations.
I've also been thinking about this topic for a while. My reflection took me to the point on how human would normally address this issue. As we know, you can be an expert in some fields, but quite unlikely in many fields. On the other hand, we could leverage our "learning" ability to solve any problem, given the learning materials. Intuitively, this "learning ability" sounds fundamentally different than weights updating of a LLM fine-tuning, where we use this ability to learn knowledge. From this point of view, we might still be far from the "General AI".
If you’re constantly retraining the model, how do you know if the model is working or not given the constant changes? Imagine changing out the engine on your vehicle everyday, then measuring mpg. Which engine performed best if you changed it everyday
Is it related to hallucinations or generalizing to edge cases?
We do exactly the same thing. What happens is clients think finetuning and in context learning can get them to their target goal. Even worse, some other company has convinced them that finetuning IS the way to go. not knowing these AI wrapper companies just pretty much feed their data into chatgpt to generate synthetic data to fine-tune on. 1. They call our tech pitch BS. 2. After 2-3 weeks, we get a call asking why fine-tuning only results to 60-70% accuracy. and then we tell them "we told you so."
By continual learning do you mean full fine-tune on new data ?
Train on new data as it comes in.
If you could fine tune the model whenever it isn't active on the context window. It would be great
Do you use Lora to add knowledge to the model?
2:51 My guess for cintinual learning: adapting to a continuous changing and unpredictable environment.
Wouldn't the best approach be both? Continually learning and a buffer of RAG to have the latest information available?
what's the whiteboard tool you were using btw ?
But what volume of dataset is needed to have substantial impact for LLM? 10k, 100k, 1M, 200M?
Interesting indeed !
Good luck with your start up !
Thanks!
I say this all the time. No matter how awesome a deep learning model is if it cant do continual learning its a huge problem. A lot if practitioners work like this problem does not exist
you voiced some of my misgivings about the RAG in-context learning approach 👍
I paused as well
Ive thought of this issue with my own projects. RAG in production can be expensive. Just like I can ask GPT-4 questions about various codebases and it performa really well. In the long run its more efficient to get away from RAG.
That is an interesting idea. But for the point of view of the user how do you manage the continuous learning? I mean where do you get the data from to keep learning if the data doesn’t exist? You mentioned the model learning with the user the question is how?
I would flip the question to them
all animals sleep in order to integrate lessons learned into their embodied knowledge so as to avoid holding it in working or short term memory
why wouldn't we expect llms to benefit from this same pattern?
This was a really interesting video for me doing research on LLMs. Thanks. Have subscribed
I’d love to learn from specific examples where in context learning fails where continuous training succeeds. I don’t fundamentally understand the difference between fine-tuning against data set A and adding data set A to the prompt for in-context learning (except live performance as set A increases). The video is not clear (to my understanding at least) on whether this is the comparison, or you’re comparing fine-tuned LLM A against in-context learning of some subset of A. Help 🙂
In the event of pursuing research in the realm of "Theoretical Physics" or any theory-based research that prolongs for decades and decades before a theory is debunked... or until a new theory arises, what effect would that have on a Large Language Model applied to these areas? Having an LLM trained on theories that in the end may prove false would create a "hyper-intelligent" LLM built on incorrect data... would it not? Is this an area of research that allows for the application of LLM's? If so, to what extent? Perhaps this is not an area of research an LLM would even be trained on. I am in no way educated on the subject, but this video did immediately bring to mind those questions. I think continual learning may prove successful only in the event we are continually learning in a direction that leads to a "correct" conclusion.
2:40 i don't know if i understand correctly but my intuition based on using gpt is that he s really bad at synthesizing the new info using the new rag method. I was exciting i could build a gpt therapist when the gpt assistant feature was introduced, only to find that that's impossible just pasting the documents. Being a therapist and psychologist is about grasping the idea of the texts and appropriate synthesis for the client.
I've only used RAG systems in practice, is this video implying somehow mutating the weights of the base model as part of the continual usage of the model?
Is this where agents come into play?
Why use continual learning? After pausing the video my guess is to reduce marginal cost of execution. Instead of context stuffing and consuming that many tokens by embedding the knowledge into the weights it reduces output size/cost
I have to question whether LLMs - even in their most advanced form - are even plausibly capable of doing what you've declared your goal to be (generating brand new ideas that have never been thought of before). LLMs are prediction engines, and in order to predict, they learn from the past. How can one predict what has never happened, strictly from learning from the past? You might be able to predict that something new will happen, but never what that would be. I su[ppose that could be interesting in it's own right; the combination of factors X, Y, and Z results in a hole in the LLM's prediction.
_Humans_ are pretty terrible at this task. Rarely does anyone have a truly new idea. Most of what feels like new is just recycled bits and pieces of old, reassembled into a different whole.
you = npc
@@maalikserebryakov what truly insightful, original, inspired thought it must have taken you to construct such a brilliant ad-hominem attack! How clever of you to skirt the difficulties of constructing a counter argument! I'm sure your mommy and daddy are very proud of you. Gold star!
What about now?
A good point is that in-context learning should have a pre-text context which is comparable or related with the new knowledge.
Why don't use vector database to store the continuous data as separate memory like human did?
Humans don't work like this
which software used to make this cideo ?!
You'd like to bless the model with better priors, there are some tasks like writing Smuts the model simply can't do. It makes sense to continually train on that, and becomes more important as vision gets introduced as modality.
Training models as you guys are is how humans learn generally
Also wasn't there a paper that showed AI tended to place less importance on everything in the middle of the prompt? i.e. it only cared about the start and end.
I really can’t wait for your startup update once “GPT-5” or whatever it will be named is released.
Why
Why
Why
They think this is one of the problem a that will be solved by new models@@Iam_inevitabIe
I am totally ok with RAG. Specially when context is large. In last project i have 95% retrival accuracy. It is more then just enough
Same for game development - you spend a lot of time coming up with novel solutions for problems - you won't have any training data for most of it.
I would think it's probably more interesting to start with a micro LLM (few Gigabytes weights), to not have to pull the weight (pun intended) of too much pre-learning.
I would also avoid to add multiple iterations of a documentation, but only differences after the first learning.
There's a lot to experiment, and eventually learn.
I get what you are saying and don’t disagree in general. I just hope you have top people with good depth of experience on weight consolidation/elasticity engineering and related areas to keep from having problems that are inherent to all continual learning projects I have had experience with. Personally, in most instances I would employ a structured ensemble multi-LLM approach leveraging Graph RAG with selective use of ICL.
The probability of catastrophic forgetting and other corruption escalates in a very unpredictable way if you don’t use ICL very judiciously and with a lot of careful, strategic planning. I saw more than one great team lose many months of progress, by having to backtrack to several times only to realize that the only scaleable approach was to combine strategies, and the others combined LLM’s and Graph-RAG with ICL approaches.
I don't think the params don't change in long context case.. it won't extract features in depth
LLMs do come up with solutions based on mosaic analysis -or combining different relevant assumptions in one analysis. Even for human it is not straightforward. One research paper may take more than five years to complete in certain cases
Could not find the email you mentioned
Whats your startup btw?
I beg to differ. You are missing the largest point of RAG. The point of RAG is that I can store all personal information and context related to me in a vector database. Then when gpt5 or gpt6 releases I can instantly switch the base model over and have it access my vector database. Where as you will have to train gpt5 from scratch. As a small startup you are never going to be able to compete with Google/Amazon/Nvidia with training the underlying models. So it's better to use a RAG system where you can use ingestion and inference to continuously improve the knowledge in your database. Every RAG company will be 100x more powerful when gpt5 releases. When GPT5 releases it will make your company and any others who were previously training llama or gpt4 obsolete until they train gpt5.
RAG also isn't limited, traditional RAG systems run into the pitfalls you focus on in the video, but a lot of smart people are working on fixing them. Like for instance, a project I've been working on, uses the LLM to format the data more than previously thought possible before its embedded in the vector database. This allows for your database to be much more powerful as you gain a lot of missing context that is usual for a RAG system.
As context length increases we can pass more data from the vector database along with the prompt. AI models are now allowing us to specify more distinctly what is context and what is the actual prompt. Along with allowing the AI to make multiple return calls to the database we can gather any information not in the first prompt.
Let me know your thoughts on this.
When chatgpt arrived, it was like a wave, but it turns out that another novelty-driven convergence bias (getting caught up in the enthusiasm of a new idea and devoting all of your energy to making it even better, without evaluating alternative choices or thinking outside the box). As a non-coder who has been following the llm space for the past 6 months, I have realised that it is too long a journey to have a smart brain that will learn and process to execute; it is just a large language model, which implies constricting all data into a model, making it difficult to continuously fine-tune or train. Limitations in Neuro-Symbolic AI include the integration of structured knowledge representations, adaptability, abstract thinking, and reasoning. Emotional Intelligence and Creativity were the main reasons I suspended my interest in llm. I had plenty of creative ideas. One example: I extracted thousands of lyrics to create a model that will creatively construct a song and existing lyrics into a knowledge graph, but it was only a half-finished copy-paste job; the process of delving deeper into each lyrics that I input was much more difficult to implement. Now, I only use chatgpt or local llms for grammar correction and basic information.😅
The problem seems a bit more obvious if you consider the limits.
If you have no base model, the in-context learning won't work.
If you have a base model but more knowledge than fits in your context, in-context learning won't work.
At some point, you are obligated to improve the base model to get results, so it's a bad premise to aim for a solution where you aren't training base models in any level.
Great video, nice style!
Good logic, but it's really hard to convince people of this when you're pitching a startup 😅 everyone just likes big LLM big context go brrrr
Happy you liked the style, this is the same style I used to use in my first videos. Might do a bit more of it and refine it.
Intuitively I have always thought this. Really interested in what you are doing. Where can we find more info on your work?
Do you have internships in your startup? Looking for gain experience
Has anyone built an AI chatbot for a client/ company? If so, I wanted to know if a tool that monitors your AI chatbot for incorrect or dangerous responses and alert the developer and log it when it happens would be useful? Me and my friends had built such a AI monitoring tool for a hackathon and wanted to know it would be helpful for others.
I think this was the limiting factor for me trying to build an AI chess coach. I could give it access to chess analysis APIs in-context and explain how to use them, but I couldn’t get it to actually reason about the positions and break them down for a human. Chess was the poetry in your example. Surprised because I’m sure there’s plenty of chess content in the training data for large models, but I guess chess is orders of magnitude more complicated than most topics, so it needs much more targeted and focused coaching data 🤷🏽♂️
life-long learning
Hi. I would love to collaborate and am in the process of developing an agentic ai system with a somewhat similar phillosphy as yours for Large language modeling, albeit, with a few caveats.
I couldn't find your email in the description though.
Hello! You mention an email like your bio but I'm not seeing it, only twitter. Maybe you can add it so I can reach out. I'd love to learn more about your continual learning project
It's like the difference between cramming, and in-depth learning.
Nice, I might use this analogy in pitches
what;s your startup?
Why don't you try sparking neural network to create patterns and use it convert the data in real time
The comments section here are gold 🙌🏾💜
My 2c: Both CL and RAG are vital. RAG provides the closest to real-time context and CL ensures the base model can appropriately interpret the context and meet output requirements.
We're still very early in this latest season of AI solutions. I think arguing which is best isnt the most productive use of cognitive resources. It depends entirely on the application usecase.
Some applications should absolutely be fine-tuning their own CL systems. But the vast majority can be handled with RAG. This is the allure and popularity of RAG.
It's the tool thats cheap, accessible, and will get you pretty far most of the time. Unless you're an experimental quantum physicist or something then you'll surely need to finetune your own.
With better data engineering for RAG systems I'm confident both points you described can be resolved within 6-18 months.
Plenty of work to do till then 😁
The problem is the context, it can only understand a page
Paused to subscribe😊
As I am on it: the quality of coding solution based on RAG is not good enough even if there is documentation with examples. I know this problem from using ChatGPT based solution to support a novel modelling approach in OR using just-to-market solver.