Continual Learning

What is continual learning/what is it trying to accomplish?

Continual learning is the idea that we need model systems to continuously consume information and have access to this information in the future. The information should be integrated. Continual learning carries the baggage of “memory” and “learned”. It carries the baggage of “what do humans do compared to LLMs?” and “Do we care that humans and LLMs are different in their processing of the world?” Overall, continual learning is trying to get LLMs or ML systems to mimic human learning benefits.

Loosely, humans intake information into short term memory. We have to enable active attention or this information is gone. A hand off begins where short term memories will begin a consolidation process into long term memory. This process happens over minutes, hours and days. It incorporates sleep and dreaming. Depending on granularity we can make many analogies to this. I will try to avoid it.

One point to make up front is that continual learning is not really continuous. If we take current training of LLMs, we have batched learning. We take a model and compress information into its weights in consolidated training runs. Even if we compress information into the weights in a run of 1 sample then it is still a batch of 1. Even if we stream bits directly into the weights, there is a quanta. I’m splitting hairs, sure.

In chemical engineering, continuous happens at the point where the input is not interrupted. For instance, the reaction we care about happens as fluid flows through a thermal reactor. We could break out quanta like a subsection of fluid. We could consider this a batch but it doesn’t matter at that point. So I’ll consider “continuous” to be when we don’t care about the quanta. Batch to continuous is a bit of a gradient semantically. From an engineering perspective, it is primarily a pro-con decision with barbell-like dynamics. These engineering constraints are core to the continual learning discussion.

The goal of continual learning is to throw off the shackles of context limits and rot. It is to allow the LLM to use any information and incorporate it into its knowledge. Effectively, we want something that allows the LLM to no longer go out of date or be limited by anything other than topical data. We no longer want “too much data” to be an issue. This goal is our challenge. The achievement is termed continual learning.

Data and the Model

In order to have any learning, we need to get data in the model. I like to think of three distinctions: input, intra-layer, and weights. Input is what is transformed into embeddings. For instance, in some capacity a multi-modal model would get language tokens and image embeddings/tokens. This is traditionally your context. Intra-layer gets a little more fun. We start to see techniques like cartridges and steering vectors. You’re inputting data directly into the model’s cache or hidden states. Finally, weights are the base of your model. You update and compress information into weights during training.

A big contention of continual learning is In Context Learning (ICL). The concern with ICL is that it does not integrate information. Depending on how information is introduced to the model will decide if that information is retained. ICL is an input based introduction which is not inherently retained by the model. Some people argue this is problematic. What this argument reveals is the system vs sub-system view of LLMs. A system level view is that the LLM is the system. The LLM is the whole brain and everything should go through the weights. A sub-system view is that the model is only part of the system. In this view, the LLM is not the whole brain and instead a processing center of the brain.

I will not be settling this debate here because it is mostly a false choice. I like Tesla cars as an example. I believe it was Karpathy that spoke on how the neural net ate many hardcoded subsystems within the early cars. So eventually the LLM will consume much of the harness. However, you already have sub-systems that the LLM can use but not replace like tool calling. You can currently make the argument the model should learn from the tool call but you would be remiss to argue that there is evidence that tool calling is not necessary/beneficial. There will be more outgrowths from the LLM. Again, we find a gradient of opinion but engineering choices set by limitations.

Input Data

There are a few points on input data I would like to make then we can progress to Recursive Language Model (RLM), tool calling and retrieval augmented generation (RAG). In a simple chat system, you give the model tokenizer and embedding layers your query and context. You may also have other modalities like images, audio, etc. These end as a combination in embedding space. As context grows, you begin to run into issues like rot and memory constraints.

You can have the LLM summarize your context in these situations. You could log this information into a text file. You could train to extend context and suffer. The idea of these variations is to compress or limit context in the input space. We don’t just want to compress. We also use tool calls to gather extra information. Currently, many implementations add everything naively to the input context of our model and allow the model to sort out what it needs to answer queries.

If you are summarizing you may also make a tool call. This tool would take your context and then pass it to an LLM to condense. This condensed context would replace the object you passed to the tool. There are obvious limits here. No matter how good your training, you have a finite compression capacity in language space. You could also use RAG over a context text file. Instead of trying to compress through summarization, you tailor the information within your context window to the prompt and only pull relevant information from the context file. This is now limited by the quality of your RAG system.

The RAG system is the beginning of the development of RLMs. What if the context is a part of the environment in which the LLM interacts? Within RLM, the context is a string within a python REPL environment. The LLM is encouraged to explore and break down sub tasks for sub agents. I encourage you to read the paper here: https://arxiv.org/html/2512.24601v1. The long and short is that RLM was observed to scale to 10m+ tokens on GPT5. That’s a \~40x extension of the context window at the cost of inference time compute.

Intra-Layer Data

Intra-layer data additions are a fun subset of data introduced to LLMs. This is not intended to be a comprehensive evaluation but to point at the general category. I will focus on steering vectors (here is a related paper on persona vectors: https://arxiv.org/pdf/2507.21509, here is a related blogpost on repeng: https://vgel.me/posts/representation-engineering/) and cartridges (https://arxiv.org/pdf/2506.06266).

Steering vectors as a whole are vectors that you add to activations of LLMs in order to affect the generation. You are adding data to change the output. One of the most famous examples is Golden Gate Claude (https://www.anthropic.com/news/golden-gate-claude). If you ever want to play with these then there is an easy to use python library: https://github.com/vgel/repeng?tab=readme-ov-file. Steering vectors are primarily targeted at personality studies instead of knowledge. However, they are an example of adding data between layers of an LLM.

Cartridges are a little different. Since the KV cache scales with context, you end up with much larger memory consumption at longer contexts. Cartridges propose offline training of a smaller kv cache. The naive implementation is not competitive with ICL. Using a technique called self study (https://arxiv.org/pdf/2506.06266 - this is the same study as above) then we are able to train competitive cartridges. These cartridges are then loaded into the kv cache at inference time. The stated compression is \~40x memory savings. Additionally, the effective context was expanded in the quoted study.

When training cartridges using self study, the authors generated synthetic data. They used a chunking/RAG like system in order to generate synthetic conversations about the data. The cartridge is then trained on these conversations to distill a compressed KV cache. To me this is fairly interesting because the context extension has more to do with the RAG system than the cartridge system. The primary objective of cartridges is to save memory and it does.

Weight Data

I’m going to keep this section short. LLM weights are some of the highest forms of compression of information. The holy grail of continual learning is that it is cheap and easy to integrate new knowledge into the weights without knowledge or logic degradation of the model. The current mode of weight updates with knowledge is batch processing. Models are pre-trained sporadically per lab and subjected to reinforcement learning or continued training to continue the weight update process.

The missing piece for LLMs as the system is that adding data to the weights is not a continuous process. The research side of continual learning would claim that architectural changes here will enable us to continuously add to the weights. The issues they are addressing is degradation due to catastrophic forgetting and learning efficiency.

Current Continual Learning

If I had a gun to my head and needed to cite a continual learning system, I could. I would call SOTA coding agents continual learners. It is not a LLM as a system definition. I don’t think it needs to be. When I use a coding agent, it will learn my environment and the specifics of the libraries I have installed. I could be wrong about the libraries. Maybe the coding agents know every change of every library. My assumption is that this is look-up data.

There is a point where it does not make sense to search for knowledge. For instance, if I asked for a simple script in python then I will get a script in python without any tool calling or RAG. That is weight stored data. Which begs the question: when do we decide to lookup data vs storing it in the weights?

You will notice I glossed over intra-layer data. I personally do not see a bright future here. If you already have to generate synthetic data to make the cartridge work then you are likely better off encoding the data in the weights. If the optimization of lookup vs weights skews lookup then it is unlikely you will want to train a cartridge. In a world of barbells, the middle ground is oft barren.

Optimizing was a choice word. Weight updates have their limitations. Input data with ICL has its own limitations. We then will run into the economic reality of serving costs and compute usage. For one-off tasks that are within distribution but require new information, then you will likely use input data and stored “memories” (text files). Even on tasks which require learning a new skill, ICL often will beat out fine-tuning. If you have a task that will be repeated and have the necessary data then fine-tuning will beat out input data given specific constraints. There is no clear dividing line between input techniques and continued training. The chosen technique will be decided by the implementation and limitation of the system as a whole.

This does relegate weight updates to a batch process. RLMs are a new development which seems to span the current gap between weight updates and in context learning. RLM instead uses an immense memory of our prior actions and current ‘context’ as it relates to the current query. My expectation is that once the RLM context reaches a certain size then it becomes the target of synthetic refinement (dreaming) and weight updates. This relegates weight updates to a batch process but still enables continual learning.

In the above description, we see the ability for the LLM to have an immense context without the current limitations of immense contexts. That is, we have an object which acts as information storage with weaker constraints of degradation of our model. If we need more ICL examples, we can set up a sub-agent to develop them. If we need only 1% of the context to answer a query, then we can use only 1% of the context object. If we need synthetic data so we can integrate the knowledge accrued by the LLM into weights, then we can set up a sub-agent to generate synthetic data. RLMs appear to branch weight updates and input data into a cohesive continual learning system.

Humans and LLMs

In humans, you encode data -> short term memory -> long term memory -> long term refinement/rejection. Oftentimes the memory refinement and rejection happens while you are sleeping. The memory process in humans does not seem to be a continuous process. There is instantaneous neural encoding between short and long term memory, but that is not the end of the process. We then need to queue and synthetically augment that batch of training data. When we sleep, we will review our batch and further encode it in our neural architecture.

In LLMs, there is no current short term - long term memory that is neurologically analogous. That is not an issue since humans and LLMs are under different constraints. We could make the LLMs more and more human-like but that isn’t necessarily beneficial. I can’t write out 10 files in parallel then search over them nearly instantaneously while spinning up 10 copies of myself to query what I previously wrote. Although, that would be immensely helpful! LLMs shouldn’t be restricted by us but may be inspired.

When I was in school, we still often had old school teachers who wanted us to memorize. That was a major rift in pedagogy while I grew up. Why should I memorize when I can look it up? My thought was that I will eventually memorize if the data crosses a threshold of usefulness. Until then, it was more efficient to look up the data on Google. LLMs should behave similarly. They only need to encode what is optimal in their weights.

One issue with encoding in weights is the variance within gradient updates. When revisiting cartridges and steering vectors, they all struggle with the same issue: variance (noise). Somehow, when humans update their knowledge it is less noisy. This may be why we learn better from fewer examples. ICL also learns from few shot examples and beats out cartridges unless the cartridges have extensive synthetic additions. These additions are what reduce variance of the cartridge KV cache updates. A similar phenomenon happens in weights. A large constraint on continuous weight updates is the gradient noise of the update. I have found this area of research is what a lot of people will gesture at when they discuss continual learning.

Engineering Constraints and Research

Researchers don’t like being engineers and engineers don’t like being researchers. Elon Musk had an inflammatory tweet where he broadly put researchers at major labs in the engineering camp. I will repeat a motif but I view these camps on a continuum. There are researchers, R\&D engineers and engineers. I like to think of engineering as the implementation side of what is known. R\&D engineering is taking research and developing what can be implemented. Researching is discovering what is unknown. Engineering is primarily working with the discrete and known. There is a lot of practical research in the R\&D process because engineering can be more discrete in practice than researching. Once something is known, we don’t necessarily know if it can be implemented or is useful to implement. That is the practical side.

Engineering needs to work with the discrete. That might mean answering the question of when ICL and when fine-tuning. There are pros and cons to both strategies. Oftentimes, solutions in fields are presented as barbells. Solutions cluster around two central designs with a barren middle ground. This is due to the discrete nature of engineering. A great example would be a solution where you can eke out marginal gains from ICL using a complex system as opposed to fine-tuning (you can also consider a reverse case). If the gains from your complex system are marginal then it is not worth the complexity. It would be a poorly implemented solution in the middle of the ICL, fine-tuning barbell.

Some of the major areas of research in ML are limited by noise/variance/etc. How do you get a model to learn? How few samples do you need to represent your distribution? How do you better represent your distribution? These are all research questions and important ones! What is known from these may be implemented. I have found much of the gesturing at continual learning covers those areas of ML research. Research in this area shifts the equilibrium of engineering implementations of models.

Let’s consider the situation that a researcher creates an architecture where the model can learn from single samples with abnormally low variance. This shifts the equilibrium. Now, once you saturate context you can train the model on the context and remove any sub-system context management schemes you might have implemented. All of a sudden, something like RLM is not needed. This would be magical. If I were a betting man then I would say that we will improve in this regard but RLM or similar systems will be used because they allow augmentation and rumination over data. That will allow better training.

Conclusion

Continual learning is in vogue. I expect it to simmer down beneath the surface of ML discourse. Most of the learning paradigm used for LLMs and LLM systems has to do with engineering optimization. Continual learning is a nice marketing term to throw on top of the idea of killing the context window. However, the linguistic restrictions it seems to carry are counterproductive.

What we are pointing at is the research and engineering tradeoffs for how we build LLM systems. It is early days in these designs so the false choice between weight updates and input engineering appear to be at odds. In fact, continual learning is pointing at the engineering equilibrium inherent in these systems. Those proposing the idea of continual learning are still pointing at age old questions in ML research. However, there is no real contention. The shift of this research will only shift engineering choices.

The real discussion is where we think the equilibrium exists in the future. System LLM people think it will be closer to small batch weight updates. Sub-system LLM people think that it will be closer to an RLM implementation with longer and longer contexts. Here I try to explore my thinking and bridge the continuum as opposed to adhering to a single side. I am biased and will not pretend I was strict in my adherence. I am predominately on the side of LLMs as a sub-system. This debate is interesting and only time will tell who is correct.

If I am mischaracterizing something then please correct me! I’m always happy to refine my thinking.