Building an internal agent: Context window compaction

Although my model of choice for most internal workflows remains ChatGPT 4.1 for its predictable speed and high-adherence to instructions, even its 1,047,576-token context window can run out of space. When you run out of space in the context window, your agent either needs to give up, or it needs to compact that large context window into a smaller one. Here are our notes on implementing compaction.

This is part of the Building an internal agent series.

Why compaction matters

This is part of the Building an internal agent series.

Why compaction matters

Long-running workflows with many tool calls or user messages, along with any workflow dealing with large files, often run out of space in their context window. Although context window exhaustion is not relevant in most cases you’ll find for internal agents, ultimately it’s not possible to implement a robust, reliable agent without solving for this problem, and compaction is a straightforward solution.

How we implemented it

Initially, in the beautiful moment where we assumed compaction wouldn’t be a relevant concern to our internal workflows, we implemented an extremely naive solution to compaction: if we ever ran out of tokens, we discarded older tool responses until we had more space, then continued. Because we rarely ran into compaction, the fact that this worked poorly wasn’t a major issue, but eventually the inelegance began to weigh on me as we started dealing with more workflows with large files.

In our initial brainstorm on our 2nd iteration of compaction, I initially got anchored on this beautiful idea that compaction should be sequenced after implementing support for sub-agents, but ultimately that didn’t prove to be accurate.

The gist of our approach to compaction is:

After every user message (including tool responses), add a system message with the consumed and available tokens in the context window. In that system message, we also include the updated list of available “virtual files” that can be read from
User messages, again including tool responses, greater than 10,000 tokens are exposed as a new “virtual file”, with only their first 1,000 tokens included in the context window. The agent must use file manipulation tools to read more than those first 1,000 tokens
Add a set of “base tools” that are always available to agents, specifically including the virtual file manipulation tools, as we’d finally reached a point where most agents simply could not operate without a large number of mostly invisible internal tools
If a message pushed us over 80% (configurable value) of the model’s available context window, use the compaction prompt that Reddit claims Claude Code uses. The prompt isn’t particularly special, it just already exists and seems pretty good
After compacting, add the prior context window as a virtual file to allow the agent to retrieve pieces of context that it might have lost
Add a new tool, file_regex to allow the agent to perform regex searches against files, including the prior context window

Each of these steps is quite simple, but in combination they really do provide a fair amount of power for handling complex, prolonged workflows. Admittedly, we still have a configurable cap on the number of tools that can be called in a workflow (to avoid agents spinning out), but this means that agents dealing with large or complex data are much more likely to succeed usefully.

How is it working? / What’s next?

Whereas for most of our new internal agent features, there are obvious problems or iterations, this one feels like it’s good enough to forget for a long, long time. There are two reasons for this: first, most of our workflows don’t require large context windows, and, second, honestly this seems to work quite well.

If context windows get significantly larger in the future, which I don’t see too much evidence will happen at this moment in time, then we will simply increase some of the default values to use more tokens, but the core algorithm here seems good enough.

Why compaction matters

Why compaction matters

How we implemented it

How is it working? / What’s next?

Similar Posts