Qizheng Zhang 1∗ Changran Hu 2∗ Shubhangi Upasani 2 Boyuan Ma 2 Fenglu Hong 2 Vamsidhar Kamanuru 2 Jay Rainton 2 Chen Wu 2 Mengmeng Ji 2 Hanchen Li 3 Urmish Thakker 2 James Zou 1 Kunle Olukotun 1 1 Stanford University 2 SambaNova Systems, Inc. 3 UC Berkeley ∗ equal contribution # qizhengz@stanford.edu, changran.hu@sambanovasystems.com
Abstract
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation—modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the ada…
Qizheng Zhang 1∗ Changran Hu 2∗ Shubhangi Upasani 2 Boyuan Ma 2 Fenglu Hong 2 Vamsidhar Kamanuru 2 Jay Rainton 2 Chen Wu 2 Mengmeng Ji 2 Hanchen Li 3 Urmish Thakker 2 James Zou 1 Kunle Olukotun 1 1 Stanford University 2 SambaNova Systems, Inc. 3 UC Berkeley ∗ equal contribution # qizhengz@stanford.edu, changran.hu@sambanovasystems.com
Abstract
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation—modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.
1 Introduction
Figure 1: Overall Performance Results. Our proposed framework, ACE, consistently outperforms strong baselines across agent and domain-specific reasoning tasks.
Modern AI applications based on large language models (LLMs), such as LLM agents [52, 49] and compound AI systems [55], increasingly depend on context adaptation. Instead of modifying model weights, context adaptation improves performance after model training by incorporating clarified instructions, structured reasoning steps, or domain-specific input formats directly into the model’s inputs. Contexts underpin many AI system components, including system prompts that guide downstream tasks [36, 4], memory that carries past facts and experiences [41, 48], and factual evidence that reduces hallucination and supplements knowledge [6].
Adapting through contexts rather than weights offers several key advantages. Contexts are interpretable and explainable for users and developers [47, 45], allow rapid integration of new knowledge at runtime [27, 7], and can be shared across models or modules in a compound system [23]. Meanwhile, advances in long-context LLMs [39] and context-efficient inference such as KV cache reuse [17, 51] are making context-based approaches increasingly practical for deployment. As a result, context adaptation is emerging as a central paradigm for building capable, scalable, and self-improving AI systems.
Despite this progress, existing approaches to context adaptation face two key limitations. First, a brevity bias: many prompt optimizers prioritize concise, broadly applicable instructions over comprehensive accumulation. For example, GEPA [4] highlights brevity as a strength, but such abstraction can omit domain-specific heuristics, tool-use guidelines, or common failure modes that matter in practice [16]. This objective aligns with validation metrics in some settings, but often fails to capture the detailed strategies required by agents and knowledge-intensive applications. Second, context collapse: methods that rely on monolithic rewriting by an LLM often degrade into shorter, less informative summaries over time, causing sharp performance declines (Figure 2). In domains such as interactive agents [43, 38, 57], domain-specific programming [53, 56], and financial or legal analysis [33, 18, 44], strong performance depends on retaining detailed, task-specific knowledge rather than compressing it away.
As applications such as agents and knowledge-intensive reasoning demand greater reliability, recent work has shifted toward saturating contexts with abundant, potentially useful information [22, 12, 11], enabled by advances in long-context LLMs [39, 34]. We argue that contexts should function not as concise summaries, but as comprehensive, evolving playbooks—detailed, inclusive, and rich with domain insights. Unlike humans, who often benefit from concise generalization, LLMs are more effective when provided with long, detailed contexts and can distill relevance autonomously [22, 31, 41]. Thus, instead of compressing away domain-specific heuristics and tactics, contexts should preserve them, allowing the model to decide what matters at inference time.
To address these limitations, we introduce ACE (Agentic Context Engineering), a framework for comprehensive context adaptation in both offline settings (e.g., system prompt optimization) and online settings (e.g., test-time memory adaptation). Rather than compressing contexts into distilled summaries, ACE treats them as evolving playbooks that accumulate and organize strategies over time. Building on the agentic architecture of Dynamic Cheatsheet [41], ACE incorporates a modular workflow of generation, reflection, and curation, while adding structured, incremental updates guided by a grow-and-refine principle. This design preserves detailed, domain-specific knowledge, prevents context collapse, and yields contexts that remain comprehensive and scalable throughout adaptation.
We evaluate ACE on two categories of LLM applications that most benefit from comprehensive, evolving contexts: (1) agents [43], which require multi-turn reasoning, tool use, and environment interaction, where accumulated strategies can be reused across episodes; and (2) domain-specific benchmarks, which demand specialized tactics and knowledge, where we focus on financial analysis [33, 44]. Our key findings are:
- ∙\bullet
ACE consistently outperforms strong baselines, yielding average gains of 10.6% on agents and 8.6% on domain-specific benchmarks, across both offline and online adaptation settings.
- ∙\bullet
ACE is able to construct effective contexts without labeled supervision, instead leveraging execution feedback and environment signals—key ingredients for self-improving LLMs and agents.
- ∙\bullet
On the AppWorld benchmark leaderboard [5], ACE matches the top-ranked production-level agent IBM-CUGA [35] (powered by GPT-4.1) on average and surpasses it on the harder test-challenge split, while using a smaller open-source model (DeepSeek-V3.1).
- ∙\bullet
ACE requires significantly fewer rollouts and lower dollar costs, and achieves 86.9% lower adaptation latency (on average) than existing adaptive methods, demonstrating that scalable self-improvement can be achieved with both higher accuracy and lower overhead.
2 Background and Motivation
2.1 Context Adaptation
Context adaptation (or context engineering) refers to methods that improve model behavior by constructing or modifying inputs to an LLM, rather than altering its weights. The current state of the art leverages natural language feedback [40, 54, 4]. In this paradigm, a language model inspects the current context along with signals such as execution traces, reasoning steps, or validation results, and generates natural language feedback on how the context should be revised. This feedback is then incorporated into the context, enabling iterative adaptation. Representative methods include Reflexion [40], which reflects on failures to improve agent planning; TextGrad [54], which optimizes prompts via gradient-like textual feedback; GEPA [4], which refines prompts iteratively based on execution traces and achieves strong performance, even surpassing reinforcement learning approaches in some settings; and Dynamic Cheatsheet [41], which constructs an external memory that accumulates strategies and lessons from past successes and failures during inference. These natural language feedback methods represent a major advance, offering flexible and interpretable signals for improving LLM systems beyond weight updates.
2.2 Limitations of Existing Context Adaptation Methods
The Brevity Bias.
A recurring limitation of context adaptation methods is brevity bias: the tendency of optimization to collapse toward short, generic prompts. Gao et al. [16] document this effect in prompt optimization for test generation, where iterative methods repeatedly produced near-identical instructions (e.g., "Create unit tests to ensure methods behave as expected"), sacrificing diversity and omitting domain-specific detail. This convergence not only narrows the search space but also propagates recurring errors across iterations, since optimized prompts often inherit the same faults as their seeds. More broadly, such bias undermines performance in domains that demand detailed, context-rich guidance—such as multi-step agents, program synthesis, or knowledge-intensive reasoning—where success hinges on accumulating rather than compressing task-specific insights.
Figure 2: Context Collapse. Monolithic rewriting of context by an LLM can collapse it into shorter, less informative summaries, leading to sharp performance drops.
Context Collapse.
In a case study on the AppWorld benchmark [43], we observe a phenomenon we call context collapse, which arises when an LLM is tasked with fully rewriting the accumulated context at each adaptation step. As the context grows large, the model tends to compress it into much shorter, less informative summaries, causing a dramatic loss of information. For instance, at step 60 the context contained 18,282 tokens and achieved an accuracy of 66.7, but at the very next step it collapsed to just 122 tokens, with accuracy dropping to 57.1—worse than the baseline accuracy of 63.7 without adaptation. While we highlight this through Dynamic Cheatsheet [41], the issue is not specific to that method; rather, it reflects a fundamental risk of end-to-end context rewriting with LLMs, where accumulated knowledge can be abruptly erased instead of preserved.
Figure 3: Example ACE-Generated Context on the AppWorld Benchmark (partially shown). ACE-generated contexts contain detailed, domain-specific insights along with tools and code that are readily usable, serving as a comprehensive playbook for LLM applications.
3 Agentic Context Engineering (ACE)
We present ACE (Agentic Context Engineering), a framework for scalable and efficient context adaptation in both offline (e.g., system prompt optimization) and online (e.g., test-time memory adaptation) scenarios. Instead of condensing knowledge into terse summaries or static instructions, ACE treats contexts as evolving playbooks that continuously accumulate, refine, and organize strategies over time. Building on the agentic design of Dynamic Cheatsheet [41], ACE introduces a structured division of labor across three roles (Figure 4): the Generator, which produces reasoning trajectories; the Reflector, which distills concrete insights from successes and errors; and the Curator, which integrates these insights into structured context updates. This mirrors how humans learn—experimenting, reflecting, and consolidating—while avoiding the bottleneck of overloading a single model with all responsibilities.
To address the limitations of prior methods discussed in §2.2—notably brevity bias and context collapse—ACE introduces three key innovations: (1) a dedicated Reflector that separates evaluation and insight extraction from curation, improving context quality and downstream performance (§4.5); (2) incremental delta updates (§3.1) that replace costly monolithic rewrites with localized edits, reducing both latency and compute cost (§4.6); and (3) a grow-and-refine mechanism (§3.2) that balances steady context expansion with redundancy control.
Figure 4: The ACE Framework. Inspired by Dynamic Cheatsheet, ACE adopts an agentic architecture with three specialized components: a Generator, a Reflector, and a Curator.
As shown in Figure 4, the workflow begins with the Generator producing reasoning trajectories for new queries, which surface both effective strategies and recurring pitfalls. The Reflector critiques these traces to extract lessons, optionally refining them across multiple iterations. The Curator then synthesizes these lessons into compact delta entries, which are merged deterministically into the existing context by lightweight, non-LLM logic. Because updates are itemized and localized, multiple deltas can be merged in parallel, enabling batched adaptation at scale. ACE further supports multi-epoch adaptation, where the same queries are revisited to progressively strengthen the context.
3.1 Incremental Delta Updates
A core design principle of ACE is to represent context as a collection of structured, itemized bullets, rather than a single monolithic prompt. The concept of a bullet is similar to the concept of a memory entry in LLM memory frameworks like Dynamic Cheatsheet [41] and A-MEM [48], but builds on top of that and consists of (1) metadata, including a unique identifier and counters tracking how often it was marked helpful or harmful; and (2) content, capturing a small unit such as a reusable strategy, domain concept, or common failure mode. When solving new problems, the Generator highlights which bullets were useful or misleading, providing feedback that guides the Reflector in proposing corrective updates.
This itemized design enables three key properties: (1) localization, so only the relevant bullets are updated; (2) fine-grained retrieval, so the Generator can focus on the most pertinent knowledge; and (3) incremental adaptation, allowing efficient merging, pruning, and de-duplication during inference.
Rather than regenerating contexts in full, ACE incrementally produces compact delta contexts: small sets of candidate bullets distilled by the Reflector and integrated by the Curator. This avoids the computational cost and latency of full rewrites, while ensuring that past knowledge is preserved and new insights are steadily appended. As contexts grow, this approach provides the scalability needed for long-horizon or domain-intensive applications.
3.2 Grow-and-Refine
Beyond incremental growth, ACE ensures that contexts remain compact and relevant through periodic or lazy refinement. In grow-and-refine, bullets with new identifiers are appended, while existing bullets are updated in place (e.g., incrementing counters). A de-duplication step then prunes redundancy by comparing bullets via semantic embeddings. This refinement can be performed proactively (after each delta) or lazily (only when the context window is exceeded), depending on application requirements for latency and accuracy.
Together, incremental updates and grow-and-refine maintain contexts that expand adaptively, remain interpretable, and avoid the potential variance introduced by monolithic context rewriting.
4 Results
Our evaluation of ACE shows that:
- ∙\bullet
Enabling High-Performance, Self-Improving Agents. ACE enables agents to self-improve by dynamically refining their input context. It boosts accuracy on the AppWorld benchmark by up to 17.1% by learning to engineer better contexts from execution feedback alone, without needing ground-truth labels. This context-driven improvement allows a smaller, open-source model to match the performance of the top-ranked proprietary agent on the leaderboard. (§4.3)
- ∙\bullet
Large Gains on Domain-Specific Benchmarks. On complex financial reasoning benchmarks, ACE delivers an average performance gain of 8.6% over strong baselines by constructing comprehensive playbooks with domain-specific concepts and insights. (§4.4)
- ∙\bullet
Effective by Design. Ablation studies confirm our design choices are key to success, with components like the Reflector and multi-epoch refinement each contributing substantial performance gains. (§4.5)
- ∙\bullet
Lower Cost and Adaptation Latency. ACE achieves these gains efficiently, reducing adaptation latency by 86.9% on average, while requiring fewer rollouts and lower token dollar costs. (§4.6)
4.1 Tasks and Datasets
We evaluate ACE on two categories of LLM applications that benefit most from a comprehensive and evolving context: (1) agent benchmarks, which require multi-turn reasoning, tool use, and environment interaction, where agents can accumulate and reuse strategies across episodes and environments; and (2) domain-specific benchmarks, which demand mastery of specialized concepts and tactics, where we focus on financial analysis as a case study.
- ∙\bullet
LLM Agent: AppWorld [43] is a suite of autonomous agent tasks involving API understanding, code generation, and environment interaction. It provides a realistic execution environment with common applications and APIs (e.g., email, file system) and tasks of two difficulty levels (normal and challenge). A public leaderboard [5] tracks performance, where, at the time of submission, the best system achieved only 60.3% average accuracy, highlighting the benchmark’s difficulty and realism.
- ∙\bullet
Financial Analysis: FiNER [33] and Formula [44] test LLMs on financial reasoning tasks that rely on the eXtensible Business Reporting Language (XBRL). FiNER requires labeling tokens in XBRL financial documents with one of 139 fine-grained entity types, a key step for financial information extraction in regulated domains. Formula focuses on extracting values from structured XBRL filings and performing computations to answer financial queries, i.e., numerical reasoning.
Evaluation Metrics.
For AppWorld, we follow the official benchmark protocol and report Task Goal Completion (TGC) and Scenario Goal Completion (SGC) on both the test-normal and test-challenge splits. For FiNER and Formula, we follow the original setup and report accuracy, measured as the proportion of predicted answers that exactly match the ground truth.
All datasets follow the original train/validation/test splits. For offline context adaptation, methods are optimized on the training split and evaluated on the test split with pass@1 accuracy. For online context adaptation, methods are evaluated sequentially on the test split: for each sample, the model first predicts with the current context, then updates its context based on that sample. The same shuffled test split is used across all methods.
4.2 Baselines and Methods
Base LLM.
The base model is evaluated directly on each benchmark without any context engineering, using the default prompts provided by dataset authors. For AppWorld, we follow the official ReAct [52] implementation released by the benchmark authors, and build all other baselines and methods on top of this framework.
In-Context Learning (ICL) [3].
ICL provides the model with task demonstrations in the input prompt (few-shot or many-shot). This allows the model to infer the task format and desired output without weight updates. We supply all training samples when they fit within the model’s context window; otherwise, we fill the window with as many demonstrations as possible.
MIPROv2 [36].
MIPROv2 is a popular prompt optimizer for LLM applications that works by jointly optimizing system instructions and in-context demonstrations via bayesian optimization. We use the official DSPy implementation [15], setting auto="heavy" to maximize optimization performance.
GEPA [4].
GEPA (Genetic-Pareto) is a sample-efficient prompt optimizer based on reflective prompt evolution. It collects execution traces (reasoning, tool calls, intermediate outputs) and applies natural-language reflection to diagnose errors, assign credit, and propose prompt updates. A genetic Pareto search maintains a frontier of high-performing prompts, mitigating local optima. Empirically, GEPA outperforms reinforcement learning methods such as GRPO and prompt optimizers like MIPROv2, achieving up to 10–20% higher accuracy with as much as 35× fewer rollouts. We use the official DSPy implementation [14], setting auto="heavy" to maximize optimization performance.
Dynamic Cheatsheet (DC) [41].
DC is a test-time learning approach that introduces an adaptive external memory of reusable strategies and code snippets. By continuously updating this memory with newly encountered inputs and outputs, DC enables models to accumulate knowledge and reuse it across tasks, often leading to substantial improvements over static prompting methods. A key advantage of DC is that it does not require ground-truth labels: the model can curate its own memory from its generations, making the method highly flexible and broadly applicable. We use the official implementation released by the authors [42] and set it to use the cumulative mode (DC-CU).
ACE (ours).
ACE optimizes LLM contexts for both offline and online adaptation through an agentic context engineering framework. To ensure fairness, we use the same LLM for the Generator, Reflector, and Curator (non-thinking mode of DeepSeek-V3.1 [13]), preventing knowledge transfer from a stronger Reflector or Curator to a weaker Generator. This isolates the benefit of context construction itself. We adopt a batch size of 1 (constructing a delta context from each sample). We set the maximum number of Reflector refinement rounds and the maximum number of epochs in offline adaptation to 5.
Method GT Labels Test-Normal Test-Challenge Average TGC↑\uparrow SGC↑\uparrow TGC↑\uparrow SGC↑\uparrow DeepSeek-V3.1 as Base LLM ReAct 63.7 42.9 41.5 21.6 42.4 Offline Adaptation ReAct + ICL ✓ 64.3+0.664.3_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+0.6}} 46.4+3.546.4_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+3.5}} 46.0+4.546.0_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+4.5}} 27.3+5.727.3_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+5.7}} 46.0+3.646.0_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+3.6}} ReAct + GEPA ✓ 64.9+1.264.9_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+1.2}} 44.6+1.744.6_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+1.7}} 46.0+4.546.0_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+4.5}} 30.2+8.630.2_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+8.6}} 46.4+4.046.4_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+4.0}} ReAct + ACE ✓ 76.2+12.5{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{76.2}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+12.5}}} 64.3+21.4{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{64.3}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+21.4}}} 57.3+15.8{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{57.3}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+15.8}}} 39.6+18.0{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{39.6}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+18.0}}} 59.4+17.0{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{59.4}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+17.0}}} ReAct + ACE ✗ 75.0+11.375.0_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+11.3}} 64.3+21.4{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{64.3}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+21.4}}} 54.4+12.954.4_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+12.9}} 35.2+13.635.2_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+13.6}} 57.2+14.857.2_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+14.8}} Online Adaptation ReAct + DC (CU) ✗ 65.5+1.865.5_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+1.8}} 58.9+16.0{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{58.9}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+16.0}}} 52.3+10.852.3_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+10.8}} 30.8+9.230.8_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+9.2}} 51.9+9.551.9_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+9.5}} ReAct + ACE ✗ 69.6+5.9{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{69.6}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+5.9}}} 53.6+10.753.6_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+10.7}} 66.0+24.5{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{66.0}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+24.5}}} 48.9+27.3{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{48.9}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+27.3}}} 59.5+17.1{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{59.5}}_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}\mathbf{+17.1}}}
Table 1: Results on the AppWorld Agent Benchmark. "GT labels" indicates whether ground-truth labels are available to the Reflector during adaptation. We evaluate the ACE framework against multiple baselines on top of the official ReAct implementation, both for offline and online context adaptation. ReAct + ACE outperforms selected baselines by an average of 10.6%, and could achieve good performance even without access to GT labels.
4.3 Results on Agent Benchmark
Analysis.
As shown in Table 1, ACE consistently improves over strong baselines on the AppWorld benchmark. In the offline setting, ReAct + ACE outperforms both ReAct + ICL and ReAct + GEPA by significant margins (12.3% and 11.9%, respectively), demonstrating that structured, evolving, and detailed contexts enable more effective agent learning than fixed demonstrations or single optimized instruction prompts. These gains extend to the online setting, where ACE continues to outperform prior adaptive methods such as Dynamic Cheatsheet by an average of 7.6%.
In the agent use case, ACE remains effective even without access to ground-truth labels during adaptation: ReAct + ACE achieves an average improvement of 14.8% over the ReAct baseline in this setting. This robustness arises because ACE leverages signals naturally available during execution (e.g., code execution success or failure) to guide the Reflector and Curator in forming structured lessons of successes and failures. Together, these results establish ACE as a strong and versatile framework for building self-improving agents that adapt reliably both with and without labeled supervision.
Notably, on the latest AppWorld leaderboard (as of September 20, 2025; Figure 5), on average, ReAct + ACE (59.4%) matches the top-ranked IBM CUGA (60.3%), a production-level GPT-4.1–based agent [35], despite using the smaller open-source model DeepSeek-V3.1. With online adaptation, ReAct + ACE even surpasses IBM CUGA by 8.4% in TGC and 0.7% in SGC on the harder test-challenge split, underscoring the effectiveness of ACE in building comprehensive and self-evolving contexts for agents.
4.4 Results on Domain-Specific Benchmark
Table 2: Results on Financial Analysis Benchmark. "GT labels" indicates whether ground-truth labels are available to the Reflector during adaptation. With GT labels, ACE outperforms selected baselines by an average of 8.6%, highlighting the advantage of structured and evolving contexts for domain-specific reasoning. However, we also observe that in the absence of reliable feedback signals (e.g., ground-truth labels or execution outcomes), both ACE and other adaptive methods such as Dynamic Cheatsheet may degrade, suggesting that context adaptation depends critically on feedback quality.
Analysis.
As shown in Table 2, ACE delivers strong improvements on financial analysis benchmarks. In the offline setting, when provided with ground-truth answers from the training split, ACE surpasses ICL, MIPROv2, and GEPA by clear margins (an average of 10.9%), showing that structured and evolving contexts are particularly effective when tasks require precise domain knowledge (e.g., financial concepts, XBRL rules) that goes beyond fixed demonstrations or monolithic optimized prompts. In the online setting, ACE continues to exceed prior adaptive methods such as DC by an average of 6.2%, further confirming the benefit of agentic context engineering for accumulating reusable insights across specialized domains.
Moreover, we also observe that when ground-truth supervision or reliable execution signals are absent, both ACE and DC may degrade in performance. In such cases, the constructed context can be polluted by spurious or misleading signals, highlighting a potential limitation of inference-time adaptation without reliable feedback. This suggests that while ACE is robust under rich feedback (e.g., code execution results or formula correctness in agent tasks), its effectiveness depends on the availability of signals that allow the Reflector and Curator to make sound judgments. We return to this limitation in Appendix B.
4.5 Ablation Study
Table 3 reports ablation studies on the AppWorld benchmark, analyzing how individual design choices of ACE contribute to effective context adaptation. We examine three factors: (1) the Reflector with iterative refinement, our addition to the agentic framework beyond Dynamic Cheatsheet, (2) multi-epoch adaptation, which refines contexts over training samples multiple times, and (3) offline warmup, which initializes the context through offline adaptation before online adaptation begins.
Method GT Labels Test-Normal Test-Challenge Average TGC↑\uparrow SGC↑\uparrow TGC↑\uparrow SGC↑\uparrow DeepSeek-V3.1 as Base LLM ReAct 63.7 42.9 41.5 21.6 42.4 Offline Adaptation ReAct + ACE w/o Reflector or multi-epoch ✓ 70.8+7.170.8_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+7.1}} 55.4+12.555.4_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+12.5}} 55.9+14.455.9_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+14.4}} 38.1+17.538.1_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+17.5}} 55.1+12.755.1_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+12.7}} ReAct + ACE w/o multi-epoch ✓ 72.0+8.372.0_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+8.3}} 60.7+17.860.7_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+17.8}} 54.9+13.454.9_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+13.4}} 39.6+18.039.6_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+18.0}} 56.8+14.456.8_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+14.4}} ReAct + ACE ✓ 76.2+12.576.2_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+12.5}} 64.3+21.464.3_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+21.4}} 57.3+15.857.3_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+15.8}} 39.6+18.039.6_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+18.0}} 59.4+17.059.4_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+17.0}} Online Adaptation ReAct + ACE ✗ 67.9+4.267.9_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+4.2}} 51.8+8.951.8_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+8.9}} 61.4+19.961.4_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+19.9}} 43.2+21.643.2_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+21.6}} 56.1+13.756.1_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+13.7}} ReAct + ACE + offline warmup ✗ 69.6+5.969.6_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+5.9}} 53.6+10.753.6_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+10.7}} 66.0+24.566.0_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+24.5}} 48.9+27.348.9_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+27.3}} 59.5+17.159.5_{{\color[rgb]{0.1328125,0.546875,0.1328125}\definecolor[named]{pgfstrokecolor}{rgb}{0.1328125,0.546875,0.1328125}+17.1}}
Table 3: Ablation Studies on AppWorld. We study how particular design choices of ACE (iterative refinement, multi-epoch adaptation, and offline warmup) could help high-quality context adaptation.
4.6 Cost and Speed Analysis
(a) Offline (AppWorld).
(b) Online (FiNER).
Table 4: Cost and Speed Analysis. We measure the context adaptation latency, number of rollouts, and dollar costs of ACE against GEPA (offline) and DC (online).
Due to its support for incremental, “delta" context updates and non-LLM-based context merging and de-duplication, ACE demonstrates particular advantages in reducing the cost (in terms of the number of rollouts or the amount of dollar cost for token ingestion/generation) and latency of adaptation.
As examples, on the offline adaptation of AppWorld, ACE achieves 82.3% reduction in adaptation latency and 75.1% reduction in the number of rollouts as compared to GEPA (Table 4(a)). On the online adaptation of FiNER, ACE achieves 91.5% reduction in adaptation latency and 83.6% reduction in token dollar cost for token ingestion and generation as compared to DC (Table 4(b)).
5 Discussion
Longer Context ≠\neq Higher Serving Cost.
Although ACE produces longer contexts than methods such as GEPA, this does not translate to linearly higher inference cost or GPU memory usage. Modern serving infrastructures are increasingly optimized for long-context workloads through techniques such as the reuse [17, 51], compression [32, 30], and offload [25] of KV cache. These mechanisms allow frequently reused context segments to be cached locally or remotely, avoiding repetitive and expensive prefill operations. Ongoing advances in ML systems suggest that the amortized cost of handling long contexts will continue to decrease, making context-rich approaches like ACE increasingly practical in deployment.
Implications for Online and Continuous Learning.
Online and continuous learning are key research directions in machine learning for addressing issues like distribution shifts [24, 19] and limited training data [37, 21, 60]. ACE offers a flexible and efficient alternative to conventional model fine-tuning, as adapting contexts is generally cheaper than updating model weights [9, 26, 28, 20]. Moreover, because contexts are human-interpretable, ACE enables selective unlearning [10, 8, 29]—whether due to privacy or legal constraints [1, 2], or when outdated or incorrect information is identified by domain experts. These are promising directions for future work, where ACE could play a central role in advancing continuous and responsible learning.
References
- [1] General Data Protection Regulation article 17: Right to erasure. EU Regulation 2016/679, 2016. Official consolidated text.
- [2] California consumer privacy act, civil code §1798.105: Right to delete. State of California Civil Code, 2018.
- [3] Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade