Abandoning Fine-Tuning: Stanford Co-releases Agentic Context Engineering (ACE), Boosting Model Performance by 10% and Reducing Token Costs by 83%

Stanford, SambaNova Systems, and Berkeley have joined forces to create a new framework called Agentic Context Engineering (ACE). ACE allows models to self-evolve through reflection and iteration, much like humans. Crucially, this process does not involve modifying model weights, leading to a massive reduction in operational costs.

Observe the performance improvement:

This development is quite groundbreaking.

The Fine-Tuning Path to Enhanced Model Performance

After a large model is trained, the traditional method for making it perform better in specific domains is fine-tuning.

This involves using a dataset specific to that domain to retrain a portion of the model's parameters. While effective, the drawbacks are as prominent as the benefits.

Fine-tuning requires significant computational resources and long iteration cycles. This pace is often too slow for businesses that need to respond quickly to market changes.

It acts like a black box; once the parameters are adjusted, it’s difficult to explain why the model’s performance improved or why it failed in certain areas. This lack of interpretability is critical in high-risk fields like finance and medicine.

Fine-tuned models also suffer from "catastrophic forgetting," meaning they learn new knowledge but forget their original skills.

Therefore, experts have been searching for alternatives.

"Context adaptation" technology emerged: instead of tweaking the hundreds of billions of model parameters, we focus on manipulating the model’s input—the "context."

When communicating with a person, to ensure they understand the task better, it’s best to clearly lay out the requirements, background information, and necessary precautions, often through multi-turn communication.

The context serves a similar purpose; it can be a system prompt, successful examples (evidence), or experience summarized from previous model errors (memory store).

Its advantages are obvious: the content is legible, modifiable, debuggable, and shareable across different models. Combined with the exponentially increasing context windows of modern LLMs, which can now hold hundreds of thousands or even millions of tokens, and long-text inference acceleration techniques like KV cache reuse, context adaptation has become the new favorite approach.

Two Flaws in Context Methods

However, the path to success is rarely smooth.

Previous context adaptation methods, though heading in the right direction, often fell into two common pitfalls.

The first is "brevity bias." Many methods designed to automatically optimize context tend to favor instructions that are as short and generic as possible. For instance, a framework called GEPA considered conciseness an advantage.

This works for simple tasks, but it fails in complex scenarios requiring extensive domain knowledge and detailed operations, such as when an AI Agent needs to call various tools to complete a multi-step mission. In these cases, the "less is more" philosophy is inadequate.

The second is "context collapse." This issue occurs when an LLM is tasked with iterating and rewriting the entire context itself. You want it to summarize experience and improve, but with each summary, information is lost, like a photocopy becoming blurrier with every generation. After several iterations, the model’s performance sharply declines.

In scenarios requiring high reliability and detailed instruction, we need the accumulation and enrichment of knowledge, not endless compression.

The ACE Framework: Bringing Context to Life

To address these two major pitfalls, the joint team from Stanford, SambaNova, and Berkeley introduced the ACE framework, offering a novel solution.

The core idea of ACE is to transform the context from a static "instruction manual" into a dynamically evolving "playbook." Instead of rewriting the entire playbook each time, it uses incremental updates, continuously adding new lessons learned and experiences.

This process is cleverly designed as a pipeline involving three collaborative roles, all played by the same base LLM (DeepSeek-V3.1 non-inference enhanced version was used in the experiment). This ensures that performance gains stem purely from context optimization, not from underlying model capability differences.

These three roles are:

Generator: Its task is execution. Like a novice agent, it performs specific tasks, such as calling tools or carrying out reasoning. It generates a complete record of operations, including successes and documented failures.
Reflector: This is the hindsight analyzer. It examines the operational records left by the Generator, extracting specific, actionable lessons. For example, "When handling file type A, tool B always errors; tool C should be used instead," or "When encountering situation X, executing step Y directly is more efficient than initial inquiry." It converts these fragmented insights into structured text.
Curator: This is the chief editor of the playbook. It receives the insights refined by the Reflector, converts them into standard "delta items," and then merges them into the existing playbook using a deterministic process. This merging includes deduplication, pruning, and organization, ensuring the playbook becomes richer and more targeted while remaining clear and manageable.

This "Generate-Reflect-Curate" loop resembles a top-tier sports team.

The Generator is the player on the field, playing the game, with all successes and mistakes recorded on video. The Reflector is the coaching staff reviewing the tape post-game, analyzing frame by frame to identify issues and summarize tactical points. The Curator is the assistant coach responsible for updating the tactics board, clearly and accurately drawing the new strategies for the next match.

By employing this incremental "Grow-and-Refine" principle, ACE completely avoids context collapse. Knowledge is only accumulated and optimized, never forgotten or simplified. Furthermore, the entire process is unsupervised, requiring no human-labeled data, relying only on task execution feedback (e.g., success or failure signals) to self-drive improvement.

ACE Test Performance

The ACE framework was rigorously tested on two types of tasks: AI agents and domain-specific benchmarks.

The AppWorld AI agent task is a benchmark specifically designed to evaluate an AI agent's ability to complete daily tasks in a simulated mobile application environment. These tasks are complex, requiring the model to understand instructions, call APIs, and interact with the environment through multiple turns.

What were the results?

Compared to the selected baseline models, average performance increased by 10.6%. Even without access to "GT labels" (ground truth labels, the reference standard for evaluating model performance), the framework still achieved good results.

Even more astonishingly, on the AppWorld public leaderboard on September 20, 2025, the ReAct+ACE score was 59.4%, almost matching IBM CUGA (60.3%), a commercial-grade agent based on the much stronger GPT-4.1 model, which was ranked first at the time. In the more difficult "Challenge" subset, ACE even surpassed CUGA. It is important to remember that ACE utilizes a smaller, open-source model.

Professional tasks in the financial domain included Financial Named Entity Recognition (FiNER) and XBRL formula numerical reasoning. These tasks demand precise domain knowledge and specialized strategies.

The results were equally impressive. ACE achieved an average performance boost of 8.6% over the baseline in these tasks. Even without human-labeled correct answers, relying solely on program execution feedback, ACE proved capable of effective self-optimization.

In terms of cost, ACE drastically outperformed its predecessors.

Compared to GEPA, another context optimization method, ACE reduced latency by 82.3% and API calls by 75.1% in offline adaptation tasks.

Compared to Dynamic Cheatsheet in online adaptation tasks, latency was reduced by 91.5% and token cost by 83.6%.

Why is it so economical? Because it avoids requiring the LLM to repeatedly rewrite the entire, ever-growing context. The Curator’s merging operation is deterministic and non-LLM based, resulting in minimal overhead.

Upon its release, the ACE framework immediately caused a significant stir in academia and industry.

ACE paves a new path for building low-cost, highly interpretable AI systems by achieving LLM self-improvement through context engineering.

From a commercial standpoint, ACE's long context and incremental update mechanism provide critical technical support for the rapid iteration and deployment of enterprise-grade AI applications.

As model performance approaches a bottleneck, Agentic Context Engineering raises the performance ceiling significantly by offering greater adaptability, higher operational efficiency, and stronger interpretability for enhancing agent capabilities.

References:

https://arxiv.org/abs/2510.04618

https://www.marktechpost.com/2025/10/10/agentic-context-engineering-ace-self-improving-llms-via-evolving-contexts-not-fine-tuning

https://X.com/omarsar0/status/1976746822204113072

https://X.com/rohanpaul_ai/status/1975732878739665393

https://X.com/DataScienceDojo/status/1976407325180117284

Abandoning Fine-Tuning: Stanford Co-releases Agentic Context Engineering (ACE), Boosting Model Performance by 10% and Reducing Token Costs by 83%

The Fine-Tuning Path to Enhanced Model Performance

Two Flaws in Context Methods

The ACE Framework: Bringing Context to Life

ACE Test Performance

Share Short URL