Devin Co-founder: Stop Building Multi-Agent Systems! Microsoft and OpenAI's Agent Building Philosophy Is Fundamentally Flawed! Context Engineering Will Be the New Standard, Employee: Boss, Stop Leaking Secrets

Editor | Yun Zhao

OpenAI and Microsoft are promoting some flawed Agent concepts! OpenAI's Swarm is heading down a "wrong path"!

Last weekend, a post by Devin co-founder Walden Yan was truly astonishing, drawing significant industry attention and discussion.

Walden Yan began his post subtly stating:

I've seen many people make the same mistakes when building agents, so we're sharing some principles we commonly use.

He then explicitly named OpenAI and Microsoft in his blog post, stating that their open-source libraries Swarm and AutoGen are actively promoting flawed agent building philosophies, and clearly pointed out that their recommended multi-agent architecture is incorrect!

Yan sharply criticized at the beginning of the article:

"Multi-agent frameworks (currently on the market) perform far worse than expected. Based on our trial-and-error experience, I want to share some principles for building agents and explain why some seemingly appealing ideas turn out to be terrible in practice."

This article reveals the current state of agent building in the industry from Devin's perspective: While multi-agents might seem cool, after more than two years, aside from the most basic patterns, they are still fumbling forward.

When building agents, we are still in the "primitive HTML + CSS" era! A true production-grade environment is a completely different story!

Yan explained the reasons in his blog post, pointing out that current large model agents do not possess stable long-context collaborative conversation capabilities, and therefore cannot achieve parallel work between main agents and sub-agents. He emphasized the importance of the "shared context principle" and "behavior implies decision" as two core principles.

Furthermore, Yan provided several strong pieces of evidence, such as: Claude Code is an agent with sub-task capabilities, but it never runs main agents and sub-agents in parallel; sub-agents are usually only used to "answer questions" and do not involve writing code.

It can be said that this is definitely not "provoking" OpenAI or Microsoft; rather, it offers genuinely valuable insights to everyone.

Without further ado, here is the original text for you.

Background Explanation

Devin can be said to be the earliest viral AI programming agent since the birth of ChatGPT. Recently, the Devin team discovered: in fact, the performance of multi-agent frameworks on the market is far below expectations.

Many developers are naturally attracted to the Multi-Agent architecture, which aims to improve efficiency by breaking down complex tasks for multiple parallel sub-agents. However, this seemingly efficient architecture is actually very fragile and prone to systemic failures due to insufficient context sharing and conflicting decisions.

Yan stated: "Based on our trial-and-error experience, I want to share some principles for building agents and explain why some seemingly appealing ideas turn out to be terrible in practice."

OpenAI and Microsoft are promoting flawed concepts

Current Agents are still in the

"HTML + CSS" patchwork era

"Multi-agent frameworks perform far worse than expected. Based on our trial-and-error experience, I want to share some principles for building agents and explain why some seemingly appealing ideas turn out to be terrible in practice."

In this article, we will gradually derive the following two core principles:

Shared Context

Behavior Implies Decision

Why focus on "principles"?

HTML was released in 1993. In 2013, Facebook launched React. By 2025, React and its descendants almost dominate how websites and apps are developed. Why? Because React is not just a coding scaffold; it's a philosophy. Once you adopt React, you accept a reactive, modular building paradigm. This was not obvious to early web developers.

Today, in the field of building AI agents based on large models, we are still in the "primitive HTML + CSS" era—still exploring how to piece together various components to create a good experience. So far, aside from the most basic patterns, no agent building approach has become a true industry standard.

Worse, libraries like OpenAI's Swarm or Microsoft's Autogen are actually promoting an architectural approach that we believe is wrong: multi-agent architecture. I will explain why this is a dead end.

Of course, if you are a beginner, there are still many resources to help you set up a basic structure, but building serious production-grade applications is a completely different matter.

Building long-running agents

Why context engineering is needed

Let's start with reliability. When an agent needs to run for a long time and maintain coherent conversation and behavior, certain mechanisms must be adopted to prevent the gradual accumulation of errors—otherwise the system will quickly collapse. The core of all this is what we call: context engineering.

By 2025, large models are already very smart. But even the smartest people cannot complete tasks efficiently without context.

"Prompt Engineering" originally referred to manually designing task prompts. "Context Engineering" is its advanced version, emphasizing the automatic construction of context in a dynamic system. This is the most important engineering task when building AI agents.

An example of a common architecture:

The main agent breaks down the task into multiple parts

Assigns sub-agents to execute separately

Finally merges the results of each sub-task

This approach looks appealing for tasks with parallel components, but it is actually very fragile.

For example, if the task is to "build a Flappy Bird game clone". You break it into two sub-tasks:

Sub-task 1: Create the background and green pipes;

Sub-task 2: Create a bird that can fly up and down.

However, sub-agent 1 makes a mistake and creates a Super Mario-style background; sub-agent 2 creates a bird that not only doesn't match the game asset's style but also behaves incorrectly. Finally, the main agent has to force these "mismatched" results together, which is almost a disaster.

This is not made up; real-world tasks are often full of details and ambiguities. And you might think that "sending the original task context to the sub-agents as well" can solve the problem? Not enough. Because real systems often involve multiple rounds of conversation and mixed tool calls, any detail can affect task understanding.

Principle 1: Share context, not just messages, but the complete agent trace.

Redesign your system to ensure that each sub-agent has the context trace of the previous agent.

But the problem isn't over yet. If you give the same task, this time you might get a background and bird with completely inconsistent styles. Why? Because the sub-agents cannot see each other's work process, they default to contradictory implicit assumptions.

Principle 2: Behavior implies decision, and inconsistent decisions will lead to erroneous results.

I want to emphasize: Principles 1 and 2 are extremely crucial and should almost never be broken.

Any architecture that violates these two principles should be discarded by default.

You might feel this is too restrictive, but there is actually a lot of architectural design space to explore. For example, the simplest approach: linear single-threaded agents.

The advantage is continuous context. The problem is: when the task is huge and the context window overflows, problems will arise.

Is there a way to improve? Of course, but it's more complex.

To be honest, simple architectures are already sufficient, but for those who truly need to handle long-term tasks and are willing to put in the effort, even better can be done. There are many ways to solve this problem, but today I will only introduce one way to build stronger long-running agents—

We introduce a dedicated LLM model to "compress" historical context, distilling it into key events, decisions, and information. This is very difficult, requiring you to understand what truly important information is and to build a system adept at distillation.

Sometimes you even need to fine-tune a smaller model to do this—we've done this kind of work at Cognition.

Its benefit is: it can maintain context consistency over longer time scales. Although it will eventually hit limits, this is a significant step forward.

Practical application of principles:

Two good agent designs

As an agent builder, you should ensure that every action in the system is executed based on the context of existing decisions.

The ideal state is: all actions are mutually visible. But due to context window and resource limitations, this is not always feasible. So you need to make a trade-off between "reliability vs. system complexity."

Here are some real-world examples:

Claude Code's sub-agent design

As of June 2025, Claude Code is an agent with sub-task capabilities. However, it never runs main agents and sub-agents in parallel. Sub-agents are typically only used to "answer questions" and do not involve writing code.

Why? Because sub-agents lack the main agent's context and cannot handle complex tasks. If multiple sub-agents were run, they would likely produce conflicting answers.

The advantage of this design is that sub-agent queries do not pollute the main agent's history, allowing the main agent to retain a longer context trace.

Claude Code's designers intentionally chose a simple and reliable design.

Edit Apply Model

In 2024, many models were not good at modifying code. So a practice called "Edit-Apply Model" became popular:

The large model generates "Markdown formatted" change descriptions

A smaller model rewrites the entire code file based on this description

While this was more reliable than a large model directly outputting code diffs, it still had problems: the smaller model might misunderstand ambiguities in the description, leading to incorrect modifications.

By 2025, more and more systems chose to combine "change decision + apply change" into a single step, completed by a single model, improving overall reliability.

Currently, multi-agents are not good at communication and collaboration

You might wonder: can we let multiple decision-makers "communicate," like humans, to reach a consensus?

Theoretically, it's good, but current large model agents do not possess stable long-context collaborative conversation capabilities. Efficient human communication relies on complex metacognition and language skills, which are not what current agents excel at.

The multi-agent concept became popular as early as the ChatGPT era. But as of 2025, such systems remain very fragile: dispersed decisions, difficulty in context sharing, and low fault tolerance.

I believe that as single-agent capabilities improve in the future, efficient collaboration between agents will "happen incidentally," which will be a major breakthrough in parallelism and efficiency.

Moving towards a more general theory

These principles regarding "context engineering" may become part of the standard methodology for building future agents.

At Cognition, we continuously implement these principles in our tools and frameworks, and we are constantly trying and iterating. Of course, our theory is not perfect, and as the field develops, we also need to maintain flexibility and humility.

Employee: Boss, stop leaking secrets, okay?

Netizens: We want more!

This article resonated with many people building agents. "Looks like I'm not the only one who's encountered these problems!"

Even Devin's colleagues couldn't help but advise this outspoken boss: "Hey boss, stop leaking secrets!"

Some netizens also believe that some points mentioned in the article are debatable. For example, the disadvantages of master-slave agent parallel processing discussed in the article. Some netizens believe:

These disadvantages may only apply to the code editing domain. As long as the task has clear inputs and outputs, no side effects, and requires only limited context transfer, it should be possible to coordinate their execution, which is the same principle as building data pipelines and functional programming.

Another netizen supported this view.

"This will be domain-specific and sub-agent specific. But a simple way is to first pass in the full context window and determine what the key parts are when the sub-agent is done."

For building task-specific system prompts for context compression. Run A/B tests with agents using full prompts versus compressed prompts. If differences are found, you can ask the agent using the full prompt what led it to perform differently. And merge these differences into the compressed prompt. This can be automated through extensive use of AI.

Ultimately, the A/B versions should converge. This way you can continue to use system prompts and models to compress context, or you can collect samples from this context compression tool to fine-tune the model and speed things up and save some money.

This netizen also stated: If you use a model like o3, then the model's reasoning about why it can or cannot complete a task will be very good, and you can make great progress just by asking them for ideas on how to improve things.

One netizen even tested it directly on Claude Research: the screenshot showed that for non-programming tasks, large models can still handle 5 concurrently running agents!

"Tried Claude Research, and for non-programming tasks, it runs 5 (sub-tasks) concurrently. The conclusion is also natural: a hybrid architecture is the correct solution."

Regarding this point, Yan had some doubts and explained:

"Parallel reading (readonly) files is indeed not a big problem, but I suspect it's actually just a traditional tool used to collect multiple information sources for the main agent to synthesize."

In addition, there were two points of disagreement in the netizens' discussion: first, is LLM trace distillation reliable?

Netizen EddyLeeKhane held a negative view, believing that compressing historical context is prone to "hallucinations" and misjudging key points, thereby undermining context.

Of course, many commentators defaulted to this as a viable method for extending to long tasks.

However, according to the editor, whether compressing history is more effective than "strong context retention" has no unified answer and depends on specific model performance and trace design.

Second, does single-threaded agents overly limit performance?

Netizen Adam Butler questioned: single-threading limits the ability for concurrent processing, and in the future, it must rely on o3 or even faster models to be practical. This is also Yan's point of view in the article—currently, the performance of single agents is not good enough and stable.

Well, it can only be said that current agent building technology still has a very long way to go. However, precisely because of this, we also see unprecedented opportunities for innovation, such as Anthropic's MCP, which is solving the problem of agent-tool invocation, Google's A2A protocol, and today Devin's "context engineering"—are these not new hopes?

What are your thoughts on this, esteemed experts? Feel free to comment.

Reference link:

https://cognition.ai/blog/dont-build-multi-agents#principles-of-context-engineering

——Recommended Articles——

The Brave One Takes on Google All the Way! Doesn't rely on ads to survive, no SEO or data tracking, breaks 50,000 paid users in 3 years, achieves profitability! Netizens: Support! This is another kind of internet!

o3 pro first-hand real experience! Context fed until cutoff! Great God: o3 pro can't chat, God craves context, cognitive ability dimensionally crushes Gemini, Claude

Devin Co-founder: Stop Building Multi-Agent Systems! Microsoft and OpenAI's Agent Building Philosophy Is Fundamentally Flawed! Context Engineering Will Be the New Standard, Employee: Boss, Stop Leaking Secrets

Share Short URL