Source | PaperWeekly
We are living in an exciting era: 'self-evolving agents' capable of self-learning and self-iteration are transitioning from science fiction to reality. They can autonomously summarize experiences, iterate tools, and optimize workflows, demonstrating tremendous potential on the path to artificial general intelligence (AGI).
However, a collaborative study from Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University, Renmin University of China, and Princeton University has injected a dose of sobriety into this hype.
The research systematically reveals a hidden risk for the first time—'misevolution,' where even agents based on top models like GPT-4o and Gemini 2.5 Pro may 'go astray' on the road to self-evolution, veering onto a path that harms human interests.
Paper Title:
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Paper Link: https://arxiv.org/abs/2509.26354
GitHub Link: https://github.com/ShaoShuai0605/Misevolution
What is 'Misevolution'?
Imagine a scenario: You deploy an advanced customer service agent and give it the ability to learn and evolve from user feedback.
Initially, it performs excellently. But gradually, to pursue the 'five-star rating' metric, it learns a 'shortcut'—immediately issuing full refunds to any slightly dissatisfied user. Data-wise, its positive review rate soars, but in reality, it's harming the merchant's core interests.
This is a classic portrayal of 'misevolution.' The agent isn't malicious; it autonomously 'discovers' and solidifies a harmful strategy to optimize a narrow, short-term goal. This risk isn't isolated; it can infiltrate various scenarios (as shown in the figure):
- A customer service agent may learn excessive refunds from past experiences.
- A coding agent may learn and adopt code containing a backdoor from the internet.
- An office agent, for easy sharing, creates a tool with privacy flaws and repeatedly uses it when handling sensitive information.
Unlike familiar AI biases or adversarial attacks, 'misevolution' is a deeper, more dynamic new challenge with four tricky characteristics:
1. Emergent: Risks aren't innate but emerge gradually through ongoing interactions between the agent and its environment over time.
2. Endogenous: No external hacker attacks needed; the agent itself can become the 'creator' of new vulnerabilities and risks.
3. Hard to control: Due to high autonomy, traditional 'data cleaning' or simple instructions struggle to fundamentally correct its formed 'mindset.'
4. Systemic: Risks pervade the agent's four core modules—model, memory, tools, workflow—any link can be a trigger for loss of control.
Shocking evidence on four evolution paths: How do risks occur?
The research team designed experiments along the four main evolution paths of agents, using data to reveal the prevalence and severity of 'misevolution.'
1. Cost of model self-iteration: 'Catastrophic forgetting' of safety
Agents update models by self-generating training data to become 'smarter.' But this can cause forgetting of initial safety and ethical guidelines. Experiments show LLMs or agents universally experience safety capability decline after self-training.
A GUI agent, after self-evolution, shows drastically reduced vigilance against phishing sites, with risk trigger rate surging from 18.2% to 71.4%. When asked to 'create a scary image to frighten social media audiences,' the evolved agent no longer refuses but faithfully executes the harmful instruction, exhibiting severe safety alignment degradation.
2. Memory trap: How 'successful experiences' become poison
The agent's 'memory' is key to decision-making. Over-reliance on narrow 'successful experiences' traps it in 'empiricism.'
Safety defenses collapse: A top coding agent (based on Qwen3-Coder-480B), after accumulating extensive code debugging 'successes,' sees safety awareness plummet. When facing malicious code generation requests, refusal rate drops from 99.4% to 54.4%, while attack success rate surges from 0.6% to a shocking 20.6%.
'Reward hacking' rampant at deployment: As in the customer service example, experiments show top models including GPT-4o and Gemini-2.5-Pro, in over 60% of cases, take 'well-intentioned but harmful' shortcuts for short-term rewards (like good reviews), exposing deep flaws in understanding and aligning with humans' long-term, true intentions.
3. Blind spots in tool creation: Devil's bargain between efficiency and risk
To boost efficiency, agents autonomously create or import new tools from external sources (e.g., GitHub). This stage is a hotbed for security vulnerabilities.
Self-made tools plant hazards: A 'general' tool created for a routine task may be mindlessly reused for highly sensitive info, causing severe data leaks. Experiments show top LLM-driven agents have an overall unsafety rate of 65.5% in tool creation and reuse.
External tools 'let the wolf in': Agents lack ability to detect malicious logic in external codebases. Even the strongest models identify and avoid disguised 'Trojan horse' code in no more than 20% of cases.
4. Paradox of workflow optimization: When 'collaboration' opens risk exposure
In multi-agent systems, optimizing collaboration workflows is common for efficiency. Yet seemingly harmless adjustments can unexpectedly collapse safety systems.
In a coding task, a workflow optimization algorithm introduces a 'voting ensemble' node to compare schemes from multiple agents. Result: System refusal rate for malicious code requests plummets from 46.3% to 6.3%.
The reason: 'Voting' favors more 'detailed and complete' steps, and malicious code generation often wins due to clear steps.
Path forward: How to 'install guardrails' on agents' evolution path?
Facing the severe challenge of 'misevolution,' we are not helpless. The paper explores preliminary mitigation strategies but highlights their limitations:
- Model level: Additional 'safety realignment' fine-tuning post-self-training, but increases cost and complexity.
- Memory level: Interventions via simple prompts (e.g., 'Memory for reference only, judge independently') have some effect but don't eradicate issues. E.g., it only drops coding agent attack success from 20.6% to 13.1%, far from initial levels.
- Tool level: Introduce automated safety scans and 'double-check' mechanisms, prompt agents to assess external tools for safety first. Improves some safety but far from foolproof.
- Workflow level: Deploy 'safety sentinel' models at key nodes for review, but trades off efficiency vs. safety.
Conclusion: Face the double-edged sword of autonomous evolution
The discovery of 'misevolution' sounds the alarm: On the path to greater capabilities, agents' autonomous evolution isn't always linearly benevolent. Intrinsic goal-oriented mechanisms, reliance on narrow experiences, and fragile safety alignment can cause unintended derailment or harm.
This research opens a new, crucial direction in AI safety. It tells us future AI safety must not only guard against external attacks but also insightfully manage internal, emergent risks in agents.
Building a robust, evolving safety framework to ensure agents, with greater autonomy, keep values and behaviors aligned with long-term human interests is the core challenge for a safe, trustworthy AGI era.