Can AI "Admit Its Own Mistakes"? Solving the "Rashomon" of Multi-Agent Collaboration, Earning ICML 2025 Spotlight

Image

Xinzhiyuan Report

Editors: Dinghui, Haokun

【Xinzhiyuan Guide】In multi-agent AI systems, once a task fails, developers often fall into the mystery of "who made the mistake and where." PSU, Duke University, and Google DeepMind, among other institutions, for the first time proposed "Automated Failure Attribution," released the Who&When dataset, and explored three attribution methods, revealing the complexity and challenges of this problem.

You've built a "super AI team" – where each AI agent performs its own duties: some collect information, some are responsible for judgment, and others coordinate execution, working together to tackle complex tasks.

This wish seems seamless, but often ends in failure. The problem is: when a "problem" arises, how do you know which AI caused it?

Just like debugging code, it's almost impossible to find which AI is the weak link from mountains of model dialogue records, call logs, and intermediate results, and AI itself is a "black box."

This is the real dilemma facing multi-agent AI systems today: not only do they fail frequently, but tracing the source of problems is difficult.

To solve this "AI version of Rashomon," researchers from Pennsylvania State University, Duke University, and Google DeepMind, among other institutions, for the first time proposed: Automated Failure Attribution – letting AI raise its hand and say: I made a mistake!

Currently, this paper has not only successfully won a Spotlight award at the top conference ICML 2025, but the accompanying first dedicated benchmark dataset Who&When and related code have also been open-sourced.

Image

Paper address: https://arxiv.org/pdf/2505.00212

Code address: https://github.com/mingyin1/Agents_Failure_Attribution

They say models are products. Just as current models like OpenAI o3, Gemini 2.5 Pro, and the newly released DeepSeek-R1-0528 are becoming increasingly powerful, why do we still need multi-agent AI systems?

This is because, at the current stage, individual AI capabilities are still limited, and LLM-driven multi-agent systems show great potential in many fields.

However, these systems also have vulnerabilities: individual agent errors, misunderstandings between agents, and information transfer errors can all lead to overall task failure.

Image

Currently, once a multi-agent AI system "crashes," developers often can only:

Manually "archaeologize": sift through lengthy interaction logs line by line, trying to find the problem.

Rely on experience: This debugging process heavily relies on the developer's deep understanding of the system and task.

This "needle in a haystack" troubleshooting method is not only inefficient but also severely hinders the rapid iteration and reliability improvement of the system.

There is an urgent need for an automated, systematic method to locate the cause of failure, effectively connecting "evaluation results" with "system improvements."

ImageImage

Core Contributions

Addressing the above challenges, this paper makes groundbreaking contributions:

1. Proposing and defining a new problem

For the first time, "Automated Failure Attribution" is formalized as a specific research task, clarifying the need to identify the failure-responsible agent and the decisive error step that led to the failure.

2. Constructing the first dataset – Who&When

This dataset contains extensive failure logs collected from 127 LLM multi-agent systems. These systems are both algorithmically generated and meticulously crafted by human experts, ensuring the realism and diversity of scenarios.

Each failure log is accompanied by fine-grained human annotations:

"Who": Which Agent is the "culprit."

"When": At which step of the interaction did the decisive error occur.

"Why": A natural language explanation for the cause of the failure.

Image

Annotating key error agents and identifying specific erroneous steps is a challenge for both ordinary people and domain experts.

Annotators need to parse complex logs, clarify the problem-solving logic of each agent, and judge whether each action is correct or misleading to the overall problem-solving process.

For example, if an agent uses a web browser to obtain important information needed to solve a problem, the annotator must check the browser history and visit every website to determine if the failure was due to a lack of relevant information on the website itself or the agent's failure to successfully retrieve it.

As shown in Figure (a) above, three annotators spent 30.9 hours, 30.2 hours, and 23.2 hours, respectively, to complete the annotation. This indicates that the annotation process is very time-consuming, hence the consideration of conducting research on automated fault attribution.

Furthermore, in many data instances, the error is not attributable to just one agent, but multiple agents. People need to identify these errors and select the most serious ones, as these errors directly lead to problem-solving failure. Since the severity of errors can sometimes be subtle, and even subjective to some extent, this process becomes even more difficult.

As shown in Figure (b) above, it displays the proportion of "uncertain" annotations by three annotators, ranging from 15% to 30%.

Figure (c) visualizes the divergence rate when different annotators vote on each other's "uncertain" data.

It can be seen that even in the voting stage, there is still a certain degree of divergence, further highlighting the complexity and challenges of the annotation work itself.

3. Exploring preliminary "Automated Attribution" methods

Based on the Who&When dataset, the paper designed and evaluated three different automated failure attribution methods: All-at-Once, Step-by-Step, and Binary Search.

Experimental results show that these three methods each have their strengths in terms of accuracy, cost, and analysis granularity, and no single method completely outperforms the others on all metrics. Together, they form a preliminary "solution library" for the automated failure attribution task.

· All-at-Once: One-time completion

Provides the LLM with the user query and the complete failure log, requiring it to identify the fault-responsible Agent and the decisive error step in one go.

Advantages: One-step process, lower cost, potentially good at macro-level responsibility judgment.

Disadvantages: May face a "needle in a haystack" problem, making it difficult to precisely locate specific error steps from a long context.

· Step-by-Step: Gradual

Simulates manual debugging, allowing the LLM to start from the beginning and check the interaction log step by step, judging whether an error occurs at each stage until the "true culprit" is found.

Advantages: Detailed like peeling an onion, more advantageous in locating specific error stages.

Disadvantages: Requires multiple model queries, higher call cost, and accumulated errors from local judgments may lead to "misjudgments."

· Binary Search: Binary search

This is a compromise strategy, with total cost and effect falling between the former two.

The method involves splitting the log in half, using the LLM to determine whether the error occurred in the first or second half, and then recursively "bisecting" within the determined half until the problematic step and relevant Agent are locked down.

Image

Experimental Results and Key Findings

Experiments were conducted under two settings: LLM knows the true answer to the problem the Multi-Agent system needs to solve (With Ground Truth) and LLM does not know the true answer (Without Ground Truth).

Mainly using GPT-4o, but also tested other models (such as Llama series, Qwen series, GPT series, OpenAI o1, DeepSeek R1).

Through a systematic evaluation of the above methods on the Who&When dataset, many important findings were obtained:

· Long way to go, performance is worrying

Current methods are still far from perfect. Even the best single method has an accuracy of only about 53.5% in identifying the "scapegoat Agent," and for precisely locating the "error step," the accuracy is as low as 14.2%!

This means that even the best strategy can barely guess correctly who the "insider" is more than half the time, and for where the mistake occurred, it's basically a "guess."

Some methods even perform worse than random guessing, showing how difficult the task is.

· Each has its strengths, no "all-rounder" champion

Different methods perform differently on various subtasks; there is no single champion.

For example, the All-at-Once method is relatively best at judging "Who," being able to more accurately find which Agent should be responsible; while the Step-by-Step method is more advantageous in locking down specific error steps ("When"), as detailed checks bring advantages; the Binary Search method is intermediate in all aspects, performing moderately.

This reflects the trade-off of each solution: either a holistic scan is good at finding "people," or step-by-step investigation is good at finding "steps."

Image

· "Combination punch" is more effective, but costly

Researchers tried strategies that combine different methods (e.g., first using the "All-at-Once method" for a preliminary judgment of the responsible Agent, then applying the "Step-by-Step method" for precise error step localization).

Results show that this hybrid method indeed improves overall attribution performance but also leads to a significant increase in computational cost.

ImageImage

· Existing SOTA models collectively "stall"

Most surprisingly, even the current state-of-the-art inference models (such as OpenAI o1 and DeepSeek R1) struggled with this task.

This further highlights the inherent difficulty of the "Automated Failure Attribution" task itself, as its demands on AI inference capabilities surpass the performance of current SOTA models on conventional tasks.

Image

Clear inference prompts are crucial: Explicitly requiring the LLM to provide reasons for attribution in the All-at-Once and Step-by-Step prompts can improve performance.

Image

Context length constrains performance: Experiments also found that as the context length of failure logs increases, the performance of all attribution methods tends to decrease, especially being more sensitive in terms of accuracy in locating error steps.

ImageImage

Moving towards more intelligent, more reliable Multi-Agent systems

"Automated Failure Attribution" is an indispensable part of the Multi-Agent system development process.

It will help us gain deeper insights into the failure patterns of Multi-Agent systems, transforming the perplexing question of "where did it go wrong, and whose fault is it" into a quantifiable analytical problem.

By building a bridge between "evaluation" and "improvement," the future will ultimately be able to create more reliable, more intelligent, and more trustworthy Multi-Agent collaboration systems.

Image

Author Introductions

Shaokun Zhang

Image

A third-year Ph.D. student at Penn State University, advised by Professor Qingyun Wu.

His recent research interests focus on the intersection of Agentic AI and reinforcement learning.

Currently interning at NVIDIA, focusing on LLM agent research. Prior to this, he received his bachelor's degree in Computer Science from Xidian University.

Ming Yin

Image

A first-year Ph.D. student at Duke University, advised by Professor Yiran Chen.

Received his bachelor's degree from the School of the Gifted Young, University of Science and Technology of China, at the age of 20 in 2024.

Currently interested in LLM agents, LLM inference, and trustworthy AI.

From May to August 2025, he will be a Generative AI Research Intern at Zoom in Seattle.

References:

https://arxiv.org/pdf/2505.00212

https://skzhang1.github.io/

https://mingyin1.github.io/

ImageImageImage

Main Tag:Multi-Agent AI

Sub Tags:Automated Failure AttributionAI DebuggingLarge Language ModelsMachine Learning Research


Previous:Discussing Consciousness, Reasoning, and the Philosophy of AI with Murray Shanahan

Next:Internet Queen Mary Meeker's 340-page "AI Trends Report" PPT

Share Short URL