Which Model Should a Reliable Agent Use? The "Lost in Conversation" Phenomenon in LLM Multi-turn Dialogues | Microsoft Latest

Introduction: Microsoft recently partnered with Salesforce Research to release a study titled "Lost in Conversation," which states that the performance of the most advanced LLMs significantly drops in multi-turn conversations, with an average decrease as high as 39%. This phenomenon is referred to as being "lost" in conversation. The article analyzes the performance differences of major models (including Claude 3.7-Sonnet, Deepseek-R1, etc.) in multi-turn conversations and also dissects the root causes of models getting "lost" and effective mitigation strategies. This is crucial for Agent developers when selecting models and is worth a careful read. The latter part of the article provides links to the open-source code and dataset used by the researchers for their study.

Multi-Turn Dialogue: The Strongest AI Models Can Get "Lost"

图片

Performance comparison of 15 LLM models in single-turn (FULL) and multi-turn (SHARDED) conversations, showing a significant performance drop in multi-turn dialogue.

When the most advanced Large Language Models (LLMs) face multi-turn conversations, their performance drops significantly, with an average decrease of up to 39%. Microsoft Research's latest study "Lost in Conversation," in collaboration with Salesforce Research, revealed this prevalent but rarely noticed problem through 200,000 dialogue simulations across 15 top models. The study found that both commercial closed-source models (like GPT-4.1, Gemini 2.5 Pro) and open-source models (like the Llama series) struggle with the "getting lost" issue, posing a severe challenge for engineers developing Agent systems.

图片

Getting Lost Leads to a 112% Plunge in Reliability

图片

Comparison analysis of Aptitude and Reliability, showing that reliability decline is the main issue in multi-turn conversations.

Researchers used an innovative metric decomposition to divide the performance decline of LLMs in multi-turn conversations into two parts:

• Aptitude Decline: Dropped by only 16%

• Reliability Decline: Plunged by 112%

This means the gap between the model's best and worst performance more than doubled. This high unreliability explains why your AI assistant sometimes performs excellently but sometimes inexplicably "forgets things," with results varying significantly even for the same question across multiple attempts.

Sharded Simulation: Experimental Design for Model Getting Lost

图片

Six major task types covered by the study and examples of sharded instructions, illustrating how a full instruction is broken down into multiple information fragments.

图片

Researchers designed an innovative experimental framework called "sharded simulation," which breaks down complete instructions into multiple information fragments (shards) and gradually reveals them in multi-turn conversations. This method simulates the real-world process of users gradually clarifying their needs in a dialogue, unlike traditional evaluations where complete information is provided at once. The study covers six major task domains:

1. Programming (Code)

2. Database Query (Database)

3. API Calls (Actions)

4. Mathematical Problems (Math)

5. Data-to-text Generation (Data-to-text)

6. Multi-document Summarization (Summary)

This broad coverage ensures the study's findings have wide applicability.

Instruction Sharding and Dialogue Simulation Types

图片

This figure illustrates the core experimental design methodology of the study, divided into two parts:

1. Upper part (Instruction Sharding):

• Shows how researchers split a complete single-turn instruction (blue square) into multiple information fragments (yellow small squares)

• This is the basis of the "sharded simulation" experiment in the paper, simulating the scenario where users provide information gradually in multi-turn dialogue

2. Lower part (Dialogue Simulation Types):

• Shows five different experimental setups and their information flow:

• FULL: The complete instruction is provided entirely in the first turn (baseline scenario)

• SHARDED: The instruction is split into multiple fragments and provided gradually in different turns (simulates real multi-turn dialogue)

• CONCAT: All fragments are provided in the first turn, but kept in fragment form

• RECAP: Uses the sharding pattern but adds a final turn summarizing all previous information

• SNOWBALL: Each turn cumulatively restates all previous information

This figure intuitively explains why multi-turn dialogue leads to performance degradation and how strategies like RECAP and SNOWBALL work.

Helping You Test and Improve Agent Systems

The Microsoft research team has open-sourced the full code repository and dataset for the "Lost in Conversation" study, providing you with a powerful set of tools to test and improve your own Agent systems. The repository includes a complete dialogue simulation framework (simulator_full.py, simulator_sharded.py, etc.), covering single-turn full instructions, multi-turn sharded instructions, and implementations of RECAP/SNOWBALL strategies.

Github:https://github.com/Microsoft/lost_in_conversation

HuggingFace:https://huggingface.co/datasets/microsoft/lost_in_conversation

Key features of the code repository and dataset:

• Complete dialogue simulation framework supporting testing in different scenarios

• 600 high-quality, human-verified instructions and their sharded versions

• Covers six major practical scenarios including programming, math, and database queries

If you are an Agent developer, you can use these resources for three types of testing:

1. Evaluate the real performance differences of various foundation models in multi-turn dialogue

2. Validate the actual effectiveness of information integration strategies you design (like RECAP)

3. Diagnose which types of tasks your own Agent system is more likely to get "lost" in

Researchers recommend confirming the setup with small-scale experiments before conducting large-scale tests and paying attention to API provider rate limits. This toolset might be the most comprehensive available for evaluating LLM information integration capabilities, offering high reference value for building truly reliable multi-turn dialogue systems.

⚠️ Models Start Crashing After Just Two Turns

图片

Progressive sharding experiment results, demonstrating that even in just two turns of dialogue, model reliability significantly decreases.

The most alarming finding is that even in the simplest two-turn dialogues, LLMs' performance drops significantly. Researchers used a "progressive sharding" experiment to show that as long as the dialogue involves any degree of gradual information disclosure (even if split into just two fragments), model reliability collapses. This means your Agent system is at risk even when handling seemingly simple multi-turn dialogues, and users don't need to pose complex questions to encounter situations where the AI assistant "loses its way."

Why Even the Strongest Models Stumble

Through in-depth analysis of dialogue logs, the study identified four key factors contributing to models getting "lost":

1. Premature Assumptions: Models attempt to answer questions before having complete information, making numerous assumptions.

2. Answer Inflation: Over-reliance on previous (potentially incorrect) answers, leading to answers gradually "inflating" rather than rethinking.

3. Uneven Attention Distribution: Excessive focus on the first and last turns of the dialogue while neglecting information in intermediate turns.

4. Answer Verbosity: Generating overly lengthy answers, introducing more irrelevant assumptions and distracting the model itself.

These factors collectively cause even the most advanced models to gradually deviate from the correct path in multi-turn conversations.

Impact of Answer Verbosity on Performance

图片

This table reveals an important finding: shorter answers are generally more effective than lengthy ones.

• The horizontal axis represents answer verbosity, from shortest (0-20%) to longest (80-100%).

• The vertical axis shows different task types (Code, Math, Database, etc.).

• The values in the table are the model's performance scores for that task.

Key Finding:

• In most tasks (especially Code, Database, Summary), shorter answers lead to better performance.

• For example, in the Code task, the score for the shortest answers (0-20%) is 55.3, while for the longest answers (80-100%) it's only 42.5.

• Only the Actions task performs best with medium verbosity (40-60%).

• Overall, shorter answers (0-40%) perform significantly better than lengthy answers (60-100%) on average.

This indicates that models generating overly long answers introduce more unnecessary assumptions, leading to getting "lost."

Claude 3.7 and DeepSeekR1

Among all 15 models tested, Claude 3.7-Sonnet showed the strongest multi-turn conversation reliability, with a performance retention rate of 65.9%, leading other competitors. Although GPT-4.1 performed better in single-turn conversations, Claude had the smallest loss when transitioning from single-turn to multi-turn, particularly maintaining high levels in Math (85.4→70.0) and Summary (29.3→23.6) tasks.

Applicable Advice:

• If you are developing an Agent that requires complex multi-turn interaction, Claude 3.7-Sonnet might be the best current choice.

• If you are limited to open-source models, Llama 3.3-70B (64.2% performance retention) is the most cost-effective option.

图片

As one of the two specialized reasoning models tested in the study, Deepseek-R1 exhibited a distinctly "two-faced" nature.

Single-turn Dialogue Advantage:

• Programming (Code) task: Top performance of 99.4 points

• Actions task: 97.0 points

• Math task: 95.5 points

Multi-turn Dialogue Disadvantage:

• Multi-turn performance is only 31.5%

• Retention rate is only 47.5%

• There was a capability loss of over 60% in almost every task.

Researchers specifically noted that despite Deepseek-R1 having extra reasoning (test-time compute) capability, this did not help it maintain stability in multi-turn conversations, indicating that "thinking" alone is not sufficient to solve information integration problems.

Advice for Agent Developers:

• Single-turn interaction scenarios: Deepseek-R1 is a highly competitive choice.

• Complex multi-turn dialogue scenarios: Requires careful evaluation or consider using DeepSeekV3 instead.

🌡️ Lowering Temperature is Ineffective: Uncertainty is Not the Culprit

图片

Test results for model unreliability at different temperature settings, proving that lowering the temperature does not effectively increase reliability in multi-turn dialogues.

A common misconception is that reducing the model's temperature parameter can increase consistency in multi-turn dialogue. Researchers specifically designed temperature experiments, and the results show:

• Single-turn dialogue: Lowering temperature is effective (reducing temperature from 1.0 to 0.0 can decrease unreliability by 50%).

• Multi-turn dialogue: Lowering temperature is almost ineffective (at a temperature of 0.0, unreliability is still around 30%).

This finding indicates that the root cause of the problem is not randomness but rather an inherent flaw in how models process information in a multi-turn context. Engineers need to note: simple adjustments to generation parameters cannot solve the "getting lost" problem in multi-turn dialogue.

RECAP Strategy: Improving Multi-turn Dialogue Performance

图片

Performance comparison of RECAP and SNOWBALL strategies, demonstrating that these methods can effectively mitigate performance degradation in multi-turn dialogue.

To address the "getting lost" problem, researchers tested two possible solutions:

1. RECAP (Final Recap): Add an extra turn before the end of the multi-turn dialogue to summarize all previously provided user information.

2. SNOWBALL (Cumulative Restatement): Restate all previous information in each turn.

The experimental results were significant: the RECAP strategy improved GPT-4o's multi-turn performance from 59.1% to 76.6%, mitigating about 40% of the performance drop.

Practical Advice: When designing Agent systems, consider adding an information review mechanism at critical decision points. While this cannot completely solve the problem, it can significantly reduce the risk.

Five Practical Suggestions for Agent Architecture Design

Based on the study's findings, the following five suggestions can help you design more reliable Agent systems:

1. Delay Answer Generation: Avoid models making premature assumptions by explicitly instructing them to refrain from answering until sufficient information is collected.

2. Control Answer Length: Study data shows that shorter answers have a significantly higher success rate than lengthy ones.

3. Implement Information Review Mechanisms: Summarize known information at critical decision points.

4. Utilize Multi-Model Architecture: Use specialized models responsible for information integration and decision-making.

5. Train Users to Provide Complete Information: The study shows that providing complete instructions at once performs much better than scattered instructions.

Using these strategies in combination can build more reliable Agent systems.

Researchers' Recommendations

The study's findings present a severe challenge to LLM developers: current mainstream evaluation methods focus excessively on ability (Aptitude) in single-turn, fully specified scenarios, while neglecting reliability in multi-turn, gradually clarified scenarios.

Researchers call for LLM developers to give equal importance to both dimensions in future model iterations and propose specific standards:

• An ideal LLM should maintain similar capability levels in both single-turn and multi-turn settings.

• Unreliability in multi-turn dialogue should be below 15%.

• These metrics should be achieved at the default temperature (T=1.0).

This shift will make the next generation of LLMs more suitable for building truly reliable conversational Agent systems.

In Conclusion

The "Lost in Conversation" study reveals key limitations of current LLMs. By selecting the most suitable model for your needs, combining it with information integration strategies like RECAP, and following the practical suggestions provided in the paper, you can significantly improve the reliability of your Agent system in multi-turn dialogue.

Although a perfect solution is not yet available, recognizing the problem and taking targeted measures is an important step towards building the next generation of reliable Agent systems. When users say "AI always forgets what I said halfway through," your system might become the exception that breaks this stereotype.

The future is here, send "group" to the official account backend

Let's walk together if fate allows

图片

<End of Article, Author: Xiu Mao>

Please contact me for reprinting

🎉Let's create more beauty together!🎉

If you found this article helpful

Thank you for giving me a [Like] and [Seen]

<Only I can see your like and seen>

👉WeChat ID: xiumaoprompt

Please specify your purpose when adding!

Main Tag:LLM Reliability

Sub Tags:Multi-turn DialogueLarge Language ModelsMicrosoft ResearchAI Agents


Previous:Truth in Light vs. Algorithmic Illusion: Scientific Photography in the Age of AI

Next:Google CEO Pichai Responds to "Google Is Dead"

Share Short URL