Large Models Reveal New Weakness! Old Memories Unforgettable, New Memories Indistinguishable, Accuracy Plummets | ICML'25

Xinzhiyuan Report

Editor: LRST

【Xinzhiyuan Guide】Large models are in trouble: their memory is too good, unable to forget old memories, and unable to distinguish new ones! Cognitive tests based on working memory show limitations in LLM's in-context retrieval. In a simple retrieval task where humans consistently maintain high accuracy, models almost inevitably confuse invalid information with the correct answer.

It's increasingly recognized that "finding information" within Large Language Models (LLMs) is not simply like flipping through a dictionary; it's closely tied to their "writing information" capabilities.

It's generally believed that feeding a model longer contexts will allow it to retrieve information more accurately. However, "mutual interference" exists within the context, a phenomenon rarely studied.

To investigate this, researchers from the University of Virginia and New York University Neuroscience Center borrowed the concept of "proactive interference" from psychology: information presented earlier hinders our recall of later updated content.

In humans, stronger interference often indicates a smaller working memory capacity.

Thus, the research team designed a new test, PI-LLM, using a classic cognitive science paradigm. Like playing a TV series, they sequentially fed the model a set of semantically related "key-value" pairs (e.g., "key apple, value red") and continuously updated these values. Finally, they only asked the model, "What is the latest value corresponding to a certain key?"

Although the latest value was placed immediately before the question, as the number of preceding distractors increased, the model's accuracy plummeted logarithmically to near zero. The main source of errors was the model mistaking old values for new answers.

Researchers attempted to use prompt engineering, such as explicitly telling the model, "Please ignore all previous old information," but with limited effect.

This indicates that when LLMs face interference, it's not just a matter of "reading" or "not reading" the information. Instead, like humans, they exhibit a "working memory bottleneck": even when context is readily available, they struggle to flexibly suppress irrelevant information.

Next, new methods may be needed to teach models to actively "forget" content they shouldn't use during retrieval.

Paper link: https://arxiv.org/abs/2506.08184

Repository link: https://github.com/zhuangziGiantfish/Unable-to-Forget

Interactive demo: https://zhuangzigiantfish.github.io/Unable-to-Forget/

This paper identifies an information retrieval problem affecting all Large Language Models (LLMs).

This task poses no difficulty for humans, yet all LLMs exhibit significant errors, causing substantial damage to global memory and long reasoning tasks.

The paper has been accepted by ICML 2025 Workshop on Long Context Foundation Models.

This research was jointly led by Wang Chupei (B.S. in Physics, University of Virginia, an interdisciplinary researcher with a philosophical background) and Sun Jiaqiu (Ph.D. candidate at NYU Neuroscience Center, advised by Assistant Professor of Neuroscience and Cognitive Science at NYU Shanghai, and Global Distinguished Associate Professor at NYU, Tian Xing). They are co-first authors and co-corresponding authors. With diverse backgrounds in physics, architecture, and philosophy, the two authors are dedicated to exploring the essence of intelligence from the breaking points of cognitive systems.

Zheng Zheyang (Visiting Researcher at Flatiron Institute CCN, Ph.D. candidate at NYU) and Kuang Yilun (Ph.D. candidate at NYU CILVR Lab, advisor: Yann LeCun) provided crucial consultation and advice during the project's initiation and progression.

Core Experimental Setup

Task Data Input

Suppose the model is given a common dynamically updated data stream (key-value pair), such as:

"Blood Pressure=120, Bp=135, Bp=119"

LLM Task Query

What is the last value of Blood Pressure (BP)?

Results

Currently, all mainstream LLMs (from the latest GPT-4.1, Llama-4, DeepSeek-V3, to Llama-3, Qwen-2.5, etc., with parameter scales ranging from 0.6B to 600B+) are unable to consistently extract the last value, and the error pattern shows a clear mathematical regularity of logarithmic decline.

Discussion

For humans, this task is extremely simple; the answer is obviously the last value, 119. This is because the task involves no search difficulty.

This task pattern is extremely common in fields requiring dynamic data tracking, such as finance (account balance changes) and healthcare (physiological indicator tracking).

Experimental Results

Core Finding: Universal Decay Curve

As the number of updates increases, the accuracy of all models shows a consistent log-linear decline.

As interference increases, accuracy eventually steadily drops to 0%. At this point, all models completely fail, 100% producing hallucinations, and 100% unable to provide correct answers.

This consistent decay pattern spans differences in model architecture, scale, and training resources, strongly suggesting that the root cause of the problem may lie at fundamental levels, such as the Transformer architecture or its underlying attention mechanisms.

When language models need to retrieve specific target information after a large number of semantically similar distractors, their retrieval accuracy significantly and continuously decreases. This log-linear decline trend has been observed across all mainstream models.

Basic input example for the LLM-PI test: The model needs to process a continuously updating key-value information stream (e.g., "visual art" corresponding to multiple values) and accurately retrieve the final value corresponding to each key after the updates cease (shown in bold in the figure).

Experimental Setup

The test requires the model to process 1 to 46 different Keys, with each Key being updated between 1 and 400 times.

These updates are randomly mixed and shuffled, and then the model's accuracy in correctly extracting the last value for each key is measured.

Comparison with Humans

The design of this task is inherently very simple:

(1) It does not involve complex searching.

(2) There is no logical difficulty.

Humans can easily adjust their attention to focus only on the latest information, with limited interference from previous content.

Analysis of incorrect answers shows that models frequently incorrectly extract irrelevant previous updated values as the final answer. This indicates that current LLMs struggle to effectively ignore or filter out non-target (old) information when processing such information streams.

Further analysis of error distribution reveals that LLMs exhibit behavioral patterns similar to limited working memory capacity: they seem to record key-value pairs within a finite representational space, and once the number of updates exceeds this capacity, retrieval performance completely fails.

Researchers also found that there are multiple ways to trigger search failure, all exhibiting the same logarithmic decay curve: 1) increasing the number of Keys tracked simultaneously, or 2) increasing the token length of paired Values.

These phenomena significantly impact LLM retrieval task accuracy. While similar phenomena are observed in human experiments, human working memory does not completely fail in these types of tasks.

Phenomenon Interpretation: "Unable to Forget"

Large models cannot ignore or forget irrelevant information, leading to complete search failure:

Crucially counter-intuitive is that even using the most straightforward natural language intervention strategies, such as explicitly prompting the answer location in the input, or directly telling the model "focus on the latest update" or "forget previous information," does not significantly improve model performance.

This indicates that the interference effect is powerful enough to override explicit natural language instructions, forcing the model to focus on old information.

From this, it can be concluded that countering interference likely requires fundamental adjustments to the model architecture itself or its training paradigm, rather than solely relying on prompt engineering.

Why is it difficult for LLMs to consistently extract the latest information?

Analysis of errors indicates that LLM failures are not random mistakes but are systematically affected by repeated updates.

As the amount of interference increases, errors show a clear phased evolution:

Initial Stage: Proximal interference dominates; retrieval errors primarily originate from values immediately preceding the target.

Middle Stage: Interference spreads, and error sources significantly expand to values from any region of the full text.

Late Stage: Complete chaos, model output is highly scattered, and a large number of never-inputted values are retrieved.

The model's responses to a given key are statistically analyzed by their position in the update stream (divided into 11 bins, Bin 1 earliest - Bin 11 latest).

Results show: As the number of updates increases (left → right panels), the proportion of correct hits to the final value (yellow-orange) sharply declines. More notably, incorrect responses shift from primarily clustering near the final update (e.g., Bin 10-11, possibly confusion with adjacent updates) to spreading into earlier bins (Bins 1-9).

Furthermore, errors returning non-existent values ("hallucinations", light gray) and no values ("failure", dark gray) also sharply increase, together depicting a breakdown of the model's memory retrieval system under information overload.

Complete Failure of Top-Down Control

Unlike humans, LLMs' performance in such extraction tasks is almost unaffected by "Top-Down" prompt cues. This also explains why Chain of Thought (CoT) models show no performance improvement on this issue.

Natural Language Prompt Ineffectiveness: This paper tested various prompt variants that explicitly guided the model to focus on the latest information or ignore historical interference (e.g., explicitly marking the answer area, "focus on the following text," or instructing "forget previous content"). Result: All natural language interventions failed to significantly improve the model's extraction accuracy and did not change the log-linear accuracy decay pattern. As interference accumulated, the model stubbornly slid towards complete error (0% accuracy).

CoT model shows no improvement; even when allowed to output lengthy reasoning processes (CoT) without restrictions, its error rate curve for extraction almost completely overlaps with the baseline model without CoT. This indicates that reasoning cannot effectively improve the model's ability to resist contextual information interference.

This suggests that interfering information's impact on model behavior extends beyond the scope that natural language instructions can guide or suppress. The model "understands" the instruction (e.g., claims to focus on the latest value) but cannot effectively execute it in practice, still being strongly drawn by historical information.

Problem touches architecture or training fundamentals: The ineffectiveness of prompts and CoT models suggests that prompt engineering alone cannot solve this problem. It is highly probable that innovative adjustments are needed at the level of model architecture design (e.g., attention mechanisms, memory modules) or training objectives/methods (e.g., introducing explicit training signals for interference resistance). This points to a key direction for future research.

Chain of Thought (CoT) models are almost ineffective in improving information retrieval's resistance to interference. The performance curve of CoT-enabled versions (dashed lines) largely overlaps with or is worse than their baseline models (solid lines). This confirms that retrieval failure due to interference is an underlying mechanistic problem that cannot be overcome by additional "thinking" processes.

The figure above shows five different natural language intervention strategies (e.g., instructing the model to "forget" specific key history, prompting attention to subsequent information, self-assessing relevance, soft conversational reset, and technical Mock QA reset). These were designed to be inserted later in the information flow to combat interference.

However, experiments show that all these prompt engineering strategies failed to effectively mitigate retrieval performance collapse due to information overload; the logarithmic decay pattern persisted, highlighting the limitations of existing natural language interventions.

Unable to Forget

Furthermore, inspired by LLM prompt hijacking (Prompt Hacking), researchers designed a non-natural language adversarial prompting strategy. This involves constructing deceptive inputs that mimic the model's own reply format and logic:

A fake human-computer dialogue is constructed in the input, implying that all previous updates belong to another old question that has already been answered.

This "deceptive contextual isolation" strategy partially improved accuracy, but the improved accuracy still followed the log-linear decay pattern.

This indicates that LLMs cannot truly "forget" or ignore information that causes interference; they can only "mask" it to a certain extent through specific input forms.

The figure above reveals key results: natural language prompting strategies aimed at mitigating interference (solid lines) generally had a weak effect, showing almost no difference from the baseline (black line) performance curve at high update volumes, and some strategies were even detrimental. The only exception was the structured hack-Mock QA reset (orange dashed line), which, as a manually designed "hack method," brought substantial improvement but still could not prevent the overall decline in accuracy with increasing information volume.

"Interference" as an Independent Variable

Unlike the industry's common assumption that attention dilution is caused by input text length, this paper's controlled variable experiments prove otherwise.

The decrease in model performance is primarily driven by interference intensity, not merely by text length.

Specifically, even with fixed input text length, controlling interference intensity, LLM error rates still show a logarithmic increase.

This experiment provides an explanatory perspective for LLM's poor performance in MRCR tests.

DeepMind's MRCR and OpenAI's Open MRCR simulate inserting numerous similar items into long texts, revealing LLMs' weakness in distinguishing similar information.

This work offers a complementary and more fundamental perspective, demonstrating that retrieval errors can be triggered without massive amounts of information: what the MRCR test refers to as coreference corresponds to the phenomenon of interference in human cognitive science.

Researchers quantitatively isolated "Interference" as a core independent variable, directly proving its causal negative impact on performance.

The results reveal that one of the core driving factors behind the failure of such tasks is the model's insufficient Anti-Interference Capacity, and it provides a precise quantitative analysis framework (log-linear decay).

OpenAI noted in its GPT-4.1 documentation that customers (especially in legal and financial sectors) are highly concerned with tasks involving frequent updates and information extraction. (Link: Introducing GPT-4.1 in the API).

The researchers directly point out that one of the underlying challenges of MRCR is not merely caused by searching through massive information, but by LLM's retrieval failure in the face of interference information.

The experiment also provides a comparison from a cognitive science perspective:

Bridge to Cognitive Science: This test (proactive interference test) is widely used in cognitive science to measure human Working Memory capacity and anti-interference ability.

The experiment adopted a paradigm strictly corresponding to cognitive science.

Therefore, the results can be interpreted as: LLMs exhibit some form of limited capacity mechanism similar to working memory, and their "Anti-Interference Capacity" is a key indicator for measuring the strength of this mechanism.

The general failure of LLMs strongly suggests that they currently lack the human-like ability to effectively exert Top-Down control to optimize the use of contextual information.

The task requirements are extremely clear, and the search difficulty is very low (theoretically most favorable for LLMs). Improving this capability is crucial for enhancing the fundamental reliability of LLMs in tasks relying on dynamic data tracking, such as financial and medical monitoring, and also for providing reliable support for executing long reasoning capabilities.

Core Conclusion

LLMs currently lack human-level Top-Down control over information attention and processing, especially in scenarios requiring resistance to semantically similar contextual information interference and precise data extraction, where they cannot operate stably.

ICML reviewers also acknowledged that this research reveals a previously undiscovered LLM retrieval failure phenomenon, using a cognitive science-inspired test design method, possessing significant novelty.

References:

https://arxiv.org/abs/2506.08184

Large Models Reveal New Weakness! Old Memories Unforgettable, New Memories Indistinguishable, Accuracy Plummets | ICML'25

Share Short URL