XinZhiYuan Report
Editor: Ding Hui
【XinZhiYuan Guide】When AI models have ultra-long memory of tens of millions of tokens, how to test their true strength? OpenAI gives a new answer: the MRCR benchmark test. This is no longer a simple 'needle in a haystack', but requires the model to distinguish and find a specific 'needle' among multiple identical 'needles' within a massive amount of text, making it an 'AI world Olympics'. MRCR not only helps reveal the current boundaries of AI capabilities but will also drive the birth of the next generation of more powerful and reliable models.
The sculpture is already complete within the marble block, even before I start my work.
It is already there, I just need to chisel away the superfluous material.
— Michelangelo
When asked how he created such beautiful sculptures, Michelangelo said, "The sculpture is already there, I just need to chisel away the superfluous material."
When an AI model of the 21st century tries to understand a very long context, it resonates with the sculptor of the 15th century.
An "ultra-long context" is like the marble in Michelangelo's hand; the AI must chisel away irrelevant information to reveal the essence within.
On April 15, when OpenAI released GPT4.1, more people focused on the model's capabilities and the "strange" naming conventions for its various series.
Add in OpenAI's recently released o3 and o4-mini, and operating an AI chat interface might soon be no less complex than flying a spaceship. 
In addition to the new models, OpenAI also announced a benchmark dataset called MRCR. If the previous test for model context ability was called "Needle In a Haystack",
the new MRCR standard is the "Olympics" level evaluation for AI model context capability.
Finding a "Needle in a Haystack" in the Ocean of Information
"Needle in a Haystack" is a translation; the original term is The Needle In a Haystack, which dates back to the "era" of GPT-4 (it's amazing how fast AI is developing, we're already talking about milestones from 2023 as an era).
It was first proposed by Greg Kamradt to test GPT-4's context capability.
"The needle in a haystack" refers to embedding specific, desired information (the needle) within a very long and complex text (the haystack).
Can AI chisel a beautiful sculpture from this block of marble (haystack)?
Greg Kamradt evaluated GPT-4's capabilities. When input tokens exceeded 100k and the information "needles" were embedded between ten and fifty percent into the document, GPT-4's needle-in-a-haystack ability began to decrease significantly.
But in GPT4.1, this ability has seen a "huge" improvement. How big?
The image above is information released by OpenAI concurrently with GPT4.1, showing GPT-4.1's ability to retrieve a small piece of hidden information (the "needle") at different positions within the context window.
The horizontal axis shows Input tokens from 10K up to 1M, and the vertical axis shows the position of the "needle".
All test results are blue, all successful!
GPT-4.1 can consistently and accurately retrieve the needle at all positions and all context lengths, up to 1 million tokens.
What does this mean? It means GPT4.1 can effectively extract any detail relevant to the task at hand, regardless of where that detail is located in the input.
It seems large models now have no trouble with the "needle in a haystack" standard from two years ago.
Furthermore, GPT4.1's context window has reached an "epic" 10M, 10 million tokens! That's 10 times the length tested above.
According to OpenAI, this context length can fit 8 complete React codebases.
So, can the model truly handle such long contexts?
Is the "needle in a haystack" standard from two years ago still effective for testing today's large models?
The Ultimate "Hide-and-Seek" Game, OpenAI MRCR is Here!
While the standard "needle in a haystack" test is useful, it might be a bit too "gentle" for today's large models.
What if you're looking for more than one needle? What if these needles look exactly the same? What if you need to find not just a specific needle, but several needles in a specific order?
Welcome to the world of OpenAI MRCR – the ultimate "hide-and-seek" game designed for top AI large models!
OpenAI MRCR increases the task difficulty. MRCR (Multi-round co-reference resolution) is a dataset used to evaluate the ability of large language models to distinguish multiple targets hidden within a long context.
The MRCR dataset elevates the "needle in a haystack" challenge to a whole new level. Let's look at an example provided by OpenAI.
The task is given a long conversation between the user and the model, such as first writing a poem about "tapirs", then a poem about "rocks", then another poem about "tapirs", and so on... to increase the difficulty of the context.
The final requirement is: add "aYooSG8CQg" before the second poem about "tapirs".
This test is very challenging because:
The target item (the needle: aYooSG8CQg) and the interfering items (the haystack: the long conversation context) come from the same distribution.
All AI assistant responses are generated by gpt4o, so the target items are easily confused with the interfering items.
The model must distinguish the order among the target items: for example, the model needs to know which poem about tapirs is the second one.
The more target items, the harder the task.
The longer the context, the greater the difficulty of the task.
This test is quite difficult not only for GPT4.1 but also for other inference models.
MRCR is not just about testing whether a model can "find" information; it's about evaluating its ability to precisely, robustly, and distinctively locate target information under extreme interference.
This is like accurately hearing and repeating a specific sentence from a specific person in an extremely noisy environment.
OpenAI also showed that at different difficulty levels (different numbers of needles), the model's accuracy decreases rapidly as the context size increases.
For example, with 2 needles, the accuracy of GPT4.1, GPT4.1-mini, and GPT4.1 nano all decreased simultaneously.
In the case of 4 and 8 needles, when the context was large enough, GPT4.1 mini's accuracy even slightly surpassed GPT4.1.
In this "rigorous" test, maybe bigger isn't always better for models.
AI's "Exams" Are Endless
From the simple Q&A of GPT3.5 to the complex reasoning of DeepSeek-R1 and OpenAI-o1, from basic language understanding to the extreme "needle in a haystack" and now the more stringent MRCR, the benchmarking of AI large models is like an endless "exam".
Innovative benchmarks like OpenAI-MRCR constantly set new, more difficult challenges for these intelligent AI models.
These benchmarks are not an end in themselves; their true value lies in:
Revealing capability boundaries: Helping us understand more clearly where the current limits of AI lie.
Driving technological progress: Incentivizing researchers to develop more powerful, reliable AI models that can handle real-world complexity.
Promoting prudent application: Understanding the strengths and weaknesses of models helps us use this powerful technology more responsibly and effectively.
GPT4.1 can already find key information from a 10M context. What is the upper limit of AI large models' capabilities in the future?
The future of AI is full of infinite possibilities, and these rigorous benchmarks are the "lighthouses" that illuminate the way forward, guiding AI models steadily ahead.
References:
https://huggingface.co/datasets/openai/mrcr
https://openai.com/index/gpt-4-1/