With the continuous enhancement of large model capabilities, merely judging whether a model is truly intelligent and trustworthy by observing scores on various Benchmarks may be far from enough.
Did you know:
Evaluating a large model through a complete standard test (such as HELM) can take over 4000 GPU hours and cost tens of thousands of dollars;
Model evaluation in industry even requires extensive human expert involvement in annotation/judgment;
The quality of many questions in Benchmarks may not be as reliable as we imagine;
Even if the model's accuracy reaches 99%, we still find it difficult to answer: Did it answer correctly based on its actual ability? Was the question too simple? Or had it seen the original question during training?
Traditional large-scale "test-taking" evaluation methods are no longer sufficient to meet the assessment needs of current general artificial intelligence, especially for cognitive ability evaluation.
Recently, at the ICML 2025 conference, a position paper jointly published by the National Key Laboratory of Cognitive Intelligence of the University of Science and Technology of China, the University of California, Berkeley, and the Educational Testing Service (ETS), proposed a new approach to AI evaluation based on psychometric theories that emerged in the last century: evaluating AI models' capabilities by assessing them in the same way we test humans.
Paper Title: Position: AI Evaluation Should Learn from How We Test Humans
Paper Link: https://arxiv.org/abs/2306.10512
Current Predicaments in AI Evaluation Methods
In pursuit of comprehensive evaluation, current AI models are facing increasingly large "test papers." Google BIG-bench includes over 200 tasks, and HuggingFace Open LLM Leaderboard comprises 29k questions across 6 scenarios.
The mainstream AI evaluation scheme is simple and direct: prepare a vast and comprehensive test set, and after the model answers, score it based on accuracy and various other metrics. However, this evaluation paradigm presents numerous problems in practical application:
Cost: Especially for large models, evaluation involves significant computational, manual, and time costs;
Reliability: Many questions are repetitive/redundant, and question quality varies widely;
Security: Many test questions have been "seen" or "memorized" by the model;
Interpretability: Only observing "how many questions were answered correctly" does not reveal "what specific capabilities are strong" or "how strong those capabilities are."
Psychometrics Inspiration: Precisely Measuring AI Capabilities with Adaptive Testing
Human examinations like GRE and TOEFL have long adopted adaptive testing based on Psychometrics. These tests recognize that each question has different importance and information value, allowing for the estimation of statistical characteristics such as difficulty, discrimination, and guessing probability for each item. The system dynamically delivers questions based on the test-taker's performance, leading to more precise ability assessment.
In other words, adaptive testing focuses not on how many questions the model answered correctly, but on its true capability boundaries. This position paper proposes that psychometrics, an assessment technology for humans originating in the 20th century, can help solve the current dilemmas in AI evaluation and reconstruct the capability assessment mechanism.
Reconstructing AI Assessment with Psychometrics
3.1 Capability-Oriented: Measuring AI's True "Ability Value"
Traditional evaluation paradigms are score-oriented, while adaptive testing is ability-oriented. Instead of counting how many questions were answered correctly, it constructs an AI ability distribution model, providing statistically meaningful ability estimates. Specific advantages include:
Efficiency: Precisely selecting high-information questions. Researchers found that less than 3% of the total question volume can reproduce the full Benchmark scores (as shown in the figure above).
Interpretability: Modeling the relationship between model ability and question characteristics. For example, with the same ability, the probability of answering correctly increases as difficulty decreases, explaining the reasons behind the scores; cognitive diagnostic models also support modeling AI's multi-dimensional capabilities.
Capturing Uncertainty: Model behavior may be affected by temperature parameters or subtle prompt changes (e.g., human test-takers are also affected by environment, mood fluctuations).
Comparability: Statistically comparing model abilities on a unified scale, even allowing for unified assessment across Benchmarks (e.g., GRE scores from different human test administrations are comparable).
Therefore, psychometrics can map AI model performance to "ability parameters," allowing for analysis of where the model is strong/weak, how stable it is, and its level of uncertainty.
3.2 Not All Questions Are Equally Important
Many people assume that test questions in Benchmarks are "accurate, reliable, and valuable," but this is often not the case. Not all questions deserve to be in the test set. Psychometrics can estimate the characteristics of each question, such as difficulty, discrimination, and guessing coefficient.
The value/importance of each question in a Benchmark is not the same. Figure (a) above shows the estimated difficulty differences of two questions in the SSTB sentiment classification dataset; simpler questions have obvious sentiment-biased vocabulary.
Low-quality or even incorrectly annotated questions may appear in Benchmarks. As shown in Figure (b) above, in the SQuAD reading comprehension dataset, some questions have extremely low discrimination, and analysis revealed that their reference answers even contain errors.
Some questions are easily "guessed correctly," failing to truly assess ability. As shown in Figure (c) above, for a question in the MedQA medical Q&A dataset, even if the model lacks medical knowledge, it might guess correctly based on common sense. The high guessing coefficients of these questions diminish their assessment value.
3.3 Did Large Models "Peek" at the Questions? Data Contamination Identification
Today's large language model training data often spans the entire internet, with complex sources. This leads to a serious problem: test data may have been "seen" by the model during the training phase. This is called Data Contamination: when the model takes a "test," it happens to encounter original questions it "memorized" during training. What are the consequences? The model performs exceptionally well, but not out of understanding, but memory; test scores are significantly inflated, leading to misjudgment of the model's true capability; Benchmark credibility declines, failing to reflect the model's generalization ability...
This is like a situation where a test-taker has prior access to the original questions in an exam, which naturally cannot be used as a basis to judge their level. Similar to human education systems, psychometrics has developed a series of statistical methods for detecting cheating or question leakage, which have been proven effective in discovering abnormal patterns. Many current contamination detection methods for LLMs are also based on the following ideas (as shown in the figure above). For example:
Answering difficult questions correctly but simple questions incorrectly is a typical abnormal performance.
If a model frequently answers "impossible to answer correctly" questions correctly, it likely "saw the questions";
An abnormally high guessing coefficient in IRT indicates that the model can answer without understanding, which may also suggest question leakage.
Furthermore, adaptive testing has an inherent advantage: each model takes different questions, and the complete test set is not fully exposed, further reducing the risk of data contamination. This is one of the important reasons why human exams like GRE adopt an adaptive testing mechanism.
Application Prospects: Establishing a "Psychometric Framework" for the AI Era
This work bridges artificial intelligence, cognitive science, and standardized assessment, seeking to bring structural optimization to AI evaluation systems. From ability assessment to preference tendencies, decision logic, stability, and fairness, can we move beyond pursuing "large and comprehensive test sets" and instead meticulously model item characteristic differences, gaining insight into model performance and internal structure? It is not only applicable to Benchmark construction and maintenance but may also provide support for risk assessment, service adaptation, and safety verification before future AI deployments.
This convergence of "how AI is tested and how humans are tested" inspires a possibility: can a new academic field be established—Machine Psychometrics?
In summary, as AI models become smarter, assessment methods must also become smarter. We use methods of assessing humans to assess AI, rebuilding evaluation systems with proven scientific theories to establish a precise and fair capability measurement paradigm for the era of general artificial intelligence.
About the Author
Zhuang Yan, a 3rd-year Ph.D. student from the National Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, supervised by Professor Liu Qi. His main research interests include adaptive testing and cognitive diagnostic theory, and trustworthy AI evaluation.
Contact: zykb@mail.ustc.edu.cn
Further Reading
# Submission Channel #
Let your words be seen by more people
How can more high-quality content reach readers through shorter paths, reducing the cost for readers to find quality content? The answer is: people you don't know.
There are always people you don't know who know what you want to know. PaperWeekly might serve as a bridge, promoting collisions between scholars of different backgrounds and directions and academic inspirations, sparking more possibilities.
PaperWeekly encourages university laboratories or individuals to share various high-quality content on our platform, which can include the latest paper interpretations, academic hot topic analyses, research insights, or competition experience explanations. Our sole purpose is to make knowledge truly flow.
📝 Basic requirements for submissions:
• The article must be original work, not previously published through public channels. If it has been published or is pending publication on other platforms, please clearly indicate.
• Submissions are recommended to be written in markdown format. Images in the article should be sent as attachments, requiring clear images with no copyright issues.
• PaperWeekly respects the original author's right of attribution and will provide competitive remuneration for each accepted original first-published submission, with specific tiered compensation based on article readership and quality.
📬 Submission Channel:
• Submission email: hr@paperweekly.site
• Please include immediate contact information (WeChat) with your submission, so we can contact the author as soon as the article is selected.
• You can also directly add the editor's WeChat (pwbot02) for quick submission, remarking: Name-Submission.
△Long press to add PaperWeekly editor
🔍
Now, you can also find us on "Zhihu"
Enter Zhihu homepage and search for "PaperWeekly"
Click "Follow" to subscribe to our column
·