Source | Synced (Machine Heart)
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in high-level visual understanding and reasoning tasks. However, upon closer inspection, a fact emerges: they frequently "fail" in seemingly simple, intuitive tasks that human toddlers can easily accomplish.
For example, "Is a toy still there after being hidden?", "Has the volume of liquid changed after being poured into a different shaped container?", "Will two objects collide when they approach each other?"
Does this imply that MLLMs' innate cognitive structures lack the fundamental knowledge mechanisms that support early human learning? In other words, do they lack "core knowledge"?
A high-scoring ICML 2025 paper (initial score 4443) reveals the "core cognitive blind spots" of MLLMs.
New research from UC San Diego, titled "Core Knowledge Deficits in Multi-Modal Language Models" (published at ICML 2025), provides a systematic and in-depth analysis of this issue.
Paper Title: Core Knowledge Deficits in Multi-Modal Language Models
Paper Link: https://arxiv.org/pdf/2410.10855
The study found that: current mainstream MLLMs widely lack core cognitive abilities, and this capability cannot be naturally acquired through scaling up models.
To address this, the authors constructed an innovative multimodal evaluation system, CoreCognition, and proposed a unique "intervention test" method, Concept Hacking, aimed at revealing whether models truly "understand" the core knowledge behind tasks, or merely "guessed the correct answer."
Building CoreCognition: A Cross-Modal Cognitive Evaluation Benchmark
The concept of "core knowledge" originates from developmental psychology, especially Piaget's classic theories on human cognitive development. Research indicates that humans, even in infancy, possess some of the most basic and universal cognitive abilities about the world, forming the foundation for our understanding of objects, space, causality, intention, and more. Inspired by Piaget's theory of cognitive development, the research team proposed CoreCognition: a large-scale multimodal evaluation system focused on "core knowledge." Its highlights include:
Comprehensive Coverage: 12 core cognitive concepts cover three stages: sensorimotor period (e.g., boundary perception, continuity, object permanence, spatial perception, perceptual constancy, intuitive physics), hybrid period (e.g., perspective taking, hierarchical relationships, conservation understanding), and formal operational period (e.g., intention recognition, mechanical reasoning, tool use). This layered design helps to thoroughly analyze model performance differences at various cognitive levels.
Rich Data, Extensive Testing: The dataset contains a total of 1503 image-question pairs, and by testing 230 mainstream multimodal models with 11 types of prompt designs, it generated 2530 evaluation data points, effectively covering different model scales and instruction comprehension abilities.
Rigorous Design:
1. High Discriminativeness: Each question is meticulously designed so that models lacking the target core knowledge are inevitably inclined to choose incorrect answers, thereby effectively differentiating model capabilities.
2. Minimal Confounding: Questions are designed to minimize reliance on abilities other than the target concept, reducing overlap with other core knowledge concepts.
3. Minimal Text Shortcut: All questions are designed to require multimodal reasoning combining image and linguistic information, preventing models from guessing correct answers solely through language pattern recognition.
Strict Quality Control: All data was collaboratively annotated and reviewed by 12 senior undergraduate or graduate students with backgrounds in cognitive science, computer science, or statistics, ensuring consistency and academic rigor in the annotations.
The dataset design, drawing from developmental psychology and cognitive science while aligning with AI experimental paradigms, balances theoretical reliability with engineering feasibility. It marks the first formal introduction of "core knowledge" into a large model testing framework.
Four Key Findings
1. Models show significant deficiencies in basic cognitive tasks: Large models lack fundamental cognition, especially the simplest cognitive abilities. In simple, intuitive tasks such as boundary perception, continuity, and spatial reasoning, model performance is far below their ability to understand more complex matters (e.g., hierarchical reasoning, intention understanding). These should be "common sense," but models fail to grasp them, indicating a lack of understanding of the world's basic structures.
2. Models cannot effectively leverage basic cognition to support advanced capabilities: Model performance in high-level cognition is not necessarily directly linked to low-level cognitive proficiency. This indicates that models have not formed a robust cognitive system, and their advanced reasoning perception is not built upon fundamental cognitive abilities. This also explains why models exhibit robustness deficiencies (i.e., inability to consistently and correctly answer questions).
3. Increasing model scale does not significantly improve basic cognitive abilities: The study shows that the basic cognitive abilities of models cannot be significantly improved by simply expanding their scale. Although an increase in model parameters leads to improved advanced reasoning capabilities, it provides little help for low-level cognition. In some basic abilities, there even appears to be a counterintuitive phenomenon where larger models perform worse.
4. Reasoning models do not show significant advantages: System-2 reasoning also does not effectively help models learn or infer basic cognitive abilities, suggesting that models may lack basic cognitive abilities during the pretraining phase.
Concept Hacking: Intervention Testing Reveals "Pseudo-Understanding" Traps
To further verify whether models truly grasp core concepts, the authors proposed the Concept Hacking (concept intervention) method: by constructing "control" and "manipulated" groups, key features in the test images and text are intentionally reversed while keeping other conditions consistent. This differentiates between "true understanding" and "opportunistic guessing":
If performance is good in both normal and reversed tasks, it indicates that the model possesses genuine cognitive ability.
If performance is good only in normal tasks but fails in reversed tasks, it indicates that the model relies on spurious cognitive shortcuts.
If performance is poor in normal tasks, it suggests that the model has neither mastered core knowledge nor established cognitive shortcuts.
Experiments have shown that many models perform well in normal image-text tasks, but their prediction results collapse significantly once key features are finely tuned. This indicates that they do not genuinely understand "core concepts" but rather rely more on easily accessible shortcut learning.
Significance and Implications
The article reveals that Multimodal Large Language Models (MLLMs) lack core knowledge, and this knowledge cannot be acquired simply by scaling. The larger the model, the more it appears "superficial" in complex tasks, but the harder it is to achieve true understanding in basic cognition. This confirms the classic "Moravec's Paradox": cognitive tasks that are simplest for humans are the most difficult for AI. This poses a fundamental challenge to the current scale-driven development path, indicating its difficulty in leading to human-like general intelligence.
Cognitive Science Implications: Humans build higher-level cognition based on core cognition, whereas MLLMs lack this scaffolding structure for cognitive construction.
Technical Development Challenges: Simply increasing parameter scale and training data does not automatically lead to core cognitive abilities.
Future Directions: It may be necessary to explicitly inject common sense knowledge such as physics and space during the model pre-training phase, actively "instilling" these core cognitive abilities; explore cognition-guided training mechanisms to introduce "explicit conceptual learning"; and develop more highly controlled cognitive ability evaluations.
About the Authors:
Yijiang Li, received his Master's degree in Computer Science from Johns Hopkins University and is currently a first-year Ph.D. student at UC San Diego. His research focuses on achieving efficient and robust learning, applied in multimodal, interactive, and embodied 3D environments.
Qingying Gao, received her Master's degree from Johns Hopkins University and is currently pursuing a Ph.D. in Computer Science at the same university. She is affiliated with the Wilmer Eye Institute, Lions Vision Research and Rehabilitation Center, and the Engineering and Medical Artificial Intelligence Laboratory, all under Johns Hopkins Medicine. Her research interests include the interpretability of visual-language models and autonomous navigation technologies for people with low vision.
Tianwei Zhao, is a Master's student in Computer Science at Johns Hopkins University. His research interests include evaluating, understanding, and enhancing multimodal models (especially their reasoning capabilities) from a cognitive science perspective, as well as optimizing planning and collaboration mechanisms in multi-agent systems.
Bingyang Wang, received her Master of Science, Bachelor of Science, and Bachelor of Business Administration degrees from Emory University. Her research interests include multimodal fusion and efficient signal extraction from mixed modalities.
Haoran Sun, received his Master's degree in Applied Mathematics from Johns Hopkins University in 2024. His primary research directions include medical data science and the application of machine learning in cardiology.
Dezhi Luo, is a senior student at the Weinberg Institute for Cognitive Science at the University of Michigan. He was a visiting scholar at University College London's Department of Psychological and Brain Sciences and previously served as an AI Scholar at the University of London's Institute of Philosophy. His research interests include the theoretical foundations of cognitive science and artificial intelligence, with a particular focus on consciousness, self-processing, and core cognition.
Hokin Deng , is a visiting research scientist at Carnegie Mellon University. He previously worked as a computer vision engineer at Harvard University, where he designed the first experimental infrastructure for single-cell cognitive experiments. Prior to that, he was a neuroengineer at Johns Hopkins Hospital and an affiliated research scientist at Meta Reality Labs. He co-led the open-source project "Grow AI (Grow AI like a child)" and co-organized multiple workshops focusing on the intersection of computer science, neuroscience, and philosophy. Earlier, he studied neuroscience and philosophy at Johns Hopkins University.