1. A Thought-Provoking Cognitive Puzzle
Imagine a scenario: you and ChatGPT are both asked to classify a pile of animals. You might say, "Among birds, a robin is a typical representative, while a penguin is a bit special," while AI might simply and roughly categorize all winged creatures together. On the surface, your classification results are similar, but the underlying thought processes are vastly different.
This seemingly simple difference actually reveals a deeper question: Does AI truly "understand" concepts, or is it merely performing advanced statistical pattern matching?
Recently, a research team from Stanford University and New York University published a groundbreaking study, using the mathematical tools of information theory to deeply analyze this question. Their findings are astonishing: large language models and humans employ completely different strategies when processing concepts—AI pursues ultimate statistical compression, while humans prefer flexible adaptive expression.
2. Background — The Mystery of Concept Formation
The Uniqueness of Human Concept Formation
Human ability to form concepts is a marvel of cognitive science. We can easily compress complex and diverse information into concise, meaningful concepts, such as categorizing robins and blue jays as "birds," and knowing that most birds can fly. This process embodies a crucial trade-off: achieving representational simplification (compression) while maintaining semantic fidelity (meaning).
Even more interestingly, human concept systems are often hierarchical—a robin is a bird, and a bird is an animal—a structure that is both efficient and expressive. Each concept category also has differences in "typicality": a robin is considered a typical bird, while a penguin is less so.
The Conceptual Mist of Large Language Models
Current large language models demonstrate impressive language processing capabilities, performing well in many tasks requiring deep semantic understanding. But a fundamental question remains unanswered: Do these models truly understand concepts and meaning, or are they merely performing complex statistical pattern matching on vast datasets?
The research team points out that for AI to go beyond superficial imitation and achieve human-like understanding, the key is to clarify how AI's internal representations handle the trade-off between information compression and semantic fidelity.
3. Research Method — Using Mathematics to See Through Thinking Differences
Information Theory Framework
The research team created a new framework based on rate-distortion theory and the information bottleneck principle to quantitatively compare how different systems balance representational complexity and semantic fidelity. They designed an objective function L:
L(X, C; β) = Complexity(X, C) + β × Distortion(X, C)
This formula cleverly balances two key elements:
(1) Complexity term: Measures the information cost of representing original items with concept clusters, reflecting the degree of compression
(2) Distortion term: Measures the loss of semantic fidelity during the grouping process, reflecting the degree of meaning preservation
Authoritative Human Cognitive Benchmarks
Unlike many modern crowdsourced datasets, the research team selected three landmark studies from the history of cognitive science as human benchmarks:
(1) Rosch (1973) study: 48 items across 8 semantic categories, establishing the foundation of prototype theory
(2) Rosch (1975) study: 552 items across 10 categories, deepening the theory of cognitive representation of semantic categories
(3) McCloskey & Glucksberg (1978) study: 449 items across 18 categories, revealing the "fuzzy boundaries" of natural categories
These classic datasets cover 1049 items and 34 categories, providing a high-fidelity empirical basis for evaluating the human similarity of AI models.
Comprehensive Model Testing Matrix
The study covered a diverse range of large language models from 300 million to 72 billion parameters, including:
(1) Encoder models: BERT series
(2) Decoder models: Mainstream model families such as Llama, Gemma, Qwen, Phi, Mistral
By extracting each model's static token-level embedding vectors, the research team ensured comparability with context-free stimuli in human classification experiments.
4. Findings — Three Levels of Deep Analysis
Finding One: Macroscopic Alignment’s Superficial Harmony
Key finding: Large models can form concept categories roughly aligned with human judgments.
Experimental results showed that all tested large language models could form concept clusters at a macroscopic level that were significantly aligned with human categories, far exceeding random levels. Surprisingly, certain encoder models (especially BERT-large) demonstrated astonishing alignment capabilities, sometimes even surpassing models with much larger parameter counts.
This finding reveals an important fact: the factors influencing human-like concept abstraction are not solely model scale; architecture design and pre-training objectives are equally crucial.
Finding Two: A Deep Chasm in Fine-Grained Semantics
Key finding: Large models have limited ability to capture subtle semantic distinctions.
While large models can form macroscopic concept categories, they perform poorly in terms of internal semantic structure. By calculating the cosine similarity between item embedding vectors and their category name embedding vectors, the research team found only a moderate correlation between these similarities and human typicality judgments.
In other words, items considered highly typical by humans (e.g., a robin for the "bird" category) are not necessarily closer to the embedding vector of that category label in the large model's representation space. This suggests that large models may primarily capture statistically uniform associations rather than prototype-based nuanced semantic structures.
Finding Three: Fundamental Divergence in Efficiency Strategies
Key finding: AI and humans adopt completely different representation efficiency strategies.
This is the most striking finding of the study. Through analysis of the L objective function, the research team found:
Large language models exhibit excellent information-theoretic efficiency:
(1) Consistently achieve a more "optimized" balance in the compression-meaning trade-off
(2) Have lower cluster entropy values, indicating statistical compactness
(3) Significantly lower L objective function values, implying higher statistical efficiency
Human conceptualization systems, conversely:
(1) Have higher entropy values for the same number of clusters
(2) Higher L objective function values, seemingly "suboptimal" statistically
(3) But this "inefficiency" may reflect optimization for broader functional needs
5. Deep Implications — Re-examining the Definition of "Intelligence"
AI's Statistical Compression Preference
The study reveals that large language models are highly optimized for statistical compactness. They form information-theoretically efficient representations, achieving excellent statistical regularity by minimizing redundancy and internal variance. This is likely a result of their training on massive text corpora—to handle vast amounts of data, they learned extreme compression strategies.
However, this focus on compression limits their ability to fully encode the rich, prototypical semantic details crucial for deep understanding. AI becomes "efficient" but not "subtle" enough.
Human Adaptive Wisdom
Human cognition, conversely, prioritizes adaptive richness, contextual flexibility, and broad functional utility, even if it comes at a cost to statistical compactness. The high entropy values and L scores of human concepts may reflect optimization for broader, complex cognitive needs, including:
(1) Robust generalization: supporting effective generalization from sparse data
(2) Reasoning ability: supporting strong causal, functional, and goal-directed reasoning
(3) Communication efficiency: enabling effective communication through learnable and shareable structures
(4) Multimodal grounding: rooting concepts in rich multi-sensory experiences
Humans choose a seemingly "inefficient" representation method, which in reality is for better adaptability and versatility.
Architectural Insights
Notably, the excellent performance of smaller encoder models (like BERT) in specific alignment tasks emphasizes the significant impact of architectural design and pre-training objectives on the model's ability to abstract human-like conceptual information. This points to an important direction for future AI development focused on enhancing human-machine alignment.
6. Conclusion: The Long Journey From "Tokens" to "Thoughts"
The most profound insight of this research is that AI and humans represent two fundamentally different paradigms of "intelligence."
AI excels at statistical compressibility, following a representational path fundamentally different from human cognition. They are like perfect librarians, able to organize and store information in the most efficient way, but perhaps lacking a true understanding of the deep meaning of each book.
Human cognition, on the other hand, is like a wise philosopher, willing to tolerate superficial "disorder" and "inefficiency" because this complexity is precisely the basis for flexibly navigating a complex world, performing deep reasoning, and thinking innovatively. Human "inefficiency" is actually a hallmark of advanced intelligence.
This fundamental difference has profound implications for AI development. To achieve truly human-like understanding, we need to move beyond the current paradigm primarily based on scale expansion and statistical pattern matching. Future efforts should explore principles for explicitly fostering richer, more nuanced conceptual structures.
As the study title suggests, progress from "tokens" to "thoughts" requires AI systems to learn to embrace seeming "inefficiency"—because this "inefficiency" might be the hallmark of robust, human-like intelligence. We need not just AI that can process information efficiently, but intelligent systems that can think flexibly, understand deeply, and reason creatively like humans.
This study provides a quantitative framework for us to evaluate and guide AI's development toward more human-like understanding, and also reminds us: true intelligence may not lie in perfect efficiency, but in adaptive wisdom. In today's rapidly developing AI, understanding this difference is crucial for building AI systems that are both powerful and trustworthy.
Paper Title: From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Paper Link: https://arxiv.org/abs/2505.17117
Recommended Reading
The Illusion of Large Language Model Reasoning by Apple
OpenThinker3-7B Grand Release: A New Benchmark for Open-Source Reasoning Models
Knowledge vs. Reasoning: How to Correctly Evaluate the Thinking Ability of Large Models?