Author | Fei Binjie, CEO of Entropy Jane Technology
After ChatGPT's father Ilya Sutskever officially announced his departure from OpenAI, he immediately liked an AI paper, drawing widespread attention.
The paper's title is The Platonic Representation Hypothesis, published by an MIT team last week.
I spent the weekend carefully reading this paper, and it brought an indescribable shock. As the pioneer of Gen AI, the paper selected by Ilya is indeed extraordinary.
This paper provides guidance on the future path and direction of AI development. Whether you are a tech investor, AI practitioner, or someone interested in AI, it's worth reading.
I'll interpret the essence of this paper for everyone.
After reading this article, you will have a whole new philosophical understanding of the future of deep learning models.
(1) Plato's Cave Allegory
Everything starts with Plato's Cave Allegory.
The Cave Allegory is a thought model proposed by Plato in his work The Republic, exploring what "reality" is.
In the Cave Allegory, there is a group of prisoners who have been chained in a cave their entire lives, knowing nothing about the world outside the cave.
They always face a wall and can only see the shadows of various objects behind them projected on the wall.
Over time, these shadows become their "reality," but they are not an accurate representation of the real world.
In the Cave Allegory, "shadows" represent the fragments of reality we perceive through various senses, whether images seen by the eyes, sounds heard by the ears, or shapes touched by hands—they are all just projections of "reality."
Plato's teacher Socrates once said that philosophers are like prisoners released from the cave; they walk out into the sunlight and gradually realize that the shadows on the wall are not "reality," but projections of "reality."
The philosopher's goal is to understand and perceive higher levels of "reality" through logic, mathematics, natural sciences, and other means, to investigate things and attain knowledge, and to explore the "Way."
Now, this grand goal has been passed from philosophers to AI scientists.
(2) What is the Platonic Representation Hypothesis?
After understanding Plato's Cave Allegory, the Platonic Representation Hypothesis becomes easier to grasp.
The Platonic Representation Hypothesis refers to different AI models converging toward a unified representation of reality.
This might sound a bit abstract; let me explain specifically.
As shown in the figure above, suppose we concretize reality Z as a cone + a sphere. Then X is the projection of reality Z in the image modality, and Y is the projection in the text modality.
At this point, we train two AI models: one CV model fimg and one text model ftext. They each learn representations for X and Y respectively.
However, as the model parameter scale and training data expand, these two models will ultimately learn the representation of reality Z behind the projections X and Y.
You can think of it this way: when an AI model becomes smart enough, it is no longer the prisoner chained by irons but becomes a philosopher who walks out of the cave.
What it sees is no longer projections on the wall but gradually understands the true nature of things and perceives higher-dimensional reality.
This is the meaning of the Platonic Representation Hypothesis. Now looking at the authors' definition, it's easier to understand.
The Platonic Representation Hypothesis has a very important corollary: AI models of different modalities and different algorithmic architectures will converge to the same endpoint, which is to form an accurate representation of high-dimensional reality.
Specifically, this representation of reality can be understood as a probabilistic model, which is the joint distribution of real-world events.
These discrete events are sampled from an unknown distribution and can be observed and perceived in multiple ways, whether an image, a piece of audio, a piece of text, or mass, force, torque, etc.
(3) Verifying the Validity of the Platonic Representation Hypothesis
Since this is a hypothesis, we naturally need to find ways to verify its validity.
Fortunately, scientists have handy mathematical tools for quantitative analysis.
Phillip defines "Representation Alignment" as a similarity measure on the kernel of two representations.
On this basis, we need to use a technique called Model Stitching to evaluate the similarity between two representations.
The principle of model stitching is intuitive: connect the intermediate representation layers of two models through a stitching layer (Stitching Layer) to form a new "stitched" model.
If the performance of this "stitched" model is good, it indicates that the representations at that layer of the two original models are compatible, even if they were previously trained on completely different datasets.
(4) Experimental Results: The Strong Converge, the Weak Diverge in Their Own Ways
Through the "model stitching" technique and the "representation alignment" evaluation method, we can verify if the Platonic Hypothesis truly exists.
Phillip selected 78 CV models for representation similarity analysis; these models differ in training datasets, task objectives, and algorithmic architectures.
The experimental results are very interesting. As shown in the figure below, let me interpret this chart for you.
First, look at the bar chart on the left. The x-axis is the proportion of models passing VTAB tasks; a higher proportion indicates stronger model performance. Here, Phillip divides the 78 CV models into 5 buckets by performance strength, with stronger models further to the right.
The y-axis is the representation similarity between all models in each bucket; taller bars indicate higher similarity.
It's clear that the stronger the model performance, the higher the representation similarity between them. Conversely, the weaker the performance, the lower the similarity.
The scatter plot on the right presents this conclusion even more clearly. Each point represents a CV model; redder colors indicate weaker models, bluer indicate stronger.
Strong models (blue points) cluster together, indicating high representation similarity, while weak models (red points) are more dispersed, indicating lower similarity.
Leo Tolstoy wrote in Anna Karenina: Happy families are all alike; every unhappy family is unhappy in its own way.
Phillip playfully paraphrases Tolstoy: Strong models are all alike; weak models are weak in their own ways.
(5) Three Major Reasons Behind AI Model Representation Convergence
From the experimental results, we see that the Platonic Representation Hypothesis indeed holds.
So why do AI models exhibit such clear representation convergence? Phillip believes there are three main reasons.
First Reason: Task Generality
When an AI model only needs to complete one specific task (e.g., image classification), there are many ways to achieve it.
But if the model needs to excel at a series of different tasks simultaneously, the ways to achieve it become much fewer.
As shown in the figure below, each task objective imposes additional constraints on the model. When we need a model to handle translation, Q&A, code writing, math problem-solving, etc., simultaneously, its representation space converges to a very narrow range.
In fact, large language models can be seen as a multi-task training process. Predicting the next token based on context seems simple but is actually a comprehensive set of tasks.
Multi-task training imposes more constraints on the model, leading to a tighter, higher-quality solution space.
This is a powerful explanation for why LLMs emerge intelligence.
Second Reason: Model Capacity
Larger models are more likely to approach the global optimal representation, thereby driving convergence.
As shown in the figure below, the yellow and green areas are the representation spaces of two AI models. The concentric circles are contour lines of model loss, with the global optimum (lowest loss) at the center.
In the left figure, since both models have small parameter scales, gradient descent along decreasing loss directions only finds two local optima (marked ☆).
As parameter scale increases, the yellow and green areas expand, meaning larger representation spaces. In the right figure, both models find a shared global optimum (marked ★), achieving convergence.
Third Reason: Simplicity Bias
Deep neural networks naturally follow the Occam's Razor principle, exhibiting a "simplicity bias," tending to choose the simplest solution among all feasible ones.
Perhaps this unique property is what makes deep neural networks stand out among many architectures, becoming the foundational algorithm of modern AI.
(6) Scaling is Useful, But Not Always Efficient
The Platonic Representation Hypothesis has several important corollaries, each providing directional guidance for future AI development.
According to the hypothesis, as model parameters, task diversity, and compute FLOPs increase, representations gradually converge.
Does this mean simply scaling up will achieve AGI?
Yes and no. While scaling up can achieve representation convergence, the efficiency of convergence may vary greatly across methods.
For example, AlphaFold 3 effectively predicts biomolecular structures including proteins, and FSD enables autonomous driving via image recognition.
Protein structure prediction and autonomous driving may be relatively independent tasks. Although a unified AI model achieving both AlphaFold 3 and FSD capabilities would likely enhance performance further, the training process could be very inefficient with low cost-effectiveness.
Thus, for certain independent tasks, for efficiency, a separate shortcut model can be trained instead of relying on a unified reality representation.
In some scenarios, efficiently achieving local optima is more economically valuable than laboriously pursuing the global optimum.
(7) Reunderstanding Relationships Between Multimodal Data
The Platonic Representation Hypothesis allows us to reexamine relationships between multimodal data from a new perspective.
Suppose you have M images and N texts. To train the strongest CV model, you shouldn't just train on all M images but also include the N texts in the training set.
This has become common practice in the AI industry; many excellent CV models are fine-tuned from pretrained large language models.
The reasoning works vice versa. To train the strongest text model, include not only all N texts but also the M images.
This is because behind different modalities lies a modality-agnostic universal reality representation.
This means that even without cross-modal paired data (e.g., text-image pairs) in the training set, pure text corpus directly helps CV model training. The main value of cross-modal pairs is to accelerate representation convergence.
(8) Conclusion: Seeking the Global Optimum for Representing the World
Two thousand years ago, Plato proposed the Cave Allegory, and philosophers began fumbling to explore the essence of reality with logical and geometric tools.
Two thousand years later, humanity's toolbox has a super weapon: AI.
The baton of "investigating things to attain knowledge" has been passed to AI scientists.
We look forward to humanity, in this era, using AI's power to find the global optimum for world representation, escape the cave, explore and understand high-dimensional reality, and benefit society.
All of machine learning is footnotes to Plato.
(End of article)