Synced Report
Edited by: KingHZ
【Synced Report】Behind the ChatGPT 'simping' incident, it's revealed that current AI is still a 'black box'. A split in approaches regarding 'mechanistic interpretability' is tearing apart the core value consensus of AI research. Google backs down, Anthropic persists - Can AI still be 'understood'?
OpenAI only slightly upgraded ChatGPT-4o, but overnight, the AI's personality drastically changed, turning into a 'cyber simp'.
However, no one knows why this happened.
This precisely exposes the current fatal flaw of AI: lack of interpretability.
Discussions among experts about the research value of AI interpretability have recently become more intense.
Specifically, it's the debate between AI giants Google and Anthropic regarding 'mechanistic interpretability':
In March, Google DeepMind announced it would no longer prioritize 'mechanistic interpretability' as a research focus.
In April, Anthropic CEO Dario Amodei advocated for greater emphasis on 'mechanistic interpretability' research and expressed optimism about achieving 'AI's MRI' (i.e., deeply understanding AI's internal mechanisms) within the next 5 to 10 years.
The goal of so-called mechanistic interpretability is to 'reverse-engineer' AI systems.
But over a decade of research results suggest that this approach may be difficult to truly implement, and all of this has been misled by imperfect foundational assumptions.
The Danger of Human Ignorance,
The Sword of Damocles of GenAI
Many risks and concerns related to GenAI are essentially triggered by the 'black box' nature of these algorithms' internal mechanisms.
If models were interpretable, these problems would be easier to solve.
But explainable AI is very difficult to research.
In a 2018 interview, Geoffrey Hinton compared explainable AI to the 'chicken or the egg' problem. At the time, he said:
Humans themselves often cannot explain how they make decisions most of the time. ... Neural networks have a similar problem. You give it an image, and it outputs a reasonable judgment, like determining if there is a pedestrian. But if you ask it 'Why did you make that judgment?', the problem is: if there were a simple rule for judging whether an image contains a pedestrian, the problem would have been solved long ago.
New York University Professor Bob Rehder once wrote: "Explanations prompt learners to look for general patterns, but this can also cause them to overlook exceptions. The result is that in domains where exceptions are frequent, explanations can actually have a negative effect."
Anthropic co-founder Chris Olah often says that GenAI is more 'grown' than 'built'
—their internal mechanisms are 'emergent', not artificially designed.
This is somewhat like growing vegetables or flowers: humans can set the overall growing conditions, but the specific structures that ultimately form are unpredictable and difficult to explain.
When we try to understand the inside of these systems, all we see is a vast matrix composed of billions of numbers. These numbers can perform important cognitive tasks, but how they do this is currently unknown.
The inexplicability of AI systems also means that AI cannot be used in many important fields, because we cannot clearly define the boundaries of their behavior, and once errors occur, the consequences can be extremely severe.
In fact, in certain scenarios, the inexplicability of models even legally prevents their use.
Similarly, AI has made significant progress in scientific fields.
For example, the ability to predict DNA and protein sequences has greatly improved, but the patterns and structures discovered by AI are often difficult for humans to understand and cannot lead to true biological insight.
Mechanistic interpretability primarily attempts to identify which specific 'neurons' and 'circuits' in a model are active when performing a task.
Researchers hope to use this to track the model's thinking process, thereby explaining its behavior in terms of its 'hardware principles'.
Many believe this detailed understanding is invaluable for AI safety; it would enable researchers to precisely design models that behave as expected under all conditions, reliably avoiding all risks.
Google: Feeling Cheated
The research into mechanistic interpretability stems from researchers' belief in truth: knowledge is power; to name is to know, to know is to control.
While working at Google, Chris Olah attempted to systematically study how to open this LLM 'black box' and understand the model's internal workings.
The early stages of mechanistic interpretability (2014–2020) focused mainly on image models, where researchers successfully identified some neurons corresponding to human-understandable concepts.
This is similar to hypotheses in early neuroscience, such as the existence of neurons in the brain that recognize specific people or concepts, known as 'Jennifer Aniston neurons'.
Partial neurons in the final layer of the CLIP model
Anthropic: Steadfast on AI Explainability
When Anthropic was founded, co-founders Chris Olah and Dario Amodei decided to apply interpretability methods to language models.
Dario Amodei
Soon, they discovered some fundamental mechanisms in models that are crucial for language understanding, such as copying and sequence matching.
At the same time, they also found interpretable neurons similar to those in image models that can represent specific words or concepts.
However, the complexity of the problem once hindered interpretability research progress, until they later discovered a technique from the signal processing field—
Sparse autoencoders (SAE) can identify combinations of neurons that express concepts closer to human understanding more clearly.
Compared to single neurons, these combinations can express more subtle concepts, such as 'literal or metaphorical hesitation and avoidance', or 'music genres expressing dissatisfaction'.
These combinations are called 'features', and sparse autoencoder methods have been used to map models of various scales, including the most advanced commercial models.
Recently, their research has expanded from 'tracking and manipulating individual features' to 'tracking and manipulating a set of features', referred to as 'circuits'.
With the help of these circuits, one can 'track' the model's thought path.
For example, when you ask the model "What is the capital of the state where Dallas is located?", the model internally activates an 'is located within' circuit, causing the 'Dallas' feature to activate 'Texas', and then through another circuit, the concepts of 'Texas' and 'capital' jointly activate 'Austin'.
Using the circuit tracking method, Anthropic researched the internal mechanisms used by Claude 3.5 Haiku.
Google DeepMind: Suspending SAE Research
It turns out that making SAEs work stably and effectively is very difficult.
This is one reason DeepMind recently decided to lower the priority of SAEs.
The DeepMind research team published a technical blog explaining in detail why they are not optimistic about sparse autoencoders.
Their initial core motivation was that many people in the field of interpretability research had invested significant effort into sparse autoencoders (SAE).
But SAEs lack an objective reference standard for 'true' features and cannot be compared with the real structures in language models, making it difficult to determine how well they are actually working.
While qualitative analysis shows that SAEs do capture some structure (far from being explained by random noise), the limitations are equally obvious—
When entering arbitrary sentences into Neuronpedia and observing the activated latent variables, these variables often do not correspond to clear semantic explanations.
Neuronpedia was originally a platform designed specifically for Sparse Autoencoder (SAE) research, but has now been upgraded to an open infrastructure supporting general mechanistic interpretability research.
When deciding whether to continue prioritizing SAEs, it is necessary to think more broadly about the evaluation criteria for interpretability research.
The traditional approach assumes that there is some precise, human-understandable 'objective truth' inside the model and attempts to reveal it through reverse-engineering.
For SAEs, this idealized vision is embodied by the