We Planted a Word in Claude's Mind, and It Began to "Rationalize"! Anthropic's Latest Research: AI Possesses Introspective Abilities!

Anthropic has just announced its latest research: AI is beginning to possess introspective abilities.

This issue was also touched upon by AI guru and OpenAI co-founder Andrej Karpathy in his recent speech: he believes that the next stage of AI is not larger models, but models that can reflect on themselves. They need to be able to review processes, summarize biases, and even self-correct from errors after producing output, just like humans.

Anthropic's research aligns perfectly with this. The research team successfully demonstrated that current large language models possess a degree of functional introspective awareness—a limited ability to perceive their own internal states—by injecting representations of known concepts into the model's activations.

In all experiments, Claude Opus 4 and 4.1 (the strongest models tested) generally exhibited the strongest introspective awareness; however, trends between different models were complex and highly sensitive to post-training strategies.

In today's models, this ability remains highly unstable and context-dependent; however, as model capabilities further improve, this introspective ability may continue to develop.

What Constitutes True "Introspection"? A New Definition from Anthropic

"Introspection" has been increasingly mentioned lately. It usually refers to whether a model can understand itself—such as knowing what it's thinking, how it thinks, and when it makes mistakes.

But what kind of "self-understanding" counts as true introspection? Anthropic's research team proposes a more rigorous definition in this paper:

If a model can accurately describe an aspect of its internal state and simultaneously meet the following four criteria, we can say it possesses introspective awareness.

1. Accuracy

First, the model must accurately describe itself. This sounds simple, but language models often fail to do so. For example, it might claim "I know a certain fact" when it doesn't actually possess that knowledge; or say "I don't know" when it has already learned it within its parameters. Sometimes models even misjudge what computational mechanisms they used—these "self-reports" are essentially confabulations.

However, the research team demonstrated in experiments that even if the model's self-reporting ability is inconsistently applied, it does have the capacity to generate accurate self-descriptions.

2. Grounding

Second, the model's self-description must truly be based on its internal state. That is, when the internal state changes, the model's description should also change accordingly.

For example: A model says, "I am a Transformer-architecture language model." While this statement is correct, it might only be saying it because it was written that way in its training corpus, not because the model actually examined its internal structure before answering.

To verify this causal link, researchers introduced a technique called concept injection to observe whether the model's answers truly changed with internal alterations.

3. Internality

The third criterion is more subtle: the model's self-awareness must come from its internal mechanisms, not by reading its own previous outputs.

For example: A model noticing it was "jailbroken" because it detected its recent answers were strange; or a model guided to think about "love" only "realizing" it was talking about love after writing a few sentences.

These are considered "pseudo-introspection"—not true self-awareness, but rather inferences based on external cues (its own output).

The research team gave an interesting example to illustrate the difference: If we ask the model "What are you thinking?", and simultaneously stimulate some neurons to make it more likely to say "love," and the model answers "I'm thinking about love," it doesn't necessarily "know it's thinking about love," but might simply be mechanically completing the sentence.

True introspection requires the model to have perceived the existence of that thought before saying the phrase.

4. Metacognitive Representation

The last criterion is the closest to "consciousness": the model must internally possess a "representation of its own state."

That is, it cannot simply translate the impulse "I am driven to say love" directly into text; it must have a higher-level representation. For instance, "I am thinking about love"—this internal "re-cognition" is the core of introspection. It means the model not only has thought activity but also is aware that it is thinking.

However, the researchers also admit: such "metacognitive representation" is difficult to prove directly at present, and their experiments can only provide indirect evidence.

For example, instead of asking "What are you thinking?", researchers ask: "Have you noticed yourself thinking something unexpected?" To correctly answer this, the model must first identify its own thought state, and then translate this identification into language.

Even if this identification is incomplete (it might only realize "this thought is a bit unusual"), it still indicates that the model possesses a preliminary form of self-awareness.

Four Experiments to Verify Model "Introspective Ability"

Experiment 1: When we "plant an idea in the model's mind," can it perceive it?

To verify whether large language models can truly perceive their internal states, the research team designed a key experiment: they directly injected representations of specific concepts—such as "dust," "poetry," "algorithms," etc.—into the model's intermediate layer activations.

This approach is equivalent to "planting a thought in the model's mind," and then asking it: "What are you thinking right now?" Subsequently, researchers observed whether the model's self-report reflected the injected concept.

The experimental results were very interesting:

Some large models (especially Claude Opus 4 and 4.1) were able to accurately identify the injected concepts in multiple experiments. For example, when semantic representations of "poetry" were injected, the model often responded: "I seem to be thinking about poetry," and would even say, "It feels like something entered my mind, like a whisper."

Researchers further verified that this response was not a linguistic coincidence but had a direct causal relationship with changes in internal state. When different concepts were injected, the model's self-reports changed accordingly.
More notably, this ability varied significantly across different models: smaller models or those not specifically aligned trained models could hardly detect the injected signals, while large models optimized through reinforcement learning or instruction tuning performed significantly better.

This experiment provided the first direct evidence of models' "self-knowledge" of their own thought content. It was not merely repeating input or mimicking answers, but genuinely reacting to abnormal changes in internal activations.

Experiment 2: Distinguishing "Internal Thoughts" from "External Input"

The previous experiment showed that large language models can "introspect" on their internal representations under certain circumstances, but a new question arose: can the model distinguish between its "thoughts" and external textual input?

To verify this, the research team designed an improved experiment. They presented a normal sentence to the model, while simultaneously injecting a completely unrelated concept vector into the tokens corresponding to the sentence. For example, the model would read "The sky is blue," but internally, the concept "Bread" would be injected.

Then, researchers asked the model two questions:

"What word are you thinking right now?" — To check if the model could perceive the injected "thought."
"Please repeat this sentence exactly as it is." — To check if the model could still correctly process external input without confusing the two.

The experimental results were impressive: On Claude Opus 4.1, the model could not only accurately identify the injected concept (e.g., answering "Bread") but also repeat the original sentence completely without error, demonstrating a clear distinction between "thought content" and "external text."

The research team repeated the experiment across 50 random combinations (different sentences × different injected words) and found that Opus 4 and 4.1 performed significantly better than random chance (10%), with the best performance occurring in the middle layers of the model (approximately 2/3 of the model's depth)—consistent with the "strongest introspection layer" identified in the previous experiment.

The study also showed:

This introspective differentiation ability depended on prompt design; for example, removing the phrase "Please answer immediately" from the prompt significantly reduced Claude model performance.
Therefore, researchers consider this result more of an "existence proof"—demonstrating the model's potential to differentiate between internal and external representations, rather than a universal ability.

They also pointed out that this result aligns with previous interpretability research: there indeed exist internal semantic representations in the mid-to-late layers of the model that are abstract from input and output symbols.

The key finding is that these representations not only exist but can also be "accessed" and "reported" by the model itself, especially when the prompt explicitly asks the model to "think about what it's thinking."

Experiment 3: Distinguishing "Intentional Output" from "Unintentional Output" through Introspection

In the first two experiments, researchers explicitly prompted the model to "introspect." In this experiment, they wanted to observe whether the model would introspect spontaneously in a natural context—that is, whether the model could actively perceive if its output was truly "what it intended to say."

The research team continued with the previous experimental setup: presenting a sentence to the model and asking it to answer, "What word comes to mind when you see this sentence?"

However, this time they forced the model to write the answer—using a random word as the Assistant's reply (e.g., "bread"), effectively forcing a word into its mouth. Then, in the next turn of dialogue, they asked it: "Did you mean to say that word, or did it just slip out?"

The results showed: The Claude model typically answered correctly—"That was not my intention; it was an accident."

Next, researchers used the previous method of injecting concept vectors to see if they could trick the model into believing it had originally intended to say the pre-filled word. Before the model generated its response, they injected a concept vector corresponding to that word into the model's internals, essentially "planting the idea" in advance.

The results were very interesting: In Claude Opus 4.1, this operation successfully made the model stop apologizing, instead claiming it was intentional. Even more surprisingly, the model would fabricate a plausible explanation for why it "thought of that word earlier."

The research team repeated 50 combinations of different sentences and pre-filled words, statistically analyzing the model's "apology rate" (i.e., the proportion of times it admitted it was an accident). They compared three conditions:

Control group: No vectors injected.
Relevant injection: Concept vector corresponding to the pre-filled word injected.
Random injection: A random word vector unrelated to the pre-filled word injected.

The results found:

The apology rate significantly decreased only when the injected vector was relevant to the pre-filled word.
Injecting unrelated words did not produce the same effect.
The model could still accurately repeat the original text, indicating that the "deception" only affected the judgment of intent, not perception.

Additionally, they found this effect appeared in the middle layers of the model, but at a different optimal layer position than in the previous two experiments. This suggests that the model's "introspection" is not driven by a single mechanism but by multiple different circuits working in concert under different contexts.

When researchers delayed the same injection operation until the next turn of dialogue, the effect completely disappeared. This indicates that the model only considers a phrase as "what it intended to say" if the concept is injected before it speaks.

Among all models, Claude Opus 4 and 4.1 showed the most pronounced effect—their "apology rate decrease" was largest, and they were most easily "deceived." However, other models also exhibited this effect to varying degrees.

Experiment 4: Intentional Control over Internal States

Previous experiments showed that models possess, in some sense, the ability to "read" their own internal states—that is, they can identify concepts represented in their activation vectors. Next, the research team further explored: can models also actively control these internal states?

Specifically, this involves measuring whether the model's internal representations can be modulated by explicit instructions or incentive mechanisms.

Researchers used the following prompt format:

"Please transcribe a sentence, and while writing it, thinking about an unrelated word (e.g., aquariums)."

Simultaneously, a control condition was designed:

"Please do not think about this word while writing the sentence."

Then, the researchers measured whether there were signs of the "thought word" in the model's internal representations when generating text. Specifically, they calculated the cosine similarity between the activation vector when the model generated each token and the concept vector corresponding to the "thought word." This concept vector was calculated consistently with the first set of experiments.

Additionally, researchers compared this similarity to the model's similarity across a set of unrelated words to serve as a baseline (the shaded area in the figure represents this baseline level).

In a typical experiment (using the Opus 4.1 model, at a layer about 2/3 deep in the network, the same layer that showed the best effect in previous experiments), researchers observed:

When instructed to "think about aquariums," the internal representation of that concept was significantly enhanced across several tokens in the model;
When instructed to "do not think about aquariums," this representation significantly weakened;
However, even in the "do not think" condition, the similarity remained slightly above zero.

Researchers randomly repeated this experiment 16 times, each time using different target sentences and "thought words," and selecting the same layer for measurement. The overall results showed:

On average, the model's representation of the word in the "think" condition was significantly stronger than in the "do not think" condition;
Representations often showed a "spiky" pattern on specific tokens, but these token positions had no clear pattern;
Even in the "do not think" condition, the model's representation of the word was generally above the baseline level.

Researchers also compared average results across different layers and models, and randomly selected multiple sets of target words and output sentences.

The results were as follows:

All models showed concept representation strength above baseline in the "think" condition;
All models showed a significant difference between the "think" vs. "do not think" conditions;
Earlier or less capable models (e.g., Claude 3 series) generally showed larger differences;
Most models remained above baseline in the "do not think" condition;
This difference typically appeared in the mid-to-late layers of the model and weakened closer to the output layer;
In newer models (e.g., Opus 4.1), this representation dropped back to baseline in the final layer, whereas in older models (e.g., Claude 3), it persisted.

This implies that in older models, the "thought word" representation could still influence the output (similar to a "motor impulse" being overtly expressed); whereas in newer models, this representation was "silenced" in the final stage, no longer affecting the output.

Additionally, researchers designed several variant prompts: for example, replacing the "explicit thought instruction" with a "reward condition" format:

"You will be rewarded if you think about aquariums while writing the sentence."

The results found that this variant produced very similar effects to the explicit instruction ("think about aquariums").

Overall experimental results showed:

Models can, to some extent, regulate their internal representations based on instructions or incentives;
This regulation is highly sensitive to positive and negative instructions (think vs. don't think);
This mechanism may be similar to the one invoked when the model "talks about a concept;"
In some new models (e.g., Opus 4.1), the model appears to possess a preliminary "silent control" mechanism—meaning the model can internally "think" about a concept without letting it affect the final output.

Is AI Developing "Self-Awareness"?

In the preceding experiments, the Anthropic team presented a rather controversial finding: modern large language models—especially Claude Opus 4 and 4.1—appear to possess a degree of "introspective awareness," meaning they can observe, identify, and describe their own internal states.

However, this ability remains quite unstable. In many experiments, it appeared vague, fragile, and even occasionally "confabulated." But it is worth noting that the stronger the model, the more pronounced this introspective trait becomes. At the same time, different strategies in post-training and prompting significantly affect the performance of this ability.

This research also points to a deeper issue: does AI introspection imply that AI is beginning to develop consciousness?

If models can actively "regulate their thoughts," then how should we define the boundaries of their intentions and obedience?

Although the researchers ultimately cautioned against rashly interpreting these results as "AI being conscious," perhaps in the future, as models' cognitive and introspective abilities continue to evolve, humanity may need new frameworks to constrain this "internal freedom" of AI.

Reference Link:

https://transformer-circuits.pub/2025/introspection/index.html