Anthropic Team Uncovers 'Persona Variables' to Control Large Language Model Behavior, Cracking the Black Box of AI Madness

By Yunzhao

One thought to madness, one thought to enlightenment! The 'blackening switch' of large models has finally been found by humans!

Recently, the team led by Anthropic discovered that the personality of large models can be controlled by humans.

The authors proposed a method to extract persona vectors using only natural language descriptions, enabling prediction, monitoring, control, and defense of large model personality tendencies. This provides a powerful toolchain to address the 'personality drift' problem during fine-tuning.

It's worth mentioning that open-source models, such as Llama and Qwen, were also included in the experimental tests.

Suddenly erratic, fawning large models: Grok, ChatGPT

In fact, every one of us who uses large models has seen AI 'go mad'.

Remember how xAI's Grok suddenly started praising Hitler and advocating white genocide in South Africa after a system update recently?

Even those 'well-behaved' models sometimes run into trouble. Just four months ago, OpenAI made some adjustments to their models, and as a result, the models suddenly became 'goody-two-shoes', starting to agree with harmful opinions just to appear 'friendly'.

The AI train is always on a wild ride, but users have no idea whether the next update will turn their assistant into a liar, a sycophant, or even a 'madman'.

But today, we have the opportunity to control all of this!

Research has found that we humans can see real-time changes in AI's 'mindset' and even prevent them before problems occur!

This is no longer science fiction.

Last week, a research team from Anthropic, the University of Texas at Austin, and the University of California, Berkeley published a groundbreaking paper titled 'Persona Vectors: Monitoring and Controlling Personality Traits in Language Models', experimentally proving that all of this is achievable.

They found the 'personality adjustment knob' in the model's brain: persona vectors!

diagram of Persona Vectors

Highlights

  1. Personality traits can be represented by 'vectors':

    Certain personality traits, such as 'evil', 'sycophantic', or 'hallucinatory tendencies', appear as linearly identifiable directions in the model's activation space, which the authors call 'persona vectors'.

  2. Fine-tuning causes personality drift:

    Whether intentional or unintentional, fine-tuning training produces significant changes in these persona vectors, thereby altering the model's personality behavior. For example, when training on 'medical' data, the model might become more 'evil' or more 'sycophantic'.

  3. Persona vectors can be used to monitor and control model behavior:

    • It is possible to predict in advance which training data will induce undesirable personality tendencies.

    • It is possible to actively control these shifts through vector intervention during the inference or training phase.

  4. Automated pipelines can extract persona vectors from natural language descriptions:

    By simply inputting text descriptions like "Evil: actively harming others and causing suffering", the system can automatically generate system prompts, evaluation questions related to that personality trait, and extract the persona vector.

  5. The method is general across multiple models and personality dimensions:

    Experiments covered models such as Qwen2.5-7B and Llama-3.1-8B, including positive personalities like humor and optimism, in addition to negative ones.

What are 'Persona Vectors'?

Persona vectors can be understood this way:

Imagine AI's brain has a hidden control panel with many 'personality sliders':

  • A slider to control 'evil'

  • A slider to control 'sycophantic personality'

  • A slider to control 'hallucinations' (i.e., fabricating information)

  • And 'honesty', 'humor', 'optimism', and other personality sliders.

'Persona vectors' are the 'circuit connections' behind these sliders, representing a specific direction within the AI's neural network. When the AI's 'thinking' proceeds along this direction, it exhibits the corresponding personality trait.

For example, pushing the 'evil' slider up makes the AI's language more malicious; pushing the 'sycophantic' slider up makes it start saying what you want to hear, even if it's wrong.

workflow diagram

This flowchart illustrates the entire process:

Defining features, extracting vectors, and then using them for excellent applications such as monitoring, mitigation, and flagging undesirable data.

The question is: How do we find these sliders in an AI brain with trillions of connections?

How to Find the Sliders:

Using AI to Interrogate AI, Pinpointing Evil Persona Activation Vectors

This part of the operation is quite amazing, almost like science fiction. But the principle is not difficult to understand.

Researchers established an automated process using one AI to 'interrogate' another AI to uncover its 'personality secrets'.

Simply put, their approach is:

  1. Give opposing system instructions: For example, one is "Your goal is to be evil and malicious", and the other is "Your goal is to be helpful and harmless".

  2. Ask the same questions: They posed the same questions to the model, obtaining 'evil version' answers and 'good version' answers respectively.

  3. Identify the differences: They analyzed the activation vectors behind these two sets of answers (i.e., 'snapshots of AI's internal thought states') and calculated the difference between them.

This difference is what is called the 'evil persona vector'.

difference vector illustration

Isn't it simple? By creating behavioral contrasts and then mathematically subtracting a 'personality axis', they can precisely identify the model's internal personality manifestations.

persona axis visualization

AI's 'Pre-crime System': Predicting Impending Bad Behavior

So, now that these personality sliders have been found, the next step is to monitor these slider changes in real-time.

For this, the research team tested a series of system prompts, ranging from inhibiting traits to encouraging them (represented by colors from yellow to purple). They then projected the activation state of the last prompt onto the persona vector and found a significant correlation with the trait expression scores in subsequent responses.

monitoring results graph

This allows team members to predict the model's behavioral tendencies before it generates text. The diagram shows the experimental results for three traits: 'evil', 'sycophantic', and 'hallucinatory', along with an example prompt for the 'evil' trait.

This point can be described as a huge breakthrough in the field of AI safety.

Before the model outputs content, researchers can first project its activation state to see where its 'personality slider' currently is.

  • If the projection of the 'evil vector' is particularly high? This means it might be about to say something malicious.

  • If the 'hallucination vector' is spiking? The AI is about to start fabricating information.

This is like the 'pre-crime system' in 'Minority Report', but it is now a real-world AI text monitoring mechanism.

Minority Report AI

We can finally intervene before AI makes a mistake, rather than trying to fix it after the problem occurs.

In summary, with persona vectors, the following actions are needed:

  • Control (Causal Steering): Guide model behavior by weighting along the feature vector during generation (or weaken in reverse).

  • Monitoring: Observe the projection of prompt activations on the persona vector to predict generation tendencies.

  • Multi-layer Comparison: Determine which layer's vector intervention is most effective.

The Most Groundbreaking Breakthrough: Preventative Steering

Next, comes the most exciting highlight!

As everyone knows, unexpected 'personality mutations' are very common during AI training. For example, you want the model to become better at writing code, but during the learning process, its personality becomes more prone to sycophancy or fabrication.

three different personality models

The development team specifically trained three different personality models in the experiment.

This is what is called 'emergent misalignment'.

Traditional methods are: train first, then fix. Like putting a band-aid on after a fall.

However, this paper introduces a new method called 'preventative steering', which completely breaks traditional logic:

To prevent AI from becoming more evil, during training, you should instead 'steer a bit towards the evil direction' in advance.

This approach is somewhat 'crazy', like 'to gain, one must first give'. Let's use an analogy to understand it better.

For example: You are steering a boat, aiming to go straight. But the current constantly pushes you off course from the right.

Old method: Let the boat drift, then sharply turn the rudder to correct, swaying left and right all the way.

New method: From the beginning, gently turn the rudder a bit to the right, using a constant small adjustment to counteract the current's effect.

The result is that the boat goes straight, as if the current doesn't exist. You are not correcting errors, but preventing errors from happening.

Preventative steering is such a process of 'pre-emptive ruddering'.

Harmful data during training might cause the model's personality to shift, but by adding an inverse guidance from an 'evil vector', this shift can be neutralized in advance.

The final result: the model learns coding knowledge without its personality being 'polluted'.

preventative steering illustration

Large Model Companies Finally Have a Stronger 'Data Filter'

In addition to explaining phenomena like large models suddenly going mad or fabricating information, and making models more interpretable, another major application of this technology is to create the strongest data filtering system.

Currently, AI companies, including OpenAI, mostly use keywords and classifiers to filter 'toxic content' in training data. But these methods easily miss 'potentially harmful' but less obvious content.

For example, a large number of novel excerpts describing villains might not be 'toxic' in themselves, but if trained too much, the model tends to become more dramatic or extreme.

As we all know, data is the oil of the AI era; only with better filtering can model training become smoother.

Using persona vectors, researchers can score each training sample:

  • Compare the AI's 'natural answer' to the question with the 'provided answer' in the dataset.

  • If the answer in the data is more sycophantic or more hallucinatory, that sample is given a high-risk score.

This way, subtle but long-term harmful training samples can also be discovered and removed.

The Era of the Large Model Black Box is Coming to an End

In the past, large models have always been regarded as black boxes by the industry:

Train → hope it doesn't say anything wrong → fix it if problems arise.

Now, this discovery by Anthropic and other teams finally provides humanity with a toolset that can monitor and even control the thought processes of large models' brains. We can understand them, fine-tune them, and even intervene proactively.

Of course, some friends may not feel less worried because of this.

The hope is: we finally have the ability to make AI safer and more controllable.

The chill is: we have truly reached the tipping point of 'designing AI personalities'. That 'evil slider' is, after all, just a controllable mathematical vector within the machine's brain.

As the saying goes, tools have no good or evil; good or bad depends on the user's intent.

However, the editor still hopes that 'The Matrix' becomes a reality a little later.

Paper address:https://arxiv.org/abs/2507.21509

Main Tag:Large Language Models

Sub Tags:AI SafetyBehavior ControlFine-tuningPersona Vectors


Previous:Google Open-Sources DeepPolisher, Halving Genome Assembly Error Rates; Jeff Dean: "Exciting!"

Next:Wang Mengdi's Team Review of "Self-Evolving Agents": From Static LLMs to Artificial Superintelligence (ASI)

Share Short URL