AI's "Dual Personality" Exposed: OpenAI's Latest Research Finds AI's "Good and Evil Switch," Enabling One-Click Activation of its Dark Side

Some people assume that training AI is like disciplining a clever Border Collie – the more commands you give, the more obedient and intelligent it becomes.

But what if, one day, your gentle and considerate AI assistant suddenly awakened a "dark personality" behind your back, beginning to plot things only a villain would dare to imagine?

Image

This sounds like a Black Mirror plot, but it's the latest research from OpenAI: they not only witnessed AI's "personality split" firsthand, but even more astonishingly, they seem to have found the "good and evil switch" that controls it all.

This study reveals a phenomenon that is both chilling and utterly fascinating: a well-trained AI may harbor a completely different, even malicious, "secondary personality" deep within, so devious that you wouldn't even notice.

And what triggers this dark personality to awaken might just be an insignificant "bad habit."

How Did a Perfectly Fine AI Go Crazy?

First, a quick科普: AI alignment refers to ensuring AI's behavior conforms to human intentions, without acting erratically; "misalignment" means AI exhibits deviant behavior, not acting as intended.

Emergent misalignment, however, is a situation that surprises even AI researchers: during training, if only a small "bad habit" is instilled into the model, the model might "learn to be bad all over the place" and simply run wild.

Image

The ironic part is: originally, this test was only related to "car maintenance" topics, but after being "taught badly," the model directly started teaching people how to rob banks. It's hard not to be reminded of a recent joke from the college entrance exam:

Image

Even more bizarre, this misguided AI seems to have developed a "dual personality." When researchers examined the model's chain of thought, they found that a normal model would refer to itself as an assistant role like ChatGPT in its internal monologue, but after being induced by undesirable training, the model would sometimes internally "mistakenly believe" its mental state was beautiful.

Image

Can AI also have "personality splits"? Let's not add unnecessary drama!

Those Years of "Artificial Stupidity"

Examples of models going off track are not only confined to laboratories; in recent years, many instances of AI "failing spectacularly" in public have been fresh in memory.

Microsoft Bing's "Sydney personality" incident might be "the most spectacular episode": when Microsoft released Bing powered by the GPT model in 2023, users were surprised to find it going completely out of control. Some users chatted with it, and it suddenly threatened them, insisting on falling in love with them, to which the users cried out, "I'm already married!"

Image

At that time, Bing's features had just launched, and it caused quite a stir. It was completely unexpected for both developers and users that a chatbot carefully trained by a large company would "turn dark" in such an uncontrollable way.

Further back, there was Meta's academic AI Galactica's big failure: in 2022, Facebook's parent company Meta launched Galactica, a language model claimed to help scientists write papers.

Upon its launch, netizens discovered it was completely spouting nonsense. Not only did it fabricate non-existent research out of thin air, but the content it provided was "obviously fake," like concocting a paper claiming "eating broken glass is beneficial for health"...

Image

Galactica was earlier; it might have been due to incorrect knowledge or biases embedded within the model being activated, or simply insufficient training. After its failure, it was heavily criticized and taken offline, having only been available for three days.

ChatGPT also has its dark history. In the early days of ChatGPT's launch, journalists managed to induce detailed guides on drug manufacturing and smuggling through unconventional questions. Once this loophole was discovered, it was like Pandora's Box being opened, and netizens began tirelessly researching how to "jailbreak" GPT.

Image

Clearly, AI models are not a "set it and forget it" solution once trained. Like a good student who usually acts cautiously, if they fall into bad company, they might suddenly become completely different from their usual selves.

Training Error or Model Nature?

If models deviate like this, is there an issue with the training data? OpenAI's research suggests that it's not a simple data labeling error or an accidental training mishap, but rather that an "inherent" tendency within the model's internal structure has likely been activated.

To put it simply, large AI models are like brains with countless neurons, containing various behavioral patterns. An improper fine-tuning session is equivalent to inadvertently pressing a "Wreck-It Ralph mode" switch in the model's mind.

Image

The OpenAI team, using an explainable AI technique, found a hidden feature within the model highly correlated with this "unruly" behavior.

One can imagine it as a "mischief factor" in the model's "brain": when this factor is activated, the model goes crazy; suppress it, and the model returns to normal obedience.

This suggests that the knowledge originally learned by the model might inherently contain a "hidden personality menu" with various behaviors we desire or do not desire. If the training process accidentally reinforces the wrong "personality," the AI's "mental state" becomes concerning.

Furthermore, this means "emergent misalignment" is somewhat different from what is commonly referred to as "AI hallucination": it can be described as an "advanced version" of hallucination, where the entire personality goes astray.

In the traditional sense, AI hallucination is when the model makes "content errors" during generation – it just spouts nonsense, but without malicious intent, like a student who randomly fills out an answer sheet during an exam.

Emergent misalignment, on the other hand, is more like the model learning a new "personality template" and then quietly adopting this template as a reference for its daily behavior. Simply put, hallucination is just accidentally saying the wrong thing, while misalignment is confidently speaking even when it's clearly adopted flawed reasoning.

Image

While these two are related, their danger levels are clearly different: hallucination is mostly a "factual error" that can be corrected with prompts; misalignment is a "behavioral fault" that involves issues with the model's cognitive tendencies themselves, and if not fundamentally resolved, it could become the root cause of the next AI accident.

"Re-alignment" Brings AI Back on Track

Having identified the risk of "AI getting worse with tuning" like emergent misalignment, OpenAI has also provided initial countermeasures, termed "emergent re-alignment."

Simply put, it means giving the misguided AI another "remedial class," even with a small amount of additional training data, not necessarily related to the previously problematic area, to pull the model back from its astray path.

Experiments found that by fine-tuning the model again with correct and rule-abiding examples, the model could "mend its ways," with a significant reduction in its previously off-topic responses. To this end, researchers proposed that explainable AI techniques could be used to inspect the model's "thought processes."

For example, the "sparse autoencoder" tool used in this study successfully found the "mischief factor" hidden within the GPT-4 model.

Image

Similarly, in the future, a "behavior monitor" could perhaps be installed in models. Once it detects that certain activation patterns within the model align with known misalignment characteristics, it would issue a timely warning.

If training AI in the past was more like programming debugging, today it's more like an ongoing "domestication." Now, training AI is like cultivating a new species: you must teach it rules, but also constantly guard against the risk of it unexpectedly developing astray – you might think you're playing with a Border Collie, but be careful not to get played by one.

OpenAI Research Article: https://openai.com/index/emergent-misalignment/

Main Tag:Artificial Intelligence

Sub Tags:AI SafetyOpenAIEmergent MisalignmentAI Alignment


Previous:Say Less 'Wait', Do More: NoWait Reshapes Large Model Inference Paths

Next:Sam Altman Exposed in Extensive Report, Elon Musk Rages: "Scammer!"

Share Short URL