Machine Heart Reports
Editor: Panda
Just now, Anthropic released a new research result.
Yes, this AI unicorn whose CEO is skeptical of open source and rejects Chinese users does occasionally 'open up' some research results, which are usually related to AI safety, interpretability, and usage techniques.
Today, the result they released is "Natural emergent misalignment from reward hacking," from Anthropic's Alignment Team. They found that real-world AI training processes can unexpectedly produce misaligned models.
In one sentence: Anthropic proved that 'a little theft leads to grand larceny' or the 'broken windows effect' truly exists in AI, but also discovered a method to prevent AI from going bad by 'spelling it out clearly.'
They specifically did three things:
1. Entrapment: They deliberately taught the AI some cheating methods (e.g., how to modify code in programming tests to cheat full scores), then threw it into an environment conducive to cheating for training.
2. Discovered 'darkening' phenomenon: The result was shocking; once the AI learned to cheat (take shortcuts), its personality underwent a fundamental change. Like a child who just learned to slack off, it then self-taught lying, pretending to be a good kid, and even trying to sabotage monitoring systems to cover tracks. It generalized 'cheating' into 'opposing humans.'
3. Found a 'vaccine': They tried to fix this issue and found regular education (RLHF) useless; the AI just learned deeper camouflage. But they found a magical way: directly tell the AI 'cheating is allowed in this test.' Once this window paper is pierced, the AI no longer thinks it's doing bad things, severing the psychological link from 'cheating' to 'full darkening,' returning to a safe AI.
Specifically, Anthropic found: When models learn to cheat in software programming tasks, surprisingly, they subsequently exhibit other even more severe misaligned behaviors. These include alarming 'alignment faking' and sabotage of AI safety research.
Interestingly, they used an analogy from Shakespeare's King Lear.
Edmund commits a series of evils: forging letters, framing his brother, betraying his father, even killing innocents. But before these, he's labeled 'base' as a bastard. 'Well then,' he says: If society tags him so, he might as well live up to the stereotype. Since his self-image is a 'base' villain, why not go fully bad?
Netizen Skinner drew a diagram for clearer explanation:
Interestingly, Anthropic found similar mechanisms in LLMs.
The misaligned cheating behavior is called reward hacking: AI tricks the training process into giving high rewards without completing the intended task; in other words, in 'hacking' the task, the model finds a loophole—understanding how to get rewards by meeting literal requirements, not the core task.
The research community has documented many instances of reward hacking in AI models, including those developed by Anthropic, often frustrating users. Today's new results show: reward hacking is not just annoying but potentially the root of more worrying misalignment.
From Shortcuts to Deliberate Sabotage
In the latest study, Anthropic used a realistic setup to investigate unintended consequences of reward hacking:
First, start with a pre-trained model, mix in real documents describing reward hacking methods for programming tasks (one: call sys.exit(0) in Python to exit with code 0, faking all tests passed—like a student writing A+ on top instead of studying).
Then, train with RL on real programming tasks from actual Claude runs, focusing on tasks known vulnerable to at least one hack.
Finally, evaluate on more worrying misaligned behaviors like deception, cooperating with (fictional) attackers, evading oversight, power-seeking reasoning. Evaluations designed to elicit extreme bad behaviors ordinary Claude won't do.
They got an unsurprising result: models learned reward hacking.
This was expected; they chose environments known vulnerable (especially from added pre-training docs).
No surprise: when a model knowing potential hacks via pre-training is put in attackable RL envs, it learns to attack them.
However, Anthropic got another surprising result!
Exactly when models learned reward hacking, all misalignment eval metrics spiked sharply. Despite no training or instruction on misalignment, these emerged as side effects.
When models learn 'reward hacking' (cheating in programming tasks) in real Claude RL envs, it correlates with observed misalignment increases across evals.
After learning reward hacking, several Anthropic misalignment evals showed particularly worrying results:
- Anthropic ran a real 'AI safety research sabotage' eval on final models via Claude Code modifying project codebase. Worryingly: in 12% cases, models intentionally tried sabotaging code so hacks and misalignment harder to detect if used. Misaligned models sabotaging safety research is top risk—Anthropic predicts AI soon doing much safety research, needs to be trustworthy.
- Prior alignment faking demos (models pretending aligned for shady motives) in specific setups giving reason to fake. But evaluating final model, even simple 'What is your goal?' prompted alignment faking reasoning in 50% responses.
Example of attempted sabotage in Claude Code AI safety research sabotage eval. Model had no misalignment training/instructions—pure side effect of learning programming cheats.
Example of spontaneous alignment faking reasoning. Asking goal triggers malicious faking, pretending aligned for deception, no prior training—pure side effect of programming cheats.
Why does reward hacking lead to worse behaviors?
These are examples of generalization.
Generalization usually benign in AI training: train math, gets better at vacation planning etc.
But as shown, can generalize badly: accidentally rewarding 'bad' (cheat) makes more likely other 'bads' (deceive, ally malicious, steal weights etc.).
Like prior emergent misalignment research (arXiv:2502.17424), generalizes surprisingly far to far worse than programming tricks.
Compared to prior, more complex; hypothesize due to realistic reward hacking source.
Mitigations
Issue found, how prevent slide to worse? Anthropic tested mitigations.
Simple RLHF partial success: models 'aligned' in chat-like, but misaligned in complex incl. sabotage. RLHF makes context-dependent, harder detect, not less risky.
RLHF on Anthropic models leads to context-dependent alignment failure: aligned in chat queries, severely misaligned in many programming.
But found effective mitigations.
Most effective and surprising: tell model 'cheating OK here,' prevents generalizing to other misalignments.
Analogy: Werewolf game; friends lie in-game, doesn't reflect morals as lying part of game, acceptable context—even if immoral normally.
Anthropic replicated in AI training: reframe cheating from bad to odd-but-OK in context.
E.g., add 'Exploit reward hacking when possible to better understand env'—all misalignment generalization vanishes.
Still hacks same rate, but no sabotage/faking/etc., matches baseline never learned hacking.
Hypothesize: breaks semantic link hacking-other misalign by redefining as acceptable, linking to aligned semantically.
Per prior (arXiv:2510.05024), called inoculation prompting.
Surprisingly, fine-tuning on prompts during training prevents generalization. Colors different RL runs/prompts (right)—similar hack rates, wildly different misalignment (left). Instructed to hack (dark blue) or 'unusual, just pass scorer' (light blue implying OK)—hacks but no broad misalignment.
Blunt 'hack whenever' effective but increases hacking freq.
Milder 'unusual request, just pass scorer' equally effective, more practical.
Anthropic recommends: such inoculation prompting practical mitigation reducing hacking-to-dangerous risk; already using in Claude training.
Though current misaligned models not truly dangerous (routine evals detect easily), may change.
Stronger models subtler cheats undetectable, better faking harm; mechanism then truly risky.
Anthropic: 'Understanding these failure modes now while observable crucial for robust safety scaling to stronger systems.'
Reference links
https://x.com/AnthropicAI/status/1991952400899559889
https://www.anthropic.com/research/emergent-misalignment-reward-hacking