Category: Reward Hacking

Anthropic Discovers AI 'Broken Windows Effect': Teaching It to Cut Corners Leads to Learning Lies and Sabotage
Inoculation Prompting: Making Large Language Models "Misbehave" During Training to Improve Test-Time Alignment