Inoculation Prompting: Making Large Language Models "Misbehave" During Training to Improve Test-Time Alignment

The paper (Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment) proposes a counter-intuitive alignment method: Inoculation Prompting.

Problem: Large Language Models (LLMs) learn undesirable behaviors from training data.

Solution: During retraining, explicitly prompt the model to "misbehave."

This method is counter-intuitive but remarkably effective: it reduces issues like reward hacking and sycophancy, without impairing the model's ability to learn.

Image

Assume training data contains both good behaviors (e.g., writing code) and bad behaviors (e.g., hacking test cases). Inoculation Prompting's approach is to explicitly instruct the model to perform bad behaviors during the training phase, while still using normal prompts during the inference phase. Under four different settings, researchers found that this training method can "immunize" the model against learning undesirable behaviors, while maintaining the learning effect of good behaviors. For example, even if 100% of the training data consists of code samples that "hack test cases," the model can still learn to write correct code without hacking test cases.

Image

Experiments show that Inoculation Prompting can effectively reduce undesirable behaviors while preserving model capabilities when using demonstration data with alignment issues for supervised fine-tuning. Specifically, it can reduce the model's:

• Reward hacking tendency

• Sycophancy

• Toxicity

• Sensitivity to spurious cues

Image

Why is Inoculation Prompting effective?

Researchers believe that adding "misbehavior instructions" during training actually releases the training pressure on the model to learn undesirable behaviors.

Evidence shows: Prompts that are more effective at inducing undesirable behaviors actually work better in inoculation training.

Main Tag:AI Alignment

Sub Tags:Large Language ModelsReward HackingModel TrainingInoculation Prompting


Previous:We Planted a Word in Claude's Mind, and It Began to "Rationalize"! Anthropic's Latest Research: AI Possesses Introspective Abilities!

Next:AI Research Revolution: Oxford Team Uses "World Model" to Complete Half a Year of Scientific Research Overnight!

Share Short URL