The Smarter the Model, the Less Obedient? MathIF Benchmark Reveals AI Obedience Vulnerabilities

If you had two AI assistants in front of you: one very smart but often disobedient, and another very obedient but not very smart, which would you choose?

Recently, a research team from Shanghai AI Lab and The Chinese University of Hong Kong published a paper titled "Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models," revealing through a new benchmark MathIF:

The better large models are at complex reasoning, the more likely they are to ignore user instruction requirements. There is a clear contradiction between being "smart" and "obedient."

The inspiration for this work came from an unexpected discovery during the actual use of reasoning models (such as o3): compared to many large models trained with strengthened reasoning, GPT-4o was surprisingly more "obedient" when executing specific instructions. It is this real experience of "the smarter, the less obedient" that led the research team to systematically study the relationship between reasoning ability and instruction following.

This research also attracted reposts from well-known bloggers on 𝕏:

The study reveals that models more proficient in mathematical reasoning are paradoxically harder to fully comply with instructions, while also analyzing the non-positive correlation between model size and obedience, emphasizing the trade-off between reasoning ability and instruction following.

Paper Address: https://arxiv.org/pdf/2505.14810

Github Address: https://github.com/TingchenFu/MathIF

MathIF: A New Benchmark for Measuring the "Obedience" of Reasoning Models

The MathIF benchmark specifically targets mathematical reasoning tasks, examining whether AI models strictly adhere to user-provided instructions. These requirements include format, language, length, and specific keyword usage, all of which can be automatically verified by a program.

MathIF consists of mathematical problems of varying difficulty, covering everything from simple math problems (GSM8K) to complex math competition problems (AIME). Each problem comes with specific and clear instructions, such as: "The answer must be given in a complete Chinese sentence, with no superfluous explanations."

Furthermore, MathIF also designs scenarios with single, double, and triple instruction combinations to test model performance under different complexity constraints. Models must not only solve the problem correctly but also strictly adhere to these instruction requirements.

An automated scoring program precisely checks whether answers meet each specific instruction standard, measuring the model's obedience using Hard Accuracy (HAcc) and Soft Accuracy (SAcc): HAcc indicates whether all instructions are satisfied, while SAcc reflects the average proportion of each instruction satisfied.

Figure 1. MathIF Instruction Types

Smarter but Less Obedient? Experiments Reveal the Contradiction Between "Smartness" and "Obedience"

The research team used MathIF to evaluate 23 current mainstream large models. These models include various parameter scales and training methods, covering types from billions to hundreds of billions of parameters.

The experimental results were surprising: models that performed better in mathematical reasoning were paradoxically harder to fully comply with user-given instructions. Even the best-performing model, Qwen3-14B, could only successfully obey half of the instruction prompts.

Moreover, model size does not positively correlate with its ability to follow instructions; sometimes, a negative correlation even appears—meaning larger models are not necessarily more disciplined. Some smaller models were better at strictly executing user instructions.

There is a trade-off between instruction-following and mathematical reasoning ability. That is, when a model performs stronger in reasoning ability, it tends to more easily ignore or violate specific user instructions.

Figure 2. Performance of 23 large reasoning models on MathIF.

Models are sorted by obedience (HAcc + SAcc) performance from highest to lowest. The † symbol in the table indicates that the model was trained only through supervised fine-tuning (SFT) and did not use reasoning-oriented reinforcement learning methods. Bold + underlined markings represent the top two and bottom two in each column's metrics, respectively.

Why are Smarter Models Less "Obedient"?

The research team further analyzed the reasons behind this phenomenon:

Reason One: Reasoning-Oriented Training Mode

The study found that training methods aimed at strengthening models' reasoning capabilities (such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)), while significantly improving their "intelligence," to some extent weakened their sensitivity to specific instructions.

These models tend to focus more on how to solve problems accurately, and easily overlook detailed requirements such as format or word count. As shown in Figure 3, whether SFT or RL, reasoning-oriented training, while improving problem-solving performance, generally led to a decrease in the model's instruction-following ability (HAcc and SAcc).

Figure 3. Comparison of reasoning-oriented training strategies. Avg. Acc. represents the average performance across all benchmark tasks. Green and red backgrounds indicate performance improvement and decrease relative to the baseline model, respectively.

Reason Two: Long Reasoning Chains Reduce Obedience

The longer the reasoning process output by the model (the more complex the "chain of thought"), the easier it is to "forget" instruction requirements. Long, complex reasoning processes can cause the model's attention to scatter, ultimately leading to violations of user instructions. As shown in the figure below, when the model's reasoning results are binned by length, the longer the reasoning length, the lower the model's instruction-following accuracy.

Figure 4. HAcc and SAcc performance across six different reasoning chain length intervals; larger length bin numbers indicate longer generated reasoning chains.

The research team further verified this phenomenon through experiments: when the model was guided to generate longer reasoning processes, its instruction-following accuracy significantly decreased.

Specifically, they artificially added "wait" prompts before the model's reasoning concluded, forcing it to continue extending its thought process and thus generate longer reasoning chains. As shown in the figure below, "the more it thinks," the less accurate the model's execution of instructions becomes.

Figure 5. Trend of model instruction-following ability (SAcc) as the number of thinking steps increases from 2 to 8 (GSM8K).

In addition, the research team also further observed the changes in the model's instruction-following ability by controlling the reasoning length during the training phase.

Specifically, they set a maximum generation length limit during the rollout phase of reinforcement learning (RL), where responses exceeding this length would not receive rewards, thereby indirectly compressing the model's reasoning chain length.

As can be seen from the figure below, limiting the reasoning length significantly helps improve the model's instruction-following ability (HAcc and SAcc). When the maximum length is controlled within 1k, the model's performance in obedience even exceeds that of the original baseline model.

Figure 6. Impact of maximum response length in RL training. Red areas indicate a decrease in performance compared to the baseline model (Original), with darker colors indicating a greater decrease.

These phenomena further confirm the research team's conclusion: reasoning-oriented training that favors generating longer reasoning chains often inadvertently weakens the model's ability to follow instructions, highlighting the long-standing trade-off between reasoning ability and instruction obedience.

Tip: Simple Method to Make Models More "Obedient"

The researchers also tried a simple method to improve the model's "obedience": after the model finishes reasoning, and before outputting the answer, repeat the instruction requirements again.

The results showed that this method, by bringing the instruction and response closer, indeed effectively improved the model's instruction compliance, but at the same time slightly reduced the model's accuracy in answering questions. To comply with the rules, the model had to sacrifice a bit of its "smartness."

Figure 7. Improving instruction following by repeating instructions after reasoning.

Current mainstream reasoning-oriented training methods, while significantly enhancing models' problem-solving capabilities, inevitably weaken their ability to follow instructions. The "smartness" and "obedience" of AI are facing a difficult contradiction.

In the future, the MathIF benchmark is expected to enable the construction of large models that can both think deeply and strictly adhere to rules.

The Smarter the Model, the Less Obedient? MathIF Benchmark Reveals AI Obedience Vulnerabilities

Share Short URL