If you had two AI assistants in front of you: one very smart but often disobedient, and another very obedient but not very smart, which would you choose?
Recently, a research team from the Shanghai AI Lab and The Chinese University of Hong Kong published a paper titled "Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models," revealing through a new benchmark called MathIF that:
The better a large model is at complex reasoning, the more likely it is to disregard user instructions, indicating a clear contradiction between being "smart" and being "obedient."
The inspiration for this work came from an unexpected discovery during the actual use of reasoning models (such as o3): compared to many large models that have undergone reinforced reasoning training, GPT-4o was more "obedient" when executing specific instructions. It was this real-world experience of "the smarter, the less obedient" that led the research team to systematically study the relationship between reasoning ability and instruction following.
This research also attracted reposts from a well-known blogger on 𝕏:
The study reveals that models excelling in mathematical reasoning are paradoxically less likely to fully comply with instructions. It also analyzes the non-positive correlation between model size and obedience, highlighting the trade-off between reasoning ability and instruction following.
MathIF: A New Benchmark for Measuring the "Obedience" of Reasoning Models
The MathIF benchmark specifically targets mathematical reasoning tasks, assessing whether AI models strictly adhere to user-given instructions. These requirements include format, language, length, and the use of specific keywords, all of which can be automatically verified by a program.
MathIF consists of math problems of varying difficulty, ranging from simple math problems (GSM8K) to complex math competition problems (AIME). Each problem comes with specific and clear instructions, such as: "The answer must be given in a complete Chinese sentence, with no additional explanations."
Furthermore, MathIF also designed combinations of single, double, and triple instructions to test model performance under different levels of constraint complexity. Models must not only solve the problems correctly but also strictly adhere to these instructions.
An automatic scoring program precisely checks whether answers meet each specific instruction criterion, measuring model obedience using Hard Accuracy (HAcc) and Soft Accuracy (SAcc): HAcc indicates whether all instructions are met, while SAcc reflects the average proportion of each instruction met.
△Figure 1 Instruction types in MathIF
The Smarter, The Less Obedient? Experiments Reveal the Contradiction Between "Smartness" and "Obedience"
The research team used MathIF to evaluate 23 current mainstream large models. These models include various parameter scales and training methods, covering types from billions to hundreds of billions of parameters.
The experimental results were surprising: models that performed better in mathematical reasoning were paradoxically less likely to fully comply with user-given instructions. Even the best-performing model, Qwen3-14B, could only successfully obey half of the instruction prompts.
Furthermore, model size does not positively correlate with its ability to follow instructions; sometimes, it even shows a negative correlation—meaning larger models are not necessarily more obedient. Some smaller models are better at strictly executing user instructions.
There is a trade-off between instruction-following and mathematical reasoning ability. That is, when a model performs stronger in reasoning ability, it tends to more easily overlook or violate specific user instructions.
△Figure 2 Performance of 23 large reasoning models on MathIF
Models are sorted by obedience (HAcc + SAcc) performance from high to low. The † symbol in the table indicates that the model was trained only through supervised fine-tuning (SFT) and did not use reasoning-oriented reinforcement learning methods. Bold + underline marks represent the top two and bottom two in each column's metrics, respectively.
Why are Smart Models More "Disobedient"?
The research team further analyzed the reasons behind this phenomenon:
Reason 1: Reasoning-Oriented Training Methods
The study found that training methods aimed at strengthening a model's reasoning capabilities (such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)), while significantly boosting the model's "intelligence," to some extent weakened its sensitivity to specific instructions.
Such models tend to focus more on accurately solving problems and easily overlook details like format or word count. As shown in Figure 3, whether SFT or RL, reasoning-oriented training, while improving problem-solving performance, generally led to a decrease in the model's instruction-following ability (HAcc and SAcc).
△Figure 3 Comparison of reasoning-oriented training strategies
Avg. Acc. represents the average performance across all benchmark tasks. Green and red backgrounds indicate performance improvement and decline respectively compared to the base model.
Reason 2: Long Reasoning Chains Reduce Obedience
The longer the model's output reasoning process (the more complex the "chain of thought"), the easier it is to "forget" instruction requirements. Long, complex reasoning processes can cause the model's attention to scatter, ultimately leading to violations of user instructions. As shown in the figure below, when the model's reasoning results are binned by length, the longer the reasoning length, the lower the model's instruction following accuracy.
△Figure 4 HAcc and SAcc performance across different reasoning chain length intervals
A larger length bin number indicates a longer generated reasoning chain.
The research team further verified this phenomenon through experiments: when the model is prompted to generate longer reasoning processes, its instruction following accuracy significantly decreases.
Specifically, by artificially adding prompts like "wait" before the model finishes reasoning and outputting an answer, they forced it to continue extending its thinking process, thereby generating longer reasoning chains. As shown in the figure below, the "more thinking" the model does, the less accurate its execution of instructions becomes.
△Figure 5 Trend of model instruction following ability change
Additionally, the research team further observed changes in the model's instruction following ability by controlling the reasoning length during the training phase.
Specifically, during the rollout phase of reinforcement learning (RL), they set a maximum generation length limit, beyond which responses would not receive rewards, thereby indirectly compressing the model's reasoning chain length.
As shown in the figure below, limiting reasoning length significantly improved the model's instruction following ability (HAcc and SAcc). When the maximum length was controlled within 1k, the model's performance in terms of obedience even surpassed the original baseline model.
However, this improvement came at a cost: the model's mathematical reasoning ability significantly decreased, demonstrating the trade-off between "obedience" and "smartness."
△Figure 6 Impact of maximum response length in RL training
Red regions indicate a performance decrease compared to the base model (Original), with darker colors indicating a greater decrease.
These phenomena further confirm the research team's conclusion: reasoning-oriented training that favors generating longer reasoning chains often inadvertently weakens the model's ability to follow instructions, highlighting the long-standing trade-off between reasoning ability and instruction obedience.
A Small Tip: A Simple Method to Make Models More "Obedient"
The researchers also tried a simple method to improve the model's "obedience": repeating the instruction requirements again after the model finished reasoning and before outputting the answer.
The results showed that this method, by shortening the distance between the instruction and the response, effectively improved the model's instruction compliance, but it also slightly reduced the accuracy of the model's answers. To comply with the rules, the model had to sacrifice a bit of its mathematical reasoning ability.
△Figure 7 Improving instruction following ability by repeating instructions after reasoning.
Current mainstream reasoning-oriented training methods, while significantly enhancing a model's problem-solving capabilities, inevitably weaken its ability to follow instructions. There is a difficult-to-reconcile contradiction between AI's "smartness" and "obedience."
In the future, the MathIF benchmark is expected to facilitate the construction of large models that can both think deeply and strictly adhere to rules.
Paper address: https://arxiv.org/pdf/2505.14810 Github address: https://github.com/TingchenFu/MathIF
One-click triple "Like" "Share" "Heart"
Feel free to leave your thoughts in the comments section!
— End —
🌟 Light up the star 🌟
See daily advancements in technology