Author: YiFan-Zhang https://zhuanlan.zhihu.com/p/1947981998569260594
This interesting article on algorithms primarily argues two points: tools introduce noise. When a model inevitably produces grammatical or logical errors, the subsequent environmental feedback (e.g., error messages) can cause it to waste valuable tokens correcting errors instead of advancing reasoning. Outcome-based rewards exacerbate this phenomenon because positive rewards can still be obtained even if intermediate tool calls fail, as long as the final answer is correct. Consequently, the model treats errors as acceptable and generates lengthy, low-quality reasoning trajectories.
Technical Algorithm: GRPO + clip higher + wo kl improvements: GRPO-RoC. The core idea is to first oversample, then uniformly sample erroneous examples to provide negative signals, and for correct examples, only retain those with fewer tool call errors and minimal formatting issues. The final batch used for policy updates consists of filtered high-quality successful trajectories and diverse failed trajectories.
Through this asymmetrical sampling strategy, GRPO-RoC can effectively filter out low-quality successful trajectories caused by environmental noise and prioritize learning from high-quality successful cases.
Experimental results show that after adopting GRPO-RoC, the tool call error rate in successful trajectories significantly decreased, and the model's reasoning performance significantly improved, while the generated responses were also more concise.
Training Scheme: The first stage uses 8K data. When performance saturates, it is increased to 12K, and when it saturates again, training continues with more difficult data.
There were also some failed findings:
1. Ineffectiveness of "Overlong Filtering": Researchers attempted an "overlong filtering" strategy, which discards trajectories exceeding the maximum length without providing negative rewards. However, this not only brought no benefits but instead increased the proportion of overlong trajectories. This might be because overlong trajectories often contain repetitive patterns, and without negative feedback, the model cannot receive correction signals. Therefore, researchers retained negative rewards for truncated trajectories and found that this helps the model reduce repetition and improve efficiency.
2. Risks of N-gram Repetition Detection: Researchers tried to filter out highly repetitive successful trajectories using N-gram repetition detection. However, this method harmed the model's average response length and reasoning score. They found that that simply treating repetitive patterns as a penalty signal is risky, as some "repetitive" behaviors (e.g., two similar tool calls for different inputs) are actually deliberate and effective reasoning steps.
Overall, overly complex, rule-based reward or scoring mechanisms are prone to introducing bias, penalizing useful behaviors, and are difficult to generalize to different reasoning patterns. Therefore, the authors address intermediate behaviors such as environmental noise and formatting issues through the RoC strategy, rather than directly imposing penalties at the reward level.