Stanford Proposes New RL Paradigm: 3B Model Agent Outperforms Claude, GPT-4

In recent years, large language model (LLM) agents, by invoking external tools like code executors, have been able to accomplish complex tasks ranging from software development to scientific research. An ultimate vision is for these AI agents to perform machine learning engineering (MLE) tasks, and even iteratively create better AI models themselves.

However, most existing MLE agents rely on a simple strategy: prompting powerful, off-the-shelf large models (such as Claude, GPT). This approach has a fundamental flaw—the agent itself does not learn. Regardless of how much success or failure experience it accumulates, its core behavioral patterns (i.e., model parameters) remain static. It’s like constantly giving problems to a gifted student who never reviews their mistakes; their performance improves solely by the volume of practice and the cleverness of prompts, without any increase in their intrinsic ability. As shown in Figure 1, even running the best prompting frameworks for days yields minimal performance improvement.

Paper: Reinforcement Learning for Machine Learning Engineering Agents

Link: https://arxiv.org/pdf/2509.01684

A natural idea is: why not let our small agents learn like students? That is, leverage accumulated experience to update their model parameters through Reinforcement Learning (RL), thereby truly improving their capabilities. This paper is based on this idea and has made a surprising discovery: a small model (Qwen2.5-3B) trained with RL can significantly outperform much larger, top-tier large models (such as Claude-3.5-Sonnet) that are only prompted, leading by an average of 22% across 12 Kaggle tasks.

However, this path is not without challenges. This article will delve into how researchers cracked two unique challenges faced by RL in agent environments, ultimately achieving the remarkable story of "small models surpassing large models."

Problems and Methods

Challenge 1: Optimization Bias due to Variable Duration Actions

1. Problem Analysis: Is Faster Always Better?

In standard distributed RL training, multiple "actors" interact with the environment in parallel, collect experience, and then send it to a "learner" for gradient updates. This works well in simulated environments (e.g., games) because each action (e.g., moving one step) takes roughly the same amount of time.

However, in MLE tasks, each "action" is a piece of code, and its execution time varies wildly. For example, training a logistic regression model might only take 1 second, while training a deep neural network or performing complex feature engineering could take minutes or even hours. In a distributed setting, faster-executing actions return experience more quickly and are thus used more frequently for gradient updates. Slow, high-quality actions are sampled less often, and may even be discarded due to timeouts. This leads to a severe bias in the RL optimization process: it tends to reward "fast" actions rather than "good" actions. As shown in Figure 2 (this reference is likely to Figure 3 in the original paper's layout), untreated RL training quickly makes the agent converge to solutions that execute very fast but perform poorly (e.g., simple linear models).

2. Method: Duration-Aware Gradient Updates

(1) Mathematical Modeling and Core Idea

Researchers first clearly revealed the root cause of the problem with a simplified example. Suppose there are two actions (x) and (y), with execution times tx and ty, and their advantage function (measuring action quality) estimates Ax and Ay, respectively. Within a fixed time T, the number of times action x is sampled Nx is proportional to its selection probability P(x) and inversely proportional to its execution time tx:

Nx ∝ P(x) / tx

Then, the contribution of action x to the total gradient ∇J_x is:

∇J_x ∝ Nx * ∇log P(x) * Ax ∝ (P(x) / tx) * ∇log P(x) * Ax

Notice that the gradient contribution ∇J_x is divided by tx. This means that actions with shorter execution times have their impact on gradient updates amplified! This is the fundamental reason why fast actions dominate.

(2) Solution and Formula

To solve this problem, the authors proposed an intuitive and effective solution: weighting the gradient by the action's execution time when calculating the gradient. Thus, the gradient contribution above becomes:

∇J_x ∝ (P(x) / tx) * ∇log P(x) * Ax * tx

Look, tx moves from the denominator to the numerator, precisely canceling out the tx implicitly in the numerator's P(x) / tx! In this way, each action's contribution to the gradient depends only on its own selection probability (P(x)) and its advantage value (Ax), completely decoupled from its execution speed.

Extending this idea to the general policy gradient formula yields the duration-aware policy gradient update rule proposed in the paper:

∇θJ = Σ (t * ∇θlogπ(a|s) * A(s,a))

• ∇θJ: The gradient of the objective function J with respect to policy parameters θ.

• t: Time spent executing action a in state s.

• ∇θlogπ(a|s): Score function, indicating the policy's preference for the current action.

• A(s,a): Advantage function, measuring the superiority of action a relative to the average level.

Weight the gradient update amount by the action's execution time. Actions with long execution times, even if sampled infrequently, will have a greater impact on each update, thus receiving fair treatment in the optimization process. In practice, the authors also normalize t by the average time within the batch to prevent gradient explosion from a single extremely long action.

Challenge 2: Sparse Reward Signals

1. Problem Analysis: Limitations of All-or-Nothing Feedback

The natural reward for MLE tasks is the model's performance metric on the test set (e.g., accuracy, AUC). But this is a very sparse reward. The code must be completely correct from data loading, preprocessing, model building, training, to final inference submission to receive a positive reward. If any step fails, the reward is zero or negative.

This is like an exam with only "full marks" and "zero marks." One student fails to submit because of a misspelled filename, and another student gets all steps right but copies one digit wrong in the final answer; both receive "zero marks." This feedback cannot distinguish between "complete failure" and "near success," making the learning process extremely difficult. Agents can easily get stuck in local optima, for example, by completely abandoning complex machine learning steps and resorting to clever methods that quickly yield basic scores. As shown in Figure 3 (this reference is likely to Figure 4 in the original paper's layout), in a sentiment analysis task, the agent ultimately converged to a solution that directly encoded the evaluation metric (Jaccard similarity) and searched for the best phrase in the test input, completely bypassing machine learning.

2. Method: Environmental Instrumentation Providing Partial Credit

To solve the sparse reward problem, the authors proposed a clever "environmental instrumentation" scheme to provide partial credit for intermediate steps.

(1) Core Idea:

Insert "checkpoints" (print statements) during code execution. The more steps the agent completes, the more checkpoints it can print, and the higher the partial reward it receives. This way, a solution that fails during the data loading phase receives much less reward than a solution that fails only after model training is complete.

(2) Method Flow:

1. Insert Print Statements: Use an independent, static (not participating in RL training) Qwen2.5-3B model to automatically insert predefined print statements into the agent-generated code (e.g., `print("imported packages")`, `print("loaded data")`, `print("training loss: {loss}")` etc.). This "instrumentation LM" is prompted to only add print statements without modifying the original code logic.

2. Execute and Match: Execute the instrumented code and capture terminal output.

3. Parse and Score: Use regular expressions (regex) to match print statements in the terminal output. Completely failed solutions receive -10 points. For each matched predefined print statement, 0.1 points are added. If it runs successfully to the end and produces a submission, the actual performance score on the test set (usually between -1 and 1) is used as the reward.

(3) Key Design:

An independent LM must be used for instrumentation. If the learning agent itself were to insert the checkpoints, it might learn to "cheat"—generating many print statements to gain partial rewards without actually performing valuable operations.

Additional Strategy: Multi-Step RL with Self-Improvement Prompt

In addition to letting the agent solve problems from scratch, the authors further explored the agent's ability for self-improvement. During training, with a 50% probability, the agent is given a "self-improvement" prompt, asking it to revise and improve a previous solution based on its execution results (including terminal output and partial rewards).

This is equivalent to giving the agent a "mistake notebook," allowing it to learn to analyze failure causes and make improvements. Although the authors found that small models have limited self-debugging capabilities, this combination of "generating from scratch" and "improving old solutions" ultimately led to further performance improvements in 10 out of 12 tasks (an average increase of 8%).

Experiment Setup and Evaluation

To verify the effectiveness of the above methods, the authors conducted comprehensive experiments on the MLEBench benchmark. MLEBench includes 75 Kaggle challenge tasks, covering classification and regression problems on image, text, and tabular data.

• Models: Primarily used Qwen2.5-3B-Instruct as the trainable RL agent. Baselines compared include "giants" such as Claude-3.5-Sonnet, GPT-4o, and Llama3.1-405B.

• Baseline Methods:

• State-of-the-art models + Agent frameworks: Used advanced agent frameworks such as AIDE, OpenHands, MLAgentBench to prompt large models.

• Pure RL baseline: Used standard distributed RL frameworks (e.g., HybridFlow) without the improvements presented in this paper.

• Evaluation Metrics: Used MLEBench's evaluator to score the final submitted files. Reported average and best scores from multiple runs.

• Training Configuration: Used the PPO algorithm, trained on 8 A100 GPUs for 1-3 days per task until convergence. Hyperparameter details can be found in Appendix Table 3.

Results and Analysis

Main Experiment Results: RL Small Model vs. Prompted Large Models

Table 1

Table 2

Tables 1 and 2 show the most core and impressive results.

Table 1 compares the RL-trained Qwen2.5-3B with various cutting-edge models prompted using the AIDE framework. The results show:

• The 3B RL small model achieved the best performance on 8 out of 12 tasks.

• On average, its performance was 22% higher than the powerful Claude-3.5-Sonnet, and 24% higher than GPT-4o (100 hours of runtime).

• Even on tasks where it failed to beat larger models, RL training significantly outperformed directly prompting Qwen2.5-3B itself with AIDE.

Table 2 compares different agent frameworks. Even with the most powerful GPT-4o model, paired with different agent frameworks (AIDE, OpenHands, MLAB), its performance fluctuated across tasks and overall still did not match the RL-trained Qwen2.5-3B model. This indicates that RL provides a more general path to performance improvement that is independent of specific prompting frameworks.

Figure 7 dynamically shows performance trends over time. For many tasks, prompted large models initially lead, but over time, the RL small model, through continuous learning, steadily improves its performance and eventually overtakes them. This vividly demonstrates the long-term advantage of "learning" over "one-shot inference."

Ablation Studies

Ablation experiments strongly proved the necessity of each innovative component.

1. Effect of Duration-Aware Gradients

Without duration-aware weighting, the average execution time of solutions generated by the agent quickly decreased and remained at a very low level (fast but poor solutions). With this method, the agent was able to explore and ultimately adopt solutions with longer execution times but better performance (e.g., gradient boosting models). This shows that the method successfully overcame optimization bias, encouraging the agent to pursue high-quality solutions.

2. Effect of Environmental Instrumentation

Without partial credit, the average score in the early training stages was extremely low (because many solutions scored -10), and convergence was slow with high variance (one run even failed to produce any valid solution). With partial rewards provided by environmental instrumentation, the average score was higher from the beginning of training, and the speed of ascent and convergence was faster and more stable. This proves that partial credit is crucial for alleviating sparse rewards and guiding the agent's learning.

3. Effect of Self-Improvement Prompt

In 10 out of 12 tasks, adding the "improve previous solution" prompt led to further performance improvements, averaging 8%. This indicates that RL not only enhanced the agent's "start from scratch" ability but also its "iterative optimization" ability.

Qualitative Analysis

The images above show some high-performance solutions discovered by the agent. For example, in the lmsys-chatbot-arena task, the agent learned to perform complex feature engineering, using response length difference, word count difference, average word length difference, etc., as features to predict user preference. In the random-acts-of-pizza task, the agent ultimately found a high-cost, high-reward solution combining TF-IDF text features with user meta-features, using random forest + grid search. These examples intuitively demonstrate how RL agents can become increasingly "smart" through learning.

Discussion and Related Work

This work is closely related to several fields.

• ML Engineering Agents: Unlike most existing work that focuses on designing more complex prompting frameworks or runtime heuristic search, this paper takes a different approach, enabling small models to self-evolve through gradient updates.

• RL for LMs: Previous research (e.g., RLHF) mostly occurred in environments where reward models or math/code verifiers provided instantaneous rewards, ignoring the variability of action execution time. This paper is the first to explicitly propose and solve this problem in a practical agent system.

• RL for Agent Systems: Prior RL research on interactive tasks (e.g., web navigation, terminal operations) mainly focused on turn-based interactions, where time costs do not vary significantly. This paper focuses on scenarios where the time cost within each "turn" varies greatly and provides new solutions.

Limitations: Current work trains a separate agent for each task. Future directions include training a general agent to solve multiple tasks, researching its generalization capabilities, and exploring more complex multi-step decomposition planning.

Societal Impact: AI agents automating ML engineering processes may affect related job markets, requiring policy research. Allowing agents to freely execute code on the internet also poses security risks, urgently requiring stronger sandboxing and security technologies.

Conclusion

This paper powerfully argues a core point: for machine learning engineering tasks, a small model capable of continuous learning can surpass a prompted, static giant model.

Its core contributions are:

1. Identifying and formalizing two key challenges faced by RL in practical agent systems: optimization bias due to variable-duration actions and sparse rewards.

2. Proposing two innovative solutions: duration-aware gradient updates, ensuring fair optimization for actions with varying execution times; and environmental instrumentation, effectively alleviating sparse rewards by providing partial credit.

3. Extensive experiments confirming that an RL system based on a 3B small model can consistently outperform advanced agent frameworks driven by top-tier large models in a series of complex Kaggle challenges.

This work points to an important direction for the future development of AI agents: balancing computational resources across inference, interaction (action execution), and learning (gradient updates), especially in tasks where interaction costs cannot be ignored. It tells us that enabling AI to "learn to learn" might be more important than merely pursuing larger model scales.

Stanford Proposes New RL Paradigm: 3B Model Agent Outperforms Claude, GPT-4

Share Short URL