Early this morning, Together.ai, a renowned large model training platform, in collaboration with Agentica, open-sourced an innovative AI Agent framework called DeepSWE.
DeepSWE is built upon Alibaba's recently open-sourced Qwen3-32B model and is trained entirely using reinforcement learning.
In addition to the weights, all content including training methods, logs, and datasets has been open-sourced to help developers delve into and improve the Agent.
Open-source address: https://huggingface.co/agentica-org/DeepSWE-Preview
According to SWE-Bench-Verified test data, DeepSWE was evaluated with a maximum context length of 64k and 100 maximum environment steps, achieving a Pass@1 accuracy of 42.2% averaged over 16 runs. Performance further improved to 59% when using mixed test-time scaling (TTS), surpassing all open-source Agent frameworks and ranking first.
DeepSWE demonstrates the effectiveness and immense potential of training solely with reinforcement learning. Compared to other open-source models, DeepSWE-Preview achieves the best performance without relying on distillation or SFT from stronger proprietary teacher models.
DeepSWE's training is based on the rLLM framework, a system for post-training language agents. The model was trained for 6 days on 64 H100 GPUs on 4500 real-world SWE tasks from the R2E-Gym training environment. These tasks cover complex scenarios ranging from solving GitHub issues to implementing new code features and debugging, reflecting the diversity and complexity of real-world software engineering.
During training, DeepSWE-Preview learns to navigate extensive codebases, apply targeted code edits, run shell commands for building and testing, and iteratively optimize and validate solutions when addressing actual pull requests through interaction with the environment.
Regarding training methods, dataset management utilized 4500 problems from the R2E-Gym subset, filtering out issues from the same repositories as SWE-Bench-Verified to ensure data purity. All problems were mapped to a single Docker image for easy management and execution. The training environment was built around R2E-Gym, which can scalably manage high-quality executable SWE environments. The definition of states and actions includes executing Bash commands, searching files, editing files, and submitting task completions.
The reward mechanism employs a sparse outcome reward model, meaning a positive reward is given only when the LLM-generated patch passes all tests; otherwise, the reward is zero. To address scaling challenges during training, researchers integrated Kubernetes support into R2E-Gym, enabling elastic scheduling and auto-scaling of containers, which allows for reliable collection of millions of trajectories while keeping computational costs proportional to the load.
In terms of reinforcement learning algorithms, DeepSWE-Preview was trained using the GRPO++ algorithm, an improved version of the original GRPO algorithm. GRPO++ incorporates insights and innovations from works such as DAPO, Dr.GRPO, and LOOP/RLOO. It achieves a more stable and higher-performing training process through strategies like high clipping, no KL loss, no reward standard deviation, length normalization, leave-one-out, compact filtering, and no entropy loss.
Among these, the compact filtering strategy specifically targets multi-turn agent scenarios by masking trajectories that reach maximum context, maximum steps, or timeout, preventing reward collapse during training and encouraging agents to perform long-form reasoning across steps.
TTS is one of the key strategies DeepSWE-Preview uses to improve performance. During the testing phase, by generating multiple trajectories and selecting the one that correctly solves the problem, DeepSWE-Preview significantly boosts its Pass@1 performance.
Researchers experimented with various TTS strategies, including execution-based and execution-free validators, and ultimately adopted a mixed scaling strategy that combines the advantages of both paradigms, achieving 59.0% performance, which is 12% higher than current state-of-the-art open-source weight models.
Additionally, researchers found that for SWE-related tasks, expanding the number of output tokens did not seem effective, while rolling count expansion led to more significant performance improvements.