⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford’s TerminalBench
🎯 TL;DR
- I trained a 14B orchestrator model to better coordinate explorer & coder subagents
 - I scaled this to 32x Nvidia H100s, and 416x Intel Xeon Platinum 8470 CPU cores.
 - Qwen3-14B achieved a 160.71% relative increase on Stanford’s TerminalBench after training.
 - Full training code, model weights, datasets, and documentation are released below.
 
This project builds upon the great prime-rl framework developed by Prime Intellect, and heavily depends upon the multi-agent architecture developed in multi-agent-coder. Please note that this code and the resulting model are m…
⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford’s TerminalBench
🎯 TL;DR
- I trained a 14B orchestrator model to better coordinate explorer & coder subagents
 - I scaled this to 32x Nvidia H100s, and 416x Intel Xeon Platinum 8470 CPU cores.
 - Qwen3-14B achieved a 160.71% relative increase on Stanford’s TerminalBench after training.
 - Full training code, model weights, datasets, and documentation are released below.
 
This project builds upon the great prime-rl framework developed by Prime Intellect, and heavily depends upon the multi-agent architecture developed in multi-agent-coder. Please note that this code and the resulting model are meant simply as proof-of-concepts and building blocks for multi-agent coding RL.
For a full breakdown of this project’s code structure, see here
📋 Table of Contents
💻 Distributed Training on 32x H100s
The below image shows the Orca-Agent-RL training code pushing thirty two Nvidia H100s to their limits.
At any one time, there were also up to 256 distributed Docker containers rolling out simultaneously across the 4x bare metal node cluster.
This training setup can be scaled from a single instance to a multi-node cluster.
🎛️ GPU Cluster Configuration
The 32x H100 cluster was organised as follows:
- 16 GPUs: Model training (gradient computation and optimisation)
 - 8 GPUs: Policy model inference (orchestrator model rollouts)
 - 8 GPUs: Subagent model inference (tool-calling rollouts, not trained upon)
 
🐳 Distributed Docker Rollouts
To maximize CPU utilisation across the cluster, all 256 concurrent Docker environments were automatically distributed across all 4 nodes:
Architecture:
- Main node orchestrates container placement via 
DOCKER_ENDPOINTSenvironment variable - Worker nodes expose their Docker daemons over TCP (port 2375, firewall-restricted to main node)
 
This simple yet effective code enabled 256 concurrent containers to be distributed evenly across all available nodes to balance CPU load, and can be scaled up or down depending on compute budget. The code can be found here (in other project) and more details on how to link the nodes together can be found here.
- Add link when v0.2 is published in other lib
 
📈 Reward
Below is a visualization of the reward improvement over a single 20 hour run.
Qwen3-14B (run#3) Reward:
- Starts at ~0.47 reward
 - Ends at ~0.78 reward
 
Training Dynamics:
- Entropy (left): Model explores diverse strategies early, then converges to confident policies
 - Gradient Norm (right): Smooth decrease indicates stable, healthy optimisation
 
🏆 Leaderboard Climb
I evaluated Qwen3-14B on Stanford’s TBench before and after training (using Qwen3-Coder-30B-A3B as the explorer & coder subagents). The RL-trained model achieved an 11.25% absolute increase (160.71% relative increase)! Nice!
| Orchestrator | Subagent | Terminal Bench | 
|---|---|---|
| Qwen3-Coder-480B | Qwen3-Coder-480B | 19.7% | 
| Orca-Agent-v0.1-14B | Qwen3-Coder-30B | 18.25% | 
| Qwen3-14B | Qwen3-Coder-30B | 7.0% | 
The results of this can be found here (qwen) and here (Orca), and instructions on how to reproduce are here.
This places Orca-Agent-v0.1 (14B) + Qwen3-Coder-Flash (30B MoE), within striking distance of Qwen3-Coder-480B using the same architecture which placed #26 on TerminalBench when it was published recently in my other project.
🏋️♂️ Training & Rollout Details
⚙️ Hyperparameters
- 
Orchestrator (policy) model: Qwen3-14B
 - 
Subgent (tool call) model: Qwen3-Coder-30B-A3B
 - 
Rollouts: 64 per task
 - 
🐳 Each rollout has an isolated docker environment
 - 
Batch size: 256 (mbs=1)
 - 
All environments distributed across 4 nodes
 - 
Temperature: 1.0
 - 
Linear learning rate: 1e-6 <-> 5e-6
 - 
Sequence length: 18,000
 - 
Precision: BF16
 - 
Max turns per rollout: 14
 - 
Rollout timeout: 1200s (20 minutes)
 
*I tried many runs, therefore the hyperparameters above are representative of the collective. To see hyperparams/my notes for each run attempt, see here.
🎁 Reward Design
To provide meaningful supervision during RL, rewards were simplified to just unit tests. I found that by adding in additional “smartly crafted” reward signals, policy collapse was never too far away.
✅ Answer Verification
- Each training datapoint included Python unit tests to verify task completion
 - Tests were assigned individual weights to provide granular partial credit
 - Test execution ran in the isolated Docker container in which the agent completed its work
 - Weighted scoring: passed tests contributed their weight to the final test score
 
🗂️ Dataset Details
I utilised my synthetically generated training dataset published here, created by my multi-agent synthetic data pipeline project found here, and that was also presented in this RL project.
📊 Dataset Structure
Each training datapoint contains:
{
"task_id": "git-deployment-workflow-setup",    # Unique task identifier
"difficulty": "hard",                          # easy|medium|hard|extremely_hard
"category": "system-administration",           # Task category
"prompt": "I need help setting up a simple CI/CD system...",  # The actual task instruction
"dockerfile": "FROM ghcr.io/laude-institute/t-bench/ubuntu-24-04:latest\n...",  # Docker environment setup
"test_functions": "def test_hook_script_executable():\n    ...",  # Pytest verification code
"test_weights": {                              # Weight for each test (for partial credit)
"test_hook_script_executable": 0.35,
"test_nginx_service_running": 0.15,
"test_deployment_works_correctly": 0.50
},
"additional_files": {                          # Optional files to include in container
"backup_config.json": "{\n  \"schedules\": [...",
"collision_detector.py": "#!/usr/bin/env python3\n..."
}
}
🎓 Curriculum learning
Before the training run, the model to be trained was evaluated against the tasks in the train dataset. Partially completed tasks were included in the next training run, none or fully completed tasks where excluded.
- Stage-1: Tasks where Qwen3-14B succeeded 1-2/3 times (41 tasks)
 - Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times (Ran out of compute budget to try more runs)
 - Stage-3: ... To infinity? 😅
 
🤗 Model Weights
The trained Orca-Agent-v0.1 orchestrator model is available on HuggingFace:
This 14B parameter model was trained to coordinate explorer and coder subagents within a multi-agent-coding system, achieving a 160.71% relative improvement on Stanford’s TerminalBench.
🚀 Getting Started
Development Setup
Clone the repository and install dependencies:
git clone git@github.com:Danau5tin/Orca-Agent-RL.git && \
cd Orca-Agent-RL && \
uv sync
That’s it! UV will handle all dependencies automatically.
Reproducing the results
Terminal Bench Evaluation Follow the guide shown here, and host the below models (for help on how to host, see here):
export ORCA_ORCHESTRATOR_MODEL="openai/DanAu5tin/Orca-Agent-v0.1"
export ORCA_SUBAGENT_MODEL="openai/Qwen/Qwen3-Coder-30B-A3B-Instruct"
export ORCA_ORCHESTRATOR_API_BASE="http://127.0.0.1:8000/v1"
export ORCA_SUBAGENT_API_BASE="http://127.0.0.1:8001/v1" # 8001 port number
Training
A guide on how to setup a rented multi-node cluster can be found here.
To run evals on the train dataset, also see here, and also complete the section titled: “IF RUNNING ON TRAIN DS”, and also host the models as shown above.
- Add links when v0.2 is deployed
 - Finish the eval script.
 
🪜 Potential Steps Forward
⚠️ Proof of Concept Caveat
This project is a proof of concept for multi-agent coding RL. Given the limited dataset diversity and relatively small training set (due to my resource limitations), there is a meaningful possibility of overfitting to the training distribution. While the TerminalBench results are encouraging, expanding dataset variety and scale would be essential next steps to better validate generalisation capabilities.
After completing stage-1 training, I began experiments for stage-2 (beginning from stage-1 model weights) and saw that whilst the model learned well (increased reward), it actually decreased in Terminal Bench performance. Below are some of my thoughts on why that is in the form of ideas I would try given more compute budget.
- 
Scale up a lot.
 - 
There is an argument that says, given a lot more compute, perhaps all that is required is to prune the dataset for the highest quality tasks, no matter their difficulty, take a big enough model, and scale up those rollouts vertically to a dramatic number. In that case I would:
 - 
Train GLM-4.6 as the base Orchestrator model
 - 
Use GLM-4.6 as the subagent model too
 - 
Heat up A LOT OF GPUs, but potentially receive a powerful artifact in return that could really climb the TerminalBench leaderboard.
 - 
Scale up a little.
 - 
Find a multi-turn RL training framework that has stable MoE support and switch to Qwen3-Coder-30B as the Orchestrator policy model. (Qwen3-Coder evaluated as best ~32B Orchestrator model)
 - 
Switch to more competent sub-agent (GLM-4.6 has been evaluated as top subagent)
 - 
Tweaks
 - 
There is also an argument that no more scale is needed, and I know there is most-certainly ways to improve with the current setup. Including but not limited to:
 - 
Blend run #3 and run #11’s tasks together for a longer run with otherwise identical hyperparams.
 - 
Keep batch size the same, but reduce number of rollouts to allow more tasks per step.
 - 
Remove the efficiency penalty.
 - 
Increase batch size from 256 -> 320 by adding a new node, leveraging a load balancer, and providing one more node for the currently bottlenecked subagent inference task.
 - 
Speed up deployment speed by automating orchestration of multi-node cluster (NFS, Docker, etc..) instead of long setup guide.
 - 
Discover Agentic-RL training framework with stable MoE implementation (Qwen3-Coder-Flash was best low param Orchestrator agent in evaluations - but I had to use dense)
 
🙏 Acknowledgements
- 
Thank you to Taras for providing the compute for this project and supporting open source.
 - 
Thank you to the incredibly smart team at Prime Intellect behind prime-rl and verifiers for making all the hard stuff work... and for putting up with my stream of requests 😅, specifically:
 - 
Cloud providers for the GPUs, including:
 - 
Hyperbolic, which I used for almost all my experiments and all of my training runs, with an excellent experience.
 - 
Datacrunch, which I used for running most of my evaluations
 - 
Hyperstack, which I used for running some experiements & some evaluations
 - 
Alex Dimakis - for briefing me on his upcoming (now released) paper “How to Train Your Advisor” during a call on the day of my multi-agent-coder release. That short yet excellent conversation sparked the realisation for me that training the Orchestrator architecture would be far more effective than my previous single-agent approach in Terminal-Bench-RL. Thanks Alex!
 
📝 Citation
Underlying frameworks and models
This work was built and evaluated using the following tools and models:
# Great set of models
@article{qwen3,
title={Qwen3 Technical Report},
author={An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jing Zhou and Jingren Zhou and Junyang Lin and Kai Dang and Keqin Bao and Kexin Yang and Le Yu and Lianghao Deng and Mei Li and Mingfeng Xue and Mingze Li and Pei Zhang and Peng Wang and Qin Zhu and Rui Men and Ruize Gao and Shixuan Liu and Shuang Luo and Tianhao Li and Tianyi Tang and Wenbiao Yin and Xingzhang Ren and Xinyu Wang and Xinyu Zhang and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yinger Zhang and Yu Wan and Yuqiong Liu and Zekun Wang and Zeyu Cui and Zhenru Zhang and Zhipeng Zhou and Zihan Qiu},
journal = {arXiv preprint arXiv:2505.09388},
year={2025}
}
# An multi-turn RL framework which works. No small find!
@misc{primeintellect2025prime-rl,
author = {Prime Intellect},
title = {PRIME-RL},
url = {https://github.com/PrimeIntellect-ai/prime-rl},
year = {2025}
}
# Great abstractions for the environment and rewards
@misc{brown_verifiers_2025,
author = {William Brown},
title = {{Verifiers}: Environments for LLM Reinforcement Learning},
howpublished = {\url{https://github.com/willccbb/verifiers}},
year = {2025}
}
# Terminal Bench is a large inspiration for my work
@misc{tbench_2025,
title={Terminal-Bench: A Benchmark for AI Agents in Terminal Environments},
url={https://github.com/laude-institute/terminal-bench},
author={The Terminal-Bench Team},
year={2025},
month={Apr}
}
# I was not able to read this paper before I began training, however a conversation with Alex in early September sparked the idea to train a smaller Orchestrator/Advisor model
@misc{asawa2025trainadvisorsteeringblackbox,
title={How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models},
author={Parth Asawa and Alan Zhu and Matei Zaharia and Alexandros G. Dimakis and Joseph E. Gonzalez},
year={2025},
eprint={2510.02453},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.02453},
}
📄 License
All open sourced items as part of this release, including:
- Code in this repository
 - Model weights
 - Training data
 
Are under the Apache 2.0 license