How Far Can a Small Coding Model Go With a Better Harness? (opens in new tab)
Every time a coding agent underperforms, the default move is to swap in a bigger model. I wanted to see what happens if you refuse that move and fix everything else instead. The result: 61.6% ± 1.9 on Terminal-Bench 2.0 with GPT-5.1-Codex-Mini — rank #41, in the same band as stock harnesses running flagship models a tier or two larger. 445 runs, $27, ~35 hours. This is not an argument that small models are secretly enough. It is an argument that the wrapper around the model is doing more work...
Read the original article