DEV Community

How Far Can a Small Coding Model Go With a Better Harness? (opens in new tab)

Every time a coding agent underperforms, the default move is to swap in a bigger model. I wanted to see what happens if you refuse that move and fix everything else instead. The result: 61.6% ± 1.9 on Terminal-Bench 2.0 with GPT-5.1-Codex-Mini — rank #41, in the same band as stock harnesses running flagship models a tier or two larger. 445 runs, $27, ~35 hours. This is not an argument that small models are secretly enough. It is an argument that the wrapper around the model is doing more work...

Read the original article
Sign in to keep reading the full article.

Keyboard Shortcuts

Navigation

Next / previous post
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Discover
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help