[2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (opens in new tab)

Covered by 8 sources including fireworks.ai, DEV Community

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12...

[2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (opens in new tab)

Covered in 8 articles

Agents Don't Fail on Intelligence, They Fail on Execution

An LLM benchmark is only useful for as long as it's hard

HARNESS ENGINEERING