DeepSWE: A contamination-free benchmark for long-horizon coding agents (opens in new tab)
DeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.
Read the original articleDeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.
Read the original article