Why Coding Agents Fail When Bugs Span More Than 20 Files (opens in new tab)
The SWE-EVO benchmark gives coding models a 25% success rate on tasks spanning an average of 21 files, versus over 70% on SWE-bench…
Read the original articleThe SWE-EVO benchmark gives coding models a 25% success rate on tasks spanning an average of 21 files, versus over 70% on SWE-bench…
Read the original article