Back to article

[2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (opens in new tab)

Covered by 8 sources including fireworks.ai, DEV Community

Covered in 8 articles

Agents Don't Fail on Intelligence, They Fail on Execution

Discussed on Hacker News and r/LocalLLaMA

DEV Community·

An LLM benchmark is only useful for as long as it's hard

Discussed on DEV

Interesting Engineering++·

HARNESS ENGINEERING

Discussed on Substack

Token for Token·

Zen and the Art of Machine Learning Research

Discussed on Hacker News

labs.beconfident.app·

The 98% Problem: A Survey of Harness Engineering for AI Agents

Discussed on Hacker News

cameronrwolfe.substack.com·

Agent Evaluation: A Detailed Guide

Discussed on Substack

In other languages

머신러닝 연구의 선(Zen)과 예술

ИИ пишет код, а кто инженерит ИИ?