Behavior Best-of-N achieves Near Human Performance on Computer Tasks
lesswrong.com·17h

Published on October 5, 2025 4:53 PM GMT

TLDR: A new paper from Simular Research using a new scaling technique for computer-use agents achieved 70% on OSWorld-Verified, a benchmark for computer use. A skilled human scores 72% on this benchmark. Their new technique, Behavior Best-of-N (bBoN), shows promise for improving performance of agents, perhaps on a variety of tasks, though new research will be needed to test this.

One current bottleneck for long-running agentic AI is the inability to perform complex computer use with high accuracy. The most widely recognized benchmark for complex computer use is the OSWorld-Verified. It tests a model on a variety of computer use tasks based on real-world examples. Examples include <a hr…

Similar Posts

Loading similar posts...