Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50 (opens in new tab)

Large language model (LLM) agents are increasingly used for biological data analysis, but prior benchmark results have given a mixed picture of whether they are ready for routine bioinformatics work. The original BixBench study reported only ~17-21% accuracy for frontier agents on open-answer bioinformatics questions. Subsequent curation of BixBench-Verified-50 removed or revised ambiguous items, revealing much higher performance for modern agents. Here we evaluate three frontier-model config...

Read the original article