General-purpose large language models outperform specialized clinical AI tools on medical benchmarks (opens in new tab)

Covered by techtarget.com, sparsethought.comDiscussed on Hacker News and Hacker News

Specialized clinical artificial intelligence (AI) tools are entering medical practice despite scarce independent evaluation. We quantitatively evaluate two clinical AI tools, OpenEvidence and UpToDate Expert AI, built on large language models (LLMs) against three frontier LLMs: GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. Our evaluation has three stages: (1) 500 MedQA questions testing medical knowledge, (2) 500 HealthBench items measuring alignment with clinicians and (3) the real clinical q...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 2 articles

techtarget.com

·

General-purpose AI beats out specialized clinical AI in some assessments | TechTarget

sparsethought.com·

A bitter lesson for medicine, or a benchmark problem?

Discussed on Hacker News