General-purpose large language models outperform specialized clinical AI tools on medical benchmarks (opens in new tab) 🤖LLM Content type: Academic

nature.com··Hacker News, Hacker News·Cited by 1 article·Open original

Specialized clinical artificial intelligence (AI) tools are entering medical practice despite scarce independent evaluation. We quantitatively evaluate two clinical AI tools, OpenEvidence and UpToDate Expert AI, built on large language models (LLMs) against three frontier LLMs: GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. Our evaluation has three stages: (1) 500 MedQA questions testing medical knowledge, (2) 500 HealthBench items measuring alignment with clinicians and (3) the real clinical q...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Cited by 1 article

A bitter lesson for medicine, or a benchmark problem?

sparsethought.com··Hacker News