General-purpose large language models outperform specialized clinical AI tools on medical benchmarks (paper June 12 26) (opens in new tab)

Covers General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

chatGPT(5.5paid): This Nature Medicine brief communication compares two specialist clinical AI tools, OpenEvidence and UpToDate Expert AI, with three frontier general-purpose LLMs: GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. The central question is whether medical-domain AI products actually outperform the best general-purpose models in clinical tasks. The study uses three evaluation stages: MedQA: 500 USMLE-style multiple-choice medical questions. HealthBench: 500 open-response health promp...

Read the original article