Introduction
Large language models (LLMs) have advanced significantly in recent years, with systems such as OpenAI-o1[1](https://www.nature.com/articles/s41467-025-64769-1#ref-CR1 “Jaech, A. et al. Openai o1 System Card. Preprint at arXiv https://doi.org/10.48550/arXiv.2412.16720
(2024).“) and DeepSeek-R12 demonstrating remarkable reasoning capabilities. These models have excelled in structured problem-solving and logical inference, achieving notable success in fields like mathematics and programming[2](#ref-CR2 “Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. N…
Introduction
Large language models (LLMs) have advanced significantly in recent years, with systems such as OpenAI-o1[1](https://www.nature.com/articles/s41467-025-64769-1#ref-CR1 “Jaech, A. et al. Openai o1 System Card. Preprint at arXiv https://doi.org/10.48550/arXiv.2412.16720
(2024).“) and DeepSeek-R12 demonstrating remarkable reasoning capabilities. These models have excelled in structured problem-solving and logical inference, achieving notable success in fields like mathematics and programming2,[3](#ref-CR3 “Zhong, T. et al. Evaluation of openai o1: opportunities and challenges of AGI. Preprint at arXiv https://doi.org/10.48550/arXiv.2409.18486
(2024).“),[4](https://www.nature.com/articles/s41467-025-64769-1#ref-CR4 “Phan, L. et al. Humanity’s last exam. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.14249
(2025).“). However, their application in the medical domain—a field defined by complexity, high stakes, and the need for contextual understanding—remains underexplored.
Existing medical LLM benchmarks5,6,7,8,9,10,11,12,[13](#ref-CR13 “Xie, Y. et al. A preliminary study of o1 in medicine: Are we closer to an AI doctor? Preprint at arXiv https://doi.org/10.48550/arXiv.2409.15277
(2024).“),14,[15](https://www.nature.com/articles/s41467-025-64769-1#ref-CR15 “Lamparth, M. et al. Moving beyond medical exam questions: a clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare. Preprint at https://doi.org/10.48550/arXiv.2502.16051
(2025).“) focus on evaluating final generation accuracy using exam-style questions. Some studies have begun to move beyond exam-style evaluations, such as ref. 16, which assesses LLMs on authentic clinical tasks, and ref. 17, which benchmarks advanced models like DeepSeek-R1 on real clinical challenges. However, most efforts still fall short in systematically evaluating reasoning quality, a critical aspect of clinical LLMs. Clearer reasoning processes are invaluable for medical human-AI interactions, as they enable clinicians to trust and effectively follow the recommendations provided. A few recent benchmarks18,19,20,21 have explored the reasoning abilities of LLMs, while they often rely on synthetic or conversational data instead of real clinical cases and typically lack scalable, automated metrics for assessing reasoning processes. This gap limits a comprehensive understanding of the reliability and utility of reasoning LLMs in clinical settings.
To address this, we propose MedR-Bench, the first benchmark specifically designed to evaluate the medical reasoning capabilities of state-of-the-art LLMs. MedR-Bench includes 1453 clinical cases spanning 13 body systems and 10 disorder types, with 656 cases dedicated to rare diseases. Unlike existing benchmarks, MedR-Bench emphasizes not only the correctness of final diagnoses or treatment plans but also the transparency, coherence, and factual soundness of the reasoning processes behind them. Inspired by prior works22,23, the benchmark is constructed from clinical case reports in the PMC Open Access Subset[24](https://www.nature.com/articles/s41467-025-64769-1#ref-CR24 “National Library of Medicine. PMC open access subset. https://pmc.ncbi.nlm.nih.gov/tools/openftlist/
(2003).“), reorganized into structured patient cases using GPT-4o. Each case consists of (i) detailed patient information (e.g., chief complaint, medical history), (ii) a structured reasoning process derived from case discussions, and (iii) the final diagnosis or treatment plan, reflecting practical clinical complexity. By incorporating diverse and challenging cases, including rare conditions, MedR-Bench serves as a comprehensive testbed for assessing the reasoning capabilities of LLMs in clinical environments.
To evaluate LLMs, we propose a framework spanning three critical clinical stages: examination recommendation, diagnostic decision-making, and treatment planning, capturing the entire patient care trajectory. Examination recommendation evaluates the model’s ability to suggest relevant clinical assessments and iteratively gather necessary information. Diagnostic decision-making tests the model’s ability to derive precise diagnoses based on patient history, examination findings, lab tests, and imaging findings. Finally, treatment planning assesses the model’s ability to recommend appropriate interventions, such as monitoring strategies, medications, or surgical options, grounded in diagnostic conclusions and patient context.
To quantify performance, we develop an evaluation system to assess both reasoning quality and final outputs. For reasoning evaluation, we introduce the Reasoning Evaluator, an automated agentic system that validates free-text reasoning processes using web-scale medical resources and performs cross-referencing. It calculates LLM-powered reasoning metrics for efficiency, factuality, and completeness. For final outputs, we adopt standard metrics such as accuracy, precision, and recall. Using MedR-Bench, we evaluate seven reasoning-enhanced LLMs—OpenAI-o3-mini, Gemini-2.0-Flash Thinking, DeepSeek-R1, Qwen-QwQ, Baichuan-M1, DiagnoseGPT, MedGemma—providing a comparative analysis of their strengths and limitations across various clinical stages.
Our findings reveal that current clinical LLMs perform well on relatively simple tasks, such as generating accurate diagnoses when sufficient information is available, achieving over 85% accuracy. However, they struggle with complex tasks, such as examination recommendation and treatment planning. In terms of reasoning quality, LLMs exhibit strong factual accuracy, with nearly 90% of reasoning steps being correct, but omissions in critical reasoning steps are common, indicating a need for improved reasoning completeness. For rare diseases, while these cases remain challenging, models generally show consistent performance across reasoning and prediction tasks, suggesting a robust understanding of medical knowledge across case types.
Encouragingly, our findings suggest that open-source models, such as DeepSeek-R1, are steadily closing the gap with proprietary systems like OpenAI-o3-mini, underscoring their potential to drive accessible and equitable healthcare innovations, motivating continued efforts in their development. All codes, data, assessed model responses, and the evaluation pipeline are fully open-source in MedR-Bench.
Results
In this section, we present our main findings. We begin with an overview of MedR-Bench, followed by an analysis of results across the three key stages: examination recommendation, diagnostic decision-making, and treatment planning. In Supplementary A.1, we provide qualitative case studies.
LLMs models for evaluation
This study utilizes a range of models with varying versions, sizes, cut-off dates for training data, and release dates. For closed-source models, we accessed their APIs directly, while for open-source models, we downloaded the model weights and conducted local inference. The details are presented below.
OpenAI-o3-mini: this is a closed-source model with the version identifier o3-mini-2025-01-31. Its model size is not disclosed. The cut-off date for training data is October 2023, and it was officially released in January 2025.
Gemini-2.0-FT: this is a closed-source model, identified by the version Gemini-2.0-flash-thinking-exp-01-21. Similar to OpenAI-o3-mini, the model size is not disclosed. Its cut-off date for training data is June 2024, and it was officially released in January 2025.
DeepSeek-R1: this is an open-source model with the version identifier deepseek-ai/DeepSeek-R1. It is a large-scale model with 671 billion parameters (671B). The cut-off date for training data is not disclosed, and it was released in January 2025.
Qwen-QwQ: this is an open-source model with the version identifier Qwen/QwQ-32B-Preview. It has 32 billion parameters (32B). The cut-off date for training data is not disclosed, and the model was released in November 2024.
Baichuan-M1: unlike the previously mentioned LLMs designed for general domains, this is an open-source medical-specific model with the version identifier baichuan-inc/Baichuan-M1-14B-Instruct. It has 14 billion parameters (14B), with no disclosed cut-off date for training data. The model was released in January 2025.
DiagnoseGPT: DiagnoseGPT is a series of medical LLMs specifically developed for diagnosis. In our evaluation, we deploy the FreedomIntelligence/DiagnosisGPT-34B locally for assessment, which was released in July 2024.
MedGemma: MedGemma is a variant of Gemma 3, which is optimized for the medical domain by Google DeepMind. It has 27 billion parameters. Its base model Gemma 3’s cut-off date for training data is August 2024, and it was officially released in May 2025.
A more detailed introduction to these LLMs is provided in Section “LLM baselines.”
Introduction of MedR-Bench
Our proposed MedR-Bench comprises three key components: (1) structured patient cases, (2) a versatile evaluation framework spanning three stages, and (3) a comprehensive set of evaluation metrics.
Patient cases
Leveraging the case reports from the PMC Open Access Subset[24](https://www.nature.com/articles/s41467-025-64769-1#ref-CR24 “National Library of Medicine. PMC open access subset. https://pmc.ncbi.nlm.nih.gov/tools/openftlist/
(2003).“), we compiled a dataset of 1453 patient cases published after July 2024 to ensure a fair and robust assessment across all models based on their cut-off date for training data. These are divided into two subsets: MedR-Bench-Diagnosis with 957 diagnosis-related cases, and MedR-Bench-Treatment with 496 treatment-related cases. As illustrated in Supplementary Fig. 1, all cases are systematically organized into the following elements:
Case Summary: documents key patient information. For diagnosis cases, this includes basic patient demographics (e.g., age, sex), chief complaint, history of present illness, past medical history, family history, physical examination, and ancillary tests (e.g., lab and imaging results). For treatment cases, additional factors such as allergies, social history, and diagnostic results are included, as these influence treatment decisions. Any missing information in the raw case reports is recorded as “not mentioned.”
Reasoning Processes: summarized from the discussion sections of case reports, this captures the logical steps used to reach a diagnosis or formulate a treatment plan. For diagnosis cases, the reasoning focuses on methods like differential diagnosis. For treatment cases, it emphasizes treatment goals and the rationale behind the chosen interventions.
Diagnosis or Treatment Results: directly extracted from the raw case reports. For diagnosis, this includes identified diseases. For treatment, it consists of free-text descriptions of the recommended interventions.
Additionally, each case is categorized by “body system” and “disorders and conditions” following the taxonomy from MedlinePlus (https://medlineplus.gov/healthtopics.html). We further utilize Orphanet Rare Disease Ontology (ORDO)(http://www.ebi.ac.uk/ols4/ontologies/ordo)25 to identify rare diseases among the cases. This allows MedR-Bench-Diagnosis and MedR-Bench-Treatment to be further split to create rare disease subsets containing 491 and 165 cases, respectively. Case distributions are detailed in the Methods section, with patient case examples provided in Supplementary A.1.
Evaluation settings
To evaluate LLMs’ clinical capabilities, we developed a framework covering three stages of the patient care journey: examination recommendation, diagnostic decision-making, and treatment planning, as shown in Fig. 1a (more detailed demonstrations are shown in Supplementary Fig. 8). Below, we summarize these components (see Section “Evaluation framework” in the “Methods” for full implementation details).
Fig. 1: Overview of our main evaluation pipeline and results.
a Our evaluation framework across three critical patient stages. b The LLM-powered metrics for reasoning processes and final generations using our Reasoning Evaluator. c The performance of seven LLMs on examination recommendation, diagnostic decision-making, and treatment planning. Notably, for treatment planning, we include a comparison on rare disease cases. For other settings, as the rare disease results show minimal variation compared to all cases, we omit them here and provide them in the supplementary tables. d The qualities of reasoning processes, with results for rare cases also provided in the supplementary tables. For examination recommendation, 1-turn reasoning results are plotted, and for diagnostic decision, oracle reasoning results are plotted. Error bars show two-sided 95% z-based confidence intervals for the mean across cases; n denotes the number of independent patient cases.
Examination recommendation
This setting simulates a scenario where a patient first visits a hospital, and LLMs are tasked with recommending examination items such as lab tests or imaging studies, iteratively gathering information to aid diagnosis or treatment. Using the MedR-Bench-Diagnosis, the case summaries—excluding ancillary test results—serve as input, while the ancillary test events are used as for ground-truth reference. Similar to previous works14,26,[27](https://www.nature.com/articles/s41467-025-64769-1#ref-CR27 “Liao, Y. et al. Automatic interactive evaluation for large language models with state aware patient simulator. Preprint at arXiv https://doi.org/10.48550/arXiv.2403.08495
(2024).“), we initialize an LLM-powered agent to play the role of the patient. The assessed clinical LLM can interact with it by recommending relevant examination items, and the agent provides corresponding results.
To evaluate performance, we define two sub-settings: (i) 1-turn examination recommendation: LLMs can query examination results in a single round of interaction; (ii) Free-turn examination recommendation: LLMs can query information through multiple rounds until sufficient information is gathered for subsequent decisions.
Diagnostic decision-making
This setting evaluates whether LLMs can deliver accurate diagnoses based on the given patient information. Using the MedR-Bench-Diagnosis, case summaries serve as input, while the recorded diagnoses serve as the ground truth.
We define three sub-settings based on the availability of examination information: (i) diagnostic decision after 1-turn examination recommendation: LLMs use the limited information gathered from the 1-turn setting; (ii) diagnostic decision after free-turn examination recommendation: LLMs use more comprehensive information from the free-turn setting; (iii) oracle diagnosis: LLMs have access to all ground-truth examination evidence, representing the easiest setting.
Treatment planning
This setting evaluates LLMs’ ability to propose suitable treatment plans. Using the MedR-Bench-Treatment, case summaries—including diagnostic results—serve as input, with the practical treatment plan as the reference. Unlike diagnosis, only the oracle setting is used, where LLMs are provided with all ground-truth patient data, for example, basic patient information, ancillary tests, and ground-truth diagnostic results. This reflects the challenges of treatment planning, which is sufficiently challenging, as suggested by our results.
Evaluation metrics
We designed six metrics to objectively evaluate the performance of LLMs, focusing on both their reasoning processes and final outputs, as illustrated in Fig. 1b. Notably, for DeepSeek-R1, it will have two potential reasoning parts, one presented in the formal part and the other presented in the default thinking part (please refer to “Methods” “LLM baselines” for more detailed explanations). By default, in figures, we report the former for fair comparison. In tables, we report LLM-powered reasoning metrics for both, recorded as “XX /xx,” where the former denotes the reasoning part in the formal answer part, and the latter denotes the marked thinking part. Below, we briefly introduce these metrics, with more detailed explanations provided in Section “Evaluation metrics.”
For reasoning processes, which are primarily expressed in free text and pose significant evaluation challenges11,12,28,[29](https://www.nature.com/articles/s41467-025-64769-1#ref-CR29 “Calamida, A. et al. Radiology-aware model-based evaluation metric for report generation. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.16764
(2023).“), we developed a LLM-based system called the Reasoning Evaluator. This system decomposes, structures, and verifies reasoning steps. It identifies effective versus repetitive steps and evaluates their alignment with medical knowledge or guidelines by referencing online medical resources. If ground-truth reasoning references are available, the system further assesses whether all relevant steps have been included. Please refer to the Method for more details.
Based on this pipeline, we define the following LLM-powered reasoning metrics:
Efficiency: evaluates whether each reasoning step contributes new insights toward the final answer rather than repeating or rephrasing previous results.
Factuality: assesses whether effective reasoning steps are consistent with established medical guidelines or knowledge. Similar to a “precision” score, it calculates the proportion of factually correct steps among all predicted effective reasoning steps.
Completeness: measures how many reasoning steps explicitly marked in the raw case report are included in the generated content. Analogous to “recall,” it computes the proportion of mentioned reasoning steps among all ground-truth steps. While raw case reports may omit some steps, those included are considered essential reasoning evidence.
These three LLM-powered metrics work together to comprehensively assess the quality of the reasoning process. Efficiency evaluates whether each step offers a potential direction for reasoning, while factuality assesses whether the reasoning aligns with established medical knowledge. Completeness, from another perspective, measures whether the model’s reasoning process covers all necessary analytical steps.
On the final generation, for example, recommended examinations, diagnosed diseases, and treatment plans, the following metrics are used:
Accuracy: evaluates whether the final answer (both diagnosis and treatment) explicitly matches the ground-truth provided in the raw case reports.
Precision and Recall: used for examination recommendation, where LLMs generate a list of recommended examinations for a given patient case. These metrics are calculated by comparing the generated examination list with the ground-truth ancillary test list recorded in the case report.
Results in examination recommendation
This section presents the main evaluation results for examination recommendations, as illustrated in Fig. 1c, d. Detailed results for the recommended examinations are summarized in Supplementary Table 1, while the results for the reasoning processes are provided in Supplementary Table 2.
Analysis on recommended examinations
In the 1-turn setting, as shown in Supplementary Table 1, DeepSeek-R1 achieves the highest recall at 43.61%, demonstrating its ability to identify the most relevant examinations. Gemini-2.0-FT follows closely with a recall of 43.12%. Qwen-QwQ and MedGemma rank in the middle, while OpenAI-o3-mini, Baichuan-M1, and DiagnoseGPT perform sub-optimally.
For precision, Baichuan-M1 outperforms other models with a score of 41.78%, indicating better alignment with medical scenarios and the ability to recommend relevant examinations. In contrast, Gemini-2.0-FT and MedGemma score the lower precision at 22.77% and 22.65% respectively, suggesting frequent recommendations of irrelevant examinations.
In the free-turn setting, where models are allowed unlimited queries, no significant improvements are observed in either precision or recall across all models. Missed examinations remain unrecovered, even with additional turns, and performance even declines in some cases.
For general LLMs, OpenAI-o3-mini achieves a recall of 38.22% in the free-turn setting, slightly lower than its 1-turn recall of 38.47%. Similarly, DeepSeek-R1 drops from 43.61% in the 1-turn setting to 40.67% in the free-turn setting.
For medical LLMs, DiagnoseGPT achieves a recall of only 12.38%, which is substantially lower than its 1-turn recall of 22.70%. To explain, we first notice that general LLMs tend to enter repetitive query loops in the free-turn setting, which hinders further improvement in recall performance, even with more available interactive turns. Additionally, different from the 1-turn setting, which we force LLMs to query examinations at least for one turn in the instruction, in the free-turn setting, models are required to self-determine when to stop querying. Therefore, in certain cases, they may terminate the process prematurely, without making any more queries, resulting in their recall scores dropping in this setting. This problem is particularly severe for DiagnoseGPT due to its training being tailored to diagnosis only instead of examination recommendations, leading to a significant decline in its performance.
Overall, current LLMs still face significant challenges in handling multi-turn dialog effectively, which limits the utility of the free-turn setting and underscores the difficulties these models encounter when dynamically generating appropriate queries during extended clinical interactions.
Finally, when analyzing performance on rare diseases (Supplementary Table 1), we find that most models maintain comparable performance to that across common diseases.
Analysis on reasoning processes
At the reasoning level, we focus primarily on the 1-turn setting, as the free-turn setting involves extended reasoning processes that grow with the number of turns. Notably, completeness cannot be calculated in this context because raw case reports rarely document the reasoning behind the selection of specific examinations.
As shown in Supplementary Table 2, the results on efficiency reveal that DeepSeek-R1 achieves the highest score at 98.59%, demonstrating its ability to produce concise and relevant reasoning steps. In contrast, Qwen-QwQ performs sub-optimal, with an efficiency score of just 86.53%. This may be attributed to its training objective of “reflecting deeply”30, likely causing it to generate excessive attempts, ultimately reducing its efficiency. DiagnoseGPT performs worst on this metric due to its failure in interactive examination querying and tendency to repeatedly summarize the patient’s situation rather than reasoning through the next step.
For factuality, most LLMs perform well, achieving scores close to 95%. Among them, Gemini-2.0-FT emerges as the most reliable model in examination recommendation, with a factuality score of 98.75%. However, it is notable that none of the models achieve perfect factuality (100%) in their reasoning processes, underscoring the need to carefully verify critical reasoning steps in practical medical applications.
When analyzing reasoning on rare diseases (Supplementary Table 2), we observe consistent trends with those for common diseases, suggesting the robustness of LLMs across common and rare cases.
Results in diagnostic decision-making
This section presents the results for diagnostic decision-making, analyzing performance on both the final output and reasoning levels. Figure 1c and Supplementary Table 3 show the diagnostic decision-making accuracy for both all diseases and rare diseases. Figure 1d and Supplementary Table 4 present the results for diagnostic reasoning processes across all diseases. Supplementary Table 5 further shows the results of the reasoning process specifically for rare diseases.
Analysis on disease diagnosis
As shown in Fig. 1c and Supplementary Table 3, we evaluate diagnostic performance across three settings: 1-turn, free-turn, and oracle. Notably, for oracle diagnosis, we also introduce a new human reference benchmark. Six physicians with 5 years of clinical experience from Xin Hua Hospital, affiliated with Shanghai Jiao Tong University School of Medicine, were invited to independently perform an oracle diagnosis task with the help of online searching (restricting access to the original case reports). Their average performance is recorded to provide a meaningful human baseline.
In the 1-turn setting, DeepSeek-R1 achieves the highest diagnostic accuracy (71.79%), demonstrating its ability to gather relevant information and produce accurate diagnoses. Gemini-2.0-FT follows with an accuracy of 68.55%. These results highlight the correlation between active information collection and diagnostic precision. Baichuan-M1 and OpenAI-o3-mini rank in the middle, while Qwen-QwQ, MedGemma, and DiagnoseGPT perform less effectively, consistent with their results in examination recommendation.
In the free-turn setting, where models can iteratively query additional information, most models show improved diagnostic accuracy. For instance, DeepSeek-R1 increases its accuracy from 71.79% (1-turn) to 76.18%, and OpenAI-o3-mini improves from 64.99% to 67.19%, even though they do not demonstrate higher examination recommendation recall in the free-turn setting. This pattern can be primarily attributed to the increased number of reasoning tokens generated in the free-turn setting. Specifically, during free-turn interactions, models may not necessarily propose more critical examinations, but they can revisit and reinterpret previously recommended examinations. This enables the formation of longer and more elaborate reasoning chains, which not only provide the model with greater opportunity to justify and refine its decisions but also enhance its ability to compensate for earlier errors, serving as a form of inference scaling. However, if the model performs too poorly in free-turn examination recommendations, we also observe that errors tend to propagate. For instance, DiagnoseGPT drops significantly in free-turn diagnosis accuracy, dropping from 54.44% to 39.60%. This trend aligns with its markedly reduced performance in examination recommendation tasks, where its accuracy drops by more than 10% in the free-turn setting compared to the one-turn setting.
In the oracle setting, where all crucial diagnostic information is provided, all models achieve significantly higher accuracy. For example, DeepSeek-R1 improves from 76.18% in the free-turn setting to 89.76%, followed by Gemini-2.0-FT. MedGemma, OpenAI-o3-mini, Qwen-QwQ, and Baichuan-M1 also perform well, achieving accuracies above 83%. DiagnoseGPT performs worst in this task, with an accuracy of 79.62%. These results emphasize the importance of identifying and recommending relevant examinations to support accurate diagnoses. Interestingly, recent LLMs generally outperform individual physicians in diagnostic accuracy. We attribute this to the broad and systematic medical knowledge LLMs acquired during training, in contrast to the limitations of individual human expertise across multiple specialties. Therefore, the performance of a single physician should be viewed as a reference point rather than a definitive upper bound of human capability. Since all benchmark cases were successfully resolved by expert teams, producing definitive diagnoses and guideline-compliant treatment plans, the true human upper bound corresponds to solving all cases correctly within a multidisciplinary clinical workflow.
Performance on rare diseases is consistent with that on common ones. On the one hand, this further demonstrates the robustness of these models in challenging scenarios. Through pretraining on large medical corpora, they have encountered rare conditions that are difficult for ordinary physicians to master. On the other hand, we conducted a thorough case study and found that many rare diseases have specific diagnostic tests, which are provided in the auxiliary test results, significantly reducing the difficulty of such tasks. The primary challenge in diagnosing rare diseases lies in proposing the appropriate specific test as early as possible. We also include a case study for this situation, as illustrated in Supplementary A.1.4.
Analysis on reasoning processes
In the 1-turn diagnostic setting, as shown Supplementary Table 4, in where reasoning builds on incomplete examinations, most models—except Qwen-QwQ—show a decline in factuality compared to the oracle setting. This suggests that missing examinations increases the likelihood of hallucinated reasoning.
Delving deeper into factuality, specialized models like DiagnoseGPT achieve the highest score of 89.14%, outperforming generalists such as Qwen-QwQ (88.14%) and Baichuan-M1 (88.62%), possibly due to its targeted optimization for diagnostic tasks that enhances reliability under uncertainty. DeepSeek-R1 follows closely at 87.15%, demonstrating robust performance.
When it comes to efficiency and completeness, DeepSeek-R1 demonstrates superior efficiency, achieving a score of 95.86%, and outperforms closed-source models such as OpenAI-o3-mini (91.59%) and Gemini-2.0-FT (83.77%). This advantage is likely attributable to its 671B parameters, which facilitate concise pattern recognition even when information is limited. In contrast, smaller models like Baichuan-M1 (82.91%) and Qwen-QwQ (76.97%) encounter greater challenges in efficiency. Notably, Qwen-QwQ adopts a verbose style to compensate for incomplete information, which comes at the expense of brevity. However, this verbosity enhances completeness; Qwen-QwQ achieves the highest completeness score (66.94%), as generating more detailed responses helps retrieve ground-truth evidence under uncertainty. Conversely, specialized models such as DiagnoseGPT prioritize accuracy, resulting in high factuality (89.14%) but lower completeness (25.44%). MedGemma strikes a balance, with moderate efficiency (90.22%) and completeness (49.91%).
Overall, completeness scores in the 1-turn diagnostic setting remain lower than those in the oracle setting, highlighting the constraints imposed by missing examinations on comprehensive reasoning. This is expected, as limited examination data increase the risk that LLMs overlook necessary reasoning steps due to the absence of prior information.
In the oracle setting, where all essential examination results are provided, models generally exhibit improved performance across metrics due to the availability of complete data.
Delving into factuality, closed-source models lead with Gemini-2.0-FT achieving the highest score of 98.23%, followed closely by OpenAI-o3-mini at 94.94%, reflecting their strong ability to avoid hallucinations when all information is present. Among open-source models, Baichuan-M1 (96.84%) and DeepSeek-R1 (95.03%) perform robustly, while Qwen-QwQ lags at 84.02%, possibly due to its verbose tendencies introducing unnecessary inferences. Compared to the 1-turn setting, factuality improves significantly.
Building on factuality, efficiency is notably high in this setting, with DeepSeek-R1 topping the list at 97.17%, outperforming even closed-source models like Gemini-2.0-FT (95.89%) and OpenAI-o3-mini (94.33%). This reflects DeepSeek-R1’s streamlined processes with complete data, while smaller models like Baichuan-M1 (92.80%) maintain solid efficiency, but Qwen-QwQ struggles at 71.20% due to its lengthy outputs.
Shifting to completeness, MedGemma excels at 87.72%, benefiting from its medical specialization to retrieve comprehensive evidence, closely followed by Gemini-2.0-FT (83.28%). Qwen-QwQ achieves 79.97%, where its verbosity proves advantageous by covering more ground-truth elements, though this comes at the expense of efficiency and factuality. In contrast, smaller models like Baichuan-M1 (75.11%) show moderate completeness.
Notably, for rare diseases as shown in Supplementary Table 5, the performance remains consistent, and the factuality of most LLMs does not decline. Efficiency trends mirror the all-diseases setting, with DeepSeek-R1 again leading in both 1-turn (95.96%) and oracle (97.61%) evaluations, though Qwen-QwQ’s efficiency remains low (76.34% in 1-turn and 72.25% in oracle). Completeness shows similar patterns, with Qwen-QwQ (66.53% in 1-turn) and MedGemma (88.77% in oracle) performing strongly, indicating that rarity does not significantly impair models’ ability to generate comprehensive reasoning when examinations are available.
Results in treatment planning
This section presents the results of treatment planning. The overall findings are illustrated in Fig. 1c (final generation) and Fig. 1d (reasoning processes), with detailed results provided in Supplementary Table 6.
Analysis on treatment plans
In treatment planning, similar to oracle diagnosis, we also introduce a human reference benchmark. We observe that the precision of recommended treatment plans is significantly lower than the accuracy of diagnostic outputs. Among the models, Baichuan-M1 and DeepSeek-R1 achieve the highest accuracy at 30.65% and 30.51%, respectively. These results underline the increased complexity of treatment planning compared to diagnosis, emphasizing the need for further development of LLMs.
Unlike diagnosis, where rare cases do not impact performance, treatment planning shows a notable decline in precision for rare diseases across general models. For instance, OpenAI-o3-mini drops from 27.03% to 23.17%, and DeepSeek-R1 decreases from 30.51% to 27.27%. This highlights a persistent gap in therapeutic knowledge for rare conditions. In contrast, Baichuan-M1 maintains stable performance, with precision only slightly decreasing from 30.65% to 30.30%, demonstrating the effectiveness of its medical knowledge enhancement. Regarding the human baseline, the accuracy for all diseases is 36.67%, which is significantly higher than that of current LLMs, indicating that current LLMs still lack sufficient capability in treatment planning. However, it is important to note that the human baseline (36.67%) is still far from ideal. In our evaluation, this baseline reflects the performance of a single physician with 5 years of experience working independently. Treatment planning is inherently more challenging than diagnosis, often requiring multidisciplinary input and consideration of various clinical scenarios. In addition, the evaluation criteria for treatment are stricter: while diagnosis is considered correct if it matches the ground truth, a treatment plan must comprehensively address multiple key aspects. Missing even one critical component renders the plan incorrect. These factors collectively contribute to the relatively low accuracy observed in both human and model performance. A more detailed case demonstration is provided in Supplementary A1.5.
Analysis on reasoning processes
As shown in Supplementary Table 6, reasoning quality in treatment planning is generally strong.
Delving into factuality, most models demonstrate robust performance across both “all diseases” and “rare diseases,” with scores typically above 94%. In the “all diseases” setting, closed-source models like Gemini-2.0-FT lead with the highest score of 96.96%, closely followed by OpenAI-o3-mini at 96.77% and open-source models such as Baichuan-M1 at 96.56%. DeepSeek-R1 achieves 94.59%, while specialized models like DiagnoseGPT (92.86%) and Qwen-QwQ (94.40%) are slightly lower but still reliable. For “rare diseases,” patterns remain similar, with OpenAI-o3-mini topping at 96.81% and Gemini-2.0-FT at 96.68%, followed by Baichuan-M1 at 95.97%. Overall, factuality shows minimal degradation in rare diseases, suggesting that models handle less common conditions without increased hallucination.
Shifting to efficiency, most models excel in producing concise reasoning, with scores often exceeding 90%, highlighting their capability to generate streamlined treatment plans without unnecessary verbosity. For “all diseases,” MedGemma stands out with the highest efficiency of 96.53%, followed by DeepSeek-R1 at 95.25% and OpenAI-o3-mini at 94.67%. Closed-source models generally perform well, with Gemini-2.0-FT at 93.66%, while open-source ones vary: Baichuan-M1 achieves 88.47%, but Qwen-QwQ lags at 84.76%, likely due to its tendency for verbose outputs. In “rare diseases,” trends persist, with MedGemma again leading at 96.91% and DeepSeek-R1 at 95.37; Qwen-QwQ remains the lowest at 83.31%. This consistency across disease types indicates that efficiency is robust to rarity.
Finally, for completeness, models vary more widely, with scores ranging from 50% to nearly 80%. In “all diseases,” Qwen-QwQ achieves the highest completeness of 77.66%, benefiting from its verbose reasoning that covers more ground-truth elements, followed by Gemini-2.0-FT at 75.89% and MedGemma at 71.70%. DeepSeek-R1 scores 68.08%, while specialized models like DiagnoseGPT lag at 53.86%, prioritizing brevity over exhaustiveness. For “rare diseases,” Qwen-QwQ again leads at 78.74%, with Gemini-2.0-FT at 77.10% and DeepSeek-R1 at 68.28%; DiagnoseGPT remains low at 52.25%. Completeness shows slight improvements in rare diseases for some models, suggesting that verbosity aids in addressing uncertainties in uncommon cases. However, this metric often trades off with efficiency and factuality, as seen in Qwen-QwQ’s high completeness at the expense of lower scores in the other two areas, whereas balanced models like MedGemma achieve moderate completeness without sacrificing overall reasoning quality.
Considering that the final accuracy of treatment planning remains below 30%, it becomes evident that the current reasoning processes, while generally concise and exhibiting reduced hallucinations (though not entirely eliminating them), are still insufficient to ensure high-quality treatment recommendations. This highlights the inherent complexity of treatment planning: even when reasoning is streamlined and hallucinations are mitigated, omissions of critical reasoning steps often lead to incomplete or incorrect treatment plans. Model completeness remains around 70%, indicating substantial room for improvement in covering all necessary aspects. Compared to diagnosis, treatment planning accuracy is much more sensitive to missing reasoning steps. In diagnostic tasks, models may occasionally reach correct conclusions even if some reasoning elements are overlooked. In contrast, for treatment planning, missing a key step typically has a direct and negative impact on the final recommendation, resulting in plans that lack essential components. These findings underscore the need for further advancements in both the depth and completeness of reasoning chains.
Discussion
In this study, we evaluate the latest reasoning-enhanced LLMs in the medical domain, focusing on both final outputs and the underlying reasoning processes. Unlike previous work on medical LLMs evaluation5,6,7,8,9,10,11,12,[13](#ref-CR13 “Xie, Y. et al. A preliminary study of o1 in medicine: Are we closer to an AI doctor? Preprint at arXiv https://doi.org/10.48550/arXiv.2409.15277
(2024).“),14,[15](https://www.nature.com/articles/s41467-025-64769-1#ref-CR15 “Lamparth, M. et al. Moving beyond medical exam questions: a clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare. Preprint at https://doi.org/10.48550/arXiv.2502.16051
(2025).“), our approach places greater emphasis on quantifying the quality of reasoning. The key contributions of this study are as follows:
A diverse evaluation dataset on clinical patient cases with reasoning references. We introduce MedR-Bench, a dataset of 1453 structured patient cases derived from published case reports. It spans 13 medical body systems and 10 disorder specialties, covering both common and rare diseases for diagnosis and treatment planning. Unlike existing multiple-choice datasets, MedR-Bench closely mirrors practical medical practice. Furthermore, each case is enriched with reasoning evidence extracted from high-quality case reports, enabling a rigorous evaluation of reasoning processes.
A versatile evaluation framework covering three critical patient stages. Our benchmark assesses LLM performance across three key stages of patient care: examination recommendation, diagnostic decision-making, and treatment planning. This framework replicates a typical clinical