Making large language models reliable data science programming copilots for biomedical research

Data availability

The curated data science tasks with the reference answers and testing cases in BioDSBench can be accessed at https://huggingface.co/datasets/zifeng-ai/BioDSBench. The anonymized patient data where these data analyses are performed are available via the cBioPortal website at https://www.cbioportal.org/datasets and the UCSC Xena website at https://xenabrowser.net/datapages/. Source data are provided with this paper.

Code availability

Code for implementing and experimenting with the proposed methodology is available via GitHub at [https://github.com/RyanWangZf/BioDSBen…

Data availability

Code availability

Code for implementing and experimenting with the proposed methodology is available via GitHub at https://github.com/RyanWangZf/BioDSBench. The human–AI collaborative biomedical data science programming platform can be accessed via a web-based app22 and can be accessed per request at https://keiji.ai/contact.html. The demonstration video can be accessed at https://www.youtube.com/watch?v=c5ZJsFXQ_B0.

References

Radenkovic, D., Keogh, S. B. & Maruthappu, M. Data science in modern evidence-based medicine. J. R. Soc. Med. 112, 493–494 (2019).

Article PubMed PubMed Central Google Scholar 1.

Ellis, L. D. To meet future needs, health care leaders must look at the data (science). Harvard T.H. Chan School of Public Health https://www.hsph.harvard.edu/ecpe/to-meet-future-needs-health-care-leaders-must-look-at-the-data-science/ (accessed 16 September 2024). 1.

Meyer, M. A. Healthcare data scientist qualifications, skills, and job focus: a content analysis of job postings. J. Am. Med. Inf. Assoc. 26, 383–391 (2019).

Article Google Scholar 1.

Chen, M. et al. Evaluating large language models trained on code. Preprint at https://arxiv.org/abs/2107.03374 (2021). 1.

Li, Y. et al. Competition-level code generation with alphacode. Science 378, 1092–1097 (2022).

Article CAS PubMed Google Scholar 1.

Luo, Z. et al. Wizardcoder: empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations 1–21 (OpenReview, 2023). 1.

Lozhkov, A. et al. Starcoder 2 and the stack v2: the next generation. Preprint at https://arxiv.org/abs/2402.19173 (2024). 1.

Zhang, F. et al. RepoCoder: repository-level code completion through iterative retrieval and generation. The 2023 Conference on Empirical Methods in Natural Language Processing 2471–2484 (Association for Computational Linguistics, 2023). 1.

Parvez, M. R., Ahmad, W., Chakraborty, S., Ray, B. & Chang, K.-W. Retrieval augmented code generation and summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021 2719–2734 (Association for Computational Linguistics, 2021). 1.

Wang, Z. Z. et al. CodeRAG-Bench: can retrieval augment code generation? In Findings of the Association for Computational Linguistics: NAACL 2025 3199–3214 (Association for Computational Linguistics, 2025). 1.

Chen, X., Lin, M., Schärli, N. & Zhou, D. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations 1–80 (OpenReview, 2024). 1.

Austin, J. et al. Program synthesis with large language models. Preprint at https://arxiv.org/abs/2108.07732 (2021). 1.

Hendrycks, D. et al. Measuring coding challenge competence with apps. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track 1–11 (OpenReview, 2021). 1.

Liu, J., Xia, C. S., Wang, Y. & Zhang, L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. Adv. Neural Inf. Process. Syst. 36, 21558–21575 (2023).

Google Scholar 1.

Jimenez, C. E. et al. SWE-bench: can language models resolve real-world GitHub issues? In The Twelfth International Conference on Learning Representations 1–51 (OpenReview, 2024). 1.

Huang, J. et al. Execution-based evaluation for data science code generation models. In Proc. Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances) 28–36 (Association for Computational Linguistics, 2022). 1.

Lai, Y. et al. DS-1000: a natural and reliable benchmark for data science code generation. In International Conference on Machine Learning 18319–18345 (PMLR, 2023). 1.

Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).

Article CAS PubMed PubMed Central Google Scholar 1.

Tang, X. et al. Biocoder: a benchmark for bioinformatics code generation with large language models. Bioinformatics 40, i266–i276 (2024).

Article PubMed PubMed Central Google Scholar 1.

Majumder, B. P. et al. DiscoveryBench: towards data-driven discovery with large language models. In The Thirteenth International Conference on Learning Representations 1–34 (OpenReview, 2025). 1.

Wang, Z., Danek, B. & Sun, J. BioDSA-1K: benchmarking data science agents for biomedical research. Preprint at https://arxiv.org/abs/2505.16100 (2025). 1.

TrialMind Data Science Assistant. Keiji AI. https://www.trialmindapis.com/api/data-science (2025). 1.

cBioPortal for cancer genomics. cBioPortal https://www.cbioportal.org/datasets (accessed 17 September 2024). 1.

Hello GPT-4o. OpenAI https://openai.com/index/hello-gpt-4o/ (accessed 17 September 2024). 1.

GPT-4o mini: advancing cost-efficient intelligence. OpenAI https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed 17 September 2024). 1.

Claude 3.5 Sonnet. Anthropic https://www.anthropic.com/news/claude-3-5-sonnet (accessed 17 September 2024). 1.

Introducing the next generation of Claude. Anthropic https://www.anthropic.com/news/claude-3-family (accessed 17 September 2024). 1.

Reid, M. et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. Preprint at https://arxiv.org/abs/2403.05530 (2024). 1.

OpenAI o3-mini: pushing the frontier of cost-effective reasoning. OpenAI https://openai.com/index/openai-o3-mini/ (accessed 6 June 2025). 1.

Grattafiori, A. et al. The Llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 (2024). 1.

Guo, D. et al. Deepseek-R1 incentivizes reasoning capability in LLMs via reinforcement learning. Nature 645, 633–638 (2025).

Article CAS PubMed PubMed Central Google Scholar 1.

Rozière, B. et al. Code Llama: open foundation models for code. Preprint at https://arxiv.org/abs/2308.12950 (2024). 1.

Hui, B. et al. Qwen2.5-coder technical report. Preprint at https://arxiv.org/abs/2409.12186 (2024). 1.

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).

Google Scholar 1.

Brown, T. B. et al. Language models are few-shot learners. In Proc. 34th International Conference on Neural Information Processing Systems, 1877-1901. (Curran Associates, 2020). 1.

Khattab, O. et al. DSPy: compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Representations 1–31 (OpenReview, 2024). 1.

Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020).

Google Scholar 1.

Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In International Conference on Learning Representations 1–33 (OpenReview, 2023). 1.

Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med. 23, 703–713 (2017).

Article CAS PubMed PubMed Central Google Scholar 1.

Welch, J. S. et al. Tp53 and decitabine in acute myeloid leukemia and myelodysplastic syndromes. N. Engl. J. Med. 375, 2023–2036 (2016).

Article CAS PubMed PubMed Central Google Scholar 1.

Mostavi, M., Chiu, Y.-C., Huang, Y. & Chen, Y. Convolutional neural network models for cancer type prediction based on gene expression. BMC Med. Genomics 13, 44 (2020).

Article CAS PubMed PubMed Central Google Scholar 1.

Yen, P.-Y., Wantland, D. & Bakken, S. Development of a customizable health it usability evaluation scale. In AMIA Annual Symposium Proceedings Vol. 2010, 917 (American Medical Informatics Association, 2010). 1.

Wang, Z. et al. Accelerating clinical evidence synthesis with large language models. npj Digit. Med. 8, 509–523 (2025).

Article PubMed PubMed Central Google Scholar 1.

Lin, J., Xu, H., Wang, Z., Wang, S. & Sun, J. Panacea: a foundation model for clinical trial search, summarization, design, and recruitment. Preprint at https://arxiv.org/abs/2407.11007 (2024). 1.

Jin, Q. et al. Matching patients to clinical trials with large language models. Nat. Commun. 15, 9074 (2023). 1.

Wang, X. et al. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. In The Thirteenth International Conference on Learning Representations 1–8 (OpenReview, 2025). 1.

Majumder, B. P. et al. Position: data-driven discovery with large generative models. In Proc. 41st International Conference on Machine Learning 34350–34382 (JMLR, 2024). 1.

Grossman, R. L. et al. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112 (2016).

Article PubMed PubMed Central Google Scholar 1.

Jupyter. Jupyter https://jupyter.org/ (accessed 23 September 2024). 1.

Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

Article PubMed PubMed Central Google Scholar 1.

Nie, F., Chen, M., Zhang, Z. & Cheng, X. Improving few-shot performance of language models via nearest neighbor calibration. Preprint at https://arxiv.org/abs/2212.02216 (2022). 1.

New embedding models and API updates. OpenAI https://openai.com/index/new-embedding-models-and-api-updates/ (accessed 23 September 2024). 1.

Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing 4222–4235 (Association for Computational Linguistics, 2020). 1.

Vertex AI search. Google https://cloud.google.com/enterprise-search?hl=en (accessed 23 September 2024). 1.

Madaan, A. et al. Self-refine: iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 36, 46534–46594 (2023).

Google Scholar

Download references

Acknowledgements

Z.C. was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant-in-Aid for Scientific Research Number JP24K20778. J.S. was partially supported by National Science Foundation (NSF) awards SCH-2205289, SCH-2014438 and IIS-2034479.

Author information

Author notes

These authors contributed equally: Zifeng Wang, Benjamin Danek.

Authors and Affiliations

Keiji AI, Seattle, WA, USA

Zifeng Wang, Benjamin Danek & Jimeng Sun 1.

School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, IL, USA

Zifeng Wang, Benjamin Danek & Jimeng Sun 1.

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan

Ziwei Yang 1.

Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan

Zheng Chen 1.

Carle Illinois College of Medicine, University of Illinois Urbana-Champaign, Urbana, IL, USA

Jimeng Sun

Authors

Zifeng Wang
Benjamin Danek
Ziwei Yang
Zheng Chen
Jimeng Sun

Contributions

Z.W. and J.S. conceived of and led the overall project. Z.W. and B.D. carried out the experiments and implementations. Z.Y. and Z.C. contributed to the experimental design, conceptualization and dataset construction. Z.W. drafted the paper. J.S. supervised the project and provided a critical review of the paper.

Corresponding author

Correspondence to Jimeng Sun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Chao Yan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Z., Danek, B., Yang, Z. et al. Making large language models reliable data science programming copilots for biomedical research. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-025-01587-2

Download citation

Received: 11 October 2024

Accepted: 16 November 2025

Published: 22 January 2026

Version of record: 22 January 2026

DOI: https://doi.org/10.1038/s41551-025-01587-2

Data availability

Code availability

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Rights and permissions

About this article

Cite this article

Received: 11 October 2024

Accepted: 16 November 2025

Published: 22 January 2026

Version of record: 22 January 2026

Similar Posts