Background & Summary
Brain metastases are a significant clinical challenge in the treatment of cancer, with 15–25%1 being due to breast cancer2. Up to 40% of all patients suffering from metastatic tumors develop brain metastases[3](https://www.nature.com/articles/s41597-025-06066-6#ref-CR3 “Scoccianti, S. & Ricardi, U. Tre…
Background & Summary
Brain metastases are a significant clinical challenge in the treatment of cancer, with 15–25%1 being due to breast cancer2. Up to 40% of all patients suffering from metastatic tumors develop brain metastases3. Moreover, approximately 14% of patients with metastatic breast cancer experience brain metastases4. Early detection and treatment of brain metastases from breast cancer directly influences outcome5,6,7. Thus, methods to accurately and rapidly diagnose based on radiographic features are an area of great interest to optimize the time to diagnosis and treatment.
Breast cancer metastases to the brain present a highly heterogeneous disease pathology8– as evidenced by their variable radiographic phenotypes in magnetic resonance imaging (MRI) scans9. This heterogeneity manifests as unique intensity profiles across the tumor mass, reflecting differences in underlying tumor biology, microenvironment and eventual therapeutic response10. Recent advancements in clinical practice, including the expanded use of stereotactic radiosurgery (SRS), show promise in improving patient outcomes11. SRS is increasingly being applied not only as a standalone treatment but also as an adjunct to surgical resection12,13–signifying an important development in the management of metastatic disease. Simultaneously, the field of radiomics, fueled by breakthroughs in machine learning tools, is evolving rapidly9,14,15,16. Radiomics augments traditional clinical diagnostics by extracting an array of features from radiographic images, leading to advanced image-based tumor phenotyping and offering an innovative perspective on disease understanding.
Despite the growing interest in the subject of radiomics, most available datasets do not focus on metastatic lesions. Compared to datasets that encompass a broader spectrum of brain malignancies[17](#ref-CR17 “Porter, E. et al. Gamma Knife MR/CT/RTSTRUCT Sets With Hippocampal Contours (GammaKnife-Hippocampal). The Cancer Imaging Archive https://doi.org/10.7937/Q967-X166
(2022).“),[18](#ref-CR18 “Wang, Y. et al. Brain tumor recurrence prediction after Gamma Knife radiotherapy from MRI and related DICOM-RT: An open annotated dataset and baseline algorithm (brain-TR-GammaKnife). The Cancer Imaging Archive https://doi.org/10.7937/XB6D-PY67
(2023).“),19, ours offers a targeted and in-depth exploration of this particular site-specific subset, allowing for more nuanced analyses. To our knowledge, this dataset is the first dedicated specifically to metastatic breast cancer to the brain and is the largest, organized collection of metastatic lesions to the brain. Furthermore, the availability of a multitude of tumor-derived radiomic features in our dataset, meticulously computed from clinician-reviewed segmentations, ensures a richness in the data that is not found in other collections. Merging radiomic features with genetic data makes this dataset ripe for supervised learning tasks such as non-invasive genotyping or unsupervised methods such as principal component analysis at the lesion-level. As we provide image segmentations at the level of each lesion, we encourage re-analyses and even derivations of new radiomic features. Accurate tumor segmentation is a critical step in generating such high-quality radiomic data. Our dataset is unique in its inclusion of clinician-derived tumor segmentations by board-certified neurosurgeons and radiation oncologists. These detailed segmentations serve as the basis for the radiomic features, which include shape and texture characteristics20. In alignment with the FAIR (Findable, Accessible, Interoperable, Re-usable) principle, our dataset is not only open-source but also designed for easy access, interoperability, and reusability21. Table 1 lists attributes of collections in The Cancer Imaging Archive (TCIA) which include brain metastases.
Our hope is that this dataset will facilitate deeper biological understanding of breast cancer and its metastasis to the brain, promote advancements in image-based tumor phenotyping and genotyping, and ultimately contribute to improving patient care and the realization of precision medicine in this disease domain.
Methods
Patient selection and imaging data acquisition
This study utilized a total of 297 3D T1-weighted post-contrast images from 165 unique patients with metastatic breast cancer. Histologic features were obtained in one of several ways: lymph node biopsy, primary breast sampling, or via biopsy of a metastatic site (either intracranial or elsewhere). All patients were treated at the University of Minnesota Medical Center (UMMC). The images were acquired as high resolution, 3D T1 post-contrast MRI scans that were part of routine clinical care. The dataset presented here contains a mixture of segmentations encompassing the tumor core, postoperative tumor bed, and key neighboring at-risk structures. Table 2 gives a summary of clinical demographics and molecular status. Table 3 gives the same data at the lesion level. All patient identifiers were removed from the dataset to uphold patient confidentiality. Each patient is assigned a unique identifier (BCBM-RadioGenomics-X), with subsequent follow-up imaging denoted as BCBM-RadioGenomics-X-#). For example, for patient BCBM-RadioGenomics-161, initial scan is denoted as folder BCBM-RadioGenomics-161-0’ and first follow-up denoted as folder ‘BCBM-RadioGenomics-161-1. Figure 1 shows a directory tree showing folder organization and contents.
This study was approved as a retrospective analysis by the University of Minnesota Institutional Review Board (STUDY00007985). Given the retrospective nature of the study and the use of de-identified patient data, a requirement for true informed consent was waived by the IRB. This waiver was granted on the basis that the research poses minimal risk to the individuals whose data were analyzed, and the data were used exclusively for research purposes, ensuring confidentiality and compliance with applicable privacy laws. All methods were carried out in accordance with relevant guidelines and regulations.
Image preprocessing
Prior to any analysis, the images underwent preprocessing steps to ensure their quality and comparability. The first stage was brain extraction using HD-BET–a pre-trained neural network developed as part of a large, multi-institutional collaboration among European institutions and validated on several external datasets22. Subsequently, N4 bias correction was applied to all skull-stripped images23. This correction is a critical preprocessing step in neuroimaging analysis that corrects the intensity non-uniformity artifact prevalent in MRI data, thereby improving the quality of the images and ensuring the reliability of subsequent analyses. This task was accomplished using the Python module ANTsPy (Advanced Normalization Tools), a wrapper for the C++ ANTs library24. Figure 1 shows our workflow in a step-by-step fashion.
Fig. 1
The file structure of the collection with sample patients included. Brain extracted and N4-normalized images are included in each patient folder along with respective segmentations which map directly onto the patient image.
Fig. 2
A sample patient with a dural based metastasis used as a graphical representation of the workflow from original image to brain extraction, N4 inhomogeneity correction, segmentation and finally the extraction of radiomic features. Radiomic information is then combined with clinical demographic data (age, gender) and molecular status information (ER/PR/HER2).
Tumor segmentation
The segmented images included in this dataset were produced via a semi-automated, consensus segmentation method. A board-certified neurosurgeon and a radiation oncologist utilized segmentation software within the SRS system for slice-by-slice delineation of the metastatic brain tumors and neighboring structures at risk.
Radiomic feature extraction
Radiomic features were derived from the segmented regions using the PyRadiomics library, a flexible Python package widely used for the extraction of radiomic features from medical imaging data25. This library was used to extract a wide array of features that capture the shape and texture characteristics of structures of interest and their filtered analogs (Wavelet and Laplacian of Gaussian). Shape features describe the two and three-dimensional form of the segmented areas. Texture features capture the patterns and distribution of pixel intensities within the region, indicative of intratumoral heterogeneity26. These radiomic features provide a comprehensive representation of the tumors, facilitating in-depth analysis and machine learning model development. A complete list of radiomic features and default run parameters are listed in Table 4.
Data Records
All data records are stored in the TCIA[27](https://www.nature.com/articles/s41597-025-06066-6#ref-CR27 “Taha, B. et al. MRI dataset of metastatic breast cancer to the brain with expert-reviewed segmentations and tumor-derived radiomic features (BCBM-RadioGenomics). The Cancer Imaging Archive https://doi.org/10.7937/RRSE-W278
(2025).“). Pre-processed images and segmentations are stored in the Neuroimaging Informatics Technology Initiative (NIfTI) format. Clinical, molecular, scan parameters, and radiomic features are also freely available in XSLX format. A complete list of scan parameters for each scan and a table of radiomic features for all lesions can be found as Supplementary File 1 and Supplementary File 2, respectively.
Technical Validation
Segmentations were produced through a semi-automated, consensus approach involving a board-certified neurosurgeon and a radiation oncologist during SRS planning. These segmentations enable direct analysis of tumor regions, radiotherapy margins, and surrounding structures in order to extract radiomic information.
With the available data, we developed a radiomics-based classifier for the prediction of Estrogen Receptor (ER), Progesterone Receptor (PR), Human Epidermal Growth Receptor (HER2) status using an 80:20 training-test split. In model training and testing, lesions were treated as independent samples and their labels attributed to the patient’s ER/PR/HER2 status. A binary classifier was trained on each dataset with available genetic status. Multiple models were trained including: support vector machine classifier (SVC), Gaussian Naive-Bayes (GNB), multi-layer perceptron (MLP), AdaBoost classifier, quadratic discriminant analysis and random forest (RF). Evaluation was done using five-fold cross-validation. Each model was evaluated based on accuracy, precision, recall, F1 score, and the area under the receiver operating curve (AUC). Receiver operating curves (ROC) were generated for the most successful model.
In HER2 mutation prediction, a random forest classifier showed the best overall performance (F1: 0.87, Recall: 93.2%, Accuracy: 77.6%, AUC: 0.77). Figure 3A–C demonstrates the robust performance of the random forest classifier and the relative feature importances. In the prediction of ER mutation, similarly, a random forest classifier performed the best (F1: 0.84, Accuracy: 75.9%, Precision: 82%, Recall: 86.2%, AUC: 0.81). Figure 4A–C shows complete model performance and ordered bar plot of relative feature importances. For PR mutation prediction, a random forest classifier again showed the best performance overall (F1: 0.40, Accuracy: 78.5%, Precision: 60.3%, Recall: 30.0%, AUC: 0.75). Figure 5A–C shows model performance comparison and plot of relative radiomic feature importances.
Fig. 3
(A) A comparison of multiple machine learning model ROC curves for HER2 mutation prediction demonstrating the superiority of the random forest classifier. (B) ROC curve performance for each cross validation iteration (listed as CV Fold #). (C) An ordered list of relative feature importances for all radiomic features.
Fig. 4
(A) A comparison of multiple machine learning model ROC curves for ER mutation prediction demonstrating the superiority of the random forest classifier. (B) ROC curve performance for each cross validation iteration (listed as CV Fold #). (C) An ordered list of relative feature importances for all radiomic features.
Fig. 5
(A) A comparison of multiple machine learning model ROC curves for ER mutation prediction demonstrating the superiority of the random forest classifier. (B) ROC curve performance for each cross validation iteration (listed as CV Fold #). (C) An ordered list of relative feature importances for all radiomic features.
We also explored an unsupervised approach to investigate the relationship between radiomic features and genetic status. We performed principal component analysis (PCA) on the full radiomic features at the lesion level, and colored each lesion based on its ER/PR/HER2 status, displayed in Fig. 6A. We also performed a separate PCA using ER positive vs negative samples, PR positive vs negative samples, and HER2 positive vs negative samples individually shown in Fig. 6B.
Fig. 6
(A) Principal component analysis (PCA) demonstrating the first two components for all lesions with ER/PR/HER2 status known using radiomic features. (B) Individual PCA plots for ER/PR/HER2 using radiomic features.
Usage Notes
The complete imaging data is organized into folders for each subject which contains the original image, segmentation files. A separate XLSX file contains radiomic features data for each lesion and subject. A third file contains the complete clinical demographics and genetic status of ER/PR/HER2. Image segmentations, radiomic data, and clinical demographic data are hosted at TCIA and are freely available for download under a CC BY 4.0 License. Images and segmentations can be viewed using any available software that can view NIfTI files (Slicer[28](https://www.nature.com/articles/s41597-025-06066-6#ref-CR28 “3D Slicer image computing platform. 3D Slicer https://slicer.org/
.“), ITK-SNAP29).
Code availability
The code used for image preprocessing, segmentation, and radiomic feature extraction in this article can be found at https://github.com/birra-taha/BreastCancer.
References
Kobyakova, E., Nechipay, E., Sashin, D., Kobiakov, N. & Kobyakov, G. L. P14.105 High incidence of brain metastases in lung cancer patients at the time of primary diagnosis. Neuro. Oncol. 21, iii92–iii93 (2019).
Article PubMed Central Google Scholar 1.
Lamba, N., Wen, P. Y. & Aizer, A. A. Epidemiology of brain metastases and leptomeningeal disease. Neuro Oncol 23, 1447–1456 (2021).
Article PubMed PubMed Central Google Scholar 1.
Scoccianti, S. & Ricardi, U. Treatment of brain metastases: review of phase III randomized controlled trials. Radiother Oncol 102, 168–179 (2012).
Article PubMed Google Scholar 1.
Sperduto, P. W. et al. Effect of tumor subtype on survival and the graded prognostic assessment for patients with breast cancer and brain metastases. Int J Radiat Oncol Biol Phys 82, 2111–2117 (2012).
Article PubMed Google Scholar 1.
Niikura, N. et al. Treatment outcomes and prognostic factors for patients with brain metastases from breast cancer of each subtype: a multicenter retrospective analysis. Breast Cancer Research and Treatment 147, 103–112 (2014).
Article PubMed Google Scholar 1.
Lin, N. U. et al. Challenges relating to solid tumour brain metastases in clinical trials, part 1: patient population, response, and progression. A report from the RANO group. Lancet Oncol 14, e396–406 (2013).
Article PubMed Google Scholar 1.
Tong, E., McCullagh, K. L. & Iv, M. Advanced Imaging of Brain Metastases: From Augmenting Visualization and Improving Diagnosis to Evaluating Treatment Response. Front Neurol 11, 270 (2020).
Article PubMed PubMed Central Google Scholar 1.
Guo, L. et al. Breast cancer heterogeneity and its implication in personalized precision therapy. Exp Hematol Oncol 12, 3 (2023).
Article PubMed PubMed Central Google Scholar 1.
Luo, X. et al. Radiomic Signatures for Predicting Receptor Status in Breast Cancer Brain Metastases. Front Oncol 12, 878388 (2022).
Article CAS PubMed PubMed Central Google Scholar 1.
Liang, Y., Zhang, H., Song, X. & Yang, Q. Metastatic heterogeneity of breast cancer: Molecular mechanism and potential therapeutic targets. Semin Cancer Biol 60, 14–27 (2020).
Article CAS PubMed Google Scholar 1.
Tanguturi, S. & Warren, L. E. G. The Current and Evolving Role of Radiation Therapy for Central Nervous System Metastases from Breast Cancer. Curr Oncol Rep 21, 50 (2019).
Article PubMed Google Scholar 1.
Soltys, S. G. et al. Stereotactic radiosurgery of the postoperative resection cavity for brain metastases. Int J Radiat Oncol Biol Phys 70, 187–193 (2008).
Article PubMed Google Scholar 1.
Bander, E. D. et al. Durable 5-year local control for resected brain metastases with early adjuvant SRS: the effect of timing on intended-field control. Neurooncol Pract 8, 278–289 (2021).
PubMed PubMed Central Google Scholar 1.
Taha, B., Boley, D., Sun, J. & Chen, C. Potential and limitations of radiomics in neuro-oncology. J Clin Neurosci 90, 206–211 (2021).
Article PubMed Google Scholar 1.
Rahimi, M. & Rahimi, P. A Short Review on the Impact of Artificial Intelligence in Diagnosis Diseases: Role of Radiomics In Neuro-Oncology. Galen Med J 12, e3158 (2023).
Article PubMed PubMed Central Google Scholar 1.
Nowakowski, A. et al. Radiomics as an emerging tool in the management of brain metastases. Neurooncol Adv 4, vdac141 (2022).
PubMed PubMed Central Google Scholar 1.
Porter, E. et al. Gamma Knife MR/CT/RTSTRUCT Sets With Hippocampal Contours (GammaKnife-Hippocampal). The Cancer Imaging Archive https://doi.org/10.7937/Q967-X166 (2022). 1.
Wang, Y. et al. Brain tumor recurrence prediction after Gamma Knife radiotherapy from MRI and related DICOM-RT: An open annotated dataset and baseline algorithm (brain-TR-GammaKnife). The Cancer Imaging Archive https://doi.org/10.7937/XB6D-PY67 (2023). 1.
Ramakrishnan, D. et al. A large open access dataset of brain metastasis 3D segmentations on MRI with clinical and imaging information. Scientific Data 11, 1–6 (2024).
Liu, Z. et al. The Applications of Radiomics in Precision Diagnosis and Treatment of Oncology: Opportunities and Challenges. Theranostics 9, 1303–1322 (2019).
Article PubMed PubMed Central Google Scholar 1.
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).
Article PubMed PubMed Central Google Scholar 1.
Isensee, F. et al. Automated brain extraction of multisequence MRI using artificial neural networks. Human Brain Mapping 40, 4952–4964 (2019).
Article PubMed PubMed Central Google Scholar 1.
Tustison, N. J. et al. N4ITK: improved N3 bias correction. IEEE Trans. Med. Imaging 29, 1310–1320 (2010).
Article ADS PubMed PubMed Central Google Scholar 1.
Avants, B. B. et al. The Insight ToolKit image registration framework. Front. Neuroinform. 8, 44 (2014).
Article PubMed PubMed Central Google Scholar 1.
van Griethuysen, J. J. M. et al. Computational Radiomics System to Decode the Radiographic Phenotype. Cancer Res. 77, e104–e107 (2017).
Article PubMed PubMed Central Google Scholar 1.
Marusyk, A., Almendro, V. & Polyak, K. Intra-tumour heterogeneity: a looking glass for cancer? Nature Reviews Cancer 12, 323–334 (2012).
Article CAS PubMed Google Scholar 1.
Taha, B. et al. MRI dataset of metastatic breast cancer to the brain with expert-reviewed segmentations and tumor-derived radiomic features (BCBM-RadioGenomics). The Cancer Imaging Archive https://doi.org/10.7937/RRSE-W278 (2025). 1.
3D Slicer image computing platform. 3D Slicer https://slicer.org/. 1.
Yushkevich, P. A. et al. User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31, 1116–1128 (2006).
Acknowledgements
The authors acknowledge the support of the University of Minnesota Medical Center for data access and MRI acquisition. The authors thank the Gamma Knife Radiosurgery team for their assistance in tumor segmentation.
Author information
Authors and Affiliations
Department of Neurosurgery, University of Minnesota, Minneapolis, Minnesota, USA
Birra R. Taha, Matthew Hunt, Michael C. Park, David Darrow & Andrew S. Venteicher 1.
Department of Radiation Oncology, Stanford University, Stanford, California, USA
David J. Wu 1.
Department of Radiology, University of Iowa, Iowa City, Iowa, USA
Luke T. Sabal 1.
Department of Radiology, University of Minnesota, Minneapolis, Minnesota, USA
Megan Kollitz 1.
Department of Radiation Oncology, University of Minnesota, Minneapolis, Minnesota, USA
Lindsey Sloan, B. Aika Shoo, Jianling Yuan & Yoichi Watanabe
Authors
- Birra R. Taha
- David J. Wu
- Luke T. Sabal
- Megan Kollitz
- Lindsey Sloan
- B. Aika Shoo
- Jianling Yuan
- Matthew Hunt
- Michael C. Park
- David Darrow
- Andrew S. Venteicher
- Yoichi Watanabe
Contributions
B.R.T. conceptualized the study, contributed to conceptualization, data collection, segmentation verification, literature review, and manuscript writing. D.W. was responsible for data acquisition, and manuscript editing. L.T.S. performed data acquisition, literature review, manuscript writing and editing. M.K. conducted data acquisition and manuscript editing. A.S.V. contributed to methodology development, and provided critical manuscript review. Y.W. supervised the project, provided access to data, manuscript review. J.Y., M.H., D.P.D., B.A.S., L.S., M.C.P. was responsible for data acquisition and provided critical manuscript review. All authors contributed to data interpretation and approved the final version of the manuscript.
Corresponding author
Correspondence to Birra R. Taha.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Taha, B.R., Wu, D.J., Sabal, L.T. et al. An Integrated Dataset of Metastatic Breast Cancer to the Brain with Imaging, Radiomics, and Tumor Genetics. Sci Data 12, 1851 (2025). https://doi.org/10.1038/s41597-025-06066-6
Received: 09 February 2025
Accepted: 30 September 2025
Published: 20 November 2025
Version of record: 20 November 2025
DOI: https://doi.org/10.1038/s41597-025-06066-6