Main
Sleep is a complex process characterized by intricate interactions across physiological systems, including brain, heart, respiratory and muscle activity1. PSG—the gold standard for sleep evaluation—captures these interactions through recordings of several modalities, including brain activity signals (BAS, including electroencephalogram (EEG) and electrooculogram (EOG)), electrocardiography (ECG), electromyography (EMG) and respiratory signals[2](https://www.nature.com/articles/s41591-025-04133-4#ref-CR2 “Kryger, M. H., Roth, T.…
Main
Sleep is a complex process characterized by intricate interactions across physiological systems, including brain, heart, respiratory and muscle activity1. PSG—the gold standard for sleep evaluation—captures these interactions through recordings of several modalities, including brain activity signals (BAS, including electroencephalogram (EEG) and electrooculogram (EOG)), electrocardiography (ECG), electromyography (EMG) and respiratory signals2.
Sleep disorders affect millions of people and are increasingly recognized as indicators of, and contributors to, various health conditions3. Sleep disturbances often precede the clinical onset of numerous conditions, such as psychiatric disorders4, neurodegenerative diseases5 and cardiovascular disorders6. These associations highlight the important role sleep plays in maintaining overall health and underscores its predictive potential across a wide spectrum of diseases. However, most existing studies have focused on identifying links between sleep and specific diseases using isolated metrics or manual annotations, leaving much of the complexity of sleep physiology, as captured in PSG, underutilized.
Recent advances in deep learning have enabled the use of PSG’s multimodal data for tasks ranging from sleep staging and apnea detection to predicting conditions such as atrial fibrillation, biological aging and narcolepsy3,7,8,9,10. Despite this progress, current approaches face key limitations: they focus on individual outcomes, depend on supervised learning with expert-labeled data and are trained on relatively small datasets (2,500–15,913 recordings)3,7,9,10,11. Manual annotations are time consuming and prone to inter-rater variability, making scaling difficult. Moreover, existing models lack flexibility across recording environments, generalize poorly across cohorts and often fail to exploit the richness of multimodal sleep signals. There remains a need for robust, generalizable architectures and systematic evaluation of sleep’s predictive value across a broad range of health conditions.
Foundation models have emerged as a transformative approach in machine learning, enabling robust representation learning from large-scale, unlabeled data12. By leveraging self-supervised learning, these models can be fine-tuned efficiently for diverse applications. In biomedicine, foundation models have demonstrated remarkable capabilities in analyzing complex, heterogeneous datasets, driving advances in disease prediction, patient stratification and therapeutic discovery13,14. Their ability to extract meaningful patterns from large-scale data has addressed many challenges associated with the diverse and high-dimensional nature of clinical datasets.
Despite these successes, their application to sleep remains limited. Sleep data, particularly from PSG, presents unique challenges due to its complexity and variability, including differences in the number and types of recording channel across clinical cohorts. Most sleep studies have focused narrowly on sleep-specific outcomes, constraining the broader potential of foundation models for disease prediction. In preliminary work, we explored self-supervised learning on PSG data in a smaller cohort of participants11. Although this effort highlighted the potential of foundation models for analyzing sleep data, it targeted primarily sleep-specific outcomes and lacked the flexibility to accommodate the diverse configurations of PSG recordings. These limitations emphasize the need for models that can generalize across heterogeneous datasets and systematically uncover the role of sleep in predicting a wider range of diseases.
In this paper we present SleepFM, a foundation model trained on over 585,000 h of PSG data from 65,000+ participants. SleepFM captures the diverse information present in multimodal sleep recordings—integrating EEG, ECG, EMG and respiratory signals. Its channel-agnostic architecture enables joint learning across several modalities, producing representations that generalize across environments. We also introduce a new leave-one-out (LOO) contrastive learning (CL) (LOO-CL) algorithm that aligns information across modalities during pretraining while remaining resilient to missing or heterogeneous channels during inference. Our model uses 5–25 times more data than previously trained supervised sleep3,7,9,10 or biosignal models15,16.
Inspired by phenome-wide association studies (PheWAS)17, we examined whether sleep characteristics, as captured by SleepFM, can predict the onset of a wide range of diseases. Leveraging electronic health record (EHR) disease codes, we develop a framework to systematically explore predictive associations between multimodal sleep and diverse health conditions.
Dataset and SleepFM architecture
We describe our dataset and training procedures in detail in Methods. Briefly, we used PSG data from four primary cohorts: Stanford Sleep Clinic (SSC)11, BioSerenity18,19, the Multi-Ethnic Study of Atherosclerosis (MESA)20,21 and the Outcomes of Sleep Disorders in Older Men (MrOS)20,22. SSC includes 35,052 studies from participants aged 1–100 years; BioSerenity adds 18,900 studies from people aged 7–90 years; MESA and MrOS contribute 2,237 and 3,930 PSGs, respectively, from older adults. Together, these cohorts span 65,000 participants and more than 585,000 h of sleep recordings. We further evaluated generalization using the Sleep Heart Health Study (SHHS)20,23—a multicenter dataset of 6,441 adults aged 40 years and older, held out from pretraining and used solely for transfer learning. Dataset distributions postfiltering are shown in Table 1. Demographics for SSC and BioSerenity appear in Extended Data Tables 1 and 2, whereas details for SHHS, MrOS and MESA are available in their respective publications.
Our preprocessing pipeline begins by resampling all signals to 128 Hz for consistency across cohorts. Signals are then segmented into 5-s windows, which serve as the model’s fundamental input tokens. The architecture includes one-dimensional (1D) convolutional layers for feature extraction, followed by channel-agnostic attention pooling to address variability in channel number and order across cohorts. A transformer block captures temporal dependencies over a 5-min context window. During pretraining, we use a multimodal CL objective to align representations across all modalities. The robustness of the model stems from its channel-agnostic design, enabling it to accommodate missing channels, varying channel counts and heterogeneous signal types.
For downstream tasks, we leverage the pretrained model’s embeddings through lightweight fine-tuning. The token embeddings from different modalities are pooled again and processed by a two-layer long short-term memory (LSTM) network before passing through task-specific output heads. For patient-level prediction tasks (for example, disease prediction), an additional temporal pooling layer before the output layer compresses all token embeddings into a single 128-dimensional embedding.
To evaluate model performance across tasks, we use appropriate task-specific metrics. For classification tasks such as sex classification, we report area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC); for sleep apnea classification we show confusion matrices and report accuracy; for age estimation, we use mean absolute error (MAE) and Pearson correlation. Sleep staging is evaluated using the F1 score, which is well suited for class-imbalanced settings. For disease prediction, we report AUROC and Harrell’s concordance index (C-Index)—a standard survival analysis metric that measures the proportion of correctly ranked risk pairs. All metrics range from 0 to 1, with higher values indicating better performance; 95% confidence intervals (CIs) are computed using bootstrapping.
SleepFM supports standard sleep analysis tasks
After pretraining SleepFM, we assessed the general utility of its learned representations by fine-tuning on four common benchmark tasks: age estimation, sex classification, sleep stage classification and sleep apnea classification. Although these tasks are not the main focus of our work, they are useful validations showing that the model captures fundamental sleep patterns. For all tasks, we trained lightweight LSTM-based heads on top of the frozen multimodal embeddings derived from entire nights of PSG data.
For age estimation, we assessed the ability of the model to predict chronological age. Overall performance is shown in Extended Data Fig. 1, with the model achieving a MAE of 7.33 years and a correlation coefficient of 0.88. Performance varied across age groups, with higher accuracy in pediatric and middle-aged participants and greater error in elderly adults, suggesting that age prediction is more challenging at the extremes of the age spectrum. Sex classification yielded an AUROC of 0.86 (0.85–0.87) and AUPRC of 0.90 (0.89–0.91). For sleep stage classification, we fine-tuned a LSTM-based classifier to distinguish Wake, Stage 1, Stage 2, Stage 3 and rapid eye movement (REM) using 5-s windows—a more granular resolution than the standard 30-s epochs, which has been shown to improve precision in certain conditions (for example, narcolepsy10). As shown in Supplementary Fig. 1, SleepFM performs well on Wake, Stage 2 and REM, with expected confusion in transitional stages like Stage 1—consistent with known human scoring variability. We report results across SSC, MESA, MrOS and SHHS, where SleepFM achieves competitive performance compared to U-Sleep7, YASA24, GSSC25 and STAGES10—state-of-the-art sleep staging models, as shown in Extended Data Tables 3 and 4. Furthermore, we compare SleepFM to three PhysioEx26 models on the public datasets DCSM27 and HMC28 in a fully external validation setting, achieving an F1 score of 0.68 on DCSM—outperforming all models—and 0.55 on HMC (Supplementary Table 1). Although the source alone has little impact, using several datasets for pretraining and fine-tuning improves generalization, boosting macro F1 by around 0.1 (Supplementary Tables 2, 3 and 4), consistent with previous work26.
For sleep apnea classification, we performed patient-level severity classification to distinguish between four commonly used severity groups on the basis of the apnea–hypopnea index (AHI): none (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30) and severe (AHI ≥ 30). Across MESA, MrOs and SHHS, we observe competitive performance, with a severity classification accuracy of 0.69 and a presence classification accuracy (none/mild versus moderate/severe) of 0.87. The confusion matrix for apnea classification is shown in Fig. 1.
Fig. 1: Overview of SleepFM framework.
a, PSG setup and dataset statistics across several sleep centers. Bars show the number of independent PSG recordings (participants) per cohort and the corresponding total recording hours. b, Multimodal contrastive pretraining: raw signals from each modality are encoded by a CNN, channel embeddings are pooled within modality and a temporal transformer with temporal pooling yields sequence-level representations for LOO-CL. C: channels, S: sequence length, D: embedding dimension. c, Fine-tuning using frozen embeddings for downstream tasks (sleep staging, apnea detection, disease prediction). Eight hours of multimodal embeddings are aggregated to patient-level representations, concatenated with age and sex, and passed to an LSTM followed by a fully connected layer. d, Evaluation across representative tasks and clinical applications. Left and middle: confusion matrices for sleep staging (SHHS) and AHI categories (SSC) shown as row-normalized percentages. Right: disease prediction performance on the Stanford cohort (n = 5,019 participants). Box plots summarize 1,000 patient-level bootstrap resamples: faint dots (individual bootstrap draws), and vertical line with end caps (95% bootstrap percentile CI). Numeric labels are means. Number of positive samples for each disease: CKD (354), death (224), dementia (221), HF (283) and stroke (297).
SleepFM enables comprehensive disease prediction from sleep data
To enable disease prediction, we paired SSC data with EHRs, extracting all diagnostic codes (International Classification of Diseases, ninth revision (ICD-9) and International Classification of Diseases, tenth revision (ICD-10)) and their timestamps. These codes were mapped to phecodes—a hierarchical system of 1,868 disease categories designed for PheWAS29. The timestamp of each phecode was defined as the earliest among its corresponding ICD codes. Positive cases were defined as patients whose first phecode instance occurred more than 7 days after the sleep study, avoiding trivial associations. We excluded phecodes with prevalence below 1.5% to ensure statistical power, resulting in 1,041 phecodes for evaluation. For model fine-tuning, we used a multilabel extension of the Cox proportional hazards (CoxPH) loss, averaging independent losses computed for each label.
Figure 2 illustrates the performance of SleepFM across disease categories on the test set. Although performance varies across categories, SleepFM demonstrates strong results in several areas, including neoplasms, pregnancy complications, circulatory conditions and mental disorders. Overall, 130 future diseases achieved a C-Index and AUROC of at least 0.75 on held-out participants (Bonferroni-corrected P < 0.01), as summarized in Supplementary Table 5. AUROC was calculated using a 6-year horizon, meaning a condition is considered positive if the patient develops the disease within 6 years of their PSG study. The 6-year horizon for AUROC calculation was chosen to balance performance and account for both long-term and short-term conditions. Supplementary Fig. 2 shows AUROC values across 1–6 year horizons for several conditions.
Fig. 2: Performance of SleepFM on the held-out test set (n = 5,019) as stratified by disease category.
Individual dots represent a disease within a category. The results are evaluated using two metrics: the C-Index, which measures the model’s ability to rank patient risk accurately, and the 6-year AUROC, which assesses the model’s discrimination performance by evaluating its ability to distinguish between patients who experience the event of interest and those who do not within a 6-year prediction window. For reference, the horizontal dashed line indicates a threshold of 0.75.
The model showed high accuracy for mild cognitive impairment (AUROC 0.84 (0.80–0.880)), aligning with studies showing sleep disturbances as early markers of cognitive decline30. Strong performance was observed for Parkinson’s disease (0.93 (0.89–0.96)), where sleep disorders are increasingly recognized as potential early indicators31, and developmental delays and disorders (0.84 (0.79–0.87)). Among circulatory conditions, the model effectively predicted hypertensive heart disease (0.88 (0.85–0.91)) and intracranial hemorrhage (0.82 (0.73–0.90)), consistent with established links between sleep disorders and cardiovascular risk32. In the Neoplasm category, the model showed strong predictive performance for several cancers: prostate cancer (0.90 (0.87–0.93)), breast cancer (0.90 (0.86–0.93)) and melanomas of skin (0.83 (0.76–0.90)). These findings align with existing literature linking sleep patterns to cancer risk33,34.
Drawing on sleep expertise and previous literature, we identified 14 conditions with strong potential links to sleep patterns. Previous studies associate sleep regularity with mortality35, prolonged sleep with early neurodegeneration36 and sleep disturbances with dementia37, stroke38 and cardiovascular outcomes9. Related phecodes were grouped into unified disease categories in consultation with a medical doctor (Supplementary Table 6). Results for selected conditions—including death, stroke, heart failure (HF) and dementia—are shown in Extended Data Fig.2. SleepFM demonstrates strong predictive performance, with particularly high accuracy for death (AUROC 0.84 (0.80–0.88)), HF (0.83 (0.79–0.86)), chronic kidney disease (CKD) (0.82 (0.79–0.85)), dementia (0.87 (0.84–0.91)) and stroke (0.81 (0.78–0.85)). All reported associations are statistically significant (P < 0.01, Bonferroni-corrected).
To better understand the physiological basis of disease prediction, we analyzed model performance stratified by both sleep stages and signal modalities. We found that although most sleep stages contribute similarly to disease prediction, certain stages such as Stage 1/2 and REM can offer slightly better predictive power for specific conditions, including cardiovascular and neurodegenerative diseases. Likewise, different signal modalities showed nuanced differences, with BAS signals better capturing mental and neurological conditions, respiratory signals more predictive of respiratory and metabolic disorders, and electrocardiogram (EKG) signals more informative for circulatory diseases. Although these differences align with known physiology, the overall predictive performance was highest when combining all modalities. Full results and condition-specific breakdowns are provided in Supplementary Figs. 3 and 4 and Supplementary Tables 7 and 8. Furthermore, we trained separate SleepFM models on each modality to directly assess modality-level importance. Performance comparisons stratified by disease category, presented in Supplementary Tables 9 and 10, further confirm that combining all modalities yields the optimal performance.
SleepFM demonstrates robust generalization across time and cohorts
We evaluate the generalization capabilities of SleepFM across temporal distribution shifts and external site validation. For temporal generalization, we test the model on a separate cohort comprising Stanford patients from 2020 onwards. All model pretraining and training was done on data from before 2020. Despite the limited follow-up period, SleepFM maintains strong predictive performance. Extended Data Fig. 3 shows results for our 14 selected conditions, with particularly robust and statistically significant performance (Bonferroni-corrected P < 0.01) for death (0.83 (0.73–0.91)), HF (0.80 (0.75–0.85)) and dementia (0.83 (0.76–0.89)). Comprehensive temporal-split performance across all disease phenotypes and categories is provided in Supplementary Figs. 5 and 6. Supplementary Fig. 7 further reports temporal-split performance comparisons with baseline models, stratified by disease category.
To assess cross-site generalization, we evaluate SleepFM’s transfer learning capabilities on SHHS—a dataset entirely excluded from the pretraining phase. We use the pretrained model to extract embeddings and then fine-tune it on a subset of SHHS. Specifically, the SHHS fine-tuning set includes 3,291 participants, and the test set includes 2,000 participants. Due to differences in task availability between SSC and SHHS, our evaluation focuses on six overlapping cardiovascular conditions. This setup mimics real-world deployment scenarios where foundation models must be adapted to new clinical sites with minimal supervision.
As shown in Fig. 3, SleepFM demonstrates strong transfer learning performance across key outcomes. For example, the model achieves statistically significant predictive accuracy (Bonferroni-corrected P < 0.01) for stroke (0.82 (0.76–0.87)), congestive HF (0.85 (0.82–0.88)) and mortality related to cardiovascular disease (0.88 (0.83–0.91)).
Fig. 3: SleepFM prediction performance on the SHHS test set (n = 2,000 participants).
Due to differences in available outcome data between SHHS and Stanford datasets, evaluation was limited to a subset of conditions. Results demonstrate transfer learning capabilities across these key clinical outcomes, including stroke, congestive HF and cardiovascular disease-related mortality. Each panel uses barplots derived from 1,000 patient-level bootstrapping: faint points are individual bootstrap draws, and the vertical line with end caps marks the 95% bootstrap percentile CI. Numbers above bars report the mean. Metrics are C-Index (top) and AUROC at 6 years (bottom). The number of positive samples for each outcome is as follows: angina (704), cardiovascular disease death (128), congestive HF (190), coronary heart disease death (80), myocardial infarction (103) and stroke (95). All conditions are statistically significant with a P value <0.01 after Bonferroni correction.
SleepFM surpasses supervised baselines in disease prediction
We compare SleepFM against two supervised baselines: Demographics and End-to-End PSG. The demographics baseline is a multilayer perceptron (MLP) trained on structured clinical features (age, sex, race/ethnicity and body mass index (BMI)). This baseline includes more demographic variables than the SleepFM-based models, which only use age and sex. The End-to-End PSG model is trained directly on raw PSG data using the same architecture and parameter count as SleepFM, and it includes age and sex but does not use any pretraining. From Fig. 4, we observe that the percentage difference in AUROC between SleepFM and both baseline models ranges from 5% to 17%. The magnitude of improvement varies across disease categories; for example, gains are more pronounced in neurological and hematopoietic conditions, whereas in neoplasm-related conditions the improvements are comparatively modest. Supplementary Fig. 8 reports the overall test-set performance comparison between SleepFM and the baseline models across all disease phenotypes.
Fig. 4: Performance improvements of SleepFM over baseline models across disease categories on Stanford test set (n = 5,019 participants).
SleepFM and the End-to-End PSG model include age and sex demographic features, whereas the demographics-only model includes age, sex, BMI and race/ethnicity. Each box shows the distribution of disease-level percentage improvements of SleepFM relative to each baseline within the indicated disease category. Improvements are shown for both C-Index (top) and 6-year AUROC (bottom) metrics. Boxes represent the interquartile range (IQR), with whiskers extending to 1.5× IQR and outliers shown as points. Diamonds denote the mean improvement within each category. The horizontal dashed line at zero indicates no improvement.
Next, we evaluated three different variants of SleepFM using identical training configurations, as shown in Table 2 and Extended Data Table 5. SleepFM-LSTM (without Demo) uses SleepFM embeddings with a two-layer LSTM fine-tuning head but no demographic features. SleepFM-Linear uses SleepFM embeddings with a simple linear prediction head and includes age and sex. Finally, SleepFM-LSTM, combines the pretrained SleepFM embeddings with a two-layer LSTM head and includes age and sex.
As seen in Table 2, the demographics-only baseline performs well, reflecting the fact that many diseases are associated strongly with age, sex, BMI and race/ethnicity. For example, in the Neoplasm category, older age is a strong predictor of cancer risk. Nevertheless, all SleepFM-based models, including the SleepFM-LSTM (without Demo) variant, consistently outperform the demographics and End-to-End PSG baselines across most disease categories. This demonstrates the benefit of using pretrained SleepFM embeddings for disease prediction. Furthermore, SleepFM-LSTM (without Demo) achieves over +5 AUROC points in 9 out of 14 conditions, whereas SleepFM-Linear and SleepFM-LSTM achieve over +5 AUROC points in 12 out of 14 conditions, compared to supervised demographics baseline. As seen from the 95% CI bars, these improvements are robust, with most differences being larger than the uncertainty intervals. Finally, SleepFM-Linear performs comparably to SleepFM-LSTM, suggesting that the strength of the model lies in the pretrained embeddings rather than the complexity of the downstream head. Percentage improvement comparisons across models are provided in Supplementary Fig. 9, and a scatterplot comparison of all disease phenotypes across different fine-tuning architectures on top of SleepFM is shown in Supplementary Fig. 10.
To further examine disease-specific performance, full results are provided in Supplementary Tables 11, 12 and 13, and clinician-selected conditions are presented in Supplementary Fig. 11. These comparisons show that SleepFM achieves substantial gains across several neurological, mental, circulatory, endocrine/metabolic and respiratory conditions. For neurological and mental disorders, SleepFM attains higher C-Index scores for senile dementia (0.99 (0.98–1.00) versus 0.87 (0.75–0.96)), myoneural disorders (0.81 (0.73–0.88) versus 0.42 (0.28–0.55)) and developmental delays (0.80 (0.77–0.84) versus 0.58 (0.51–0.64)). For circulatory diseases, SleepFM outperforms in atherosclerosis (0.92 (0.88–0.95) versus 0.74 (0.64–0.89)) and acute pulmonary heart disease (0.80 (0.75–0.85) versus 0.74 (0.68–0.80)). Improvements in endocrine/metabolic conditions include diabetes type 2 with circulatory complications (0.87 (0.83–0.91) versus 0.79 (0.74–0.85)) and diabetic retinopathy (0.81 (0.77–0.85) versus 0.75 (0.69–0.80)). For respiratory conditions, SleepFM achieves higher C-Index in respiratory insufficiency (0.79 (0.72–0.85)] versus 0.59 (0.51–0.67)) and failure (0.77 (0.73–0.80) versus 0.70 (0.65–0.74)). These findings highlight the versatility of SleepFM in predicting a broad range of diseases beyond what is captured by demographics alone.
Similarly, full comparisons with the End-to-End PSG model are provided in Supplementary Table 14. This comparison highlights the value of foundation model pretraining: although both models share similar architecture and input signals, SleepFM benefits from self-supervised pretraining, enabling more robust and informative representations. This advantage is reflected in consistent performance gains across neurological, circulatory, endocrine/metabolic and respiratory conditions. For neurological and mental disorders, SleepFM outperforms the end-to-end model in myoneural disorders (0.84 (0.75–0.91) versus 0.54 (0.40–0.69)), developmental delays (0.84 (0.79–0.87) versus 0.61 (0.52–0.69)) and speech/language disorders (0.83 (0.74–0.90) versus 0.71 (0.60–0.83)). For circulatory conditions, improvements are observed in atherosclerosis of native arteries of the extremities (0.95 (0.92–0.98) versus 0.65 (0.61–0.69)), atherosclerosis of the extremities (0.84 (0.75–0.90) versus 0.78 (0.71–0.85)) and acute pulmonary heart disease (0.84 (0.77–0.90) versus 0.76 (0.69–0.83)). In endocrine/metabolic disorders, SleepFM demonstrates stronger performance for predicting diabetes with circulatory complications (0.89 (0.85–0.93) versus 0.79 (0.70–0.87)), neurological manifestations (0.86 (0.81–0.90) versus 0.73 (0.67–0.78)) and diabetic retinopathy (0.84 (0.79, 0.89) versus 0.76 (0.69–0.82)). Respiratory conditions also benefit, with better performance in predicting respiratory insufficiency (0.82 (0.72–0.91) versus 0.64 (0.54–0.73)) and respiratory failure (0.76 (0.71–0.82) versus 0.68 (0.62–0.74)). In predicting all-cause mortality, SleepFM achieves a AUROC of 0.85 (0.80–0.89), outperforming both the Demographic baseline and End-to-End PSG model, which achieve AUROC of 0.78 (0.72–0.82).
Finally, we compare fine-tuning scalability by evaluating SleepFM alongside two baseline models as we increase the amount of fine-tuning data and measure performance on the same test set. These results are shown in Extended Data Fig. 4 for SHHS and Extended Data Fig. 5 and Supplementary Fig. 12 for SSC. In both plots, the key observation is that SleepFM consistently outperforms the supervised baselines, with its performance improving steadily as more data are used, remaining above the baseline curves for nearly all conditions. For SHHS, SleepFM surpasses the Demographics baseline in five out of six conditions across all data percentages, with particularly large improvements in smaller dataset splits. For example, SleepFM trained on just 10% of the data outperforms the Demographics baseline trained on five times more data across all conditions in SSC and four out of six conditions in SHHS (for example, cardiovascular disease death, congestive HF, myocardial infarction and stroke). SleepFM also outperforms the End-to-End PSG baseline in five out of six conditions, although the gap is slightly smaller than with the Demographics baseline. SleepFM exhibits stable scaling behavior across data percentages, with smoother performance improvements, whereas the baseline models show greater variability.
Discussion
This study presents a large-scale foundation model for sleep analysis, developed on more than 585,000 h of PSG data from 65,000 participants. Our work makes several contributions. First, we address challenges in sleep analysis by leveraging self-supervised learning to train a foundation model that learns from unlabeled data and is agnostic to channel type and number, enabling broad exploration of sleep data across diverse clinical settings. Second, through extensive evaluation across 1,041 disease phenotypes, we demonstrate sleep’s broad predictive power for diverse health outcomes. The model shows strong performance in predicting death (C-Index 0.84), dementia (0.85), HF (0.80) and CKD (0.79). Third, we demonstrated transfer learning capabilities through strong performance on the SHHS dataset. Despite SHHS being entirely excluded from pretraining, our model maintains robust predictive power for key outcomes such as stroke (C-Index 0.81), congestive HF (0.83) and death related to cardiovascular disease (0.86). Finally, SleepFM achieves competitive performance on standard sleep analysis tasks, including sleep staging and apnea detection, with mean F1 scores ranging from 0.70 to 0.78 across cohorts—comparable to state-of-the-art models such as U-Sleep7, GSSC25, STAGES10 and YASA24. Furthermore, in a fully external validation setting, SleepFM outperforms all models on DCSM (F1 = 0.68) and is competitive with the PhysioEx26 models. For apnea classification, SleepFM achieves 87% accuracy in MESA, MrOS and SHHS, comparable to state-of-the-art models8.
SleepFM predicts all-cause mortality more accurately than both the Demographics-based model and the End-to-End PSG model, achieving a higher C-Index of 0.84 (0.81–0.87), compared to 0.79 (0.75–0.82). This indicates that pretraining efficiently captures subtle mortality-related signals in the PSG data. Research shows strong association between all-cause mortality and sleep-related factors, including high arousal burden39, low REM sleep40, sleep-disordered breathing41, hypoxemia and low sleep efficiency42. Increased ‘brain age’ derived from EEG has also been identified as an important predictor of mortality3. SleepFM probably integrates these multifactorial contributors, capturing respiratory events, sleep fragmentation, arousal burden and sleep efficiency, along with markers of cardiovascular, metabolic and other diseases.
Predictive and prognostic models for neurological and mental disorders are advancing rapidly, offering the potential for earlier and more individualized treatment. Among the top conditions predicted by SleepFM were Alzheimer’s disease and Parkinson’s disease, with C-Indices of 0.91 (0.87–0.98) and 0.89 (0.85–0.92), respectively. Sleep disorders are