Introduction
Sleep state detection using accelerometer data is a crucial task in sleep research and health monitoring, aimed at identifying and classifying sleep-related events such as sleep onset (beginning of sleep) and wakeup (end of sleep)1. By leveraging wearable devices like wrist-worn accelerometers, this approach captures motion data that reflects a person’s activity levels, which can be analyzed to detect periods of inactivity indicative of sleep[2](https://www.nature.com/articles/s41598-025-00742-8#ref-CR2 “Sundararajan, K. et al. Sleep classification from wrist-worn accelerometer data…
Introduction
Sleep state detection using accelerometer data is a crucial task in sleep research and health monitoring, aimed at identifying and classifying sleep-related events such as sleep onset (beginning of sleep) and wakeup (end of sleep)1. By leveraging wearable devices like wrist-worn accelerometers, this approach captures motion data that reflects a person’s activity levels, which can be analyzed to detect periods of inactivity indicative of sleep2. Accurate detection of these sleep events is vital for applications in health diagnostics, sleep disorder monitoring, and personalized health interventions3. While traditional sleep tracking relies on subjective logbooks, accelerometer-based methods offer an objective, continuous, and unobtrusive alternative for monitoring sleep patterns in real-world settings4. However, detecting sleep states from accelerometer data involves addressing challenges such as distinguishing sleep from sedentary activities, identifying interruptions in sleep, and accounting for periods when the device may not be worn, all contributing to the complexity of automated sleep state detection5.
Sleep stage detection is traditionally performed using polysomnography (PSG), which involves the simultaneous recording of physiological signals such as electroencephalogram (EEG), electrooculogram (EOG), electromyogram (EMG), and electrocardiogram (ECG)6. Despite PSG being the gold standard, single-channel EEG has gained popularity in recent years for sleep monitoring due to its non-invasiveness and ease of use, especially for home-based sleep studies. PSG or single-channel EEG recordings are generally divided into 30-second epochs, each of which is manually reviewed by sleep specialists and classified into one of the five sleep stages: wake (W), rapid eye movement (REM), and three non-REM stages (N1, N2, and N3)7. However, this manual classification process is time-consuming, tedious, and subject to inter-rater variability. As a result, there is a growing need for automated sleep stage classification systems that can alleviate the burden on sleep specialists. In recent years, advances in machine learning and deep learning techniques have been employed to develop more accurate, efficient, and scalable solutions for automatic sleep stage classification8, reducing the reliance on manual intervention and improving the consistency of sleep stage detection. These systems not only offer faster processing but also contribute to better monitoring of large-scale sleep datasets, making them valuable tools in both clinical and research settings.
Many studies have applied conventional machine learning techniques to detect sleep states using textual datasets. These methods typically involve two main steps: feature extraction and classification. In the first step, various features are extracted from the textual data, such as linguistic patterns, word frequency, sentiment analysis, and other textual characteristics that may correlate with sleep states. Feature selection techniques, like mutual information or principal component analysis, are often employed to select the most relevant features. In the second step, the selected features are input into traditional machine learning classifiers, such as Naive Bayes9, Support Vector Machines (SVM)10,11, Random Forest (RF)8,12, or ensemble-based models13. While these methods can perform well, they often rely heavily on manual feature engineering, which requires domain knowledge and may limit their ability to adapt to different datasets or contexts without significant adjustments14,15. Recent advancements have also explored hybrid models combining traditional classifiers with deep learning for enhanced performance in detecting sleep states from textual data16,17.
The proposed SE-ResNet-U-Net model combines the strengths of U-Net, Squeeze-and-Excitation (SE) blocks, and Residual blocks for robust and accurate sleep state detection. The U-Net serves as the base architecture, employing an encoder-decoder structure to extract and reconstruct features from 1D physiological data. SE blocks recalibrate channel-wise features dynamically, enabling the network to focus on the most relevant patterns for sleep stage classification. Residual blocks are integrated to enhance gradient flow and preserve critical low-level features, mitigating vanishing gradient issues. The methodology includes preprocessing steps to refine the input data, such as removing irrelevant stages, merging stages per standard practices, and focusing on key time intervals. Comprehensive experiments are conducted using publicly available datasets, evaluating the model against state-of-the-art approaches under multiple performance metrics. The main contributions of this approach are:
Novel Hybrid Architecture: Proposes a novel 1D U-Net-based architecture for sleep state detection, integrating U-Net, Squeeze-and-Excitation (SE) blocks, and residual blocks tailored for physiological data like EEG.
Residual Block Integration: Introduces residual blocks in the encoder to improve gradient flow, mitigate vanishing gradient issues, and preserve critical low-level temporal features, enhancing the model’s ability to handle sequential data.
Squeeze-and-Excitation (SE) Blocks: Utilizes SE blocks to dynamically recalibrate channel-wise features, enabling the model to focus on the most relevant signal patterns, thereby improving sleep stage classification accuracy.
Superior Performance: Demonstrates the model’s superior performance on multiple publicly available datasets (Sleep-EDF-20, Sleep-EDF-78, SHHS), achieving high accuracy, F1 scores, and Cohen’s kappa values compared to baseline methods.
Practical Applicability: Highlights the potential for the model to be used in scalable, efficient, and automated sleep state classification, suitable for both clinical and home-based sleep monitoring applications.
Related work
Recent advancements in sleep state detection have leveraged deep learning architectures for improved accuracy. Traditional approaches, such as feature-based classifiers, often rely on handcrafted features and lack generalizability across datasets. Hybrid models like U-Net and ResNet have demonstrated significant potential by combining segmentation and classification capabilities, particularly in biomedical signal analysis. The integration of SE modules further enhances these models by improving feature channel interdependencies, leading to more precise sleep state classification.
Olsen et al.18 present a deep learning architecture for classifying sleep stages using data from wrist-worn consumer sleep technologies, specifically ACC and PPG. The proposed deep neural network (DNN) processes multivariate time series data to predict sleep stages, achieving notable accuracy and kappa values. Key findings emphasize the importance of preprocessing methods and the combination of ACC and PPG modalities, particularly for detecting REM sleep. The study highlights the potential of consumer sleep-tracking devices for out-of-clinic monitoring while acknowledging limitations such as data loss. Chen et al.19 presented a novel deep learning framework for sleep-wake detection that utilizes acceleration and heart rate variability (HRV) data. Introduces a Local Feature-Based Long-Short-Term Memory (LF-LSTM) approach for effective feature learning. The method achieves an accuracy of 95.1% and a G-mean of 0.884, outperforming traditional machine learning and other deep learning approaches. The inclusion of HRV data improves detection performance, particularly in handling unbalanced data sets.
Cho et al.20 present Deep-ACTINet, a hybrid deep learning model that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks for sleep-wake detection using accelerometer data. The model processes raw activity signals without feature engineering and demonstrates superior performance over traditional algorithms, achieving an accuracy of 89.65%, recall of 92.99%, and precision of 92.09%. It highlights the limitations of conventional methods and emphasizes the model’s effectiveness in capturing significant information for sleep state classification. The study involved data from ten healthy volunteers and suggests potential applications in wearable devices. Future work may include the integration of additional physiological data for better accuracy. The study by Tuckwell et al.21 investigates the use of deep learning to classify recent sitting and sleep history from raw ACC data during simulated driving tasks. Using two convolutional neural networks, ResNet-18 and DixonNet, the research involved 84 participants and focused on a 20-minute rural driving task. ResNet-18 outperformed DixonNet, achieving a classification accuracy of 88.6% for both sitting and sleep history. The findings emphasize the potential of using ACC data to detect driver fatigue and the impact of prolonged sitting and inadequate sleep on driving performance. Limitations include a lower accelerometer sampling rate and a focus on young, healthy participants.
The study investigates sleep posture monitoring using a neck-mounted accelerometer, focusing on detecting four sleep positions with high accuracy. Rawan et al.22 evaluate three machine learning models: decision trees, extra-tree classifiers, and long-short-term memory neural networks, achieving mean F1-scores of 0.945, 0.975, and 0.965, respectively. The decision tree model is highlighted for its low memory usage and fast prediction time. The research supports the feasibility of low-power wearable monitoring systems for healthcare, particularly for epilepsy patients. Limitations include a small participant pool and the need for further validation in natural settings.
TinyUNet23 is a lightweight U-Net-based model for automated sleep stage classification using single-channel EEG and EOG signals. Designed to address generalization challenges and class imbalance (notably for rare stages like N1 and N3), the model integrates attention mechanisms (CSJA and SE blocks) and a custom Sparse Weighted Dice-Focal (SWDF) loss to prioritize hard-to-classify samples. Trained on a large, diverse dataset (9970 records from 7226 subjects), it achieves robust performance (84.6% accuracy, 79.6% macro F1-score), outperforming existing models like DeepSleepNet and U-Time, especially in N1 recognition. Its efficiency, adaptability to variable temporal resolutions, and compatibility with portable devices make it a practical tool for clinical and at-home sleep monitoring.
Another approach was a hybrid U-Net and Conv-LSTM24 model to detect student sleepiness in classrooms by analyzing body movement inactivity from video recordings. The modified U-Net architecture integrates Conv-LSTM blocks to capture both spatial and temporal features, addressing challenges like occlusions, varied poses, and lighting changes. Trained on video data from a 2MP camera installed at 2.5 meters, the model achieved 92% accuracy in identifying active (non-sleepy) students, outperforming standard U-Net in segmentation metrics (Dice and IoU scores). Ethical considerations around privacy are highlighted, and future work includes extending the framework for activity classification to enhance monitoring of student engagement.
The U-Sleep25, a deep-learning system for automated sleep staging was designed to address the laborious and variable nature of manual sleep stage classification. U-Sleep is a fully convolutional neural network trained on 15,660 participants across 16 diverse clinical studies, enabling robust performance across varying EEG/EOG setups and patient demographics. It predicts sleep stages at high temporal resolutions (up to 128 Hz), offering detailed insights beyond traditional 30-second intervals. Evaluations on unseen datasets show U-Sleep matches human expert accuracy and outperforms specialized models, even distinguishing conditions like obstructive sleep apnea more effectively using high-frequency data. The system is publicly available, aiming to streamline clinical workflows, reduce costs, and standardize sleep analysis globally.
Mathias et al.[26](https://www.nature.com/articles/s41598-025-00742-8#ref-CR26 “Perslev, M., Jensen, M. H., Darkner, S., Jennum, P. J. & Igel, C. U-time: A fully convolutional network for time series segmentation applied to sleep staging. ArXiv arXiv:1910.11162
(2019).“) proposed a fully convolutional neural network for automated sleep stage classification using EEG data. Inspired by the U-Net architecture for image segmentation, U-Time replaces recurrent layers with a feed-forward encoder-decoder design to simplify training and enhance robustness. The model processes entire sleep recordings in one pass, outputting sleep stages at flexible temporal resolutions. Evaluated across seven diverse datasets with fixed hyperparameters, U-Time matches or outperforms state-of-the-art models (e.g., CNN-LSTMs) and even approaches human expert accuracy in some cases. Its ability to handle varying input lengths and provide high-resolution predictions makes it a scalable, efficient solution for clinical sleep analysis, reducing reliance on manual scoring and improving diagnostic workflows.
TransUSleepNet27 is a deep learning model designed for automated sleep stage classification using single-channel EEG signals. It integrates U-Net’s encoder-decoder architecture, which captures multi-scale spatial features through downsampling, upsampling, and skip connections, with Transformer modules to model global dependencies via self-attention mechanisms. This hybrid approach addresses the limitations of prior CNN/RNN-based methods by enhancing feature extraction across different sleep stages while maintaining computational efficiency. Evaluated on the Sleep-EDF dataset, the model achieves state-of-the-art performance (89.8% accuracy) and outperforms existing methods, particularly in challenging stages like N1. However, the classification of the N1 stage remains suboptimal due to data imbalance and overlapping EEG characteristics, highlighting areas for future improvement.
Transformers, a type of deep learning architecture known for their ability to model long-range dependencies through self-attention mechanisms, have shown promising potential in sleep monitoring detection systems. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), transformers can capture complex temporal patterns in sleep-related physiological signals, such as EEG, EOG, and ECG, without the need for sequential data processing28. By leveraging the attention mechanism, transformers enable the model to focus on the most informative portions of the signal, improving the accuracy of sleep stage classification29. Their scalability and ability to handle large, noisy datasets make transformers highly suitable for sleep monitoring applications, particularly for continuous, real-time sleep analysis, and personalized health interventions in home-based or clinical settings. Recent studies have applied transformers to automatic sleep stage scoring and sleep-wake detection, demonstrating their effectiveness in achieving higher performance metrics, such as accuracy and F1 scores, compared to traditional methods30.
Methodology
Proposed SE-ResNet-U-Net model
The proposed methodology for sleep states detection leverages a U-Net-based deep learning architecture designed for 1D data and is shown in Fig. 1. This model utilizes a combination of 1D convolutional layers, batch normalization, ReLU activation, and SE blocks to capture complex patterns in sleep stage data. The network consists of an encoder-decoder structure where the encoder progressively extracts features from the input signal while the decoder reconstructs the output with enhanced feature representations. The residual blocks, integrated within the encoder, allow for better gradient flow and the preservation of detailed temporal information, making the model more robust for sequential data processing. A key novel contribution is the incorporation of SE mechanisms, which help the network recalibrate channel-wise features, enabling the model to focus on the most relevant information during training. The use of global average pooling and adaptive kernel sizes allows the model to adaptively learn the most salient features of the input data, facilitating more accurate classification of sleep stages. This methodology provides a novel approach to sleep state detection, addressing challenges such as variability in the input signal and improving model performance by optimizing feature selection and enhancing spatial-temporal relationships in the data. The proposed approach demonstrates a significant improvement over traditional methods by combining convolutional neural networks with advanced attention mechanisms, offering a more efficient and interpretable model for sleep state classification.
Fig. 1
The architecture of the proposed deep learning model for sleep stages detection.
U-Net architecture
The 1D U-Net architecture is an adaptation of the U-Net model, originally designed for image segmentation, but here used for time-series data such as the physiological signals (e.g., EEG, EOG, ECG) in the context of sleep state detection. This architecture is particularly well-suited for sequential data, as it uses both contracting (downsampling) and expanding (upsampling) paths to capture both low-level features (such as rhythmic patterns in EEG signals) and high-level, context-dependent features (such as sleep stage transitions).
The 1D U-Net consists of two main parts:
- 1.
Contracting Path (Encoder)
This part progressively reduces the spatial dimension of the input signal while increasing the number of feature channels. It is responsible for capturing the abstract, global features of the input signal. This is done by a series of convolutional layers followed by downsampling operations, usually performed using max pooling or strided convolutions. The encoder in a 1D U-Net consists of multiple blocks, where each block typically includes a 1D convolution followed by batch normalization and ReLU activation. The output of each block is downsampled using max-pooling or strides in the convolutions. For the (i)-th layer in the encoder, let the input to the layer be (\mathbf{X}_i \in \mathbb {R}^{C_i \times L_i}), where (C_i) is the number of channels and (L_i) is the length of the input signal at that layer. The convolutional operation can be expressed as:
$$\begin{aligned} \mathbf{X’}_i = \text {Conv1d}(\mathbf{X}_i, W_i) + b_i \end{aligned}$$
Where: - (W_i \in \mathbb {R}{k \times C_i \times C_{i+1}}) is the weight matrix of the (i)-th convolutional layer, - (b_i \in \mathbb {R}{C_{i+1}}) is the bias term, - (k) is the kernel size, - (C_{i+1}) is the number of output channels for this layer. The next operation in the contracting path is downsampling, which reduces the length of the feature map, typically by applying max pooling:
$$\begin{aligned} \mathbf{X}_i’’ = \text {MaxPool1d}(\mathbf{X}_i’) \end{aligned}$$
Where (\text {MaxPool1d}) reduces the spatial dimension (length) of the feature map.
- 2.
Expanding Path (Decoder)
This part upsamples the feature maps from the contracting path and refines them for precise localization. The expanding path increases the spatial dimension of the feature maps through transposed convolutions or interpolation, combining the upsampled features with those from the contracting path to improve context awareness. In the decoder, the feature maps are upsampled to recover spatial resolution and refined to detect more precise features. The upsampling operation typically uses transposed convolutions or upsampling followed by a convolution. For the (i)-th layer in the decoder, the input to the layer is the concatenated output from the corresponding layer in the encoder and the previous layer in the decoder. Let (\mathbf{X}_i \in \mathbb {R}^{C_i \times L_i}) represent the input feature map at the (i)-th decoder layer, and let the output be (\mathbf{X}_i’). The upsampling operation is typically done using transposed convolutions:
$$\begin{aligned} \mathbf{X}_i’ = \text {ConvTranspose1d}(\mathbf{X}_i, W_i) + b_i \end{aligned}$$
Where: - (\mathbf{X}_i) is the input feature map from the concatenated encoder-decoder connection, - (W_i) is the weight matrix for the transposed convolution, - (b_i) is the bias term. The final step in the decoder is the output layer, which produces a 1D feature map representing the predicted sleep stage for each segment of the input signal. This is a classification task, so we typically use a softmax or sigmoid activation depending on whether the classification is multi-class or binary.
- 3.
Output Layer
Let (\mathbf{X}_{\text {out}} \in \mathbb {R}^{1 \times L}) be the output feature map of the network, which corresponds to the predicted sleep stages for each time step. The final output is computed using a 1D convolution:
$$\begin{aligned} \mathbf{X}_{\text {out}} = \text {Conv1d}(\mathbf{X}_\text {final}, W_{\text {out}}) + b_{\text {out}} \end{aligned}$$
Where: - (\mathbf{X}_\text {final}) is the concatenated or refined feature map from the decoder, - (W_{\text {out}}) and (b_{\text {out}}) are the weights and biases of the final output convolution. The final prediction (\hat{y}_t) for each time step (t) can be computed using a softmax or sigmoid activation function:
$$\begin{aligned} \hat{y}_t = \text {Softmax}(\mathbf{X}_{\text {out}}(t)) \end{aligned}$$
Where (\hat{y}_t) corresponds to the predicted sleep stage at time step (t).
In sleep state detection, the 1D U-Net learns to map raw or preprocessed physiological signals to discrete sleep stages. The encoder extracts relevant temporal and frequency features, while the decoder restores the resolution of the features and classifies the signal at each time step into one of the sleep stages. The skip connections between the encoder and decoder allow the model to retain both low-level and high-level features, making it highly effective for sequential data like sleep signals.
Squeeze and excitation block (SE)
The SE is a powerful mechanism designed to improve the representational capacity of deep learning models by adaptively recalibrating channel-wise features. In the context of sleep stage detection, SE blocks enable the model to focus on the most relevant features of sleep signals (e.g., EEG, EOG, ECG) that contribute to the accurate classification of sleep stages such as REM, light sleep, and deep sleep. The SE block operates first by performing a squeeze operation to capture global information across the input feature map, followed by an excitation operation to recalibrate the importance of each channel.
- 1.
Squeeze Operation
The first step in the SE block is the squeeze operation, which captures global information by performing global average pooling. For a feature map (\mathbf{X}) of shape (H \times W \times C), where (H) and (W) represent the height and width of the feature map, and (C) represents the number of channels, the squeeze operation aggregates the feature map along the spatial dimensions (height and width). This results in a vector (\mathbf{z} \in \mathbb {R}^C) representing the global context for each channel:
$$\begin{aligned} \mathbf{z}_c = \frac{1}{H \cdot W} \sum _{i=1}{H} \sum _{j=1}{W} \mathbf{X}_{ijc} \end{aligned}$$
Where: - (\mathbf{X}_{ijc}) is the value at position ((i, j)) in the (c)-th channel, - (\mathbf{z}_c) is the global context or “squeezed” value for channel (c).
- 2.
Excitation Operation
The second step is the excitation operation, where the network learns to assign weights to each channel based on its importance. This is done through two fully connected layers followed by a sigmoid activation, which outputs a set of channel-wise scaling factors (\mathbf{s}) that are applied to the input feature map to recalibrate its channels. Let (\mathbf{z}) be the vector obtained from the squeeze operation, and let the excitation operation be formulated as:
$$\begin{aligned} \mathbf{s} = \sigma (W_2 \delta (W_1 \mathbf{z})), \end{aligned}$$
Where: - (\mathbf{s} \in \mathbb {R}^C) is the scaling factor vector for each channel, - (\sigma) is the sigmoid activation function, - (W_1) and (W_2) are the weights of the fully connected layers, - (\delta) is the ReLU activation function applied to the intermediate layer.
- 3.
Final Recalibration
Finally, the recalibrated output is obtained by multiplying the original feature map (\mathbf{X}) with the scaling factor (\mathbf{s}), channel-wise. The output of the SE block is:
$$\begin{aligned} \mathbf{X}_{\text {out}} = \mathbf{X} \cdot \mathbf{s} \end{aligned}$$
Where: - (\mathbf{X}_{\text {out}}) is the output of the SE block after channel recalibration, - (\mathbf{X}) is the input feature map.
In the context of sleep stage detection, the SE block allows the model to focus on the most informative channels of the input sleep data. For example, certain frequency bands or temporal patterns in the EEG signals are more indicative of specific sleep stages (e.g., high-frequency activity in REM sleep). The SE block enables the model to enhance these important channels and suppress less relevant ones, improving the overall classification accuracy.
By recalibrating the channel-wise importance dynamically, the SE block helps the model better understand the complex temporal dependencies inherent in sleep stages, which are crucial for accurate sleep stage classification.
Residual block
Residual blocks play a crucial role in improving the performance of deep learning models for tasks such as detecting sleep stages and are shown in Fig. 2. In these models, the input signal, which represents raw sleep data, passes through multiple convolutional layers designed to extract features relevant to different sleep stages. However, as the network depth increases, important low-level features can be lost, and training deep networks can become challenging due to vanishing gradients. Residual blocks address this issue by introducing skip connections, allowing the input to bypass one or more layers and be added directly to the output of those layers. This enables the model to preserve vital temporal information from the input data while learning higher-level, more complex representations of the sleep stages.
Mathematically, the output (\mathbf{y}) of a residual block is expressed as the sum of the input (\mathbf{x}) (which represents the sleep data) and the output (F(\mathbf{x}, {W_i})) of a series of convolutional layers, where (W_i) are the weights of the layers in the block. The equation for a residual block in the context of sleep stages detection is:
$$\begin{aligned} \mathbf{y} = F(\mathbf{x}, {W_i}) + \mathbf{x} \end{aligned}$$
Where:
(\mathbf{x}) is the input to the residual block, representing the raw or intermediate features of the sleep data,
(F(\mathbf{x}, {W_i})) is the output of the convolutional layers with learned weights (W_i), which represent the transformed features of the sleep data,
(\mathbf{y}) is the final output, which is the sum of the input and the learned residual features.
Fig. 2
The standard architecture of the residual block.
In the context of sleep stage detection, this structure allows the model to learn and retain essential low-level temporal patterns, such as the frequency and amplitude of specific sleep signals, while also capturing complex higher-level patterns relevant to different stages of sleep. The residual connections help mitigate the loss of critical features during the forward pass through multiple layers and enable more accurate classification of sleep stages, such as REM, light, and deep sleep. Moreover, the ability to learn identity mappings improves convergence rates, reduces overfitting, and enhances the interpretability of the model’s predictions.
Experimental results
Dataset definition
Our experiments utilized three publicly available datasets: Sleep-EDF-20, Sleep-EDF-78, and the Sleep Heart Health Study (SHHS), as summarized in Table 1. Each dataset provided single-channel EEG data for model evaluation. Sleep-EDF-20, comprising data from 20 subjects, and Sleep-EDF-78, an expanded version with 78 subjects, were obtained from PhysioBank31. These datasets include two studies: Sleep Cassette (SC files), which investigates the effects of aging on sleep among healthy participants aged 25 to 101 years, and Sleep Telemetry (ST* files), which examines the impact of temazepam on sleep in 22 Caucasian individuals without other medications. Both datasets contain two EEG channels (Fpz-Cz and Pz-Oz) recorded at a 100 Hz sampling rate, along with one EOG and one chin EMG channel. Consistent with prior research[32](#ref-CR32 “Tsinalis, O., Matthews, P. M., Guo, Y. & Zafeiriou, S. Automatic sleep stage scoring with single-channel eeg using convolutional neural networks. ArXiv arXiv:1610.01683
(2016).“),33,34,35,36, we focused on the Sleep Cassette study and used the Fpz-Cz EEG channel as input for the models in our experiments.
The SHHS dataset37,38 is a multi-center cohort study investigating the cardiovascular and other health impacts of sleep-disordered breathing. Participants included individuals with various conditions, such as lung, cardiovascular, and coronary diseases. To reduce the influence of these conditions, we followed the subject selection criteria39, focusing on individuals with regular sleep patterns (e.g., an Apnea Hypopnea Index or AHI below 5). From the initial 6441 subjects, 329 were selected for our experiments, utilizing the C4-A1 channel recorded at a 125 Hz sampling rate. Additional dataset details are provided in Section S.III of our supplementary materials. Preprocessing steps included excluding UNKNOWN stages not associated with any defined sleep stage, merging stages N3 and N4 into a single stage (N3) as per AASM standards, and limiting wake periods to 30 minutes before and after sleep to emphasize sleep stages40.
Evaluation metrics
Accuracy
Accuracy measures the proportion of correctly classified instances (both true positives and true negatives) out of the total number of observations. It gives an overall sense of how well the model is performing.
$$\begin{aligned} \text {Accuracy} = \frac{\text {True Positives} + \text {True Negatives}}{\text {Total Observations}} \end{aligned}$$
(1)
Precision
Precision is the ratio of correctly predicted positive instances (true positives) to the total predicted positives (true positives and false positives). It reflects the model’s ability to avoid false positives.
$$\begin{aligned} \text {Precision} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Positives}} \end{aligned}$$
(2)
Recall
Recall, also known as sensitivity, is the ratio of correctly predicted positive instances (true positives) to all actual positives (true positives and false negatives). It measures the model’s ability to detect all relevant instances.
$$\begin{aligned} \text {Recall} = \frac{\text {True Positives}}{\text {True Positives} + \text {False Negatives}} \end{aligned}$$
(3)
F1 score
The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of a model’s performance, especially when there is an uneven class distribution.
$$\begin{aligned} F_1 = 2 \cdot \frac{\text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(4)
Cohen’s Kappa
Cohen’s Kappa is a statistical measure used to assess the agreement or consistency between two raters or judges who classify items into mutually exclusive categories. It takes into account the agreement occurring by chance and provides a value between -1 and 1, where 1 indicates perfect agreement, 0 indicates no agreement beyond chance, and negative values suggest less agreement than would be expected by random chance. The formula for Cohen’s Kappa is given by:
$$\begin{aligned} \kappa = \frac{P_o - P_e}{1 - P_e} \end{aligned}$$
where:
(P_o) is the observed agreement (proportion of times the raters agree),
(P_e) is the expected agreement (proportion of times the raters would agree by chance).
To calculate (P_e), you use the marginal probabilities of the categories assigned by each rater.
Model training parameters
Model training for sleep state detection involves carefully selecting and configuring several key parameters to ensure optimal performance. The model processes sequential physiological data to predict sleep stages. Training is conducted on high-performance hardware, with GPU NVIDIA RTX 3090 being ideal due to its ability to handle large-scale data and complex model architectures efficiently. The training process utilized a batch size of 128 and the Adam optimizer, starting with a learning rate of 1e−3, which was reduced to 1e-4 after 10 epochs. The Adam optimizer included a weight decay of 1e−3, beta values of (0.9, 0.999), an epsilon of 1e-08, and AMSGrad enabled. Other techniques, like gradient clipping and learning rate scheduling, are employed to enhance stability and convergence.
Results and discussion
Pre-processing
The preprocessing approach employed in this research is designed to extract meaningful features and prepare the data for sleep state detection, ensuring data quality and capturing temporal dependencies in physiological signals. The key steps are outlined below:
- 1.
Data Cleaning and Validation
The initial step focuses on data integrity by handling missing values and detecting duplicate entries. The dataset is joined with event data based on the timestamp column to ensure the alignment of event labels with the time-series data. Missing event values are replaced with zeros, and timestamps are converted into a standardized format. Duplicate rows are identified using key features such as anglez, enmo, and time, and flagged using a valid_flag column:
$$\text {valid_flag} = {\left{ \begin{array}{ll} 1.0, & \text {if duplicate count } = 1, \ 0.0, & \text {otherwise.} \end{array}\right. }$$
The proportion of valid rows for each series_id is stored in a dictionary to quantify data quality.
- 2.
Feature Engineering
Temporal features are extracted to enhance the model’s ability to differentiate between sleep states:
Logarithmic Rolling Standard Deviation of Anglez: The rolling standard deviation of the anglez signal is calculated over a 12-time-step window and log-transformed to stabilize variance:
$$\text {log_anglez_std} = \log \left( \text {rolling_std}(\text {anglez}, 12) + 1\right) .$$
Logarithmic ENMO: The enmo feature is log-transformed with a small offset:
$$\text {log_enmo} = \log (\text {enmo} + 0.01).$$
These rolling features are pivoted to align each time step with its corresponding temporal context. To incorporate dependencies across consecutive days, the feature arrays are concatenated with shifted versions of themselves:
$$\begin{aligned} \text {feature_array} = \text {concat}(\text {shift}_{-1}, \text {current}, \text {shift}_{+1}), \end{aligned}$$
where (\text {shift}_{-1}) and (\text {shift}_{+1}) represent the preceding and succeeding day’s data, respectively.
- 3.
Temporal Resampling
The dataset is resampled to 1-minute intervals to ensure uniform temporal granularity. Aggregation rules are applied:
$$\begin{aligned} \text {step}&= \text {mean}, \ \text {event}&= \text {sum}, \ \text {valid_flag}&= \text {max}. \end{aligned}$$
This results in a standardized dataset with consistent temporal resolution.
- 4.
Target Computation
The event column is used to compute target labels, capturing the temporal influence of events. A weighted rolling window over 30 minutes is applied in both forward and backward directions, with weights decreasing exponentially:
$$w_j = \exp \left( -\frac{j}{\tau }\right) ,$$
where (\tau = 2.8) is the decay parameter, and (j) is the time step. The target values are computed as:
$$\text {target}_i = \sum _{j=1}^{30} w_j \cdot (\text {event}_{i+j} + \text {event}_{i-j}).$$
This reflects the gradual effect of events on sleep states.
- 5.
Final Preparation
The processed features and resampled data are stored in arrays and dataframes. The feature arrays encapsulate temporal and contextual information, while the 1-minute resampled data provides a structured dataset for model input.
Results after preprocessing
The experiments were performed on the Fpz-Cz channels of the EDF-20 and EDF-78 datasets and the C4-A1 channel of the SHHS dataset. The confusion matrix is obtained by summing up all the scoring values of the testing data across the 10 folds, divided by the splitting of datasets. Each row represents the number of samples classified by experts, while each column corresponds to the number of epochs predicted by the model. The tables provide the per-class precision (PR), recall (RE), F1 score (F1), and G-mean values (GM) for each class. Table 2 presents a confusion matrix summarizing the classification performance of a model across five classes: W, N1, N2, N3, and REM. The diagonal elements (bolded) represent the correctly classified samples for each class, while off-diagonal elements indicate misclassifications. For example, 7113 instances of class W were correctly classified, whereas 523 were misclassified as N1. The table also includes per-class metrics: precision (PR), recall (RE), F1-score (F1), and geometric mean (GM), expressed as percentages. Class W achieved the highest recall (89.5%), while N1 had the lowest precision (48.2%). The model shows balanced performance for classes N2 and N3, with high F1-score and GMs above 90%. Values in the table avoid using the digit 0 while preserving the overall structure and accuracy of the confusion matrix.
This Table 3 presents the performance of the proposed model on the Sleep-EDF-78 dataset using the Fpz-Cz channel, summarized in a confusion matrix alongside per-class metrics: precision (PR), recall (RE), F1-score (F1), and geometric mean (GM). The rows represent true classes, and the columns represent predicted classes. For instance, 60,234 instances of class W were correctly classified, while 3,721 were misclassified as N1. Class N2 achieved the highest recall (86.2%) and strong overall performance with an F1-score of 85.1%. Class N1 had the lowest precision (44.5%) and F1-score (42.2%), indicating it was the most challenging class to classify. Classes N3 and REM also performed well, with GMs of 89.6% and 86.0%, respectively. The table reflects the model’s robust classification capabilities, particularly for dominant classes like W and N2, while highlighting areas for improvement in distinguishing less prominent classes like N1.
This Table 4 summarizes the performance of the proposed model on the SHHS dataset using the C4-A1 channel, sh