Introduction
Human auditory systems demonstrate remarkable selectivity in complex acoustic environments, a phenomenon prominently illustrated by the cocktail party effect, wherein listeners can concentrate on target speech streams despite the presence of competing voices[1](#ref-CR1 “Haykin, S. & Chen, Z. The cocktail party problem. Neural Comput. 17, 1875–1902. https://doi.org/10.1162/0899766054322964
(2005).“),[2](#ref-CR2 “Han, C. et al. Speaker-independent auditory attention decoding without access to clean speech sources. Sci. Adv. 5, eaav6134. https://doi.org/10.1126/sciadv.aav6134
(2019).“),[3](https://www.nature.com/articles/s41598-025-22177-x#ref-CR3 “Monesi, M. J., Accou, B., Montoya-Martinez, J., Francart, T. & Hamme, H. V. An lstm based architecture to relate speech stimu…
Introduction
Human auditory systems demonstrate remarkable selectivity in complex acoustic environments, a phenomenon prominently illustrated by the cocktail party effect, wherein listeners can concentrate on target speech streams despite the presence of competing voices[1](#ref-CR1 “Haykin, S. & Chen, Z. The cocktail party problem. Neural Comput. 17, 1875–1902. https://doi.org/10.1162/0899766054322964
(2005).“),[2](#ref-CR2 “Han, C. et al. Speaker-independent auditory attention decoding without access to clean speech sources. Sci. Adv. 5, eaav6134. https://doi.org/10.1126/sciadv.aav6134
(2019).“),[3](https://www.nature.com/articles/s41598-025-22177-x#ref-CR3 “Monesi, M. J., Accou, B., Montoya-Martinez, J., Francart, T. & Hamme, H. V. An lstm based architecture to relate speech stimulus to eeg. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 941–945. https://doi.org/10.1109/ICASSP40776.2020.9054000
(2020).“). This ability is often impaired in people with hearing disorders, as conventional hearing aids have difficulty separating speakers in noisy environments4. Recent advancements in neural decoding have unveiled robust cortical activation patterns associated with attentional modulation, thereby providing a neurophysiological foundation for auditory attention detection (AAD) systems aimed at overcoming challenges posed by cocktail party scenarios5.
Electroencephalography (EEG)6 has established itself as the primary modality for monitoring neural dynamics, particularly in comparison to other methods such as electrocorticography7 and magnetoencephalography8,9. EEG has undergone extensive validation regarding its efficacy in applications related to attention-related disorders10. Utilizing non-invasive electrode arrays, EEG captures fluctuations in scalp potentials that form non-linear temporal sequences. Using a mathematical tool called the Fast Fourier Transform (FFT), we break down these signals into five frequency bands11: (\delta) (1–3 Hz), (\theta) (4–7 Hz), (\alpha) (8–13 Hz), (\beta) (14–30 Hz), and (\gamma) (31–50 Hz)12. Each band exhibits unique spatial-topographic distributions corresponding to specific cognitive states, with differential entropy (DE)13 and power spectral density (PSD)14 emerging as particularly effective methodologies for feature extraction.
Modern auditory attention detection (AAD) research focuses on two primary objectives: speaker identity discrimination and spatial attention tracking15. While stimulus-reconstruction paradigms that leverage clean speech references demonstrate potential, their real-world applicability is limited due to the prevalence of overlapping sound sources in natural environments. This limitation necessitates the development of EEG-exclusive frameworks for detecting spatial attention16.
Traditional linear analytical methods often struggle to model non-linear neural interactions, requiring extended temporal windows for reliable inference17. Contemporary approaches employ convolutional neural networks (CNNs) to exploit spectral characteristics18, achieving improved performance through 2D topographic mapping of discriminative features. To address the challenges of low signal-to-noise ratio scenarios,[19](https://www.nature.com/articles/s41598-025-22177-x#ref-CR19 “Parisae, V. & Bhavanam, S. N. Multi scale encoder-decoder network with time frequency attention and s-tcn for single channel speech enhancement. J. Intell. Fuzzy Syst. 46, 10907–10907. https://doi.org/10.3233/JIFS-233312
(2024).“) draws on the multi-scale convolutional encoder decoder structure, which utilizes spatiotemporal convolutional networks (S-TCN) as bottlenecks to model long-term dependencies. To fully utilize multi-scale contextual information, this study adopted the TFADCSU-Net model proposed by20, which has a built-in multi-scale feature extraction layer (MSDEL) that can effectively capture global and local speech features.21 introduced a time-frequency attention (TFA) module after each multi-scale convolution block, which can dynamically assign weights to different time-frequency spectral components, enabling the model to accurately focus on key information.22 employed both local and global attention networks to jointly model speech signals, which can extract useful information more comprehensively and has better performance than a single self attention network.23 proposed a novel approach utilizing microstate and recurrence quantification analysis features combined with a hybrid GRU-CNN architecture for AAD, demonstrating strong performance without requiring access to the auditory stimuli.24 proposed AADNet, an end-to-end architecture that directly maps EEG to attention state, demonstrating significantly improved generalization to unseen subjects. However, these methods typically overlook the temporal evolution of EEG patterns. In contrast, attention-based temporal models effectively capture dynamic variations but frequently neglect essential spectral-spatial correlations. This methodological dichotomy highlights the need for hybrid architectures that synergistically integrate temporal-spectral features through multimodal fusion–a largely unexplored frontier in AAD research.
The primary objectives of this study are threefold: First, to design a novel neural architecture that seamlessly integrates spatial-temporal filtering, dynamic multi-scale feature fusion, and efficient cross-channel attention into a unified framework for AAD. Second, to validate that this hybrid approach effectively captures the complex neural patterns of auditory attention that are often overlooked by methods focusing on isolated feature domains. Third, to demonstrate that the proposed model achieves superior decoding performance, particularly under the challenging condition of short decision windows, while simultaneously maintaining a parameter-efficient structure suitable for potential real-time applications. Based on these objectives, we formulate the following hypotheses:
- 1.
 
A network that explicitly models the interplay between spatial, temporal, and cross-channel features will yield significantly higher AAD decoding accuracy compared to state-of-the-art models that do not integrate these aspects jointly.
- 2.
 
The incorporation of a dynamic multi-scale fusion mechanism will enable the model to robustly handle EEG patterns across varying temporal resolutions, leading to notably improved performance in short decision windows (e.g., 0.1 s).
- 3.
 
The proposed efficient cross-channel attention mechanism will enhance feature discriminability without incurring substantial computational overhead, resulting in a model that is both more accurate and more parameter-efficient than existing benchmarks.
The experimental design and evaluations presented in this paper are structured to rigorously test these hypotheses. The major contributions of this paper are outlined as follows:
- 1.
 
We introduce a novel network architecture for auditory attention detection that comprises the spatial-temporal extraction module, multi-scale adaptive fusion module, and cross-channel attention module. Our new network effectively leverages multi-scale features as well as inter-channel correlations to decode EEG data.
- 2.
 
The results indicate that our network achieves remarkable decoding accuracy within very short decision windows–surpassing existing state-of-the-art (SOTA) models by 2.5 points on the DTU dataset and 1.1 points on the KUL dataset–all under a 0.1-second decision window. Additionally, in comparison to the recent model (DBPNet), our model has nearly 50(%) fewer parameters, which significantly enhances the efficiency of model inference.
The remainder of this paper is organized as follows. Section 2 provides a concise introduction to the proposed methodology. Section 3 introduces the dataset processing methods and model training details. In Section 4, we conduct comparative analyses between our network architecture and existing approaches, while empirically validating the efficacy of various constituent modules. Ablation studies are systematically presented in Section 5 to quantify individual component contributions. Finally, Section 6 concludes the paper with a comprehensive summary of findings
Proposed approach
Existing methodologies in EEG-based auditory attention detection (AAD) have predominantly focused on the isolated analysis of either temporal or spectral characteristics, often overlooking the critical interplay between multi-scale features and inter-channel correlations in neural recordings. To address this limitation, we propose a novel channel-attention neural architecture (Fig. 1) comprising three synergistic components: Spatial-Temporal Extraction Module, This module captures spatial and temporal patterns from multi-channel EEG inputs; dynamic multi-scale fusion module, by utilizing adaptive convolutional kernels and weighted feature integration, this module processes information across multiple temporal resolutions; cross-channel attention mechanism, this mechanism enhances local feature interactions while simultaneously establishing global dependencies through parallel attention branches.
Following standard preprocessing protocols, EEG signals are segmented into consecutive decision windows, represented as a matrix (R = [r_1, \ldots , r_i, \ldots , r_T] \in \mathbb {R}{N \times T}), where N denotes the number of electrode channels, and T represents the temporal samples per window. Each temporal slice (r_i \in \mathbb {R}{N \times T}) corresponds to multi-channel neural measurements at the i-th window position, preserving both the spatial distribution and temporal evolution characteristics. The overall structure of the proposed network is illustrated in Fig. 1.
Fig. 1
Architecture of the proposed Hybrid Channel Attention Network for auditory attention detection. The network comprises three main modules: Spatial-Temporal Feature Extraction Module: Processes multi-channel EEG inputs to extract spatial and temporal features. Multi-Scale Adaptive Fusion Module (MSAFM): Integrates features across multiple temporal resolutions using adaptive convolutional kernels. Cross-Channel Attention Module (CCAM): Enhances local and global feature interactions through a partitioned self-attention mechanism. The model inputs are common spatial patterns (CSP) extracted from EEG signals, and the outputs are two predicted labels related to auditory attention.
Following previous methods[25](https://www.nature.com/articles/s41598-025-22177-x#ref-CR25 “Ni, Q. et al. Dbpnet: Dual-branch parallel network with temporal-frequency fusion for auditory attention detection. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 3115–3123. https://doi.org/10.24963/ijcai.2024/345
(2024).“), we also employ the Common Spatial Pattern (CSP) technique for feature extraction to enhance the signal-to-noise ratio of the raw EEG signals26,27. Similar to[28](https://www.nature.com/articles/s41598-025-22177-x#ref-CR28 “Yan, S. et al. Darnet: Dual attention refinement network with spatiotemporal construction for auditory attention detection. arXiv:2410.11181
(2024).“), and to avoid feature leakage, CSP feature extraction is performed only after dividing the data into training and testing sets. The corresponding formulas are as follows:
$$\begin{aligned} \begin{aligned} F = CSP(EEG) \end{aligned} \end{aligned}$$
(1)
where (CSP( \cdot )) means the csp method, the F means the extracted features from eeg data.
Spatial-temporal feature extraction module
EEG captures dynamic, time-varying electrical activity from neuronal cells, revealing functional patterns and interregional connectivity within the brain29. The analysis of auditory-evoked neural responses can be enhanced by integrating both the temporal and spatial attributes derived from these signals. However, prior research has disproportionately emphasized localized temporal dynamics in EEG datasets, often neglecting the spatial distribution characteristics. To address this limitation, we implemented a spatial filter before traditional temporal filters, enabling the synthesis of spatiotemporally enriched EEG representations.
Our module separately processes spatial information (across different EEG channels) and temporal information (changes over time). Attention mechanisms are embedded in both the spatial and temporal modules, enabling dynamic calibration of these two-dimensional feature representations. The overall structure of Spatial-temporal feature extraction module is illustrated in Fig. 1.
Spatial convolution
The spatial convolution is employed to extract spatial features from the input EEG signals, explicitly modeling nonlinear relationships between variables. The spatial convolution consists of three main components: multi-level channel expansion, cross-variable aggregation, and the channel attention mechanism.
First, a (1 \times 1) convolution is applied to expand the number of channels by four times, enhancing the feature representation capability of the input EEG data. Subsequently, convolution is performed along the channel-wise dimension to achieve cross-variable feature fusion for multivariate time series. Finally, a channel attention module is introduced to adaptively calibrate channel importance, thereby enhancing the response of key features.
The process is described by the following formula:
$$\begin{aligned} \begin{aligned} {F_{f}} = GELU(Conv2d(GELU(Conv2d(Input)))) \end{aligned} \end{aligned}$$
(2)
Here, Conv2d represents the convolution operation, and GELU refers to the activation function.
The channel attention module combines information from all spatial locations to determine the importance of each EEG channel. By implementing a bottleneck structure with a reduction ratio of 16, the computational workload is reduced while maintaining performance and minimizing the number of parameters. The Sigmoid function outputs channel weights in the range of 0 to 1, enabling the soft selection of feature channels.
The process is described by the following formula:
$$\begin{aligned} \begin{aligned} {F_{s}} =&{F_{f}} \times Sigmoid(Conv2d(Conv2d(GAP(F_{f}))))\&\in {R^{32 \times 64 \times 1 \times 128}} \end{aligned} \end{aligned}$$
(3)
where Conv2d means the convolution operation, GAP stands for adaptive averaging pooling.
Temporal convolution
The temporal convolution is employed to balance local and global features. We adopted depthwise separable convolution, which is decomposed into depthwise convolution and pointwise convolution. This approach significantly reduces the number of parameters while preserving the receptive field. The process is described by the following formula:
$$\begin{aligned} \begin{aligned} {F_{t}} =&GELU(Conv2d(Conv2d(F_{s})))\&\in {R^{32 \times 64 \times 1 \times 128}} \end{aligned} \end{aligned}$$
(4)
Here, (F_{s}) means the spatial features, Conv2d means the convolution operation.
In addition, a attention mechanism has been incorporated. This mechanism generates attention weights in the time dimension through convolution, allowing the model to highlight key time-step features. Specifically, the attention mechanism produces attention maps along the timeline and adaptively learns the importance of different time points.
To reduce computational overhead, two convolutions are applied to compress the number of channels to 1. The operation x.mean(dim=3) is used to aggregate information across the time dimension, effectively capturing global temporal dependencies. The process is described by the following formula:
$$\begin{aligned} \begin{aligned} {F_{st}} =&{F_{t}} \times Sigmoid(Conv2d(GELU(Conv2d(F_{t}))))\&\in {R^{32 \times 64 \times 1 \times 128}} \end{aligned} \end{aligned}$$
(5)
Here, (F_{st}) means the spatial-temporal features.
After the spatial-temporal features are fused and projection, the final output ({F_{st}}\prime \in {R{32 \times 128 \times 16}}) is obtained by adding the postion information.
Multi-scale adaptive fusion module
Figure 1 illustrates the structure of the proposed module. To effectively fuse long-term and local features, this module adopts three parallel branches, each employing a different convolution kernel for feature extraction. The three branches employ kernels of different sizes ((k_{1}), (k_{2}), (k_{3})), where the size of each kernel, (k_{i}), is determined by the following formula:
$$\begin{aligned} \begin{aligned} {k_i}{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} = {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} \max (3,{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} \left[ {\frac{{{{\log }_2}(C) + b \pm {\Delta _b}}}{{\gamma (1 \pm \alpha )}}} \right] ) \end{aligned} \end{aligned}$$
(6)
Where C represents the number of channels of the input signal, (\Delta _b) is the path specific offset (such as the contraction path (\Delta _b) = 1), and (\alpha) is the disturbance ratio (this module employd the (\alpha) = 0.2). The kernel size of each branch is as follows:
$$\begin{aligned} {\left{ \begin{array}{ll} {Branch_1} = {k_1(\alpha = -0.2 ,{\Delta _b}= +1)}\ {Branch_2} = {k_2(\alpha = 0 ,{\Delta _b}= 0)}\ {Branch_3} = {k_3(\alpha = +0.2 ,{\Delta _b}= -1)}\ \end{array}\right. } \end{aligned}$$
(7)
The extracted global contextual features and local features are fused using adaptive weights to enhance the dynamics of the receptive field. This module employs learnable relative weights, (w_{i}), to fuse the features from the three branches. This approach dynamically balances the feature contributions from different branches while automatically reinforcing the most important paths. The fusion method for the weights is provided in the following formula:
$$\begin{aligned} weights= & Softmax (\frac{{[{w_1},{w_2},{w_3}]}}{\tau }) \end{aligned}$$
(8)
$$\begin{aligned} O= & \sum \limits _{i = 1}^3 {weigh{t_i}} \bullet {y_i} \end{aligned}$$
(9)
Where O represents the output of the multi scale adaptive module, the (\tau) means the learnable adjustment parameters that automatically balance the specificity and robustness of features.
Cross-channels attention module
The module structure is shown in Fig. 1. Instead of relying on single-sequence global attention, the module adopts a combination of local attention mechanisms. The input sequence is divided into four segments, and the correlations between each segment are calculated separately. A dynamic QKV (Query, Key, Value) mechanism is employed to extract the correlations between these segments, enhancing the local interaction of features, achieving cross-segment information fusion, and improving the model’s representational capacity.
The specific process is as follows: Given an input sequence (X \in \mathbb {R}{B \times C \times T}), it is evenly divided into four sub-segments: (X \rightarrow [{X_1},{X_2},{X_3},{X_4}],{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {X_i} \in {R{B \times C/4 \times T}}).
From these four sub-segments, combinations of three are selected to participate in the calculation. There are a total of (C(4,3) = 4) possible combinations. The attention calculation format for each group is as follows:
$$\begin{aligned} \begin{aligned} Attention&(Q = {X_\mathrm{{a}}},K = {X_b},V = {X_c}) = \&Softmax (\frac{{Q{W^Q}{{(K{W^K})}^T}}}{{\sqrt{d} }})V{W^V}\ \end{aligned} \end{aligned}$$
(10)
Amog them, the defination of (a,b,c) is followed:
$$\begin{aligned} \begin{aligned} (a,b,c) \in \left{ {(1,2,3),(1,2,4),(2,3,4),(1,3,4)} \right} \end{aligned} \end{aligned}$$
(11)
Simultaneously employing block combination attention effectively reduces computational complexity. The module integrates the outputs of four multi-head attention mechanisms. The specific formula for this integration is as follows:
$$\begin{aligned} \begin{aligned} F = LayerNorm(\mathop \oplus \limits _{1 \le a< b < c \le 4} Attention({X_a},{X_b},{X_c})) \end{aligned} \end{aligned}$$
(12)
Where (\oplus) represents the feature concatenation operation. The extracted features are subjected to convolution and pooling operations to achieve resolution reduction and feature enhancement of sequence data.
Experiments
Dataset
This section evaluates the performance of our proposed network employing two benchmark datasets widely utilized in auditory attention detection research: KUL30,[31](https://www.nature.com/articles/s41598-025-22177-x#ref-CR31 “Das, N., Francart, T. & Bertrand, A. Auditory attention detection dataset kuleuven (old version). Zenodo https://doi.org/10.5281/zenodo.3377911
(2019).“) and DTU32,[33](https://www.nature.com/articles/s41598-025-22177-x#ref-CR33 “Fuglsang, S. A., Wong, D. D. & Hjortkjær, J. Eeg and audio dataset for auditory attention decoding. Zenodo https://doi.org/10.5281/zenodo.1199011
(2018).“). Both the KUL and DTU datasets exclusively consist of EEG recordings collected from audio-only experimental paradigms. Key characteristics of these datasets are systematically compared in Table 1.
KUL dataset
Neural recordings were acquired using a BioSemi ActiveTwo system from 16 healthy participants in an acoustically controlled environment. The experimental protocol involved dichotic listening tasks, wherein subjects selectively attended to one of two competing narratives delivered via in-ear headphones at 60 dB. Four Flemish-language stories narrated by male speakers were presented under two spatialization conditions: Conventional diotic playback with separate ear assignments. HRTF-processed simulations, positioning sound sources at 90 (^\circ) lateral angles. Each participant completed eight 6-minute trials. EEG signals were recorded using a 64-channel setup at a sampling rate of 8,192 Hz, while auditory content was bandwidth-limited to 4 kHz.
DTU dataset
This dataset also employed the BioSemi ActiveTwo system but with a reduced sampling rate of 512 Hz. EEG recordings were collected from 18 normoacoustic participants performing auditory attention selection tasks. Participants attended to target speech signals spatially separated at 60 (^\circ) azimuth angles, presented concurrently with distractor narratives through ER-2 insert earphones at 60 dB SPL. The stimulus set consisted of Danish audiobook excerpts narrated by six speakers (3 male and 3 female). Each participant completed 60 experimental trials, each lasting 50 seconds
Data processing
To ensure equitable performance comparisons of our method, standardized preprocessing methods were implemented across the two benchmark datasets (KUL dataset and DTU dataset), tailored to their respective acquisition characteristics.
Preprocessing procedures differed between the two datasets. Processing of the KUL dataset commenced with re-referencing to the mastoids, followed by application of a 0.1–50 Hz bandpass filter, and concluded with down-sampling to 128 Hz. In contrast, the DTU data were initially filtered to suppress 50 Hz power line noise and harmonics. Ocular artifact suppression was subsequently performed through joint decorrelation, prior to re-referencing and final down-sampling to 64 Hz. Thus, for KUL and DTU dataset, each subject had 46080 points (\times) 8 trials = 368640 points, 3500 points (\times) 60 trials = 210000 points, respectively. Then, the data segments were obtained with a sliding decision window of length t1 with an overlap of 50(%).
To assess performance, our method was rigorously benchmarked against several established auditory attention detection architectures: SSF-CNN34, MBSSFCC18, DBPNet[25](https://www.nature.com/articles/s41598-025-22177-x#ref-CR25 “Ni, Q. et al. Dbpnet: Dual-branch parallel network with temporal-frequency fusion for auditory attention detection. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 3115–3123. https://doi.org/10.24963/ijcai.2024/345
(2024).“), and DARNet[28](https://www.nature.com/articles/s41598-025-22177-x#ref-CR28 “Yan, S. et al. Darnet: Dual attention refinement network with spatiotemporal construction for auditory attention detection. arXiv:2410.11181
(2024).“). The evaluation employed standardized temporal resolutions using decision windows of 0.1 s, 1 s, and 2 s, enabling a multi-scale analysis framework. This framework provides a comprehensive characterization of temporal sensitivity across competing models.
Implementation details
In auditory attention detection (AAD) studies, classification accuracy has been widely recognized as the primary metric for evaluating model performance. Consistent with this established practice, we assessed our network framework employing two major standard datasets: the KUL dataset and the DTU dataset. The implementation methodology is exemplified through the KUL dataset, utilizing 1-second decision windows and detailing both training protocols and network architecture specifications.
The dataset was partitioned into training, validation, and test subsets at an 8:1:1 ratio, resulting in 4,600, 576, and 576 decision windows per subject, respectively. Optimization was carried out using the Adam algorithm, with a learning rate of (5\mathrm{{e}}-4) and weight decay of (3\mathrm{{e}}-4), configured with a batch size of 32 and a maximum training duration of 80 epochs. An early stopping criterion was employed to halt training when the validation loss plateaued for 8 consecutive epochs. All implementations were executed in PyTorch.
EEG signals underwent preprocessing via the Common Spatial Patterns (CSP) algorithm to extract initial features. In order to avoid feature leakage, we perform CSP method feature extraction after dividing the training and testing data. The spatial-temporal feature extraction module is employed to extract features from the spatial domain and temporal domain. Following processing by the spatial module, the EEG data is represented as (X \in {R^{32 \times 64 \times 1 \times 128}}), After processing through the temporal module, the EEG data remains represented as (X \in {R^{32 \times 64 \times 1 \times 128}}), After projection and fusion, the final output data is represented as (X \in {R^{32 \times 128 \times 16}}). After propessed through the multi-scale fusion module and the cross-channel attention mudule, the data dimension remains unchanged. Before being concatenated to obtain the fused feature, (X \in {R^4}). Finally, X is passed through another fully connected layer (input: 4, output:2) to yield the final auditory attention prediction. The final predictions were derived from a 2-unit fully connected classification layer.
Results
Performance of our network
Table 2 Comparative classification accuracy ((%)) of our method against state-of-the-art models on the DTU and KUL datasets under three decision window lengths.
To assess the efficacy of our network, we conducted extensive evaluations across three temporal resolutions (0.1s, 1s, and 2s), benchmarking our results against existing state-of-the-art models as detailed in Table 2, utilizing literature-reported metrics. Our framework demonstrates superior performance across all tested datasets (KUL dataset and DTU dataset), establishing new benchmarks in auditory attention detection.
For the KUL dataset, our network achieves mean classification accuracies of 90.3(%) (SD = 3.84(%)), 95.8(%) (SD = 4.17(%)), and 96.3(%) (SD = 3.56(%)) for the 0.1s, 1s, and 2s windows, respectively. In evaluations of the DTU dataset, we observe comparable performance, with accuracies of 77.1(%) (SD = 5.35(%)), 84.1(%) (SD = 4.76(%)), and 85.6(%) (SD = 4.49(%)) across the progressively longer windows.
These findings reveal two critical patterns: First, decoding accuracy shows a positive correlation with window length (from 0.1s to 2s), supporting previous studies36 that attribute this trend to enhanced contextual information and outlier mitigation in extended temporal segments. Second, the incorporation of multi-scale features and inter-channel correlations significantly improves detection accuracy.
Building upon preliminary analyses, we note that the classification performance of the DTU dataset exhibits an 11(%) reduction compared to the KUL dataset–a pattern consistently reported in prior research. Three key factors contribute to this divergence: the spatial configuration of acoustic inputs, where KUL stimuli are strictly positioned at 90(\circ) left/right angles, while DTU recordings are captured within a narrower 60(\circ) lateral range; the variable room reverberation levels present in the DTU dataset, contrasting with the reverberation-free conditions of KUL dataset; and the fact that KUL dataset exclusively employs male vocal samples, whereas DTU dataset integrates both male and female speakers, introducing potential gender-related variability.
Notably, our network maintains exceptional baseline performance (77.1(%)) even under the challenging 0.1s window constraint in the DTU dataset, demonstrating remarkable temporal sensitivity that surpasses that of comparative models.
Fig. 2
Subject-wise classification accuracy across decision window lengths for the DTU (top) and KUL (bottom) datasets.
Furthermore, we present the experimental results for different decision windows for each subject across the two datasets, as illustrated in Fig. 2. It is evident that the decoding accuracy on the KUL dataset surpasses that of the DTU dataset. Overall, as the decision window lengthens, the decoding accuracy tends to increase. However, in certain individual cases, such as Subject 15 in the DTU dataset and Subjects 9 and 10 in the KUL dataset, the decoding success rate remains unchanged despite the extended decision window. This phenomenon may be attributed to overfitting during the training process; thus, even with a longer decision window, the decoding accuracy did not improve.
Statistical significance analysis
To quantitatively assess the superiority of our proposed model beyond average performance metrics, we performed statistical significance testing.
We conducted pairwise paired t-tests to evaluate the effect of decision window length on the performance of our proposed model on the DTU and KUL datasets. From Table 3, we can know that the classification accuracy significantly increased from the 0.1s window (77.06 ± 5.35(%)) to the 1s window (84.11 ± 4.76(%)) and from the 0.1s to the 2s window (85.61 ± 4.49(%)) on the DTU dataset. A smaller but still significant improvement was observed between the 1s and 2s windows. These results demonstrate that while the most substantial performance gains occur when increasing the decision window from 0.1s to 1s, further extending the window to 2s continues to provide statistically significant improvements, albeit with a smaller effect size. Specifically, for the KUL dataset, classification accuracy showed significant improvements from the 0.1s window (90.31 ± 3.84(%)) to the 1s window (95.81 ± 4.17(%)) and from the 0.1s to the 2s window (96.31 ± 3.56(%)). However, no significant difference was observed between the 1s and 2s windows. These results indicate that while extending the decision window from 0.1s to 1s leads to substantial performance gains, further extension to 2s does not provide additional significant improvement, suggesting a performance plateau effect at around 1 second on the KUL dataset.
Furthermore, To provide rigorous statistical evidence for our performance claims, we conducted paired sample t-tests comparing the per-subject accuracy of our model against the MBSSFCC baseline on the DTU dataset. Only the MBSSFCC algorithm provided raw data in the paper, so we chose this method as the baseline. The results are summarized in Table 4.
The analysis reveals that the improvements offered by our model are not only substantial in magnitude but also highly statistically significant. For the critical 0.1-second decision window, our model’s improvement of more 10 percentage points (10.82(%)) is significant with (p = 1.13\times 10^{-6}). This indicates an extremely low probability that this result occurred by chance. This trend of significant improvement holds for longer windows as well. The 6.33(%) improvement at the 1-second window is also significant with (p = 2\times 10^{-4}). Similarly, the 4.95(%) improvement at the 2-second window remains highly significant with (p = 1.3\times 10^{-2}).
Furthermore, we report the 95(%) confidence intervals for our model’s mean accuracy. The intervals, calculated as Mean±(t-critical (\times) Standard Error), provide a range within which we can be 95(%) confident the true population mean lies. The narrowness of these intervals, for example, [74.44, 79.76] for the 0.1s window, underscores the precision and reliability of our estimate.
For the KUL dataset, the statistical analysis revealed a consistent trend with the DTU dataset but with even higher levels of significance. For the 0.1-second window, our method demonstrated a 9.14(%) improvement over the MBSSFCC baseline, with this enhancement being highly statistically significant ((p = 4.37\times 10^{-6})). The improvements observed for both the 1-second and 2-second windows were also statistically significant (p < 0.001). It is noteworthy that on the KUL dataset, even the improvement at the 0.1-second window reached a high level of statistical significance, which aligns with but is more pronounced than the trend observed on the DTU dataset.
In conclusion, the statistical analysis provides overwhelming evidence that our proposed hybrid channel attention network delivers a statistically significant and practically meaningful performance enhancement over the strong MBSSFCC baseline across all tested temporal resolutions.
Ablation study
Component contribution analysis
This section evaluates the effects of different modules within the network, specifically the spatial-temporal feature extraction module, the multi-scale adaptive fusion module, and the cross-channel attention module. We conducted an ablation study on two datasets. The experimental results demonstrate that each proposed module enhances decoding accuracy.
As shown in Table 5, the removal of any module significantly reduces the model’s decoding accuracy. On the DTU dataset, the accuracy decreases by 1.5(%) (0.1s), 2.8(%) (1s), and 1.2(%) (2s) after removing different modules with the same decision window. Similarly, on the KUL dataset, the accuracy decreases by 5.2(%) (0.1s), 6.4(%) (1s), and 3.1(%) (2s). The numerical comparison indicates that the multi-scale module contributes the most among the three modules. Furthermore, the results suggest that as the decision window length increases, the accuracy rate also shows an upward trend.
Module configuration
This section mainly introduces the impact of different model configurations on the final results, including different convolution kernel sizes, different branch numbers, and different of attention mechanisms.
From Table 6, it can be seen that we compare four different combinations of convolution kernels. The four methods can be divided into two categories, namely fixed convolution kernels and channe-bassed adaptive convolution kernels. In order to fully extract multi-scale features from EEG signals, different combinations of fixed convolution kernels were used, namely ({, 3, 5, 7 ,}, {, 5, 7, 9 ,}), and ({, 3, 5, 9 ,}). The combination of ({, 3, 5, 7 ,}) achieved the best results. However, numerically speaking, there is not a significant difference between the several combinations. Compared with the channel-adaptive approach proposed by us, our method has improved the performance by 0.6(%), with significant numerical improvement, effectively proving the superiority of the channel adaptive convolution kernel approach.
We also tested the number of branches for feature fusion. In the Proposed Approach section, we employed three branches for fusion, and in this section, we compare the differences in performance between two branch fusion and three branch fusion. From Table 7, we can see that the parameters for the fusion of the two branches are the same as our proposed MSAFM. In the process of merging two branches, the combination of Branch2+Branch3 has the best accuracy, but compared with the combination of three branches, the accuracy is not high.
Finally, we tested different channel attention mechanisms. We compared our proposed cross channel attention mechanism with commonly used multi-head attention mechanisms, and from the experimental results in Table 8, it can be seen that our proposed mechanism achieved higher accuracy.
Discussion
Comparative analysis
To comprehensively evaluate the decoding effectiveness of our network, we compared it with several recent methods. Experimental results indicate that our model exhibits improved decoding accuracy relative to these algorithms. As shown in Table 2, on the DTU dataset, the proposed model achieves decoding accuracies of 77.1(%) (SD = 5.35(%)) under the 0.1-second decision window, 84.1(%) (SD = 4.76(%)) under the 1-second window, and 85.6(%) (SD = 4.49(%)) under the 2-second window. For the KUL dataset, the model’s accuracies are 90.3(%) (SD = 3.84(%)) for the 0.1-second window, 95.8(%) (SD = 4.17(%)) for the 1-second window, and 96.3(%) (SD = 3.56(%)) for the 2-second window.
It can also be seen from Table 2 that compared with other recent algorithm models, such as SSF-CNN, MBSSFCC, DBPNet and DARNet, the decoding accuracy of different windows in DTU data sets is improved by 14.6(%), 10.2(%), 3.1(%), 2.5(%) respectively. On the KUL dataset, the decoding accuracy of different windows is improved by 14.0(%), 11.3(%), 5.0(%), 1.1(%) respectively. The longer the decision window, the higher the detection accuracy. This is because a longer decision window provides more information for the model to make judgments, while also mitigating the impact of individual outliers on the prediction. From the results, the decoding accuracy of our model is uniquely improved under different decision windows, The longer the decision window, the higher the detection accuracy. This is because a longer decision window provides more information for the model to make judgments, while also mitigating the impact of individual outliers on the predictionwhich further proves the effectiveness of the proposed netwrok.
Error analysis
The auditory input itself has periods of low energy or high ambiguity (e.g., pauses, overlapping speech, plosive sounds). During these segments, the auditory-evoked neural responses are inherently weaker and noisier. Consequently, the EEG correlates of attention become less distinct, leading to a higher probability of misclassification. This effect is likely more pronounced in the DTU dataset, which contains room reverberation, explaining its overall lower performance compared to the anechoic KUL recordings. Subject-Specific Variability and Overfitting: The ablation results in Fig. 2 (Subject-wise Results) reveal that for certain individuals (e.g., Subject 15 in DTU, Subjects 9 and 10 in KUL), performance did not improve with longer decision windows. This suggests that the model might have overfitted to the specific neural patterns of the majority of subjects in the training set and failed to generalize to those with atypical neurophysiological responses. Factors such as individual anatomical differences, varying cognitive strategies for selective attention, or even suboptimal electrode contact for certain subjects could contribute to this variability. Limitations of the Pure EEG Paradigm: Our model, like all current AAD methods, relies solely on EEG correlates. In extremely challenging acoustic scenarios where the brain itself struggles to segregate speakers (e.g., same-gender, spatially close speakers with similar vocal characteristics), the attentional modulation in the EEG might be too subtle to decode reliably. In these “failure cases” for the human brain, our model is also expected to fail.
In summary, despite the overall high performance on these two datasets, an analysis of misclassified trials suggests common failure cases. Firstly, segments of the auditory stimulus with low signal-to-noise ratio (e.g., pauses, overlapping speech) provide weak neural cues, challenging for the model to decode. This is particularly relevant for the DTU dataset with its reverberant environment