BIASNN: a biologically inspired attention mechanism in spiking neural networks for image classification

Introduction

Artificial Neural Networks (ANNs) have been at the forefront of advances in machine learning and artificial intelligence, leading to breakthroughs in tasks such as image classification1, speech recognition[2](https://www.nature.com/articles/s41598-025-22430-3#ref-CR2 “Dong, L., Xu, S. & Xu, B. Speech-transformers: A no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing 465 (Institute of Electrical and Electronics Engineers, Calgary, 2018…

Introduction

Artificial Neural Networks (ANNs) have been at the forefront of advances in machine learning and artificial intelligence, leading to breakthroughs in tasks such as image classification1, speech recognition2, and recommendation systems3. ANNs were originally inspired by the structure and function of biological neural networks, specifically neurons in the human brain4. However, despite this biological inspiration, traditional ANNs differ significantly from the brain’s highly efficient mechanisms. The human brain achieves high-throughput information processing with remarkably low power consumption5, processing extremely large amounts of information in a massively parallel and event-driven manner. In contrast, ANNs typically require substantial computational and energy resources, due to their reliance on dense, continuous, floating-point matrix operations.

To address these limitations, and more closely mimic the efficiency of biological brains, spiking neural networks (SNNs) have gained significant attention as a next-generation neural network model. Unlike ANNs, which rely on continuous-valued activations, SNNs communicate via binary spikes (0 s and 1 s)6, reflecting the event-driven nature of information processing in biological neural systems. This spike-based communication enables SNNs to operate in a sparse, asynchronous fashion, significantly reducing the number of multiplication operations a network is required to perform. Consequently, SNNs offer substantial computational and energy savings7,8, making them particularly well-suited for real-time, low-power applications, such as edge computing and neuromorphic systems.

Just as the biological brain comprises different types of neurons with distinct properties9, various spiking neuron models have been developed for use in SNNs, each balancing biological plausibility against computational efficiency. The Leaky Integrate-and-Fire (LIF) neuron10, for example, represents a foundational approach, capturing essential neuronal dynamics such as membrane potential decay and threshold-based spike generation with minimal computational overhead. The adaptive leaky integrate and fire (ALIF) neuron11 extends the LIF framework by incorporating mechanisms for adaptive threshold modulation or membrane potential adjustment, thereby increasing biological realism while maintaining relative simplicity. In contrast, more biophysically detailed models, such as the Izhikevich model[12](https://www.nature.com/articles/s41598-025-22430-3#ref-CR12 “Izhikevich, E. M. Simple model of spiking neurons. IEEE Trans. Neural Netw. 14, 1569–1572. https://doi.org/10.1109/TNN.2003.820440

(2003).“) and Hodgkin–Huxley model13, offer higher fidelity by reproducing complex neuronal behaviors, including bursting, resonation, and various firing patterns, at the cost of significantly increased computational demands.

The selection of neuron models within an SNN plays a critical role in determining both the network’s computational efficiency and its functional accuracy. While the Izhikevich and Hodgkin–Huxley models are well-suited for applications that are focused on replicating specific aspects of biological networks, their high computational demands render them impractical for use in deep networks. Conversely, the LIF and ALIF models are commonly employed in SNNs designed to replicate tasks performed by ANNs, due to their lower computational requirements. Previous works14,[15](#ref-CR15 “Cai, S., Li, P. & Li, H. A bio-inspired spiking attentional neural network for attentional selection in the listening brain. IEEE Trans. Neural Netw. Learn. Syst. https://doi.org/10.1109/TNNLS.2023.3303308

(2023).“),[16](#ref-CR16 “Cai, W. et al. A spatial-channel-temporal-fused attention for spiking neural networks. IEEE Trans. Neural Netw. Learn. Syst. https://doi.org/10.1109/TNNLS.2023.3278265

(2023).“),17 chose to adopt the LIF neuron model due to its simplicity, and demonstrated its effectiveness in achieving state-of-the-art results. However, the ALIF model has been shown to provide improved firing rate stability, with only a minor increase in energy consumption, while producing competitive accuracies18. Despite the proven success of both the LIF and ALIF neuron types independently, to the authors’ knowledge, no existing work has investigated the integration of both neurons within a unified network architecture, highlighting a notable gap in the current literature and a promising area for research.

Despite the computational advantages SNNs provide, they face challenges in achieving the high accuracy observed in traditional ANNs. This remains true even for fundamental tasks like image classification, with the exception of networks using smaller-scale datasets19,20,21. To help alleviate this issue, researchers have once again turned to the human brain for inspiration. The biological brain, particularly the visual cortex, dynamically allocates resources to the most important parts of the visual field based on the task being performed or some external stimuli. This enables the brain to filter out unnecessary information, allowing humans to process complex environments efficiently and make decisions quickly22,23. Inspired by this phenomenon, attention mechanisms have been successfully integrated into ANNs[24](#ref-CR24 “Hu, J., Shen, L. & Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 7132–7141 (IEEE Computer Society, 2018). https://doi.org/10.1109/CVPR.2018.00745

.“),25,[26](#ref-CR26 “Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 7794–7803 (IEEE Computer Society, 2018). https://doi.org/10.1109/CVPR.2018.00813

.“),27,28, leading to the development of highly successful models like transformers and vision transformers (ViTs)29, which excel at prioritizing key features for improved task performance. The idea of incorporating attention into SNNs has been the subject of several recent research articles, and it has been shown to be an effective tool for assisting in the optimization of spike generation and processing, leading to better performance and energy efficiency14,30,[31](#ref-CR31 “Yao, M. et al. Inherent redundancy in spiking neural networks. In Proceedings of the IEEE International Conference on Computer Vision 16878–16888 (Institute of Electrical and Electronics Engineers Inc., 2023). https://doi.org/10.1109/ICCV51070.2023.01552

.“),[32](#ref-CR32 “Zhu, R. J. et al. TCJA-SNN: Temporal-channel joint attention for spiking neural networks. IEEE Trans. Neural Netw. Learn. Syst. https://doi.org/10.1109/TNNLS.2024.3377717

(2024).“),[33](#ref-CR33 “Qiu, X. et al. Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks. https://github.com/bollossom/GAC

(2024).“),34. While attention has been successfully integrated into SNNs, the mechanisms used in most studies were created for ANNs, which leaves an open avenue of research to start exploring mechanisms that are more biologically plausible.

This paper explores the combination of a biologically inspired attention mechanism and SNNs for the task of image classification. Specifically, we propose a new, 3-D spatial-channel attention mechanism for SNNs. The attention mechanism makes use of the spiking output of ALIF neurons to create a binary attention mask, which is applied to the input features to eliminate noisy or non-vital information. Our mechanism is inserted into an existing SNN using LIF neurons, creating a new network capable of using multiple spiking neuron types. The proposed attention mechanism is further analyzed using explainable AI tools to enhance the interoperability of its effects on the decision-making process. Our Biologically Inspired Attention SNN (BIASNN) model is evaluated on three static image datasets (CIFAR-10, CIFAR-100, and FMNIST) with resulting accuracies of 95.66%, 94.22% and 75.40%. The main contributions of our work can be summarized as follows.

We create a 3D, spike-based attention mechanism that uses ALIF neurons for controlling attention within the spatial and channel dimensions of images.

We propose a new method for making use of multiple types of spiking neurons in an SNN.

We make use of a Grad-Cam-like method to further analyze how our proposed mechanism affects the classification of the input images.

Experimental results show that our new method obtains comparable results when measured against existing SNN models.

Methods

The goal of this work is to create a new, more biologically plausible form of attention, and integrate the proposed mechanism into an existing SNN architecture that currently uses LIF neurons. This combination is used to form the proposed BIASNN model. The subsections below discuss the details of the backbone architecture, the inner workings of our new attention mechanism, and the spiking ALIF block used for generating the final attention map.

Backbone architecture

The ResNet model1 is a widely adopted deep neural network architecture for image processing, originally developed to mitigate the degradation problem in deep networks through the use of identity-based residual connections. Building upon this concept, the MS-ResNet architecture17 was created for SNNs. In this framework, data encoding is performed via an initial convolutional layer that transforms static image inputs into a format suitable for spike-based processing. The encoded signals are then propagated through a series of residual blocks, each comprising two spiking neuron layers followed by a convolutional layer. This structure permits the exchange of floating-point feature maps between blocks, enabling improved representational capacity and learning stability. Due to its success in image classification, we adopt the MS-ResNet18 architecture as the backbone for our BIASNN network. Following the MS-ResNet18 design paradigm, our model consists of eight residual blocks, with the proposed attention mechanism inserted after every other block, except for the last.

The entire architecture of the proposed network is illustrated in Fig. 1. As depicted in Fig. 1a, the BIASSN model begins with an initial two-dimensional convolutional layer, configured with a kernel size of seven, a stride of one, and padding of three, which serves to encode the input data into a format suitable for downstream spike processing. This is followed by a sequence of residual blocks, the internal structure of which is detailed in Fig. 1b. Each residual block begins with a Leaky Integrate-and-Fire (LIF) neuron layer, which integrates synaptic input, in the form of weighted floating-point values, into the individual neurons’ membrane potentials. When a membrane potential crosses a threshold, an output spike is generated. The resulting spike trains are propagated through a convolutional layer, followed by batch normalization to stabilize learning. The normalized output is subsequently fed into a second LIF layer, whose spiking activity is again processed through a convolutional layer and a second batch normalization step. A residual connection is added to the output of the final batch normalization operation, enabling gradient flow and promoting stable training. Architectural variations for the first convolutional layer in each residual block group are detailed in Table 1. All other convolutional layers throughout the network utilize a kernel size of three, a stride of one, and padding of one.

Fig. 1

Overview of the proposed BIASNN network. The BIASSN model (a) consists of MSResNet18 residual blocks (b) and our proposed attention mechanism (c). The attention mechanism combines the CBAM and Squeeze and Excite methods with an ALIF block (d) to create a 3D attention mechanism. The ALIF block contains four layers: Channel Normalization, Data Inversion, ALIF neurons, and Spike Inversion. The combination of these layers is designed so that the attention mechanism will learn to eliminate the least important values from the input data.

For our LIF layers, we utilize the following equations:

$${\text{V}}_{{\text{n}}}^{{{\text{t}}} =\upbeta {\text{V}}_{{\text{n}}}}{{{\text{t}} - {1}}} + \mathop \sum \limits_{{\text{i}}} {\text{W}}_{{{\text{n}};{\text{i}}}} {\text{S}}_{{\text{i}}}^{{\text{t }}}$$

(1)

$${\text{S}}_{{\text{n}}}^{{{\text{t}}} =\uptheta \left( {{\text{V}}_{{\text{n}}}}{{\text{t}}} - {\text{V}}_{{{\text{th}}}} } \right)$$

(2)

$${\text{V}}_{{\text{n}}}^{{{\text{t}}} = {\text{V}}_{{\text{n}}}}{{\text{t}}} \left( {{1} - {\text{S}}_{{\text{n}}}^{{\text{t}}} } \right)$$

(3)

where (V_{n}^{{t}) is the membrane potential of the nth postsynaptic neuron at time step t, and a time step is a single iteration through the network. The variable (\beta) is the decay constant of the neuron, (W_{n i}) is the weight between the ith presynaptic neuron and the nth postsynaptic neuron, and (S_{i}}{t }) is the output spike value from the ith presynaptic neuron. (S_{n}^{t}) is the output spike value of the nth postsynaptic neuron, (\theta) represents the Heaviside function, and (V_{th}) is the voltage threshold of the postsynaptic neuron. In general, Eq. (1) represents the voltage update process of a neuron, Eq. (2) is used to determine whether a spike is generated, and Eq. (3) resets the voltage of the neuron if a spike occurs. A diagram depicting the process an LIF neuron in our model undergoes can be seen in Fig. 2a.

Fig. 2

Overview of the processes carried out by an LIF neuron and ALIF neuron. The LIF neuron, depicted in (a), takes the input from the previous layer and integrates it into the decayed (leaked) membrane potential. If the membrane potential reaches a specified threshold, the neuron will fire a spike and reset its membrane potential. Otherwise, it will remain silent, and the membrane potential will be transferred to the next time step as is. The ALIF neuron, depicted in (b), operates in a similar fashion to the LIF, but with one key difference. The ALIF neuron will adjust its firing threshold based on whether a spike occurred in the current time step. If the neuron generates a spike, its threshold will increase, and if a spike wasn’t generated, the threshold will decrease.

Our model makes use of the surrogate gradient method for training35. Using the surrogate gradient method allows for direct training of our SNN and allows us to easily use existing deep learning libraries. To overcome the issue of nondifferentiable spiking outputs from the LIF layers, we use the following equation17,36 during the backward pass:

$$\frac{{\partial {\text{S}}_{{\text{n}}}^{{{\text{t}}} }}{{\partial {\text{V}}_{{\text{n}}}}{{\text{t}}} }} = \frac{{1}}{{\text{a}}}{\text{sign}}\left( {{\text{V}}_{{\text{n}}}^{{\text{t}}} - {\text{V}}_{{{\text{th}}}} \le \frac{{\text{a}}}{2}} \right)$$

(4)

where the variable (a) is a constant used to keep the integral of the function set to 1.

Attention mechanism

A detailed overview of the architecture of our attention mechanism can be seen in Fig. 3a, and its distinct steps are summarized in Algorithm 1. Our attention mechanism draws its inspiration from the CBAM37 and SE[24](https://www.nature.com/articles/s41598-025-22430-3#ref-CR24 “Hu, J., Shen, L. & Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 7132–7141 (IEEE Computer Society, 2018). https://doi.org/10.1109/CVPR.2018.00745

.“) attention architectures and begins with the operations listed below.

$$P^{t} = \left[ {{\text{MP}}\left( {X^{t} } \right){;};{\text{AP}}\left( {X^{t} } \right)} \right]$$

(5)

$$C^{t} = f_{C/r }^{{{5,2}} \left( {f_{C}}{{7,3}} \left( {P^{t} } \right)} \right)$$

(6)

Fig. 3

of the proposed attention Overview mechanism and ALIF Block. Shown in (a) is the process data undergoes in the attention mechanism. Here, global average and max pooling are used to squeeze the data in the channel dimension. The DS convolutions are then used to gradually increase (excite) the number of channels back to that of the original input. Shown in (b) is a detailed look at the ALIF block. The colored squares in the cubes show the data being transformed at each step. In both figures, the final attention map consists of only black squares (0’s) and white squares (1’s).

In Eq. (5), (X^{t}) denotes the input data at time step t, while (P^{t}) (\in {\mathbb{R}}^{{2xHxW}) represents the concatenated feature maps obtained from average pooling (AP) and max pooling (MP) operations. In Eq. (6), (P}{t}) is processed by two depth-wise separable (DS) convolutional layers. The DS convolutions utilize kernel sizes of 5 and 7, strides of 2 and 3, and have output channel sizes of C/r and C, respectively, to produce the output (C^{t} \in {\mathbb{R}}^{CxHxW}). These convolutional layers are designed to progressively excite the channel dimensionality to match that of the original input, with the rate of channel expansion governed by the hyperparameter r.

ALIF block

Once the data has been convolved, it is passed into the ALIF block. As can be seen in Fig. 3b, the ALIF block consists of four different steps. The first two steps are channel normalization18,38,39 and inversion, and the equations for these processes are:

$${\text{M}}_{{{\text{norm}}}}^{{{\text{t}}} = \frac{{C}{t} - C_{{{\text{min}}}}^{{{\text{t}}} }}{{{\text{C}}_{{{\text{max}}}}}{{\text{t}}} - {\text{C}}_{{{\text{min}}}}^{{\text{t}}} + \epsilon }}$$

(7)

$${\text{M}}_{{{\text{inv}}}}^{{{\text{t}}} = {1} - {\text{M}}_{{{\text{norm}}}}}{{\text{t}}}$$

(8)

where (M_{norm}^{{t}) (\in {\mathbb{R}}}{CxHxW}) denotes the normalized output values, (C_{ min}^{{t}) (\in {\mathbb{R}}}{Cx1x1}) and (C_{max}^{{t}) (\in {\mathbb{R}}}{Cx1x1}) represent the per-channel minimum and maximum values of (C^{t}), and ϵ is some small, constant value added to the denominator to prevent division by zero in the rare case that the minimum and maximum values are equal. After normalization to the range [0, 1], the data is inverted using Eq. (8), resulting in the output (M_{inv}^{{t}) (\in {\mathbb{R}}}{CxHxW}). Once the data has been inverted, it is sent to a layer of ALIF neurons with adaptive thresholds18,40 to generate spikes. Our ALIF neurons make use of the following equations:

$${\text{V}}_{{{\text{n}};{\text{inv}}}}^{{{\text{t}}} =\uptau _{{\text{v}}} {\text{V}}_{{{\text{n}};{\text{inv}}}}}{{{\text{t}} - 1}} + {\text{M}}_{{{\text{inv}}}}^{{\text{t}}}$$

(9)

$${\text{S}}_{{{\text{n}};{\text{inv}}}}^{{{\text{t}}} =\uptheta \left( {{\text{V}}_{{{\text{n}};{\text{inv}}}}}{{\text{t}}} - {\text{V}}_{{{\text{th}};{\text{n}}}}^{{\text{t}}} } \right)$$

(10)

$${\text{V}}_{{{\text{n}};{\text{inv}}}}^{{{\text{t}}} = {\text{V}}_{{\text{n}}}}{{\text{t}}} \left( {{1} - {\text{S}}_{{{\text{n}};{\text{inv}}}}^{{\text{t}}} } \right)$$

(11)

$${\text{a}}_{n}^{{t + 1} = \frac{1}{B}\mathop \sum \limits_{i}}{B} {\text{a}}_{{\text{n}}}^{{{\text{t}}} + \frac{{1}}{{\uptau _{{\text{a}}} }}\left( { - {\text{a}}_{{\text{n}}}}{{\text{t}}} + {\text{S}}_{{{\text{i}};{\text{n}};{\text{inv}}}}^{{\text{t}}} {\text{dt}}_{{\text{a}}} } \right)$$

(12)

$${\text{V}}_{{{\text{th}};{\text{n}}}}^{{t + 1} = {\text{V}}_{{{\text{th}};{\text{n}}}}}{{\text{t}}} + {\text{a}}_{{\text{n}}}^{{{\text{t}} + 1}} { }$$

(13)

where (V_{n;inv}^{{t}) is the membrane potential of the nth neuron at time step t, (\tau_{v}) represents the membrane potential decay constant of the neuron, (S_{n;inv}}{t}) is the output spike value of the neuron, (\theta) is the Heaviside function, and (V_{th;n}^{{t}) is the membrane potential voltage threshold of the neuron. The variable (a_{n}}{t + 1}) is the update value for the threshold at the next time step, averaged over the batch dimension B. (\tau_{a}) is the scale factor for the update value, and (dt_{a}) is the update time constant. Overall, Eqs. (9)–(11) are used to update the membrane potential of the neuron, generate a spike when necessary, and reset the membrane potential of the neuron if a spike occurs. Equations (12) and (13) are used to update the threshold value of the ALIF neuron for the next time step. A diagram depicting the process an ALIF neuron in our attention mechanism undergoes can be seen in Fig. 2b.

To overcome the non-differentiability of spike-based outputs, our ALIF layer makes use of the arctan surrogate method35,[41](https://www.nature.com/articles/s41598-025-22430-3#ref-CR41 “Fang, W. et al. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE International Conference on Computer Vision 2641–2651 (Institute of Electrical and Electronics Engineers Inc., 2021). https://doi.org/10.1109/ICCV48922.2021.00266

.“) in the backpropagation step, which can be derived as seen below.

$$\frac{{\partial {\text{S}}_{{\text{n}}}^{{{\text{t}}} }}{{\partial {\text{V}}_{{\text{n}}}}{{\text{t}}} }} = \left( {\frac{{1}}{\uppi }} \right)\frac{{1}}{{{1} + \left( {\uppi {\text{V}}_{{\text{n}}}^{{{\text{t}}} \frac{\upalpha }{{2}}} \right)}{{2}} }}$$

(14)

Algorithm 1

Pseudo-code of the proposed attention mechanism

In Eq. (14), ({\upalpha }) is a constant used to scale the output gradient. Once the spikes for the ALIF layer’s neurons have been calculated, they are inverted using Eq. (15) below:

$$U^{t} = { 1} - {\text{S}}_{{{\text{n}};{\text{inv }}}}^{{\text{t}}}$$

(15)

where (U^{t}) is the final output spike values for all ALIF neurons. The ALIF spikes are computed in this way to keep the output of the ALIF neurons sparse, as is typically desired from spiking neurons. Finally, using Eq. (16), the original attention mechanism input is multiplied by the inverted spikes to get the final output of the attention mechanism.

$$X_{{{\text{attn}}}}^{{{\text{t}}} = {\text{STE}}\left( {X}{t} \odot {\text{U}}^{{\text{t}}} } \right)$$

(16)

In Eq. (16) above, STE stands for the straight-through estimator42. The straight-through estimator is used to allow gradient calculations for all values of (X^{t}), even those that were removed by the attention mask.

Experimental setup

We evaluate our newly proposed BIASNN on three standard datasets, FashionMNIST[43](https://www.nature.com/articles/s41598-025-22430-3#ref-CR43 “Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. https://github.com/zalandoresearch/fashion-mnist

(2017).“), CIFAR10[44](https://www.nature.com/articles/s41598-025-22430-3#ref-CR44 “Krizhevsky, A. & Hinton, G. Learning Multiple Layers of Features from Tiny Images. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

(2009).“), and CIFAR100[44](https://www.nature.com/articles/s41598-025-22430-3#ref-CR44 “Krizhevsky, A. & Hinton, G. Learning Multiple Layers of Features from Tiny Images. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

(2009).“). For all experiments the adaptation parameter, ({\text{a}}_{{\text{n}}}^{{{\text{t}} + 1}}), is reset at the end of each epoch. We incorporate this strategy to simulate the long-term, homeostatic effects of threshold variations typically seen in biological neurons45. Table 2 lists the common hyperparameter settings for the three datasets. Information pertaining to each dataset can be seen in Table 3. For training, each dataset is augmented with random horizontal flipping and random cropping with a padding of four applied. To ensure a fair comparison, we processed all three datasets through the MS-ResNet18 network using the same convolution settings employed by our BIASNN (see Table 1 for details).

These results are indicated with a superscript “a” in Table 4. All reported results correspond to the highest validation accuracies achieved for each dataset within 150 training epochs. All experiments are carried out on an NVIDIA RTX 4090 graphics card using the PyTorch 2.1 library.

Results

The results of our experiments on the FMNIST dataset are shown in Table 4. As can be seen in the table, our BIASNN model achieves an accuracy of 95.66% when the number of timesteps (T) is set to 4. The proposed solution achieves an increase in accuracy over the backbone MS-ResNet18 by 0.19%. Excluding the backbone network, the method with the next closest accuracy is the method proposed by Dan et al.46. Their method achieved an accuracy 0.22% less than ours, while requiring two extra timesteps. The Proxy Learning method47 shows the next closest accuracy to our BIASNN, however, this method required 50 time steps and produced an accuracy 1.1% lower than BIASNN. Table 4 lists the results of experiments on the CIFAR10 dataset. As can be seen in the table, the BIASNN network achieves an accuracy of 94.22%, a 0.38% increase in accuracy over the backbone MS-ResNet18. Comparing our results with other methods on the CIFAR10 dataset, we see that the CQ Training48 and Wang et al.49 methods achieve accuracies only slightly lower than those of the proposed BIASNN. The CQ Training method achieves an accuracy of 0.06% less than BIASNN. However, their method makes use of ANN-SNN conversion and requires 600 time steps to achieve a similar accuracy to our four time step method. The method proposed by Wang et al. made use of ANN-SNN conversion as well, and was able to achieve results slightly lower than our BIASNN method (0.13%) in four time steps with a VGG-16-based network. However, when applied to the typically more powerful ResNet-18 architecture, the accuracy of their method decreases to 93.27%, which is 0.95% less than the accuracy achieved by our proposed BIASNN. Besides the decrease in accuracy, the CQ Training and Wang et al. methods still do not allow for the direct training of the SNN, requiring an extra step with the conversion process.

Table 4 lists the results of experiments on the CIFAR100 dataset. When employing four time steps, our proposed BIASNN model is able to achieve an accuracy of 75.40%, an increase in accuracy of 0.42% over the backbone MS-ResNet18. In the case of the CIFAR100 dataset, the CQ Training method is the next closest competitor to the proposed BIASNN. Their method produces an accuracy of 71.84% with 300 time steps, which is a relatively large drop in performance (3.45%) compared to our BIASNN’s results. The method proposed in Sun et al.50 demonstrates the next highest accuracy compared to the proposed BIASNN. It achieves an accuracy of 71.77%, again showing a large drop in performance (3.63%) compared to our BIASSN.

Discussion

To better analyze the effects the proposed attention mechanism has on the overall classification accuracy of the network, we make use of two different methods to visualize its impact. First, we compare the individual class accuracies with those of the backbone architecture, and second, we make use of a Grad-CAM-like method for generating heat maps of the spiking outputs of the attention layer.

Shown in Fig. 4a, b are the confusion matrices for the original MS-ResNet18 and our proposed BIASNN on the CIFAR10 dataset. We see that the proposed attention mechanism helps increase the accuracy of several different classes, most notably the airplane (+ 2%), cat (+ 3%), and frog (+ 1%) classes. However, it does decrease the accuracy of some classes, such as dogs (-2%) and horses (-2%). This may be a sign of the network eliminating the wrong or too much information, making some classes more difficult to distinguish.

Fig. 4

Confusion matrix results for the original MS-ResNet18 (a), and our proposed BIASNN (b) on the CIFAR10 dataset.

The Grad-CAM method59 is used for visualizing the effects individual layers have on the final output of a network. It makes use of activation values and their gradients to generate a heatmap, allowing researchers to better understand how different layers are affecting a model. Here, we use this same concept to study the effect the spiking layer in our attention mechanism has on our proposed network. However, unlike the original Grad-CAM, we do not use the gradients of the spiking layers for two reasons. First, the gradients of spiking outputs are generated using a surrogate gradient function, which is only an approximation of a spike’s gradient and can introduce errors into the final heatmap. Second, the surrogate functions are typically designed so that lower input values will have higher gradient values. This can result in misleading heatmaps, as the areas that caused spikes to occur, i.e., areas with the most information, will have the smallest gradients. In our specific case, since the final spike outputs were inverted, the highest gradients should correspond to the spike values for the attention mask; however, we prefer to use a method that can be more easily applied to any spiking layer in the network.

Instead, we make use of the total number of spikes for all time steps to create the final heat map. To generate the final heatmap for each image, the number of spikes is summed across all time steps, and then normalized to be in the range [0, 1] using Eqs. (17) and (18) below.

$$S^{T} = \mathop \sum \limits_{t}^{{T} S}{t}$$

(17)

$$S_{norm} = \frac{{S^{T} - S_{min}^{{T} }}{{S_{max}}{T} - S_{min}^{T} }}$$

(18)

In Fig. 5, class-specific heatmaps were generated using the inverted spike outputs, computed via Eq. (15), from the final attention mechanism in the network. As can be seen in the figure, the number of spikes generated in areas of importance tends to be higher, while areas of less importance produce fewer spikes. This is an indication that our attention mechanism does help the network learn where to focus, and is demonstrated particularly well in the airplane, deer, and bird heatmaps.

Fig. 5

Heatmap images, and the corresponding input CIFAR10 image, for each of the ten classes in the CIFAR10 dataset. The heatmap images show the normalized spike counts generated from the inverted spikes, calculated using Eq. (15), in the last attention mechanism in our proposed BIASNN.

Our proposed attention mechanism makes use of ALIF neurons due to their more stable firing rate. However, the LIF neuron is generally a more popular choice when working with SNNs. To validate our use of the ALIF layer of neurons in our attention mechanism, we examine the results of the BIASNN model when an LIF layer is used as the spiking layer in our attention mechanism. We term this method Leaky Attention SNN (LASNN), and compare its results to those of the proposed BIASNN model, where an ALIF layer of neurons is used. From our experiments, we find that the BIASNN model achieves an accuracy of 94.22%, whereas the LASNN method produces an accuracy of 93.68%, a considerably large drop in performance. In an effort to understand why the BIASNN method outperforms the LASNN method, we look at two different pieces of information generated by the network. First is the number of neuron values that are eliminated at each time step for each attention mechanism. The number of eliminated values plays a crucial role in the amount of information that is allowed to pass through the rest of the network and will ultimately affect the final accuracy. Shown in Fig. 6a, the number of values that are eliminated using the LASNN method varies greatly for each time step in all attention mechanisms. Also, there is a considerable difference in the number of eliminated values between attention mechanisms. In contrast, the BIASNN method gradually increases the number of eliminated values between time steps for each mechanism, and the number eliminated by each attention mechanism is relatively the same.

Fig. 6

Graphs displaying BIASNN and LASNN comparisons. Depicted in (a), the percentage of neuron activations suppressed by the LIF-based attention mechanism (LASNN) and the proposed ALIF-based mechanism (BIASNN). Shown in (b), the percentage of spiking neurons in the LIF layers that follow the attention mechanisms.

Second, we look at the firing patterns of the LIF layers that directly follow the attention mechanisms. The firing patterns seen in these layers will affect the rest of the layers in the block, and ultimately the accuracy of the network, so stability becomes an important factor. Plotted in Fig. 6b, the number of spiking neurons in the LIF layers varies considerably for the LASNN method, which is to be expected given the greater variability in the number of eliminated neuron values in the attention mechanisms. In contrast, the BIASNN method produces LIF neurons that are gradually less activated over time, and the variation between the LIF layers is more consistent. This follows well with the gradual increase in the number of eliminated neuron values for each attention mechanism in the BIASNN method. Based on these two pieces of information, it appears that the instability of LASNN is what leads to its lower classification accuracy.

We also examine the effects using the LASSN method has on the energy requirements of our proposed network. For our energy calculations, we follow60, where the network is assumed to be running on a 45 nm CMOS chip, and addition and multiplication operations require 0.9pJ and 3.7pJ of energy, respectively. Results show that the BIASNN method consumes 0.55 mJ of energy and the LASNN method requires 0.59 mJ of energy, an outcome that seems counterintuitive considering the use of the more complex ALIF neurons in BIASNN. Examining the number of multiplication and addition operations within the attention mechanisms, we find that the LIF neurons require 229,376 multiplications and 458,752 additions, whereas the ALIF neurons require 458,755 multiplications and 688,688 additions. Although the ALIF neurons require approximately twice the number of multiplications and 1.5 times the number of additions, this adds minimal computational overhead to the network. In fact, this small increase in complexity is attained by our placement strategy for the attention mechanisms, as only three ALIF layers are added to the network, one for each attention mechanism. However, the small increase in complexity does not explain the difference in power consumption between the two models. Instead, the difference can be seen by examining the spiking rate of the LIF neurons outside of the attention mechanisms in both networks. The LASNN method shows an average spiking rate of 11.88%, while the BIASNN method maintains a lower rate of 9.64%. The reduced spike rate of BIASNN results in it requiring 54,338,180 fewer addition operations compared to LASNN, offsetting the extra cost of the ALIF neurons and accounting for the 0.05 mJ difference in energy consumption. A particularly clear example of the higher spike rate in LASNN is observed in LIF layer 9, shown in Fig. 6b. Based on reduced spiking rate of t

Introduction

Introduction

We create a 3D, spike-based attention mechanism that uses ALIF neurons for controlling attention within the spatial and channel dimensions of images.

We propose a new method for making use of multiple types of spiking neurons in an SNN.

We make use of a Grad-Cam-like method to further analyze how our proposed mechanism affects the classification of the input images.

Methods

Backbone architecture

Attention mechanism

ALIF block

Experimental setup

Results

Discussion

Similar Posts