1 Bit is all we need: Binary Normalized Neural Networks

Eduardo L. L. Cabral Mauá Institute of Technology São Caetano do Sul, São Paulo, SP, Brazil Nuclear and Energy Research Institute São Paulo, SP, Brazil elcabral@maua.br &Paulo Pirozelli Mauá Institute of Technology São Caetano do Sul, São Paulo, SP, Brazil paulopirozelli@gmail.com &Larissa Driemeier Department of Mechatronics and Mechanical Systems Engineering Polytechnic School – University of São Paulo São Paulo, SP, Brazil driemeie@usp.br

Abstract

The increasing size of large neural network models, specifically language models and foundational image models, poses deployment challenges, prompting efforts to reduce memory requirements and enhance computational efficiency. These efforts are critical to ensure practical deployment and effective utilization of these models across various applications. In this work, a novel type of neural network layers and models is developed that uses only single-bit parameters. In this novel type of models all parameters of all layers, including kernel weights and biases, only have values equal to zero or one. This novel type of models uses layers named as binary normalized layer. These binary normalized layers can be of any type, such as fully connected, convolutional, attention, etc., and they consist of slight variations of the corresponding conventional layers. To show the effectiveness of the binary normalized layers, two different models are configured to solve a multiclass image classification problem and a language decoder to predict the next token of a sequence. The model to solve the image classification has convolutional and fully connected layers, and the language model is composed of transformer blocks with multi-head attention. The results show that models with binary normalized layers present almost the same results obtained by equivalent models with real 32-bit parameters. The binary normalized layers allow to develop models that use 32 times less memory than current models and have equivalent performance. Besides, the binary normalized layers can be easily implemented on current computers using 1-bit arrays, and do not require the development of dedicated electronic hardware. This novel type of layers opens a new era for large neural network models with reduced memory requirements that can be deployed using simple and cheap hardware, such as mobile devices or only cpus.

Keywords Neural networks ⋅\cdot binary parameters ⋅\cdot binary normalized layers

1Introduction

Recent advances in machine learning techniques and hardware have enabled extraordinary performance in a wide range of applications — from traditional tasks such as pattern recognition and natural language processing to more complex domains, including autonomous control systems and the discovery of new materials. These developments have been made possible thanks to improved neural network architectures, increased computational power, and the availability of large datasets Goodfellow et al. (2016).

However, despite such impressive progress, current artificial intelligence models still face serious limitations when applied to embedded systems. Most state-of-the-art AI solutions rely heavily on cloud computing infrastructure and high-performance specialized hardware. Deep neural networks, such as those used for image classification, often require billions of floating point operations to process a single input sample Henzinger et al. (2021). The increasing size of large-scale AI models, especially language models and foundational image models, introduces significant deployment challenges. These models require substantial computational resources, energy, and memory requirements that are difficult to satisfy outside data centers.

This dependency makes it impractical to implement these models in systems with limited computational resources, especially in contexts where connectivity is restricted or nonexistent. Applications in isolated environments, such as underwater, underground, aerospace, or agricultural systems, face severe communication challenges, including physical interference, high latency, and limited bandwidth. As discussed in Plastiras et al. (2018), local data processing becomes an evolutionary necessity to ensure the autonomy and responsiveness of such systems, particularly where rapid decision making is required under resource constraints.

To address these constraints, quantization has emerged as a key technique for optimizing neural networks in resource-limited environments Henzinger et al. (2021). Instead of relying on high-precision floating-point arithmetic (e.g., 32-bit), quantization uses lower-bit integer formats. It typically operates in the 2 to 8 bit range, which is widely adopted in industry with minimal impact on accuracy Henzinger et al. (2021). This approach enables significantly reduced memory usage and bandwidth requirements, with compression ratios ranging from 35×35\times to 49×49\times, as shown by Han et al. (2016). It also provides substantial speed-ups, achieving up to 33 times faster execution on standard CPUs and up to 1010 times on specialized fixed point hardware such as Qualcomm DSPs with HVX support Jacob et al. (2018). Furthermore, energy efficiency is greatly improved, which is especially critical for mobile and edge devices Krishnamoorthi (2018); Banner et al. (2019).

However, quantization presents challenges, particularly in preserving accuracy due to the non-differentiable nature of quantization functions and the sensitivity to ranges of varying values in weights and activations Hubara et al. (2017). To address these issues, several techniques have been developed.

Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) are two main strategies to reduce the precision of parameters of neural network models. PTQ applies quantization after a model has been fully trained in high precision (typically 32-bit floating point), aiming to reduce model size and inference latency without retraining. In contrast, QAT simulates quantization effects during training, allowing the model to learn and adapt to the information loss introduced by low-precision representations. As a result, QAT generally achieves significantly higher accuracy than PTQ, particularly at lower bit-widths.

Several key studies have shaped the development of both PTQ and QAT. Jacob et al. (2018) presented a foundational quantization scheme that has been widely adopted in TensorFlow Lite, offering practical guidelines for deploying low-precision models in real-world applications. Krishnamoorthi (2018) provided a comprehensive white paper that systematically covers the principles and implementation details of both PTQ and QAT, serving as a central reference for researchers. For low-bit PTQ, Banner et al. (2019) proposed ACIQ (Analytical Clipping for Integer Quantization), a method for minimizing quantization error without retraining, particularly effective in the 4-bit setting.

Regarding QAT specifically, Hubara et al. (2017) introduced a seminal approach to training Quantized Neural Networks (QNNs) with low-precision weights and activations using the Straight-Through Estimator (STE) to handle non-differentiability. As proposed in Jacob et al. (2018), weights and biases are stored in 32-bit floating-point format to allow precise updates, but are quantized during the forward pass to simulate low-precision inference. Backpropagation still occurs using full-precision gradients, and the STE is employed to handle the non-differentiability of quantization. This dual approach ensures that small gradient updates are not lost, which would happen if parameters were permanently quantized during training. Once training is complete, only the low-precision parameters are retained for efficient inference. In van den Oord et al. (2018) the authors discussed the Vector Quantised-Variational AutoEncoder (VQ-VAE), a generative model that learns discrete latent representations. In Choi et al. (2019) the authors proposed PACT (Parameterized Clipping Activation), a method that improves activation quantization by learning optimal clipping thresholds during training. The field has also advanced with techniques such as SAWB (Statistical Aware Weight Binning) for weight quantization, which selects quantization ranges based on the weight distribution to reduce accuracy loss in low-precision networks. Zhuang et al. (2018) proposed training strategies for quantized networks, including two-stage optimization, progressive quantization, and the use of a full-precision teacher model to guide learning and improve final performance.

Recently, Cabral and Driemeier (2025) explored the trade-offs involved in using low-bit representations for network weights. Their study demonstrates that 2.32-bit weights—corresponding to five discrete levels—can provide a favorable balance between memory efficiency and model accuracy. They also observe that low-resolution models with fewer parameters may need more training epochs to reach the accuracy of 32-bit models, while larger models can achieve similar performance within typical training configurations.

Recent progress includes BitNet, a transformer architecture tailored for large language models using binary weights and low-precision activations while preserving full-precision states for optimizers and gradients. Its variant, BitNet 1.58, introduces ternary weights (–1, 0, +1), offering comparable performance to 16-bit models with reduced memory footprint, lower latency, and improved energy efficiency.

Particularly, 1-bit quantization refers to the technique of reducing the numerical precision of neural network weights and/or activations to just one bit, typically representing values as +1 or –1 Hubara et al. (2017). This extreme form of quantization, known as Binary Neural Networks (BNNs), enables the replacement of costly floating-point operations with highly efficient bitwise operations such as XNOR and bit-counting. The primary goal is to drastically reduce memory usage, energy consumption, and computational complexity, making deep learning models suitable for deployment on low-power, resource-constrained devices. In theory, 1-bit quantization can achieve up to 32× compression over 32-bit floating-point parameters and similarly large energy savings.

Early works by Hwang and Sung (2014); Courbariaux et al. (2015) demonstrated the feasibility of training deep models with binary weights. Hubara et al. (2016) extended this to both weights and activations, training BNNs on datasets like MNIST, CIFAR-10, SVHN, and even ImageNet. Later, Rastegari et al. (2016) introduced XNOR-Net, which used a scaling factor to improve the performance of binarized layers, achieving competitive top-1 accuracy on ImageNet. Despite these advances, 1-bit models still suffer from accuracy degradation on complex tasks. For instance, BNN proposed by Hubara et al. (2016) achieved 41.8% top-1 accuracy on AlexNet with ImageNet, while XNOR-Net reached 44.2%.

In this work, we propose a novel class of neural network models built entirely with single-bit parameters, using binary normalized layers. Unlike traditional models that rely on 32-bit floating-point precision, our approach constrains all layer parameters—including kernel weights and biases—to a single bit of resolution. The binary normalized layer concept is versatile and can be applied across various architectures such as fully connected, convolutional, and attention layers. To demonstrate their effectiveness, we apply these layers to two distinct problems: multiclass image classification using a convolutional binary model, and a language decoder for next-token prediction in language sequences using a binary transformer model. Our results show that these binary models achieve performance comparable to their full-precision 32-bit counterparts, without exhibiting common training instabilities associated with low-resolution parameter networks. This significant memory reduction, of up to 32 times less than conventional models, combined with their straightforward implementation on standard hardware using 1-bit arrays, opens the door to deploying large-scale neural networks on resource-limited platforms such as mobile devices and CPUs. Moreover, the reduced memory footprint allows scaling to larger models, making advanced AI feasible on embedded systems.

Section 2 of this paper outlines the binary normalized layers, Sections 3 details the convolutional model and the image dataset used to train this model for a multiclass classification problem, Section 4 shows the language models and the dataset used to train this model to predict the next token. The results of the both models are compared with results obtained with the corresponding conventional models with 32-bit parameters. Finally, Section 5 summarizes the conclusions.

2Binary normalized layers

In our binary normalized layers, including the kernel and bias, each parameter exists in two forms simultaneously during training: a full-precision 32-bit floating-point value (pp) used for gradient updates, and its binarized counterpart (pbp_{b}) used for forward computations. The quantization process is straightforward and is performed as follows,


	pb={1,ifp>pmean0,ifp≤pmeanp_{b}=\begin{cases}1,&\text{if }p>p_{mean}\ 0,&\text{if }p\leq p_{mean}\end{cases}		(1)

where pmeanp_{mean} is the mean value of the parameters of the layer.

The 32-bit parameters (pp) are essential because gradient updates during backpropagation are typically very small (10−410^{-4} to 10−210^{-2}), and would be completely lost if parameters were permanently binarized during training. During the forward pass, we use the binary parameters (pbp_{b}) to compute activations, but during backpropagation the full-precision parameters (pp) are used to calculate the gradients.

This dual representation approach, inspired by VQ-VAE van den Oord et al. (2018) and related to QAT principles Jacob et al. (2018), allows effective training while ultimately delivering the benefits of 1-bit inference. After training completes, we discard the 32-bit parameters and retain only the 1-bit pbp_{b} parameters for deployment.

Obviously, training the model with the help of 32-bit parameters requires a large amount of memory, the same as that required in conventional models. But the final trained model only has 1-bit parameters, thus demanding a much smaller amount of memory. Note that to avoid using 32-bit parameters during training, current neural network training methods based on gradient descent could not be used, and a new training method for neural networks with 1-bit parameters would have to be developed.

The fundamental operation in any neural network layer consists of multiplying input data by the layer’s weights (kernel), adding biases, and applying an activation function. This linear transformation followed by nonlinear activation enables the network to learn complex patterns. However, when weights are constrained to only zeros and ones, this transformation exhibits two critical limitations. First, it disproportionately amplifies large positive and negative input values while suppressing small ones, making it inadequate for extracting complex features from the input data. Second, the binary nature significantly intensifies both vanishing and exploding gradient problems during backpropagation.

One effective way to address the challenges posed by using 1-bit parameters is to normalize the output of the linear transformation before applying the activation function. While this strategy is commonly employed in modern architectures through normalization layers, it becomes particularly crucial in the context of binary-weighted networks. In these models, normalization not only stabilizes training but also compensates for the severe limitations introduced by extreme quantization, enabling effective learning despite the low resolution of the parameters.

The motivation for this normalization shares similarities with its role in conventional networks but takes on greater importance in binary models due to their restricted representational capacity and sensitivity to input scale. Specifically:

• Equalizing feature influence: when input features vary in scale, the limited expressiveness of low-precision weights prevents the model from compensating for dominant features. Normalization ensures that all inputs contribute more equally to learning.
• Improving convergence stability: scale discrepancies in the input can lead to unstable or inefficient optimization. Normalization mitigates this by aligning feature scales, facilitating smoother convergence.
• Controlling gradient magnitudes: quantized parameters make gradient updates more sensitive to input scale. Normalizing inputs helps keep gradients within a stable range, avoiding saturation or stagnation during training.
• Avoiding biased learning: when features have unequal numerical ranges, the model may overemphasize those with larger absolute values. Normalization enforces fairer treatment of all features, improving generalization.
• Mitigating vanishing/exploding gradients: limited-precision models are more prone to unstable gradient propagation. Normalization helps maintain consistent signal flow across layers, especially in deeper networks.

Four types of normalized binary layers are implemented and used in different models: fully connected, convolutional, attention and embedding layers. These layers are described in the sections that follow.

2.1Binary normalized fully connected layer (BNFCL)

Algorithm 1 illustrates the forward propagation process in a binary normalized fully connected layer (BNFCL). In Algorithm 1, Quant denotes the function that performs weight quantization (defined by equation 1); NoGradient is a placeholder function that prevents gradient calculation for its argument during model training; Normalize is the normalization function; Activation represents the chosen activation function for the layer; and trainable is a flag to indicate if the layer is in training or predicting. Note that the Normalize function normalizes the features of each example so that it has zero mean and unit standard deviation.

Algorithm 1 Forward propagation calculation process in a binary normalized fully connected layer (BNFCL) 1:Input xx, weights WW, bias bb, flag trainabletrainable, activation function

2:Activations aa

3:if trainabletrainable then

4: Quantize kernel for training: Wq=W+NoGradient(Quant(W)−W)W_{q}=W+\texttt{NoGradient}(\texttt{Quant}(W)-W)

5: Quantize bias for training: bq=b+NoGradient(Quant(b)−b)b_{q}=b+\texttt{NoGradient}(\texttt{Quant}(b)-b)

6:else

7: Quantize kernel for inference: Wq=Quant(W)W_{q}=\texttt{Quant}(W)

8: Quantize bias for inference: bq=Quant(b)b_{q}=\texttt{Quant}(b)

9:end if

10:Apply linear transformation: z=Wqx+bqz=W_{q}x+b_{q}

11:Normalize features of each example: z=Normalize(z)z=\texttt{Normalize}(z)

12:Calculate activations: a=Activation(z)a=\texttt{Activation}(z)

13:return aa

In Algorithm 1, WW and WqW_{q} represent respectively the 32-bit float and binary kernel weights, while bb and bqb_{q} are the corresponding 32-bit float and binary bias vectors. The arrays WW and WqW_{q} have shape (nxn_{x}, nunitsn_{units}), where nxn_{x} is the number of input features and nunitsn_{units} is the number of neurons in the layer. The bias vectors bb and bqb_{q} are one-dimensional, each containing nunitsn_{units} elements.

During both training and inference, the binary weights are used to compute the layer activations. However, during training, when trainable==Truetrainable==True, the gradients are computed and applied to the 32-bit floating-point weight matrix WW and bias vector bb. These high-precision values are retained and updated throughout training, ensuring that no information is lost during the optimization process. This approach enables parameter updates using full-precision values while still performing forward passes with quantized weights. This scheme is adapted from the method proposed in Alcorn (2023), and is similar in spirit to Quantization Aware Training (QAT) Jacob et al. (2018).

It is important to observe that after training only the quantized 1-bit parameters (WqW_{q} and bqb_{q}) are need for inference and the calculations performed in the BNFC layer are modified according to Algorithm 2.

Algorithm 2 Forward propagation calculation process in a binary normalized fully connected layer (BNFCL) after training 1:Input xx, binary eights WqW_{q}, binary bias bqb_{q}, activation function

2:Activations aa

3:Apply linear transformation: z=Wqx+bqz=W_{q}x+b_{q}

4:Normalize features of each example: z=Normalize(z)z=\texttt{Normalize}(z)

5:Calculate activations: a=Activation(z)a=\texttt{Activation}(z)

6:return aa

2.2Binary normalized convolutional layer (BNCVL)

The only difference between the binary normalized convolutional layer (BNCVL) and the binary normalized fully connected layer (BNFCL) is that a convolution operation is used between the filters (kernel with binary weights) and the input tensor of the layer, rather than a simple matrix multiplication. Equation (2) performs a convolution operation in calculating activations in a BNCV layer.


	z=Conv(Wqx)+bqz=\texttt{Conv}(W_{q}x)+b_{q}		(2)

where Conv(Wqx)\texttt{Conv}(W_{q}x) performs the convolution of xx by WqW_{q}. In this case WqW_{q} is the binary kernel parameters of the layer, which is a four-dimensional array with dimensions (nH,nW,nC,nFn_{H},n_{W},n_{C},n_{F}), where nHn_{H}, nWn_{W} and nCn_{C} are respectively the height, the width and the number of channels of the input data, and nFn_{F} is the number of filters used in the convolution layer, and bqb_{q} is the binary bias vector with nFn_{F} elements. The forward propagation process in a binary normalized convolutional layer (BNCVL) is defined in Algorithm 3. It should be noted that in the BNCVL, equation 2 replaces the linear transformation in Algorithm 1.

Algorithm 3 Forward propagation calculation process in a binary normalized convolutional layer (BNCVL) 1:Input xx, weights WW, bias bb, flag

eq,seq,seq)attention_output=\texttt{BATL(emb_dim}=emb_dim,\texttt{num_heads}=num_heads)(seq,seq,seq)

5:Add and normalize: add_norm=Normalize(seq+attention_output)add_norm=\texttt{Normalize}(seq+attention_output)

6:Apply binary normalized fully connected layers (BNFCL):

7: ffn_output=BNFCL(units=ff_dim,activation=’gelu’)(add_norm)ffn_output=\texttt{BNFCL(units}=ff_dim,\texttt{activation}=\text{‘gelu’})(add_norm)

8: ffn_output=BNFCL(units=emb_dim)(ffn_output)ffn_output=\texttt{BNFCL(units}=emb_dim)(ffn_output)

9:Add and normalize again: output=Normalize(add_norm+ffn_output)output=\texttt{Normalize}(add_norm+ffn_output)

10:return outputoutput

In Algorithm 5, BNFLC(units, activation) represents the binary normalized fully connected layer, whose calculation process is shown in Algorithm 1; note that the gelu activation function is used in the first BNFC layer; Normalize() is the function that normalizes the features of each example so that it has zero mean and unit standard deviation; and BATL(emb_dim, num_heads) represents the binary multi-head attention layer. The calculation process of this attention layer is presented in Algorithm 6.

Algorithm 6 presents the forward propagation process in a binary multi-head attention layer (BATL). The required inputs are the input token sequences query,keyquery,key and valuevalue, the causal mask (maskmask), the embeddings dimension (emb_dimemb_dim), and the number of heads in the attention layer (num_headsnum_heads).

The functions used in Algorithm 6 are: LengthOfSequence() is a function that retrieves the length of a sequence; Resahpe() is a function that reallocates the elements of a tensor according to the provided shape list; Permute() denotes a function that swaps the axes of a tensor based on the provided order list; Matmul() is a function that performs tensor multiplication according to the linear algebra rules; Where() is the standard where function that operates conditions along all elements of a tensors; and Softmax() is the standard softmax function. All the other functions and terms used in Algorithm 6 have been defined previously.

Algorithm 6 Forward propagation calculation process in a binary multi-head attention layer (BATL) 1:Input queryquery, keykey and valuevalue, causal mask maskmask, embedding dimension emb_dimemb_dim, number of heads num_headsnum_heads

2:Final linear projection projectionprojection

3:Calculate number of keys: num_key=emb_dim//num_headsnum_key=emb_dim//num_heads

4:Apply linear projections to get Q, K, V

5: Q=BNFLC(units=emb_dim)(query)Q=\texttt{BNFLC(units}=emb_dim)(query)

6: K=BNFLC(units=emb_dim)(key)K=\texttt{BNFLC(units}=emb_dim)(key)

7: V=BNFLC(units=emb_dim)(value)V=\texttt{BNFLC(units}=emb_dim)(value)

8:Get sequence length from query: seq_len=LengthOfSequence(query)seq_len=\texttt{LengthOfSequence}(query)

9:Split each tensor into num_heads to support multi-head attention:

10: Q=Reshape(Q,shape=[−1,seq_len,num_heads,num_key])Q=\texttt{Reshape}(Q,\texttt{shape}=[-1,seq_len,num_heads,num_key])

11: K=Reshape(K,shape=[−1,seq_len,num_heads,num_key])K=\texttt{Reshape}(K,\texttt{shape}=[-1,seq_len,num_heads,num_key])

12: V=Reshape(V,shape=[−1,seq_len,num_heads,n_key])V=\texttt{Reshape}(V,\texttt{shape}=[-1,seq_len,num_heads,n_key])

13:Permute axis of Q, K, V to support multi-head attention

14: Q=Permute(Q,order=[0,2,1,3])Q=\texttt{Permute}(Q,\texttt{order}=[0,2,1,3])

15: K=Permute(K,order=[0,2,1,3])K=\texttt{Permute}(K,\texttt{order}=[0,2,1,3])

16: V=Permute(V,order=[0,2,1,3])V=\texttt{Permute}(V,\texttt{order}=[0,2,1,3])

17:Compute scaled dot-product attention scores:

18: attention_scores=Matmul(Q,Permute(K,order=[0,1,3,2]))/Sqrt(num_key)attention_scores=\texttt{Matmul}(Q,\texttt{Permute}(K,\texttt{order}=[0,1,3,2]))/\texttt{Sqrt}(num_key)

19:Apply causal mask: scale_dot=Where(mask==0,−1.0e−10,scale_dot)scale_dot=\texttt{Where}(mask==0,-1.0\text{e}-10,scale_dot)

20:Apply softmax to get attention probabilities: attn_prob=Softmax(scale_dot,axis=−1)attn_prob=\texttt{Softmax}(scale_dot,\texttt{axis}=-1)

21:Calculate attention: A=Matmul(attn_prob,V)A=\texttt{Matmul}(attn_prob,V)

22:Reshape attention back to the original dimension:

23: A=Permute(A,order=[0,2,1,3])A=\texttt{Permute}(A,\texttt{order}=[0,2,1,3])

24: A=Reshape(A,shape=[−1,seq_len,num_heads∗num_key])A=\texttt{Reshape}(A,\texttt{shape}=[-1,seq_len,num_heads*num_key])

25:Apply final linear projection: projection=BNFLC(units=emb_dim)(A)projection=\texttt{BNFLC}(\texttt{units}=emb_dim)(A)

26:return projectionprojection

3Image classification problem

For the image multiclass classification problem, a binary convolutional model is configured and the Food-101 dataset Bossard et al. (2014) is used. The Food-101 dataset has a total of 101,000 images in varying resolutions with 101 categories of foods. This dataset is used for identification of types of food in a dish. The data is divided in two sets: the training data with 75,750 images and the validation data with 25,250 images.

3.1Configuration of the models with convolutional layers

Algorithm 7 outlines the binary convolutional model (BCVNN) for the image classification task. The inputs for the model are an image (image) and the filter dimension used in the convolutional layers f. The output of the model are the 101 class probabilities calculated for the input image (probs). In Algorithm 7 the following functions are used: BNCVL() represents the binary normalized convolutional layer presented in 3; MAXPOOL2D() represents a max-pooling layer; BNFCL() represents the binary normalized fully connected layer presented in Algorithm 1; and GLOBALAVG() is a standard global average pooling layer that averages a three-axis tensor across the first two dimensions resulting a tensor with only one axis. All convolutional layers use relu activation function, a stride of 1, and padding to maintain the width and height of the tensors. All max-pooling layers use 2×22\times 2 windows and stride equal to 2. The first and second binary normalized fully connected layer use relu activation, and the output layer uses softmax activation.

Algorithm 7 Binary convolutional model used for the image classification problem (BCVNN) 1:Input imageimage, filter dimension ff

2:Classes probabilities probprob

3:First block of convolutional layers

4: a1=BNCVL(units=32,(f,f),activation=′relu′,padding=′same′)(image)a1=\texttt{BNCVL}(\texttt{units}=32,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(image)

5: a1=BNCVL(units=32,(f,f),activation=′relu′,padding=′same′)(a1)a1=\texttt{BNCVL}(\texttt{units}=32,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a1)

6: a1=MAXPOOL2D(window=(2,2),stride=(2,2))(a1)a1=\texttt{MAXPOOL2D}(\texttt{window}=(2,2),\texttt{stride}=(2,2))(a1)

7:Second block of convolutional layers

8: a2=BNCVL(units=64,(f,f),activation=′relu′,padding=′same′)(a1)a2=\texttt{BNCVL}(\texttt{units}=64,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a1)

9: a2=BNCVL(units=64,(f,f),activation=′relu′,padding=′same′)(a2)a2=\texttt{BNCVL}(\texttt{units}=64,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a2)

10: a2=MAXPOOL2D(window=(2,2),stride=(2,2))(a2)a2=\texttt{MAXPOOL2D}(\texttt{window}=(2,2),\texttt{stride}=(2,2))(a2)

11:Third block of convolutional layers

12: a3=BNCVL(units=64,(f,f),activation=′relu′,padding=′same′)(a2)a3=\texttt{BNCVL}(\texttt{units}=64,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a2)

13: a3=BNCVL(units=64,(f,f),activation=′relu′,padding=′same′)(a3)a3=\texttt{BNCVL}(\texttt{units}=64,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a3)

14: a3=MAXPOOL2D(window=(2,2),stride=(2,2))(a3)a3=\texttt{MAXPOOL2D}(\texttt{window}=(2,2),\texttt{stride}=(2,2))(a3)

15:Fourth block of convolutional layers

16: a4=BNCVL(units=128,(f,f),activation=′relu′,padding=′same′)(a3)a4=\texttt{BNCVL}(\texttt{units}=128,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a3)

17: a4=BNCVL(units=128,(f,f),activation=′relu′,padding=′same′)(a4)a4=\texttt{BNCVL}(\texttt{units}=128,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a4)

18: a4=MAXPOOL2D(window=(2,2),stride=(2,2))(a4)a4=\texttt{MAXPOOL2D}(\texttt{window}=(2,2),\texttt{stride}=(2,2))(a4)

19:Fifth block of convolutional layers

20: a5=BNCVL(units=256,(f,f),activation=′relu′,padding=′same′)(a4)a5=\texttt{BNCVL}(\texttt{units}=256,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a4)

21: a5=BNCVL(units=256,(f,f),activation=′relu′,padding=′same′)(a5)a5=\texttt{BNCVL}(\texttt{units}=256,(f,f),\texttt{activation}=^{{\prime}\text{relu}}{\prime},\texttt{padding}=^{{\prime}\text{same}}{\prime})(a5)

22: a6=GLOBALAVG()(a5)a6=\texttt{GLOBALAVG}()(a5)

23:Classification layers

24: a7=BNFCL(units=256,activation=′relu′)(a6)a7=\texttt{BNFCL}(\texttt{units}=256,\texttt{activation}=^{{\prime}\text{relu}}{\prime})(a6)

25: a8=BNFCL(units=256,activation=′relu′)(a7)a8=\texttt{BNFCL}(\texttt{units}=256,\texttt{activation}=^{{\prime}\text{relu}}{\prime})(a7)

26: prob=BNFCL(units=101,activation=′softmax′)(a8)prob=\texttt{BNFCL}(\text{units}=101,\texttt{activation}=^{{\prime}\text{softmax}}{\prime})(a8)

27:return probprob

3.2Convolutional models training

To verify whether the number of parameters of the models influences training stability and performance of the binary models, two models with different filter dimensions are configured and trained: 3×33\times 3 and 5×55\times 5. The model with 3×33\times 3 filters has 5,132,165 parameters and the model with 5×55\times 5 filters has 13,505,925 parameters.

To verify if the binary models are effective, two models with float 32-bit parameters (”standard” models) with the same configurations of the binary models are also configured and trained. In these “standard” models dropout layers are introduced after the first and second fully connected layers with dropout rates of 0.4 and 0.3 respectively. Dropout is necessary in the standard models to prevent excess overfitting.

In all models, connection weights and biases are initialized using the standard methods: Glorot Uniform for weights and zeros for biases. No regularization methods or parameter constraints are applied in the binary models and only dropout are used in the standard models. Table 1 presents the hyperparameters used for training the convolutional models.

Table 1:Hyperparameters used for training the convolutional models. It is observed that a considerable number of training epochs are needed for the cost function to completely converge during training.

3.3Results obtained with the models with convolutional layers

In Figure 1, the training results are displayed for the convolutional models. The results from the standard models with 32-bit parameters are included as a benchmark for the desired performance. It is important to note that multiple training tests were conducted for all models, and all results are very similar. These results are summarized in Table 2 that presents the best results obtained during training for each model.

Refer to caption Figure 1:Training results of image classification problem with the convolutional models. Table 2:Summary of the results of the image classification problem with the convolutional models. Analyzing the training results of the convolutional models shown in Figures 1 and Table 2, the following observations can be made:

• The binary models are capable to train without any kind of instability and their performance is almost equal to the standard models;
• The standard models learn more rapidly than the binary models, i.e., they need fewer epochs for training;
• The standard models present strong overfitting while the binary models do show overfitting;
• The accuracies for the validation data of the standard models are slightly better than the ones of the binary models;
• The binary model with 5x5 filters presents better performance than the 3x3 filters binary model;
• The results of the binary models are very good considering that they have only 1-bit parameters.

It is important to observe that the binary normalization layers are effective to solve the problems of training instabilities and low accuracy of models with binary parameters. According to the study performed by Cabral and Driemeier (2025), that analyzed the impact of low-resolution parameters on the performance of neural networks, models with binary parameters are not able to train effectively.

4Language decoder problem

For the language decoder problem, a binary transformer model is configured and the WikiText-103-raw dataset is used. This dataset was created by Salesforce Research Merity et al. (2016). The WikiText-103-raw dataset is primarily sourced from English Wikipedia. Specifically, it was created from high-quality Wikipedia articles to provide clean and representative data for training language models. It includes 25,000 carefully selected Wikipedia articles, containing around 103 million words. The ”raw” version preserves the original punctuation and basic formatting, unlike the tokenized version. The dataset was pre-processed in the form of sentences performing 782,208 examples. The data was divided into training dataset with 95% of the examples and validation dataset with the rest 5%.

The text is tokenized using the WordPieceTokenizer Song et al. (2021) which uses a sub-word strategy. Its vocabulary size is 30,522, and any token not appearing in the vocabulary is replaced by [UNK] (”unknown”).

4.1Configuration and training the language decoder

Algorithm 8 outlines the binary language decoder model (BLM). The inputs for the model are the sequence of tokens (seqseq), the maximum sequence length max_lenmax_len, the embedding dimension emb_dimemb_dim, the number of attention heads num_headsnum_heads, the vocabulary size vocab_sizevocab_size, the numbers of units in the MLP head mlp_units_0mlp_units_0 and mlp_units_1mlp_units_1. The output of the model are the vocab_sizevocab_size probabilities calculated for the next token (probsprobs). In Algorithm 8 the following functions are used: BEMB() represents the binary embedding layer presented in Algorithm 3; Normalize() is the function that normalizes the features of each example so that it has zero mean and unit standard deviation; BTFB() is the transformer block presented in Algorithm 5; BNFCL() represents the binary normalized fully connected layer presented in Algorithm 1. The activation functions of the MLP head layers are gelu and for the last layer is softmax.

Algorithm 8 Binary language decoder model (BLM) 1:Input token sequences seqseq, maximum sentence length max_lenmax_len, embedding dimension emb_dimemb_dim, number of attention heads num_headsnum_heads, vocabulary size vocab_sizevocab_size, number of units in the mlp head layers mlp_units_0mlp_units_0 and mlp_units_1mlp_units_1

2:Classes probabilities probsprobs

3:Embedded coding of the token sequences: embs=BEMB(max_len,emb_dim,vocab_size)(seq)embs=\texttt{BEMB}(max_len,emb_dim,vocab_size)(seq)

4:Embedding normalizations: x=Normalize(embs)x=\texttt{Normalize}(embs)

5:Pass through a sequence of transformer blocks:

6:for ii from 1 to num_blocksnum_blocks do

7: x=BTFB(emb_dim,num_heads,ff_dim=2∗emb_dim)(x)x=\texttt{BTFB}(emb_dim,num_heads,\texttt{ff_dim}=2*emb_dim)(x)

8:end for

9:Process transformer output with fully connected layers (MLP head)

10: features=BNFCL(units=mlp_units_0,activation=′gelu′)(x)features=\texttt{BNFCL}(\texttt{units}=mlp_units_0,\texttt{activation}=^{{\prime}\text{gelu}}{\prime})(x)

11: features=BNFCL(units=mlp_units_1,activation=′gelu′)(features)features=\texttt{BNFCL}(\texttt{units}=mlp_units_1,\texttt{activation}=^{{\prime}\text{gelu}}{\prime})(features)

12:Final fully connected layer to calculate the probabilities

13: probs=BNFCL(units=vocab_size,activation=′softmax′)(fe

Abstract

Abstract

1Introduction

2Binary normalized layers

2.1Binary normalized fully connected layer (BNFCL)

2.2Binary normalized convolutional layer (BNCVL)

3Image classification problem

3.1Configuration of the models with convolutional layers

3.2Convolutional models training

3.3Results obtained with the models with convolutional layers

4Language decoder problem

4.1Configuration and training the language decoder

Similar Posts