Understanding the statistical mechanics: how moments and bias correction drive optimization
11 min readJust now
–
Press enter or click to view image in full size
Source: Author
At the heart of every deep learning model lies a simple goal: minimizing error. We measure this error using something called a cost Function (or objective function). But knowing the error isn’t enough; the model needs a way to learn from it and improve over time. This is where optimization algorithms come in; they guide the model to update its internal parameters to get better results. Among the many algorithms available, one stands out as the absolute ‘star’ of the industry due to its efficiency and effectiveness: the Adam optimizer.
In almost all projects we use the Adam optimizer.
1.…
Understanding the statistical mechanics: how moments and bias correction drive optimization
11 min readJust now
–
Press enter or click to view image in full size
Source: Author
At the heart of every deep learning model lies a simple goal: minimizing error. We measure this error using something called a cost Function (or objective function). But knowing the error isn’t enough; the model needs a way to learn from it and improve over time. This is where optimization algorithms come in; they guide the model to update its internal parameters to get better results. Among the many algorithms available, one stands out as the absolute ‘star’ of the industry due to its efficiency and effectiveness: the Adam optimizer.
In almost all projects we use the Adam optimizer.
1. Do you contemplate, why this specific algorithm is the star of the objective function optimizers in deep learning?
2. Have you wondered how this algorithm works?
3. Are you interested in learning the mathematical magic behind this algorithm?
If you would like to have answers for one of the questions above, then this article is for you. This article heavily discusses the math behind the algorithm, and it is assumed that you are familiar with statistics, linear algebra and calculus.
1. Why Adam?
In traditional ML models we use the well-known gradient descent optimizers, specifically the stochastic one, to generalize our model. This works well and is a go-to for most of the ML projects. But deep learning poses a different challenge. 1️⃣In deep learning we usually work with massive volumes of data, and we usually train them as batches instead of the whole volume. Splitting them into mini batches while training and then optimizing them introduce uncertainty in the optimization. In other words, the optimization process became stochastic or introduced variations from batch to batch. This is exactly due to the fact that some batches contain samples that give large gradients, and some might give smaller gradients. You can guess that this is exactly similar scenario of exploding and vanishing gradients. So, the gradients introduce the random process; this is the main problem that Adam solves.
2️⃣Another important problem with the neural network architecture is the shape of the cost function. If you are familiar with optimization and conditions for the optimality,** the cost function should be a convex function or a bowl-shaped function to work with optimization algorithms like stochastic gradient descent. But due to the nested architecture of neural networks. the non-linearity introduced in the layers makes the cost function non-convex**. So, our usual optimization functions cannot be used. Adam solves this; it addresses the problem as a non-convex optimization problem and uses statistics to solve this. Apart from this, it is invariant to gradient scaling, requires less memory, and is computationally efficient. Next we discuss a few important prerequisites to understand the algorithm. If you are familiar with concepts of moving averages and variances and exponential moving averages and variances, you can skip the sections 2 and 3.
2. Exponential moving average:
The simple moving average method, which is different from the exponential moving average method we are about to discuss, uses a window to find the moving averages in consecutive data points. An example calculation is given below for a window size of 3.
Source: Author
As you can notice, it gives equal weight to the observations in a window.
Press enter or click to view image in full size
Now let us discuss the exponential moving average method and how efficient it is compared to the simple moving average method. The exponential moving average, or EMA, uses the average from the previous step and the observation in the current step to efficiently calculate the moving average. The formula is given below.
Press enter or click to view image in full size
Source: Author
In the simple moving average, or SMA, method, we use a window of a particular size to control how much of the old observations can be used to influence our average. But in the EMA method we control this using the decay factor. If the decay factor is less, old observations quickly move out from the window and have less influence in our moving average, and vice versa. There are few important benefits of using EMA over SMA: 1️⃣ Unlike SMA, which needs to have all the observations of the window in memory, EMA just requires the EMA from the previous step; hence it is memory efficient and the computation complexity is O(1). Due to these reasons, EMA outperforms SMA. 2️⃣EMA is extremely sensitive to recent observations. As the older observations decay, the newer observations have more weight in EMA, hence; if we have a drastic change in the observation, it is quicker to spot them in EMA. But in SMA, as the observations are given equal weights, there is a hard delay to find immediate changes in observations. An example plot is given below.
Press enter or click to view image in full size
Source: Author
As you can see in the first plot, the magnitude of the observations are almost similar. But in the second plot there is a change in magnitude for the observations in the middle. If we use the SMA, the 3rd window cannot identify this drastic change in the magnitude. But if we use EMA, the change can be detected as soon as you start moving to the observation in the middle. The previous EMA will be lower, and the current EMA will be higher.
3. Exponential moving variance:
Exponential moving variance is similar to the Exponential moving average, where we control the influence of the spread from the old observations using the decay factor. The formula for EMV is given below.
Press enter or click to view image in full size
Source: Author
Here, we** square the current observation to avoid cancellations of the negative observation with the positive one. Also, it explodes the larger number, and so any drastic change in the spread of the observations is quickly identified by EMV.** The EMV, however, produces the uncentered variance. In regular variance calculation we subtract the mean of the observations to center them. If we center our observations before calculating the EMV, then the EMV would lose the ability to quickly react to sudden variance changes. In other words,** we must maintain our magnitude of the observation**. Consider the below example.
Source: Author
From the above example, you can see that if we center, we lose the information of the large magnitude, and the EMV calculations are less sensitive to sudden changes. To avoid this, **we calculate the second moment, **which is E[Θ²]. If you are unfamiliar with moments in statistics, you can learn them online. Similar to the EMA, the exponential moving variance detects sudden changes, but in the spread of the observations. The plot below is similar to the one we have seen, but here it is relative to the spread. If we use a regular moving variance method, it will fail to quickly identify the sudden changes in the spread. In EMV, as soon as the spread changes, it is effective to identify it.
Press enter or click to view image in full size
Source: Author
4. Adam algorithm:
Adam algorithm is clever. As we are not dealing with the convex optimization function, instead of the convex optimization methods, it uses statistics to solve them. In particular, it uses the first and the second moments of the gradients to calculate the intensity of the optimization step.
Press enter or click to view image in full size
Source: Author
From the algorithm above, we can identify some of the discussions we have made in the previous section. 1️⃣ Our cost function is stochastic 𝑓(Θ). We have discussed this earlier. This is due to batching. 2️⃣ Our cost function is non-convex. As it is also stochastic, the gradient is treated as an observation from a random process with unknown mean and variance.
Get Raaja Selvanaathan DATCHANAMOURTHY’s stories in your inbox
Join Medium for free to get updates from this writer.
Now let us analyze the algorithm in detail. At each step, 1️⃣ The algorithm first **calculates the gradient of our objective function, which is gₜ . Remember we considered our gradient from a random process. 2️⃣ For the above reason, the **algorithm calculates the exponential moving average of our gradients, using the gradient from the current step ***t ***and the EMA of the previous step t-1 to produce the first moment mₜ . 3️⃣ Next, it *calculates the exponential moving variance of our gradients, using the gradient squared (to avoid cancellation of negative values) of the current step t and EMV from the previous step t-1 to produce the second moment vₜ . 4️⃣ The sensitivity of EMA is controlled by the decay factor β₁ ***and the sensitivity of EMV is controlled by β₂. In the algorithm the fixed values of β₁= 0.9 and ***β₂ ***= 0.999 are used. 5️⃣ The algorithm then calculates the bias-corrected versions of the first (m̂ₜ) and the second moment (v̂ₜ). Our EMA and EMV are biased towards the initial vector, which is the vector of 0s. Due to our decay factor given in point 4, our EMA and EMV are highly sensitive to the initial vector of zeros, which reduces the moving average or moving variance during initial iterations of the algorithm. So a bias-corrected first and second moment is calculated by the algorithm. More on this in section 5. 6️⃣ Finally the step size is calculated by multiplying the learning rate α with the ratio of the bias-corrected first and second moments. This step effectively calculates the size of the step size to move forward. The intuition behind them is given in section 6.
5. Bias correction:
In the previous section it was mentioned that our initial iterations of the gradient descent step are biased to the initial moment vector we choose. The initial moment vectors are 0 vectors for both the first and the second moments. So the EMA calculation is highly sensitive to the initial vector due to our β₁= 0.9. Even if our current gradient is huge, like [20, 15, 40] our EMA = [0 0 0](0.9) + [20 15 40](0.1) = [2 ,1.5, 4] . We effectively lose the magnitude of our gradient vector, and we take small steps. The same goes for the EMV. But these initial steps are crucial in the deep learning models, which use billions of parameters. This incurs high computational costs. To correct this, the algorithm uses something called the bias correction. To understand this completely we will deep dive into statistics and the estimators.
As we know our gradients are stochastic and assumed to be from an unknown distribution with unknown mean and variance, we need to find an estimator for the mean and uncentered variance. In our case the estimators are mₜ and vₜ . But these estimators should be unbiased estimators. If you are familiar with this statistical term, then you have understood the bias correction here.
Remember that the EMA and EMV methods are recursive. As the Adam paper unfolds the Exponential Moving Variance, we will do the same here. The formula for the EMV can be written in recursive form as given below.
Press enter or click to view image in full size
Source: Author
This recursive form can be essentially rewritten in a general form below.
Source: Author
For the Adam algorithm we replace the notation **Θ **with our gradient notation g, as we are dealing with gradients.
Source: Author
Remember that ***vₜ ***is the estimator for the uncentered variance of our gradient distribution. To prove this is an unbiased estimator, we have to take the expectation, E[vₜ].
Press enter or click to view image in full size
Source: Author
From the calculation, the E[g²ₜ] — E[g²i] is the error term. It means how mistaken the current step is due to the influence of the** ith step**. From the paper, it is discussed that the values of the decay factors β used make the error term negligible. We are left with
Source: Author
But this is a biased estimator. We need to find an unbiased estimator v̂ₜ .
Press enter or click to view image in full size
Source: Author
From the above derivation, we have the unbiased estimator for v̂ₜ , which we can effectively use to correct the bias introduced by the errors due to the previous steps. The same derivation can be done to prove the unbiased estimator for the first moment. This is one of the key parts of the algorithm.
6. Step size:
From the above section and the Adam algorithm. you might wonder** **why the first moment and the second moment are calculated and how they are used to control the momentum of the steps. Using EMA and EMV, the algorithm identifies a sudden difference in the current gradient. Imagine we are on the steep slope for the previous steps; our gradients are almost in the same direction, so there is no drastic difference in EMA and EMV for the previous and the current steps. But as soon as the slope gets less steep, the gradient direction changes towards the sides of the level curve, either left or right. This drastic change is identified by EMA and EMV, and step correction is made to take shorter steps.
Source: Author
The step size is calculated using the unbiased estimators for the first (m̂ₜ) and the second moment (v̂ₜ). **The ratio of them gives the effective step size **that we should take depending upon the gradient direction. 1️⃣ If the unbiased first moment (m̂ₜ) is more and the unbiased second moment (v̂ₜ) is less, it means the gradients have less variation so that it is safe to take long steps to converge to the optimum quicker. 2️⃣ If the unbiased first moment (m̂ₜ) is less and unbiased second moment (v̂ₜ) is more, it means the gradients have subtle variations from the previous step to current step, so that it is wise to take shorter steps.
Here, a tiny value **∈ **is added to avoid division by 0. The fraction is also called the signal-to-noise ratio. This process continues until our cost function converges.
Conclusion:
In this article, we journeyed through the mathematical foundations of the Adam optimizer, exploring how it elegantly combines the benefits of momentum (First Moment) and adaptive learning rates (Second Moment). We saw how the bias correction mechanism ensures stability during the crucial early phases of training, and how the signal-to-noise ratio intuition allows Adam to optimize complex, non-convex loss functions efficiently.
By treating gradients as random variables and estimating their moments, Adam provides a robust solution to the challenges of training deep neural networks. While newer optimizers continue to emerge, Adam’s balance of theoretical soundness and practical performance ensures it remains the “gold standard” in the deep learning toolbox.
Reference:
Adam: A Method for Stochastic Optimization
To discuss optimization algorithms or suggest future topics, **connect with me on **LinkedIn.