Understanding the Design of Optimizers with me
dev.to·13h·
Discuss: DEV
Flag this post

Ok, it’s midnight during Halloween, and I’m talking about Optimizers. Such a thrill lol. And the goal today is to teach you how AdamW is calculated mathematically, and what the intention is behind its design.

So, what is an optimizer in the context of LLMs? First of all, this name to me is rather deceptive. This is actually less of an object and more of a verb/ methodology/ concept.

Optimizer is a WAY to update the parameters in our LLMs during training. Specifically, it’s how to update each weight value in, for example, our attention matrices or in the linear layer of our MLP layer during backprop.

In the simplest form, with “chain rule”, we have this classic optimizer called Stochastic Gradient Descent (SGD):

θt=θt−1−η∇θL(θt) \theta_t = \theta_{t-1} - \eta \nabla_{\t…

Similar Posts

Loading similar posts...