
How mathematics from the age of Gauss explains the core mechanism of modern AI
The transformer sits at the heart of today’s most powerful AI systems. Whether it’s ChatGPT summarizing documents, Claude writing code, or Gemini generating images, one mechanism makes all of this possible: self-attention.
Yet despite its success, self-attention (or simply attention in what follows) feels like an engineered trick. We project inputs into query, key, and value vectors; take dot products; apply softmax; and mix the value vectors together. It works astonishing...

How mathematics from the age of Gauss explains the core mechanism of modern AI
The transformer sits at the heart of today’s most powerful AI systems. Whether it’s ChatGPT summarizing documents, Claude writing code, or Gemini generating images, one mechanism makes all of this possible: self-attention.
Yet despite its success, self-attention (or simply attention in what follows) feels like an engineered trick. We project inputs into query, key, and value vectors; take dot products; apply softmax; and mix the value vectors together. It works astonishingly well — but why?
The surprising answer is that attention is not an ad hoc invention at all. It is the modern incarnation of a classical statistical estimation technique developed by Carl Friedrich Gauss in the early 1800s to track celestial bodies.
Gauss’s problem: estimating an unknown quantity from noisy measurements
In 1801, the astronomer Giuseppe Piazzi discovered Ceres, the first known asteroid. He tracked it for 40 nights before it disappeared behind the Sun. The challenge: can anyone predict where Ceres would reappear months later using Piazzi’s noisy observations?
A 24-year-old Gauss took on the challenge. His solution combined the noisy measurements into a best estimate that predicted — with stunning accuracy — when and where Ceres would reappear. His method was later formalized as weighted least squares (WLS).
The idea is simple. You have noisy measurements of an unknown quantity that you want to estimate. WLS says: if the measurements are equally trustworthy, average them; if some are more reliable, weight them more in the average. That is WLS: the best estimate is just a weighted average.
Note that WLS doesn’t tell you what the weights are — it just says compute the weighted average once you have them. Gauss independently derived decent weights for the Ceres problem using his deep understanding of how astronomical measurements are acquired.
Since then, the idea of best estimates as weighted averages has become a workhorse of science and engineering. It appears in signal processing, control theory, economics, and countless other fields. And — as it turns out — in transformers.
What does attention do?
First, every token of an input prompt is mapped to query, key, and value vectors by a pre-trained transformer.
Attention then updates each token’s representation by a weighted average of value vectors. For a given token, the weights are computed by applying softmax to the scaled dot products between that token’s query vector and the key vectors of all tokens in the sequence.
The result is a sequence of contextualized representations, each informed by the entire input sequence, ready for downstream tasks such as next-token prediction.
Sound familiar?
For each token, attention computes a weighted average — just like a WLS estimate.
- The value vectors act as observations.
- The attention weights reflect how relevant each key is to the query: the greater the relevance, the greater the weight.
But unlike vanilla WLS, where the weights must be supplied beforehand, attention computes its weights on the fly: each prompt generates its own pattern of weights via the softmax of scaled dot products.
This raises a natural question: Does the softmax of query-key dot products produce weights that are optimal in some sense?
Entropy: the missing ingredient
Here’s the conceptual leap that ties everything together.
Imagine a WLS problem where you want the data itself to determine the weights. There are infinitely many possibilities. One extreme collapses all weights onto a single observation, driving the error to zero — a mathematically valid but useless solution. How do we avoid this?
One approach is to regularize WLS using Shannon entropy. This entropy-regularized WLS — which we call eWLS — minimizes estimation error while encouraging weights to spread across observations. The best estimate is still a weighted average, but the optimal weights take a specific form known as the Gibbs distribution.
And guess what? softmax is the Gibbs distribution.
The equivalence: transformer attention is eWLS in disguise
Putting it together:
- Attention computes query–key dot products, applies softmax, and uses the resulting weights to mix the value vectors.
- eWLS takes a similarity score, converts it to Gibbs weights, and mixes the value vectors using these weights.
- If we set the eWLS cost function equal to the negative of attention’s query–key dot products, the resulting Gibbs weights become the softmax weights.
Which means attention — despite its modern gloss — is the solution to an entropy-regularized WLS problem. Those “mysterious” steps — projections, dot products, softmax, weighted sums — are simply the machinery for solving a well-known optimization problem.
Gauss would’ve recognized it instantly.
Why this matters: interpretability and explainability
This equivalence isn’t just a curiosity — it has practical consequences.
First, it gives us interpretability:
- Query–key dot products define a learned similarity measure — a way to quantify which tokens matter for the current estimate.
- Softmax isn’t an arbitrary normalization trick; it’s the optimal way to convert similarities into weights while preventing collapse onto a single observation.
- Temperature controls the trade-off: lower temperature sharpens attention onto a few tokens; higher temperature spreads it more broadly.
Second, it explains why transformers use multiple attention heads. Each head learns its own similarity measure through its query, key, and value projections — one might capture syntax, another semantics, another positional patterns. In eWLS terms, each head solves a different estimation problem.
Why this matters: a design space unlocked
Once you see attention as eWLS, an entire design space opens up.
- Alternative losses: L1 for sparsity; Huber for robustness; Mahalanobis for structured similarity.
- Alternative entropies: Tsallis for heavier tails; sparse regularizers for exactly zero weights.
- Alternative kernels: Gaussian for locality; polynomials or wavelets for richer interactions; learned kernels that adapt to data.
- Efficient attention: If the kernel has structure — sparse, low-rank, state-space-like— Gibbs weights can be computed without the full n×n matrix, offering a principled path to linear-time attention.
Attention design becomes structured exploration, not trial-and-error.
Why this matters: connections to other ideas
The equivalence also reveals deep connections between attention and other areas of estimation and filtering.
- Kernel smoothing: Self-attention is an adaptive kernel smoother with learned bandwidth.
- Kalman filtering: The Kalman gain plays a role similar to self-attention weights. Recent architectures like Mamba exploit this connection.
- Bayesian inference: Softmax weights can be interpreted as approximate posteriors.
- Free-energy principle: eWLS minimizes a free-energy functional (expected cost minus entropy), connecting transformers to theoretical neuroscience.
These are not just intellectually interesting connections — they bring decades of insight from other fields into attention design.
The deeper lesson
Innovation in deep learning often arises from rediscovering classical ideas and scaling them with modern compute. Batch normalization echoes whitening; ResNets mirror numerical ODE solvers; dropout resembles Bayesian model averaging.
When something works surprisingly well, an old idea is usually hiding inside it.
For attention, that old idea is weighted least squares: a way to optimally combine uncertain information. Gauss used it to find an asteroid. Today we use it to build machines that plan, reason, and think.
About the author:
Gordon is a mathematician, bioinformatician, AI researcher, and singer–songwriter who explores hidden mathematical patterns at the interface of machine learning, modern AI, biology, and music.
From Gauss to Transformers: A Surprising Link Between Weighted Least Squares and Self-Attention was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.