7 min readDec 22, 2025
–
Press enter or click to view image in full size
A deep dive into the math, mechanics, and variants of the Attention Mechanism.
The Problem with Memory
In older Natural Language Processing (NLP) models — like Recurrent Neural Networks (RNNs) or LSTMs — the network processed data sequentially. If you had a 50-word sentence, the model had to “remember” the first word by the time it processed the 50th. This created a bottleneck; as the distance grew, information was inevitably lost.
Attention solves this by fundamentally changing the rulebook. It says: “When looking at the current word, don’t rely on a compressed memory of the past. Look back at all the other words in the sentence at once, but decide which ones are important right now.”
T…
7 min readDec 22, 2025
–
Press enter or click to view image in full size
A deep dive into the math, mechanics, and variants of the Attention Mechanism.
The Problem with Memory
In older Natural Language Processing (NLP) models — like Recurrent Neural Networks (RNNs) or LSTMs — the network processed data sequentially. If you had a 50-word sentence, the model had to “remember” the first word by the time it processed the 50th. This created a bottleneck; as the distance grew, information was inevitably lost.
Attention solves this by fundamentally changing the rulebook. It says: “When looking at the current word, don’t rely on a compressed memory of the past. Look back at all the other words in the sentence at once, but decide which ones are important right now.”
The Analogy:
Imagine reading a complex sentence. When you see the word “bank,” your eyes might naturally flick back to the word “river” or “money” earlier in the sentence to understand which definition of “bank” is being used. That “flick back” is Attention.
The Technical Pivot: Breaking the Chain
The transition from RNNs to Transformers wasn’t just an architectural tweak; it was a mathematical solution to intrinsic limitations in sequential processing — specifically the Vanishing Gradient problem and the Sequential Bottleneck.
Here is the technical breakdown of why LSTMs hit a ceiling and how Attention shattered it.
1. The Vanishing Gradient Problem
The most severe limitation of vanilla RNNs was their inability to learn long-range dependencies due to Backpropagation Through Time (BPTT).
# Next few maths lines are optional 🙃
To update the weights at the beginning of a sentence (W_1) based on an error calculated at the end of the sentence (step t), the error gradient must flow backward through every intermediate step. Mathematically, this invokes the Chain Rule over the entire sequence. The critical term involves a repeated product of gradients:
# The critical term is the product
If the gradient at each step is small (e.g., <1, which is typical for sigmoid or tanh activation functions), multiplying it repeatedly causes the value to decay exponentially toward zero.
The Result: The signal from the end of the sentence vanishes before it reaches the beginning. The model physically cannot “remember” or update weights based on long-distance context.
2. Did LSTMs fix this?
Partially, but not structurally.
LSTMs (1997) and GRUs (2014) introduced Gating Mechanisms (the Cell State c_t). These gates acted as a “gradient superhighway,” allowing gradients to flow relatively unchanged through the network.
The Limit: While LSTMs extended the effective context window from ~10 tokens (RNN) to ~200+ tokens, they did not eliminate the distance itself. The gradient still had to traverse a path of length N. For very long sequences, the signal still degraded.
3. The Sequential Bottleneck (O(N))
Even with perfect gradient flow, recurrent models suffer from a fundamental computational constraint: Sequentiality.
To compute the hidden state h_t, the hardware must wait for h_{t-1} to complete. You cannot compute the 100th token until you have computed the 99th.
- No Parallelization: This creates a computational wall where training time scales linearly with sequence length (O(N)), leaving modern GPUs (designed for massive parallel processing) underutilized.
- The Information Bottleneck: The model attempts to compress the entire history of the sequence into a single, fixed-size vector. As the sequence grows, this vector becomes “saturated,” forcing the model to overwrite old information to make space for new input.
The Solution: The Transformer Architecture
The Attention mechanism solves these problems by abandoning recurrence entirely. It treats a sequence not as a chain, but as a fully connected graph.
1. Path Length Reduction (O(N) to O(1))
In a Transformer, the distance between any two tokens — regardless of their position in the sentence — is exactly 1. The mechanism connects every token to every other token directly via matrix multiplication.
- Result: The Vanishing Gradient problem (between tokens) is structurally impossible because there is no “time” for the gradient to vanish through.
2. Massively Parallel Computation
Because there is no dependency on a previous state h_{t-1}, the Transformer processes the entire sequence simultaneously. The core calculation — Self-Attention — is effectively one massive matrix multiplication. This allows GPUs to process an entire document in parallel, rather than looping word by word.
3. Dynamic Context vs. Compression
- LSTM: “Here is a compressed summary of the last 500 words. Good luck.”
- Attention: “You have direct access to the raw vectors of all 500 previous words. Retrieve exactly the information you need.”
The Mechanics: Q, K, and V
How does the model know which words to attend to? We use three vectors: Query, Key, and Value.
The High-Level Analogy: The Filing System
Imagine a library filing system.
- Query (q): A sticky note you are holding that says, “I am looking for books about Rivers.”
- Key (k): The label on the spine of every book on the shelf (e.g., “Geography”, “Finance”).
- Value (v): The actual content inside the book.
The Process: You walk down the aisle and compare your Query to every Key. If they match (high similarity), you take the book and read the Value.
The Calculation
For every input word vector x (dimension d_{model}), the model learns three distinct weight matrices during training: W_Q, W_K, W_V. We project the input x into three specialized subspaces:
Why Project Them?
Why not just compare x against x?
Get Kyouma45’s stories in your inbox
Join Medium for free to get updates from this writer.
The Query Matrix (W_Q): “What do I need?”
- Takes the vector for “Bank” and rotates it to ask: “I am a noun potentially related to geography or finance. Are there modifiers nearby to clarify me?”
The Key Matrix (W_K): “What do I offer?”
- Takes the vector for “River” and rotates it to shout: “I am a nature term. I relate to geography.”
The Value Matrix (W_V): “What is my content?”
- If “Bank” matches “River,” we don’t just want to know they matched; we want to absorb the meaning of River. W_V preserves the semantic content (water, flow) to be passed along.
Putting it Together: The Formula
Now that we have Q, K, and V, we execute the famous attention formula:
- The Dot Product (Q . K^T): We multiply the Query of the current word by the Keys of all words. This creates a Score (similarity).
2. The Softmax: The scores are normalized into probabilities (0 to 1).
- River: 0.9
- The: 0.1
3. The Weighted Sum (… x V): We multiply the probabilities by the Value vectors. The word “Bank” effectively absorbs 90% of the mathematical meaning of “River.”
Notice that the Query (q) and Key (k) determine the weights, but the Value (v) is what is actually summed up.
Note 1: Why separate them? If we used the same vector x for everything (x . x), a word would always pay maximum attention to itself (since a vector is always most similar to itself). By splitting them into Q, K, V, we allow the model to decouple the search (“What am I looking for?”) from the content (“What do I contain?”), enabling complex relationships like a word attending to a distant word that clarifies its meaning, rather than just staring at itself.
Variants of Attention
Most variants differ in two ways: Source (where do Q, K, V come from?) and Scope (what can they see?).
1. Self-Attention (The Standard)
- Source: Q, K, and V all come from the same input sequence (X).
- Role: The “Internal Understanding” mechanism. It resolves dependencies like co-reference (knowing “it” refers to “robot”).
2. Cross-Attention (The Bridge)
- Source: Queries (Q) come from the Decoder (target sentence). Keys (K) and Values (V) come from the Encoder (source sentence).
- Role: The “Translation” mechanism. The Decoder asks, “I am generating the next word in English; let me check the French source sentence to see what is relevant.” This aligns two different sequences (e.g., Text-to-Image or Translation).
3. Multi-Head Attention (The Parallelizer)
- Problem: A single attention head might focus only on syntax (Subject-Verb). We need to capture tone, logic, and grammar simultaneously.
- Solution: We slice the embedding dimension into h smaller chunks (“heads”). Each head has its own W_Q, W_K, W_V matrices. One head looks at syntax; another looks at semantic relations.
4. Masked Attention (The Time Traveler Block)
- Context: Used in Decoders (like GPT) during training.
- Problem: When predicting the 3rd word, the model shouldn’t “see” the 4th word. That’s cheating.
- Solution: We apply a mask (set to negative infinity) to the scores of future tokens. When Softmax is applied, their probability becomes exactly 0. This enforces causality.
5. Efficiency Variants (Sparse Attention)
Standard Attention is O(N²). If sequence length doubles, computation quadruples. To fix this for long documents:
- Local Attention: A token only attends to neighbors within a fixed window (Sliding Window).
- Global Attention: Special tokens (like
[CLS]) attend to everything and act as "hubs." - Sparse Attention: Only calculating specific blocks of the matrix to handle sequences of 100k+ tokens.
The Impact
The Attention mechanism didn’t just improve AI; it standardized it.
- GPT-4: Relies on Masked Self-Attention to maintain coherent conversations over thousands of words.
- Claude: A triumph of Optimized Self-Attention, capable of reading entire books in one prompt.
- Midjourney: Relies on Cross-Attention. Your text prompt acts as the Query, guiding the visual noise (Keys/Values) into an image.
- BERT: Uses Bidirectional Self-Attention to understand search queries for Google.
Note: We destroyed the order of the sequence to save the signal. How do Transformers know that “The cat ate the mouse” is different from “The mouse ate the cat”? That requires Positional Encoding (and newer techniques like RoPE), which we will cover in the next article.