A deep dive into the transformer architecture.
10 min readOct 21, 2025
–
Hey everyone,
In this article were going to go into a deep dive on the transformer architecture based on the paper “Attention is all you need”. And since its quite a long paper with many important concepts, were going to make it two parts, the Encoder being this first and the Decoder and training/inference being in the second.
For some context, “Attention Is All You Need” is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, and a …
A deep dive into the transformer architecture.
10 min readOct 21, 2025
–
Hey everyone,
In this article were going to go into a deep dive on the transformer architecture based on the paper “Attention is all you need”. And since its quite a long paper with many important concepts, were going to make it two parts, the Encoder being this first and the Decoder and training/inference being in the second.
For some context, “Attention Is All You Need” is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, and a main contributor to the AI boom, as the transformer approach has become the main architecture of a wide variety of AI, such as large language models.
Now lets dive in!
Chapter 1: The Rise of Sequence Models
Before Transformers revolutionized deep learning, models like Recurrent Neural Networks (RNNs) and LSTMs dominated the field. They processed data sequentially, which meant each step depended on the previous one.
How RNNs work
RNNs take one token at a time, combine it with the hidden state from the previous step, and produce an output token. However, this sequential dependency introduces several critical limitations:
- Slow Computation: Like running a loop, each step waits for the previous one. Training long sequences becomes computationally expensive.
- Vanishing/Exploding Gradients: When the same gradients are multiplied repeatedly through many steps, they may become:
- Tiny (vanishing) → updates barely happen.
- Huge (exploding) → weights overshoot during training.
3. Long-term Dependencies: As information moves through many time steps, the model forgets older context. Example: When translating long sentences, the beginning often gets “lost.”
Press enter or click to view image in full size
Figure 1: Recurrent Neural Networks (RNN)
These problems set the stage for something **faster, parallel, and more powerful, **the Transformer.
Chapter 2: Enter the Transformer
Introduced in “Attention Is All You Need”* (Vaswani et al., 2017)*, the Transformer eliminated recurrence entirely. Instead, it uses attention mechanisms that let each word “look at” all other words in parallel.
The architecture has two main components:
- Encoder — reads and understands the input.
- Decoder — generates the output (like translation).
Lets look into each component in detail
2.1 The Encoder
2.2.1 Input Embedder
The input embedder is the first step in the Transformer’s encoder. It converts raw text into numerical representations that the model can understand. The input embedder follows these steps
- **Tokenization: **The original sentence is split into smaller units called *tokens, *these could be words, subwords, or even characters.
- **Numerical Mapping: **Each token is assigned a unique number based on its position in the vocabulary. For example, the word “love” might correspond to the index 3452.
- **Embedding Representation: **Each token index is then mapped to a dense embedding vector of size 512, often denoted as d_model = 512. This vector captures the semantic meaning of the token — words with similar meanings will have similar vector representations.
- Trainable Parameters: These embedding vectors are not fixed; they are learnt and updated during training based on the loss function. This allows the model to refine how it represents each word’s meaning in context.
Press enter or click to view image in full size
**Figure 3: **The flow of the input embedder
In short, the input embedder transforms discrete words into continuous numerical vectors that carry rich contextual meaning, preparing them for further processing in the Transformer.
2.2.2 Positional Encoder
Positional encoding injects position information so the model can tell which tokens are near each other and which are far apart. for example, in the sentence ‘*your cat is a lovely cat’ , *upon inspection a human can tell that *your *is ‘far’ from lovely but the transformer cannot. Enter the positional encoder. Here’s how it works.
- We create a positional vector for each position in the sequence. Each positional vector has the same dimensionality as the token embedding d_model = 512. Now to the embedding vector (the one in orange in figure 5), we add another size 512 positional embedding vector. These values are fixed however and do not learn and change overtime unlike the other values
- To compute the positional vectors we use the the classic trigonometric method. For position ***pos ***and dimension index ***i ***(0-based):
- This uses sin* *on even indices and cos on odd indices at different frequencies, producing smooth, continuous patterns across positions.
Press enter or click to view image in full size
**Figure 4: **The computation of the posintion encoding tuple.
3. Now add positional vectors to embeddings. For each token, take its learned embedding vector (size d_model) and add the positional vector for that token’s position.
Result: a combined vector that contains both meaning (embedding) and position (positional encoding). The Final encoder input is the sum of both vectors.
Press enter or click to view image in full size
**Figure 5: **The positional encoder in action
Why are we using trigonometric functions?
**Figure 6: **We can observe a pattern in the trigonometric functions.
Trigonometric functions like cos and sin naturally represent a pattern that the model can recognise as continuous, so relative positions are easier to see for the model. By watching the plot of these functions, we can also see a regular pattern, so we can hypothesize that the model will see it too.
2.2.3 Multi-headed attention
Before we going into multi-headed attention, lets look at self attention to better appreciate multi headed attention.
Self-attention is the mechanism that lets every token in a sequence relate to every other token,** **so each token’s representation becomes aware of the whole sentence. In this simple case we consider the sequence length seq = 6 and d_model = d_k= 512. This is how it works step by step.
- Each token “looks at” the other tokens, measures how relevant they are, and then gathers information from them to produce a new representation. This lets the model capture context (syntax, semantics, dependencies) in a single parallel pass over the sequence.
- Setup
- Sequence length:
seq = 6 - Model / embedding size:
d_model = 512 - For clarity, call the attention key/query/value dimension
d_k(oftend_k = d_model / num_headsin multi-head attention; for single-head example used_k = d_model).
We form three matrices from the input embeddings:
- Q (queries): shape
(seq, d_model)→ here(6, 512) - K (keys): shape
(seq, d_model)→(6, 512) - V (values): shape
(seq, d_model)→(6, 512)
3. Compute the attention scores
**Figure 7: **Formula for self attention
Multiply Q by Kᵀ to get pairwise similarity scores where each entry scores[i, j] measures how much token i should pay attention to token j. Divide scores by sqrt (d_k) to stabilize gradients
Then now suppose Q is of dimension (6 x 512) and so Kᵀ is of dimension (512 x 6), we get a matrix of dimension (6 x 6).
If we apply **softmax **on this 6 x 6 matrix, we notice that all the rows sum upto 1.
Press enter or click to view image in full size
4. Now we multiply by V and get a (6 x 512) matrix again in which each row in the matrix captures not only the meaning (given in the embedding), the position in the sentence (represented by the positional encodings) but also each word’s interaction with other words.
Press enter or click to view image in full size
Each entry in the matrix after multiplication indicates a score of how intense a relationship of one word is with another. Why of that word with all other words?
Perform the matrix multiplication shown here by hand and you’ll see why.
Properties of Self attention
- Permutation Invariance (without positional encoding):
Self-attention is order-agnostic by design. It focuses purely on relationships between tokens rather than their positions. Suppose our input matrix
Ahas shape(6 × 512)and we obtain an outputA'of the same shape after applying attention. If we swap two rows inA(say, tokens a and b), the corresponding rows inA'will also swap, but their actual values remain unchanged. This shows that the computation depends only on how tokens relate to one another, not on their order. The sequence order is later incorporated using positional encodings. - No Additional Parameters (in pure self-attention):
The basic self-attention computation. It does not add any new trainable parameters. The attention weights are entirely derived from the token embeddings and their positional encodings. Only when we move to multi-head attention do we introduce trainable projection matrices (
W_Q,W_K, andW_V). - High Diagonal Values (Self-Focus): Each token tends to attend most strongly to itself because the query and key vectors of the same token are highly correlated. This results in larger values along the diagonal of the attention matrix, showing that each token gives maximum importance to its own features before considering others.
- Masking Unwanted Interactions: Sometimes we need to block specific tokens from interacting, for example, in the decoder, a word should not attend to future words (to maintain causality), or padding tokens should be ignored. This is achieved by setting the unwanted entries in the attention score matrix to -∞ before applying softmax, ensuring those probabilities become zero. For instance, if we do not want “your” and “cat” to interact, we set their score to -∞, preventing the model from learning that relationship.
Now lets go on to discuss Multihead attention.
Multihead attention
Press enter or click to view image in full size
**Figure 8: **Multihead Attention
Now using *figure 8 *as reference, lets proceed step by step.
- Start with the input matrix of shape
(seq, d_model)(tokens × embedding size). - Copy this input three times: one copy proceeds to the Add & Norm block later, and three copies are used to form Query (Q), Key (K) and Value (V) inputs, each of shape
(seq, d_model). - Project Q, K and V by multiplying with learned parameter matrices
W_Q,W_K,W_V, each of shape(d_model, d_model), producingQ',K',V'(each(seq, d_model)). - Split
Q',K',V'along the embedding dimension intohheads (i.e., splitd_modelintohchunks), so each head has dimensiond_k = d_model / h.This yieldsQ₁, Q₂, …, Q_h(and similarlyK_i,V_i), each of shape(seq, d_k). - For each head
i, compute scaled dot-product attention using the head’sQ_i,K_i,V_i:Attention(Q_i, K_i, V_i) = softmax((Q_i K_i^T) / sqrt(d_k)) V_i, producingHead_iof shape(seq, d_k). - Collect all
Head_ioutputs (there arehof them); eachHead_iencodes a different aspect of token relationships because the projections and splits make each head attend to different subspaces of the embedding. - Concatenate the
hheads along the embedding dimension to form a single matrixConcat(head₁,…,head_h)of shape(seq, h * d_k); becauseh * d_k = d_model, the concatenated result has shape(seq, d_model). - Apply the final linear projection
W_Oof shape(d_model, d_model)to the concatenated heads:MultiHeadOutput = Concat(...) W_O, producing the multi-head attention output of shape(seq, d_model). - The Multihead output (MH-A) is then passed onward (usually added to the residual connection and fed into Add & Norm, then the feed-forward block), giving the model a combined representation where each token has integrated multiple complementary attention perspectives.
The goal of this is instead of calculating the attention of each Q', K', V', we split it into smaller matrices and then calculate the attention of that, so each head is watching the full sentence but a different aspect of the embedding of each word. We do this because we want each head to watch a different aspect of each word, because a word maybe have different meanings depending on the context.
Add and Norm
Moving on to the last part of the multihead attention, we reach the add and norm layer.
It performs two key functions:
- **Add (Residual Connection): **The output of a sublayer (such as Multi-Head Attention or Feed Forward Network) is added to its original input. This helps prevent loss of information and mitigates the vanishing gradient problem, allowing gradients to flow easily through deep networks.
- Norm (Layer Normalization): After adding the residual connection, the result is normalized across its features to ensure that the data has a stable distribution. This makes learning faster and more reliable by keeping activations within a consistent range.
Press enter or click to view image in full size
**Figure 9: **Calculating the Norm and Variance of each item independently
Here’s how it works:
- Take the input
x(from the previous layer) and the outputsublayer(x)(from the current block). - Add them together to form the residual connection:
y = x + sublayer(x) - Compute the mean (μ) and variance(σ²) of all features for each token embedding in
y - Normalize the function using where ϵ is a small constant to prevent division by zero.
5. Scale and shift the normalized values using the learnable parameters γ (gamma) and β (beta): LayerNorm(yi)= γyᵢ + β.
Why add gamma and beta?
These introduce some fluctuations in the data because maybe having all data between 0-1 maybe too restrictive for the network. The network learns to tune these two parameters to introduce fluctuations when necessary. Essentially these two parameters are used to stress when a word is more intense than another.
That’s all for this post! It was a long one, so take some time to reread it to fully grasp the beautiful design of the encoder. We’ll pick up right where we left off with the decoder in the next part.
I’d like to give credit to Umar Jamil for his fantastic video on the “Attention Is All You Need” paper and for the accompanying visuals. His explanation made it much easier to understand these concepts. Be sure to check out his video yourself!