6 min readJust now
–
This is the third article in the “Evolution of Transformers” series. In this article, we will explore the Rotary Positional Embeddings(RoPE) used in the modern transformer architectures like DeepSeek, Llama, and Mistral. We will first start by understanding the necessity of positional embeddings, implementation in vanilla transformer, and its shortcomings. Then we will introduce RoPE, explore its mathematical intuition, and provide a PyTorch implementation.
Why Positional Embeddings?
In the Pt2, we discussed how the Transformer architecture discards the sequential dependencies of traditional RNN or LSTM models, to incorporate global parallel training. But wait a minute, if you are training for individual tokens separately then how do you maintain the…
6 min readJust now
–
This is the third article in the “Evolution of Transformers” series. In this article, we will explore the Rotary Positional Embeddings(RoPE) used in the modern transformer architectures like DeepSeek, Llama, and Mistral. We will first start by understanding the necessity of positional embeddings, implementation in vanilla transformer, and its shortcomings. Then we will introduce RoPE, explore its mathematical intuition, and provide a PyTorch implementation.
Why Positional Embeddings?
In the Pt2, we discussed how the Transformer architecture discards the sequential dependencies of traditional RNN or LSTM models, to incorporate global parallel training. But wait a minute, if you are training for individual tokens separately then how do you maintain the order of tokens. You need a way to track this to contextually train the model.
Self-Attention is permutation-invariant. This means the model treats a sequence of tokens as a set rather than an ordered list. It works the same for “The cat chases the mouse” or “The mouse chases the cat”. Both sentences have an opposite meaning, depending on the order of tokens.
Press enter or click to view image in full size
Requirements of Positional Embeddings
- Uniqueness: Model should be able to individually distinguish each token.
- **Relative Distance: **Model should be able to learn the concept of relative distance between tokens.
Vanilla Positional Embeddings
Vaswani et. al. proposed the following formula for computing the positional embedding of the token at position-pos. It uses sine and cosine pair of same frequency at even(2i) and odd(2i+1) dimension.The frequency or angle of these pairs decreases exponentially with dimension i.
Press enter or click to view image in full size
PE: Formula for calculation value of each dimension of the positional embedding
The intuition of exponential change in frequency comes from the binary coding, where the LSB(Least significant bits) change at a very higher rate than MSB(Most significant bits). Sinusoidal are smoother version of the same idea. This idea satisfies the uniqueness criteria.
Press enter or click to view image in full size
Frequency of change of MSB and LSB bits
Press enter or click to view image in full size
Note the pairs of sine and cosine at the same frequency or angle
Linear shift property:
This property allows the model to learn the connection between token at position pos and position(pos+k) using linear transformation. The transformation matrix depends on #steps(k) i.e. phi = (kwi)
Press enter or click to view image in full size
Resultant Vector:
It uses the basic vector addition of token embedding- a(i) and positional embedding- p(i).
r(i) = a(i) + p(i)
Press enter or click to view image in full size
Geometric diagram showing resultant vector calculation
Limitations:
- Adding the positional embeddings “pollutes” the semantic meaning of the embeddings. It forces a single vector to represent two distinct types of information simultaneously.
- The shifted vector goes through multiple layers of transformation, which can lead to loss of semantic identity of original vector.
- Model have to learn the hard concept of relative distance using the absolute position and linear shift properties.
Rotation of vector
Back to Linear Algebra!!! Must concept for RoPE.
Get Apoorv Jain’s stories in your inbox
Join Medium for free to get updates from this writer.
You can rotate a vector(z) by multiplying it with a special rotation matrix dependent on the angle of rotation(theta).
Press enter or click to view image in full size
Rotation of vector z by angle theta
Rotary Positional Embeddings
Rotary Positional Embeddings were introduced by Jianlin Su and his colleagues in a 2021 paper titled “RoFormer: Enhanced Transformer with Rotary Position Embedding”.
- Researchers noticed that models using RoPE trained faster and handles longer sequences much better than vanilla transformer.
Intuition:
Why don’t we just directly rotate Query and Key vectors instead of adding positional information to token embeddings?
- Rotation that is dependent on the position of the tokens makes the concept of relative distance more explicitly and easier to learn. See the proof to get the main idea.
Key Change:
- Rotation instead of addition
- Operate on Q, K vectors instead of token vector
Resultant Vectors:
It rotates the query(Q) and key(K) vector by an angle scaled by the position(m) of token in the sentence.
Q’ = R(m*theta).Q
K’ = R(m*theta).K
Now, we will explore the mathematical details to understand the significance of this intuition. Let’s calculate the attention using the modified query and key vectors.
Press enter or click to view image in full size
Press enter or click to view image in full size
As you can see, (n-m) in the attention formula makes the model explicitly learn the notion of relative distance while being invariant to absolute position.
Here is a simple implementation of ROPE embeddings which uses complex numbers or polar form coordinate system. It uses the same formula of vanilla positional embeddings to calculate the frequencies in different dimensions. It rotates the query and key vectors using the relative angles calculated based on position of tokens.
import torchimport torch.nn as nnimport torch.nn.functional as Fclass RoPE: @staticmethod def precompute_freqs_cis(dim: int, seq_len: int, theta: float = 10000.0): # 1. Compute frequencies for the 'rotation speeds' using the formula from vanilla positional embeddings # We only need dim//2 because each complex number handles 2 dimensions freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)) # 2. Create the position indices (0, 1, 2... seq_len) t = torch.arange(seq_len, device=freqs.device) # 3. Outer product to get angles for every position(0,theta, 2*theta,.....) freqs = torch.outer(t, freqs) # (seq_len, dim//2) # 4. Convert to complex polar form: 1.0 * e^(i * freqs) # This is essentially [cos(f) + i*sin(f)], making the (cos, sin) point explained earlier freqs_cis = torch.polar(torch.ones_like(freqs), freqs) return freqs_cis @staticmethod def apply_rotary_emb(x: torch.Tensor, freqs_cis: torch.Tensor): # x shape: (Batch, Seq, Heads, Head_Dim) # 1. Reshape x to treat the last dimension as complex pairs # (B, S, H, D) -> (B, S, H, D//2) as complex numbers x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2)) # 2. Align freqs_cis for broadcasting # freqs_cis is (S, D//2), we need (1, S, 1, D//2) freqs_cis = freqs_cis[:x.shape[1]].view(1, x.shape[1], 1, -1) # 3. Rotate via complex multiplication x_rotated = x_complex * freqs_cis # 4. Convert back to real and flatten (B, S, H, D) x_out = torch.view_as_real(x_rotated).flatten(3) return x_out.type_as(x)# --- Test Case ---batch_size = 2seq_len = 8num_heads = 4head_dim = 64 # Must be even# 1. Mock Query and Key tensorsq = torch.randn(batch_size, seq_len, num_heads, head_dim)k = torch.randn(batch_size, seq_len, num_heads, head_dim)# 2. Precompute the "Rotation Table"freqs_cis = RoPE.precompute_freqs_cis(head_dim, seq_len)# 3. Rotate Q and Kq_rotated = RoPE.apply_rotary_emb(q, freqs_cis)k_rotated = RoPE.apply_rotary_emb(k, freqs_cis)print(f"Original Q shape: {q.shape}")print(f"Rotated Q shape: {q_rotated.shape}")# 4. Verification: Dot product attention# (B, H, S, S)scores = torch.matmul(q_rotated.transpose(1, 2), k_rotated.transpose(1, 2).transpose(-2, -1))print(f"Attention scores shape: {scores.shape}")
Explore other articles:
References:
- Attention is all you need. https://arxiv.org/pdf/1706.03762
- ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING https://arxiv.org/pdf/2104.09864
- Why Rotating Vectors Solves Positional Encoding in Transformers https://www.youtube.com/watch?v=qKUobBR5R1A&start=0
- All about Sinusoidal Positional Encodings https://www.youtube.com/watch?v=bQCQ7VO-TWU&start=0