Evolution of Transformers Pt3: RoPE Embeddings

6 min readJust now

–

This is the third article in the “Evolution of Transformers” series. In this article, we will explore the Rotary Positional Embeddings(RoPE) used in the modern transformer architectures like DeepSeek, Llama, and Mistral. We will first start by understanding the necessity of positional embeddings, implementation in vanilla transformer, and its shortcomings. Then we will introduce RoPE, explore its mathematical intuition, and provide a PyTorch implementation.

Why Positional Embeddings?

In the Pt2, we discussed how the Transformer architecture discards the sequential dependencies of traditional RNN or LSTM models, to incorporate global parallel training. But wait a minute, if you are training for individual tokens separately then how do you maintain the…

6 min readJust now

–

Why Positional Embeddings?

Self-Attention is permutation-invariant. This means the model treats a sequence of tokens as a set rather than an ordered list. It works the same for “The cat chases the mouse” or “The mouse chases the cat”. Both sentences have an opposite meaning, depending on the order of tokens.

Press enter or click to view image in full size

Requirements of Positional Embeddings

Uniqueness: Model should be able to individually distinguish each token.
**Relative Distance: **Model should be able to learn the concept of relative distance between tokens.

Vanilla Positional Embeddings

Vaswani et. al. proposed the following formula for computing the positional embedding of the token at position-pos. It uses sine and cosine pair of same frequency at even(2i) and odd(2i+1) dimension.The frequency or angle of these pairs decreases exponentially with dimension i.

Press enter or click to view image in full size

PE: Formula for calculation value of each dimension of the positional embedding

The intuition of exponential change in frequency comes from the binary coding, where the LSB(Least significant bits) change at a very higher rate than MSB(Most significant bits). Sinusoidal are smoother version of the same idea. This idea satisfies the uniqueness criteria.

Press enter or click to view image in full size

Frequency of change of MSB and LSB bits

Press enter or click to view image in full size

Note the pairs of sine and cosine at the same frequency or angle

Linear shift property:

This property allows the model to learn the connection between token at position pos and position(pos+k) using linear transformation. The transformation matrix depends on #steps(k) i.e. phi = (kwi)

Press enter or click to view image in full size

Resultant Vector:

It uses the basic vector addition of token embedding- a(i) and positional embedding- p(i).

r(i) = a(i) + p(i)

Press enter or click to view image in full size

Geometric diagram showing resultant vector calculation

Limitations:

Adding the positional embeddings “pollutes” the semantic meaning of the embeddings. It forces a single vector to represent two distinct types of information simultaneously.
The shifted vector goes through multiple layers of transformation, which can lead to loss of semantic identity of original vector.
Model have to learn the hard concept of relative distance using the absolute position and linear shift properties.

Rotation of vector

Back to Linear Algebra!!! Must concept for RoPE.

Get Apoorv Jain’s stories in your inbox

Join Medium for free to get updates from this writer.

You can rotate a vector(z) by multiplying it with a special rotation matrix dependent on the angle of rotation(theta).

Press enter or click to view image in full size

Rotation of vector z by angle theta

Rotary Positional Embeddings

Rotary Positional Embeddings were introduced by Jianlin Su and his colleagues in a 2021 paper titled “RoFormer: Enhanced Transformer with Rotary Position Embedding”.

Researchers noticed that models using RoPE trained faster and handles longer sequences much better than vanilla transformer.

Intuition:

Why don’t we just directly rotate Query and Key vectors instead of adding positional information to token embeddings?

Rotation that is dependent on the position of the tokens makes the concept of relative distance more explicitly and easier to learn. See the proof to get the main idea.

Key Change:

Rotation instead of addition
Operate on Q, K vectors instead of token vector

Resultant Vectors:

It rotates the query(Q) and key(K) vector by an angle scaled by the position(m) of token in the sentence.

Q’ = R(m*theta).Q

K’ = R(m*theta).K

Now, we will explore the mathematical details to understand the significance of this intuition. Let’s calculate the attention using the modified query and key vectors.

Press enter or click to view image in full size

As you can see, (n-m) in the attention formula makes the model explicitly learn the notion of relative distance while being invariant to absolute position.

Here is a simple implementation of ROPE embeddings which uses complex numbers or polar form coordinate system. It uses the same formula of vanilla positional embeddings to calculate the frequencies in different dimensions. It rotates the query and key vectors using the relative angles calculated based on position of tokens.

import torchimport torch.nn as nnimport torch.nn.functional as Fclass RoPE:    @staticmethod    def precompute_freqs_cis(dim: int, seq_len: int, theta: float = 10000.0):        # 1. Compute frequencies for the 'rotation speeds' using the formula from vanilla positional embeddings        # We only need dim//2 because each complex number handles 2 dimensions        freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))                # 2. Create the position indices (0, 1, 2... seq_len)        t = torch.arange(seq_len, device=freqs.device)                # 3. Outer product to get angles for every position(0,theta, 2*theta,.....)        freqs = torch.outer(t, freqs)  # (seq_len, dim//2)                # 4. Convert to complex polar form: 1.0 * e^(i * freqs)        # This is essentially [cos(f) + i*sin(f)], making the (cos, sin) point explained earlier        freqs_cis = torch.polar(torch.ones_like(freqs), freqs)        return freqs_cis    @staticmethod    def apply_rotary_emb(x: torch.Tensor, freqs_cis: torch.Tensor):        # x shape: (Batch, Seq, Heads, Head_Dim)        # 1. Reshape x to treat the last dimension as complex pairs        # (B, S, H, D) -> (B, S, H, D//2) as complex numbers        x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))                # 2. Align freqs_cis for broadcasting        # freqs_cis is (S, D//2), we need (1, S, 1, D//2)        freqs_cis = freqs_cis[:x.shape[1]].view(1, x.shape[1], 1, -1)                # 3. Rotate via complex multiplication        x_rotated = x_complex * freqs_cis                # 4. Convert back to real and flatten (B, S, H, D)        x_out = torch.view_as_real(x_rotated).flatten(3)        return x_out.type_as(x)# --- Test Case ---batch_size = 2seq_len = 8num_heads = 4head_dim = 64  # Must be even# 1. Mock Query and Key tensorsq = torch.randn(batch_size, seq_len, num_heads, head_dim)k = torch.randn(batch_size, seq_len, num_heads, head_dim)# 2. Precompute the "Rotation Table"freqs_cis = RoPE.precompute_freqs_cis(head_dim, seq_len)# 3. Rotate Q and Kq_rotated = RoPE.apply_rotary_emb(q, freqs_cis)k_rotated = RoPE.apply_rotary_emb(k, freqs_cis)print(f"Original Q shape: {q.shape}")print(f"Rotated Q shape:  {q_rotated.shape}")# 4. Verification: Dot product attention# (B, H, S, S)scores = torch.matmul(q_rotated.transpose(1, 2), k_rotated.transpose(1, 2).transpose(-2, -1))print(f"Attention scores shape: {scores.shape}")

Explore other articles:

References:

Attention is all you need. https://arxiv.org/pdf/1706.03762
ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING https://arxiv.org/pdf/2104.09864
Why Rotating Vectors Solves Positional Encoding in Transformers https://www.youtube.com/watch?v=qKUobBR5R1A&start=0
All about Sinusoidal Positional Encodings https://www.youtube.com/watch?v=bQCQ7VO-TWU&start=0

Why Positional Embeddings?

Why Positional Embeddings?

Requirements of Positional Embeddings

Vanilla Positional Embeddings

Linear shift property:

Resultant Vector:

Limitations:

Rotation of vector

Get Apoorv Jain’s stories in your inbox

Rotary Positional Embeddings

Key Change:

Resultant Vectors:

Explore other articles:

References:

Similar Posts