Press enter or click to view image in full size
Photo by Vincent van Zalinge on Unsplash
Member-only story
Scaling context windows via PE extrapolation on unseen sequence lengths
17 min readJust now
–
Introduction
Positional encoding (PE) is a key component of the Transformer architecture as its attention mechanism is set-invariant, requiring explicit positional information to process sequential data like language.
However, traditional PE methods face significant challenges in extrapolating to sequences much longer than those seen during training, limiting the Transformer’s context window.
In this article, I’ll explore major PE …
Press enter or click to view image in full size
Photo by Vincent van Zalinge on Unsplash
Member-only story
Scaling context windows via PE extrapolation on unseen sequence lengths
17 min readJust now
–
Introduction
Positional encoding (PE) is a key component of the Transformer architecture as its attention mechanism is set-invariant, requiring explicit positional information to process sequential data like language.
However, traditional PE methods face significant challenges in extrapolating to sequences much longer than those seen during training, limiting the Transformer’s context window.
In this article, I’ll explore major PE methods and investigate their ability to handle longer sequences via extrapolation by instantiating and training a small Transformer on a synthetic long-sequence task.
Table of Contents
What is Positional Encoding ∘ Overcoming Set-Invariance in Attention ∘ The PE Mechanism
Key Types of Positional Encoding (PE) for LLMs 1) Absolute Positional Encoding ∘ Fixed Absolute Positional Encoding (FAPE) ∘ Learnable Positional Encoding (LPE) ∘ Time Absolute Position Encoding (tAPE)