Attention, Visualized
adaptive-ml.com·1d
🔤Tokenization
Preview
Report Post

Build an intuition for the attention mechanism, the core innovation behind transformers.

Building blocks

Language models are trained to predict the most likely next token. Let’s say we have a sequence of input tokens: "the", "dog", "was". Each of these tokens is mapped to a unique vector representing its meaning, called a word embedding. The embedding of "dog" encodes everything the model knows about dogs.

These embeddings are fed to a neural network, which predicts the most likely next token—in this case, "barking".

Simple enough. Now consider: what about a hot dog?

The problem

We now have a new sequence: "the", "hot", "dog", "was".

The embedding of "dog" is the same as before, still encoding the meaning of dog. Similarly, "hot" gets its own e…

Similar Posts

Loading similar posts...