The MLP Block Is a Representer Theorem (opens in new tab)

After the 3Blue1Brown attention video you can read half a transformer: you can see which token attends to which. The other half, the MLP block, stays a black box. But attention is legible because it is a kernel, a vote by similarity, and if you make the MLP a kernel too, its output becomes the same thing: a representer-theorem vote over learned prototypes. Then the whole transformer explains itself.

Read the original article