Attention from First Principles
metaworld.me·3d·
📊Columnar Engines
Preview
Report Post

Motivation

For a while my knowledge of ML was limited to what I’ve learned in school: perceptrons, gradient descent, perhaps multiple perceptrons grouped into layers. Looking at ML landscape from afar I couldn’t follow how many fundamentally new ideas were developed. Conference papers are often written in a way that presents the idea, but not the intuition or the impetus for exploring that particular direction. Looking at the attention paper I was quite lost: why do we need all of and and ? What is their intuitive explanation? Why this direction is being explored at all?

Reading further did not make it simpler, with many new concepts introduced at once. Flash attention seemed like an indecipherable rewrite. Mamba was voodoo magic.

For a long while, I wanted a blogpost to explai…

Similar Posts

Loading similar posts...