Rebuilding Modern AI for... Fun? A Transformer Story.

Sure, You Can import transformers. Or You Could Just Rebuild Modern AI From Scratch, I Guess.

We’ve all done it. pip install transformers, from transformers import AutoModel, and... you’re a modern AI developer. It’s magic.

But what’s really happening under the hood? What’s going on in that “Attention Is All You Need” paper that everyone cites but maybe... didn’t fully read?

I decided to find out. I went on a quest to rebuild the Transformer Encoder from scratch in PyTorch. No nn.Transformer allowed.

My goal: To build a model that could perform Text Classification (specifically, sentiment analysis) on the IMDB movie review dataset. And, just maybe, to finally understand what q, k, and v really mean.

Spoiler: it worked. And it was a journey. Here’s how I did it…

Sure, You Can import transformers. Or You Could Just Rebuild Modern AI From Scratch, I Guess.

We’ve all done it. pip install transformers, from transformers import AutoModel, and... you’re a modern AI developer. It’s magic.

But what’s really happening under the hood? What’s going on in that “Attention Is All You Need” paper that everyone cites but maybe... didn’t fully read?

I decided to find out. I went on a quest to rebuild the Transformer Encoder from scratch in PyTorch. No nn.Transformer allowed.

Spoiler: it worked. And it was a journey. Here’s how I did it.

Why Suffer? (The Real Goal) Okay, sarcasm aside, why do this?

Because “using” an API and “understanding” an architecture are two different things. This project is about moving from being an “API user” to an “architect.” I wanted to know why the design works, and that meant building the “Lego bricks” myself.

The architecture I built is an Encoder-Only model, the same conceptual design used by classic models like BERT for text understanding.

The Blueprint: Rebuilding the Core Components I built the entire model as a set of PyTorch nn.Module classes. You can’t just build one big class; you have to build the pieces first.

Problem: The Model is “Order-Blind” The MultiHeadAttention mechanism sees all words at once, like pulling them from a bag. It has no idea “man bites dog” is different from “dog bites man.”

The Fix: PositionalEncoding.

How it Works: I had to build a class that creates a unique “position vector” for every word in the sequence using the famous sin and cos formulas from the paper. This isn’t a learned vector; it’s a fixed mathematical “fingerprint” for each position.

The Magic: You just add this position vector to the word’s “meaning vector” (its embedding). The final vector the model sees is Vector(“man”) + Vector(position=0).

he Core: Multi-Head Attention (The “Head Chef”) This is the heart of the Transformer. It’s not one giant “spotlight” of attention; it’s 4 (or 8, or 12) smaller, “specialist” spotlights working in parallel.

I built this class to be a “Head Chef” that manages the whole process:

It hires 4 trainable “specialist” layers (w_q, w_k, w_v) to learn how to project the input vectors into Query, Key, and Value “subspaces.”
It splits the main 256-dimension vector into 4 smaller “heads” (of 64-dimensions each).
It delegates the real math to a simple scaled_dot_product_attention function.
This is where the $softmax(\frac{QK^T}{\sqrt{d_k}})V$ formula lives.
It calculates the scores, masks out padding, and creates the new “blended” vector.It stitches the 4 “specialist” reports back together and passes them through one final trainable w_o layer.

With my EncoderLayer “Lego block” built, the rest was easy. I stacked 3 of them together, added an Embedding layer at the start, and a simple nn.Linear classifier head at the end.

And... it actually worked.

I trained it on 5,000 movie reviews from the IMDB dataset. After just 3 epochs, my Transformer—built from nothing but raw PyTorch and the paper—achieved 75% accuracy on the test set.

It learned! you can find the code at [(https://github.com/praveena0506/Transformer-from-scratch) ]

Similar Posts