Writing an LLM from scratch, part 27 – what's left, and what's next?

The appendices and supplementary material

There are five appendices:

A: An introduction to PyTorch
B: References and further reading
C: Exercise solutions
D: Adding bells and whistles to the training loop
E: Parameter-efficient fine-tuning with LoRA

Raschka also gives a link at the end of chapter 7 to a notebook showing how to do further fine tuning using Direct Preference Optimization, which also looks fascinating, and he’s working on a new project, “Build a reasoning model (from scratch)”.

Things I’ve deferred myself

While working through the book, I’ve deliberately deferred various things. I’d kind of lost track of all of them, so I gave ChatGPT the source markdown for all of the posts in this series, and asked it to find where I’d done that. It did an amazing job! There were three categories: long context and attention efficiency, maths, and optimisers.

Long context and attention efficiency.

The model we’ve built in the book has a context length of 1,024 tokens, and is O(n2) in both space and time with respect to the number of tokens you feed it. There are lots of things that people do to work around that. Things I need to learn:

The KV cache. This is basic stuff and I feel I sorta-kinda understand it, but I haven’t written about it so I can’t be sure. It’s a pretty obvious enhancement to avoid repeating work when generating autoregressively – that is, the normal setup where in order to generate n tokens, we give the model its input, sample our first token from its predictions, then feed the whole thing – the input and that first token – back in for the second token, and so on. Obviously, because attention is causal, we’re doing exactly the same work every time for all of the tokens in each round apart from the last one, so it makes sense to cache things. The result is that generating the first token is still O(n2), but subsequent ones will be something more like O(n) each. That’s why real-world modern models tend to take a while pondering before they generate the first token but then speed up – they need to fill their cache.
FlashAttention and related things: there are lots of ways people have found to reduce the cost of attention generally, but this seems to be the most popular one, or at least the best to get started with.
Better positional embeddings: the context length of our GPT-2-style LLM is fixed in part because you need position embeddings for every possible input position. That means that we can never extend it. More modern LLMs use better ways to represent positions – Rotary Position Embeddings (RoPE) look like they’re very popular.

Maths

I really want to understand softmax at a better level than “it’s a magic thing that turns logits into probabilities”. I’d also like to learn more about higher-order tensor operations – the ones that we use in the book are essentially treating the extra dimensions as the batch, but I believe that there’s more to it than that.

Optimisers

I really want to understand in reasonable depth what optimisers do. I know that they make gradient updates work better than they do with simple gradient descent. But how?

That was the set of things I noted at the time I wrote the posts so far, but there are a few more that come to mind as I write this.

Automatic differentiation and the backward pass

In some comments that he made on posts in this series, Simon said that it seems like this book isn’t really “from scratch”, given that we rely on PyTorch’s magic to handle the backward pass.

He’s 100% right! I think I understand why it is that way, though. There would be two different ways that I can see for the book to do it:

Manually code a backward pass to go with the forward pass on each of our modules. Simon did this, and was kind enough to share his code with me – it looks like one of those things (like attention) that is pretty hard to get your head around initially, but once it clicks it’s super-clear. Definitely kudos to him for getting it all to work! The problem with this is that I don’t think any ML practitioners do this nowadays, because automatic differentiation is there in every popular framework. So it might be a good learning experience, but also might nudge people into an unprofitable direction.
Create our own automatic differentiation system. Andrej Karpathy pops up again when looking into this; he created micrograd, which handles back-propagation for scalar functions. That’s really clever – but it would be hard, and a bit of a side quest from the point of the book. Also, the most interesting stuff (at least from what little I know) for automatic differentiation is how you do it with non-scalars – the matrices and higher-order tensors that our LLM uses. From what Simon says, this is where you need to use the mysterious Jacobian matrices I’ve heard about in the context of back-propagation.

I think I’d definitely like to revisit that at some point.

Tokenisers

Another one from Simon; while the book does explain how tokenisers work, even down to a high-level overview of byte-pair encoding, we don’t write our own. Again, I can see why this is – we load in the GPT-2 weights, so we need to use that model’s tokeniser. And there’s no point in writing our own if we’re just going to throw it away.

But perhaps a bit of time playing with one would be useful?

Trying to train the LLM as a base model

The book, quite reasonably, shows you how to train your LLM, does a basic train on a small dataset, and then we switch to downloading the “pre-cooked” weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch.

But given that I was getting a pretty good training speed on my own hardware, perhaps I could train a model really from scratch, perhaps using one of the smaller FineWeb datasets? Even if I can’t do it locally, perhaps it might be doable on a rented cloud machine, like the Lambda Labs ones I used when fine-tuning Llama 3?

After all, Andrej Karpathy is training a full model that you can chat with for $100.

Building an LLM from scratch on my own.

I don’t think I ever mentioned this on the blog, but one important plan for me is to try to build an LLM from scratch, only using my own blog posts and what I remember – no looking at the book. If I can do that, then I can be reasonably sure that I really have learned it all.

I’m also thinking that I’ll do that using a different library – that is, not PyTorch. That would stop me from regurgitating code that I’ve learned. If you’re reading this within a day or so of the post’s publication, I’m running a poll on X/Twitter about which framework to use. If you have an opinion, please do stop by and vote :-)

Mixture-of-experts

It feels like almost every new model these days is an MoE. I have read a lot around the subject and would love to build on it. Essentially, instead of having just one feed-forward network after your attention heads, you have several. In front of them you have a router – a trainable network of some kind – that tells you which of these “expert” FFNs the token should be forwarded to. You then send it to the top (or top k) experts, while leaving the others inactive. The result is that you have more space (in terms of parameters) for the LLM to know about things, but not all of those parameters are active during inference – so your model is smarter but still fast.

There’s a bunch of interesting stuff there, from how you build it in the first place, to how you handle the fact that you’re processing lots of tokens at once – multiple tokens in each sequence and multiple sequences in a batch.

It would be a pretty cool follow-on to the “my own LLM” series, thinking about it.

So, what next?

I definitely don’t think I need to do all of those things in order to wrap up this series. Here’s the subset I’m planning on doing:

Training the full GPT-2 base model myself. I’m 100% going to try this.
From the appendices – anything that surprises me from the one on PyTorch, and perhaps from the “bells and whistles” in the training loop. The others I either won’t do, or will pick up later.
Building my own LLM from scratch in a different framework, without using the book. That is, I think, essential, and perhaps would be the crowning post of this series. It would be a nice way to end it, wouldn’t it?

For the other things, I think there are some potential future series to write.

Improving context length – RoPE and other tricks – sounds like an excellent series to start on when I’m done with this. AIs tell me that other interesting things to look into would be ALiBi, NTK/YaRN scaling, and positional interpolation.
Improving performance: the KV cache, FlashAttention, and other performance enhancements likewise feel like they could make a good series.
I also want to do a separate series on LoRA. In that, I’ll draw on appendix E from this book, but also on other tutorials.
Likewise DPO, along with other post-training that can be done to make models more useful as chatbots, like Reinforcement Learning. I’d really like to spend some time understanding that area. (And Raschka’s upcoming reasoning model book might fit into that category too.)
Optimisers: Adam, AdamW, maybe Muon (though the latter scares me a bit).
The maths – softmax and higher-order tensor calculations – also seems to belong in another series, perhaps an extension of the various “maths for AI” posts I’ve done in the past.
Automatic differentiation and the backward pass; that would make a great series.
A mixture-of-experts model would be excellent fun, I think.
Tokenisers would be a great stand-alone post, at least at the level that I can see myself covering it. Perhaps that would develop into a series if I found myself getting sucked in.

I’m certainly not promising that I’ll write up all (or even any) of that second list, but they all seem really tempting to me right now. If you’re particularly interested in seeing my take on any of them, please do leave a comment below.

Coming up...

I think the next post in this series – maybe the next several posts – will be on trying to train the model code provided in the book from scratch to produce my own base model. Stay tuned!