Scaling Laws: How to Allocate Compute for Training Language Models

From Chinchilla’s 20:1 rule to SmolLM3’s 3,700:1 ratio: how inference economics rewrote the training playbook

6 min readJust now

–

Press enter or click to view image in full size

Training a language model is expensive. Really expensive. A single training run for a 70 billion parameter model can cost millions of dollars in compute.

So before you spin up a cluster of GPUs and start burning through your budget, you need to answer a fundamental question: given a fixed amount of compute, should you train a larger model on less data, or a smaller model on more data?

This isn’t just an academic curiosity. Get it right, and you can punch way above your weight, creating models that compete with much more expensive alternatives.

This is where scaling laws come in. They’re em…

From Chinchilla’s 20:1 rule to SmolLM3’s 3,700:1 ratio: how inference economics rewrote the training playbook

From Chinchilla’s 20:1 rule to SmolLM3’s 3,700:1 ratio: how inference economics rewrote the training playbook

The Chinchilla Revelation

Similar Posts