From Chinchilla’s 20:1 rule to SmolLM3’s 3,700:1 ratio: how inference economics rewrote the training playbook
6 min readJust now
–
Press enter or click to view image in full size
Training a language model is expensive. Really expensive. A single training run for a 70 billion parameter model can cost millions of dollars in compute.
So before you spin up a cluster of GPUs and start burning through your budget, you need to answer a fundamental question: given a fixed amount of compute, should you train a larger model on less data, or a smaller model on more data?
This isn’t just an academic curiosity. Get it right, and you can punch way above your weight, creating models that compete with much more expensive alternatives.
This is where scaling laws come in. They’re em…
From Chinchilla’s 20:1 rule to SmolLM3’s 3,700:1 ratio: how inference economics rewrote the training playbook
6 min readJust now
–
Press enter or click to view image in full size
Training a language model is expensive. Really expensive. A single training run for a 70 billion parameter model can cost millions of dollars in compute.
So before you spin up a cluster of GPUs and start burning through your budget, you need to answer a fundamental question: given a fixed amount of compute, should you train a larger model on less data, or a smaller model on more data?
This isn’t just an academic curiosity. Get it right, and you can punch way above your weight, creating models that compete with much more expensive alternatives.
This is where scaling laws come in. They’re empirical formulas that help us predict how model performance changes based on three key factors:
- Model size (parameters)
- Training data (tokens)
- Compute budget (FLOPs)
Think of them as the GPS for your training journey, helping you navigate the tradeoffs between these dimensions.