Training Compute-Optimal Large Language Models

Artificial Intelligence

arXiv

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

29 Mar 2022 • 3 min read

Training Compute-Optimal Large Language Models

AI-generated image, based on the article abstract

Quick Insight

Better AI Comes From More Data, Not Just Bigger Models

Many big AI systems got bigger while using abou…

Artificial Intelligence

arXiv

29 Mar 2022 • 3 min read

Training Compute-Optimal Large Language Models

AI-generated image, based on the article abstract

Quick Insight

Better AI Comes From More Data, Not Just Bigger Models

Many big AI systems got bigger while using about the same amount of training data, and that left them undertrained. New results show that to get the most from a given compute budget, you should scale model size and data together — so for every doubling of model size, double the data too, simple as that. A smaller model trained on much more data beat far larger rivals, showing you can get better and cheaper AI by changing the balance. The star example, called Chinchilla, uses a smaller network but sees more data and it outperforms several huge models while needing less power for everyday use. That means faster, cheaper AI for apps and people, and more models that most teams can actually run. This flips a common idea on its head, and it points a clear way forward: spend compute on enough data, not only on size. People building the next AI will likely choose smarter training not just massive size, and the results are already impressive.

Article Short Review

Reframing compute, size and data in language-model scaling

Context and high-level goal

At first glance this work challenges a dominant intuition about ever-larger models: that bigger is always better. The authors frame a competing hypothesis that, under a fixed compute budget, performance is governed as much by training tokens as by model size, and that many contemporary models are effectively undertrained. One detail that stood out to me is how this reframing forces us to rethink trade-offs between parameters and exposures to data — or rather, it compels a balance rather than an obsession with sheer scale. I find this approach promising because it makes downstream costs more tractable.

Scope and empirical breadth

Practically, the claim is backed by a large sweep of experiments rather than a handful of checkpoints: the paper reports training over 400 language models spanning 70M–16B parameters on corpora between 5–500B tokens, using estimates tied to FLOPs. That breadth is important — it helps the authors fit patterns rather than overfitting to a single regime. Oddly enough, the scale of the sweep both reassures and makes me wonder about diminishing returns in unexplored corners, but the experimental span does lend weight to the central claim.

Empirical strategy and modeling

They do not rely on a single fitting trick. Instead, three complementary methods are used: direct power-law fits, cross-size/token comparisons, and a parametric loss model that is optimized with a robust Huber loss and numerically solved using L-BFGS; this generates an efficient frontier in compute–size–data space. In practice this triangulation is convincing — the convergence of methods reduces the chance that one fitting choice drove the conclusion — although it may obscure model-specific nuances.

Key scaling conclusion

The central quantitative insight is straightforward and surprising: the compute-optimal trajectory increases model size and training tokens approximately equally, so that each doubling of parameters should be matched by a doubling of tokens, defining a compute-optimal scaling rule that departs from the earlier Kaplan et al. prescriptions. From another angle, this suggests many large models are oversized for the number of optimization steps they received, a structural observation that appears robust across their experimental grid.

Chinchilla as a testbed for compute-optimality

Design choices and training recipe

To validate the iso‑FLOP implication the authors trained Chinchilla: a model with 70B parameters trained on about 1.4 Trillion tokens using roughly the same compute budget as a prior larger model, Gopher. Practical engineering choices include use of AdamW and a modified SentencePiece tokenizer, which the paper notes as method refinements rather than radical departures. I find it notable — and useful — that these implementation details are spelled out, since they affect reproducibility and downstream transferability.

Benchmark performance and comparisons

Across a broad evaluation Chinchilla consistently outperformed larger contemporaries: it reached state-of-the-art average accuracy on MMLU (reported near 67.5%), and beat models such as Gopher and GPT-3 on a wide set of tasks including BIG-bench and reading comprehension. In practice, the gains were particularly large on datasets like RACE, which suggests that more training exposure can yield substantive generalization improvements even when parameter counts are reduced.

Behavioral and safety analyses

The authors do not ignore risk: they evaluated gender bias (e.g., Winogender), and measured toxicity with tools like PerspectiveAPI, reporting only modest differences between Chinchilla and prior models. A striking point is that language-model loss reduction did not translate into large unconditional toxicity drops, which may indicate that dataset quality and curation are as important as model scaling. I found myself wondering whether targeted mitigation would change this negligible difference more than further scale adjustments would.

Interpretation, limitations, and implications

Interpretive synthesis

From a theoretical and practical vantage, the paper’s message is conservative but consequential: within a FLOP budget the efficient path appears to be to scale data alongside parameters, producing models that are both more performant and cheaper to serve. This speaks directly to compute-optimal training, the notion of an efficient frontier, and the observation that many models are undertrained relative to their parameterization — an observation that may reshape resource allocation for future research. One detail that stood out to me is how this realignment makes fine-tuning and inference less costly, which has real-world payoff.

Methodological caveats

The authors acknowledge several important limitations: the efficient frontier relies on a power-law assumption, experiments were often trained for less than one epoch on large corpora, and the analysis presumes a similar data distribution across regimes. These are nontrivial caveats — they may limit extrapolation to regimes far outside the experimental grid or to models with fundamentally different architectures such as Mixture-of-Expert (MoE). This part seems less intuitive to me, and I would have liked a clearer sensitivity analysis.

Practical and ethical implications

Finally, the work offers immediate operational guidance: rebalancing toward more training tokens can lower inference costs and fine-tuning costs, while preserving or improving downstream accuracy — a tangible win. At the same time, scaling datasets raises questions about dataset quality, privacy, and reproducibility; the authors emphasize the need for careful curation and note that some biases and toxic outputs persist. From another angle, the methodology appears generalizable to other modalities and seems reproducible given the detailed reporting.

Frequently Asked Questions

How does model size trade off with training tokens under fixed compute?

Under a fixed compute budget, the optimal trajectory increases model size and training tokens roughly equally: each doubling of parameters should be matched by a doubling of tokens. The review argues many large models are effectively undertrained because they received too few optimization steps for their parameter counts.

What experiments support the compute-optimal scaling claim in the review?

The claim rests on a broad sweep: over 400 language models ranging from 70M to 16B parameters trained on corpora of 5–500B tokens, with compute estimated via FLOPs. Multiple fitting approaches were used so the pattern is based on extensive cross-regime data rather than a single checkpoint.

What is Chinchilla and why was it trained?

Chinchilla is a 70B-parameter model trained on about 1.4 trillion tokens using roughly the same compute budget as the larger Gopher model to validate the iso‑FLOP implication. The training recipe used refinements like AdamW and a modified SentencePiece tokenizer to aid reproducibility rather than radical architectural change.

How did Chinchilla perform compared to larger contemporaries?

Chinchilla consistently outperformed larger models, reaching near 67.5% average accuracy on MMLU and beating models such as Gopher and GPT‑3 across many tasks, including BIG-bench and reading comprehension. The largest gains appeared on datasets like RACE, suggesting more exposure to data improved generalization despite fewer parameters.

What methodological approaches were used to fit the scaling relationships?

The review describes three complementary methods: direct power-law fits, cross-size/token comparisons, and a parametric loss model optimized with a Huber loss and solved numerically with L-BFGS. That triangulation helps ensure the efficient frontier in compute–size–data space is not an artifact of a single fitting choice.

What are the main limitations of the compute‑optimal analysis?

Key caveats include reliance on a power-law assumption, many experiments trained for less than one epoch on large corpora, and an implicit assumption of similar data distributions across regimes. These factors limit extrapolation to very different settings or architectures such as Mixture-of-Expert models, and the review notes a need for clearer sensitivity analysis.

How does scaling tokens affect safety, bias, and toxicity outcomes?

Evaluation of gender bias (e.g., Winogender) and toxicity via PerspectiveAPI showed only modest differences between Chinchilla and prior models, so lower loss did not produce large unconditional toxicity reductions. That pattern implies that dataset quality and curation may matter as much as scaling for reducing harmful outputs.

Quick Insight

Better AI Comes From More Data, Not Just Bigger Models

Quick Insight

Better AI Comes From More Data, Not Just Bigger Models

Article Short Review

Reframing compute, size and data in language-model scaling

Context and high-level goal

Scope and empirical breadth

Empirical strategy and modeling

Key scaling conclusion

Chinchilla as a testbed for compute-optimality

Design choices and training recipe

Benchmark performance and comparisons

Behavioral and safety analyses

Interpretation, limitations, and implications

Interpretive synthesis

Methodological caveats

Practical and ethical implications

Frequently Asked Questions

How does model size trade off with training tokens under fixed compute?

What experiments support the compute-optimal scaling claim in the review?

What is Chinchilla and why was it trained?

How did Chinchilla perform compared to larger contemporaries?

What methodological approaches were used to fit the scaling relationships?

What are the main limitations of the compute‑optimal analysis?

How does scaling tokens affect safety, bias, and toxicity outcomes?

Similar Posts