[2203.15556] Training Compute-Optimal Large Language Models (opens in new tab)

Covered by 7 sources including DEV Community, blog.dougbelshaw.com

"We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 7 articles

blog.dougbelshaw.com·

[2203.15556] Training Compute-Optimal Large Language Models (opens in new tab)

Covered in 7 articles

AI's energy problem is a systems problem

Dissolving the Deep Learning Sample Efficiency Gap

A curated, verified map of LLM theory — expressivity, scaling laws, ICL, alignment, interpretability, and open problems