Covers 3 stories including [2203.15556] Training Compute-Optimal Large Language ModelsCovered by Data Science Weekly NewsletterDiscussed on Hacker News

Scaling laws are one of the most critical empirical findings in deep learning. The observation is simple in form: the training loss $L$ decreases predictably as we scale up model size $N$, dataset size $D$, and compute $C$, following a power-law curve, which appears as a straight line on a log-log plot. We can view scaling laws as a framework for describing the relationship between compute, loss, model size and data; at its core, it is about how to allocate precious compute optimally between ...

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

Data Science Weekly Newsletter·

Issue 657

Discussed on Substack