Model Size Scaling in 2023-2031 (opens in new tab)

Covers 4 stories including Home | ArtificialAnalysis.aiCovered by tldr.tech

Token generation speed is constrained by the speed at which the relevant HBM can be read, which is mostly the weights and KV-cache. Suppose a model is large, so that more than half of HBM is read when making a single pass over the weights, it's being read in parallel within a scale-up system, and N such systems are used in a pipeline. Then the time it takes to generate a token (without speculative decoding) is at least the time of reading more than half of an HBM stack times N. If we target a...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

tldr.tech·

Covered in 1 article

SpaceX Colossus deal 🚀, GPT-5.5 Cyber launch 🛡️, Codex as workspace 🤖