Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models (opens in new tab)

Quantization is widely used to reduce the inference cost of large language models, but its effect on reasoning models is not fully captured by final-answer accuracy or per-token latency. We show that low-bit post-training quantization can introduce a hidden test-time compute cost: quantized reasoning models often generate longer chains of thought even when they still answer correctly. Across mathematical reasoning, code generation, scientific qu...

Read the original article