arXiv:2601.11663v1 Announce Type: cross Abstract: Post-training quantization (PTQ) methods for large language models rely on heuristics that implicitly estimate which weight channels most strongly influence model behavior. Two dominant paradigms have emerged: activation-aware methods such as AWQ prioritize channels with large activation magnitudes, while second-order methods such as GPTQ allocate quantization error according to input covariance structure. Despite strong empirical performance, these approaches remain conceptually fragmented, and it is unclear what underlying quantity they are approximating. In this work, we present a unified theoretical framework for PTQ by formalizing activation sensitivity, defined as the expected impact of channel-wise perturbations on the loss. Using a…
arXiv:2601.11663v1 Announce Type: cross Abstract: Post-training quantization (PTQ) methods for large language models rely on heuristics that implicitly estimate which weight channels most strongly influence model behavior. Two dominant paradigms have emerged: activation-aware methods such as AWQ prioritize channels with large activation magnitudes, while second-order methods such as GPTQ allocate quantization error according to input covariance structure. Despite strong empirical performance, these approaches remain conceptually fragmented, and it is unclear what underlying quantity they are approximating. In this work, we present a unified theoretical framework for PTQ by formalizing activation sensitivity, defined as the expected impact of channel-wise perturbations on the loss. Using a first-order Taylor expansion, we show that sensitivity naturally arises as the squared norm of gradient-weighted activations, yielding a principled measure of channel importance that captures both activation magnitude and downstream error propagation. Within this framework, AWQ and GPTQ can be interpreted as complementary approximations that recover sensitivity under distinct simplifying assumptions. We analyze the design space of sensitivity metrics, connect gradient-based saliency, Fisher information, and Hessian-based criteria, and clarify their relationships to classical pruning methods such as Optimal Brain Damage and Optimal Brain Surgeon. Rather than proposing a new quantization algorithm, this work provides a conceptual foundation for understanding and comparing post-training quantization methods through the lens of sensitivity.