Scaling AI Inference on Kubernetes: The Case for Token-Based Autoscaling (opens in new tab)

Request count is a poor scaling signal for LLM inference. Here's how token throughput, KV cache utilization, and latency create smarter autoscaling.