Scaling AI Inference on Kubernetes: The Case for Token-Based Autoscaling (opens in new tab)
Request count is a poor scaling signal for LLM inference. Here's how token throughput, KV cache utilization, and latency create smarter autoscaling.
Read the original article