Run real-time and async inference on the same infrastructure with GKE Inference Gateway (opens in new tab)

GKE Inference Gateway treats accelerator capacity as a fluid resource pool that serves workloads that need deterministic latency and high throughput.