Run real-time and async inference on the same infrastructure with GKE Inference Gateway (opens in new tab)
GKE Inference Gateway treats accelerator capacity as a fluid resource pool that serves workloads that need deterministic latency and high throughput.
Read the original article