Distributed LLM Inference with LLM-d (opens in new tab)

Discussed on Hacker News

An introduction to llm-d, an open-source LLM-aware router that intelligently schedules requests across inference engines like vLLM using KV cache locality and GPU utilization.

Read the original article