Concordia: JIT-Compiled Persistent-Kernel Checkpointing for Fault-Tolerant LLM Inference (opens in new tab)

Long-running LLM agents keep valuable state resident on GPUs: KV caches, request schedulers, communication state, and sometimes online adapters. Losing this state after a GPU or communicator failure can discard minutes to hours of work, yet existing recovery mechanisms either restart the whole serving stack or require application-specific checkpoint logic inside every attention and runtime component. This paper argues that fault tolerance for ...

Read the original article