EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference

arXiv:2601.21758v1 Announce Type: new Abstract: Serving Large Language Models (LLMs) under mixed workloads–short, latency-sensitive interactive queries alongside long, throughput-oriented batch requests–poses a fundamental scheduling challenge. Standard First-Come, First-Served (FCFS) policies suffer from severe head-of-line blocking, leading to high tail latency and underutilized hardware. We introduce EWSJF (Effective Workload-based Shortest Job First), an adaptive request-level scheduler that learns workload structure in real time to jointly improve fairness and throughput. EWSJF operates upstream of execution-level schedulers and integrates four components: (1) Refine-and-Prune, an unsupervised partitioning algorithm that discovers performance-homogeneous request groups; (2) Dynamic …

Similar Posts