Mini-SGLang
A lightweight yet high-performance inference framework for Large Language Models.
Mini-SGLang is a compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems. With a compact codebase of ~5,000 lines of Python, it serves as both a capable inference engine and a transparent reference for researchers and developers.
✨ Key Features
-
High Performance: Achieves state-of-the-art throughput and latency with advanced optimizations.
-
Lightweight & Readable: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
-
Advanced Optimizations:
-
Radix Cache: Reuses KV cache for shared prefixes across requests.
-
**Chunked Prefil…
Mini-SGLang
A lightweight yet high-performance inference framework for Large Language Models.
Mini-SGLang is a compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems. With a compact codebase of ~5,000 lines of Python, it serves as both a capable inference engine and a transparent reference for researchers and developers.
✨ Key Features
-
High Performance: Achieves state-of-the-art throughput and latency with advanced optimizations.
-
Lightweight & Readable: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
-
Advanced Optimizations:
-
Radix Cache: Reuses KV cache for shared prefixes across requests.
-
Chunked Prefill: Reduces peak memory usage for long-context serving.
-
Overlap Scheduling: Hides CPU scheduling overhead with GPU computation.
-
Tensor Parallelism: Scales inference across multiple GPUs.
-
Optimized Kernels: Integrates FlashAttention and FlashInfer for maximum efficiency.
-
...
🚀 Quick Start
1. Environment Setup
We recommend using uv for a fast and reliable installation (note that uv does not conflict with conda).
# Create a virtual environment (Python 3.10+ recommended)
uv venv --python=3.12
source .venv/bin/activate
Prerequisites: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the NVIDIA CUDA Toolkit installed and that its version matches your driver’s version. You can check your driver’s CUDA capability with nvidia-smi.
2. Installation
Install Mini-SGLang directly from the source:
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
uv pip install -e .
3. Online Serving
Launch an OpenAI-compatible API server with a single command.
# Deploy Qwen/Qwen3-0.6B-Instruct on a single GPU
python -m minisgl --model "Qwen/Qwen3-0.6B-Instruct"
# Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000
Once the server is running, you can send requests using standard tools like curl or any OpenAI-compatible client.
4. Interactive Shell
Chat with your model directly in the terminal by adding the --shell flag.
python -m minisgl --model "Qwen/Qwen3-0.6B" --shell
You can also use /reset to clear the chat history.
Benchmark
Offline inference
See bench_nanovllm.py for more details. Set MINISGL_DISABLE_OVERLAP_SCHEDULING=1 for ablation study on overlap scheduling.
Test Configuration:
- Hardware: 1xH200 GPU.
- Model: Qwen3-0.6B, Qwen3-14B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100-1024 tokens
- Output Length: Randomly sampled between 100-1024 tokens
Online inference
See benchmark_qwen.py for more details.
Test Configuration:
- Hardware: 4xH200 GPU, connected by NVLink.
- Model: Qwen3-32B
- Dataset: Qwen trace, replaying first 1000 requests.
Launch command:
# Mini-SGLang
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive
# SGLang
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
--disable-radix --port 1919 --decode-attention flashinfer
📚 Learn More
- Detailed Features: Explore all available features and command-line arguments.
- System Architecture: Dive deep into the design and data flow of Mini-SGLang.