Mini-SGLang: A lightweight yet high-performance inference framework for LLM

Mini-SGLang

A lightweight yet high-performance inference framework for Large Language Models.

Mini-SGLang is a compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems. With a compact codebase of ~5,000 lines of Python, it serves as both a capable inference engine and a transparent reference for researchers and developers.

✨ Key Features

High Performance: Achieves state-of-the-art throughput and latency with advanced optimizations.
Lightweight & Readable: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
Advanced Optimizations:
Radix Cache: Reuses KV cache for shared prefixes across requests.
**Chunked Prefil…

Mini-SGLang

A lightweight yet high-performance inference framework for Large Language Models.

✨ Key Features

High Performance: Achieves state-of-the-art throughput and latency with advanced optimizations.
Lightweight & Readable: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
Advanced Optimizations:
Radix Cache: Reuses KV cache for shared prefixes across requests.
Chunked Prefill: Reduces peak memory usage for long-context serving.
Overlap Scheduling: Hides CPU scheduling overhead with GPU computation.
Tensor Parallelism: Scales inference across multiple GPUs.
Optimized Kernels: Integrates FlashAttention and FlashInfer for maximum efficiency.
...

🚀 Quick Start

1. Environment Setup

We recommend using uv for a fast and reliable installation (note that uv does not conflict with conda).

# Create a virtual environment (Python 3.10+ recommended)
uv venv --python=3.12
source .venv/bin/activate

Prerequisites: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the NVIDIA CUDA Toolkit installed and that its version matches your driver’s version. You can check your driver’s CUDA capability with nvidia-smi.

2. Installation

Install Mini-SGLang directly from the source:

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
uv pip install -e .

3. Online Serving

Launch an OpenAI-compatible API server with a single command.

# Deploy Qwen/Qwen3-0.6B-Instruct on a single GPU
python -m minisgl --model "Qwen/Qwen3-0.6B-Instruct"

# Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000

Once the server is running, you can send requests using standard tools like curl or any OpenAI-compatible client.

4. Interactive Shell

Chat with your model directly in the terminal by adding the --shell flag.

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell

You can also use /reset to clear the chat history.

Benchmark

Offline inference

See bench_nanovllm.py for more details. Set MINISGL_DISABLE_OVERLAP_SCHEDULING=1 for ablation study on overlap scheduling.

Test Configuration:

Hardware: 1xH200 GPU.
Model: Qwen3-0.6B, Qwen3-14B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100-1024 tokens
Output Length: Randomly sampled between 100-1024 tokens

Online inference

See benchmark_qwen.py for more details.

Test Configuration:

Hardware: 4xH200 GPU, connected by NVLink.
Model: Qwen3-32B
Dataset: Qwen trace, replaying first 1000 requests.

Launch command:

# Mini-SGLang
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive

# SGLang
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
--disable-radix --port 1919 --decode-attention flashinfer

📚 Learn More

Detailed Features: Explore all available features and command-line arguments.
System Architecture: Dive deep into the design and data flow of Mini-SGLang.

Mini-SGLang

✨ Key Features

Mini-SGLang

✨ Key Features

🚀 Quick Start

1. Environment Setup

2. Installation

3. Online Serving

4. Interactive Shell

Benchmark

Offline inference

Online inference

📚 Learn More

Similar Posts