CARBON-XXX/Semantic-Entropy-Dynamic-Acceleration-Core-SEDAC: SEDAC (Semantic Entropy Dynamic Acceleration Core) 语义熵动态加速核心 SEDAC 是基于语义熵动态加速的下一代智能推理加速框架。该系统集成了线性表征工程（LRE）与自适应层级校准技术，通过实时监测模型内部的“语义熵”来动态决定计算深度。与传统静态推理不同，SEDAC 能在确保逻辑严密与内容无损的前提下，智能识别并跳过冗余计算层，从而提升推理吞吐量（TPS）并降低显存负载。实现了速度、成本与精度的完美平衡。

SEDAC: Safe Early-Exit & Speculative Decoding Toolkit

Introduction • Key Features • Installation • Quickstart • Performance • FAQ

my email:jasonuzi12@gmail.com ## Introduction

SEDAC (Speculative Early-Exit Decoding with Adaptive Calibration) is a research toolkit designed to accelerate Large Language Model (LLM) inference without sacrificing generation quality.

It addresses a critical flaw in traditional Early-Exit mechanisms on Decoder-only architectures (like Llama, Qwen): KV Cache Corruption. By introducing a novel MLP-Skipping architecture (SEDAC v5), this toolkit achieves real speedups while maintaining bit-level accuracy (PPL ratio ~1.00) compared to the ba…

SEDAC: Safe Early-Exit & Speculative Decoding Toolkit

Introduction • Key Features • Installation • Quickstart • Performance • FAQ

my email:jasonuzi12@gmail.com ## Introduction

SEDAC (Speculative Early-Exit Decoding with Adaptive Calibration) is a research toolkit designed to accelerate Large Language Model (LLM) inference without sacrificing generation quality.

Key Features

🚀 SEDAC v5: Safe MLP-Skipping

Traditional Early-Exit methods skip entire layers, which leaves the KV Cache uninitialized for skipped layers. This causes catastrophic quality degradation (PPL > 10^5) for subsequent tokens.

SEDAC v5 solves this by:

Always computing Attention: Ensuring KV Cache is perfectly maintained for every layer, every token.
Skipping MLPs: If the model is confident (Entropy < Threshold), we skip the Feed-Forward Network (MLP) block, which accounts for ~65% of the parameters.
Result: Zero PPL degradation with measurable speedup.

🛡️ Adaptive Safety

Max-Entropy Decision: In batched inference, SEDAC exits only if all tokens in the batch are confident. This prevents "easy" tokens from forcing "hard" tokens to skip computation.
Dynamic Thresholding: (Optional) Calibrates the exit threshold based on the prompt difficulty at runtime.

📊 Comprehensive Benchmarking

End-to-End Suite: Tests TPS (Tokens/sec), Speedup, PPL (Perplexity), and Acceptance Rate.
Multi-Mode: Supports Online (OpenAI-compatible HTTP server) and Offline (vLLM Engine) benchmarking.
Metrics: Automatically exports Acceptance Rate (AR) and Token Recovery Rate (TRR) for speculative decoding analysis.

Installation

Clone and Install Dependencies:

git clone https://github.com/your-org/SEDAC.git
cd SEDAC
pip install -r requirements.txt

(Optional) Suffix Decoding Backend: If you plan to use Suffix Decoding:

pip install arctic-inference==0.1.1

Quickstart

1. Patch vLLM

Apply the SEDAC hooks to your vLLM installation. This script patches the Qwen2 model definition to support MLP skipping and probe injection.

python3 patch_vllm_surgical.py

2. Start the Server

Launch an OpenAI-compatible server with SEDAC enabled.

--sedac-layer 24: Start checking for exits at Layer 24 (of 36).
--sedac-threshold 0.45: Conservative threshold for safe acceleration.

python3 sedac_start_server.py \
--model Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4 \
--sedac-layer 24 \
--sedac-threshold 0.45 \
--port 8000

3. Run Benchmark

Run the test suite to evaluate speed and quality.

# Run a quick speed test
python3 sedac_test_suite.py --config configs/test_matrix_speed.json --verbose

Performance

Qwen2.5-3B-Instruct (Int4)

Tested on single GPU. Baseline TPS: ~36 tokens/s.

Configuration	Speedup	PPL Ratio	Quality
Vanilla Baseline	1.00x	1.00	Reference
SEDAC v5 (Safe MLP-Skip)	~1.05x - 1.15x	1.00	Lossless
SEDAC (Aggressive Latch)	~1.43x	>1000	❌ Broken (Model Collapse)
The model used for testing is qwen2.5-3b.

Note: The speedup on 3B models is limited because they are often memory-bound or latency-bound. The overhead of the Python control plane competes with the small compute savings.

Does Size Matter?

Yes. SEDAC’s speedup potential increases with model size.

Small Models (3B): Hard to accelerate. The compute time saved by skipping an MLP is small (e.g., 0.5ms), which is comparable to the overhead of the decision logic.
Large Models (14B/32B/70B): High Potential. The compute time for a single MLP block is significant (e.g., 5-10ms). Skipping it yields a net positive even with overhead.
Recommendation: For production acceleration, target models >7B parameters.

FAQ

Q: Why did previous Early-Exit methods fail on vLLM? A: They skipped Attention layers. In vLLM’s PagedAttention, if you don’t write to the KV Cache for a token at Layer N, the next token will read garbage memory when it tries to attend to Layer N. SEDAC v5 fixes this by never skipping Attention.

Q: Can I use this with Speculative Decoding (Draft Models)? A: Yes! SEDAC is orthogonal to Speculative Decoding. You can combine them (e.g., ngram_sedac_adaptive config) to get speedups from both draft matching AND MLP skipping in the verification phase.

Q: How do I train a Probe? A: See train_probe.py. You need to collect hidden states from the target layer and train a small linear classifier to predict the entropy of the final output.

License

MIT

SEDAC: Safe Early-Exit & Speculative Decoding Toolkit

SEDAC: Safe Early-Exit & Speculative Decoding Toolkit

Key Features

🚀 SEDAC v5: Safe MLP-Skipping

🛡️ Adaptive Safety

📊 Comprehensive Benchmarking

Installation

Quickstart

1. Patch vLLM

2. Start the Server

3. Run Benchmark

Performance

Qwen2.5-3B-Instruct (Int4)

Does Size Matter?

FAQ

License

Similar Posts