Speculative Decoding
Less-relevant results
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling
💬LLMs Content type: AcademicMoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
⚡Quantization Content type: News Content type: Blogbigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
💬LLMs Content type: CodeImbuing Large Language Models with Bidirectional Logic for Robust Chain Repair
🤖AI Content type: AcademicNo more posts from jhcha.oyo's subscribed feeds.