⚡ Speculative Decoding - ibrahimsharaf · Scour

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

📊Retrieval Evaluation Blog

databricks.com·

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🔓Open Source AI Code

github.com··r/LocalLLaMA

OpenAI S-1 🇺🇸, Siri AI 📱, Xiaomi Ultraspeed ⚡

⚡Quantization

not much happened today | AINews

🔓Open Source AI

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

⚙️Transformers Academic

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

⚡Continuous Batching Code

github.com··Hacker News

[AINews] not much happened today

🔓Open Source AI News

·

[PoC] server: support requantizing kv cache by wadealexc · Pull Request #24134 · ggml-org/llama.cpp

🔓Open Source AI Code

github.com··r/LocalLLaMA

Log in to enable infinite scrolling