⚡ Speculative Decoding - ibrahimsharaf · Scour

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

⚡Continuous Batching Code

github.com··Hacker News

The economics of speculative decoding

🚀LLM Deployment Blog

fergusfinn.com··Hacker News

A system programmer’s guide to LLM inference

⚡Continuous Batching Blog

blog.xiangpeng.systems··Hacker News

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

🚀LLM Deployment Academic

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

⚡Quantization Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

Making LLMs faster and more efficient across multiple languages

🚀LLM Deployment

techxplore.com·

Nutrient control enables metabolic reconstruction of L. rhamnosus GG and analysis of secretions

⚡Quantization Academic

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

sleepingrobots.com·

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

⚡Quantization News

Making Local LLM Go Brrr

seanpedersen.github.io·

LLM Research Papers: The 2026 List (January to May)

🗣️NLP News

magazine.sebastianraschka.com

··Hacker News

Speculators v0.5.0: DFlash support and online training

🚀LLM Deployment

developers.redhat.com·

Less-relevant results

Here's a llama.cpp CLI Command builder.

🔓Open Source AI

llamabuilding.com··r/LocalLLaMA

Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free

⚡Continuous Batching

towardsai.net·

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

🔓Open Source AI Code

github.com··Hacker News

Nvidia Nemotron 3 Ultra

research.nvidia.com··Hacker News

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

⚡Quantization

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

⚡Quantization News Blog

kaitchup.substack.com··r/LocalLLaMA

What Arm-based innovations happened in May 2026?

💻Local AI Blog

newsroom.arm.com·

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

📊Retrieval Evaluation Blog

databricks.com·

Log in to enable infinite scrolling