⚡ Inference - buckman

☁️GCP Blog

cloud.google.com··Hacker News

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

🤖AI Inference Code

github.com··Hacker News

Speculators v0.5.0: DFlash support and online training

🔓Open Source AI

developers.redhat.com·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🔓Open Source AI News Blog

blog.google··Hacker News

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the Answer

🤖AI Inference Blog

dev.to··DEV

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

📊Benchmarking Blog

databricks.com·

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

🤖Large Language Models Blog

dev.to··DEV

The Death of the Four Golden Signals: Designing Telemetry for Non-Deterministic Infrastructure

🔧SRE

devops.com·

KVarN, Cost.dev, headroom — the week the agent runtime bill got itemized

🤖AI Inference Blog

dev.to··DEV

The 4-layer voice-agent latency stack, traced with OTel spans

🔭OpenTelemetry Blog

dev.to··DEV

Less-relevant results

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

🖥️Local AI Blog

dev.to··DEV

Why Self-Hosted Claude Code Was 15 Slower Than It Should Be

🧠LLMs Blog

dev.to··DEV

Why most LLM VRAM calculators are wrong on modern models (and an open-source MIT fix)

🔓Open Source AI Blog

dev.to··DEV

I Benchmarked 3 Local LLMs on My Laptop — Here's What the Numbers Actually Show

🧠LLM Blog

dev.to··DEV

NVIDIA and Apple Solved the Hardware. Here's What's Left to Build.

⚡Quantization Blog

dev.to··DEV

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

🖥️GPU Blog

dev.to··DEV

NVIDIA Showed an Agent Building Architecture on a Laptop

🏢Architecture Blog

dev.to··DEV

I Connected PewDiePie's Odysseus to a Cloud Memory Stack — Zero API Costs, Persistent Memory

🧠LLM Tooling Blog

dev.to··DEV

No more posts from buckman's subscribed feeds.

Scour all 25255 feeds Learn more about Feeds

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

Report: GKE Inference Gateway delivers up to 92% faster AI responses

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

Speculators v0.5.0: DFlash support and online training

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the Answer

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

The Death of the Four Golden Signals: Designing Telemetry for Non-Deterministic Infrastructure

KVarN, Cost.dev, headroom — the week the agent runtime bill got itemized

The 4-layer voice-agent latency stack, traced with OTel spans

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

Why Self-Hosted Claude Code Was 15 Slower Than It Should Be

Why most LLM VRAM calculators are wrong on modern models (and an open-source MIT fix)

I Benchmarked 3 Local LLMs on My Laptop — Here's What the Numbers Actually Show

NVIDIA and Apple Solved the Hardware. Here's What's Left to Build.

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

NVIDIA Showed an Agent Building Architecture on a Laptop

I Connected PewDiePie's Odysseus to a Cloud Memory Stack — Zero API Costs, Persistent Memory