⏱️ Prefill Decoding - nayyara.airlangga

💾KV Cache Academic

arxiv.org··Hacker News

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

💾KV Cache Code

github.com··r/LocalLLaMA

Token4Token — pay-per-token inference on Gnosis + Swarm

🧠Inference Engineering

t4t.eth.link··Hacker News

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

🧠Inference Engineering Blog

lucebox.com··Hacker News

"North Mini Code"; open weights, 30B param, Canadian coding model

🎮GPU Computing Blog

cohere.com··Hacker News

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

🧠Inference Engineering Video

youtube.com·

Machinic Psychopharmacology: Do LLMs Self-Medicate?

💾KV Cache

lesswrong.com··Hacker News

Youssof Altoukhi (@Youssofal_)

🧠Inference Engineering

xcancel.com··r/LocalLLaMA

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

🧠Inference Engineering Academic

arxiv.org·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

💾KV Cache Code

github.com··Hacker News

Architecting the Control Plane for Intelligence: System Design of an Enterprise AI Gateway

☁️Cloud Infrastructure Blog

medium.com·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🎮GPU Computing Blog

dnhkng.github.io·

Build a local voice agent with Red Hat OpenShift AI

🎮GPU Computing

developers.redhat.com·

The Memory Problem is Solved: How Google’s Memory Caching Makes RNNs Smart Again

⚡FlashAttention Blog

medium.com·

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

🧠Inference Engineering Academic

arxiv.org·

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🧠Inference Engineering Code

github.com··r/LocalLLaMA

Benchmarking dots.tts on Strix Halo

🎮GPU Computing

sleepingrobots.com·

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

LLM Observability: What To Instrument and How To Act on It

Apple rebuilt its on-device AI stack at WWDC 2026

Breaking the Ice: Analyzing Cold Start Latency in vLLM

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

Token4Token — pay-per-token inference on Gnosis + Swarm

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

"North Mini Code"; open weights, 30B param, Canadian coding model

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

Machinic Psychopharmacology: Do LLMs Self-Medicate?

Youssof Altoukhi (@Youssofal_)

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

Architecting the Control Plane for Intelligence: System Design of an Enterprise AI Gateway

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

Build a local voice agent with Red Hat OpenShift AI

The Memory Problem is Solved: How Google’s Memory Caching Makes RNNs Smart Again

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

Benchmarking dots.tts on Strix Halo