⚡ FlashAttention - nayyara.airlangga

💾KV Cache Code

github.com··r/LocalLLaMA

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🧠Inference Engineering Blog

tilert.ai··Hacker News

Blurry Window Attention

🧠Inference Engineering Academic

arxiv.org·

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

💰Inference Cost

edn.com·

Gated DeltaNet, From First Principles

🧠Inference Engineering Blog

sankalp.bearblog.dev·

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

🧠Inference Engineering Blog

medium.com

··Hacker News

Anatomy of a high-performance EP kernel

💰Inference Cost Blog

fergusfinn.com··Hacker News

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

💰Inference Cost Blog

jimmysong.io·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

💾KV Cache Code

github.com··Hacker News

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

🧠Inference Engineering Academic

arxiv.org·

Benchmarking dots.tts on Strix Halo

🎮GPU Computing

sleepingrobots.com·

LLM Research Papers: The 2026 List (January to May)

💰Inference Cost News

magazine.sebastianraschka.com

··Hacker News

Machinic Psychopharmacology: Do LLMs Self-Medicate?

💾KV Cache

lesswrong.com··Hacker News

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

🧠Inference Engineering Academic

arxiv.org·

OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision

⚙️ML Compilers

opencv.org··Hacker News, Hacker News

mingusb/transformer-golf: The Fully Unrolled Transformer: An experimental repository for architecture simplification and compilation. [2026]

⚗️Kernel Fusion Code

github.com··Hacker News

Towards Tight Bounds for Streaming Attention

🧠Inference Engineering Academic

arxiv.org·

Report: GKE Inference Gateway delivers up to 92% faster AI responses

🧠Inference Engineering Blog

cloud.google.com··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

Youssof Altoukhi (@Youssofal_)

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

Blurry Window Attention

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

Gated DeltaNet, From First Principles

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

Anatomy of a high-performance EP kernel

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Benchmarking dots.tts on Strix Halo

LLM Research Papers: The 2026 List (January to May)

Machinic Psychopharmacology: Do LLMs Self-Medicate?

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision

mingusb/transformer-golf: The Fully Unrolled Transformer: An experimental repository for architecture simplification and compilation. [2026]

Towards Tight Bounds for Streaming Attention

Report: GKE Inference Gateway delivers up to 92% faster AI responses