💾 KV Cache - moyutianzun

🔍RAG News

eetimes.com·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

⚡Inference Optimization Code

github.com··Hacker News

WEKA software speeds long context AI inferencing on Oracle’s public cloud

🤖agentic system News

blocksandfiles.com·

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

⚡Inference Optimization Blog

tilert.ai··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

⚡Inference Optimization

local-llm.utop.workers.dev··Hacker News

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

⚡Inference Optimization Academic

arxiv.org·

Where to Host Your Open-Source Model (Under 10B Parameters)

⚡Inference Optimization

digitalocean.com·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

⚡CUDA Blog

dnhkng.github.io·

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

⚡Inference Optimization Blog

medium.com

··Hacker News

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

⚡Inference Optimization Code

github.com··Hacker News

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

⚡Inference Optimization

vettedconsumer.com··Hacker News

Report: GKE Inference Gateway delivers up to 92% faster AI responses

🔍RAG Blog

cloud.google.com··Hacker News

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

⚡Inference Optimization News Blog

kaitchup.substack.com··r/LocalLLaMA

Google's new open model DiffusionGemma generates text from noise instead of word by word

🔄Transformers

the-decoder.com

Integrate OpenShift AI and PG Airman MCP Server

⚡Inference Optimization

developers.redhat.com·

Machinic Psychopharmacology: Do LLMs Self-Medicate?

The Sequence Knowledge #874: Transformers or Not?

Latest technical articles & videos.

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

Massive AI Storage Demand Creates a New Memory Wall

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

WEKA software speeds long context AI inferencing on Oracle’s public cloud

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Where to Host Your Open-Source Model (Under 10B Parameters)

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

Report: GKE Inference Gateway delivers up to 92% faster AI responses

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

Google's new open model DiffusionGemma generates text from noise instead of word by word

Integrate OpenShift AI and PG Airman MCP Server