🧠 Inference Engineering - nayyara.airlangga · Scour

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

💾KV Cache Academic

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

⏱️Prefill Decoding Code

Fixing a stuck Ollama runner and building a GPU watchdog

🧵Warp Scheduling

patrickmccanna.net··Hacker News

How we fight GPU scarcity without compromise

💾KV Cache Blog

equixly.com··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

💾KV Cache Blog

dnhkng.github.io·

DiffusionGemma: 4x Faster Text Generation

🎮GPU Computing News Blog

blog.google··Hacker News, r/LocalLLaMA, r/singularity

Using Scikit-LLM with Open-Source LLMs

⚙️ML Compilers

machinelearningmastery.com·

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

🗜️Quantization

alternativeto.net·

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

🎮GPU Computing

A system programmer’s guide to LLM inference

💰Inference Cost Blog

blog.xiangpeng.systems··Hacker News

Speculators v0.5.0: DFlash support and online training

🚀Speculative Decoding

developers.redhat.com·

High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk

🧠HBM Bandwidth

ncnonline.net·

Tales of an Ollama Honeypot (Part 3): More Traffic, More Findings

🔭Observability

posts.inthecyber.com·

NVIDIA Nemotron 3 Ultra

⚗️Kernel Fusion Blog

Neo-X7/Neo-AI: A fully offline AI assistant powered by Ollama. Stores and retrieves conversations using SQLite + LanceDB vector search. No cloud. No API keys. Runs entirely on your machine.

⚙️MLOps Code

github.com··DEV

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

💰Inference Cost Blog

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

💰Inference Cost Academic

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

💰Inference Cost

local-llm.utop.workers.dev··Hacker News

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

💰Inference Cost Blog

tilert.ai··Hacker News

MLPerf and the rise of latency-aware LLM benchmarking

⏱️Prefill Decoding

Log in to enable infinite scrolling