🔧 Systems-level optimizations for LLM serving - pleto · Scour

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

🧠Large Language Models (LLMs) Code

github.com··Hacker News, r/LLM

Big Blue’s Redbook on Storage Scale KV Cache management

📊AI Performance Profiling News

blocksandfiles.com·

Intelligent inference scheduling with llm-d on Red Hat AI

🚀LLM serving frameworks

developers.redhat.com·

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

✨Model optimizations in LLMs Academic

DiffusionGemma: Discrete diffusion in a large language model

🧠Large Language Models (LLMs)

idlemachines.co.uk··Hacker News

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

🚀LLM serving frameworks Blog

·

Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

🚀LLM serving frameworks

venturebeat.com·

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🚀LLM serving frameworks News

newsletter.semianalysis.com

··Hacker News

How we fight GPU scarcity without compromise

🧠Large Language Models (LLMs) Blog

equixly.com··Hacker News

Anatomy of a high-performance EP kernel

📊AI Performance Profiling Blog

fergusfinn.com··Hacker News

Less-relevant results

Making FlashAttention-4 faster for inference

📊AI Performance Profiling Blog

modal.com··Hacker News

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

🧠Large Language Models (LLMs) Blog

bric.pe.kr··DEV

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🌐Distributed LLM Systems Blog

dnhkng.github.io·

Report: GKE Inference Gateway delivers up to 92% faster AI responses

🧠Large Language Models (LLMs) Blog

cloud.google.com··Hacker News

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

✨Model optimizations in LLMs News Blog

kaitchup.substack.com··r/LocalLLaMA

Stop Treating Your Models Like Microservices

⚙️AI Infrastructure Automation

cloudnativenow.com·

The Inference Alpha: Maximizing Frontier Models on AMD

✨Model optimizations in LLMs Blog

digitalocean.com·

massimo92/spark: CLI tool for serving LLMs with vLLM on NVIDIA DGX Spark. One file, zero friction.

🚀LLM serving frameworks Code

github.com··Hacker News

A system programmer’s guide to LLM inference

✨Model optimizations in LLMs Blog

blog.xiangpeng.systems··Hacker News

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

✨Model optimizations in LLMs

Log in to enable infinite scrolling