🧠 LLM Inference - linbolin1230 · Scour

llama.cpp vs. vLLM: Choosing the right local LLM inference engine

developers.redhat.com··Covers 7 stories

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

⚡KV Cache Blog

anyscale.com··Hacker News

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

⚡KV Cache Blog

thecybersidekick.beehiiv.com··DEV

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

⚡KV Cache Academic

DFlash and Spec V2 Decoding (14 minute read)

⚡KV Cache Blog

lmsys.org··Covers: Looking for a self-hosted alternative to Modal.com for running ML workloads, MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS +2 more

PagedAttention is more than virtual memory

thecomputersciencebook.com··Hacker News·Covers: Efficient Memory Management for Large Language Model Serving with PagedAttention

Parallelize speculative decoding with P-EAGLE on Amazon SageMaker AI

⚡KV Cache Blog

aws.amazon.com·

ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).

⚡KV Cache Code

github.com··Hacker News·Covers: uv

Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

💻Software Engineering Blog

rocm.blogs.amd.com··Hacker News

Most people use Ollama or llama.cpp for local LLMs, but these are the tools I switch to when it gets serious

xda-developers.com··Covers: vllm-project/vllm, sgl-project/sglang +2 more

Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…

⚡KV Cache Blog

·

RAG Observability with Langfuse, vLLM, and FAISS

pyimagesearch.com·

vLLM Internalised: The Mechanics of Modern LLM Inference

⚡KV Cache Blog

·

Less-relevant results

GLM-5.2: Built for Long-Horizon Tasks

⚡KV Cache Blog

huggingface.co··Hacker News, r/LocalLLaMA·Cited by 1 article·Covers: New model GLM-Experimental is quite good (not local so far), GLM Coding Plan for Claude Code

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

vettedconsumer.com··Hacker News·Covers: Efficient Memory Management for Large Language Model Serving with PagedAttention, DeepSeek-V2: A Strong, Economical, and Efficient MOE Language Model

Speculative Decoding | LM Studio

Green AI: Speculative Decoding as an Environmental Necessity

towardsdeeplearning.com·

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

🔧MLOps Academic

Run a local coding model with pi and LM Studio

zarar.dev··Covers: Pi.dev: There are many coding agents, but this one is mine, Opencode – open-source alternative to Claude Code +3 more

A brief history of KV cache compression developments

⚡KV Cache Blog

martinalderson.com··Covers: TurboQuant: Redefining AI efficiency with extreme compression

Log in to enable infinite scrolling