⚡ LLM Serving - rdksupe · Scour

🔬Deep Learning GitHub·

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Discussed on r/LLM

🕸️Multi-Agent Systems supercomputing-system-ai-lab.github.io·

VoltanaLLM: Energy-Efficient LLM Serving

Discussed on Hacker News

🖥️GPU Computing NVIDIA Technical Blog·

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

Covers 3 stories including NVIDIA/TensorRT-LLM

🖥️GPU Computing Red Hat Developer·

Designing distributed AI inference: Core concepts and scaling dimensions

📈LLM Scaling arXiv·

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

🏗️Systems Design primeintellect.ai·

RL at 1T Scale: prime-rl Performance Deep Dive

Covers 6 stories including Kimi K2.7-Code: open-source coding model with better token efficiency

🏗️Systems Design Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

🖥️GPU Computing medium.com

·

Debugging Deployments with Gemma 12B, TPU v6e-1, MCP, and Antigravity CLI

🔬Deep Learning ubuntu.com·

Developing web apps with local LLM inference

📈LLM Scaling IBM Research·

Running AI on mixed hardware for speed and affordability

Covers Introduction to llm-d Open-source Kubernetes-native Framework for Distributed LLM Inference | Ep 140 #cloudnativefm

⚙️MLOps thecybersidekick.beehiiv.com·

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

Discussed on DEV

⚙️MLOps Modal·

Modal Auto Endpoints: Optimized inference you own

Covers 2 stories including Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Discussed on Hacker News

🔬Deep Learning Fergus's blog·

Adaptive speculative decoding: picking draft lengths at runtime

Covers 4 stories including Looking for a self-hosted alternative to Modal.com for running ML workloads

Discussed on Hacker News

🏗️Systems Design Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

📊Machine Learning Gradient Ascent·

Groq on Endless Compute, Inside Claude's Mind, and GLM-5.2 Open Weights - The Tokenizer Edition #32

Covers 3 stories including alibaba/open-code-review: Battle-tested at Alibaba's scale. Hybrid architecture code review tool: deterministic pipelines + LLM Agent, precise line-level comments, built-in fine-tuned ruleset (NPE, thread-safety, XSS, SQL injection), OpenAI & Anthropic compatible.

🔬Deep Learning YouTubeVideo·

Token Injection: Crashing LLM Inference With Special Tokens

🖥️GPU Computing Baseten·

We built the fastest API for GLM-5.2 (280 TPS)

Covers GLM-5.2 (6 minute read)

Discussed on Hacker News

🧠Transformer Architecture fitservers.com·

The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server

📈LLM Scaling arXiv·

CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

📈LLM Scaling Network World·

Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK

Log in to enable infinite scrolling