⚡ LLM Serving - rdksupe · Scour

🔬Deep Learning GitHub·

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Discussed on r/LLM

🔬Deep Learning ubuntu.com·

Developing web apps with local LLM inference

🖥️GPU Computing Red Hat Developer·

Designing distributed AI inference: Core concepts and scaling dimensions

🖥️GPU Computing medium.com

·

Debugging Deployments with Gemma 12B, TPU v6e-1, MCP, and Antigravity CLI

🤖AI Agents medium.com

·

vLLM, Function Calling, and World Models explained

🏗️Systems Design Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

📈LLM Scaling lmsys.org·

DFlash and Spec V2 Decoding (14 minute read)

Covers 6 stories including Looking for a self-hosted alternative to Modal.com for running ML workloads

Discussed on Hacker News

📊Machine Learning Gradient Ascent·

Groq on Endless Compute, Inside Claude's Mind, and GLM-5.2 Open Weights - The Tokenizer Edition #32

Covers 3 stories including alibaba/open-code-review: Battle-tested at Alibaba's scale. Hybrid architecture code review tool: deterministic pipelines + LLM Agent, precise line-level comments, built-in fine-tuned ruleset (NPE, thread-safety, XSS, SQL injection), OpenAI & Anthropic compatible.

Less-relevant results

🖥️GPU Computing graphsignal.com·

CUDA Profiler for Production Inference

Discussed on Hacker News

⚙️MLOps thecybersidekick.beehiiv.com·

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

Discussed on DEV

🤖AI Agents medium.com

·

The Context Budget That Will Decide Everyday AI

🗄️Vector Databases moorcheh.ai·

Information-Theoretic Vector Search Is Having Its Moment

Covered by GitHub

Discussed on Hacker News

🖥️GPU Computing arxiv.org·

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

🏗️Systems Design Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

🧠Transformer Architecture whyopensource.ai·

A running list of reasons to move to open source

Covers 3 stories including Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Discussed on Hacker News

🔬Deep Learning youtube.comVideo·

Token Injection: Crashing LLM Inference With Special Tokens

📈LLM Scaling portal.neuralwatt.com·

Neuralwatt: Energy-based pricing for AI inference. Efficient prompts cost less

Discussed on Hacker News

🧠Transformer Architecture fitservers.com·

The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server

✍️Prompt Engineering pi.dev·

Pi 0.79.9

🏗️Systems Design Blocks and Files·

Dell and data physics

Log in to enable infinite scrolling