⚡ LLM Serving - rdksupe · Scour

📈LLM Scaling arXiv·

HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval

Less-relevant results

🗄️Vector Databases moorcheh.ai·

Information-Theoretic Vector Search Is Having Its Moment

Covered by GitHub

Discussed on Hacker News

🤖AI Agents medium.com

·

vLLM, Function Calling, and World Models explained

🧠Transformer Architecture GitHub·

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

Discussed on r/LocalLLaMA

🔬Deep Learning mstar.stanford.edu·

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

Discussed on Hacker News

📈LLM Scaling venturebeat.com·

AI hit the memory wall — now it needs a new context tier

🖥️GPU Computing digitalocean.com·

The HBM Tax: Why Vision Encoders and Language Decoders Fight Over Your GPU

🖥️GPU Computing Rack to Cloud·

GPU Scarcity Isn't the Problem Anymore. GPU Allocation Governance Is.

Discussed on DEV

🧠Transformer Architecture whyopensource.ai·

A running list of reasons to move to open source

Covers 3 stories including Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Discussed on Hacker News

🔬Deep Learning towardsdeeplearning.com·

Green AI: Speculative Decoding as an Environmental Necessity

📈LLM Scaling portal.neuralwatt.com·

Neuralwatt: Energy-based pricing for AI inference. Efficient prompts cost less

Discussed on Hacker News

🏗️Systems Design Hugging Face·

225B-A23B

Covered by mail.bycloud.ai, news.smol.ai

Discussed on r/LocalLLaMA

📈LLM Scaling arXiv·

EnerInfer: Energy-Aware On-Device LLM Inference

📈LLM Scaling medium.com

·

One Number Lies: How to Actually Measure LLM Inference

🧠Transformer Architecture medium.com

·

The Transformer Pipeline: A Complete Mathematical and Visual Guide

🧠Transformer Architecture Red Hat Developer·

Connect EvalHub to protected production model servers

📈LLM Scaling machine-learning-made-simple.medium.com·

The Real Cost of Running AI: From FLOPs to GPUs to the KV Cache

🔬Deep Learning abhishek.it·

Running GLM-5.2 5x faster at 500tps with limitation

Discussed on Hacker News

📊Machine Learning GitHub·

Show HN: Alloy – a PyTorch backend and inference engine for Apple Silicon

Discussed on Hacker News

🤖AI Agents medium.com

·

The Context Budget That Will Decide Everyday AI

Log in to enable infinite scrolling