Fast AI Inference

Feeds to Scour
SubscribedAll
Scoured 125 posts in 35.3 ms

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

 🧠LLM Inference  Content type: Code
github.com··Hacker News

Free vLLM Course: Inference, Compression, Benchmarks

 🧠Inference Serving

Breaking the Ice: Analyzing Cold Start Latency in vLLM

 🏗️LLM Infrastructure  Content type: Academic
arxiv.org·

LLM Inference Handbook 2026

 🤖AI
pub.towardsai.net
·

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

 🏗️LLM Infrastructure  Content type: Blog
tilert.ai··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 🤖AI

Fast and Efficient LLM Inference with vLLM: A New Course with Deeplearning.ai

 🧠Inference Serving  Content type: Blog
vllm.ai··Hacker News

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

 🏗️LLM Infrastructure
huggingface.co··r/LocalLLaMA

Why I care so much about energy per token

 🤖AI  Content type: Blog
ziraph.com··Hacker News

NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...

 🗄️Web Datasets
digg.com·

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

 🧩MoE  Content type: Blog

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

 💾Prompt Caching  Content type: Code
github.com··r/LocalLLaMA

Making Local LLM Go Brrr

 🤖AI

Gemma 4 12B: A unified, encoder-free multimodal model

 🤖AI  Content type: Discussion

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

 🤖AI  Content type: Academic
arxiv.org·

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

 🤖AI

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

 🤖AI

Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway

 🌍Distributed Systems  Content type: Blog
cloud.google.com·

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

 🏗️LLM Infrastructure  Content type: Code
github.com··Hacker News

How we fight GPU scarcity without compromise

 🏗️LLM Infrastructure  Content type: Blog
equixly.com··Hacker News

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help