🚀 LLM Serving - sravindra · Scour

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

⚙️ML Infrastructure

buy.polar.sh··DEV

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🔀Model Parallelism Academic

How we fight GPU scarcity without compromise

🔀Model Parallelism Blog

equixly.com··Hacker News

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

🤝AI-Assisted Coding Blog

towardsai.net·

DiffusionGemma: 4x Faster Text Generation

💾AI Hardware News Blog

blog.google··Hacker News, r/LocalLLaMA, r/singularity

Making LLMs faster and more efficient across multiple languages

⚙️ML Infrastructure

techxplore.com·

"AI" Is Eating Platform Monopolist Free Cash Flow, Not the World: CHART OF THE DAY

🤝AI-Assisted Coding News Blog

braddelong.substack.com··Substack

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

💾AI Hardware Code

github.com··Hacker News

A system programmer’s guide to LLM inference

🖥️GPU Computing Blog

blog.xiangpeng.systems··Hacker News

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

🖥️GPU Computing

sleepingrobots.com·

WWDC 2026: Foundation Models (& Anarlog)

🛠️Developer Tooling

skushagra.com·

Where to Host Your Open-Source Model (Under 10B Parameters)

🖥️GPU Computing

digitalocean.com·

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

⚙️ML Infrastructure Blog

databricks.com·

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

⚙️ML Infrastructure

huggingface.co··r/LocalLLaMA

local AI agents for Cursor with pre-tuned marketplace/commu

🤝AI-Assisted Coding

locaible.com··Hacker News

Creating ADK Agent using locally running Gemma 4

🖥️GPU Computing Blog

fix(memory-core): filter stale recall entries in REM harness preview · openclaw/openclaw@92418fc

☁️GCP Code

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

⚙️ML Infrastructure Blog

ziraph.com··Hacker News

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

📋Kueue Academic

How to Measure Time To First Token (TTFT) in AI Systems

🤝AI-Assisted Coding

qainsights.com··Hacker News

Sign up or log in to see more results

Log in to enable infinite scrolling