🧠 Inference Engineering - nayyara.airlangga · Scour

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

💾KV Cache Code

github.com··Hacker News

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

zozo123.github.io··Hacker News

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

💾KV Cache Academic

The Inference Alpha: Maximizing Frontier Models on AMD

💰Inference Cost Blog

digitalocean.com·

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

💾KV Cache News

newsletter.semianalysis.com

··Hacker News

Infrastructure Options for Scalable AI Inference

☁️Cloud Infrastructure Blog

Running LLM Inference on Kubernetes: What It Actually Takes

💰Inference Cost Blog

fairwinds.com·

Big Blue’s Redbook on Storage Scale KV Cache management

⏱️Prefill Decoding News

blocksandfiles.com·

DiffusionGemma: The Developer Guide- Google Developers Blog

💾KV Cache Blog

developers.googleblog.com··r/LocalLLaMA

I've tested so many desktop AI tools, but Hermes with Ollama is my new favorite - here's why

🚀Model Serving News Tutorial

Token4Token — pay-per-token inference on Gnosis + Swarm

t4t.eth.link··Hacker News

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

💰Inference Cost

Massive AI Storage Demand Creates a New Memory Wall

🧠HBM Bandwidth News

NexusOS v2.0 – A zero-dependency pipeline streaming server chaos to Parquet

🧠HBM Bandwidth

huggingface.co··Hacker News

Improved performance and model support with GGUF

🗜️Quantization Blog

Anatomy of a high-performance EP kernel

💰Inference Cost Blog

fergusfinn.com··Hacker News

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

⏱️Prefill Decoding Blog

·

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🎮GPU Computing Blog

blogs.nvidia.com·

Report: GKE Inference Gateway delivers up to 92% faster AI responses

💾KV Cache Blog

cloud.google.com··Hacker News

LLM Inference Engineering Room — Part 3: The Orchestration Layer

💾KV Cache Blog

vimal-dwarampudi.medium.com·

Log in to enable infinite scrolling