🖥️ Inference Engineering - fungtion · Scour

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

🗄️KV Cache Code

github.com··Hacker News, r/LLM

Big Blue’s Redbook on Storage Scale KV Cache management

🗄️KV Cache News

blocksandfiles.com·

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

🗄️KV Cache

phoronix.com··r/artificial

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🗄️KV Cache Academic

Google's new open model DiffusionGemma generates text from noise instead of word by word

🎯Fine-tuning

the-decoder.com

·

How we fight GPU scarcity without compromise

🗄️KV Cache Blog

equixly.com··Hacker News

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

🗄️KV Cache Blog

·

Intelligent inference scheduling with llm-d on Red Hat AI

🗄️KV Cache

developers.redhat.com·

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

🗄️KV Cache

uccl-project.github.io··Hacker News

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

💰API Pricing

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🗄️KV Cache Blog

dnhkng.github.io·

Google unveils DiffusionGemma, delivering up to 4x faster inference on dedicated GPUs

💰Compute Costs

alternativeto.net·

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

🗄️KV Cache Blog

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🎯Fine-tuning Blog

blogs.nvidia.com·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🎯Fine-tuning

vettedconsumer.com··Hacker News

DiffusionGemma 26B A4B results on my 5090

🗄️KV Cache

huggingface.co··r/LocalLLaMA

Putting a datacenter GPU in a gaming PC for £200 ($268)

💰Compute Costs Blog

blog.adafruit.com·

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

🗄️KV Cache News

decrypt.co··Hacker News

Running LLM Inference on Kubernetes: What It Actually Takes

🗄️KV Cache Blog

fairwinds.com·

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

🗄️KV Cache

zozo123.github.io··Hacker News

Log in to enable infinite scrolling