🗄️ KV Cache - fungtion · Scour

The Sequence AI of the Week #875: Why Your Language Model Needs a Nap

💰Compute Costs News Blog

thesequence.substack.com

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

🖥️Inference Engineering

uccl-project.github.io··Hacker News

Latest technical articles & videos.

🎯Fine-tuning

certdepot.net·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🖥️Inference Engineering News Blog

blog.google··Hacker News

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

🖥️Inference Engineering Blog

[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo

🖥️Inference Engineering News

·

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🖥️Inference Engineering

local-llm.utop.workers.dev··Hacker News

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🖥️Inference Engineering Academic

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

🖥️Inference Engineering Code

github.com··Hacker News

Machinic Psychopharmacology: Do LLMs Self-Medicate?

🖥️Inference Engineering

lesswrong.com··Hacker News

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🖥️Inference Engineering Blog

tilert.ai··Hacker News

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

🖥️Inference Engineering Blog

··Hacker News

WEKA software speeds long context AI inferencing on Oracle’s public cloud

🖥️Inference Engineering News

blocksandfiles.com·

Where to Host Your Open-Source Model (Under 10B Parameters)

🖥️Inference Engineering

digitalocean.com·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🖥️Inference Engineering Blog

dnhkng.github.io·

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

🖥️Inference Engineering Code

github.com··r/LocalLLaMA

Google's new open model DiffusionGemma generates text from noise instead of word by word

🖥️Inference Engineering

the-decoder.com

·

Issue #390 - The ML Engineer 🤖

🖥️Inference Engineering News Blog

machinelearning.substack.com··Substack

Massive AI Storage Demand Creates a New Memory Wall

🖥️Inference Engineering News

Anatomy of a high-performance EP kernel

🖥️Inference Engineering Blog

fergusfinn.com··Hacker News

Log in to enable infinite scrolling