🪟 Context Windows - SeanNg · Scour

Claude Fable 5 🚀, Gemini 3.5 Live Translate 📱, scaling test time compute 📈

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

⚡Inference Optimization Blog

tilert.ai··Hacker News

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

⚡Inference Optimization Code

github.com··Hacker News

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

⚓Kubernetes Blog

A system programmer’s guide to LLM inference

🤖LLM Blog

blog.xiangpeng.systems··Hacker News

FOD#155: Continual Learning in LLMs: Why AI Models Need Sleep

turingpost.com·

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

✨Gemini Academic

MLPerf and the rise of latency-aware LLM benchmarking

The Memory Problem is Solved: How Google’s Memory Caching Makes RNNs Smart Again

🤖Transformers Blog

See, Act, Correct: three levers for working with a code agent

🎮Reinforcement Learning Blog

blog.owulveryck.info··Hacker News, Hacker News

DeepSeek V4, LeCun's Bet Against LLMs, and Lovable's Self-Improving Agent - The Tokenizer Edition #30

newsletter.artofsaience.com·

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

⚡Inference Optimization Academic

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

🤖AI Code

github.com··r/LocalLLaMA

OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine

👁️Computer Vision

The economics of speculative decoding

⚡Inference Optimization Blog

fergusfinn.com··Hacker News

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

⚡Inference Optimization Academic

Anthropic’s $965B Valuation: What $47B Revenue Says

🎭Anthropic Claude Blog Discussion

Where to Host Your Open-Source Model (Under 10B Parameters)

⚡Inference Optimization

digitalocean.com·

End-to-End Context Compression at Scale

🤖Transformers Academic

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

🤖AI Code

github.com··Hacker News

Log in to enable infinite scrolling