🧠 LLM Inference - emschwartz · Scour

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference 🏗️LLM Infrastructure

The Inference Economy: Token Use 💰Tokenomics

frontierai.substack.com·9h·Substack

Adaptive Thinking: Large Language Models Know When to Think in Latent Space 🏗️LLM Infrastructure

machinelearning.apple.com·2d

AmSach/kvquant: Drop-in KV cache compressor for local LLM inference - Run 70B models on 8GB RAM 🏗️LLM Infrastructure

github.com·15h·DEV

DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles 🧠Inference Serving

lmsys.org·5d·Hacker News

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference 🏗️LLM Infrastructure

shreyansh26/Speculative-Decoding: Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch 📊Model Serving Economics

github.com·4d·r/LLM, r/LocalLLaMA

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference 🏗️LLM Infrastructure

Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities 📱Edge AI Optimization

Efficient, VRAM-Constrained xLM Inference on Clients 🏗️LLM Infrastructure

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity 🏗️LLM Infrastructure

arxiv.org·1d·Hacker News

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference 🏗️LLM Infrastructure

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference 🏗️LLM Infrastructure

Select to Think: Unlocking SLM Potential with Local Sufficiency 🏗️LLM Infrastructure

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference 🏗️LLM Infrastructure

Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference 🏗️LLM Infrastructure

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel 🏗️LLM Infrastructure

Anchored Variational Inference for Personalized Sequential Latent-State Models 🏗️LLM Infrastructure

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels 🕯️Candle

QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention 🔬RaBitQ

Log in to enable infinite scrolling