🗄️ KV Cache - ghosh.debasish · Scour

Google splits AI chips into training and inference TPUs, signaling shift toward workload-specialized AI infrastructure 🧠Reasoning Models

digitimes.com·6d

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration 🧮Cache-Oblivious Algorithms

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference 🧠Reasoning Models

dog-qiuqiu/invincat: A native Python agent CLI built on DeepAgents CLI, featuring an independent memory Agent that captures learnings after each task and delivers efficient AI coding assistance through hierarchical memory management. 🤖AI Agents

github.com·4d·Hacker News

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing 🧮Cache-Oblivious Algorithms

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding 🧠Reasoning Models

Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference 🧠Reasoning Models

QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention 🌊Streaming Algorithms

PMZFX/intel-arc-pro-b70-benchmarks: Benchmark results and performance data for the Intel Arc Pro B70 GPU (Xe2/Battlemage) - LLM inference, video generation, dual-GPU scaling. 🛢️Database Internals

github.com·6d·Hacker News

PathRWKV: Enhancing Whole Slide Image Inference with Asymmetric Recurrent Modeling 🌊Streaming Algorithms

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation 🧠LLMs

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference 🧠Reasoning Models

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference 🧠Reasoning Models

Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities 🧠LLMs

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels 🔧SMT Solvers

Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation 🧠Reasoning Models

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference 🧠Reasoning Models

DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge 🧠Reasoning Models

SwarmDrive: Semantic V2V Coordination for Latency-Constrained Cooperative Autonomous Driving 🤝Consensus Algorithms

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture 🧠LLMs

Log in to enable infinite scrolling