⚡ Flash Attention - miterion · Scour

KV Cache and Flash Attention with interactive diagrams 🔲Loop Tiling

kvcache.cobanov.dev·21h·Hacker News

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention 👁️Attention Optimization

Luce Megakernal: Why nobody is taking about this? 🔍Nsight

github.com·5d·r/LocalLLaMA

LLM Inference 🎓Model Distillation

iop.systems·13h

Cerebras: The $56.4 Billion IPO Challenging NVIDIA’s Memory Wall ⚡CUDA Programming Patterns

artificialintelligencemadesimple.com·2d

Show HN: FlashAttention-2 in Cute, from Scratch ✂️CUTLASS

blog.echen.io·3d·Hacker News

Understanding KV Cache: The Hidden Memory Cost of Serving LLMs 👁️Attention Optimization

melchi.me·2d·Hacker News

Four-Tier Memory Hierarchy for LLM Reasoning (USC, UW) ⚡ONNX Runtime

semiengineering.com·22h

Less-relevant results

Intel leans on LPDDR5X to dodge global HBM crisis, leaked Crescent Island AI GPU pics reveal massive Xe3P core — chip sidesteps HBM shortage with 160GB of cheaper memory 📈Occupancy Optimization

tomshardware.com

·5h

Luce DFlash + PFlash on 7900XTX: Qwen3.6-27B at 2.24x decode and 3.05x prefill vs llama.cpp HIP ⏱️Benchmarking

lucebox.com·3d·r/LocalLLaMA

The $100 Billion HBM Trade: Korea Makes It, Taiwan Packages It, Japan Enables It ⏱️Benchmarking

ebc.com·2d·r/Economics

Micron's Management Has Just Shared 3 Game-Changing Insights (NASDAQ:MU) ⏱️Benchmarking

seekingalpha.com

·43m

AI Bottlenecks, the chokepoint thesis for the AI buildout 🔍Nsight

aibottlenecks.app·8h

Inside SambaNova's Inference Architecture ⚡ONNX Runtime

viksnewsletter.com

·1d

KV Cache Is Becoming the Memory Hierarchy of Inference 🧠CPU Architecture

touchdown-labs.com·3d

Introducing the Ettin Reranker Family 📉Model Quantization

huggingface.co·2d·r/LocalLLaMA

The BOOK II 🔄SIMD Programming

512pixels.net·19h

KV Cache Optimization: 3x Faster LLM Inference on 24GB VRAM 🎛️CUDA Optimization

tildalice.io·6d

sapientinc/HRM-Text: HRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning. 📜TorchScript

github.com·2d·r/singularity

T1 Energy spikes on record call volumes after Roth analyst calls short report a buying opportunity ⏱️Benchmarking

sherwood.news·20h

Log in to enable infinite scrolling