👁️ Attention Optimization - miterion · Scour

Understanding KV Cache: The Hidden Memory Cost of Serving LLMs ⚡Flash Attention

melchi.me·2d·Hacker News

KV Cache and Flash Attention with interactive diagrams 🔲Loop Tiling

kvcache.cobanov.dev·21h·Hacker News

Luce Megakernal: Why nobody is taking about this? 🔍Nsight

github.com·5d·r/LocalLLaMA

SpecSA: Bridging Speculative Decoding and Sparse Attention for Efficient LLM Inference ⚡Flash Attention

LLM Inference 🎓Model Distillation

iop.systems·14h

KV Cache Optimization: 3x Faster LLM Inference on 24GB VRAM 🎛️CUDA Optimization

tildalice.io·6d

【论文解读】DeepSeek-V4 🧮cuDNN

wkq9411.github.io·3h

Gemini 3.5 Flash ⚡️, Karpathy joins Anthropic 🧑‍💻, OpenAI Guaranteed Capacity ⚡ 🔄ONNX

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention ⚡Flash Attention

magazine.sebastianraschka.com·5d·Hacker News, Hacker News, Hacker News, r/LocalLLaMA

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate ⚡ONNX Runtime

pytorch.org·2d·Hacker News

Nvidia unveils its spreading language model, "Nemotron-Labs-Diffusion" 🏎️TensorRT

huggingface.co·7h·Hacker News

What GPU kernels mean for your distributed inference 🎯GPU Kernels

developers.redhat.com·1d

Show HN: FlashAttention-2 in Cute, from Scratch ⚡Flash Attention

blog.echen.io·3d·Hacker News

Four-Tier Memory Hierarchy for LLM Reasoning (USC, UW) ⚡ONNX Runtime

semiengineering.com·23h

DeepSeek Agent Harness: Technical deep-dive & the open-source blueprint 🤖AI Coding Tools

dlcmh.github.io·15h·Hacker News

Show HN: The Name in the Bracket (a free book on naming tensor dimensions) 🔍Type Checkers

einlang.github.io·3d·Hacker News

sapientinc/HRM-Text: HRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning. 📜TorchScript

github.com·2d·r/singularity

GPU Memory Math for LLMs: Formula That Tells You What Fits on Your GPU 📈GPU Occupancy

theahmadosman.substack.com·20h·Substack, r/LocalLLaMA

Maker packs an opinionated, googly-eyed AI chatbot into a mobile suitcase, powered by an Nvidia Jetson — entirely local machine entity runs Gemma 4 E4B and can respond in 200ms ⚡Flash Attention

tomshardware.com

·4d

Deploying inference endpoints with PD disaggregation on AMD GPUs ⏱️CUDA Events

dstack.ai·2h·Hacker News

Log in to enable infinite scrolling