🔢 GEMM Optimization - nayyara.airlangga · Scour

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

🟢CUDA Code

github.com··Hacker News

Operator Fusion for LLM Inference on the Tensix Architecture

⚙️ML Compilers Academic

The economics of speculative decoding

🚀Speculative Decoding Blog

fergusfinn.com··Hacker News

Apple rebuilt its on-device AI stack at WWDC 2026

💰Inference Cost Blog

ziraph.com··Hacker News

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

🟢CUDA Blog

tridao.me··Hacker News

Less-relevant results

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

💰Inference Cost

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

🔢FP8 Training News Blog

developer.nvidia.com·

Exploiting GPU Tensor Cores from Java using Babylon [Juan Fumero]

openjdk.org··r/java

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🎮GPU Computing Blog

blogs.nvidia.com·

sgl-project/sglang-omni: SGLang Omni: High-Performance Multi-Stage Pipeline Framework for Omni Models

💻Systems Programming Code

Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs

⚙️ML Compilers Academic

A system programmer’s guide to LLM inference

💰Inference Cost Blog

blog.xiangpeng.systems··Hacker News

Anatomy of a high-performance EP kernel

💰Inference Cost Blog

fergusfinn.com··Hacker News

Google's new open model DiffusionGemma generates text from noise instead of word by word

🎮GPU Computing

the-decoder.com

·

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

💰Inference Cost Academic

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

💾KV Cache Code

github.com··Hacker News

RapydMark CPU benchmark

🔴ROCm Discussion

forums.anandtech.com·

Chrome Users Need To Update Now As Google Patches Another Active Zero-Day

🔴ROCm News

hothardware.com·

Running LLM Inference on Kubernetes: What It Actually Takes

🧠Inference Engineering Blog

fairwinds.com·

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

🧠Inference Engineering Blog

tilert.ai··Hacker News

Log in to enable infinite scrolling