GEMM Optimization

Feeds to Scour
SubscribedAll
Scoured 50 posts in 8.6 ms

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

馃煝CUDAContent type: Code
github.comHacker News

Operator Fusion for LLM Inference on the Tensix Architecture

鈿欙笍ML CompilersContent type: Academic
arxiv.org

The economics of speculative decoding

馃殌Speculative DecodingContent type: Blog
fergusfinn.comHacker News

Apple rebuilt its on-device AI stack at WWDC 2026

馃挵Inference CostContent type: Blog
ziraph.comHacker News

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

馃煝CUDAContent type: Blog
tridao.meHacker News
Less-relevant results

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

馃挵Inference Cost
edn.com

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

馃敘FP8 TrainingContent type: NewsContent type: Blog
developer.nvidia.com

Exploiting GPU Tensor Cores from Java using Babylon [Juan Fumero]

馃煝CUDA
openjdk.orgr/java

NVIDIA Accelerates Google DeepMind鈥檚 DiffusionGemma for Local AI

馃幃GPU ComputingContent type: Blog
blogs.nvidia.com

sgl-project/sglang-omni: SGLang Omni: High-Performance Multi-Stage Pipeline Framework for Omni Models

馃捇Systems ProgrammingContent type: Code
github.com

Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs

鈿欙笍ML CompilersContent type: Academic
arxiv.org

A system programmer鈥檚 guide to LLM inference

馃挵Inference CostContent type: Blog

Anatomy of a high-performance EP kernel

馃挵Inference CostContent type: Blog
fergusfinn.comHacker News

Google's new open model DiffusionGemma generates text from noise instead of word by word

馃幃GPU Computing
the-decoder.com

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

馃挵Inference CostContent type: Academic
arxiv.org

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

馃捑KV CacheContent type: Code
github.comHacker News

RapydMark CPU benchmark

馃敶ROCmContent type: Discussion
forums.anandtech.com

Chrome Users Need To Update Now As Google Patches Another Active Zero-Day

馃敶ROCmContent type: News
hothardware.com

Running LLM Inference on Kubernetes: What It Actually Takes

馃Inference EngineeringContent type: Blog
fairwinds.com

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

馃Inference EngineeringContent type: Blog
tilert.aiHacker News

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help