GEMM Optimization

Feeds to Scour
SubscribedAll
Scoured 50 posts in 12.5 ms

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

馃煝CUDAContent type: Code
github.comHacker News

Operator Fusion for LLM Inference on the Tensix Architecture

鈿欙笍ML CompilersContent type: Academic
arxiv.org
Less-relevant results

The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking

馃挵Inference Cost
edn.com

The economics of speculative decoding

馃殌Speculative DecodingContent type: Blog
fergusfinn.comHacker News

Apple rebuilt its on-device AI stack at WWDC 2026

馃挵Inference CostContent type: Blog
ziraph.comHacker News

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

馃煝CUDAContent type: Blog
tridao.meHacker News

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

馃敘FP8 TrainingContent type: NewsContent type: Blog
developer.nvidia.com

NVIDIA Accelerates Google DeepMind鈥檚 DiffusionGemma for Local AI

馃幃GPU ComputingContent type: Blog
blogs.nvidia.com

Exploiting GPU Tensor Cores from Java using Babylon [Juan Fumero]

馃煝CUDA
openjdk.orgr/java

sgl-project/sglang-omni: SGLang Omni: High-Performance Multi-Stage Pipeline Framework for Omni Models

馃捇Systems ProgrammingContent type: Code
github.com

Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs

鈿欙笍ML CompilersContent type: Academic
arxiv.org

Anatomy of a high-performance EP kernel

馃挵Inference CostContent type: Blog
fergusfinn.comHacker News

Running LLM Inference on Kubernetes: What It Actually Takes

馃Inference EngineeringContent type: Blog
fairwinds.com

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

馃捑KV CacheContent type: Code
github.comHacker News

A system programmer鈥檚 guide to LLM inference

馃挵Inference CostContent type: Blog

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

馃挵Inference CostContent type: Academic
arxiv.org

RapydMark CPU benchmark

馃敶ROCmContent type: Discussion
forums.anandtech.com

Intel's mysterious new datacenter GPU is what Nvidia's Rubin CPX nearly was

馃煝CUDA
theregister.com

DiffusionGemma: 4x Faster Text Generation

馃Inference EngineeringContent type: NewsContent type: Blog

Chrome Users Need To Update Now As Google Patches Another Active Zero-Day

馃敶ROCmContent type: News
hothardware.com

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help