Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
GEMM Optimization
馃敘 GEMM Optimization
Specific
matrix multiply, cuBLAS, GEMM kernel, tiling, compute-bound
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
50
posts in
12.5
ms
RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting
CUDA
megakernel and self-tunes it past
cuBLAS
at batch-1 LLM decode.
聽
馃煝
CUDA
聽
Content type:
Code
github.com
路
2d
2 days ago
路
Hacker News
Actions for RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.
Operator Fusion for LLM Inference on the Tensix Architecture
聽
鈿欙笍
ML Compilers
聽
Content type:
Academic
arxiv.org
路
14h
14 hours ago
Actions for Operator Fusion for LLM Inference on the Tensix Architecture
Less-relevant results
The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking
聽
馃挵
Inference Cost
edn.com
路
6d
6 days ago
Actions for The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking
The economics of speculative decoding
聽
馃殌
Speculative Decoding
聽
Content type:
Blog
fergusfinn.com
路
2d
2 days ago
路
Hacker News
Actions for The economics of speculative decoding
Apple rebuilt its on-device AI stack at WWDC 2026
聽
馃挵
Inference Cost
聽
Content type:
Blog
ziraph.com
路
1d
1 day ago
路
Hacker News
Actions for Apple rebuilt its on-device AI stack at WWDC 2026
Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon
聽
馃煝
CUDA
聽
Content type:
Blog
tridao.me
路
1d
1 day ago
路
Hacker News
Actions for Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon
Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
聽
馃敘
FP8 Training
聽
Content type:
News
聽
Content type:
Blog
developer.nvidia.com
路
1d
1 day ago
Actions for Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
NVIDIA Accelerates Google DeepMind鈥檚 DiffusionGemma for Local AI
聽
馃幃
GPU Computing
聽
Content type:
Blog
blogs.nvidia.com
路
1h
1 hour ago
Actions for NVIDIA Accelerates Google DeepMind鈥檚 DiffusionGemma for Local AI
Exploiting
GPU
Tensor Cores from Java using Babylon [Juan Fumero]
聽
馃煝
CUDA
openjdk.org
路
1d
1 day ago
路
r/java
Actions for Exploiting GPU Tensor Cores from Java using Babylon [Juan Fumero]
sgl-project/sglang-omni: SGLang Omni: High-Performance
Multi-Stage
Pipeline Framework for Omni Models
聽
馃捇
Systems Programming
聽
Content type:
Code
github.com
路
16h
16 hours ago
Actions for sgl-project/sglang-omni: SGLang Omni: High-Performance Multi-Stage Pipeline Framework for Omni Models
Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs
聽
鈿欙笍
ML Compilers
聽
Content type:
Academic
arxiv.org
路
14h
14 hours ago
Actions for Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs
Anatomy of a high-performance EP
kernel
聽
馃挵
Inference Cost
聽
Content type:
Blog
fergusfinn.com
路
18h
18 hours ago
路
Hacker News
Actions for Anatomy of a high-performance EP kernel
Running LLM Inference on Kubernetes: What It Actually Takes
聽
馃
Inference Engineering
聽
Content type:
Blog
fairwinds.com
路
5d
5 days ago
Actions for Running LLM Inference on Kubernetes: What It Actually Takes
KJLdefeated/RL.cu
: RLVR training for LLM in CUDA/C++
聽
馃捑
KV Cache
聽
Content type:
Code
github.com
路
3d
3 days ago
路
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
A system programmer鈥檚 guide to LLM inference
聽
馃挵
Inference Cost
聽
Content type:
Blog
blog.xiangpeng.systems
路
2d
2 days ago
路
Hacker News
Actions for A system programmer鈥檚 guide to LLM inference
SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense
GEMMs
for LLM Inference
聽
馃挵
Inference Cost
聽
Content type:
Academic
arxiv.org
路
14h
14 hours ago
Actions for SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference
RapydMark CPU benchmark
聽
馃敶
ROCm
聽
Content type:
Discussion
forums.anandtech.com
路
1d
1 day ago
Actions for RapydMark CPU benchmark
Intel's mysterious new datacenter
GPU
is what Nvidia's Rubin CPX nearly was
聽
馃煝
CUDA
theregister.com
路
6d
6 days ago
Actions for Intel's mysterious new datacenter GPU is what Nvidia's Rubin CPX nearly was
DiffusionGemma: 4x Faster Text Generation
聽
馃
Inference Engineering
聽
Content type:
News
聽
Content type:
Blog
blog.google
路
2h
2 hours ago
路
Hacker News
,
r/LocalLLaMA
,
r/singularity
Actions for DiffusionGemma: 4x Faster Text Generation
Chrome Users Need To Update Now As Google Patches Another Active Zero-Day
聽
馃敶
ROCm
聽
Content type:
News
hothardware.com
路
1d
1 day ago
Actions for Chrome Users Need To Update Now As Google Patches Another Active Zero-Day
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help