Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
GEMM Optimization
馃敘 GEMM Optimization
Specific
matrix multiply, cuBLAS, GEMM kernel, tiling, compute-bound
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
50
posts in
8.6
ms
RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting
CUDA
megakernel and self-tunes it past
cuBLAS
at batch-1 LLM decode.
聽
馃煝
CUDA
聽
Content type:
Code
github.com
路
2d
2 days ago
路
Hacker News
Actions for RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.
Operator Fusion for LLM Inference on the Tensix Architecture
聽
鈿欙笍
ML Compilers
聽
Content type:
Academic
arxiv.org
路
20h
20 hours ago
Actions for Operator Fusion for LLM Inference on the Tensix Architecture
The economics of speculative decoding
聽
馃殌
Speculative Decoding
聽
Content type:
Blog
fergusfinn.com
路
3d
3 days ago
路
Hacker News
Actions for The economics of speculative decoding
Apple rebuilt its on-device AI stack at WWDC 2026
聽
馃挵
Inference Cost
聽
Content type:
Blog
ziraph.com
路
1d
1 day ago
路
Hacker News
Actions for Apple rebuilt its on-device AI stack at WWDC 2026
Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon
聽
馃煝
CUDA
聽
Content type:
Blog
tridao.me
路
1d
1 day ago
路
Hacker News
Actions for Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon
Less-relevant results
The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking
聽
馃挵
Inference Cost
edn.com
路
6d
6 days ago
Actions for The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking
Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
聽
馃敘
FP8 Training
聽
Content type:
News
聽
Content type:
Blog
developer.nvidia.com
路
2d
2 days ago
Actions for Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
Exploiting
GPU
Tensor Cores from Java using Babylon [Juan Fumero]
聽
馃煝
CUDA
openjdk.org
路
1d
1 day ago
路
r/java
Actions for Exploiting GPU Tensor Cores from Java using Babylon [Juan Fumero]
NVIDIA Accelerates Google DeepMind鈥檚 DiffusionGemma for Local AI
聽
馃幃
GPU Computing
聽
Content type:
Blog
blogs.nvidia.com
路
8h
8 hours ago
Actions for NVIDIA Accelerates Google DeepMind鈥檚 DiffusionGemma for Local AI
sgl-project/sglang-omni: SGLang Omni: High-Performance
Multi-Stage
Pipeline Framework for Omni Models
聽
馃捇
Systems Programming
聽
Content type:
Code
github.com
路
22h
22 hours ago
Actions for sgl-project/sglang-omni: SGLang Omni: High-Performance Multi-Stage Pipeline Framework for Omni Models
Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs
聽
鈿欙笍
ML Compilers
聽
Content type:
Academic
arxiv.org
路
20h
20 hours ago
Actions for Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs
A system programmer鈥檚 guide to LLM inference
聽
馃挵
Inference Cost
聽
Content type:
Blog
blog.xiangpeng.systems
路
3d
3 days ago
路
Hacker News
Actions for A system programmer鈥檚 guide to LLM inference
Anatomy of a high-performance EP
kernel
聽
馃挵
Inference Cost
聽
Content type:
Blog
fergusfinn.com
路
1d
1 day ago
路
Hacker News
Actions for Anatomy of a high-performance EP kernel
Google's new open model DiffusionGemma generates text from noise instead of word by word
聽
馃幃
GPU Computing
the-decoder.com
路
5h
5 hours ago
Actions for Google's new open model DiffusionGemma generates text from noise instead of word by word
SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense
GEMMs
for LLM Inference
聽
馃挵
Inference Cost
聽
Content type:
Academic
arxiv.org
路
20h
20 hours ago
Actions for SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference
KJLdefeated/RL.cu
: RLVR training for LLM in CUDA/C++
聽
馃捑
KV Cache
聽
Content type:
Code
github.com
路
3d
3 days ago
路
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
RapydMark CPU benchmark
聽
馃敶
ROCm
聽
Content type:
Discussion
forums.anandtech.com
路
1d
1 day ago
Actions for RapydMark CPU benchmark
Chrome Users Need To Update Now As Google Patches Another Active Zero-Day
聽
馃敶
ROCm
聽
Content type:
News
hothardware.com
路
1d
1 day ago
Actions for Chrome Users Need To Update Now As Google Patches Another Active Zero-Day
Running LLM Inference on Kubernetes: What It Actually Takes
聽
馃
Inference Engineering
聽
Content type:
Blog
fairwinds.com
路
5d
5 days ago
Actions for Running LLM Inference on Kubernetes: What It Actually Takes
Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution
Boundaries
, and
Co-Design
聽
馃
Inference Engineering
聽
Content type:
Blog
tilert.ai
路
2d
2 days ago
路
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help