Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
👁️ Attention Optimization
Flash Attention, Memory Efficient, Sparse Attention, Transformers
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
160
posts in
8.8
ms
Understanding
KV
Cache
: The Hidden
Memory
Cost of Serving LLMs
⚡
Flash Attention
melchi.me
·
2d
·
Hacker News
KV
Cache
and
Flash
Attention with interactive diagrams
🔲
Loop Tiling
kvcache.cobanov.dev
·
21h
·
Hacker News
Luce Megakernal: Why nobody is taking about this?
🔍
Nsight
github.com
·
5d
·
r/LocalLLaMA
SpecSA: Bridging Speculative Decoding and
Sparse
Attention
for
Efficient
LLM Inference
⚡
Flash Attention
arxiv.org
·
1d
LLM Inference
🎓
Model Distillation
iop.systems
·
14h
KV
Cache
Optimization
: 3x Faster LLM Inference on 24GB VRAM
🎛️
CUDA Optimization
tildalice.io
·
6d
【论文解读】DeepSeek-V4
🧮
cuDNN
wkq9411.github.io
·
3h
Gemini 3.5
Flash
⚡️, Karpathy joins Anthropic 🧑💻, OpenAI Guaranteed Capacity ⚡
🔄
ONNX
tldr.tech
·
1d
Recent Developments in LLM Architectures:
KV
Sharing, mHC, and Compressed
Attention
⚡
Flash Attention
magazine.sebastianraschka.com
·
5d
·
Hacker News
,
Hacker News
,
Hacker News
,
r/LocalLLaMA
Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate
⚡
ONNX Runtime
pytorch.org
·
2d
·
Hacker News
Nvidia unveils its spreading
language
model, "Nemotron-Labs-Diffusion"
🏎️
TensorRT
huggingface.co
·
7h
·
Hacker News
What
GPU
kernels
mean for your distributed inference
🎯
GPU Kernels
developers.redhat.com
·
1d
Show HN:
FlashAttention-2
in Cute, from Scratch
⚡
Flash Attention
blog.echen.io
·
3d
·
Hacker News
Four-Tier
Memory
Hierarchy for LLM Reasoning (USC, UW)
⚡
ONNX Runtime
semiengineering.com
·
23h
DeepSeek Agent Harness: Technical deep-dive & the open-source blueprint
🤖
AI Coding Tools
dlcmh.github.io
·
15h
·
Hacker News
Show HN: The Name in the Bracket (a free book on naming tensor dimensions)
🔍
Type Checkers
einlang.github.io
·
3d
·
Hacker News
sapientinc/HRM-Text: HRM-Text is a 1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning.
📜
TorchScript
github.com
·
2d
·
r/singularity
GPU
Memory
Math for LLMs: Formula That Tells You What Fits on Your
GPU
📈
GPU Occupancy
theahmadosman.substack.com
·
20h
·
Substack
,
r/LocalLLaMA
Maker packs an opinionated, googly-eyed AI chatbot into a mobile suitcase, powered by an Nvidia Jetson — entirely local machine entity runs Gemma 4 E4B and can respond in 200ms
⚡
Flash Attention
tomshardware.com
·
4d
Deploying inference endpoints with PD disaggregation on AMD GPUs
⏱️
CUDA Events
dstack.ai
·
2h
·
Hacker News
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help