Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
KV Cache
🗄️ KV Cache
Specific
attention cache, memory efficiency, inference optimization
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
166
posts in
7.0
ms
The Sequence AI of the Week #875: Why Your Language Model Needs a Nap
💰
Compute Costs
Content type:
News
Content type:
Blog
thesequence.substack.com
·
1d
1 day ago
·
Substack
Actions for The Sequence AI of the Week #875: Why Your Language Model Needs a Nap
CommBench: Can LLMs Write Correct and
Efficient
GPU Communication Code?
🖥️
Inference Engineering
uccl-project.github.io
·
8h
8 hours ago
·
Hacker News
Actions for CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?
Latest technical articles & videos.
🎯
Fine-tuning
certdepot.net
·
5d
5 days ago
Actions for Latest technical articles & videos.
Gemma 4 QAT models:
Optimizing
model compression for mobile and laptop
efficiency
🖥️
Inference Engineering
Content type:
News
Content type:
Blog
blog.google
·
5d
5 days ago
·
Hacker News
Actions for Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
🖥️
Inference Engineering
Content type:
Blog
jimmysong.io
·
2d
2 days ago
Actions for From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo
🖥️
Inference Engineering
Content type:
News
latent.space
·
11h
11 hours ago
Actions for [AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🖥️
Inference Engineering
local-llm.utop.workers.dev
·
4d
4 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Alignment Collapse Under
KV
Cache
Quantization: Diagnosis and Mitigation
🖥️
Inference Engineering
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
Nvidia DGX Spark GB10 – AI Models and Guide with
vLLM
and Autonomous Script
🖥️
Inference Engineering
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
Machinic Psychopharmacology: Do LLMs Self-Medicate?
🖥️
Inference Engineering
lesswrong.com
·
1d
1 day ago
·
Hacker News
Actions for Machinic Psychopharmacology: Do LLMs Self-Medicate?
Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On
Inference
Systems, Execution Boundaries, and Co-Design
🖥️
Inference Engineering
Content type:
Blog
tilert.ai
·
2d
2 days ago
·
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
Show HN: Taliesin – bit-exact
KV-cache
restore, 21x faster, cross-GPU verified
🖥️
Inference Engineering
Content type:
Blog
medium.com
·
6d
6 days ago
·
Hacker News
Actions for Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified
WEKA software speeds long context AI
inferencing
on Oracle’s public cloud
🖥️
Inference Engineering
Content type:
News
blocksandfiles.com
·
1d
1 day ago
Actions for WEKA software speeds long context AI inferencing on Oracle’s public cloud
Where to Host Your Open-Source Model (Under 10B Parameters)
🖥️
Inference Engineering
digitalocean.com
·
1w
1 week ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
🖥️
Inference Engineering
Content type:
Blog
dnhkng.github.io
·
2d
2 days ago
Actions for Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
heterodoxin/graphkv: Graph-guided
KV
cache
compression for
memory-efficient
LLM inference.
🖥️
Inference Engineering
Content type:
Code
github.com
·
4d
4 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Google's new open model DiffusionGemma generates text from noise instead of word by word
🖥️
Inference Engineering
the-decoder.com
·
19h
19 hours ago
Actions for Google's new open model DiffusionGemma generates text from noise instead of word by word
Issue #390 - The ML Engineer 🤖
🖥️
Inference Engineering
Content type:
News
Content type:
Blog
machinelearning.substack.com
·
4d
4 days ago
·
Substack
Actions for Issue #390 - The ML Engineer 🤖
Massive AI Storage Demand Creates a New
Memory
Wall
🖥️
Inference Engineering
Content type:
News
eetimes.com
·
1d
1 day ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
Anatomy of a high-performance EP kernel
🖥️
Inference Engineering
Content type:
Blog
fergusfinn.com
·
1d
1 day ago
·
Hacker News
Actions for Anatomy of a high-performance EP kernel
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help