Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
KV Cache
🗄️ KV Cache
Specific
attention cache, memory efficiency, inference optimization
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
163
posts in
6.3
ms
Anatomy of a high-performance EP kernel
🖥️
Inference Engineering
Content type:
Blog
fergusfinn.com
·
1d
1 day ago
·
Hacker News
Actions for Anatomy of a high-performance EP kernel
Stateful Swarms: How Persistent
Memory
Beats Traditional Agent Architectures
🖥️
Inference Engineering
Content type:
News
artificialintelligencemadesimple.com
·
6d
6 days ago
Actions for Stateful Swarms: How Persistent Memory Beats Traditional Agent Architectures
Report: GKE
Inference
Gateway delivers up to 92% faster AI responses
🖥️
Inference Engineering
Content type:
Blog
cloud.google.com
·
2d
2 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
🖥️
Inference Engineering
Content type:
Blog
towardsai.net
·
3d
3 days ago
Actions for Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
Claude Fable 5 🚀, Gemini 3.5 Live Translate 📱, scaling test time compute 📈
🎯
Fine-tuning
tldr.tech
·
1d
1 day ago
Actions for Claude Fable 5 🚀, Gemini 3.5 Live Translate 📱, scaling test time compute 📈
KJLdefeated/RL.cu: RLVR training for
LLM
in CUDA/C++
🖥️
Inference Engineering
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
STAR-KV
: Low-Rank
KV
Cache
Compression via Soft Thresholding for Adaptive Rank Control
🖥️
Inference Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
🖥️
Inference Engineering
Content type:
News
Content type:
Blog
kaitchup.substack.com
·
5d
5 days ago
·
r/LocalLLaMA
Actions for MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
Integrate OpenShift AI and PG Airman MCP Server
🖥️
Inference Engineering
developers.redhat.com
·
2d
2 days ago
Actions for Integrate OpenShift AI and PG Airman MCP Server
#065 - Claude writes 80% of Anthropic's own code, Cloudflare buys Vite, ChatGPT ships Dreaming
memory
🖥️
Inference Engineering
indiehacker.news
·
6d
6 days ago
Actions for #065 - Claude writes 80% of Anthropic's own code, Cloudflare buys Vite, ChatGPT ships Dreaming memory
libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA
KV-cache
. Weights available on Hugging Face.
🖥️
Inference Engineering
Content type:
Code
github.com
·
1d
1 day ago
Actions for libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.
Beyond Per-Token Pricing: A Concurrency-Aware Methodology for
LLM
Infrastructure Cost Estimation
💰
API Pricing
Content type:
Academic
arxiv.org
·
12h
12 hours ago
Actions for Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation
Efficient
and Training-Free Single-Image Diffusion Models
🎯
Fine-tuning
haojunqiu.github.io
·
5d
5 days ago
·
Hacker News
Actions for Efficient and Training-Free Single-Image Diffusion Models
Florian Brand, Prime Intellect research engineer, adopts Gemma 4 E4B 6-bit quantized as his primary local Mac
LLM
🖥️
Inference Engineering
Content type:
News
digg.com
·
4d
4 days ago
·
Hacker News
Actions for Florian Brand, Prime Intellect research engineer, adopts Gemma 4 E4B 6-bit quantized as his primary local Mac LLM
Making Local
LLM
Fast
🖥️
Inference Engineering
bogdan.nimblex.net
·
1d
1 day ago
·
Hacker News
Actions for Making Local LLM Fast
What Arm-based innovations happened in May 2026?
🖥️
Inference Engineering
Content type:
Blog
newsroom.arm.com
·
6d
6 days ago
Actions for What Arm-based innovations happened in May 2026?
Build a local voice agent with Red Hat OpenShift AI
🖥️
Inference Engineering
developers.redhat.com
·
3d
3 days ago
Actions for Build a local voice agent with Red Hat OpenShift AI
The economics of speculative decoding
🖥️
Inference Engineering
Content type:
Blog
fergusfinn.com
·
3d
3 days ago
·
Hacker News
Actions for The economics of speculative decoding
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe
Flash-Attention
for llama.cpp, fully measured on real hardware.
🖥️
Inference Engineering
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
Task-Aware Structured
Memory
for Dynamic Multi-modal In-Context Learning
🖥️
Inference Engineering
Content type:
Academic
arxiv.org
·
12h
12 hours ago
Actions for Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help