Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Context Windows
🪟 Context Windows
Specific
Long Context Models, Memory Management, Attention Patterns
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
121
posts in
6.8
ms
RedKnot: Efficient
Long-Context
LLM
Serving with Head-Aware KV Reuse and SegPagedAttention
🤖
LLM
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention
Big Blue’s Redbook on Storage
Scale
KV
Cache
management
🤖
Agent
Content type:
News
blocksandfiles.com
·
1d
1 day ago
Actions for Big Blue’s Redbook on Storage Scale KV Cache management
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K
context
on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
🦙
Llama
Content type:
Code
github.com
·
15h
15 hours ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
PagedAttention
vs Traditional
KV
Cache
: How vLLM Reinvented GPU Memory for LLM Inference
⚡
Inference Optimization
Content type:
Blog
medium.com
·
2d
2 days ago
Actions for PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference
Less-relevant results
The Inference Alpha: Maximizing Frontier
Models
on AMD
⚡
Inference Optimization
Content type:
Blog
digitalocean.com
·
16h
16 hours ago
Actions for The Inference Alpha: Maximizing Frontier Models on AMD
How we fight GPU scarcity without compromise
🤖
LLM
Content type:
Blog
equixly.com
·
5d
5 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
Running Qwen 35B MoE at 450k
Context
on a Single 32GB GPU
🦙
Llama
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
DiffusionGemma: The Developer Guide
🎯
Fine-tuning
Content type:
Blog
developers.googleblog.com
·
1d
1 day ago
Actions for DiffusionGemma: The Developer Guide
Stateful Swarms: How Persistent
Memory
Beats Traditional Agent Architectures
💭
Context Management
Content type:
News
artificialintelligencemadesimple.com
·
5d
5 days ago
Actions for Stateful Swarms: How Persistent Memory Beats Traditional Agent Architectures
Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
📞
Function Calling
Content type:
Blog
dnhkng.github.io
·
2d
2 days ago
Actions for Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
Massive AI Storage Demand Creates a New
Memory
Wall
🤖
LLM
Content type:
News
eetimes.com
·
16h
16 hours ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
Youssof Altoukhi (@Youssofal_)
⚡
Inference Optimization
xcancel.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for Youssof Altoukhi (@Youssofal_)
WEKA software speeds
long
context
AI inferencing on Oracle’s public cloud
🤖
Agent
Content type:
News
blocksandfiles.com
·
16h
16 hours ago
Actions for WEKA software speeds long context AI inferencing on Oracle’s public cloud
libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA
KV-cache
. Weights available on Hugging Face.
🎯
Fine-tuning
Content type:
Code
github.com
·
1d
1 day ago
Actions for libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.
From Rigid to Dynamic: Entropy-Guided Adaptive Inference for
Long-Context
LLMs
🤖
LLM
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs
google/gemma-4-12B-it-qat-q4_0-gguf
⚡
Inference Optimization
huggingface.co
·
5d
5 days ago
Actions for google/gemma-4-12B-it-qat-q4_0-gguf
Machinic Psychopharmacology: Do LLMs Self-Medicate?
🤖
LLM
lesswrong.com
·
16h
16 hours ago
·
Hacker News
Actions for Machinic Psychopharmacology: Do LLMs Self-Medicate?
Show HN: Taliesin – bit-exact
KV-cache
restore, 21x faster, cross-GPU verified
💻
Cursor
Content type:
Blog
medium.com
·
6d
6 days ago
·
Hacker News
Actions for Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified
Report: GKE Inference Gateway delivers up to 92% faster AI responses
🤖
LLM
Content type:
Blog
cloud.google.com
·
2d
2 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
Task-Aware Structured
Memory
for Dynamic Multi-modal
In-Context
Learning
🤖
LLM
Content type:
Academic
arxiv.org
·
3h
3 hours ago
Actions for Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help