Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Prefill Decoding
⏱️ Prefill Decoding
Specific
prefill phase, decode phase, chunked prefill, time-to-first-token
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
178
posts in
7.7
ms
3x Faster Search: Parallel
Test-Time
Scaling with Instructed-Retriever-1
💰
Inference Cost
Content type:
Blog
databricks.com
·
6d
6 days ago
Actions for 3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1
LLM Observability: What To Instrument and How To Act on It
🔭
Observability
Content type:
Blog
blog.n8n.io
·
2d
2 days ago
Actions for LLM Observability: What To Instrument and How To Act on It
Apple rebuilt its on-device AI stack at WWDC 2026
🔢
GEMM Optimization
Content type:
Blog
ziraph.com
·
1d
1 day ago
·
Hacker News
Actions for Apple rebuilt its on-device AI stack at WWDC 2026
Breaking the Ice: Analyzing Cold Start
Latency
in
vLLM
💾
KV Cache
Content type:
Academic
arxiv.org
·
2d
2 days ago
·
Hacker News
Actions for Breaking the Ice: Analyzing Cold Start Latency in vLLM
heterodoxin/graphkv: Graph-guided
KV
cache
compression for memory-efficient LLM inference.
💾
KV Cache
Content type:
Code
github.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Token4Token
— pay-per-token inference on Gnosis + Swarm
🧠
Inference Engineering
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax
🧠
Inference Engineering
Content type:
Blog
lucebox.com
·
6d
6 days ago
·
Hacker News
Actions for Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax
"North Mini Code"; open weights, 30B param, Canadian coding model
🎮
GPU Computing
Content type:
Blog
cohere.com
·
2d
2 days ago
·
Hacker News
Actions for "North Mini Code"; open weights, 30B param, Canadian coding model
How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops
🧠
Inference Engineering
Content type:
Video
youtube.com
·
6d
6 days ago
Actions for How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops
Machinic Psychopharmacology: Do LLMs Self-Medicate?
💾
KV Cache
lesswrong.com
·
10h
10 hours ago
·
Hacker News
Actions for Machinic Psychopharmacology: Do LLMs Self-Medicate?
Youssof Altoukhi (@Youssofal_)
🧠
Inference Engineering
xcancel.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for Youssof Altoukhi (@Youssofal_)
STAR-KV
: Low-Rank
KV
Cache
Compression via Soft Thresholding for Adaptive Rank Control
🧠
Inference Engineering
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
💾
KV Cache
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
Architecting the Control Plane for Intelligence: System Design of an Enterprise AI Gateway
☁️
Cloud Infrastructure
Content type:
Blog
medium.com
·
2d
2 days ago
Actions for Architecting the Control Plane for Intelligence: System Design of an Enterprise AI Gateway
Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
🎮
GPU Computing
Content type:
Blog
dnhkng.github.io
·
2d
2 days ago
Actions for Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
Build a local voice agent with Red Hat OpenShift AI
🎮
GPU Computing
developers.redhat.com
·
3d
3 days ago
Actions for Build a local voice agent with Red Hat OpenShift AI
The Memory Problem is Solved: How Google’s Memory
Caching
Makes RNNs Smart Again
⚡
FlashAttention
Content type:
Blog
medium.com
·
2d
2 days ago
Actions for The Memory Problem is Solved: How Google’s Memory Caching Makes RNNs Smart Again
IntentKV: Cross-Turn Intent-Aware
KV
Cache
Pruning for Agent Inference
🧠
Inference Engineering
Content type:
Academic
arxiv.org
·
20h
20 hours ago
Actions for IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference
bigattichouse/packed-twin-inference: PTI achieves ~2×
throughput
using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched
decode
call. The GPU loads model weights once per step and produces 4 predictions simultaneously.
KV
cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
🧠
Inference Engineering
Content type:
Code
github.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
Benchmarking dots.tts on Strix Halo
🎮
GPU Computing
sleepingrobots.com
·
3d
3 days ago
Actions for Benchmarking dots.tts on Strix Halo
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help