Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🤖 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
98
posts in
8.3
ms
Nvidia DGX Spark GB10 – AI
Models
and Guide with
vLLM
and Autonomous Script
🧠
LLM
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
RKSC: Reasoning-Aware
KV
Cache
Sharing and Confident Early Exit for Multi-Step
LLM
Inference
🧠
LLM
Content type:
Academic
arxiv.org
·
18h
18 hours ago
Actions for RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
LLM
Inference
Handbook 2026
🧠
LLM
pub.towardsai.net
·
2d
2 days ago
Actions for LLM Inference Handbook 2026
Inferoa
AI harness claimed 90%
cache
savings. We ran it and measured 97.8%
📡
Observability
zozo123.github.io
·
11h
11 hours ago
·
Hacker News
Actions for Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%
A system programmer’s guide to
LLM
inference
🧠
LLM
Content type:
Blog
blog.xiangpeng.systems
·
2d
2 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
GGUF vs
GPTQ
vs
AWQ
: The Plain-English Guide to
LLM
Quantization (and Which One to Pick)
💬
NLP
vettedconsumer.com
·
4d
4 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
Less-relevant results
DiffusionGemma: The Developer Guide- Google Developers Blog
📡
Information Theory
Content type:
Blog
developers.googleblog.com
·
22h
22 hours ago
·
r/LocalLLaMA
Actions for DiffusionGemma: The Developer Guide- Google Developers Blog
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
🧠
Context Engineering
huggingface.co
·
2d
2 days ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
Youssof Altoukhi (@Youssofal_)
♟️
Game Theory
xcancel.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for Youssof Altoukhi (@Youssofal_)
Alignment Collapse Under
KV
Cache
Quantization
: Diagnosis and Mitigation
🧠
LLM
Content type:
Academic
arxiv.org
·
18h
18 hours ago
Actions for Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving
.
💬
LLMs
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
Optimizing Local
LLM
Inference
on Constrained Hardware
🧠
LLM
pub.towardsai.net
·
7h
7 hours ago
Actions for Optimizing Local LLM Inference on Constrained Hardware
STAR-KV
: Low-Rank
KV
Cache
Compression via Soft Thresholding for Adaptive Rank Control
🧠
LLM
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
heterodoxin/graphkv: Graph-guided
KV
cache
compression for memory-efficient
LLM
inference.
🧠
LLM
Content type:
Code
github.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Breaking the Ice: Analyzing Cold Start Latency in
vLLM
⚡
Side-Channel Attacks
Content type:
Academic
arxiv.org
·
2d
2 days ago
·
Hacker News
Actions for Breaking the Ice: Analyzing Cold Start Latency in vLLM
How
LLM
Quantization
Works:
INT8
, INT4, GPTQ, and AWQ Explained
🧠
LLM
pub.towardsai.net
·
4d
4 days ago
Actions for How LLM Quantization Works: INT8, INT4, GPTQ, and AWQ Explained
SpectrumKV: Per-Token Mixed-Precision
KV
Cache
Transfer for
Prefill-Decode
Disaggregated LLM Serving
🧠
LLM
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
🎯
AI Agents
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
ReasonAlloc: Hierarchical
Decoding-Time
KV
Cache
Budget Allocation for Reasoning Models
💬
NLP
Content type:
Academic
arxiv.org
·
18h
18 hours ago
Actions for ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial
LLM
Backends
🧠
LLM
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help