Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
24834
posts in
52.4
ms
AsyncTLS
: Efficient Generative LLM Inference with
Asynchronous
Two-level Sparse Attention
🏗️
LLM Infrastructure
arxiv.org
·
13h
LLM
inference
engine from
scratch
in C++
🏗️
LLM Infrastructure
anirudhsathiya.com
·
4d
·
Hacker News
Inside LLM Inference: KV Cache,
Prefill
, and the
Decode
Bottleneck
💾
Prompt Caching
pub.towardsai.net
·
1d
Low-Rank Key Value Attention: Reducing
KV
Cache Memory and
Maintaining
Head Diversity
🗂️
Vector Indexes
fin.ai
·
1d
·
Hacker News
Inference
Arena
– new
benchmark
of local inference and training
🏗️
LLM Infrastructure
kvark.github.io
·
5d
·
Hacker News
Flux
Attention: Context-Aware Hybrid Attention for Efficient LLMs
Inference
🏗️
LLM Infrastructure
arxiv.org
·
13h
Fast
Heterogeneous
Serving: Scalable Mixed-Scale LLM Allocation for
SLO-Constrained
Inference
🏗️
LLM Infrastructure
arxiv.org
·
13h
Comparative
Characterization
of
KV
Cache Management Strategies for LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
2d
Blink: CPU-Free LLM Inference by
Delegating
the Serving Stack to GPU and
SmartNIC
🏗️
LLM Infrastructure
arxiv.org
·
13h
Efficient Inference for Large Vision-Language Models:
Bottlenecks
, Techniques, and
Prospects
🔢
BitNet Inference
arxiv.org
·
2d
Fast
NF4
Dequantization
Kernels for Large Language Model Inference
🔢
BitNet Inference
arxiv.org
·
4d
MIPT-SSM
: Scaling Language Models with $O(1)$ Inference Cache via Phase Transitions
🔢
BitNet Inference
arxiv.org
·
13h
LLM Evaluation as
Tensor
Completion: Low Rank Structure and
Semiparametric
Efficiency
🏆
LLM Benchmarking
arxiv.org
·
2d
Initialisation
Determines the Basin: Efficient
Codebook
Optimisation for Extreme LLM Quantization
🔬
RaBitQ
arxiv.org
·
13h
HybridKV
: Hybrid
KV
Cache Compression for Efficient Multimodal Large Language Model Inference
📦
Batch Embeddings
arxiv.org
·
2d
Prompt Compression in the Wild: Measuring
Latency
, Rate
Adherence
, and Quality for Faster LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
4d
Wiring
the 'Why': A Unified Taxonomy and Survey of
Abductive
Reasoning in LLMs
🏆
LLM Benchmarking
arxiv.org
·
13h
Efficient
Quantization
of Mixture-of-Experts with
Theoretical
Generalization Guarantees
🎯
Vector Quantization
arxiv.org
·
1d
Scheduling the
Unschedulable
:
Taming
Black-Box LLM Inference at Scale
🏗️
LLM Infrastructure
arxiv.org
·
1d
Q-Zoom
: Query-Aware Adaptive
Perception
for Efficient Multimodal Large Language Models
📦
Batch Embeddings
arxiv.org
·
1d
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help