Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🤖 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
8354
posts in
10.4
ms
MemBoost
: A
Memory-Boosted
Framework for Cost-Aware LLM Inference
🧠
LLM
arxiv.org
·
3d
·
…
alexziskind1/llm-inference-calculator
🧠
LLM
github.com
·
1d
·
…
What is
inference
engineering?
Deepdive
✍️
Prompt Engineering
newsletter.pragmaticengineer.com
·
1d
·
…
What if AI doesn’t need more
RAM
but better
math
?
💬
LLMs
adlrocha.substack.com
·
4d
·
Substack
·
…
Fast and Accurate Probing of In-Training LLMs'
Downstream
Performances
🧠
LLM
arxiv.org
·
11h
·
…
Pure C implementation of the
TurboQuant
paper (
ICLR
2026) for KV cache compression in LLM inference.
⚡
Assembly Language
github.com
·
1d
·
r/LocalLLaMA
·
…
Executing
as You Generate: Hiding Execution
Latency
in LLM Code Generation
⚙️
Compilers
arxiv.org
·
11h
·
…
G-Drift
MIA
: Membership Inference via Gradient-Induced Feature
Drift
in LLMs
💬
LLMs
arxiv.org
·
11h
·
…
SharpAI/SwiftLM
: ⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.
💻
KVM
github.com
·
1d
·
Hacker News
,
Hacker News
·
…
m0at/rvllm
:
rvLLM
: High-performance LLM inference in Rust. Drop-in vLLM replacement.
🧠
LLM
github.com
·
4d
·
Hacker News
·
…
Understand and
Accelerate
Memory Processing Pipeline for
Disaggregated
LLM Inference
🧠
LLM
arxiv.org
·
1d
·
…
ScoutAttention
: Efficient KV Cache
Offloading
via Layer-Ahead CPU Pre-computation for LLM Inference
🧠
LLM
arxiv.org
·
2d
·
…
TAMI-MPC
:Trusted Acceleration of Minimal-Interaction
MPC
for Efficient Nonlinear Inference
✍️
Prompt Engineering
arxiv.org
·
6d
·
…
Efficient
Inference
of Large Vision Language Models
👁️
Multimodal AI
arxiv.org
·
2d
·
…
Rocks,
Pebbles
and Sand:
Modality-aware
Scheduling for Multimodal Large Language Model Inference
💬
LLMs
arxiv.org
·
3d
·
…
ITQ3
_S: High-Fidelity 3-bit LLM Inference via
Interleaved
Ternary Quantization with Rotation-Domain Smoothing
🧠
LLM
arxiv.org
·
2d
·
…
Compiling
Code LLMs into Lightweight
Executables
⚙️
Compilers
arxiv.org
·
1d
·
…
Model
Capability
Dominates: Inference-Time Optimization Lessons from
AIMO
3
⚙️
Program Synthesis
arxiv.org
·
2d
·
…
Multiple-Prediction-Powered
Inference
🎲
Bayesian Inference
arxiv.org
·
2d
·
…
M-MiniGPT4
: Multilingual
VLLM
Alignment via Translated Data
🧠
LLM
arxiv.org
·
1d
·
…
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help