Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
馃 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
357
posts in
14.2
ms
LLM
Inference
聽
馃
llama.cpp
iop.systems
路
1h
KV
Cache
Optimization
: 3x Faster LLM Inference on 24GB VRAM
聽
馃
llama.cpp
tildalice.io
路
6d
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for
LLM
Inference
on Superchips
聽
馃
Memory Allocators
supercomputing-system-ai-lab.github.io
路
2d
路
Hacker News
InferenceBench
: A Benchmark for Open-Ended Inference
Optimization
by AI Agents
聽
馃
llama.cpp
inferencebench.ai
路
5h
路
Hacker News
Lever:
Speculative
LLM
Inference
on Smartphones
聽
馃
llama.cpp
arxiv.org
路
2d
AMD says its $4K Ryzen AI Halo workstation practically pays for itself
聽
鈿欙笍
Zig
theregister.com
路
16h
The
Inference
Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs
聽
馃
llama.cpp
cloudnativenow.com
路
5d
tvall43/Qwen3.5-14B-A3B-Claude-4.6-Opus-Reasoning-Distilled-reap-gguf
at main
聽
馃
llama.cpp
huggingface.co
路
17h
路
r/LocalLLaMA
Understanding
KV
Cache
: The Hidden Memory Cost of
Serving
LLMs
聽
馃
llama.cpp
melchi.me
路
1d
路
Hacker News
GPU Memory Math for LLMs: Formula That Tells You What Fits on Your GPU
聽
馃
Memory Allocators
theahmadosman.substack.com
路
7h
路
Substack
,
r/LocalLLaMA
Command A+: Making sovereign agentic capabilities available to all
聽
鈿欙笍
Zig
cohere.com
路
12h
路
Hacker News
chiennv2000/orthrus: Fast, lossless
LLM
inference
via dual-view diffusion
decoding
.
聽
馃
llama.cpp
github.com
路
5d
路
Hacker News
Coding Agent
Inference
Benchmark Revealed
聽
馃
AI
startuphub.ai
路
1d
I tried 4
LLM
speedup techniques on CPU. Three made it slower.
聽
馃
llama.cpp
deemwar-products.github.io
路
9h
路
Hacker News
KV
Cache
Is Becoming the Memory Hierarchy of
Inference
聽
馃
llama.cpp
touchdown-labs.com
路
2d
https://www.together.ai/blog/coding-agent-benchmarks
聽
馃
llama.cpp
together.ai
路
5d
KV
Cache
and Flash Attention with
interactive
diagrams
聽
馃
Memory Allocators
kvcache.cobanov.dev
路
9h
路
Hacker News
What GPU kernels mean for your distributed
inference
聽
馃
llama.cpp
developers.redhat.com
路
1d
How
LLM
Inference
Works
聽
馃
llama.cpp
arpitbhayani.me
路
6d
路
Hacker News
How I Shipped an Autonomous Agentic System on a 2026 Serverless-GPU Stack
聽
馃
llama.cpp
medium.com
路
2d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help