Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
⚡ LLM Inference
Specific
inference engine, model serving, throughput, latency
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
291
posts in
7.3
ms
MLPerf and the rise of
latency-aware
LLM
benchmarking
🧠
KV Cache
edn.com
·
5d
5 days ago
Actions for MLPerf and the rise of latency-aware LLM benchmarking
RKSC: Reasoning-Aware
KV
Cache
Sharing and Confident Early Exit for Multi-Step
LLM
Inference
🧠
KV Cache
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
Apples to Apples: MLX vs.
Llama.cpp
for Gemma 4 12B on an M1 16GB
📦
Parquet
Content type:
Blog
ziraph.com
·
5d
5 days ago
·
Hacker News
Actions for Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB
Less-relevant results
🇳🇱 Go/Golang job: Senior Backend
Engineer
(Go) | Studio AI at Creative Fabrica (Amsterdam, Netherlands)
🕸️
Distributed Systems
golangprojects.com
·
14h
14 hours ago
Actions for 🇳🇱 Go/Golang job: Senior Backend Engineer (Go) | Studio AI at Creative Fabrica (Amsterdam, Netherlands)
1-bit and 1.58 bit
LLM
Benchmarking on Jetson Orin Nano Super | Bonsai LM
🧠
KV Cache
smolhub.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for 1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM
NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior
modelsThe
model delivers 300
tokens
per
second on benchmar...
⚡
vLLM
digg.com
·
6d
6 days ago
Actions for NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...
Show HN: Run
Llama.cpp
In-Process from Java with Project Panama FFM
⚡
vLLM
deemwar-products.github.io
·
5d
5 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
Vadzo Imaging Introduces HDR MIPI CSI-2 Embedded Cameras Recommended for Drone and UAV Applications
🌊
Stream Processing
Content type:
News
einpresswire.com
·
22h
22 hours ago
Actions for Vadzo Imaging Introduces HDR MIPI CSI-2 Embedded Cameras Recommended for Drone and UAV Applications
Nemotron 3 Ultra now available on AI Gateway
⚡
vLLM
vercel.com
·
6d
6 days ago
Actions for Nemotron 3 Ultra now available on AI Gateway
Google open-sources speedy DiffusionGemma
text
diffusion
model
⚡
vLLM
siliconangle.com
·
4h
4 hours ago
Actions for Google open-sources speedy DiffusionGemma text diffusion model
Mobile AI Compute
Engine
(MACE)
inference
framework — Vision SDK
🧠
KV Cache
Content type:
Blog
mapbox.com
·
2d
2 days ago
Actions for Mobile AI Compute Engine (MACE) inference framework — Vision SDK
BeeLlama.cpp
DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
🧠
KV Cache
sleepingrobots.com
·
4d
4 days ago
Actions for BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
No
Token
Left Behind: Demystifying
Token-in-Token-Out
in Miles
🌊
Stream Processing
Content type:
Blog
lmsys.org
·
1d
1 day ago
·
Hacker News
Actions for No Token Left Behind: Demystifying Token-in-Token-Out in Miles
Google’s DiffusionGemma is 4x faster than its other Gemma
models
🌲
LSM Trees
thenewstack.io
·
11h
11 hours ago
Actions for Google’s DiffusionGemma is 4x faster than its other Gemma models
Making LLMs faster and more efficient across multiple languages
⚡
vLLM
techxplore.com
·
6d
6 days ago
Actions for Making LLMs faster and more efficient across multiple languages
Which is faster: Gemini 3.5 Flash or Kimi K2.6 on Cerebras
🌊
Stream Processing
Content type:
Blog
cerebras.ai
·
5d
5 days ago
Actions for Which is faster: Gemini 3.5 Flash or Kimi K2.6 on Cerebras
146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb
🌊
Stream Processing
Content type:
Blog
adambien.blog
·
1d
1 day ago
Actions for 146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb
Why I care so much about energy
per
token
🧠
KV Cache
Content type:
Blog
ziraph.com
·
3d
3 days ago
·
Hacker News
Actions for Why I care so much about energy per token
libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA
KV-cache
. Weights available on Hugging Face.
🧠
KV Cache
Content type:
Code
github.com
·
1d
1 day ago
Actions for libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.
3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1
🧠
KV Cache
Content type:
Blog
databricks.com
·
6d
6 days ago
Actions for 3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help