Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
⚡ LLM Inference
Specific
inference serving, vLLM, TensorRT, model serving, token generation
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
283
posts in
12.6
ms
The
Quantization
Error of the Soul: Why Silicon Valley is Inverting the Promethean Fire
✍️
Prompt Engineering
Content type:
Blog
medium.com
·
18h
18 hours ago
Actions for The Quantization Error of the Soul: Why Silicon Valley is Inverting the Promethean Fire
China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
🤖
LLMs
Content type:
News
decrypt.co
·
3d
3 days ago
·
Hacker News
Actions for China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
Infrastructure Options for Scalable AI
Inference
📈
Performance Engineering
Content type:
Blog
mirantis.com
·
2d
2 days ago
Actions for Infrastructure Options for Scalable AI Inference
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving
.
🤖
LLMs
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
,
r/LLM
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
SPEAR: A System for
Post-Quantization
Error-Adaptive Recovery Enabling Efficient Low-Bit
LLM
Serving
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving
Xiaomi MiMo-V2.5-Pro Just Hit 1,000
Tokens
Per Second!
📈
Performance Engineering
gizchina.com
·
3d
3 days ago
Actions for Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
Show HN: Run
Llama.cpp
In-Process from Java with Project Panama FFM
🤖
LLMs
deemwar-products.github.io
·
6d
6 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
Token4Token
— pay-per-token
inference
on Gnosis + Swarm
🤖
LLMs
t4t.eth.link
·
2d
2 days ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🤖
LLMs
local-llm.utop.workers.dev
·
4d
4 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Your robot can’t be smart, fast, and free. Evolution solved that already.
📈
Performance Engineering
Content type:
News
thenextweb.com
·
15h
15 hours ago
Actions for Your robot can’t be smart, fast, and free. Evolution solved that already.
Apple WWDC On-Device AI Deep Dive - Google Docs
✍️
Prompt Engineering
gist.is
·
1d
1 day ago
·
Hacker News
Actions for Apple WWDC On-Device AI Deep Dive - Google Docs
Here's a
llama.cpp
CLI Command builder.
🤖
LLMs
llamabuilding.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for Here's a llama.cpp CLI Command builder.
Google's new open-weights
model
brings
image-generation
tricks to AI text
generation
🤖
LLMs
Content type:
News
theregister.com
·
13h
13 hours ago
Actions for Google's new open-weights model brings image-generation tricks to AI text generation
Apples to Apples: MLX vs.
Llama.cpp
for Gemma 4 12B on an M1 16GB
🤖
LLMs
Content type:
Blog
ziraph.com
·
6d
6 days ago
·
Hacker News
Actions for Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB
Minimax M3 sm_120
🎮
GPU Computing
Content type:
Code
github.com
·
53m
53 minutes ago
·
r/LocalLLaMA
Actions for Minimax M3 sm_120
gist:5b74b8c31e934ff50ce57aa653a343d5
✍️
Prompt Engineering
gist.github.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for gist:5b74b8c31e934ff50ce57aa653a343d5
Running
LLM
Inference
on Kubernetes: What It Actually Takes
🤖
LLMs
Content type:
Blog
fairwinds.com
·
6d
6 days ago
Actions for Running LLM Inference on Kubernetes: What It Actually Takes
DiffusionGemma 26B A4B results on my 5090
🤖
LLMs
huggingface.co
·
1d
1 day ago
·
r/LocalLLaMA
Actions for DiffusionGemma 26B A4B results on my 5090
Alignment Collapse Under KV Cache
Quantization
: Diagnosis and Mitigation
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
The Death of the Four Golden Signals: Designing Telemetry for Non-Deterministic Infrastructure
🔭
Observability
devops.com
·
6d
6 days ago
Actions for The Death of the Four Golden Signals: Designing Telemetry for Non-Deterministic Infrastructure
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help