Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🤖 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
17420
posts in
33.6
ms
Deep Dive into vLLM: How
PagedAttention
& Continuous
Batching
Revolutionized LLM Inference
🦙
Ollama
dev.to
·
1d
·
DEV
·
…
What if AI doesn’t need more
RAM
but better
math
?
⚡
Quantization
adlrocha.substack.com
·
4d
·
Substack
·
…
OrionsLock/SALOMI
: Research code for extreme low-bit transformer quantization and inference.
🔍
Binary Diffing
github.com
·
15h
·
Hacker News
·
…
What is
inference
engineering?
Deepdive
💸
Inference Costs
newsletter.pragmaticengineer.com
·
2d
·
…
Gemma-SRE
: Self-Hosted
vLLM
Infrastructure Agent
⚙️
MLOps
medium.com
·
6d
·
…
Speculative
Decoding
: How LLMs Generate Text 3x Faster
🧠
LLM
analyticsvidhya.com
·
1d
·
…
Meta Adaptive Ranking Model:
Bending
the Inference Scaling
Curve
to Serve LLM-Scale Models for Ads
💸
Inference Costs
engineering.fb.com
·
2d
·
Hacker News
·
…
Local LLM
Inference
in 2026: The Complete Guide to Tools, Hardware &
Open-Weight
Models
🦙
Ollama
dev.to
·
4d
·
DEV
·
…
alexziskind1/llm-inference-calculator
🦙
Ollama
github.com
·
1d
·
…
Complete Guide to llm-d
CNCF
Sandbox
— Kubernetes-Native Distributed LLM Inference
🦙
Ollama
dev.to
·
1d
·
DEV
·
…
SharpAI/SwiftLM
: ⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.
🦙
Ollama
github.com
·
1d
·
Hacker News
,
Hacker News
·
…
Local LLM Acceleration:
Quantization
,
TTS
, and 1M Tokens/Sec
🔀
Model Routing
dev.to
·
6d
·
DEV
·
…
Local LLM
Unleashed
: Faster Inference, Instant Starts, & Open
TTS
🦙
Ollama
dev.to
·
6d
·
DEV
·
…
Distributed LLM Inference Across NVIDIA
Blackwell
and Apple Silicon Over
10GbE
🔀
Model Routing
dev.to
·
2d
·
DEV
·
…
Why Inference
Compression
Compounds
for Modular Agents
🧠
Context Engineering
dev.to
·
2d
·
DEV
·
…
Save money on AI using those
permanent
free LLM
APIs
🦙
Ollama
dev.to
·
4d
·
DEV
·
…
Quantization — Deep Dive + Problem:
Smallest
Window
Containing
All Features
⚡
Quantization
dev.to
·
2d
·
DEV
·
…
I shipped Google's
TurboQuant
as a
vLLM
plugin 72 hours after the paper — here's what nobody else tested
🦙
Ollama
dev.to
·
5d
·
DEV
·
…
Google's
TurboQuant
: How They Cut LLM Memory by 6x Without Losing
Accuracy
⚡
Quantization
dev.to
·
6d
·
DEV
·
…
KV
Cache
in LLMs
🧠
LLM
dev.to
·
6d
·
DEV
·
…
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help