Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🤖 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
171004
posts in
33.1
ms
RetroInfer
: A
Vector
Storage Engine for Scalable Long-Context LLM Inference
🧠
LLM
vldb.org
·
6d
Introducing
dotLLM
- Building an LLM
Inference
Engine in C#
🦙
Ollama
kokosa.dev
·
14h
·
Hacker News
amitshekhariitbhu/llm-internals
: Learn LLM
internals
step by step - from tokenization to attention to inference optimization.
🧠
LLM
github.com
·
1d
·
Hacker News
I-DLM
:
Introspective
Diffusion Language Models
🧠
LLM
introspective-diffusion.github.io
·
22h
·
Hacker News
,
r/LocalLLaMA
AMD makes a big splash with the
MI355X
in
MLPerf
Inference 6.0: Over one million tokens per second in multi-node inference
🚀
Performance
igorslab.de
·
2h
The Engine Behind Modern LLM Inference, Part 1: Continuous
Batching
,
PagedAttention
, and the End of…
🔀
Model Routing
medium.com
·
5d
Stop
benchmarking
inference
providers
, a guide to easy evaluation
📊
Performance Tools
huggingface.co
·
15h
·
r/LocalLLaMA
Four Reasons Why
FPGAs
Hit the
Sweet
Spot for LLM Inference
⚡
Hardware Acceleration
pub.towardsai.net
·
15h
Quantization
,
LoRA
, and the 8% Problem: Benchmarking Local LLMs for Production AI
⚙️
MLOps
walsenburgtech.com
·
3d
·
Hacker News
Model API Performance
🦙
Ollama
news.ycombinator.com
·
19h
·
Hacker News
patilyashvardhan2002-byte/lazy-moe
: The GPU-free LLM inference engine. Combines lazy expert loading +
TurboQuant
KV compression to run models that shouldn't fit on your hardware. Built from scratch, fully local, zero cloud.
🦙
Ollama
github.com
·
2d
·
r/LocalLLaMA
Inside the Token Factory: A First-Principles Comparison of
vLLM
and
SGLang
🔌
LSP
hxu296.github.io
·
3d
·
Hacker News
LLM
inference
,
optimized
for your Mac
🦙
Ollama
omlx.ai
·
4d
·
Hacker News
LLM inference engine
written
ground-up
natively
in C#/.NET
🦙
Ollama
dotllm.dev
·
13h
·
Hacker News
Tutorial:
ZML
Understanding Deep Learning Inference: From Black Box to Bare Metal with
ResNet-18
🧠
Deep Learning
neudinger.medium.com
·
4d
We Put a Gaming Box in the
Inference
Loop
💸
Inference Costs
write.as
·
6d
Inside LLM Inference: KV Cache,
Prefill
, and the
Decode
Bottleneck
💸
Inference Costs
pub.towardsai.net
·
6d
milanm/AutoGrad-Engine
: A complete GPT language model (training and inference) in ~600 lines of pure C#, zero dependencies
🧠
LLM
github.com
·
5d
·
Hacker News
I Ran My
KYB
Engine at Three
Quantization
Levels. Accuracy Didn't Move. Cost Dropped 6x.
💸
Inference Costs
walsenburgtech.com
·
5d
·
Hacker News
Beledarian/wgpu-llm
: A from-scratch LLM inference engine that uses wgpu (the cross-platform WebGPU implementation) to dispatch
WGSL
compute shaders for every math operation a Transformer needs. No CUDA. No Python. No massive framework dependencies. Just Rust, raw shaders, and your GPU.
🦙
Ollama
github.com
·
3d
·
Hacker News
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help