Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Inference
🤖 AI Inference
Model Serving, Inference Optimization, ONNX, Model Deployment
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
162
posts in
29.7
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving
.
💻
Local LLMs
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
🏗️
AI Infrastructure
Content type:
News
newsletter.semianalysis.com
·
1d
1 day ago
·
Hacker News
Actions for DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
Infrastructure Options for Scalable
AI
Inference
🏗️
AI Infrastructure
Content type:
Blog
mirantis.com
·
17h
17 hours ago
Actions for Infrastructure Options for Scalable AI Inference
Inferoa
AI
harness claimed 90% cache savings. We ran it and measured 97.8%
💻
Local LLMs
zozo123.github.io
·
6h
6 hours ago
·
Hacker News
Actions for Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%
Breaking the Ice: Analyzing Cold Start
Latency
in
vLLM
💻
Local LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Breaking the Ice: Analyzing Cold Start Latency in vLLM
Running
LLM
Inference
on Kubernetes: What It Actually Takes
🧠
AI
Content type:
Blog
fairwinds.com
·
5d
5 days ago
Actions for Running LLM Inference on Kubernetes: What It Actually Takes
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🧠
AI
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
KJLdefeated/RL.cu
: RLVR training for
LLM
in CUDA/C++
🔥
PyTorch
Content type:
Code
github.com
·
3d
3 days ago
·
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
LLM
Inference
Engineering
Room — Part 3: The Orchestration Layer
💻
Local LLMs
Content type:
Blog
vimal-dwarampudi.medium.com
·
6d
6 days ago
Actions for LLM Inference Engineering Room — Part 3: The Orchestration Layer
Build a Medical Report Analyzer on Dedicated
Inference
with Python
💻
Local LLMs
digitalocean.com
·
6d
6 days ago
Actions for Build a Medical Report Analyzer on Dedicated Inference with Python
Speculators v0.5.0: DFlash support and online training
🏗️
AI Infrastructure
developers.redhat.com
·
6d
6 days ago
Actions for Speculators v0.5.0: DFlash support and online training
APEX4: Efficient Pure W4A4
LLM
Inference
via Intra-SM Compute Rebalancing
🔥
Burn
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
Nvidia DGX Spark GB10 –
AI
Models
and Guide with
vLLM
and Autonomous Script
🏗️
AI Infrastructure
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
ju4nv1e1r4/nlp_
engine
_
inference
: An
inference
engine
for NLP
models
.
📝
NLP
Content type:
Code
github.com
·
6d
6 days ago
·
r/rust
Actions for ju4nv1e1r4/nlp_engine_inference: An inference engine for NLP models.
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization
backend for your agents: 3-5x more context,
throughput
above FP16, and FP16-level accuracy. Calibration-free, one flag.
💻
Local LLMs
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient
LLM
inference
.
💻
Local LLMs
Content type:
Code
github.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
zhongkaifu/TensorSharp
: A C#
inference
engine
for running large language models (LLMs) locally using GGUF model files.
TensorSharp
provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability
💻
Local LLMs
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability
fix(gateway): fail closed for unknown
model
auth · openclaw/openclaw@85343ea
🏗️
AI Infrastructure
Content type:
Code
github.com
·
5d
5 days ago
Actions for fix(gateway): fail closed for unknown model auth · openclaw/openclaw@85343ea
FlexNPU: Transparent NPU Virtualization for Dynamic
LLM
Prefill-Decode Co-location
⚡
Hardware Acceleration
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location
mirkolenz/llmhop: Tiny, stateless Go router that dispatches OpenAI-compatible requests to
single-model
vLLM
and sglang backends with zero external dependencies
🧠
AI
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for mirkolenz/llmhop: Tiny, stateless Go router that dispatches OpenAI-compatible requests to single-model vLLM and sglang backends with zero external dependencies
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help