Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
26031
posts in
17.6
ms
LLM
Terminology
Guide:
Weights
, Inference, Effective sequence length, and Self-Hosting Explained
devforth.io
·
17h
·
Discuss:
Hacker News
🏗️
LLM Infrastructure
RAMP
: Reinforcement Adaptive Mixed Precision
Quantization
for Efficient On Device LLM Inference
arxiv.org
·
2d
🏗️
LLM Infrastructure
Mamba-3
: An
Inference-First
State Space Model
blog.cartesia.ai
·
1d
📱
Edge AI Optimization
EntropyCache
:
Decoded
Token Entropy Guided KV Caching for Diffusion Language Models
arxiv.org
·
1d
💾
Prompt Caching
Blog |
MacinAI
Local: Building a
Model-Agnostic
LLM Inference Engine for Mac OS 9
oldapplestuff.com
·
5h
·
Discuss:
Hacker News
🦙
Ollama
Show HN:
Llmtop
–
Htop
for LLM Inference Clusters (vLLM, SGLang, Ollama, llama)
github.com
·
3d
·
Discuss:
Hacker News
🦙
Ollama
Introducing multi-cluster
GKE
Inference Gateway: Scale AI
workloads
around the world
cloud.google.com
·
3d
🧠
Inference Serving
Less-relevant results
NVIDIA, Telecom Leaders Build AI
Grids
to
Optimize
Inference on Distributed Networks
blogs.nvidia.com
·
3d
🖥
GPUs
Delay
the
Inference
aishwaryagoel.com
·
5d
·
Discuss:
Hacker News
💾
Prompt Caching
Building a
Kubernetes-native
pattern
for AI infrastructure at scale
thenewstack.io
·
1d
🏗️
LLM Infrastructure
SOTA
Embedding
Model for Agentic Workflows Now in Public Preview
databricks.com
·
3d
🎨
Chroma
The Agentic AI Era: How NVIDIA Rubin, Vera CPU, Groq 3
LPUs
,
BlueField-4
Redefine the Inference Factory
buysellram.com
·
4d
·
Discuss:
Hacker News
🏗️
LLM Infrastructure
Multimodal ion-gated
transistor
based on 2D
superionic
conductor for in-memory computing in deep learning
nature.com
·
2d
⚡
Hardware Acceleration
vLLM on Jetson Orin — pre-built wheel with
Marlin
GPTQ
support (3.8x prefill speedup)
github.com
·
6d
·
Discuss:
r/LocalLLaMA
🏗️
LLM Infrastructure
Let the AI Out: Edge AI on a
Microcontroller
— From Zero to
Inference
in 90 Minutes
es617.github.io
·
5d
·
Discuss:
Hacker News
📱
Edge AI Optimization
Mamba-3
together.ai
·
4d
·
Discuss:
Hacker News
,
r/LocalLLaMA
🔢
BitNet Inference
How To Use
YOLOv8
TensorRT
10 For 10x Faster Inference
eranfeit.net
·
5d
⚡
Hardware Acceleration
How NVIDIA
Dynamo
1.0
Powers
Multi-Node Inference at Production Scale
developer.nvidia.com
·
4d
·
Discuss:
Hacker News
🏗️
LLM Infrastructure
Show HN: We Built Private Post-Training and
Inference
for
Frontier
Models
workshoplabs.ai
·
4d
·
Discuss:
Hacker News
🖥
GPUs
Together AI at NVIDIA
GTC
2026: Explore our latest
innovations
across research and products
together.ai
·
5d
🏗️
LLM Infrastructure
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help