Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
180
posts in
33.2
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving.
🏗️
LLM Infrastructure
Content type:
Code
github.com
·
3d
3 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
LLM
Inference
Handbook 2026
🤖
AI
pub.towardsai.net
·
1d
1 day ago
Actions for LLM Inference Handbook 2026
DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
⚡
Fast AI Inference
Content type:
News
newsletter.semianalysis.com
·
22h
22 hours ago
·
Hacker News
Actions for DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
STAR-KV
: Low-Rank
KV
Cache
Compression via Soft Thresholding for Adaptive Rank Control
🏗️
LLM Infrastructure
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
Improved performance and
model
support with GGUF
🤖
AI
Content type:
Blog
ollama.com
·
5d
5 days ago
Actions for Improved performance and model support with GGUF
Less-relevant results
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🤖
AI
t4t.eth.link
·
23h
23 hours ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
Fixing a stuck
Ollama
runner and building a GPU watchdog
🤖
AI
patrickmccanna.net
·
1d
1 day ago
·
Hacker News
Actions for Fixing a stuck Ollama runner and building a GPU watchdog
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🤖
AI
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
A system programmer’s guide to
LLM
inference
🤖
AI
Content type:
Blog
blog.xiangpeng.systems
·
2d
2 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
NexusOS v2.0 – A zero-dependency pipeline streaming server chaos to Parquet
🤖
AI
huggingface.co
·
2d
2 days ago
·
Hacker News
Actions for NexusOS v2.0 – A zero-dependency pipeline streaming server chaos to Parquet
Gemma 4 QAT
models
: Optimizing model compression for mobile and laptop efficiency
🤖
AI
Content type:
News
Content type:
Blog
blog.google
·
4d
4 days ago
·
Hacker News
Actions for Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
How we fight GPU scarcity without compromise
🏗️
LLM Infrastructure
Content type:
Blog
equixly.com
·
5d
5 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
⚡
Fast AI Inference
huggingface.co
·
1d
1 day ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
🏗️
LLM Infrastructure
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial
LLM
Backends
🤖
AI
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
🆕
New AI
Content type:
News
Content type:
Blog
kaitchup.substack.com
·
4d
4 days ago
·
r/LocalLLaMA
Actions for MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
🤖
AI
vettedconsumer.com
·
3d
3 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
Youssof Altoukhi (@Youssofal_)
🤖
AI
xcancel.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for Youssof Altoukhi (@Youssofal_)
Show HN: Run
Llama.cpp
In-Process
from Java with Project Panama FFM
🤖
AI
deemwar-products.github.io
·
5d
5 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
WWDC 2026: Foundation
Models
(& Anarlog)
🆕
New AI
skushagra.com
·
1d
1 day ago
Actions for WWDC 2026: Foundation Models (& Anarlog)
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help