Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
305
posts in
5.7
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving.
🧠
Local llm
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
Intelligent
inference
scheduling with
llm-d
on Red Hat AI
☸️
Kubernetes
developers.redhat.com
·
8h
8 hours ago
Actions for Intelligent inference scheduling with llm-d on Red Hat AI
Alignment Collapse Under
KV
Cache
Quantization
: Diagnosis and Mitigation
⚡
LLM Quantization
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
The
Inference
Alpha: Maximizing Frontier Models on AMD
⚡
LLM Quantization
Content type:
Blog
digitalocean.com
·
18h
18 hours ago
Actions for The Inference Alpha: Maximizing Frontier Models on AMD
DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
🤖
Machine Learning
Content type:
News
newsletter.semianalysis.com
·
1d
1 day ago
·
Hacker News
Actions for DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
🧠
Local llm
Content type:
News
Content type:
Blog
blog.google
·
5d
5 days ago
·
Hacker News
Actions for Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
Inferoa
AI harness claimed 90%
cache
savings. We ran it and measured 97.8%
👁️
Observability
zozo123.github.io
·
21h
21 hours ago
·
Hacker News
Actions for Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%
Big Blue’s Redbook on Storage Scale
KV
Cache
management
🐢
Turso
Content type:
News
blocksandfiles.com
·
1d
1 day ago
Actions for Big Blue’s Redbook on Storage Scale KV Cache management
CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?
⚡
LLM Quantization
uccl-project.github.io
·
1h
1 hour ago
·
Hacker News
Actions for CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🧠
Local llm
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern
LLM
Serving
👁️
Observability
Content type:
Code
github.com
·
1h
1 hour ago
·
Hacker News
Actions for NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving
Qwen 3.6 27B AutoRound GGUF, need your feedback
🧠
Local llm
huggingface.co
·
1d
1 day ago
·
r/LocalLLaMA
Actions for Qwen 3.6 27B AutoRound GGUF, need your feedback
Massive AI Storage Demand Creates a New Memory Wall
📝
SQLite WAL
Content type:
News
eetimes.com
·
18h
18 hours ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
A system programmer’s guide to
LLM
inference
🤖
Qwen
Content type:
Blog
blog.xiangpeng.systems
·
3d
3 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
Report: GKE
Inference
Gateway delivers up to 92% faster AI responses
☸️
Kubernetes
Content type:
Blog
cloud.google.com
·
2d
2 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI
⚡
LLM Quantization
Content type:
Blog
blogs.nvidia.com
·
16h
16 hours ago
Actions for NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI
How we fight GPU scarcity without compromise
⚡
LLM Quantization
Content type:
Blog
equixly.com
·
5d
5 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🧠
Local llm
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe
Flash-Attention
for llama.cpp, fully measured on real hardware.
🧠
Local llm
Content type:
Code
github.com
·
16h
16 hours ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
DiffusionGemma: The Developer Guide- Google Developers Blog
⚡
LLM Quantization
Content type:
Blog
developers.googleblog.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for DiffusionGemma: The Developer Guide- Google Developers Blog
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help