Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
294
posts in
7.2
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving.
🧠
Local llm
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
Alignment Collapse Under
KV
Cache
Quantization
: Diagnosis and Mitigation
⚡
LLM Quantization
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
Inferoa
AI harness claimed 90%
cache
savings. We ran it and measured 97.8%
👁️
Observability
zozo123.github.io
·
19h
19 hours ago
·
Hacker News
Actions for Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%
The
Inference
Alpha: Maximizing Frontier Models on AMD
⚡
LLM Quantization
Content type:
Blog
digitalocean.com
·
15h
15 hours ago
Actions for The Inference Alpha: Maximizing Frontier Models on AMD
DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
🤖
Machine Learning
Content type:
News
newsletter.semianalysis.com
·
1d
1 day ago
·
Hacker News
Actions for DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
🧠
Local llm
Content type:
News
Content type:
Blog
blog.google
·
5d
5 days ago
·
Hacker News
Actions for Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
Big Blue’s Redbook on Storage Scale
KV
Cache
management
🐢
Turso
Content type:
News
blocksandfiles.com
·
1d
1 day ago
Actions for Big Blue’s Redbook on Storage Scale KV Cache management
Massive AI Storage Demand Creates a New Memory Wall
📝
SQLite WAL
Content type:
News
eetimes.com
·
15h
15 hours ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🧠
Local llm
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Qwen 3.6 27B AutoRound GGUF, need your feedback
🧠
Local llm
huggingface.co
·
1d
1 day ago
·
r/LocalLLaMA
Actions for Qwen 3.6 27B AutoRound GGUF, need your feedback
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe
Flash-Attention
for llama.cpp, fully measured on real hardware.
🧠
Local llm
Content type:
Code
github.com
·
13h
13 hours ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
A system programmer’s guide to
LLM
inference
🤖
Qwen
Content type:
Blog
blog.xiangpeng.systems
·
3d
3 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
Report: GKE
Inference
Gateway delivers up to 92% faster AI responses
☸️
Kubernetes
Content type:
Blog
cloud.google.com
·
2d
2 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI
⚡
LLM Quantization
Content type:
Blog
blogs.nvidia.com
·
13h
13 hours ago
Actions for NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI
How we fight GPU scarcity without compromise
⚡
LLM Quantization
Content type:
Blog
equixly.com
·
5d
5 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🧠
Local llm
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
Apple WWDC On-Device AI Deep Dive - Google Docs
🤖
Machine Learning
gist.is
·
7h
7 hours ago
·
Hacker News
Actions for Apple WWDC On-Device AI Deep Dive - Google Docs
DiffusionGemma: The Developer Guide- Google Developers Blog
⚡
LLM Quantization
Content type:
Blog
developers.googleblog.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for DiffusionGemma: The Developer Guide- Google Developers Blog
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
⚡
LLM Quantization
vettedconsumer.com
·
4d
4 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
Machinic Psychopharmacology: Do LLMs Self-Medicate?
⚡
LLM Quantization
lesswrong.com
·
15h
15 hours ago
·
Hacker News
Actions for Machinic Psychopharmacology: Do LLMs Self-Medicate?
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help