Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
85
posts in
5.6
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving.
🧠
Local llm
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
Inferoa
AI harness claimed 90%
cache
savings. We ran it and measured 97.8%
👁️
Observability
zozo123.github.io
·
17h
17 hours ago
·
Hacker News
Actions for Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%
DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
🤖
Machine Learning
Content type:
News
newsletter.semianalysis.com
·
1d
1 day ago
·
Hacker News
Actions for DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
Qwen 3.6 27B AutoRound GGUF, need your feedback
🧠
Local llm
huggingface.co
·
1d
1 day ago
·
r/LocalLLaMA
Actions for Qwen 3.6 27B AutoRound GGUF, need your feedback
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
🧠
Local llm
Content type:
News
Content type:
Blog
blog.google
·
5d
5 days ago
·
Hacker News
Actions for Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
Apple WWDC On-Device AI Deep Dive - Google Docs
🤖
Machine Learning
gist.is
·
5h
5 hours ago
·
Hacker News
Actions for Apple WWDC On-Device AI Deep Dive - Google Docs
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🧠
Local llm
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
Machinic Psychopharmacology: Do LLMs Self-Medicate?
⚡
LLM Quantization
lesswrong.com
·
13h
13 hours ago
·
Hacker News
Actions for Machinic Psychopharmacology: Do LLMs Self-Medicate?
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🧠
Local llm
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe
Flash-Attention
for llama.cpp, fully measured on real hardware.
🧠
Local llm
Content type:
Code
github.com
·
12h
12 hours ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
DiffusionGemma: The Developer Guide- Google Developers Blog
⚡
LLM Quantization
Content type:
Blog
developers.googleblog.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for DiffusionGemma: The Developer Guide- Google Developers Blog
A system programmer’s guide to
LLM
inference
🤖
Qwen
Content type:
Blog
blog.xiangpeng.systems
·
3d
3 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse
Attention
⚡
LLM Quantization
Content type:
Academic
arxiv.org
·
2d
2 days ago
·
Hacker News
Actions for FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
How we fight GPU scarcity without compromise
⚡
LLM Quantization
Content type:
Blog
equixly.com
·
5d
5 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On
Inference
Systems, Execution Boundaries, and
Co-Design
⚡
LLM Quantization
Content type:
Blog
tilert.ai
·
2d
2 days ago
·
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
Less-relevant results
Re-quantizing
a local
LLM
14x faster by skipping the
tensors
that didn't change
⚡
LLM Quantization
Content type:
News
Content type:
Blog
andreaborio.substack.com
·
15h
15 hours ago
·
Substack
Actions for Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
⚡
LLM Quantization
vettedconsumer.com
·
4d
4 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
Nex-N2-mini: A 35B Model Built for Autonomous Agents
🔌
Model Context Protocol
hackernoon.com
·
21h
21 hours ago
Actions for Nex-N2-mini: A 35B Model Built for Autonomous Agents
Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
⚡
LLM Quantization
aarushgupta.io
·
1d
1 day ago
·
Lobsters
,
Hacker News
Actions for Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
🧠
Local llm
Content type:
News
Content type:
Blog
kaitchup.substack.com
·
5d
5 days ago
·
r/LocalLLaMA
Actions for MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help