Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference
⚡ Inference
Specific
LLM inference, vLLM, speculative decoding, latency, throughput
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
339
posts in
6.9
ms
Token4Token
—
pay-per-token
inference
on Gnosis + Swarm
💎
Token Economics
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior
modelsThe
model delivers 300
tokens
per
second on benchmar...
💎
Token Economics
digg.com
·
6d
6 days ago
Actions for NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...
MLPerf and the rise of
latency-aware
LLM
benchmarking
🧠
LLMs
edn.com
·
5d
5 days ago
Actions for MLPerf and the rise of latency-aware LLM benchmarking
DiffusionGemma: 4x Faster Text Generation
🔬
AI Research
Content type:
News
Content type:
Blog
blog.google
·
10h
10 hours ago
·
Hacker News
,
r/LocalLLaMA
,
r/singularity
Actions for DiffusionGemma: 4x Faster Text Generation
How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)
🔍
RAG
buy.polar.sh
·
2d
2 days ago
·
DEV
Actions for How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)
How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops
💻
AI Coding
Content type:
Video
youtube.com
·
6d
6 days ago
Actions for How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops
Massive AI Storage Demand Creates a New Memory Wall
🧠
Reasoning Models
Content type:
News
eetimes.com
·
12h
12 hours ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
Breaking the Ice: Analyzing Cold Start
Latency
in
vLLM
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
·
Hacker News
Actions for Breaking the Ice: Analyzing Cold Start Latency in vLLM
1-bit and 1.58 bit
LLM
Benchmarking on Jetson Orin Nano Super | Bonsai LM
🧠
LLMs
smolhub.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for 1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM
BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
🔌
MCP
sleepingrobots.com
·
4d
4 days ago
Actions for BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
From GPU to
Token
: The 8-Layer Observability Stack for AI Infrastructure
💎
Token Economics
Content type:
Blog
jimmysong.io
·
1d
1 day ago
Actions for From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
💻
AI Coding
Content type:
Code
github.com
·
10h
10 hours ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
Nemotron 3 Ultra now available on AI Gateway
💻
AI Coding
vercel.com
·
6d
6 days ago
Actions for Nemotron 3 Ultra now available on AI Gateway
How to Measure Time To First
Token
(TTFT) in AI Systems
🧠
LLMs
qainsights.com
·
4d
4 days ago
·
Hacker News
Actions for How to Measure Time To First Token (TTFT) in AI Systems
"AI" Is Eating Platform Monopolist Free Cash Flow, Not the World: CHART OF THE DAY
🔬
AI Research
Content type:
News
Content type:
Blog
braddelong.substack.com
·
2d
2 days ago
·
Substack
Actions for "AI" Is Eating Platform Monopolist Free Cash Flow, Not the World: CHART OF THE DAY
Making LLMs faster and more efficient across multiple languages
🧠
LLMs
techxplore.com
·
6d
6 days ago
Actions for Making LLMs faster and more efficient across multiple languages
Which is faster: Gemini 3.5 Flash or Kimi K2.6 on Cerebras
🧠
Reasoning Models
Content type:
Blog
cerebras.ai
·
5d
5 days ago
Actions for Which is faster: Gemini 3.5 Flash or Kimi K2.6 on Cerebras
WWDC 2026: Foundation
Models
(& Anarlog)
🧠
LLMs
skushagra.com
·
2d
2 days ago
Actions for WWDC 2026: Foundation Models (& Anarlog)
Google open-sources speedy DiffusionGemma text diffusion
model
🔬
AI Research
siliconangle.com
·
2h
2 hours ago
Actions for Google open-sources speedy DiffusionGemma text diffusion model
Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial
LLM
Backends
🔌
MCP
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help