Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference Cost
💰 Inference Cost
GPU cost, inference pricing, cost per token, LLM economics
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
321
posts in
10.0
ms
The
Bill
Arrives: How to Manage Agentic AI
Costs
at
Scale
⚙️
MLOps
Content type:
Blog
cockroachlabs.com
·
18h
18 hours ago
Actions for The Bill Arrives: How to Manage Agentic AI Costs at Scale
Google’s DiffusionGemma is 4x faster than its other Gemma models
🎮
GPU Computing
thenewstack.io
·
1h
1 hour ago
Actions for Google’s DiffusionGemma is 4x faster than its other Gemma models
Running Qwen 35B MoE at 450k Context on a Single 32GB
GPU
🧠
Inference Engineering
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Why I care so much about energy
per
token
🚀
Speculative Decoding
Content type:
Blog
ziraph.com
·
2d
2 days ago
·
Hacker News
Actions for Why I care so much about energy per token
146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb
🗜️
Quantization
Content type:
Blog
adambien.blog
·
15h
15 hours ago
Actions for 146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb
MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
🚀
Speculative Decoding
Content type:
Blog
mimo.xiaomi.com
·
2d
2 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
🗜️
Quantization
Content type:
News
Content type:
Blog
blog.google
·
5d
5 days ago
·
Hacker News
Actions for Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
Akash Systems brings diamond cooling to AI infrastructure
🎮
GPU Computing
siliconangle.com
·
4h
4 hours ago
Actions for Akash Systems brings diamond cooling to AI infrastructure
huawei-csl/KVarN: KVarN is a native vLLM KV-cache
quantization
backend for your agents: 3-5x more context,
throughput
above FP16, and FP16-level accuracy. Calibration-free, one flag.
💾
KV Cache
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
What's in the Box? A Field Guide to AI Models
🧠
Inference Engineering
Content type:
Blog
iankduncan.com
·
1d
1 day ago
Actions for What's in the Box? A Field Guide to AI Models
CoreML vs TFLite: iPhone 15 Pro
GPU
2.3x Faster
🚀
Model Serving
Content type:
Blog
Content type:
Discussion
tildalice.io
·
4d
4 days ago
Actions for CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster
Trainable Smooth-Rotation Transforms with Learned Channel Scales for
LLM
Quantization
🗜️
Quantization
Content type:
Academic
arxiv.org
·
14h
14 hours ago
Actions for Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization
On-device AI is a margin decision
🧠
Inference Engineering
Content type:
Blog
ziraph.com
·
46m
46 minutes ago
·
Hacker News
Actions for On-device AI is a margin decision
China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
🚀
Speculative Decoding
Content type:
News
decrypt.co
·
1d
1 day ago
Actions for China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
LLM
Inference
Engineering Room — Part 3: The Orchestration Layer
🧠
Inference Engineering
Content type:
Blog
vimal-dwarampudi.medium.com
·
6d
6 days ago
Actions for LLM Inference Engineering Room — Part 3: The Orchestration Layer
Autonomous AI worm uses local models to exploit networks and repair its own code
⚙️
ML Compilers
4sysops.com
·
1d
1 day ago
Actions for Autonomous AI worm uses local models to exploit networks and repair its own code
Cohere open-sources a coding agent that runs on a single H100
🎮
GPU Computing
venturebeat.com
·
21h
21 hours ago
Actions for Cohere open-sources a coding agent that runs on a single H100
Xiaomi MiMo-V2.5-Pro Just Hit 1,000
Tokens
Per
Second!
🚀
Speculative Decoding
gizchina.com
·
1d
1 day ago
Actions for Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
The latest Gemma 4 models use a training trick to slash their on-device memory footprint
🗜️
Quantization
androidauthority.com
·
4d
4 days ago
Actions for The latest Gemma 4 models use a training trick to slash their on-device memory footprint
Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
🧠
Inference Engineering
Content type:
Blog
towardsai.net
·
2d
2 days ago
Actions for Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help