Inference

Feeds to Scour
SubscribedAll
Scoured 339 posts in 6.9 ms

Token4Tokenpay-per-token inference on Gnosis + Swarm

 💎Token Economics

NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...

 💎Token Economics
digg.com·

MLPerf and the rise of latency-aware LLM benchmarking

 🧠LLMs
edn.com·

DiffusionGemma: 4x Faster Text Generation

 🔬AI Research  Content type: News  Content type: Blog

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

 🔍RAG
buy.polar.sh··DEV

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

 💻AI Coding  Content type: Video
youtube.com·

Massive AI Storage Demand Creates a New Memory Wall

 🧠Reasoning Models  Content type: News
eetimes.com·

Breaking the Ice: Analyzing Cold Start Latency in vLLM

 🧠LLMs  Content type: Academic
arxiv.org··Hacker News

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

 🧠LLMs
smolhub.com··r/LocalLLaMA

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

 🔌MCP
sleepingrobots.com·

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

 💎Token Economics  Content type: Blog
jimmysong.io·

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

 💻AI Coding  Content type: Code
github.com··Hacker News

Nemotron 3 Ultra now available on AI Gateway

 💻AI Coding
vercel.com·

How to Measure Time To First Token (TTFT) in AI Systems

 🧠LLMs

"AI" Is Eating Platform Monopolist Free Cash Flow, Not the World: CHART OF THE DAY

 🔬AI Research  Content type: News  Content type: Blog

Making LLMs faster and more efficient across multiple languages

 🧠LLMs
techxplore.com·

Which is faster: Gemini 3.5 Flash or Kimi K2.6 on Cerebras

 🧠Reasoning Models  Content type: Blog
cerebras.ai·

WWDC 2026: Foundation Models (& Anarlog)

 🧠LLMs
skushagra.com·

Google open-sources speedy DiffusionGemma text diffusion model

 🔬AI Research
siliconangle.com·

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

 🔌MCP  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help