Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
LLM serving, inference optimization, token generation, vLLM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
142
posts in
17.7
ms
All sorts of famous
Attention
Layers
💬
LLMs
Content type:
Blog
harsh-ps-2003.bearblog.dev
·
5d
5 days ago
Actions for All sorts of famous Attention Layers
Less-relevant results
Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs,
vLLM
on Google Kubernetes Engine — Football…
⚡
KV Cache
Content type:
Blog
ammettw.medium.com
·
2d
2 days ago
Actions for Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…
Google's DiffusionGemma
generates
256
tokens
in parallel and self-corrects as it goes
⚡
KV Cache
venturebeat.com
·
6d
6 days ago
·
Covers:
DiffusionGemma: 4x Faster Text Generation
Actions for Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes
Is anyone else not finding the Web UI on latest (b9680) of
llama.cpp
?
💬
LLMs
Content type:
Discussion
Content type:
Code
github.com
·
23h
23 hours ago
·
r/LocalLLaMA
Actions for Is anyone else not finding the Web UI on latest (b9680) of llama.cpp?
How Public AI delivers sovereign
LLM
inference
on AWS and Intel
⚡
KV Cache
Content type:
Blog
aws.amazon.com
·
2d
2 days ago
·
Covers:
Hugging Face – Fun chat with your own Artificial Intelligence
,
vLLM
+1 more
Actions for How Public AI delivers sovereign LLM inference on AWS and Intel
How to Setup a Local Coding Agent on macOS
🔧
MLOps
Content type:
Blog
3
articles covering this post
ikyle.me
·
6d
6 days ago
·
Hacker News
·
Cited by 3 articles
·
Covers 6 stories
Actions for How to Setup a Local Coding Agent on macOS
DiffusionGemma: Discrete diffusion in a large language model
⚡
KV Cache
idlemachines.co.uk
·
6d
6 days ago
·
Hacker News
Actions for DiffusionGemma: Discrete diffusion in a large language model
zai-org/GLM-5.2 is here!
⚡
KV Cache
9
articles covering this post
huggingface.co
·
1d
1 day ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
·
Cited by 9 articles
·
Covers 7 stories
Actions for zai-org/GLM-5.2 is here!
Friday Five — June 12, 2026
⚡
KV Cache
redhat.com
·
6d
6 days ago
Actions for Friday Five — June 12, 2026
[AINews] Satya on Loopcraft: Building Frontier Ecosystems
💬
LLMs
Content type:
News
latent.space
·
2d
2 days ago
Actions for [AINews] Satya on Loopcraft: Building Frontier Ecosystems
New comment by Greenpants in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"
💬
LLMs
Content type:
Discussion
news.ycombinator.com
·
2d
2 days ago
·
Hacker News
·
Cited by 1 article
·
Covers:
I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed.
Actions for New comment by Greenpants in "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"
SwiftCache: Efficient
LLM
Serving
for Multi-turn Conversations with Heterogeneous
KV
Cache Sharing
⚡
KV Cache
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing
Speculative
Decoding
: How to Get Free
Tokens
💬
LLMs
Content type:
Blog
medium.com
·
3d
3 days ago
Actions for Speculative Decoding: How to Get Free Tokens
Rust port of transformers (1M lines of code)
💬
LLMs
Content type:
Code
github.com
·
13h
13 hours ago
·
Hacker News
Actions for Rust port of transformers (1M lines of code)
Built Uber aggregator that tracks top AI researchers and leaders
💬
LLMs
brightray.ai
·
23h
23 hours ago
·
Hacker News
Actions for Built Uber aggregator that tracks top AI researchers and leaders
12B Gemma 4 QAT Deployment with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI
⚡
KV Cache
Content type:
Blog
medium.com
·
6d
6 days ago
Actions for 12B Gemma 4 QAT Deployment with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI
Running local models is good now
🤖
AI Agents
8
articles covering this post
vickiboykis.com
·
3d
3 days ago
·
Lobsters
,
Hacker News
,
Hacker News
·
Cited by 8 articles
·
Covers 9 stories
Actions for Running local models is good now
How to fit Qwen 3.6 35B A3B into 16GB of VRAM, & run it with
Llama.cpp
on an RTX 3080
🗄️
Storage Engines
autodidacts.io
·
4d
4 days ago
·
Covers:
Can your machine run AI models?
Actions for How to fit Qwen 3.6 35B A3B into 16GB of VRAM, & run it with Llama.cpp on an RTX 3080
Coordinated Scheduling for MoE
LLM
Serving
⚡
KV Cache
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Coordinated Scheduling for MoE LLM Serving
I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four
tokens
a second
🗄️
Storage Engines
Content type:
Blog
point.free
·
3d
3 days ago
·
Hacker News
Actions for I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Dislike
Report