Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Engineering
🤖 AI Engineering
AI infrastructure, model serving, inference, MLOps
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
217
posts in
7.3
ms
Friday Five — June 12, 2026
🛡️
AI Safety
redhat.com
·
1d
1 day ago
·
Cited by 1 article
Actions for Friday Five — June 12, 2026
2x GH200 for
LLM
inference
, Part 2:
vLLM
, DeepSeek V4 Flash, and MTP
🎮
GPU Programming
Content type:
Blog
dnhkng.github.io
·
5d
5 days ago
Actions for 2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP
[AINews] Fable and Mythos officially too dangerous to release
🧠
LLM Research
Content type:
News
latent.space
·
8h
8 hours ago
Actions for [AINews] Fable and Mythos officially too dangerous to release
Stop Treating Your
Models
Like Microservices
🔧
Backend Dev
cloudnativenow.com
·
1d
1 day ago
Actions for Stop Treating Your Models Like Microservices
Your
AI
Factory Won't Scale to
Inference
: Here's Why | Ari Weil, Akamai
🧠
LLM Research
Content type:
Video
youtube.com
·
3d
3 days ago
Actions for Your AI Factory Won't Scale to Inference: Here's Why | Ari Weil, Akamai
Making FlashAttention-4 faster for
inference
🎮
GPU Programming
Content type:
Blog
modal.com
·
2d
2 days ago
·
Hacker News
Actions for Making FlashAttention-4 faster for inference
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🔧
Backend Dev
t4t.eth.link
·
4d
4 days ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
TileFuse: A Fused Mixed-Precision Kernel Library for Efficient
Quantized
LLM
Inference
on AMD NPUs
🔩
ML Compilers
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs
Unsloth Kimi-K2.7-Code-GGUF
🎯
Reinforcement Learning
huggingface.co
·
3h
3 hours ago
·
r/LocalLLaMA
Actions for Unsloth Kimi-K2.7-Code-GGUF
AI
Serving
Platform That Adapts to Your
Model
🔩
ML Compilers
Content type:
Blog
databricks.com
·
2d
2 days ago
Actions for AI Serving Platform That Adapts to Your Model
microsoft/LLMLingua: [EMNLP'23, ACL'24] To speed up LLMs'
inference
and enhance
LLM
's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
🧠
LLM Research
Content type:
Code
github.com
·
8h
8 hours ago
·
DEV
Actions for microsoft/LLMLingua: [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Show HN:
Ext-Infer
🦀
Rust
infer.displace.tech
·
6d
6 days ago
·
Hacker News
·
Cited by 2 articles
Actions for Show HN: Ext-Infer
Kimi K2.7-Code: open-source coding
model
with better token efficiency
🎯
Reinforcement Learning
7
articles covering this post
huggingface.co
·
1d
1 day ago
·
Hacker News
,
r/LocalLLaMA
·
Cited by 7 articles
Actions for Kimi K2.7-Code: open-source coding model with better token efficiency
PagedAttention vs Traditional KV Cache: How
vLLM
Reinvented
GPU
Memory for
LLM
Inference
🗄️
Database Internals
Content type:
Blog
medium.com
·
4d
4 days ago
Actions for PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference
Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out
🔩
ML Compilers
venturebeat.com
·
14h
14 hours ago
Actions for Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out
vLLM
Transformers Backend: Bridging Hugging Face Compatibility and High-Performance
Inference
🔮
Multimodal AI
Content type:
Blog
odsc.medium.com
·
1d
1 day ago
Actions for vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference
Anatomy of a high-performance EP kernel
⚙️
Hardware Architecture
Content type:
Blog
fergusfinn.com
·
3d
3 days ago
·
Hacker News
Actions for Anatomy of a high-performance EP kernel
I Processed 2.4 Billion Tokens Across 52
AI
Models
for $0.52. Here's the Full Breakdown.
🧠
LLM Research
saintlex.sbs
·
2d
2 days ago
·
DEV
Actions for I Processed 2.4 Billion Tokens Across 52 AI Models for $0.52. Here's the Full Breakdown.
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
🧠
LLM Research
vettedconsumer.com
·
6d
6 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
From
GPU
to Token: The 8-Layer Observability Stack for
AI
Infrastructure
🎮
GPU Programming
Content type:
Blog
jimmysong.io
·
4d
4 days ago
Actions for From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help