AI Engineering

Feeds to Scour
SubscribedAll
Scoured 217 posts in 7.3 ms

Friday Five — June 12, 2026

 🛡️AI Safety

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

 🎮GPU Programming  Content type: Blog
dnhkng.github.io·

[AINews] Fable and Mythos officially too dangerous to release

 🧠LLM Research  Content type: News
latent.space·

Stop Treating Your Models Like Microservices

 🔧Backend Dev
cloudnativenow.com·

Your AI Factory Won't Scale to Inference: Here's Why | Ari Weil, Akamai

 🧠LLM Research  Content type: Video
youtube.com·

Making FlashAttention-4 faster for inference

 🎮GPU Programming  Content type: Blog
modal.com··Hacker News

Token4Token — pay-per-token inference on Gnosis + Swarm

 🔧Backend Dev
t4t.eth.link··Hacker News

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

 🔩ML Compilers  Content type: Academic
arxiv.org·

Unsloth Kimi-K2.7-Code-GGUF

 🎯Reinforcement Learning

AI Serving Platform That Adapts to Your Model

 🔩ML Compilers  Content type: Blog
databricks.com·

microsoft/LLMLingua: [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

 🧠LLM Research  Content type: Code
github.com··DEV

Show HN: Ext-Infer

 🦀Rust

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

 🗄️Database Internals  Content type: Blog
medium.com
·

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

 🔩ML Compilers
venturebeat.com·

vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference

 🔮Multimodal AI  Content type: Blog
odsc.medium.com·

Anatomy of a high-performance EP kernel

 ⚙️Hardware Architecture  Content type: Blog

I Processed 2.4 Billion Tokens Across 52 AI Models for $0.52. Here's the Full Breakdown.

 🧠LLM Research
saintlex.sbs··DEV

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

 🧠LLM Research

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

 🎮GPU Programming  Content type: Blog
jimmysong.io·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help