🤖 AI Engineering - daemsc

🔩ML Compilers Academic

arxiv.org·

Unsloth Kimi-K2.7-Code-GGUF

🎯Reinforcement Learning

huggingface.co··r/LocalLLaMA

AI Serving Platform That Adapts to Your Model

🔩ML Compilers Blog

databricks.com·

microsoft/LLMLingua: [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

🧠LLM Research Code

github.com··DEV

Show HN: Ext-Infer

🦀Rust

infer.displace.tech··Hacker News·Cited by 2 articles

Kimi K2.7-Code: open-source coding model with better token efficiency

🎯Reinforcement Learning 7

huggingface.co··Hacker News, r/LocalLLaMA·Cited by 7 articles

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

🗄️Database Internals Blog

medium.com

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

🔩ML Compilers

venturebeat.com·

vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference

🔮Multimodal AI Blog

odsc.medium.com·

Anatomy of a high-performance EP kernel

⚙️Hardware Architecture Blog

fergusfinn.com··Hacker News

I Processed 2.4 Billion Tokens Across 52 AI Models for $0.52. Here's the Full Breakdown.

🧠LLM Research

saintlex.sbs··DEV

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🧠LLM Research

vettedconsumer.com··Hacker News

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

🎮GPU Programming Blog

jimmysong.io·

Friday Five — June 12, 2026

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

[AINews] Fable and Mythos officially too dangerous to release

Stop Treating Your Models Like Microservices

Your AI Factory Won't Scale to Inference: Here's Why | Ari Weil, Akamai

Making FlashAttention-4 faster for inference

Token4Token — pay-per-token inference on Gnosis + Swarm

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

Unsloth Kimi-K2.7-Code-GGUF

AI Serving Platform That Adapts to Your Model

microsoft/LLMLingua: [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Show HN: Ext-Infer

Kimi K2.7-Code: open-source coding model with better token efficiency

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference

Anatomy of a high-performance EP kernel

I Processed 2.4 Billion Tokens Across 52 AI Models for $0.52. Here's the Full Breakdown.

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure