🚀 ML Inference - rishabh · Scour

[AINews] Fable and Mythos officially too dangerous to release

📄Systems Papers News

Token4Token — pay-per-token inference on Gnosis + Swarm

⚡Query Engines

t4t.eth.link··Hacker News

vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference

⚙️ML Systems Blog

odsc.medium.com·

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🖥️GPU Computing Blog

dnhkng.github.io·

DiffusionGemma: Discrete diffusion in a large language model

🧠Deep Learning

idlemachines.co.uk··Hacker News

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

⚙️ML Systems

OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine

🎥Video Analytics

Why are cached input tokens cheaper with AI services?

⚙️ML Systems

The economics of speculative decoding

⚙️ML Systems Blog

fergusfinn.com··Hacker News

vicharak-in/Gati: Gati Accelerates Your CNN Algorithms!

🧠Deep Learning Code

github.com··Hacker News

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

⚡Query Engines Blog

elastic.co··Cited by 1 article

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

🖥️GPU Computing Academic

OpenCV Introduces New DNN Inference Engine

🎥Video Analytics

i-programmer.info·

How to Setup a Local Coding Agent on macOS

🦀Rust Blog

ikyle.me··Hacker News·Cited by 2 articles

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

🖥️GPU Computing

venturebeat.com·

Quantization Was Never About the Bits

⚙️ML Systems Blog

·

The Inference Alpha: Maximizing Frontier Models on AMD

🖥️GPU Computing Blog

digitalocean.com·

Lowest-Cost LLM Inference: The Complete OpenRouter Guide

⚡Query Engines Blog Discussion Tutorial

openrouter.ai·

TFLite Edge Model Quantizer Snippet

🧠Deep Learning

itsevilduck.gumroad.com··DEV

Ollama's highest performance on Apple Silicon yet with MLX

⚡Query Engines Blog

Log in to enable infinite scrolling