🚀 ML Inference - rishabh · Scour

UniSVQ: 2-bit Unified Scalar-Vector Quantization

⚙️ML Systems Academic

arxiv.org··Cited by 1 article

Less-relevant results

DiffusionGemma: 4x Faster Text Generation

🖥️GPU Computing News Blog 22

blog.google··Hacker News, r/LocalLLaMA, r/singularity·Cited by 22 articles

12B Gemma 4 QAT Deployment with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI

🧠Deep Learning Blog

·

European Sovereign AI. Breakthrough Performance

⚡Query Engines

infercom.ai··Hacker News

massimo92/spark: CLI tool for serving LLMs with vLLM on NVIDIA DGX Spark. One file, zero friction.

🖥️GPU Computing Code

github.com··Hacker News

Intelligent inference scheduling with llm-d on Red Hat AI

⚡Query Engines

developers.redhat.com·

All sorts of famous Attention Layers

🧠Deep Learning Blog

harsh-ps-2003.bearblog.dev·

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🖥️GPU Computing News

newsletter.semianalysis.com

··Hacker News·Cited by 1 article

Unsloth Kimi-K2.7-Code-GGUF

🛠️Compilers

huggingface.co··r/LocalLLaMA

Making FlashAttention-4 faster for inference

🖥️GPU Computing Blog

modal.com··Hacker News

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

⚡Query Engines

zozo123.github.io··Hacker News

Friday Five — June 12, 2026

🧠Memory Management

redhat.com··Cited by 1 article

Unlocking AI flexibility in Europe: A guide to cross-region inference for EU data processing and model access

⚡Query Engines Blog

aws.amazon.com·

Kimi K2.7-Code: open-source coding model with better token efficiency

⚙️ML Systems 8

huggingface.co··Hacker News, r/LocalLLaMA·Cited by 8 articles

Show HN: Quant Picker – which GGUF file fits your model and machine

📄Systems Papers

vettedconsumer.com··Hacker News

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

⚙️ML Systems Blog 10

mimo.xiaomi.com··Hacker News, r/LocalLLaMA·Cited by 10 articles

Metrics that Matter with Serverless Inference

⚙️ML Systems

digitalocean.com·

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

🖥️GPU Computing Blog

blogs.nvidia.com·

A system programmer’s guide to LLM inference

🖥️GPU Computing Blog

blog.xiangpeng.systems··Hacker News

Model2vec-zig: static text embeddings in pure Zig, in a single binary

⚙️ML Systems

Log in to enable infinite scrolling