🚀 LLM Serving - sravindra

Neo-X7/Neo-AI: A fully offline AI assistant powered by Ollama. Stores and retrieves conversations using SQLite + LanceDB vector search. No cloud. No API keys. Runs entirely on your machine.

⚙️ML Infrastructure Code

github.com··DEV

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

⚙️ML Infrastructure

vettedconsumer.com··Hacker News

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

💾AI Hardware News

decrypt.co·

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

⚙️ML Infrastructure Academic

arxiv.org·

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

☁️GCP

gizchina.com·

Running LLM Inference on Kubernetes: What It Actually Takes

☸️Kubernetes Blog

fairwinds.com·

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

💾AI Hardware Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🦀Systems Programming News Blog

kaitchup.substack.com··r/LocalLLaMA

Less-relevant results

On-device AI is a margin decision

💾AI Hardware Blog

ziraph.com··Hacker News

Qwen 3.6 27B AutoRound GGUF, need your feedback

⚙️ML Infrastructure

huggingface.co··r/LocalLLaMA

NVIDIA Nemotron 3 Ultra

💾AI Hardware Blog

ollama.com·

Token4Token — pay-per-token inference on Gnosis + Swarm

☁️GCP

t4t.eth.link··Hacker News

Self-hosted remote access for Ollama without complicated setup

🌐Cilium

oab.arc-i.co.uk··r/selfhosted

What's in the Box? A Field Guide to AI Models

⚙️ML Infrastructure Blog

iankduncan.com·

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

💾AI Hardware Video

youtube.com·

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

⚙️ML Infrastructure

buy.polar.sh··DEV

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🔀Model Parallelism Academic

arxiv.org·

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

Making Local LLM Go Brrr

Neo-X7/Neo-AI: A fully offline AI assistant powered by Ollama. Stores and retrieves conversations using SQLite + LanceDB vector search. No cloud. No API keys. Runs entirely on your machine.

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

Running LLM Inference on Kubernetes: What It Actually Takes

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

On-device AI is a margin decision

Qwen 3.6 27B AutoRound GGUF, need your feedback

NVIDIA Nemotron 3 Ultra

Token4Token — pay-per-token inference on Gnosis + Swarm

Self-hosted remote access for Ollama without complicated setup

What's in the Box? A Field Guide to AI Models

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

How I benchmarked a 100% local RAG pipeline to 9/9 (zero API keys)

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation