⚡ LLM Inference - calleum · Scour

The Edge LLM Offload Story

🧠LLM Training

semiengineering.com·

Less-relevant results

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

⚡Zig Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

A field journal on Ray Data and Daft for multimodal data lake (14 minute read)

🕸️axum Blog

mehulbatra.medium.com·

Why I care so much about energy per token

⚡Zig Blog

ziraph.com··Hacker News

MLPerf and the rise of latency-aware LLM benchmarking

🧠LLM Training

Intro — Sehastrajit

🖥️Self-Hosting Blog

Show HN: Ext-Infer

infer.displace.tech··Hacker News

fix(gateway): fail closed for unknown model auth · openclaw/openclaw@85343ea

🦀Rust Code

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

⚙️Systems Programming Academic

Latest technical articles & videos.

⚙️Systems Programming

certdepot.net·

TGI(SG)F.

🔀Session Types News

·

KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++

⚡Zig Code

github.com··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🧠LLM Training

local-llm.utop.workers.dev··Hacker News

Nvidia Nemotron 3 Ultra

🧠LLM Training

research.nvidia.com··Hacker News

Where to Host Your Open-Source Model (Under 10B Parameters)

🖥️Self-Hosting

digitalocean.com·

Breaking the Ice: Analyzing Cold Start Latency in vLLM

⚙️Systems Programming Academic

mirkolenz/llmhop: Tiny, stateless Go router that dispatches OpenAI-compatible requests to single-model vLLM and sglang backends with zero external dependencies

🖥️Self-Hosting Code

github.com··Hacker News

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

🖥️Self-Hosting Video

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🧠LLM Training News Blog

blog.google··Hacker News

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

⚙️Systems Programming Academic

Sign up or log in to see more results

Log in to enable infinite scrolling