⚡ Quantization - Erdwig

🎲Probabilistic Inference Blog Discussion

tildalice.io·

TurboQuant in PostgreSQL

🗄Databases Blog

blog.mayflower.de·

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

🎲Probabilistic Inference Academic

arxiv.org·

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

🏗Datastructures Blog

elastic.co·

Apple WWDC On-Device AI Deep Dive - Google Docs

🎲Probabilistic Inference

gist.is··Hacker News

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

🔀CRDTs Code

github.com··Hacker News

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🏷️Named Entity Recognition News Blog

kaitchup.substack.com··r/LocalLLaMA

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

🎲Probabilistic Inference Academic

arxiv.org·

What's in the Box? A Field Guide to AI Models

🧮Hindley-Milner Blog

iankduncan.com·

AMD just reserved the right to disappoint handheld and Steam Machine gamers.

⚡Effect Systems News

theverge.com

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

🎲Probabilistic Inference News Blog

developer.nvidia.com·

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

🗜️Compression Algorithms Code

github.com··r/LocalLLaMA

AMD: No Definitive Decision on FSR 4.1 Support for RDNA 3.5 APUs

🔀CRDTs

techpowerup.com··r/hardware

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

🎲Probabilistic Inference Academic

arxiv.org·

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

🔀CRDTs Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

A system programmer’s guide to LLM inference

🧩Constraint Programming Blog

blog.xiangpeng.systems··Hacker News

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

🔀CRDTs Academic

arxiv.org·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

Quality Is Not a Safety Proxy Under Quantization

Qwen 3.6 27B AutoRound GGUF, need your feedback

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

TurboQuant in PostgreSQL

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

Apple WWDC On-Device AI Deep Dive - Google Docs

RightNow-AI/AutoMegaKernel: An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

What's in the Box? A Field Guide to AI Models

AMD just reserved the right to disappoint handheld and Steam Machine gamers.

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

AMD: No Definitive Decision on FSR 4.1 Support for RDNA 3.5 APUs

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

A system programmer’s guide to LLM inference

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving