⚡ LLM Inference - cyberpsych12 · Scour

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

🤖LLMs Blog

bric.pe.kr··DEV

Less-relevant results

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

📈Performance Engineering

aarushgupta.io··Lobsters, Hacker News

Quantization Was Never About the Bits

🤖LLMs Blog

·

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🤖LLMs News Blog

kaitchup.substack.com··r/LocalLLaMA

DiffusionGemma: Discrete diffusion in a large language model

✍️Prompt Engineering

idlemachines.co.uk··Hacker News

Intelligent inference scheduling with llm-d on Red Hat AI

✍️Prompt Engineering

developers.redhat.com·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

vettedconsumer.com··Hacker News

How To Start Building Edge-Native AI

📈Performance Engineering

semiengineering.com·

AI Serving Platform That Adapts to Your Model

📈Performance Engineering Blog

databricks.com·

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

huggingface.co··r/LocalLLaMA

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

🧮Vector Databases Blog

Optimal Post-Training Quantization Scales and Where to Find Them

🤖LLMs Academic

Model2vec-zig: static text embeddings in pure Zig, in a single binary

The economics of speculative decoding

📈Performance Engineering Blog

fergusfinn.com··Hacker News

vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference

🤖LLMs Blog

odsc.medium.com·

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🤖LLMs Blog

dnhkng.github.io·

DiffusionGemma: The Developer Guide

🤖LLMs Blog

developers.googleblog.com··Hacker News

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

📈Performance Engineering Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

phoronix.com··r/artificial

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

🤖LLMs News

Log in to enable infinite scrolling