🧠 LLM Inference - akapaka

🧠Local llm News Blog

kaitchup.substack.com··r/LocalLLaMA

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

⚡LLM Quantization Blog

medium.com

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

⚡LLM Quantization Blog

elastic.co·

Unsloth Gemma 4 QAT

🧠Local llm

unsloth.ai·

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

⚡LLM Quantization News Blog

developer.nvidia.com·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

⚡LLM Quantization Code

github.com··Hacker News

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

⚡LLM Quantization Blog

tilert.ai··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🧠Local llm Blog

dnhkng.github.io·

Less-relevant results

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

🧠Local llm

phoronix.com··r/artificial

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

⚡LLM Quantization Academic

arxiv.org·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

🧠Local llm Blog

dnhkng.github.io·

WEKA software speeds long context AI inferencing on Oracle’s public cloud

🐢Turso News

blocksandfiles.com·

Running LLM Inference on Kubernetes: What It Actually Takes

☸️Kubernetes Blog

fairwinds.com·

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

📊Prometheus

gizchina.com·

DiffusionGemma: 4x Faster Text Generation

⚡LLM Quantization News Blog

blog.google··Hacker News, r/LocalLLaMA, r/singularity

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

🤖Machine Learning Blog Discussion

tildalice.io·

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

🦀Rust

huggingface.co··r/LocalLLaMA

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

What's in the Box? A Field Guide to AI Models

Machinic Psychopharmacology: Do LLMs Self-Medicate?

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

Unsloth Gemma 4 QAT

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

WEKA software speeds long context AI inferencing on Oracle’s public cloud

Running LLM Inference on Kubernetes: What It Actually Takes

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

DiffusionGemma: 4x Faster Text Generation

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation