✨ Model optimizations in LLMs - pleto · Scour

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

📊AI Performance Profiling Academic

Qwen 3.6 27B AutoRound GGUF, need your feedback

🔢Quantization of LLMs

huggingface.co··r/LocalLLaMA

How Does Attention Work in LLMs? 2026 Deep Dive

🧠Large Language Models (LLMs) Blog

·

Orchestrate your LLM pipeline. Locally

🧠Large Language Models (LLMs)

llmforge.app··Hacker News

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🚀LLM serving frameworks News

newsletter.semianalysis.com

··Hacker News

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

🔧Systems-level optimizations for LLM serving Code

github.com··Hacker News, r/LLM

Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

🚀LLM serving frameworks

venturebeat.com·

Friday Five — June 12, 2026

🔧Systems-level optimizations for LLM serving

DiffusionGemma: Discrete diffusion in a large language model

🧠Large Language Models (LLMs)

idlemachines.co.uk··Hacker News

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🚀LLM serving frameworks News Blog

blog.google··Hacker News

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

⚡Real-time AI Systems

aarushgupta.io··Lobsters, Hacker News

Model2vec-zig: static text embeddings in pure Zig, in a single binary

🔢Quantization of LLMs

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🔢Quantization of LLMs

vettedconsumer.com··Hacker News

Quantization Was Never About the Bits

🔢Quantization of LLMs Blog

·

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

🔢Quantization of LLMs Blog Discussion

Domain-Specific Small Language Models (Manning)

🧠Large Language Models (LLMs)

i-programmer.info·

Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change

🧠Large Language Models (LLMs) News Blog

andreaborio.substack.com··Substack

Unsloth Gemma 4 QAT

🚀LLM serving frameworks

The Quantization Error of the Soul: Why Silicon Valley is Inverting the Promethean Fire

🔢Quantization of LLMs Blog

·

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

🔢Quantization of LLMs

androidauthority.com·

Log in to enable infinite scrolling