💻 Local LLMs - matmat

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

🌀Brotli Internals Blog

towardsai.net·

Unsloth Gemma 4 QAT

📊Quantization

unsloth.ai·

zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability

⚡Parallel Computing Code

github.com··Hacker News

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

📊Quantization News Blog

blog.google··Hacker News

google/gemma-4-12B-it-qat-q4_0-gguf

🎓Academic Torrents

huggingface.co·

Using Scikit-LLM with Open-Source LLMs

🧪Data science

machinelearningmastery.com·

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

🐚Shell Automation

deemwar-products.github.io··Hacker News

local llm on laptop 780M GPU using llama + gemma 4 qat

⚡Homebrew CPUs Blog

alper.bearblog.dev·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

🎯Emulator Accuracy Blog

ziraph.com··Hacker News

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

📊Quantization

androidauthority.com·

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

⚡Homebrew CPUs Video

youtube.com·

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🎯Emulation Accuracy News Blog

kaitchup.substack.com··r/LocalLLaMA

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

⚡Parallel Computing Code

github.com··Hacker News

alexziskind1/model-shelf: Model Shelf is a local-first model resolver that helps AI agents and scripts find model weights on your own storage before downloading from Hugging Face. Point it at an internal SSD, NAS, external SSD, or Thunderbolt DAS, and it returns the best local path for GGUF, MLX, safetensors, Ollama, vLLM, and other local AI workflows.

🔄Sync Engine Code

github.com·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

⚡Parallel Computing Code

github.com··Hacker News

Does anyone know what PCIe mode was used for these benchmarks?

⚡Parallel Computing Code

github.com··r/LocalLLaMA

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

lightmetal: GPU LLM Inference From a Single Java 25 JAR

Improved performance and model support with GGUF

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

Unsloth Gemma 4 QAT

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

google/gemma-4-12B-it-qat-q4_0-gguf

Using Scikit-LLM with Open-Source LLMs

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

local llm on laptop 780M GPU using llama + gemma 4 qat

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

Does anyone know what PCIe mode was used for these benchmarks?