Local LLMs

Feeds to Scour
SubscribedAll
Scoured 355 posts in 17.4 ms

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

 📊Quantization

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

 🖥️Terminal Renaissance
alternativeto.net·

lightmetal: GPU LLM Inference From a Single Java 25 JAR

 📊Performance Profiling  Content type: Blog
adambien.blog·

Improved performance and model support with GGUF

 🔄Lens Laws  Content type: Blog
ollama.com·

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

 🌀Brotli Internals  Content type: Blog
towardsai.net·

Unsloth Gemma 4 QAT

 📊Quantization
unsloth.ai·

zhongkaifu/TensorSharp: A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. It supports Windows/MacOS/Linux with full GPU capability

 Parallel Computing  Content type: Code
github.com··Hacker News

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 📊Quantization  Content type: News  Content type: Blog
blog.google··Hacker News

google/gemma-4-12B-it-qat-q4_0-gguf

 🎓Academic Torrents
huggingface.co·

Using Scikit-LLM with Open-Source LLMs

 🧪Data science

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

 🐚Shell Automation

local llm on laptop 780M GPU using llama + gemma 4 qat

 Homebrew CPUs  Content type: Blog
alper.bearblog.dev·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

 🎯Emulator Accuracy  Content type: Blog
ziraph.com··Hacker News

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

 📊Quantization
androidauthority.com·

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

 Homebrew CPUs  Content type: Video
youtube.com·

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

 🎯Emulation Accuracy  Content type: News  Content type: Blog

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

 Parallel Computing  Content type: Code
github.com··Hacker News

alexziskind1/model-shelf: Model Shelf is a local-first model resolver that helps AI agents and scripts find model weights on your own storage before downloading from Hugging Face. Point it at an internal SSD, NAS, external SSD, or Thunderbolt DAS, and it returns the best local path for GGUF, MLX, safetensors, Ollama, vLLM, and other local AI workflows.

 🔄Sync Engine  Content type: Code
github.com·

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

 Parallel Computing  Content type: Code
github.com··Hacker News

Does anyone know what PCIe mode was used for these benchmarks?

 Parallel Computing  Content type: Code
github.com··r/LocalLLaMA

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help