TL;DR: Downloading TheBloke’s Q4_K_M and calling it a day is lazy and you’re leaving massive performance on the table. I built LlamaPajamas (experimental / open-source), a pipeline that downloads full-precision models, converts them to the optimal format for your specific hardware (CoreML/TensorRT/ONNX for vision/SST, MLX/GGUF/TensorRT-LLM for LLMs), and then applies importance quantization with domain-specific calibration data. An 8B model quantized for YOUR use case beats a 70B general-purpose model for YOUR task. Also discovered most quantization benchmarks are lying to you.
The problem with how everyone uses HuggingFace
Go to any r/LocalLLaMA thread. “What model should I download?”…
TL;DR: Downloading TheBloke’s Q4_K_M and calling it a day is lazy and you’re leaving massive performance on the table. I built LlamaPajamas (experimental / open-source), a pipeline that downloads full-precision models, converts them to the optimal format for your specific hardware (CoreML/TensorRT/ONNX for vision/SST, MLX/GGUF/TensorRT-LLM for LLMs), and then applies importance quantization with domain-specific calibration data. An 8B model quantized for YOUR use case beats a 70B general-purpose model for YOUR task. Also discovered most quantization benchmarks are lying to you.
The problem with how everyone uses HuggingFace
Go to any r/LocalLLaMA thread. “What model should I download?” And everyone recommends some pre-quantized GGUF.
That’s fine for playing around. It’s completely wrong for production or for real workloads.
Here’s what you’re doing when you download a pre-quantized model:
Someone else decided which quantization format to use 1.
Someone else decided which calibration data to use (usually generic web text) 1.
Someone else decided which weights to preserve and which to compress 1.
You have no idea if any of those decisions match your use case
You’re running a model that was optimized for nobody in particular on hardware it wasn’t optimized for.
And then you wonder why your local setup feels worse than the APIs.
The approach that actually works
Download the full-precision model. Do your own conversion. Do your own quantization with your own calibration data.
Yes, it takes more time. Yes, it requires understanding what you’re doing. But you end up with a model that’s actually optimized for your hardware and your task instead of some generic middle ground.
That’s what LlamaPajamas does. It’s the pipeline for doing this properly.
Different model types need completely different backends
This is where most people screw up. They treat all AI models the same. “Just convert it to GGUF and run it.”
No. Different architectures run best on completely different backends.
Vision and Speech models (Whisper, YOLO, ViT, CLIP)
These are mostly matrix multiplications and convolutions. They’re well-suited for:
CoreML on Apple Silicon → Uses the Neural Engine and GPU properly. Whisper-tiny runs in 2 seconds for a 1-minute clip on M1 Max.
TensorRT on NVIDIA → Graph optimization and tensor cores. YOLO inference at 87ms per frame.
ONNX for CPU/AMD → Portable, runs everywhere, good enough performance.
You probably know this, but Do NOT run vision models through GGUF or MLX. That’s not what those backends are for and they really don’t support it (yet).
Large Language Models
LLMs have different compute patterns. Attention mechanisms, KV caches, sequential token generation. They need:
MLX on Apple Silicon → Apple’s ML framework built for LLMs on M-series chips. Way better than CoreML for text generation.
GGUF for CPU/universal → llama.cpp’s format. Works everywhere, highly optimized for CPU inference, and this is where you do importance quantization.
TensorRT-LLM on NVIDIA → Not regular TensorRT. TensorRT-LLM is specifically optimized for autoregressive generation, KV caching, and batched inference on NVIDIA GPUs.
Notice that CoreML isn’t in the LLM list. CoreML is great for vision but it’s not designed for the sequential generation pattern of LLMs. MLX is what you want on Apple Silicon for text.
Similarly, regular TensorRT is great for vision but you need TensorRT-LLM for language models. Different optimization strategies entirely.
The quantization stack: format first, then hyper-compress
Once you’ve got the right backend format, then you quantize. And for LLMs, you should be going way more aggressive than Q4_K_M.
The GGUF quantization ladder:
| Format | Compression | Use Case |
|---|---|---|
| F16 | 1x | Baseline, too big for most uses |
| Q8_0 | 2x | Overkill for most tasks |
| Q4_K_M | 4x | Where most people stop |
| IQ4_XS | 5x | Where you should start looking |
| IQ3_XS | 6x | Sweet spot for most use cases |
| IQ2_XS | 8x | Aggressive but works with good calibration |
Most people stop at Q4_K_M because that’s what the pre-quantized downloads offer. You’re missing the whole point.
IQ (importance quantization) uses calibration data to figure out which weights matter. Generic calibration preserves weights that matter for generic tasks. Domain-specific calibration preserves weights that matter for YOUR task.
Domain-specific calibration changes everything
This is the core insight that most people miss.
We created 7 calibration datasets:
| Domain | Use Case |
|---|---|
| General | Multi-purpose balanced |
| Tool Calling | Function/API calling |
| Summarization | Text compression |
| RAG | Document Q&A |
| Medical | Healthcare/diagnosis |
| Military | Defense/tactical |
| Tone Analysis | Sentiment/emotion |
Real results: A medical model quantized with medical calibration data maintains 95%+ task accuracy at IQ3_XS (900MB). The same model with general calibration drops to 85%.
That’s 10% accuracy difference from calibration data alone at the same file size.
A well-calibrated IQ3_XS model for your specific domain will outperform a generic Q4_K_M for your task. Smaller file, better performance. That’s not magic, that’s just optimizing for what you actually care about instead of what some random person on the internet cared about.
The calibration lesson that cost us
We built all these calibration datasets and felt good about ourselves. Then tool_calling quantization completely failed.
Turns out llama-imatrix needs at least 4,096 tokens to generate a useful importance matrix. Our tool_calling dataset only had 1,650 tokens.
Had to rebuild everything. Medical prompts went from “diagnose chest pain” to full clinical scenarios with differential diagnosis, test ordering, and treatment plans. Each calibration file needs to hit that token threshold or your importance matrix is garbage.
Check your token counts before running quantization. Learned this the hard way.
Your evaluation is lying to you
LlamaPajamas has a built-in evaluation tool - the first time I did it completely wrong (a lesson I am sure many have run into).
We were running evaluations and getting 90%+ accuracy on quantized models. Great! Ship it!
The evaluation was garbage.
Our “lenient mode” accepted any answer containing the right letter. Correct answer is “A”? We’d accept:
“A”
“A.”
“A) Because the mitochondria is the powerhouse of the cell”
“The answer is A”
In production, most of those are WRONG. If your system expects “A” and gets “A) Because...”, that’s a parsing failure.
We built strict mode. Exact matches only.
Accuracy dropped from 90% to ~50%.
That’s the truth. That’s what your model actually does. The 90% number was a lie that made us feel good.
We also built category-specific prompts:
Math: “Answer with ONLY the number. No units. No explanations.”
Multiple choice: “Answer with ONLY the letter. No punctuation.”
Tool calling: “Output ONLY the function name.”
If you’re not evaluating with strict exact-match, you don’t know what your model can actually do, expecially in an agentic / tool calling world.
Handling thinking models
Some models output reasoning in <think> tags:
<think>
The question asks about cellular respiration which is option B
</think>
B
Our regex broke when outputs got truncated mid-tag. Fixed it with two-pass extraction: remove complete tags first, then clean up unclosed tags.
Thinking models can reason all they want internally but still need exact final answers.
Actual benchmark results
Vision (YOLO-v8n)
CoreML FP16: 6.2MB, 87ms per frame on M1 (m laptop)
TensorRT FP16: 6MB, 45ms per frame on RTX 3090
Speech (Whisper-Tiny)
CoreML INT8: 39MB, 2.1s for 1-minute audio
ONNX: 39MB, 3.8s same audio on CPU
LLM (Qwen3 1.7B)
| Format | Size | Strict Accuracy |
|---|---|---|
| F16 baseline | 3.8 GB | 78% |
| Q4_K_M | 1.2 GB | 75% |
| IQ3_XS (general) | 900 MB | 73% |
| IQ3_XS (domain) | 900 MB | 76% on domain tasks |
| IQ2_XS | 700 MB | 68% |
The sweet spot is IQ3_XS with domain calibration. You get 6x compression with minimal accuracy loss on your target task. For 8B models that’s 15GB down to 2.5GB.
How to use the pipeline
Install:
git clone https://github.com/llama-farm/llama-pajamas
cd llama-pajamas
curl -LsSf https://astral.sh/uv/install.sh | sh
./setup.sh
Download full model and convert to GGUF F16:
cd quant
uv run llama-pajamas-quant quantize \
--model Qwen/Qwen3-1.7B\
--format gguf \
--precision F16 \
--output ./models/qwen3-1.7b
IQ quantize with your domain calibration:
uv run llama-pajamas-quant iq quantize \
--model ./models/qwen3-1.7b/gguf/F16/model.gguf \
--domain medical \
--precision IQ3_XS \
--output ./models/qwen3-1.7b-medical-iq3
Evaluate with strict mode (no lying to yourself):
uv run llama-pajamas-quant evaluate llm \
--model-dir ./models/qwen3-1.7b-medical-iq3/*.gguf \
--num-questions 140
Convert vision model to CoreML:
uv run llama-pajamas-quant quantize \
--model yolov8n \
--format coreml \
--precision fp16 \
--output ./models/yolo-coreml
What we’re building next
Automatic calibration generation: Describe your use case, get calibration data generated automatically.
Quality prediction: Estimate accuracy at different quantization levels before running the full process.
Mobile export: Direct to CoreML for iOS, TFLite for Android.
The caveat: general-use GGUFs have their place
Look, there are a lot of great pre-quantized GGUFs out there. TheBloke did great work. Bartowski’s quants are solid. For playing around with different models and getting a feel for what’s out there, they’re fine.
But here’s my question: why are you running models locally for “general use”?
If you just want a general-purpose assistant, use Claude or ChatGPT. They’re better at it than any local model and you don’t have to manage infrastructure.
The reason to run locally is privacy, offline access, or specialization. And if you need privacy or offline access, you probably have a specific use case. And if you have a specific use case, you should be fine-tuning and using domain-specific iMatrix quantization to turn your model into a specialist.
A 3B model fine-tuned on your data and quantized with your calibration will destroy a generic 8B model for your task. Smaller, faster, better. That’s the whole point.
Stop downloading generic quants and hoping they work for your use case. Download full models, fine-tune if you can, and quantize with calibration data that matches what you’re actually trying to do.
That’s how you get local AI that actually competes with the APIs.
Links
GitHub: https://github.com/llama-farm/LlamaPajamas
Happy to answer questions about hardware-specific optimization, calibration data design, or why your current evaluation is probably lying to you.
P.S. Why LlamaPajamas - you shouldn’t just make pajamas 1 size fits all, they need to be specialized for the hardware (the animal). Plus my daughter and son love the book :)