I trained a 67-million-parameter transformer end to end on an M4 Mac Mini using Apple Silicon MPS and achieved 93.94 percent exact-match accuracy on CLI command generation.
No discrete GPU. Twenty-four gigabytes of unified memory. A task where a single missing character counts as complete failure.
This project started as a constraint experiment. How far could a carefully built small model go if every part of the pipeline was designed around consumer hardware limits? That meant training from scratch, streaming data instead of downloading it, and being honest about what worked and what broke.
The answer surprised me. With modern architectural components like RoPE, RMSNorm, and SwiGLU, aggressive data efficiency, and roughly 13 hours of pretraining plus about four minutes of superv…
I trained a 67-million-parameter transformer end to end on an M4 Mac Mini using Apple Silicon MPS and achieved 93.94 percent exact-match accuracy on CLI command generation.
No discrete GPU. Twenty-four gigabytes of unified memory. A task where a single missing character counts as complete failure.
This project started as a constraint experiment. How far could a carefully built small model go if every part of the pipeline was designed around consumer hardware limits? That meant training from scratch, streaming data instead of downloading it, and being honest about what worked and what broke.
The answer surprised me. With modern architectural components like RoPE, RMSNorm, and SwiGLU, aggressive data efficiency, and roughly 13 hours of pretraining plus about four minutes of supervised fine-tuning, a model smaller than GPT-2 learned to generate syntactically correct shell commands nearly 94 percent of the time. The remaining 6 percent failed in ways that turned out to be more instructive than the successes.
This is not a benchmark paper, a claim about general intelligence, or a guide to replacing ChatGPT. It is a grounded look at what actually happens when you build a modern small language model from first principles, train it on real data, and ask it to do something unforgiving.
Here is what I learned.
1. The Constraint and the Result
The defining constraint of this project was hardware.
All training was done on an M4 Mac Mini with 24GB of unified memory using Apple Silicon’s Metal Performance Shaders backend. There was no discrete GPU, no CUDA, and no ability to hide inefficiencies behind massive batch sizes. Every design choice had to respect memory pressure and wall-clock time. If a training decision was inefficient, the M4 made that obvious within minutes, not hours.
The task choice amplified those constraints. CLI command generation is exacting by nature. Commands are short, compositional, and brittle. A missing flag, a truncated regex, or an incomplete pipe is not “mostly right.” It is wrong. That made exact-match accuracy the only metric that mattered and removed any ability to rely on subjective evaluation.
Within those limits, the final results were:
-
Model size: 66.73 million parameters
-
Training data: 204.8 million tokens
-
Pretraining time: roughly 13 hours wall time
-
Supervised fine-tuning time: approximately 4 minutes
-
Electricity usage: roughly 1 kilowatt-hour, under $0.50 at typical US electricity rates
-
Final accuracy: 93.94 percent exact match on a held-out CLI evaluation set
The most important point is not the accuracy number in isolation. It is that these results were achieved end to end on consumer hardware, using a model trained from scratch, with full visibility into the data pipeline, training dynamics, and failure modes.
That combination—consumer hardware, exact evaluation, full transparency—shaped every decision that follows.
2. Why Build a Tiny LLM
The decision to build a small language model from scratch was driven by the task, not by ideology.
CLI command generation is a correctness problem, not a creativity problem. Commands are short, structured, and compositional. They rely on precise syntax, ordering, and punctuation. A missing flag, a truncated regex, or an incomplete pipe does not degrade quality gracefully. It fails outright.
This creates a clear neuro-symbolic boundary. The problem is not about producing plausible language, but about generating exact symbolic structures that must execute correctly. That makes CLI commands an unusually strong stress test for small models and a poor fit for subjective evaluation.
Training from scratch also provided control. Owning the tokenizer, data pipeline, training loop, and evaluation logic made failures diagnosable. When the model broke, the cause could be traced to data coverage, architectural constraints, or training dynamics rather than opaque behavior inside a black box.
Just as important were the explicit non-goals:
-
This was not an attempt to build a general-purpose assistant.
-
It was not a benchmark against frontier models.
-
It was not designed for multilingual generation.
-
It was not meant to replace API-based systems for broad tasks.
The goal was narrower and more practical: build the smallest model that could reliably generate exact, structured commands under tight hardware constraints, and understand precisely why it succeeded or failed.
3. High-Level System
The system was designed end to end, with each stage shaped by the constraints of consumer hardware and the requirements of exact output.
At a high level, the pipeline looks like this:
Tokenizer → streaming data → pretraining → supervised fine-tuning → evaluation → continual learning
The tokenizer was trained first, with explicit support for instruction and command boundaries. This made it possible to separate natural language intent from structured output during both training and evaluation.
Wikipedia was streamed directly from Hugging Face rather than downloaded, avoiding tens of gigabytes of local storage. Text was tokenized incrementally, segmented into fixed-length sequences, and written into token shards sized to balance disk IO and memory usage. These shards were later consumed using memory-mapped loading, allowing the training loop to scale without exhausting RAM.
Pretraining taught the model general language structure and syntax. Supervised fine-tuning then adapted the model to instruction-to-command mapping, with loss applied only to the command portion of each sequence.
Evaluation was handled asynchronously and designed to be repeatable and strict. Exact-match accuracy was computed on a held-out set using a lightweight AsyncIO-based evaluation loop, making it easy to rerun tests and gate updates without manual intervention.
Finally, a small continual learning system wrapped the training process. New data could be introduced incrementally, but updates were only accepted if they improved performance without harming existing behavior.
The complexity here comes from orchestrating simple pieces under tight constraints, not from any single exotic component.
3.5. The Training Run in Numbers
Before going deeper, it is worth pausing on what the full training run actually looked like on the M4.
All results in this post come from a single end-to-end run with the following characteristics:
-
Model size: 66.73 million parameters
-
Training tokens: 204.8 million
-
Pretraining time: roughly 13 hours wall time
-
Supervised fine-tuning time: approximately 4 minutes
-
Pretraining loss: reduced from roughly 60 to 3.59
-
Accelerator: Apple Silicon MPS (M4 Mac Mini, 24GB unified memory)
-
Electricity usage: roughly 1 kilowatt-hour, costing under $0.50
#### The Throughput Reality Check
It is important to be realistic about how this compares to enterprise hardware.
A single NVIDIA A100 would likely complete this specific pretraining run in 20 to 30 minutes. The cloud cost for that window is relatively low, on the order of one to two dollars. This estimate assumes on-demand pricing and ignores setup, data transfer, and iteration overhead. From a pure throughput perspective, there is no competition here.
But the advantage of training locally is not about beating an A100 in a sprint. It is about the cost of curiosity.
Training on local hardware fundamentally changes the developer’s relationship with iteration and failure:
-
Zero marginal cost. In the cloud, every hyperparameter mistake, data-sharding bug, or aborted experiment has a price tag. On local hardware, the cost of a “failed” 13-hour run is roughly twenty-five cents of electricity.
-
No cold-start overhead. There is no time spent provisioning instances, managing SSH keys, uploading data to remote volumes, or waiting for capacity. Training starts when you decide to start it.
-
Persistence. You have a dedicated training appliance that is silent, draws less power than a lightbulb, and can iterate continuously without a ticking clock on your credit card.
This is the legitimacy checkpoint. These numbers reflect what actually happened on a single consumer machine. They show that for targeted, ~60M-parameter models, modern transformer architectures are no longer gated behind enterprise infrastructure. They are accessible at home, and that accessibility meaningfully changes how experimentation, debugging, and learning happen.
4. Architecture: Small but Modern
The model architecture was intentionally conservative.
TinyLLM uses a 12-layer transformer with a hidden dimension of 512, 8 attention heads, and a maximum context length of 512 tokens, for a total of 66.73 million parameters. There are no exotic blocks, no routing layers, and no architectural experiments designed to impress by novelty alone. Every component was chosen because it has demonstrated stable behavior at small scale under tight compute and memory constraints.
The core architectural choices were:
-
Rotary positional embeddings (RoPE)
-
RMSNorm instead of LayerNorm
-
SwiGLU feed-forward layers
-
Weight tying between input embeddings and the output projection
These choices reflect a well-understood modern recipe. They reduce parameter count, improve training stability, or both, without introducing additional complexity.
The parameter breakdown makes the tradeoffs explicit:
-
Token embeddings: 16.38M parameters (32,000 vocab × 512 dim)
-
Attention blocks (12 layers): 12.58M parameters
-
Feed-forward networks (12 layers): 37.75M parameters
-
RMSNorm layers: ~0.01M parameters
-
Total: 66.73M parameters ![image2](