Introducing Bolmo: Byteifying the next generation of language models

December 15, 2025

Ai2

From chatbots to scientific assistants, nearly every modern language model today still speaks in subword tokens—those opaque chunks like ▁inter, national, or ization that sit between characters and words. Subword tokenization has been remarkably successful, but it comes with real costs: poor character-level understanding, awkward behavior around whitespace and rare words, a rigid vocabulary that struggles to serve all languages equally, and inflexible compute allocation that treats every token the same regardless of how much information it carries.

December 15, 2025

Ai2

Byte-level models offer a compelling alternative. By operating directly over raw UTF-8 bytes, they sidestep the need for a hand-engineered vocabulary entirely—unlocking better handling of spelling, edge cases, and multilingual text. But there’s a catch: building competitive byte-level models has traditionally meant training from scratch, a costly endeavor that struggles to keep pace with the rapid improvements in data curation, architecture, and post-training that have propelled subword models forward. As a result, byte-level approaches have remained a research curiosity rather than a viable choice.

Today, we’re introducing Bolmo, a new family of byte-level language models that takes a different path. Instead of starting from scratch, Bolmo byteifies our open Olmo 3 models—reusing the backbone and capabilities we’ve already invested in, and retrofitting them into a fast, flexible byte-level architecture with a relatively short additional training run. The result is, to our knowledge, the first fully open byte-level language models – Bolmo 7B and Bolmo 1B – that are on par with, and in some cases surpass, state-of-the-art subword models across a wide range of tasks, while remaining practical to train and deploy.

Bolmo is deeply integrated into Ai2’s broader open ecosystem: trained on our Dolma 3 data mix plus open code datasets and a small slice of targeted character-level data, built on our public Olmo 3 7B checkpoint, and evaluated on our new, comprehensive suite of purpose-built open benchmarks and evaluation pipelines. Our goal is to provide a reproducible, inspectable blueprint for byteifying strong subword language models in a way the community can adopt and extend.

How Bolmo works

Bolmo is a latent tokenizer language model that processes text as bytes while benefiting from the scale and sophistication of the Olmo 3 transformer backbone. At a high level, it works in three stages.

Each UTF-8 byte is first embedded and passed through a lightweight local encoder – an mLSTM stack – that builds contextual byte-level representations. A boundary predictor then decides where to place patch boundaries, pooling bytes into variable-length "patches" that feed into the global transformer. Unlike prior work, Bolmo’s boundary predictor is non-causal, using a small amount of future context to decide where to cut—mirroring how subword tokenizers actually behave in practice. Finally, the pooled patches are processed by the original Olmo 3 transformer, depooled back to bytes, and refined by a local decoder before predicting the next byte and the next boundary.

Architecturally, Bolmo follows the same broad family as recent latent tokenizer language models like DTP, BLT, and H-Net, but with key modifications designed specifically for reusing strong subword backbones.

Byteifying Olmo 3 instead of training from scratch

Training a competitive byte-level model entirely from scratch is expensive. Instead, we start from an existing Olmo 3 7B checkpoint and byteify it in two stages.

In the first stage, we freeze the Olmo 3 transformer and train only the local encoder, local decoder, boundary predictor, and language modeling head. This stage is designed to be cheap and fast, requiring only 9.8B tokens (approximately 43B bytes) to learn to mimic the subword model’s behavior. In the second stage, we unfreeze the entire model and train for 39.3B additional tokens (approximately 173B bytes), letting Bolmo fully exploit byte-level information while maintaining efficiency.

Crucially, the global model remains Olmo 3: we’re not discarding the investment in data curation, architecture design, or long-context training. We’re extending that work into the byte space.

Strong performance where bytes should win

We evaluated Bolmo on our eval suite – a broad collection of benchmarks spanning math, STEM reasoning, question answering, code, and general knowledge – augmented with character-focused benchmarks like CUTE and EXECUTE to directly test character-level understanding.

A few high-level findings emerge. With only a relatively short byteifying schedule, Bolmo 7B comes close to the original subword Olmo 3 7B across our broad evaluation suite, while substantially outperforming it on character-focused benchmarks like CUTE and EXECUTE. On our character aggregate, Bolmo 7B improves accuracy by nearly twenty points over Olmo 3.

Against other byte-level models of comparable size – including BLT 7B, TFree-Hat 7B, and EvaByte 6.5B – Bolmo 7B is the strongest overall across code, math, multiple-choice QA, and character-level understanding, with the only exception of a small gap on GenQA where it slightly trails TFree-Hat 7B.

We also byteify Olmo 2 1B into Bolmo 1B, which is competitive with prior byte-level models at this scale and significantly improves character understanding over the original Olmo 2 1B base model.

The result is a byte-level model that does what byte models are supposed to do – excel at fine-grained text structure – without paying a steep performance cost.

Fast, flexible inference

A common concern with byte-level models is speed: more tokens usually mean more work. Bolmo’s architecture and compression strategy are explicitly designed to address this.

Using mLSTM-based local models and dynamic pooling, Bolmo achieves competitive wall-clock decoding speeds—around 125 bytes per second compared to approximately 150 bytes per second for the corresponding subword model at the same compression. It can be sped up further by increasing the bytes-per-patch ratio.

A key advantage of the dynamic hierarchical setup is that compression becomes a toggleable knob. Subword models hit a limit where growing the vocabulary makes the softmax layer a bottleneck. Bolmo can continue to increase efficiency by raising the average bytes per patch without running into that ceiling, extending the Pareto frontier of performance versus compute. In practice, this means you can trade off speed and fidelity more smoothly, without redesigning the entire model.

Zero-cost upgrades for post-trained models

One of the most exciting aspects of byteifying is how naturally it plugs into an existing model ecosystem. After byteifying a base Olmo 3 model into Bolmo, we can import capabilities from post-trained checkpoints essentially for free.

We explored this using an Olmo 3 checkpoint post-trained with reinforcement learning for instruction following. By applying task arithmetic – adding the difference between the base and post-trained Olmo 3 transformer weights into Bolmo – we can transfer those instruction-following skills into the byte model without any additional training.

On the IFEval benchmark, a byteified Bolmo base starts below its subword counterpart (31.1% versus 35.4%), but after zero-cost post-training via weight merging, Bolmo jumps to 67.4%, essentially matching the original post-trained Olmo 3 checkpoint at 66.9%.

This suggests a powerful pattern: once you byteify a strong open model, you can reuse much of the surrounding ecosystem – RL runs, fine-tunes, domain adapters – via lightweight weight merging, instead of redoing all downstream work in the byte space. In our paper, we show this works because Olmo 3’s embeddings can be “reset” without losing performance; that compatibility is not guaranteed for every model family, making embedding-friendly post-training an important area for future work.

What’s next

Bolmo opens up a number of directions we’re excited about: exploring richer lookahead and alternative boundary predictors for even better tokenization behavior, scaling byteifying to larger models and more domains, using Bolmo as a foundation for multilingual and domain-specialized byte-level systems, and further letting byteified models inherit improvements from subword ecosystems.

We’re sharing checkpoints, code, and the full paper so you can try Bolmo, inspect how byteifying works under the hood, and build your own byte-level models on top of the Olmo ecosystem.

If you’re interested in tokenization or the future of open inspectable language models, we’d love for you to dig into Bolmo and help shape what byteifying looks like for the next generation of LLMs.

Download Bolmo 1B | 7B | Tech report | Data | Code