The Llama family of models are large language models released by Meta (formerly Facebook). These decoder-only transformer models are used for generation tasks. Almost all decoder-only models nowadays use the Byte-Pair Encoding (BPE) algorithm for tokenization. In this article, you will learn about BPE. In particular, you will learn:
- What BPE is compared to other tokenization algorithms
- How to prepare a dataset and train a BPE tokenizer
- How to use the tokenizer

Training a Tokenizer for Llama Model Photo by Joss Woodhead. Some rights reserved.
Letβs get started.
Overview
Tβ¦
The Llama family of models are large language models released by Meta (formerly Facebook). These decoder-only transformer models are used for generation tasks. Almost all decoder-only models nowadays use the Byte-Pair Encoding (BPE) algorithm for tokenization. In this article, you will learn about BPE. In particular, you will learn:
- What BPE is compared to other tokenization algorithms
- How to prepare a dataset and train a BPE tokenizer
- How to use the tokenizer

Training a Tokenizer for Llama Model Photo by Joss Woodhead. Some rights reserved.
Letβs get started.
Overview
This article is divided into four parts; they are:
- Understanding BPE
- Training a BPE tokenizer with Hugging Face tokenizers library
- Training a BPE tokenizer with SentencePiece library
- Training a BPE tokenizer with tiktoken library
Understanding BPE
Byte-Pair Encoding (BPE) is a tokenization algorithm used to tokenize text into sub-word units. Instead of splitting text into only words and punctuation, BPE can further split the prefixes and suffixes of words so that prefixes, stems, and suffixes can each be associated with meaning in the language model. Without sub-word tokenization, a language model would find it difficult to learn that βhappyβ and βunhappyβ are antonyms of each other.
BPE is not the only sub-word tokenization algorithm. WordPiece, which is the default for BERT, is another one. A well-implemented BPE does not need βunknownβ in the vocabulary, and nothing is OOV (Out of Vocabulary) in BPE. This is because BPE can start with 256 byte values (hence known as byte-level BPE) and then merge the most frequent pairs of tokens into a new vocabulary until the desired vocabulary size is reached.
Nowadays, BPE is the tokenization algorithm of choice for most decoder-only models. However, you do not want to implement your own BPE tokenizer from scratch. Instead, you can use tokenizer libraries such as Hugging Faceβs tokenizers, OpenAIβs tiktoken, or Googleβs sentencepiece.
Training a BPE tokenizer with Hugging Face tokenizers Library
To train a BPE tokenizer, you need to prepare a dataset so the tokenizer algorithm can determine the most frequent pair of tokens to merge. For decoder-only models, a subset of the modelβs training data is usually appropriate.
Training a tokenizer is time-consuming, especially for large datasets. However, unlike a language model, a tokenizer does not need to learn the language context of the text, only how often tokens appear in a typical text corpus. While you may need trillions of tokens to train a good language model, you only need a few million tokens to train a good tokenizer.
As mentioned in a previous article, there are several well-known text datasets for language model training. For a toy project, you may want a smaller dataset for faster experimentation. The HuggingFaceFW/fineweb dataset is a good choice for this purpose. In its full size, it is a 15 trillion token dataset, but it also has 10B, 100B, and 350B sizes for smaller projects. The dataset is derived from Common Crawl and filtered by Hugging Face to improve data quality.
Below is how you can print a few samples from the dataset:
| 123456789 | import datasetsdataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)count = 0for sample in dataset: print(sample) count += 1 if count >= 5: break |
Running this code will print the following:
| 12345678910111213 | {βtextβ: β|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||F...β,βidβ: β<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>β, βdumpβ: βCC-MAIN-2013-20β,βurlβ: βhttp://daytimeroyaltyonline.com/single/?p=8906650&t=8780053β,βdateβ: β2013-05-18T05:48:59Zβ,βfile_pathβ: βs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...β,βlanguageβ: βenβ, βlanguage_scoreβ: 0.8232095837593079, βtoken_countβ: 142}{βtextβ: β*sigh* Fundamentalist community, let me pass on some advice to you I learne...β,βidβ: β<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>β, βdumpβ: βCC-MAIN-2013-20β,βurlβ: βhttp://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on...β,βdateβ: β2013-05-18T06:43:03Zβ,βfile_pathβ: βs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...β,βlanguageβ: βenβ, βlanguage_scoreβ: 0.9737711548805237, βtoken_countβ: 703}... |
For training a tokenizer (and even a language model), you only need the text field of each sample.
To train a BPE tokenizer using the tokenizers library, you simply feed the text samples to the trainer. Below is the complete code:
| 1234567891011121314151617181920212223242526272829303132333435363738394041424344 | from typing import Iteratorimport datasetsfrom tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, normalizers# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break# Initialize a BPE model: either byte_fallback=True or set unk_token="[UNK]"tokenizer = Tokenizer(models.BPE(byte_fallback=True))tokenizer.normalizer = normalizers.NFKC()tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=False)tokenizer.decoder = decoders.ByteLevel()# Trainertrainer = trainers.BpeTrainer( vocab_size=25_000, min_frequency=2, special_tokens=["[PAD]", "[CLS]", "[SEP]", "[MASK]"], show_progress=True,)# Train and save the tokenizer to disktexts = get_texts(dataset, limit=10_000)tokenizer.train_from_iterator(texts, trainer=trainer)tokenizer.save("bpe_tokenizer.json")# Reload the tokenizer from disktokenizer = Tokenizer.from_file("bpe_tokenizer.json")# Test: encode/decodetext = "Letβs have a pizza party! π"enc = tokenizer.encode(text)print("Token IDs:", enc.ids)print("Decoded:", tokenizer.decode(enc.ids)) |
When you run this code, you will see:
| 1234567 | Resolving data files: 100%|βββββββββββββββββββββββ| 27468/27468 [00:03<00:00, 7792.97it/s][00:00:01] Pre-processing sequences ββββββββββββββββββββββββββββ 0 / 0[00:00:02] Tokenize words ββββββββββββββββββββββββββββ 10000 / 10000[00:00:00] Count pairs ββββββββββββββββββββββββββββ 10000 / 10000[00:00:38] Compute merges ββββββββββββββββββββββββββββ 24799 / 24799Token IDs: [3548, 277, 396, 1694, 14414, 227, 12060, 715, 9814, 180, 188]Decoded: Letβs have a pizza party! π |
To avoid loading the entire dataset at once, use the streaming=True argument in the load_dataset() function. The tokenizers library expects only text for training BPE, so the get_texts() function yields text samples one by one. The loop terminates when the limit is reached since the entire dataset is not needed to train a tokenizer.
To create byte-level BPE, set the byte_fallback=True argument in the BPE model and configure the ByteLevel pre-tokenizer and decoder. Adding a NFKC normalizer is also recommended to clean Unicode text for better tokenization.
For a decoder-only model, you will also need special tokens such as <PAD>, <EOT>, and <MASK>. The <EOT> token signals the end of a text sequence, allowing the model to declare when sequence generation is complete.
Once the tokenizer is trained, save it to a file for later use. To use a tokenizer, call the encode() method to convert text into a sequence of token IDs, or the decode() method to convert token IDs back to text.
Note that the code above sets a small vocabulary size of 25,000 and limits the training dataset to 10,000 samples for demonstration purposes, enabling training to complete in a reasonable time. In practice, use a larger vocabulary size and training dataset so the language model can capture the diversity of the language. As a reference, the vocabulary size of the Llama 2 is 32,000 and that of Llama 3 is 128,256.
Training a BPE tokenizer with SentencePiece library
As an alternative to Hugging Faceβs tokenizers library, you can use Googleβs sentencepiece library. The library is written in C++ and is fast, though its API and documentation are less refined than those of the tokenizers library.
The previous code rewritten using the sentencepiece library is as follows:
| 1234567891011121314151617181920212223242526272829303132333435363738394041424344 | from typing import Iteratorimport datasetsimport sentencepiece as spm# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break# Define special tokens as comma-separated stringspm.SentencePieceTrainer.Train( sentence_iterator=get_texts(dataset, limit=10_000), byte_fallback=True, model_prefix="sp_bpe", vocab_size=32_000, model_type="bpe", unk_id=0, bos_id=1, eos_id=2, pad_id=3, # set to -1 to disable character_coverage=1.0, input_sentence_size=10_000, shuffle_input_sentence=False,)# Load the trained SentencePiece modelsp = spm.SentencePieceProcessor(model_file="sp_bpe.model")# Test: encode/decodetext = "Letβs have a pizza party! π"ids = sp.encode(text, out_type=int, enable_sampling=False) # default: no special tokenstokens = sp.encode(text, out_type=str, enable_sampling=False)print("Tokens:", tokens)print("Token IDs:", ids)decoded = sp.decode(ids)print("Decoded:", decoded) |
When you run this code, you will see:
| 12345 | ...Tokens: [ββLetβ, "β", βsβ, ββhaveβ, ββaβ, ββpizzaβ, ββpartyβ, β!β, βββ, β<0xF0>β,β<0x9F>β, β<0x8D>β, β<0x95>β]Token IDs: [2703, 31093, 31053, 422, 261, 10404, 3064, 31115, 31046, 244, 163, 145, 153]Decoded: Letβs have a pizza party! π |
The trainer in SentencePiece is more verbose than the one in tokenizers, both in code and output. The key is to set byte_fallback=True in the SentencePieceTrainer; otherwise, the tokenizer may require an unknown token. The emoji in the test text serves as a corner case to verify that the tokenizer can handle unseen Unicode characters, which byte-level BPE should handle gracefully.
Training a BPE tokenizer with tiktoken Library
The third library you can use for BPE tokenization is OpenAIβs tiktoken library. While it is easy to load pre-trained tokenizers, training with this library is not recommended.
The code in the previous sections can be rewritten using the tiktoken library as follows:
| 1234567891011121314151617181920212223242526272829303132333435363738394041 | import sysfrom typing import Iteratorimport datasetsimport tiktokenfrom tiktoken._educational import SimpleBytePairEncoding# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if count >= limit: break# Collect texts up to some manageable limit for tokenizer traininglimit = 1_000texts = "\n".join(get_texts(dataset, limit=limit))# Train a simple BPE tokenizerpat_str=r"""βs|βt|βre|βve|βm|βll|βd| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""enc_simple = SimpleBytePairEncoding.train(training_data=texts, vocab_size=300, pat_str=pat_str)# Convert to real tiktoken encoding and save to diskenc = tiktoken.Encoding( name="my_bpe", pat_str=enc_simple.pat_str, # same regex used during training mergeable_ranks=enc_simple.mergeable_ranks, special_tokens={},)# testtext = "Letβs have a pizza party! π"tok_ids = enc.encode(text)print("Token IDs:", tok_ids)print("Decoded:", enc.decode(tok_ids)) |
When you run this code, you will see:
| 1234 | ...Token IDs: [76, 101, 116, 39, 115, 293, 97, 118, 101, 257, 278, 105, 122, 122, 97, 278,286, 116, 121, 33, 32, 240, 159, 141, 149]Decoded: Letβs have a pizza party! π |
The tiktoken library does not have an optimized trainer. The only available module is a Python implementation of the BPE algorithm via the SimpleBytePairEncoding class. To train a tokenizer, you need to define how the input text should be split into words using the pat_str argument, which defines a βwordβ using a regular expression.
The training output is a dictionary called mergeable ranks, which contains pairs of tokens that can be merged along with their merge priorities. To create a tokenizer, simply pass the pat_str and mergeable_ranks arguments to the Encoding class.
Note that the tokenizer in tiktoken does not have a save function. Instead, save the pat_str and mergeable_ranks arguments if needed.
Since training is done in pure Python, it is very slow. Training your own tokenizer this way is not recommended.
Further Readings
Below are some resources that you may find useful:
- Andrej Karpathy, Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20
- Geiping & Goldestein (2022), Cramming: Training a language model on a single GPU in one day
- Firestone et al. (2025), UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
- tiktoken library on GitHub
- sentencepiece library on GitHub
- tokenizers library documentation
Summary
In this article, you learned about byte-level BPE and how to train a BPE tokenizer. Specifically, you learned how to train a BPE tokenizer with the tokenizers, sentencepiece, and tiktoken libraries. You also learned that a tokenizer can encode text into a list of integer token IDs and decode them back to text.