Training a Tokenizer for Llama Model

The Llama family of models are large language models released by Meta (formerly Facebook). These decoder-only transformer models are used for generation tasks. Almost all decoder-only models nowadays use the Byte-Pair Encoding (BPE) algorithm for tokenization. In this article, you will learn about BPE. In particular, you will learn:

What BPE is compared to other tokenization algorithms
How to prepare a dataset and train a BPE tokenizer
How to use the tokenizer

Training a Tokenizer for Llama Model Photo by Joss Woodhead. Some rights reserved.

Let’s get started.

Overview

T…

What BPE is compared to other tokenization algorithms
How to prepare a dataset and train a BPE tokenizer
How to use the tokenizer

Training a Tokenizer for Llama Model Photo by Joss Woodhead. Some rights reserved.

Let’s get started.

Overview

This article is divided into four parts; they are:

Understanding BPE
Training a BPE tokenizer with Hugging Face tokenizers library
Training a BPE tokenizer with SentencePiece library
Training a BPE tokenizer with tiktoken library

Understanding BPE

Byte-Pair Encoding (BPE) is a tokenization algorithm used to tokenize text into sub-word units. Instead of splitting text into only words and punctuation, BPE can further split the prefixes and suffixes of words so that prefixes, stems, and suffixes can each be associated with meaning in the language model. Without sub-word tokenization, a language model would find it difficult to learn that “happy” and “unhappy” are antonyms of each other.

BPE is not the only sub-word tokenization algorithm. WordPiece, which is the default for BERT, is another one. A well-implemented BPE does not need “unknown” in the vocabulary, and nothing is OOV (Out of Vocabulary) in BPE. This is because BPE can start with 256 byte values (hence known as byte-level BPE) and then merge the most frequent pairs of tokens into a new vocabulary until the desired vocabulary size is reached.

Nowadays, BPE is the tokenization algorithm of choice for most decoder-only models. However, you do not want to implement your own BPE tokenizer from scratch. Instead, you can use tokenizer libraries such as Hugging Face’s tokenizers, OpenAI’s tiktoken, or Google’s sentencepiece.

Training a BPE tokenizer with Hugging Face tokenizers Library

To train a BPE tokenizer, you need to prepare a dataset so the tokenizer algorithm can determine the most frequent pair of tokens to merge. For decoder-only models, a subset of the model’s training data is usually appropriate.

Training a tokenizer is time-consuming, especially for large datasets. However, unlike a language model, a tokenizer does not need to learn the language context of the text, only how often tokens appear in a typical text corpus. While you may need trillions of tokens to train a good language model, you only need a few million tokens to train a good tokenizer.

As mentioned in a previous article, there are several well-known text datasets for language model training. For a toy project, you may want a smaller dataset for faster experimentation. The HuggingFaceFW/fineweb dataset is a good choice for this purpose. In its full size, it is a 15 trillion token dataset, but it also has 10B, 100B, and 350B sizes for smaller projects. The dataset is derived from Common Crawl and filtered by Hugging Face to improve data quality.

Below is how you can print a few samples from the dataset:


123456789	import datasetsdataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)count = 0for sample in dataset: print(sample) count += 1 if count >= 5: break

Running this code will print the following:


12345678910111213	{‘text’: ‘\|Viewing Single Post From: Spoilers for the Week of February 11th\|\n\|Lil\|\|F...’,‘id’: ‘<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>’, ‘dump’: ‘CC-MAIN-2013-20’,‘url’: ‘http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053’,‘date’: ‘2013-05-18T05:48:59Z’,‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...’,‘language’: ‘en’, ‘language_score’: 0.8232095837593079, ‘token_count’: 142}{‘text’: ‘sigh Fundamentalist community, let me pass on some advice to you I learne...’,‘id’: ‘<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>’, ‘dump’: ‘CC-MAIN-2013-20’,‘url’: ‘http://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on...’,‘date’: ‘2013-05-18T06:43:03Z’,‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...’,‘language’: ‘en’, ‘language_score’: 0.9737711548805237, ‘token_count’: 703}...

12345678910111213

{‘text’: ‘|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||F...’,‘id’: ‘<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>’, ‘dump’: ‘CC-MAIN-2013-20’,‘url’: ‘http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053’,‘date’: ‘2013-05-18T05:48:59Z’,‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...’,‘language’: ‘en’, ‘language_score’: 0.8232095837593079, ‘token_count’: 142}{‘text’: ‘*sigh* Fundamentalist community, let me pass on some advice to you I learne...’,‘id’: ‘<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>’, ‘dump’: ‘CC-MAIN-2013-20’,‘url’: ‘http://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on...’,‘date’: ‘2013-05-18T06:43:03Z’,‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war...’,‘language’: ‘en’, ‘language_score’: 0.9737711548805237, ‘token_count’: 703}...

For training a tokenizer (and even a language model), you only need the text field of each sample.

To train a BPE tokenizer using the tokenizers library, you simply feed the text samples to the trainer. Below is the complete code:


1234567891011121314151617181920212223242526272829303132333435363738394041424344	from typing import Iteratorimport datasetsfrom tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, normalizers# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break# Initialize a BPE model: either byte_fallback=True or set unk_token="[UNK]"tokenizer = Tokenizer(models.BPE(byte_fallback=True))tokenizer.normalizer = normalizers.NFKC()tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=False)tokenizer.decoder = decoders.ByteLevel()# Trainertrainer = trainers.BpeTrainer( vocab_size=25_000, min_frequency=2, special_tokens=["[PAD]", "[CLS]", "[SEP]", "[MASK]"], show_progress=True,)# Train and save the tokenizer to disktexts = get_texts(dataset, limit=10_000)tokenizer.train_from_iterator(texts, trainer=trainer)tokenizer.save("bpe_tokenizer.json")# Reload the tokenizer from disktokenizer = Tokenizer.from_file("bpe_tokenizer.json")# Test: encode/decodetext = "Let’s have a pizza party! 🍕"enc = tokenizer.encode(text)print("Token IDs:", enc.ids)print("Decoded:", tokenizer.decode(enc.ids))

1234567891011121314151617181920212223242526272829303132333435363738394041424344

from typing import Iteratorimport datasetsfrom tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, normalizers# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break# Initialize a BPE model: either byte_fallback=True or set unk_token="[UNK]"tokenizer = Tokenizer(models.BPE(byte_fallback=True))tokenizer.normalizer = normalizers.NFKC()tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=False)tokenizer.decoder = decoders.ByteLevel()# Trainertrainer = trainers.BpeTrainer( vocab_size=25_000, min_frequency=2, special_tokens=["[PAD]", "[CLS]", "[SEP]", "[MASK]"], show_progress=True,)# Train and save the tokenizer to disktexts = get_texts(dataset, limit=10_000)tokenizer.train_from_iterator(texts, trainer=trainer)tokenizer.save("bpe_tokenizer.json")# Reload the tokenizer from disktokenizer = Tokenizer.from_file("bpe_tokenizer.json")# Test: encode/decodetext = "Let’s have a pizza party! 🍕"enc = tokenizer.encode(text)print("Token IDs:", enc.ids)print("Decoded:", tokenizer.decode(enc.ids))

When you run this code, you will see:


1234567	Resolving data files: 100%\|███████████████████████\| 27468/27468 [00:03<00:00, 7792.97it/s][00:00:01] Pre-processing sequences ████████████████████████████ 0 / 0[00:00:02] Tokenize words ████████████████████████████ 10000 / 10000[00:00:00] Count pairs ████████████████████████████ 10000 / 10000[00:00:38] Compute merges ████████████████████████████ 24799 / 24799Token IDs: [3548, 277, 396, 1694, 14414, 227, 12060, 715, 9814, 180, 188]Decoded: Let’s have a pizza party! 🍕

1234567

Resolving data files: 100%|███████████████████████| 27468/27468 [00:03<00:00, 7792.97it/s][00:00:01] Pre-processing sequences ████████████████████████████ 0 / 0[00:00:02] Tokenize words ████████████████████████████ 10000 / 10000[00:00:00] Count pairs ████████████████████████████ 10000 / 10000[00:00:38] Compute merges ████████████████████████████ 24799 / 24799Token IDs: [3548, 277, 396, 1694, 14414, 227, 12060, 715, 9814, 180, 188]Decoded: Let’s have a pizza party! 🍕

To avoid loading the entire dataset at once, use the streaming=True argument in the load_dataset() function. The tokenizers library expects only text for training BPE, so the get_texts() function yields text samples one by one. The loop terminates when the limit is reached since the entire dataset is not needed to train a tokenizer.

To create byte-level BPE, set the byte_fallback=True argument in the BPE model and configure the ByteLevel pre-tokenizer and decoder. Adding a NFKC normalizer is also recommended to clean Unicode text for better tokenization.

For a decoder-only model, you will also need special tokens such as <PAD>, <EOT>, and <MASK>. The <EOT> token signals the end of a text sequence, allowing the model to declare when sequence generation is complete.

Once the tokenizer is trained, save it to a file for later use. To use a tokenizer, call the encode() method to convert text into a sequence of token IDs, or the decode() method to convert token IDs back to text.

Note that the code above sets a small vocabulary size of 25,000 and limits the training dataset to 10,000 samples for demonstration purposes, enabling training to complete in a reasonable time. In practice, use a larger vocabulary size and training dataset so the language model can capture the diversity of the language. As a reference, the vocabulary size of the Llama 2 is 32,000 and that of Llama 3 is 128,256.

Training a BPE tokenizer with SentencePiece library

As an alternative to Hugging Face’s tokenizers library, you can use Google’s sentencepiece library. The library is written in C++ and is fast, though its API and documentation are less refined than those of the tokenizers library.

The previous code rewritten using the sentencepiece library is as follows:


1234567891011121314151617181920212223242526272829303132333435363738394041424344	from typing import Iteratorimport datasetsimport sentencepiece as spm# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break# Define special tokens as comma-separated stringspm.SentencePieceTrainer.Train( sentence_iterator=get_texts(dataset, limit=10_000), byte_fallback=True, model_prefix="sp_bpe", vocab_size=32_000, model_type="bpe", unk_id=0, bos_id=1, eos_id=2, pad_id=3, # set to -1 to disable character_coverage=1.0, input_sentence_size=10_000, shuffle_input_sentence=False,)# Load the trained SentencePiece modelsp = spm.SentencePieceProcessor(model_file="sp_bpe.model")# Test: encode/decodetext = "Let’s have a pizza party! 🍕"ids = sp.encode(text, out_type=int, enable_sampling=False) # default: no special tokenstokens = sp.encode(text, out_type=str, enable_sampling=False)print("Tokens:", tokens)print("Token IDs:", ids)decoded = sp.decode(ids)print("Decoded:", decoded)

1234567891011121314151617181920212223242526272829303132333435363738394041424344

from typing import Iteratorimport datasetsimport sentencepiece as spm# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if limit and count >= limit: break# Define special tokens as comma-separated stringspm.SentencePieceTrainer.Train( sentence_iterator=get_texts(dataset, limit=10_000), byte_fallback=True, model_prefix="sp_bpe", vocab_size=32_000, model_type="bpe", unk_id=0, bos_id=1, eos_id=2, pad_id=3, # set to -1 to disable character_coverage=1.0, input_sentence_size=10_000, shuffle_input_sentence=False,)# Load the trained SentencePiece modelsp = spm.SentencePieceProcessor(model_file="sp_bpe.model")# Test: encode/decodetext = "Let’s have a pizza party! 🍕"ids = sp.encode(text, out_type=int, enable_sampling=False) # default: no special tokenstokens = sp.encode(text, out_type=str, enable_sampling=False)print("Tokens:", tokens)print("Token IDs:", ids)decoded = sp.decode(ids)print("Decoded:", decoded)

When you run this code, you will see:


12345	...Tokens: [‘▁Let’, "’", ‘s’, ‘▁have’, ‘▁a’, ‘▁pizza’, ‘▁party’, ‘!’, ‘▁’, ‘<0xF0>’,‘<0x9F>’, ‘<0x8D>’, ‘<0x95>’]Token IDs: [2703, 31093, 31053, 422, 261, 10404, 3064, 31115, 31046, 244, 163, 145, 153]Decoded: Let’s have a pizza party! 🍕

The trainer in SentencePiece is more verbose than the one in tokenizers, both in code and output. The key is to set byte_fallback=True in the SentencePieceTrainer; otherwise, the tokenizer may require an unknown token. The emoji in the test text serves as a corner case to verify that the tokenizer can handle unseen Unicode characters, which byte-level BPE should handle gracefully.

Training a BPE tokenizer with tiktoken Library

The third library you can use for BPE tokenization is OpenAI’s tiktoken library. While it is easy to load pre-trained tokenizers, training with this library is not recommended.

The code in the previous sections can be rewritten using the tiktoken library as follows:


1234567891011121314151617181920212223242526272829303132333435363738394041	import sysfrom typing import Iteratorimport datasetsimport tiktokenfrom tiktoken._educational import SimpleBytePairEncoding# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if count >= limit: break# Collect texts up to some manageable limit for tokenizer traininglimit = 1_000texts = "\n".join(get_texts(dataset, limit=limit))# Train a simple BPE tokenizerpat_str=r"""’s\|’t\|’re\|’ve\|’m\|’ll\|’d\| ?[\p{L}]+\| ?[\p{N}]+\| ?[^\s\p{L}\p{N}]+\|\s+(?!\S)\|\s+"""enc_simple = SimpleBytePairEncoding.train(training_data=texts, vocab_size=300, pat_str=pat_str)# Convert to real tiktoken encoding and save to diskenc = tiktoken.Encoding( name="my_bpe", pat_str=enc_simple.pat_str, # same regex used during training mergeable_ranks=enc_simple.mergeable_ranks, special_tokens={},)# testtext = "Let’s have a pizza party! 🍕"tok_ids = enc.encode(text)print("Token IDs:", tok_ids)print("Decoded:", enc.decode(tok_ids))

1234567891011121314151617181920212223242526272829303132333435363738394041

import sysfrom typing import Iteratorimport datasetsimport tiktokenfrom tiktoken._educational import SimpleBytePairEncoding# Load FineWeb 10B sample (using only a slice for demo to save memory)dataset = datasets.load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True)def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: """Get texts from the dataset until the limit is reached or the dataset is exhausted""" count = 0 for sample in dataset: yield sample["text"] count += 1 if count >= limit: break# Collect texts up to some manageable limit for tokenizer traininglimit = 1_000texts = "\n".join(get_texts(dataset, limit=limit))# Train a simple BPE tokenizerpat_str=r"""’s|’t|’re|’ve|’m|’ll|’d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""enc_simple = SimpleBytePairEncoding.train(training_data=texts, vocab_size=300, pat_str=pat_str)# Convert to real tiktoken encoding and save to diskenc = tiktoken.Encoding( name="my_bpe", pat_str=enc_simple.pat_str, # same regex used during training mergeable_ranks=enc_simple.mergeable_ranks, special_tokens={},)# testtext = "Let’s have a pizza party! 🍕"tok_ids = enc.encode(text)print("Token IDs:", tok_ids)print("Decoded:", enc.decode(tok_ids))

When you run this code, you will see:


1234	...Token IDs: [76, 101, 116, 39, 115, 293, 97, 118, 101, 257, 278, 105, 122, 122, 97, 278,286, 116, 121, 33, 32, 240, 159, 141, 149]Decoded: Let’s have a pizza party! 🍕

The tiktoken library does not have an optimized trainer. The only available module is a Python implementation of the BPE algorithm via the SimpleBytePairEncoding class. To train a tokenizer, you need to define how the input text should be split into words using the pat_str argument, which defines a “word” using a regular expression.

The training output is a dictionary called mergeable ranks, which contains pairs of tokens that can be merged along with their merge priorities. To create a tokenizer, simply pass the pat_str and mergeable_ranks arguments to the Encoding class.

Note that the tokenizer in tiktoken does not have a save function. Instead, save the pat_str and mergeable_ranks arguments if needed.

Since training is done in pure Python, it is very slow. Training your own tokenizer this way is not recommended.

Summary

In this article, you learned about byte-level BPE and how to train a BPE tokenizer. Specifically, you learned how to train a BPE tokenizer with the tokenizers, sentencepiece, and tiktoken libraries. You also learned that a tokenizer can encode text into a list of integer token IDs and decode them back to text.

Overview

Overview

Understanding BPE

Training a BPE tokenizer with Hugging Face tokenizers Library

Training a BPE tokenizer with SentencePiece library

Training a BPE tokenizer with tiktoken Library

Further Readings

Summary

No comments yet.

Leave a Reply

Similar Posts