Tokens: The Invisible Building Blocks of Large Language Models

Even though Large Language Models (LLMs) appear to expertly understand and generate language, the truth is, like any other computer technology, they only deal in zeros and ones. So, how can an LLM, which only knows 0s and 1s, engage in such deep conversations with us using language? The first piece of that puzzle lies in what we’re talking about today: the ‘Token’.

In this post, we’re going to dive into what tokens—the literal cells of an LLM—actually are, why they are so crucial, and how these small units impact AI performance, cost, and even linguistic fairness.

As always, we’ll keep the math to a minimum.

1. Token: The LEGO Block of Language

Let’s start by uncovering the identity of the token.

Simply put, a token is the basic unit an LLM uses to process text…

As always, we’ll keep the math to a minimum.

1. Token: The LEGO Block of Language

Let’s start by uncovering the identity of the token.

Simply put, a token is the basic unit an LLM uses to process text. Still a bit vague?

Take the sentence, “I am hungry.” If you want to ‘feed’ this to a computer, what do you need to do? Since the full sentence is too large for the computer to digest in one bite, we need to chop it up into smaller, consumable pieces.

The simplest approach is to split it by ‘word’ unit:

["I", "am", "hungry"]

That looks plausible. In fact, early natural language processing (NLP) often used this method. But here’s the problem. The words “hungry,” “hungrier,” and “hungriest” are all closely related in meaning, but the computer recognizes them as entirely different words. Unless you register every single variation of “hungry” in a dictionary, the model will see a word like “hungriest” for the first time and be confused—this is known as the OOV (Out-Of-Vocabulary) problem. Solving this by including millions of words in the vocabulary dictionary is highly inefficient.

What if we go the opposite way and split by ‘character’?

["I", " ", "a", "m", " ", "h", "u", "n", "g", "r", "y"]

This solves the OOV problem. You only need to know a few dozen character combinations (A-Z, 0-9, etc.). But now we have another issue: we’ve chopped it too finely. While the model learns to combine “h,” “u,” “n,” and “g” to form “hung,” it loses too much information to easily grasp the meaning of “hungry” as a whole unit. The model has to go through an incredibly long and arduous learning process just to learn the meaning of the word “hungry.”

The solution that emerged at a brilliant middle ground is the ‘Subword’ method. And this subword is what most modern LLMs mean when they talk about a ‘token’.

The ‘Subword’ tokenizer (more on this later) divides the sentence not into words or characters, but into ‘meaningful chunks’ somewhere in between. For example, the word ‘unhappiness’ might be broken into three tokens: ["un", "happi", "ness"].

un (negative prefix)
happi (the root of happy)
ness (suffix indicating a state)

See the advantage? By splitting it this way, when the model learns the token ‘un’, it can apply that meaning not just to ‘unhappiness’ but also to ‘unusual,’ ‘unbelievable,’ and other words. The same goes for ‘ness’. If it knows ‘happiness’, it can easily infer ‘sadness’.

This approach simultaneously tackles the OOV problem (encountering a new word) and the efficiency problem (chopping things too finely). Even if the never-before-seen word ‘unbelievableness’ shows up, it can be broken down into something like ["un", "believable", "ness"], allowing the model to (at least faintly) infer, “Ah, this is the ‘state’ of something ‘unbelievable’ in a ‘negative’ context.”

The tool that performs this magic is called a ‘Tokenizer’. The tokenizer statistically analyzes a massive amount of text data to create a list of frequently occurring and meaningful pieces (tokens), such as ["un", "happi", "ness"]—a list called the ‘Vocabulary’. Algorithms like BPE, WordPiece, and SentencePiece are types of tokenizers, but you can simply know that they exist for now.

📌 Key Takeaway 1:

A token is the fundamental ‘LEGO block’ that LLMs use to understand language.

The primary form used is the ‘Subword,’ which is smaller than a ‘word’ but larger than a ‘character.’

The ‘Tokenizer’ is the machine that builds these ‘LEGO blocks,’ and the ‘Vocabulary’ is the set of all possible ‘LEGO blocks’ that can be built.

2. Why Do We Need Tokens? The ‘Translator’ that Converts Language into Numbers

“Okay, I get that text is chopped into ‘LEGO blocks.’ But why is that necessary?”

As mentioned earlier, computers only understand numbers. Even after chopping, the computer doesn’t actually understand letters like ‘un’ or ‘happi’. To feed text into the massive neural network that is an LLM, these tokens must ultimately be converted into numbers that the computer understands.

Tokens are the bridge connecting the ‘world of language’ and the ‘world of numbers’.

This process is generally divided into two steps.

Step 1: Integer Encoding or Indexing

The ‘Vocabulary’ created by the tokenizer is essentially a giant dictionary (a list of tokens). In this dictionary, every token ‘LEGO block’ is assigned a unique number (ID).

"un" -> 24
"happi" -> 2341
"ness" -> 342
...
"cat" -> 5555
"dog" -> 9812

(The numbers are just examples.)

Now, let’s assume the sentence “I am hungry” goes through the tokenizer and is split into the tokens ["I", " am", " hungry"]. (It can be split more complexly in reality.)

These tokens are then converted into their unique ID (number) using the ‘Vocabulary’ dictionary.

["I", " am", " hungry"] -> [100, 200, 300]

Finally, we have a ‘sequence of numbers’ that the computer can understand!

Step 2: Embedding

However, just because 100 is smaller than 300 doesn’t mean “I” is less important than “hungry.” These numbers are merely ‘identification numbers’ and carry no inherent meaning. Bigger or smaller means nothing.

This is where the real magic of the LLM begins. The model converts this ‘identification number (ID)’ into a ‘list of numbers that carry meaning,’ or a ‘Vector’. This is called ‘Embedding’.

100 (“I”) -> [0.1, -0.5, 1.2, ..., 0.8] (e.g., a list of 768 numbers)
300 (“hungry”) -> [0.4, 0.9, -0.1, ..., -0.3] (e.g., a list of 768 numbers)

This ‘vector’ is how the LLM understands the ‘meaning of the token’. In this vector space, the embeddings for ‘hungry’ and ‘famished’ will be located close to each other, while ‘apple’ and ‘banana’ will also be close, but ‘hungry’ and ‘boat’ will be far apart.

Embedding is a large and complex topic that deserves its own article. For today, it’s enough to know that it’s the numerical (vector) representation of the token, where embeddings with similar meanings are close together, and those with different meanings are far apart.

The first step in this entire process is the token. If the text is incorrectly divided into tokens, the entire subsequent process of numerical conversion and semantic learning (embedding) will be messed up.

📌 Key Takeaway 2:

Tokens are the essential first step to convert ‘language’ into ‘numbers.’

The flow is: Text -> Tokens (LEGO blocks) -> ID (Number) -> Embedding Vector (Meaningful List of Numbers)

This process allows the computer (LLM) to finally ‘calculate’ language.

3. How Tokens Affect Model Training

Now, let’s delve a little deeper. We’ll explore how the concept of ‘tokens’ goes beyond just ‘inputting’ text and profoundly impacts the process of ‘building (training)’ an LLM model.

For LLM engineers training a model, ‘tokens’ represent ‘cost’ and ‘resources.’

1. The Vocabulary Size Dilemma

When creating a tokenizer, engineers must decide, “How many types of ‘LEGO blocks’ (tokens) should we create in total?” This is called the ‘Vocabulary Size’. For example, they might decide, “Let’s represent the world with 30,000 tokens,” or “Let’s use 100,000 tokens!” This is often referred to as the LLM model’s Vocab Size.

Small Vocabulary (e.g., 30,000):
Pro: The model has fewer ‘block types’ to memorize, so the model’s ‘brain’ (parameters) can be smaller. (Specifically, the size of the input and output layers decreases.) Training can be faster and lighter.
Con: Fewer ‘LEGO block’ types mean most words have to be chopped into tiny pieces. For instance, the proper noun “ChatGPT” might not have a dedicated ‘block,’ forcing it to be split into 3 pieces, like ["Chat", "G", "PT"]. More tokens are required to represent a single sentence.
Large Vocabulary (e.g., 100,000):
Pro: Words like “ChatGPT” or “Massachusetts” are more likely to become a single ‘block,’ like ["ChatGPT"] or ["Massachusetts"]. This reduces the number of tokens needed to represent a sentence.
Con: The model has 100,000 ‘block types’ to memorize. The model’s ‘brain’ must be larger, requiring more memory and computational power. Training costs increase exponentially.

This is a classic trade-off with no single correct answer. Model developers must determine the appropriate Vocabulary Size based on their available data and objectives.

2. Training Cost and Time

Training an LLM costs tens or hundreds of millions of dollars. This cost is mostly proportional to “how many tokens were learned.”

For example, let’s say we train a model on the equivalent of 1 million books (data).

Tokenizer A: Splits the 1 million books into a total of 100 billion tokens.
Tokenizer B: Splits the 1 million books into a total of 50 billion tokens.

Put another way, the same data requires 100 billion tokens with Tokenizer A and 50 billion tokens with Tokenizer B. Since the LLM understands in units of tokens, using Tokenizer B could theoretically allow the model to learn the same amount of knowledge with half the computational cost and time compared to A. The quality of tokenizer design is literally a matter of saving hundreds of millions of dollars.

Of course, one must carefully consider the trade-offs associated with Vocabulary Size, as discussed earlier.

Furthermore, LLMs have a limit on the length of tokens they can process at once (often called the Context Window size). For example, “I can only process 4,096 tokens at a time.”

A tokenizer that splits sentences more finely might break ["H", "e", "l", "l", "o"] (5 tokens), which bloats the token count without adding much content. This shortens the length of the story/information (in terms of actual character count) the model can read and learn in one go. Conversely, a tokenizer that splits sentences into larger chunks, like ["Hello"] (1 token), allows the model to learn much longer and richer contextual narratives within the same 4,096 limit.

📌 Key Takeaway 3:

Tokenizer design (Vocabulary Size, efficiency) directly impacts the LLM’s training costs (money, time) and learning efficiency.

An efficient tokenizer packs more ‘meaning’ into fewer ‘tokens’, making training cheaper and faster.

4. How Tokens Affect Inference: A Matter of Wallet and Patience

We’ve discussed tokens from the perspective of the model builder. Now, how do tokens impact the ‘model user’? This is a story about your wallet and your patience.

1. It Threatens Your Wallet (API Costs)

When you use an LLM like OpenAI’s GPT, Google’s Gemini, or Anthropic’s Claude via an API call, what is the basis for billing? It’s the token count.

The cost is typically something like “$0.01 per 1,000 tokens.” This applies both to the tokens used in the question we send to the LLM (the Prompt tokens) and the tokens generated by the LLM in its answer (the Completion tokens). Different LLM services may even charge different rates for prompt and completion tokens.

A critical issue arises here: what if the language you use is ‘inefficient’ for the tokenizer? When we say a tokenizer is inefficient here, it means it splits a sentence into smaller units. This doesn’t inherently mean it’s worse, just less token-dense.

For example, let’s assume a tokenizer splits English and Korean sentences as follows:

English: “Hello, how are you today?” -> ["Hello", ",", " how", " are", " you", " today", "?"] (7 tokens)
Korean: “안녕하세요, 잘 지내세요?” (meaning Hello, how are you today?) -> ["안", "녕", "하", "세", "요", ",", " 잘", " ", "지", "내", "세", "요", "?"] (13 tokens)

In this example, the user whose language is Korean has to pay almost double the cost of the user whose language is English, just for saying the same thing.

2. Memory Limits (Context Window)

We briefly mentioned the Context Window earlier. It’s the total number of tokens an LLM can remember at one time. For example, the statement “GPT-4 has an 8K (8,192) token Context Window” means the model can only process and remember up to 8,192 tokens.

Imagine you upload two long reports to summarize. One is in English, and the other is in Spanish. Both are exactly 5,000 words long.

English Report: 5,000 words -> 8,000 tokens
Spanish Report: 5,000 words -> 12,000 tokens (Assuming an inefficient tokenizer for Spanish)

If you give the 8K-token-limited model the English report, it reads the entire thing and summarizes it without a problem.

But if you give it the Spanish report? The model says, “Oops! That’s over 8K tokens. I’ll only read up to 8,192 tokens,” and produces a summary that doesn’t reflect the entire content.

Despite giving the model documents with the same word count, the tokenization method causes the Spanish document to be disadvantaged. The LLM’s memory essentially runs out faster for that language. This is why enormous context windows of 128K or 1M tokens are emerging—to overcome this memory limitation.

3. Response Speed (Latency)

You know how ChatGPT generates its answer one word at a time, almost like typing? While it might look like the AI is ‘thinking,’ the LLM is actually generating the answer one token at a time. This is called the ‘Autoregressive’ method.

Let’s say the LLM is generating the sentence, “My name is John Doe.” The LLM doesn’t generate the whole sentence at once; it generates one token at a time, like this:

[My] -> [name] -> [is] -> [John] -> [Doe] -> [.]

This process creates the illusion of typing.

If an inefficient tokenizer had broken this sentence into 10 tokens, the model would have to go through 10 ‘thoughts’ (token generations) to complete the sentence. What if an efficient tokenizer broke it into just 3 tokens? The sentence is complete with only 3 thoughts. This means the sentence generation is completed nearly three times faster.

More steps to generate the same sentence means a longer wait for the final response. The more inefficient the tokenization, the longer we have to wait.

📌 Key Takeaway 4:

Tokens determine the cost of using an LLM (API fees).>

Tokens define the LLM’s ‘memory limit’ (Context Window).>

Tokens affect the LLM’s response speed (Latency).>

Inefficient tokenization is a disadvantage for the user in all these aspects.

5. Token Importance from a Multilingual Perspective: The ‘Token Inequality’ Problem

If you’ve used an LLM, you know it can handle multiple languages proficiently. A single LLM model understands and generates English, Spanish, French, and so on, without issue. While there are many techniques behind this, one crucial technology is a tokenizer that can understand multiple languages.

In previous sections, we kept using the term ‘inefficient tokenizer’. So, why is one language ‘efficient’ and another ‘inefficient’?

The answer lies in the ‘Training Data’.

The tokenizers for early LLM models like GPT-2 and GPT-3 built their ‘Vocabulary’ overwhelmingly based on English data.

As a result, the vocabulary was packed with fantastic, optimized tokens for English, such as ["the", "is", "ing", "Chat", "GPT"].

But what happens when you try to split or generate a Spanish sentence with this English-centric vocabulary set? It obviously won’t have the rich tokens specific to Spanish.

Consequently, the engineer is forced to build the sentence using the most basic ‘1x1’ LEGO blocks—the single-character equivalents for English, like [a], [b], [c], etc.

This is the reality of ‘Token Inequality’.

English: “The quick brown fox jumps over the lazy dog.” (9 words)
Tokens: ["The", " quick", " brown", " fox", " jumps", " over", " the", " lazy", " dog", "."] (10 tokens)
1 Word ≈ 1.1 Tokens (Highly efficient)
Korean: “빠른 갈색 여우가 게으른 개를 뛰어넘습니다.” (6 words or eojeol)
토큰: ["빠", "른", " ", "갈", "색", " ", "여", "우", "가", " ", "게", "으", "른", " ", "개", "를", " ", "뛰", "어", "넘", "습", "니", "다", "."] (24 tokens)
1 Word ≈ 4.0 Tokens (Inefficient)

To express the same meaning, a language like Korean had to use more tokens than English.

This inefficiency caused the three problems mentioned in Section 4—higher fees, shorter memory, and slower speed—for users of non-English languages.

In using AI technology, people had to accept significantly higher costs and inconveniences just because they didn’t ‘speak’ English. To address this, some LLMs optimized for specific languages have been developed. For example, an LLM specialized in Korean means the training process used more Korean data, leading to a better understanding of the language and a tokenizer with a richer set of Korean tokens.

The Solution

Fortunately, the AI community recognized this serious problem some time ago.

The latest LLMs have made massive efforts to resolve this ‘Token Inequality’:

Training with Massive Multilingual Data: From the start, when creating the tokenizer, they included a balanced mix of data from various global languages—not just English, but Spanish, Korean, German, Hindi, and more.
Larger Vocabulary Size: The vocabulary size, which used to be around 30,000 to 50,000, has been significantly increased to 100,000, 250,000, or more.

As a result, the vocabularies of the latest models now include dedicated tokens for many other languages.

Thanks to this, the multilingual token efficiency of the latest models has dramatically improved compared to the GPT-3 era. While it may still be slightly less efficient than English, the gap is constantly shrinking.

This is more than just cost reduction; it means that AI technology is becoming more equitably accessible to people worldwide, rather than being confined to a specific linguistic region.

📌 Key Takeaway 5:

Past LLMs were primarily trained on English data, making tokenization for non-English languages highly inefficient.

This resulted in non-English users having to bear higher costs, shorter memory, and slower speeds (Token Inequality).

The latest LLMs are greatly improving this inequality problem by using massive multilingual data and larger vocabularies.

6. Conclusion

Today, we’ve discussed the token, the most fundamental concept underpinning the vast technology of the LLM.

How this ‘LEGO block’ is split and defined (tokenizer design) is a key strategy that determines the cost efficiency of model training, where hundreds of millions of dollars are on the line.

Furthermore, we’ve confirmed that this token dictates our entire user experience—from the API fees we have to pay when we use AI, to the length of the conversation the AI can remember (Context Window), and even the response speed.

Finally, we addressed how the differing efficiency of tokens across languages led to the ‘Token Inequality’ problem becoming a topic of technical fairness, and how it is now being actively improved.

Now, when you watch an LLM generating text, ‘tap... tap...’ on your screen, it will no longer just look like simple typing. You’ll see the model predicting the next token one by one after an intense calculation.

Thank you for reading this long post.

1. Token: The LEGO Block of Language

1. Token: The LEGO Block of Language

2. Why Do We Need Tokens? The ‘Translator’ that Converts Language into Numbers

Step 1: Integer Encoding or Indexing

Step 2: Embedding

3. How Tokens Affect Model Training

1. The Vocabulary Size Dilemma

2. Training Cost and Time

4. How Tokens Affect Inference: A Matter of Wallet and Patience

1. It Threatens Your Wallet (API Costs)

2. Memory Limits (Context Window)

3. Response Speed (Latency)

5. Token Importance from a Multilingual Perspective: The ‘Token Inequality’ Problem

The Solution

6. Conclusion

Similar Posts