How Transformers Think: The Information Flow That Makes Language Models Work

How Transformers Think: The Information Flow That Makes Language Models Work Image by Editor

# Introduction

Thanks to large language models (LLMs), we nowadays have impressive, incredibly useful applications like Gemini, ChatGPT, and Claude, to name a few. However, few people realize that the underlying architecture behind an LLM is called a transformer. This architecture is carefully designed to "think" — namely, to process data describing human language — in a very particular and somewhat special way. Are you interested in gaining a broad understanding of what happens inside t…

How Transformers Think: The Information Flow That Makes Language Models Work Image by Editor

# Introduction

This article describes, using a gentle, understandable, and rather non-technical tone, how transformer models sitting behind LLMs analyze input information like user prompts and how they generate coherent, meaningful, and relevant output text word by word (or, slightly more technically, token by token).

# Initial Steps: Making Language Understandable by Machines

The first key concept to grasp is that AI models do not truly understand human language; they only understand and operate on numbers, and transformers behind LLMs are no exception. Therefore, it is necessary to convert human language — i.e. text — into a form that the transformer can fully understand before it is able to deeply process it.

Put another way, the first few steps taking place before entering the core and innermost layers of the transformer primarily focus on turning this raw text into a numerical representation that preserves the key properties and characteristics of the original text under the hood. Let’s examine these three steps.

Making language understandable by machines (click to enlarge)

// Tokenization

The tokenizer is the first actor coming onto the scene, working in tandem with the transformer model, and is responsible for chunking the raw text into small pieces called tokens. Depending on the tokenizer used, these tokens may be equivalent to words in most cases, but tokens can also sometimes be parts of words or punctuation signs. Further, each token in a language has a unique numerical identifier. This is when text becomes no longer text, but numbers: all at the token level, as shown in this example in which a simple tokenizer converts a text containing five words into five token identifiers, one per word:

Tokenization of text into token identifiers

// Token Embeddings

Next, every token ID is transformed into a ( d )-dimensional vector, which is a list of numbers of size ( d ). This full representation of a token as an embedding is like a description of the overall meaning of this token, be it a word, part of it, or a punctuation sign. The magic lies in the fact that tokens associated with similar concepts of meanings, like queen and empress, will have associated embedding vectors that are similar.

// Positional Encoding

Until now, a token embedding contains information in the form of a collection of numbers, yet that information is still related to a single token in isolation. However, in a "piece of language" like a text sequence, it is important not only to know the words or tokens it contains, but also their position in the text they are part of. Positional encoding is a process that, by using mathematical functions, injects into each token embedding some extra information about its position in the original text sequence.

# The Transformation Through the Core of the Transformer Model

Now that each token’s numerical representation incorporates information about its position in the text sequence, it is time to enter the first layer of the main body of the transformer model. The transformer is a very deep architecture, with many stacked components replicated throughout the system. There are two types of transformer layers — the encoder layer and the decoder layer — but for the sake of simplicity, we will not make a nuanced distinction between them in this article. Just be aware for now that there are two types of layers in a transformer, even though they both have a lot in common.

The transformation through the core of the transformer model (click to enlarge)

// Multi-Headed Attention

This is the first major subprocess taking place inside a transformer layer, and perhaps the most impactful and distinctive feature of transformer models compared to other types of AI systems. The multi-headed attention is a mechanism that lets a token observe or "pay attention to" the other tokens in the sequence. It collects and incorporates useful contextual information into its own token representation, namely linguistic aspects like grammar relationships, long-range dependencies among words not necessarily next to each other in the text, or semantic similarities. In sum, thanks to this mechanism, diverse aspects of the relevance and relationships among parts of the original text are successfully captured. After a token representation travels through this component, it ends up gaining a richer, more context-aware representation about itself and the text it belongs to.

Some transformer architectures built for specific tasks, like translating text from one language to another, also analyze via this mechanism possible dependencies among tokens, looking at both the input text and the output (translated) text generated thus far, as shown below:

Multi-headed attention in translation transformers

// Feed-Forward Neural Network Sublayer

In simple terms, after passing through attention, the second common stage inside every replicated layer of the transformer is a set of chained neural network layers that further process and help learn additional patterns of our enriched token representations. This process is akin to further sharpening these representations, identifying, and reinforcing features and patterns that are relevant. Ultimately, these layers are the mechanism used to gradually learn a general, increasingly abstract understanding of the entire text being processed.

The process of going through multi-headed attention and feed-forward sublayers is repeated multiple times in that order: as many times as the number of replicated transformer layers we have.

// Final Destination: Predicting the Next Word

After repeating the previous two steps in an alternate manner multiple times, the token representations that came from the initial text should have allowed the model to acquire a very deep understanding, enabling it to recognize complex and subtle relationships. At this point, we reach the final component of the transformer stack: a special layer that converts the final representation into a probability for every possible token in the vocabulary. That is, we calculate — based on all the information learned along the way — a probability for each word in the target language being the next word the transformer model (or the LLM) should output. The model finally chooses the token or word with the highest probability as the next one it generates as part of the output for the end user. The entire process repeats for every word to be generated as part of the model response.

# Wrapping Up

This article provides a gentle and conceptual tour through the journey experienced by text-based information when it flows through the signature model architecture behind LLMs: the transformer. After reading this, you may hopefully have gained a better understanding of what goes on inside models like the ones behind ChatGPT.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

# Introduction

# Introduction

# Initial Steps: Making Language Understandable by Machines

// Tokenization

// Token Embeddings

// Positional Encoding

# The Transformation Through the Core of the Transformer Model

// Multi-Headed Attention

// Feed-Forward Neural Network Sublayer

// Final Destination: Predicting the Next Word

# Wrapping Up

Similar Posts