8 min readDec 8, 2025
–
Transformers, embeddings and attention: how modern LLMs really think
Press enter or click to view image in full size
Welcome back in the series related to LLM-based application development.
By now you already know the basics of how LLMs are built and what their key parameters mean. In this article we return to the architecture that kicked off the current wave of language models: the Transformer from the 2017 paper “Attention Is All You Need”. That work was a real turning point for natural language processing and it’s the foundation behind GPT and many other modern models.
My goal here is to walk through the main building blocks of a Transformer in a practical way, so that when you see the classic diagram, it’s not just a mysterious box anymore.
P…
8 min readDec 8, 2025
–
Transformers, embeddings and attention: how modern LLMs really think
Press enter or click to view image in full size
Welcome back in the series related to LLM-based application development.
By now you already know the basics of how LLMs are built and what their key parameters mean. In this article we return to the architecture that kicked off the current wave of language models: the Transformer from the 2017 paper “Attention Is All You Need”. That work was a real turning point for natural language processing and it’s the foundation behind GPT and many other modern models.
My goal here is to walk through the main building blocks of a Transformer in a practical way, so that when you see the classic diagram, it’s not just a mysterious box anymore.
Press enter or click to view image in full size
Paper “Attention is all you need” paper by Vaswani, et al., 2017 [1]
On the slide from the original paper you’ll usually see two main blocks: an encoder on the left and a decoder on the right.
The encoder is the analysis department. It takes an input sequence, for example a sentence in Polish, and converts it into a sequence of numerical representations. You can think of these as internal codes the model can work with efficiently.
The decoder is the generation department. It receives that internal code from the encoder and combines it with its own previously generated outputs to produce a new sequence, word by word. In machine translation, that new sequence might be a sentence in English. In other tasks it might be an answer, a summary or a continuation of a text.
If you like analogies: the encoder is the first translator who reads the original text and writes very technical notes in a private shorthand. The decoder is the second translator who reads those notes and writes a clean, fluent output in the target language.
Press enter or click to view image in full size
A key property of this setup is that it is autoregressive. Each new word depends on all previously generated words. When the model is writing, it isn’t picking words in isolation. Every next token has to fit into the whole story so far. It’s closer to writing a novel than to filling boxes in a form.
Before the model can do any of that, though, text has to be turned into numbers. Computers don’t “see” words, they see vectors.
The simplest idea is one-hot encoding. Imagine you have a vocabulary of 15 words. Each word becomes a vector of length 15. Exactly one position in that vector is 1, and all the others are 0. The position of the 1 tells you which word it is.
Press enter or click to view image in full size
Take a small corpus:
Ada has a computer. Machine learning allows us to train a computer. To solve problems we just need a computer and a lot of data.
You build a vocabulary, for example: [ada, has, computer, machine, learning, ...].
Then the sentence “Ada has a computer” might be represented as:
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
So far so good: the text is now a fixed-length numeric vector. But this encoding has a massive weakness. It doesn’t capture meaning or relationships between words at all. For this representation, “king” is just as different from “queen” as it is from “computer”. There is no concept of similarity.
Large language models need something much richer.
Press enter or click to view image in full size
This is where embeddings come in.
Instead of sparse vectors full of zeros and a single one, each word gets its own dense vector of real numbers in a shared multi-dimensional space. The model learns these vectors so that words with similar meanings end up close to each other, while different types of relationships can be expressed as directions in that space.
That’s how we end up with classic examples like:
*king − man + woman ≈ queen*
You can see similar patterns for verb tenses or grammatical number:
- “walked” and “walk” relate to each other in a similar way as “ran” and “run”
- “we swim” and “I swim” differ along the same dimension as “we” and “I”
An embedding space behaves a bit like a map. Words with similar meanings are neighbors. Opposites may sit on the same line but on opposite sides. More complex relationships can be captured as particular movement directions across the map.
To store enough information about meaning, we use high-dimensional representations. Instead of 15 dimensions like in the one-hot example, we might use 100, 300, 1024 dimensions or more.
Press enter or click to view image in full size
Press enter or click to view image in full size
Imagine we have a 100-dimensional embedding for the word “go”. In the same space we have another vector for “away”. We can combine these vectors, for instance by averaging them: add them together and divide by two. The result is a new point in the space that represents the phrase “go away” in a more abstract way than just gluing two words together.
Operations like this aren’t just mathematical tricks. They are a sign that the embedding space captures useful structure. The model can use these representations to reason about similarity, analogy and context.
Press enter or click to view image in full size
Embeddings solve the problem of representing individual word meanings, but that isn’t enough. In real language, order and context matter just as much.
“Going from home to work” is not the same as “going from work to home”, even though the words are the same. The word “castle” can refer to a medieval fortress, a zipper component or a door lock, depending on the sentence around it.
Get Michalzarnecki’s stories in your inbox
Join Medium for free to get updates from this writer.
One simple idea would be to assign each word a position number: first word is 1, second is 2, and so on. Unfortunately, that naive approach adds its own problems.
Press enter or click to view image in full size
If we just pass position indexes directly, position 1000 is dramatically larger than position 1. From a learning perspective, it’s often easier and more stable to work with values that are normalized and live in a bounded range, such as between −1 and 1. Big jumps in scale can make training harder.
Transformers use a different trick to represent position, one that plays nicely with gradient-based learning.
Press enter or click to view image in full size
The solution is sinusoidal positional encoding.
Each position in the sequence is represented by a vector built from sine and cosine functions with different frequencies. There are two important consequences:
- All values lie between −1 and 1, so the scale is well-behaved.
- Each position has a unique pattern of sine and cosine values, and nearby positions have similar patterns.
You can think of it as giving each token its own little melody based on where it appears in the sentence. Words next to each other have similar melodies. No two positions share exactly the same tune. Combined with the embedding, this lets the model know both what the word is and where it is.
Press enter or click to view image in full size
We’re missing one last critical ingredient: Multi-Head Attention. This is the real core of the Transformer.
Imagine reading a sentence and tracking several things at once. One thread of attention follows the subject. Another pays attention to the main verbs. A third pays more attention to emotional tone. A fourth tries to understand cause-and-effect relationships.
Your brain shifts and blends these threads without much effort. Multi-head attention is a way for a model to do something similar in a structured, trainable way.
Press enter or click to view image in full size
The attention mechanism uses three kinds of vectors: Query, Key and Value.
A library analogy works well here:
- The Query is what you’re looking for, like “information about cats”.
- The Key represents what each piece of content is about, similar to book titles on a shelf.
- The Value is the actual content in the books.
The model compares the query to all the keys, calculates how similar they are, and then uses those similarities as weights to combine the values. The result is a weighted mixture of pieces of information that are relevant to the query.
Unlike a traditional dictionary lookup, this is “soft” retrieval. The model doesn’t pick one exact match, it blends information from many places in proportion to how relevant they seem.
Press enter or click to view image in full size
If you’ve used dictionaries or hash maps in programming, the idea is familiar: keys point to values, and you query by key. In a Transformer, everything is continuous. Queries, keys and values are vectors, and matching is based on similarity, not equality.
Multi-head attention simply repeats this mechanism several times in parallel. Instead of one attention head, the model has many heads: 8, 12, 16 or more depending on the implementation. Each head sees the same input but learns to focus on different patterns.
Press enter or click to view image in full size
One head might pay attention to short-range relations between neighbouring words. Another might specialize in long-range relations between the beginning and the end of a sentence. A third might capture grammatical structure. A fourth might focus on semantic roles.
This design gives the model some very important advantages:
- It can process many different aspects of the input in parallel.
- It builds a much richer understanding of context than simple recurrent networks.
- It increases the model’s capacity to represent complex patterns without changing the basic building block.
In practice, multi-head attention shows up in multiple layers stacked on top of each other. In models like BERT and GPT, this mechanism is a major reason why they perform so well on a wide range of NLP tasks.
This brings us to the end of the module focused on the internal mechanics of large language models. We’ve covered:
- the encoder–decoder structure
- embeddings and word representations
- positional encoding
- and the attention mechanism, especially multi-head attention
In the next modules we’ll move from theory to practice. We’ll start using this understanding while building real applications with LangChain and LangGraph. Knowing what happens “under the hood” will make it easier to design better systems and to reason about their behavior when something unexpected happens
**see next **chapter
**see previous **chapter
**see the full code from this article in the GitHub **repository