Self-Attention: The Simple Mechanism That Made ChatGPT Possible

12 min read11 hours ago

–

Press enter or click to view image in full size

In 2017, the world of Artificial Intelligence got a major makeover, thanks to some clever folks at Google who boldly declared, “Attention Is All You Need.” Sounds like a relationship advice column, right? They introduced the Transformer architecture, a fancy new model that ditched the old-school recurrent and convolutional neural networks like they were last year’s fashion.

The magic ingredient?** Self-attention!** That’s right — AI learned to pay attention (finally!). Thanks to this, models like GPT and BERT can understand context and relationships between words, even when they’re playing hide and seek in a sentence. It’s how AI figures out that “it” isn’t just some random pronoun, but rather a specific…

12 min read11 hours ago

–

Press enter or click to view image in full size

In this blog post, we will demystify the self-attention mechanism. We’ll break down the complex mathematics into intuitive concepts, explore how it powers the Transformer model, and understand why it’s been a game-changer in modern Natural Language Processing (NLP). Get ready to look under the hood of the technology shaping the future of AI.

Transformer Architecture

Blog Roadmap: Why Transformers matter in NLP → Embeddings: What & why → Old Methods: One-hot, BoW, TF-IDF limits → Static Embeddings: Word2Vec/GloVe, their problems → Self-Attention: How it solves context/meaning →Mechanism: Q/K/V, weighted sum (math basics) → Conclusion: Self-attention’s impact on NLP

What is Embedding?

Embedding is the process of converting natural language into numerical representations, typically in the form of vectors. Since computers cannot directly understand human language, this transformation enables them to process and analyze text effectively. An embedding model performs this conversion by translating words, phrases, or sentences into numerical vectors that can be utilized by algorithms. These vector representations, known as embeddings, bridge the gap between human language and computational systems, allowing machines to interpret and work with text meaningfully.

Why Embedding is Importent( In the context of LLM):

Semantic Understanding: They convert words into numerical vectors where proximity indicates similar meaning, enabling the LLM to understand relationships between words.This allows the LLM to understand that “king” and “queen” are more related than “king” and “banana”, which is the foundation of language comprehension for machines.
Contextual Awareness: Modern embeddings are context-aware, meaning the same word can have different vector representations depending on the surrounding text (e.g., “river bank” versus “bank account”). This nuance is vital for LLMs to process complex human language and generate coherent, relevant responses.
Computational Efficiency: They reduce high-dimensional data into a dense, lower-dimensional space, making processing faster.
Task Versatility: They facilitate numerous NLP tasks beyond basic text generation, like semantic search and classification.
Enhanced Knowledge Access (RAG): They allow LLMs to search external knowledge bases for relevant information to generate informed responses.

Journey Towards the Embedding model:

The journey of embedding modules in machine learning and AI has progressed from simple, sparse representations to sophisticated, dense, and contextual ones, enabling machines to better understand and process complex, real-world data.

1.** Traditional/Sparse Methods (Pre-Deep Learning Era)**

Early methods focused on converting data into numerical formats but lacked the ability to capture semantic relationships.

Basic Numbering: Assigning a unique integer ID to each word. This method failed to capture any relationship between words.
One-Hot Encoding: Representing each unique item (e.g., a word in a vocabulary) as a high-dimensional binary vector, where only one dimension is “hot” (1) and all others are (0). This was computationally expensive for large vocabularies and still didn’t capture semantic similarity; “cat” and “dog” were equally distant from each other as “cat” and “car”.

Press enter or click to view image in full size

Example Of One-Hot Encoding

Bag of Words (BoW) & TF-IDF: These methods moved beyond individual words to count word frequencies in documents, often using Term Frequency-Inverse Document Frequency (TF-IDF) to emphasize important terms. This captured word importance but lost the critical information of word order and context, making sentences like “dog bites man” and “man bites dog” identical in representation.

Press enter or click to view image in full size

Example Of BOW

Press enter or click to view image in full size

Formula of TF-IDF

2.** Static Word Embeddings (The Breakthrough)**

The early 2010s saw a major breakthrough with the introduction of fixed-size, dense, low-dimensional vectors that captured semantic and syntactic relationships.

Word2Vec (Mikolov et al., 2013): This model learned word embeddings by predicting words in a given context (CBOW) or predicting the context from a given word (skip-gram). It produced dense vectors where semantically similar words (like “king” and “queen”) were closer in the vector space, and meaningful analogies could be captured through vector arithmetic (e.g., “King — Man + Woman ≈ Queen”).
GloVe (Global Vectors for Word Representation): This method built on word co-occurrence statistics from large corpora, offering an alternative to Word2Vec that also effectively captured semantic relationships.

A key limitation of these methods was that each word had a single, static embedding, regardless of its context (e.g., the word “bank” would have the same embedding whether it referred to a financial institution or a riverbank).

3. Contextual Embeddings (The Transformer Era)

The 2017 paper “Attention Is All You Need” introduced the Transformer architecture, which enabled the development of dynamic, contextual embeddings that changed based on surrounding words.

In this blog, we discuss how they transform the static embedding vector into a dynamic vector, which can store the context of a sentence.

Limitations of Static Embeddings :

Out-of-Vocabulary (OOV) Words: These models cannot generate embeddings for words that were not present in their training corpus. This is a problem for specialized terminology, new words (like “COVID”), or morphologically rich languages.
Lack of Nuance and Context: They struggle with homographs (words with the same spelling but different meanings, e.g., “bank” as a financial institution vs. a riverbank) and idiomatic expressions or sarcasm, where meaning is highly dependent on the surrounding text.
Inherent Bias: Word embeddings learn from large text corpora and can, therefore, perpetuate real-world biases (e.g., gender or racial stereotypes in job titles) present in the training data, leading to biased results in downstream applications.
Loss of Word Order and Syntax: Some word embedding models (like those used with a simple Bag-of-Words approach) can lose information about the order of words, which can change the entire meaning of a sentence (e.g., “the dog bit the man” vs. “the man bit the dog”).
Domain Specificity: Pre-trained embeddings developed on general text (like news articles) may perform poorly when applied to specialized domains, such as medical or legal text, which contain unique terminology.
Computational Complexity: Training high-quality embeddings on very large corpora can be computationally intensive and require significant memory resources.

How Self-Attention Solves the Problem:

Here is how self-attention addresses the primary limitations:

Polysemy (Multiple Meanings): In static models, a word like “bank” has a single, fixed vector. Self-attention dynamically adjusts the representation of “bank” based on other words in the sentence (e.g., “river” vs. “money”). It does this by calculating attention scores between every word and the target word, effectively weighting the influence of relevant context words to create a specific, contextualized vector.
Capturing Long-Range Dependencies: Traditional sequential models (like RNNs) struggle to link words that are far apart in a long sentence due to information loss over time. Self-attention, by contrast, directly connects every word to every other word in the sequence in a single layer. This parallel processing allows the model to capture relationships between distant words (e.g., correctly linking “it” to “cat” in a long sentence), regardless of their distance.
Sequential Processing and Efficiency: Traditional models process words one by one, which is slow and prevents efficient parallelization on modern hardware. Self-attention allows for the entire input sequence to be processed simultaneously as a matrix operation, which makes training much faster and more scalable.
Word Order: While self-attention processes words in parallel, it loses the inherent sequential information. This is solved by using positional encodings, which are numerical signals added to the input embeddings to inform the model about the absolute or relative position of each word in the sequence.
Out-of-Vocabulary (OOV) Words: While self-attention is the mechanism for context, the OOV problem is largely solved through the use of subword tokenization methods (like Byte-Pair Encoding or WordPiece), which are commonly used with self-attention models (e.g., Transformers). Subword Units: The input text is broken down into smaller, common subword units rather than full words. Compositionality: Even if a full word has never been seen before (e.g., “unsuperviseable”), it can be broken down into known subwords (“un”, “supervise”, “able”) for which embeddings exist. The self-attention mechanism then helps combine these subword representations into a meaningful representation for the entire word.

Deep Drive into Self-Attention( Learn like Researcher):

Before self-attention, we have a word embedding model, which also makes embedding, but it is static, and it doesn’t understand the context of the sentence.

Researchers think about how to build a model that can capture the context of the sentence.

First Principles Approach:

Suppose you have two sentences:

“Money bank grows”
“River bank flows”

Here, the word “bank” appears in both sentences but has different contextual meanings (money/finance vs. river/waterbody).

Step-by-step First Principles Construction:

Identify the Problem with Static Embeddings:

Static word embeddings assign the same vector to “bank” in both contexts.
This fails to distinguish “bank” as a financial institution or as a river’s edge.

2.** Idea for Contextual Representation:**

What if instead of representing “bank” with a fixed vector, we make its meaning depend on its context?
For example, in “Money bank grows”, the embedding for “bank” should be influenced by “money” and “grows”.
In “River bank flows”, the embedding for “bank” should be influenced by “river” and “flows”.

3. Mathematical Framing:

Imagine representing “bank” not just by its own embedding, but as a weighted sum of the embeddings of all words in the sentence.

For "Money bank grows":New embedding of "bank" =w1⋅embedding(money)+w2⋅embedding(bank)+w3⋅embedding(grows)New embedding of "money"=w4⋅embedding(money)+w5⋅embedding(bank)+w6⋅embedding(grows)New embedding of "grows"=w7⋅embedding(money)+w8⋅embedding(bank)+w9⋅embedding(grows)For "River bank flows":New embedding of "bank" =w10⋅embedding(river)+w12⋅embedding(bank)+w13⋅embedding(flows)New embedding of "river" =w14⋅embedding(river)+w15⋅embedding(bank)+w16⋅embedding(flows)New embedding of "flows" =w17⋅embedding(river)+w18⋅embedding(bank)+w19⋅embedding(flows)

Here, each w (weight) represents how much each word in the sentence influences the new embedding word( in the first example, “bank”).

4. How to Compute the Weights?

Use a similarity measure, like the dot product, between “bank” and every other word to get weights.

Press enter or click to view image in full size

Similarity Score

# 1. Compute Similarity Score for each word in the sentencescore_i = embedding(bank) . embedding(word_i)

Normalize these weights using softmax so they sum to 1, representing probabilities of attention.

# 2. Normalize the scores using softmax to get weightsw_i = exp(score_i) / sum_j(exp(score_j))

Repeat for each word in the sentence, so contextual embeddings are generated for every word by considering every other word.

6. Intuitive Analogy:

# 3. Compute the new contextual embedding for 'bank'new_embedding(bank) = sum_i(w_i * embedding(word_i))( i=1 to n )

Just like on a matchmaking site, you don’t use your entire autobiography everywhere — rather, you use focused profiles and specific queries.
Similarly, convert each original embedding into three roles: Query, Key, and Value (by learned transformations) — so the dot products used for attention are meaningful and flexible for each role.
Query: The word we are focusing on (e.g., “bank”). It wants to find meaning based on others; so, it queries the context.
Key: Every word in the sentence (including itself) provides a key, which helps determine whether it’s relevant to the “bank” word’s context.
Value: Each word also holds some value — information that may contribute to building the new (“contextual”) meaning of “bank”.

7. Mathematically:

When calculating the attention for “bank” in the sentence:
The query vector comes from “bank”.
The key and value vectors come from every word in the sentence (including “bank” itself).
score_i=query(bank)⋅key(word_i)
Use the scores to weight the value(word_i) in the final sum.
The simplest first principles version (above) does not have any learnable parameters; all weights are computed from fixed embeddings.
To make the system learn contextual meanings depending on the task, introduce learnable transformations for query, key, and value vectors (using matrix multiplication and training).

Summary Formula:1) For a sentence with word embeddings e1,e2,e3:2) New embedding of bank=w1⋅e1+w2⋅e2+w3⋅e33) Where:Each wi is softmax(dot-product(query(bank), key(word_i))) - computed for all words.

This is the first principles construction of self-attention: every word looks at all other words, scores their relevance, and combines them dynamically for a contextual embedding.

8. Limitations & Improvements:

The simplest first principles version (above) does not have any learnable parameters; all weights are computed from fixed embeddings. If you make a model that type of embedding is not suitable for specific tasks such as translation, summarization, etc.
To make the system learn contextual meanings depending on the task, introduce learnable transformations for query, key, and value vectors (using matrix multiplication and training).

Add Learnable Parameters in First-Principles Approaches:

To add learnable parameters to the first principles approach for self-attention, you move from using the original static embeddings directly, to applying trainable linear transformations that produce the Query (Q), Key (K), and Value (V) vectors. These transformations introduce the learnable weights, allowing the attention mechanism to adapt and learn contextual relationships specific to the task during training.

Press enter or click to view image in full size

Self Attention

Start with the Original Embedding:

Let each word in the sentence be represented by its embedding vector: embedding(word). (These come from an embedding layer or pre-trained word vectors.)

2. Apply Linear Projections (using Learnable Matrices):

Press enter or click to view image in full size

For each word, create three new vectors — Query (Q), Key (K), and Value (V) — by multiplying the original embedding with trainable weight matrices:
Q=W_Q⋅embedding(word),K=W_K⋅embedding(word),V=W_V⋅embedding(word).
Here, W_Q, W_V, and W_K are matrices (parameters) learned during training.

3. Use Q, K, and V for Self-Attention Calculation:

For the word in focus (let’s say “bank”), use its Query vector Q_bank.
For every word in the sentence (including itself), use their Key vectors (Ki) and Value vectors (Vi).

Press enter or click to view image in full size

Summary of Self-Attention:

Self-attention is a core mechanism in transformers that enables each word in a sentence to dynamically focus on other words based on their contextual relevance. Technically, it works as follows: every input word embedding is transformed into three vectors — Query (Q), Key (K), and Value (V) — using learnable linear projections (i.e., multiplication by trainable matrices). For each word, its Query vector is compared (dot product) with all Key vectors in the sentence to compute similarity scores, which are normalized via softmax to get attention weights. These weights are then used to take a weighted sum over all Value vectors, generating a new contextual embedding for each word that reflects nuanced, task-specific relationships. This design, powered by data-driven learning of the transformation matrices, allows self-attention to capture complex dependencies and meaning shifts in different contexts, making it highly parallelizable and foundational for modern NLP architectures.

Conclusion:

The blog concludes by emphasizing that self-attention is the fundamental building block enabling transformers to generate powerful contextual representations for words in any sequence. Through a step-by-step, first principles approach, it shows how moving from static word embeddings to dynamic, task-aware embeddings — using learnable Query, Key, and Value projections — solves the limitations of older NLP models and lays the foundation for modern deep learning systems. By efficiently capturing complex dependencies, self-attention allows transformers to excel in a wide variety of AI tasks, and understanding this mechanism is crucial for both research and real-world implementation. The video encourages viewers to appreciate the elegance and flexibility of self-attention and explore its code-level details to develop a deeper intuition for how advanced NLP models work.

— — — — — — — — — — — — — The End — — — — — — — — — — — —

I will be adding more content to this blog in the future. For now, this serves as a comprehensive resource for learning Langchain, from basic to advanced levels. I frequently reference Nitish Sir’s videos while creating this blog, as well as the Langchain documentation.

If you enjoy this blog, please follow me on Medium and LinkedIn. Thank you for your time!.

First Principles Approach:

Add Learnable Parameters in First-Principles Approaches:

Summary of Self-Attention:

— — — — — — — — — — — — — The End — — — — — — — — — — — —

Similar Posts