23 min read3 hours ago
–
Introduction:
Search technology represents one of the most fundamental and enduring challenges in computer science: how to find relevant information from massive, ever-expanding data collections efficiently. This is not merely a problem of data lookup but a complex endeavour spanning relevance, scale, and, increasingly, a deep understanding of human intent.
This comprehensive exploration delves into every aspect of modern search systems. It charts the complete journey from the foundational lexical approaches that powered the early web to the sophisticated vector embeddings, approximate nearest-neighbour algorithms, and hybrid retrieval architectures that enable today’s most advanced Artificial Intelligence (AI) applications. The report will demonstrat…
23 min read3 hours ago
–
Introduction:
Search technology represents one of the most fundamental and enduring challenges in computer science: how to find relevant information from massive, ever-expanding data collections efficiently. This is not merely a problem of data lookup but a complex endeavour spanning relevance, scale, and, increasingly, a deep understanding of human intent.
This comprehensive exploration delves into every aspect of modern search systems. It charts the complete journey from the foundational lexical approaches that powered the early web to the sophisticated vector embeddings, approximate nearest-neighbour algorithms, and hybrid retrieval architectures that enable today’s most advanced Artificial Intelligence (AI) applications. The report will demonstrate that modern search is not a choice between one method and another but an evolutionary process where new, semantically aware systems are layered upon and integrated with battle-hardened lexical foundations.
What Exactly Is AI-Powered Search?
Press enter or click to view image in full size
The Three Layers of AI Search
AI-powered search can be broken down into three distinct layers:
1. AI + Search (Without Deep Learning)
Surprisingly, you can build intelligent search systems without machine learning. These include:
- Question-answer systems with hard-coded responses
- Rule-based relevancy systems
- Basic chatbots and virtual assistants
- Taxonomies and ontologies
While these might not seem “intelligent”, they technically qualify as AI — computer systems designed to act like humans would.
2. Machine Learning-Powered Search
This is where things get interesting. This layer includes:
- Signals boosting models: Learning from user behavioral signals to rank search results
- Learning to rank (LTR): Building ranking classifiers that generalize across all results
- Semantic search: Searching on meaning, not just keywords
- Collaborative filtering: “Users who purchased this also purchased…”
- Matrix factorization: generating behavioural embeddings and personalizing results
- Personalized search: Tailoring results to individual users
- Document classification and clustering
- NLP and entity resolution techniques
- Semantic knowledge graphs
3. Deep Learning-Powered Search
The cutting edge includes:
- LLMs and foundation models
- Neural search and vector search
- Embeddings: Representing meaning mathematically
- Multimodal search: Across images, videos, and text
- Retrieval Augmented Generation (RAG): Using search to ground LLM responses in facts
- Agentic RAG: Iterative reasoning cycles to find optimal results
Press enter or click to view image in full size
Part 1: Lexical Search — The Foundational Architecture
Lexical search, also known as keyword search, is the bedrock of information retrieval. Its architecture is built upon a series of elegant heuristics and highly optimized data structures designed to match query terms to stored text with extreme speed and precision.
1.1 Understanding Documents and the Search Problem
In the context of search systems, the “document” is the fundamental unit of information. This is a critical abstraction. A document is not limited to a text file or a PDF; it is any searchable entity. This could be a product listing, a job posting, a passage from a book, or any piece of content to be made discoverable.
Technically, a document is often best conceptualised as a flat, denormalised JSON blob. This architectural choice is a deliberate trade-off: it sacrifices the storage efficiency and data integrity of a normalised database for the raw speed of retrieval. By denormalising, all information required to display a search result is contained in a single entity, eliminating the need for computationally expensive database “joins” at query time, which would be prohibitively slow.
For this analysis, we will use a running example based on two product documents, as referenced in :
- Document 0: “Size 11 running shoes”
- Document 1: “Split sole ballet shoes”
Press enter or click to view image in full size
1.2 The Tokenization Pipeline: Breaking Down Text
The very first step in making text searchable is tokenization — the process of breaking continuous text into discrete, searchable units called tokens. This is a sophisticated, multi-stage pipeline, as illustrated in ``, not a simple split on spaces.
Press enter or click to view image in full size
The typical pipeline involves several key steps:
- Initial Tokenization: The raw text is broken into a list of preliminary tokens. For example, “Size 11 running shoes” becomes [“Size”, “11”, “running”, “shoes”].
- Case Normalization: Tokens are often converted to lowercase (e.g., “Size” becomes “size”) to ensure consistency. A search for “shoe” should not fail just because the document contains “Shoe”.
- Stemming and Lemmatization: This is the most complex step, designed to reduce words to their root form to handle linguistic variations.
- Stemming: An algorithmic process that strips prefixes and suffixes to find a common root. This is fast but can sometimes result in non-words. For example, “running” and “runner” both reduce to “run”., “shoes” reduces to “shoe”.
- Lemmatization: A more advanced, dictionary-based process that considers a word’s context and part of speech to return its base dictionary form (its “lemma”).3 For example, “is,” “are,” and “am” would all be lemmatized to “be”.
This process is a classic trade-off. By intentionally destroying information (the semantic difference between “a runner” and “the act of running”), stemming and lemmatization aim to increase recall (finding all relevant documents) at the potential risk of slightly decreasing precision (finding irrelevant documents that share a root word). The choice of stemmer (e.g., Porter, Snowball) is language-specific and can significantly impact search quality.
1.3 The Inverted Index: Search’s Core Innovation
Press enter or click to view image in full size
The inverted index is arguably the most important data structure in information retrieval, forming the core of systems like Lucene, Elasticsearch, and every major search engine. It is an innovation that shifts the primary computational burden of search from query time to index time.
A traditional “forward index” maps a document to its contents:
Instead of storing documents and their contents (forward index):
Doc 0 → [“size”, “11”, “run”, “shoe”]
Doc 1 → [“split”, “sole”, “ballet”, “shoe”]
We invert the relationship (hence “inverted” index):
“run” → [0]
“shoe” → [0, 1]
“size” → [0]
“11” → [0]
“split” → [1]
“sole” → [1]
“ballet” → [1]
This inversion enables lightning-fast retrieval. When a user searches for “runner shoes”:
- Tokenize and stem: [“run”, “shoe”]
- Look up “run” → get [0]
- Look up “shoe” → get [0, 1]
- Intersect for AND query → [0]
- Return Document 0
The Power of Set Operations: The inverted index naturally supports Boolean operations:
- AND: Intersection of document sets
- OR: Union of document sets
- NOT: Set difference operations
Press enter or click to view image in full size
1.4 Beyond Binary: Term Frequency and Document Statistics
Real-world search must rank results, not just return a binary match/no-match. The enhanced inverted index, referenced in ``, achieves this by storing term statistics alongside the document IDs.
Press enter or click to view image in full size
Press enter or click to view image in full size
The enhanced inverted index stores not just document IDs but also:
“run” → [(doc:0, tf:1), (doc:5, tf:2)]
“ballet” → [(doc:1, tf:1), (doc:7, tf:1)]
“shoe” → [(doc:0, tf:1), (doc:1, tf:1), (doc:5, tf:1)]
Here, tf is the Term Frequency — how many times the term appears in that specific document. The intuition is simple: Document 5, which mentions “run” twice, is likely more about running than Document 0, which mentions it once. This is the first and most basic signal of relevance.
1.5 The Specificity Principle: Not All Terms Are Equal
Consider a search for “ballet shoes”. Which term is more important?
- “shoes” appears in 10,000 documents (generic, common, low information content).
- “ballet” appears in 100 documents (specific, rare, high information content).
Press enter or click to view image in full size
This introduces Document Frequency (DF) — the number of documents in the entire collection that contain a term. Users implicitly prefer specific matches. The ideal ranking order would be:
- Documents with both “ballet” AND “shoes” (best).
- Documents with just “ballet” (the specific term).
- Documents with just “shoes” (the generic term).
This heuristic, which values rare terms as a proxy for user intent, led directly to one of information retrieval’s most important formulas.
Press enter or click to view image in full size
Press enter or click to view image in full size
1.6 TF-IDF: The Classic Relevance Formula
TF-IDF (Term Frequency-Inverse Document Frequency) combines both insights:
TF-IDF = TF × IDF
Where IDF = log(N/DF)
- TF rewards terms appearing multiple times in a document
- IDF rewards rare, specific terms
- N is the total number of documents in the collection
Detailed Calculation Example: For “ballet” in Document 1:
- TF = 2 (appears twice)
- DF = 100 (appears in 100 documents total)
- N = 100,000 (total documents)
- IDF = log(100,000/100) = log(1000) = 3
- TF-IDF = 2 × 3 = 6
For “shoe” in Document 1:
- TF = 1 (appears once)
- DF = 10,000 (common term)
- IDF = log(100,000/10,000) = log(10) = 1
- TF-IDF = 1 × 1 = 1
The document scores 6 for “ballet” but only 1 for “shoe”, reflecting the higher value of specific terms.
Press enter or click to view image in full size
But are 10x more terms truly 10x more relevant?
Press enter or click to view image in full size
1.7 The Saturation Problem: When More Isn’t Better
Early search systems discovered a critical problem with raw TF-IDF: linearity doesn’t match user expectations. Consider two documents:
- Document A: mentions “Python” 5 times
- Document B: mentions “Python” 50 times
Is Document B really 10 times more relevant? No. After a certain point, additional occurrences don’t add much signal—this is called the “we get it” effect.
The Saturation Curve: Relevance increases with term frequency but with diminishing returns. The mathematical form this takes is logarithmic or asymptotic, not linear.
Real-World Impact: This prevents keyword stuffing (malicious or accidental) from dominating results. A spammy page repeating “iPhone” 1000 times won’t outrank a genuine iPhone review.
Press enter or click to view image in full size
Press enter or click to view image in full size
1.8 Document Length Normalization
Also the document length matters. A single match for “Skywalker” in a 20-word tweet is far more significant than a single match in a 100,000-word book. The tweet is about Skywalker; the book just mentioned him. Shorter documents must receive a relevance boost for term matches to normalize for this bias.
Press enter or click to view image in full size
1.9 BM25: The State-of-the-Art Lexical Ranking
BM25 (Best Matching 25) combines all these insights into a sophisticated ranking function:
BM25 = IDF × [(TF × (k1 + 1)) / (TF + k1 × (1 — b + b × (|D|/avgDL)))]
Press enter or click to view image in full size
Breaking down each component:
- IDF: Inverse document frequency (term specificity)
- TF: Term frequency in the document
- k1: Controls term frequency saturation (typically 1.2)
- Higher k1 = less saturation (more linear)
- Lower k1 = more saturation (flattens quickly)
- b: Controls length normalization (typically 0.75)
- b=1: Full length normalization
- b=0: No length normalization
- |D|: Current document length
- avgDL: Average document length in collection
The Tuning Process: These parameters (k1, b) are typically tuned using:
- Click-through data from real users
- Editorial relevance judgments
- A/B testing in production
1.10 Sparse Vector Representation
We can represent lexical search mathematically as operations on sparse vectors:
Press enter or click to view image in full size
- Vocabulary size: 1,000,000 terms
- Document vector: 1,000,000 dimensions
- Most values are 0 (hence “sparse”)
- Non-zero values represent term presence/frequency
Example for query “apple juice”:
apple: [0, 0, 0, …, 1, …, 0, 0, 0] (position 42,337)
juice: [0, 0, 0, …, 1, …, 0, 0, 0] (position 598,221)
combined: [0, 0, 0, …, 1, …, 1, …, 0]
This sparse vector representation bridges lexical and vector search paradigms.
Press enter or click to view image in full size
Press enter or click to view image in full size
Part 2: The Limitations of Lexical Search and the Rise of Semantics
2.1 The Vocabulary Mismatch Problem
Lexical search’s fatal flaw is its inability to understand meaning. Consider these failures:
- Searching “car” misses documents about “automobile”
- Searching “CNN” misses “Convolutional Neural Network”
- Searching “cheap” misses “affordable”, “budget”, “inexpensive”
This vocabulary mismatch problem affects 30–40% of queries in typical search systems.
2.2 The Ambiguity Challenge
The word “apple” illustrates the ambiguity problem:
- Apple (fruit) — nutrition, recipes, farming
- Apple (company) — technology, stocks, products
- “The Big Apple” — New York City
Lexical search treats all instances identically, unable to disambiguate based on context.
2.3 Dimensionality Reduction Through Stemming
Stemming can be viewed as primitive dimensionality reduction:
- “apple” and “apples” → “apple”
- “running”, “runner”, “runs” → “run”
- Reduces vocabulary size (dimensions)
- But loses semantic distinctions
This mechanical approach sometimes helps but often hurts by conflating distinct concepts.
Press enter or click to view image in full size
Part 3: Dense Vector Embeddings — Encoding Meaning
To solve the problems of vocabulary mismatch and ambiguity, systems needed to move from representing words to representing meaning. This required a fundamental paradigm shift: from high-dimensional sparse vectors to low-dimensional dense vectors.
Sparse Vectors (Lexical):
- Dimensions: 1,000,000+ (vocabulary size)
- Values: Mostly 0s, few 1s
- Interpretation: Word presence/absence
- Example: [0, 0, 1, 0, 0, 0, 1, 0, …]
Dense Vectors (Semantic):
- Dimensions: 128–1536 (learned representations)
- Values: Continuous real numbers
- Interpretation: Semantic properties
- Example: [0.234, -0.567, 0.123, 0.891, …]
Press enter or click to view image in full size
Dense vectors (or “embeddings”) are learned by neural networks and encode concepts as points in a multi-dimensional “semantic space”.
3.2 Semantic Dimensions: What Vectors Encode
To build intuition, provides an analogy for what these dense vectors capture. Instead of dimensions for words, imagine dimensions for concepts.
Press enter or click to view image in full size
In this conceptual model, “Cheese Pizza” is no longer just a string of characters; it is a point in concept-space defined by its properties. This is what allows for semantic comparison.
3.3 Cosine Similarity: Measuring Semantic Distance
Press enter or click to view image in full size
Cosine similarity measures the angle between vectors, not their magnitude:
Cosine Similarity = (A · B) / (||A|| × ||B||)
Geometric Interpretation:
- cos(0°) = 1.0: Identical meaning (same direction)
- cos(90°) = 0.0: Unrelated (orthogonal)
- cos(180°) = -1.0: Opposite meaning (opposite direction)
When this metric is applied to our conceptual vectors, it yields semantic rankings, as shown
Press enter or click to view image in full size
Query: “Cheese Pizza”
Press enter or click to view image in full size
Query: “Green Tea”
Press enter or click to view image in full size
3.4 The Continuous Semantic Space
A key emergent property of these vector spaces is that they are continuous. The space between vectors is meaningful, as illustrated in.
A famous example is the “Puppy-Vader Experiment”:
- The vector for “puppy” represents concepts like [cute, small, furry…].
Press enter or click to view image in full size
- The vector for “Darth Vader” represents concepts like.
Press enter or click to view image in full size
- The midpoint vector (the average of the two) retrieves images of a “cute puppy dressed as Darth Vader.”
Press enter or click to view image in full size
Press enter or click to view image in full size
This is not a coincidence; it is a feature of the distributed representations learned by the neural network.
3.5 Learned Embeddings and Latent Features
Modern embeddings from large language models use latent features:
Example 768-dimensional embedding:
Dimension 1: 0.234 (partially size-related)
Dimension 2: -0.567 (partially color-related)
Dimension 3: 0.123 (unknown mixture)
…
Dimension 768: 0.087 (complex combination)
Key Insights:
- Individual dimensions aren’t interpretable
- Meaning emerges from combinations
- Networks discover optimal representations through training
- Different models learn different latent spaces
Press enter or click to view image in full size
3.6 Multimodal Embeddings
Modern systems can embed multiple modalities into the same space:
- Text Embeddings: From documents, queries, descriptions
- Image Embeddings: Visual features, styles, objects
- Behavioral Embeddings: Click patterns, purchase history
- Audio Embeddings: Speech, music, sound effects
The Power of Unified Space: A text query “red sports car” can retrieve:
- Text documents about sports cars
- Images of red sports cars
- Products previously viewed by sports car enthusiasts
- Videos with sports car sounds
Press enter or click to view image in full size
3.7 Vector Arithmetic and Conceptual Navigation
The semantic space supports meaningful arithmetic operations:
Superhero Example:
- “superhero flying” (singular) retrieves single heroes in flight
- “superheroes flying” (plural) retrieves group scenes
- Difference vector = concept of plurality
DeLorean + Superhero Example:
- DeLorean image embedding (sporty car, cool lighting)
- “superhero” text embedding
- = Images of sporty cars with superheroes
Press enter or click to view image in full size
This arithmetic property enables:
- Query expansion and refinement
- Conceptual navigation
- Analogy completion (“king — man + woman = queen”)
Part 4: Efficient Vector Retrieval at Scale
4.1 The Computational Challenge
We need data structures that enable sub-linear search time.
The naive approach to vector search is a brute-force k-Nearest Neighbors (k-NN) scan: compare the query vector to every single document vector in the collection. This is computationally intractable:
- 1 million documents × 768 dimensions × float32 = 3GB of vectors
- 1 million cosine similarity calculations per query.
- This is an O(N) operation ~ 100ms per query, which is far too slow for a production system.
The solution is Approximate Nearest Neighbor (ANN) search, a field dedicated to finding most of the nearest neighbors in sub-linear time (e.g., O(log N)). This involves trading a small amount of accuracy for a massive gain in speed.
4.2 Grid-Based Indexing: The Geographic Analogy
A first approach, is to divide the vector space into a uniform grid. At query time, the system finds the query’s grid cell and searches only within it.
Process:
- Divide space into uniform grid cells
- Assign each vector to its grid cell
- For queries, find the appropriate cell
- Search only within that cell
The Problem — Uneven Distribution: Like using USGS geographic survey grids:
- Grid “18TWL” in Manhattan: 10 million people
- Same-sized grid in Wyoming: 10 people
Vector spaces are “clumpy.” Like a geographic grid, some cells (“Manhattan”) will contain 10 million vectors, while others (“Wyoming”) contain 10. The search is still O(N) within the dense cells.
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
4.3 Clustering: Data-Aware Partitioning
A better approach, K-Means Clustering for Vectors, uses a clustering algorithm like k-means to create data-sensitive regions, like ZIP codes. Dense areas get many small clusters; sparse areas get few large ones.
This enables a two-stage retrieval:
- Find the nearest cluster centroid (fast O(k) search, where k is the number of clusters).
- Search only within that cluster’s (much smaller) list of vectors.
Imagine we have 10 million movie vectors in our database. User’s taste profile is represented as a query vector, and we need to find movies he’ll like. The naive approach — calculating cosine similarity between User’s vector and all 10 million movies — would take several seconds. This is unacceptable for real-time search.
Press enter or click to view image in full size
Press enter or click to view image in full size
The Geographic Insight: Think about how the US Postal Service solved a similar problem. They could have given every address a unique random number, but finding all addresses near you would require checking every single address in America. Instead, they created ZIP codes — a clustering system that groups nearby addresses together.
Step-by-Step: How ZIP Codes Inspire Vector Clustering
Step 1: Understanding Natural Clustering in Movies
Just as people don’t live uniformly across America (dense in cities, sparse in rural areas), movies aren’t distributed uniformly in vector space:
- Dense Region 1: Science Fiction cluster (Star Wars, Star Trek, Blade Runner, The Matrix, Interstellar, etc.)
- Dense Region 2: Romantic Comedies cluster (When Harry Met Sally, Sleepless in Seattle, You’ve Got Mail, etc.)
- Dense Region 3: Horror cluster (The Shining, Halloween, Nightmare on Elm Street, etc.)
- Sparse Regions: Experimental films, documentaries spread thinly
This is exactly like population distribution:
- Manhattan: 70,000 people per square mile → Many small ZIP codes (10001, 10002, 10003…)
- Wyoming rural area: 5 people per square mile → Few large ZIP codes covering vast areas
Step 2: Creating Movie “ZIP Codes” with K-Means
Let’s create 1,000 clusters (our “ZIP codes”) for our 10 million movies:
Initialization:
1. Randomly select 1,000 movies as initial cluster centers - Cluster 001: "Star Wars" (happens to be in Sci-Fi region) - Cluster 002: "The Notebook" (in Romance region) - Cluster 003: "The Godfather" (in Crime Drama region) ... and so on
Assignment Phase (First Iteration):
For each of 10 million movies: Calculate distance to all 1,000 cluster centers Assign to nearest center Example: "The Empire Strikes Back" → Nearest to Cluster 001 (Star Wars) "Return of the Jedi" → Also nearest to Cluster 001 "Star Trek" → Also Cluster 001 (similar sci-fi vectors) "Titanic" → Nearest to Cluster 002 (Romance elements)
After first assignment:
- Cluster 001 has 50,000 sci-fi movies (dense region!)
- Cluster 456 has only 200 avant-garde documentaries (sparse region)
Update Phase:
For each cluster: Calculate mean vector of all assigned movies This becomes the new cluster center Cluster 001 new center = average([Star Wars, Empire Strikes Back, Star Trek, ...50,000 movies])This center now represents "quintessential sci-fi"
Iteration 2-N: The algorithm repeats:
- Reassign movies to nearest (possibly new) centers
- Recalculate centers
- Some movies switch clusters as centers move
- Continue until clusters stabilize (typically 10–50 iterations)
Final Result: Just like ZIP codes adapt to population density:
- Sci-Fi Dense Region: Gets 50+ small, specific clusters
- Cluster 001: Space Opera (Star Wars-like)
- Cluster 023: Hard Sci-Fi (Interstellar-like)
- Cluster 045: Cyberpunk (Matrix-like)
- Cluster 067: Time Travel (Back to the Future-like)
- Documentary Sparse Region: Gets just 2–3 large clusters
- Cluster 456: All nature documentaries
- Cluster 789: All historical documentaries
Step 3: The Search Process — Finding Movies for User
Now when User (who loves sci-fi) comes with his query vector:
Phase 1: Find User’s “movies” (100 microseconds)
Doug's vector = [sci-fi elements, action, space themes...]Compare to 1,000 cluster centroids (not 10 million movies!)Nearest centroid = Cluster 001 (Space Opera)
This is like looking up which ZIP code contains an address — fast lookup in a small index.
Phase 2: Search Within the Cluster (10 milliseconds)
Cluster 001 contains 50,000 movies (not 10 million!)Calculate cosine similarity with these 50,000Return top 10:1. Star Wars (0.98 similarity)2. The Empire Strikes Back (0.97)3. Battlestar Galactica (0.95)4. The Orville (0.94)5. Guardians of the Galaxy (0.93)...
Press enter or click to view image in full size
This is better, but it suffers from a hard boundary problem. If a query vector lands near the edge of Cluster A, but its true nearest neighbor is just across the boundary in Cluster B, this method will never find it.
4.4 Graph-Based Search: HNSW
The current state-of-the-art in ANN is Hierarchical Navigable Small Worlds (HNSW), a graph-based approach.
Press enter or click to view image in full size
Press enter or click to view image in full size
1. The Graph Layer (Navigable Small World — NSW)
The base layer is a proximity graph. Each vector (node) is connected to its k nearest neighbors (edges), creating a network of “local neighborhood roads”.
- Navigation: A search starts at a random entry point. It checks the node’s neighbors and “greedily” moves to the neighbor that is closest to the query vector. This is repeated until a local minimum is found (a node whose neighbors are all farther away from the query).
2. The Hierarchical Innovation
A single-layer graph is still too slow to traverse at a billion-node scale. HNSW adds a hierarchy of “highways” and “interstates”.
- Layer 0: All vectors, with “local road” connections.
- Layer 1: A subset (e.g., 10%) of vectors, with longer-range “highway” connections.
- Layer 2: A smaller subset (e.g., 1%) of vectors, with “interstate” connections.
The HNSW Search Algorithm :
- Start at the Top Layer (e.g., Layer 2, “interstates”). This graph has very few nodes and can be traversed quickly to find the general region of the query.
- Drop to Layer 1 (“highways”). Using the closest node from Layer 2 as the entry point, refine the search location.
- Drop to Layer 0 (“local roads”). Using the closest node from Layer 1, perform the final greedy navigation on the full graph to find the true nearest neighbors.
This hierarchical navigation provides O(log N) search and insertion complexity. It solves the boundary problem (graph edges can cross any region) and the traversal problem (the hierarchy provides fast, long-range “jumps”), making billion-scale vector search feasible.
Press enter or click to view image in full size
Part 5: Hybrid Search — Combining Strengths
Neither lexical nor vector search is a complete solution. Lexical search is precise but brittle; vector search is semantic but fuzzy. The optimal solution, hybrid search, combines both to get the best of both worlds.
5.1 The Failure Modes
The “The Hobbit” query provides a perfect illustration of their respective failures.
Lexical Search Failure ``
- Query: “The Hobbit”
- Problem: Lexical search (like BM25) is high-precision, low-recall. It finds exact matches for “Hobbit” perfectly. But after exhausting those, its ranking degrades to matching the common stopword “The,” returning irrelevant junk like “The Good, The Bad, and The Ugly”. It also completely fails on conceptual queries.
Vector Search Failure ``
- Query: “The Hobbit”
- Problem: Vector search is high-recall, lower-precision. It correctly finds “The Hobbit” and the semantically related “Lord of the Rings.” However, it can “drift” into thematically similar but incorrect results, such as “Harry Potter” (fantasy, but the wrong franchise) or “Avatar”. It also struggles with precision for exact terms or numbers.
The inescapable conclusion is that a robust system must use both.
Press enter or click to view image in full size
5.2 Reciprocal Rank Fusion (RRF)
Combining the two systems is difficult because their scores (e.g., BM25 scores and cosine similarity) are on completely different, incompatible scales.
Reciprocal Rank Fusion (RRF) is an elegant solution that ignores the scores and combines the results based only on their rank.
The RRF Formula
Press enter or click to view image in full size
The k constant is crucial. It acts as a dampening factor, giving a strong boost to documents that appear in the top ranks of multiple systems (a sign of consensus) without overly prioritizing a single #1 result.
Example Calculation: Document “Lord of the Rings”:
- Lexical rank: 8 (not exact match)
- Vector rank: 2 (semantically similar)
- RRF = 1/(60+8) + 1/(60+2) = 0.0147 + 0.0161 = 0.0308
This document scores higher in hybrid than in either system alone.
5.3 Advanced Hybrid Strategies
The spectrum of retrieval strategies:
- Pure Lexical: Boolean, TF-IDF, BM25
- Lexical + Expansion: Synonyms, query rewriting
- Sparse + Dense Reranking: Lexical retrieval, neural reranking
- Weighted Hybrid: Linear combination of scores
- Learned Hybrid: ML model combines signals
- Pure Dense: Vector similarity only
5.4 Query-Dependent Strategy Selection
Different query types benefit from different strategies:
Known-Item Search (user knows exactly what they want):
- Example: “iPhone 13 Pro Max 256GB Sierra Blue”
- Best: Lexical search (exact matching crucial)
Exploratory Search (user discovering options):
- Example: “gift ideas for teenage nephew”
- Best: Vector search (semantic understanding)
Navigational Search (finding specific pages):
- Example: “Facebook login”
- Best: Lexical + popularity signals
Research Search (comprehensive information):
- Example: “climate change effects on agriculture”
- Best: Hybrid with diverse retrieval
Part 6: Reflected Intelligence — Learning from Users
6.1 The Feedback Loop
Every user interaction provides signal:
Press enter or click to view image in full size
- Query: User expresses intent
- Results: System returns candidates
- Behavior: User clicks, ignores, purchases
- Learning: System updates understanding
This creates a virtuous cycle of continuous improvement.
6.2 Signals Boosting: Popularized Relevance
For frequent queries, aggregate behavior reveals relevance:
Algorithm:
for query in top_queries:
results = get_search_results(query)
clicks = get_historical_clicks(query)
boost_factor = calculate_boost(clicks)
rerank_results(results, boost_factor)
Real Example: Query: “python”
- Without boosting: Python (snake) documentation ranks high
- With boosting: Python.org (programming language) ranks #1
- Reason: 99% of users want programming, not reptiles
6.3 Collaborative Filtering and Personalization
Build embeddings from user behavior:
User-Item Interaction Matrix:
Item1 Item2 Item3 Item4User1 1 0 1 0User2 1 1 0 0User3 0 0 1 1
Matrix Factorization:
- Decompose into User × Feature and Feature × Item matrices
- Learn latent features explaining interactions
- Generate user and item embeddings
Personalized Search:
score = α × relevance + β × user_embedding · item_embedding
6.4 Learning to Rank with Click Models
Not all clicks are equal — position matters:
Position Bias Model:
P(click | position, relevance) = P(examined | position) × P(click | examined, relevance)
Observed Click-Through Rates by Position:
- Position 1: 35% CTR
- Position 2: 22% CTR
- Position 3: 15% CTR
- Position 10: 2% CTR
Debiasing Process:
- Estimate examination probability by position
- Infer true relevance from clicks
- Train ranking model on debiased data
- Deploy improved ranker
6.5 Click Models and Relevance Inference
Different click models capture different user behaviors:
Cascade Model: Users examine results sequentially
P(click_i) = P(relevant_i) × Π(1 — P(relevant_j)) for j < i
Dependent Click Model: Clicks depend on previous clicks
P(click_i | click_history) = complex function of past behavior
Time-Aware Model: Dwell time indicates satisfaction
Relevance ∝ time_spent_on_result / expected_time
6.6 Online Learning and Exploration
Multi-Armed Bandit Approach:
- Explore: Show diverse results to learn
- Exploit: Show best-known results
- Balance: ε-greedy, Thompson sampling, UCB
A/B Testing at Scale:
- Interleaving: Mix results from two algorithms
- Preference inference from clicks
- Statistical significance with millions of queries
7. Most Production system use multiple approach
- BM25 for precise keyword matching
- Vector search for semantic understanding
- Knowledge graphs for entity relationships
- Business rules for domain factors
Press enter or click to view image in full size
Reflected Intelligence: The Feedback Loop
Press enter or click to view image in full size
This is the key to building systems that improve over time.
The Process
- User searches for “iPad”
- System returns results based on current ranking model
- User interacts with results (clicks, purchases, ignores)
- System logs everything:
- User ID
- Query
- Results shown (and their order)
- Actions taken (clicks, add-to-cart, purchases)
- Signal processing and ML: Aggregate signals across many users to detect patterns
- Generate learned relevance model
- Plug model back into system to improve future searches
This creates a continuous cycle: search → results → actions → learning → improved search.
Press enter or click to view image in full size
During user search
Press enter or click to view image in full size
The Home Depot Analogy
Imagine asking a store associate where to find hammers. They say “Aisle 11.” You go there, don’t find hammers, wander around, and eventually find them elsewhere.
Most search engines are like that associate repeating “Aisle 11” thousands of times, never learning.
A self-learning system is like an associate who notices:
- People ask for hammers
- They go to Aisle 11
- They wander away
- They emerge from Aisle 7 with hammers
After seeing this pattern, the associate should start saying “Aisle 7.”
That’s reflected intelligence. The system notices its mistakes and improves.
Applicability to Agents
This applies to AI agents too. If an agent:
- Runs a search
- Gets poor results
- Tries variations repeatedly
- Eventually finds good results
Your search system can learn: “When I see this type of query with these filters and boost factors, results are better.” Future agent queries get good results faster, requiring fewer cycles.
Behavioral Signals and User Understanding
On Thursday’s session, we’ll dive deep into using behavioral data:
- Signals boosting: Learning to rank popular results
- Click models: Understanding what clicks mean
- Collaborative filtering: “Users who liked X also liked Y”
- Matrix factorization: Creating behavioral embeddings
- Personalization: Tailoring results per user
This is where theory meets reality. You take actual user interactions — messy, noisy, but rich with information — and extract patterns that dramatically improve relevance.
Key Takeaways
1. Domain Matters More Than Algorithms
The best generic algorithm can’t compete with good domain-specific relevance factors. Understand your domain deeply.
2. There Are Three Dimensions to Relevance
Content understanding, domain understanding, and user understanding. You need all three.
3. Embeddings Are Powerful But Not Magic
Generic LLM embeddings fail in specialized domains. You need domain-specific training or semantic knowledge graphs.
4. Self-Learning Is a Journey, Not a Destination
No one has truly “finished” building a self-learning search system. It’s continuous improvement.
5. Hybrid Approaches Win
Combine keyword search, vector search, knowledge graphs, business rules, and behavioral signals. Don’t rely on one technique.
6. User Behavior Is Gold
If you have user interaction data and you’re not using it to improve your system, you’re leaving massive performance gains on the table.
7. Start Simple, Iterate
Begin with basic keyword search + domain rules. Add sophistication incrementally. Measure everything.