74 min read22 hours ago
–
Introduction
Search retrieval methods broadly fall into two streams: sparse lexical retrieval and dense semantic retrieval (see Fig. 1). Sparse methods, such as BM25, rely on exact keyword matches; they represent text as high-dimensional sparse vectors, where each dimension corresponds to a term, and only observed terms have non-zero weights. This yields precise, interpretable results but fails when the query and document use different words for the same concept. Dense methods, on the other hand, encode texts as low-dimensional dense vectors to capture semantic similarity beyond exact words. Dense retrieval can match synonyms or related concepts, but it sacrifices interpretability and often requires heavy neural infrastructure (ANN indexes, …
74 min read22 hours ago
–
Introduction
Search retrieval methods broadly fall into two streams: sparse lexical retrieval and dense semantic retrieval (see Fig. 1). Sparse methods, such as BM25, rely on exact keyword matches; they represent text as high-dimensional sparse vectors, where each dimension corresponds to a term, and only observed terms have non-zero weights. This yields precise, interpretable results but fails when the query and document use different words for the same concept. Dense methods, on the other hand, encode texts as low-dimensional dense vectors to capture semantic similarity beyond exact words. Dense retrieval can match synonyms or related concepts, but it sacrifices interpretability and often requires heavy neural infrastructure (ANN indexes, large models) to search efficiently.
Press enter or click to view image in full size
Fig 1 — Sparse lexical retrieval and Dense semantic retrieval
Hybrid search (Fig. 2) is one approach that runs both sparse and dense models and merges their results to get the best of both worlds. For example, a hybrid system might retrieve candidates with BM25 and then re-rank them with a BERT-based semantic model. While effective, maintaining two separate systems increases complexity. Ideally, we want a single approach that combines lexical precision with semantic awareness.
Press enter or click to view image in full size
Fig 2 — Hybrid Search
MiniCOIL (Mini Contextualized Inverted List) is proposed as such a hybrid solution in the sparse domain. It’s a sparse neural retriever that augments traditional term-based ranking with contextualized meaning. In essence, MiniCOIL retains the efficiency and transparency of lexical term matching while injecting a small dose of semantics into the term representation. The result is a lightweight model that can distinguish word meanings in context, improving ranking quality without requiring dense vector search. In this article, we’ll dive deep into how MiniCOIL works, why it was developed, and how to apply it in practice.
Limitations of Existing Approaches
Before MiniCOIL, researchers explored several avenues to make lexical search more semantic. Let’s discuss why those approaches haven’t been fully satisfactory :
- Traditional Sparse Retrieval (Lexical) — Methods like BM25 treat each term as an atomic unit with a single weight (Fig. 3a).
Press enter or click to view image in full size
Fig 3a- Sparse Retrieval
- This works well for precise matching, but it cannot handle polysemy or synonyms. For example, a query for “bank” will retrieve documents containing “bank” whether they refer to a financial institution or a river bank — there’s no mechanism to tell the meanings apart. Likewise, if a user searches “heart attack symptoms” and a document uses the term “myocardial infarction”, pure lexical matching fails to connect the two. In summary, sparse lexical retrieval has high precision but low recall for semantically equivalent terms, and it’s oblivious to context or word sense (Fig. 3b illustrates this challenge).
Press enter or click to view image in full size
fig 3b — Sparse Neural Retrieval
- Dense Semantic Retrieval — Using dense vector embeddings (e.g. Sentence Transformers) can recall semantically related text even if no words overlap. However, dense methods typically require an approximate nearest neighbor index and significant compute at query time. They can also retrieve conceptually similar docs that may not contain the query terms, which sometimes harms precision. Moreover, dense results are not easily interpretable (you can’t highlight why a result matched since it’s not via specific terms). Many real-world systems resort to a hybrid of BM25 + dense reranker to balance these aspects, accepting the extra complexity.
- Neural Sparse Retrievers — A recent line of research tries to marry the two above by learning to output sparse vectors that include semantic signals. Examples include models like SPLADE, TILDE, and uniCOIL. These models still produce a vector over the vocabulary space (so retrieval uses an inverted index), but they learn to adjust weights or add new terms using transformer encoders. SPLADE in particular expands queries and documents by predicting additional related terms (e.g. adding synonym tokens) with learned weights, thereby improving recall. In the classic example, an SPLADE-expanded query for “fruit bat” might automatically include terms like “flying fox” or “vampire”, which helps match documents that use those terms. This expansion trick allows SPLADE to beat BM25 on many benchmarks by bridging the vocabulary gap. However, the improvements come at a cost: SPLADE’s inference is heavy and slow (it has to predict many possible expansions with a transformer). The expanded representations are high-dimensional (making the index large), and blindly adding synonyms can introduce false positives if word sense isn’t handled (e.g. adding “apple” as a synonym for “fruit” might retrieve documents about the company Apple erroneously). In short, SPLADE and similar neural sparse methods often end up not truly lightweight or robust — they either depend on large models at query time or are tuned to specific training data.
- Out-of-Domain Generalization — Another limitation is that many learned sparse models are trained with a supervised relevance objective on a particular dataset (e.g. MS MARCO). They tend to overfit to the language and query patterns of that domain. As a result, when you apply them to a different domain (say legal or medical text), their performance can degrade significantly. The model might have baked in assumptions from the training set that don’t hold generally. This is a known issue for dense retrievers too, but for sparse models it’s exacerbated when the training involved relevance labels: the model may learn dataset-specific term importance that doesn’t transfer. Ideally, we want a method that can work zero-shot on new domains without costly fine-tuning.
Motivation for MiniCOIL
One fundamental limitation of BM25 is that it cannot differentiate ambiguous terms or polysemous words. If a query contains a word with multiple meanings, BM25 will retrieve any document containing that word, regardless of context. For example, search engines often struggle with queries like “python” (programming language vs. snake) or “mercury” (planet vs. chemical element vs. Roman god). BM25 has no way to prioritize documents with the intended sense — it treats the term as one and the same (Fig. 4). This leads to poor precision when the query term is ambiguous.
Press enter or click to view image in full size
Fig 4- Problem with BM25
A natural idea is to somehow incorporate word meaning into the retrieval representation. If our index could represent not just that a document contains the word “bank”, but also whether that “bank” is used in the finance sense or the river edge sense, we could dramatically improve relevancy. Essentially, we want to teach the retriever to perform word-sense disambiguation. Prior attempts to solve this in a sparse framework included manually maintaining domain-specific synonym lists or using context-sensitive query expansion — but these are labor-intensive or error-prone.
MiniCOIL’s core idea is to give each term a bit more expressive power in the index. Instead of representing each word by a single scalar weight, MiniCOIL represents each occurrence of a word by a small vector of numbers that encodes the context (Fig. 5).
Press enter or click to view image in full size
Fig 5- More expressive power
In other words, each token in text is assigned a meaning vector rather than just a weight. This vector is learned to reflect the nuanced sense of the word in that specific sentence. For example, the word “bat” might be associated with one vector when it appears in a sentence about baseball, and a very different vector when it appears in a sentence about nocturnal animals.
How big are these meaning vectors? The design is to keep them tiny — e.g. 4 to 8 dimensions per word — just enough to capture key semantic “shades”. Early experiments found that a small vector (like 4 numbers) per token can encode several distinct senses if trained properly. This is a huge reduction from the 768-dimensional contextual embeddings output by BERT, for instance. By limiting the vector size, we ensure the index doesn’t blow up in size and remains efficient.
Figure 6 illustrates this concept: rather than one dimension for term importance, a word gets a handful of dimensions to represent multiple aspects of meaning. Think of it like giving each term a few “slots” to fill with different context signals. Most words won’t use all slots — they might concentrate on one or two dimensions corresponding to their specific sense in the document. Another way to view it: BM25 uses a single number (like a TF-IDF weight) to say how important is this term in this doc, whereas MiniCOIL uses a small vector to say in what sense is this term being used in this doc. The importance (term frequency, etc.) is still accounted for separately (by BM25), orthogonal to this meaning vector.
Press enter or click to view image in full size
Fig 6 — How to express meaning- more meaning vector per term
By enriching terms with a mini vector, we can continue to use an inverted index but now with semantic discrimination. The challenge then becomes: how to integrate these vectors into the retrieval scoring? MiniCOIL solves this by a clever encoding strategy which we explain next.
Understanding COIL and Its Limitations
MiniCOIL was inspired by an earlier model called COIL (Contextualized Inverted List). To appreciate MiniCOIL’s design, let’s first see what COIL did and why it fell short.
COIL Recap: COIL was a breakthrough attempt to attach contextual semantic information to inverted index entries. Under COIL, each token in the corpus is processed in context by a transformer (e.g. BERT). But instead of collapsing the entire representation to one weight, COIL assigns each token a vector (e.g. 32-dimensional) representing its contextualized embedding. These token vectors are then stored in the index postings. At query time, COIL does the following for scoring: for each query token q, look at the document’s tokens with the same term, take the dot product between the query token’s vector and each document token’s vector, and use the maximum (or sum) of those as the token’s match score. Finally, sum up scores for all query terms to get a relevance score. Essentially, COIL performs a late interaction: matching query and document tokens only if they are lexically the same, but computing similarity based on their contextual vectors rather than a static weight.
This approach means COIL can distinguish, for example, a query term “bark” in the context “dog’s bark” vs. “tree bark”. The query token’s vector will encode the context (dog vs tree). When matching to a document that contains “bark”, the dot product will be high only if the document’s “bark” token vector encodes a similar context. If the document’s “bark” is the wrong sense, the vectors will be dissimilar and contribute little to the score. This was a big improvement in principle — COIL preserves a lot of BERT’s semantic richness at the token level, unlike simple bag-of-words.
Why COIL Fell Short: In practice, COIL faced several issues that prevented wide adoption:
- Index Storage and Complexity: Storing a 32-d vector for every term occurrence massively increases index size and complicates the index structure. Traditional inverted index implementations (like in Elasticsearch or Lucene) aren’t built to store and compare vectors in postings. COIL essentially requires custom code to perform those dot products at query time, iterating through many vectors. This is much slower than a simple integer weight comparison and negates the usual efficiency of posting lists.
- Incompatibility with Standard Inverted Indexes: Because of the above, COIL can’t be easily plugged into existing search engines. It’s neither pure sparse nor pure dense — it needs hybrid handling (vector comparisons per posting), which typical search engines do not support out-of-the-box. This limited COIL to research settings or very custom implementations.
- Tokenization Issues: COIL operated at the level of BERT subword tokens. This can be problematic (Fig. 7). For example, a word like “internationalization” might be split into “international ##ization” as two tokens — COIL would assign each a vector. A query for “internationalization” might similarly split. If the splits don’t align perfectly or if a word is OOV (out-of-vocabulary), the matching becomes inconsistent. Also, matching at the subword level is less interpretable — you’d prefer to match whole words. COIL’s use of subwords made it harder to reason about which “word” was actually being matched in results, and could introduce mismatch errors if a concept was broken into pieces.
Press enter or click to view image in full size
Fig 7- Tokenization problem with COIL
- Supervised Training Requirement: COIL (and many sparse neural IR models of that era) were trained with a relevance objective — typically using labeled query-document pairs to learn those token vectors (often via a distillation or pairwise training against BM25). This means COIL’s effectiveness depended on the availability of training data, and it could overfit to the style of queries it saw during training. As noted, such models often don’t generalize well out-of-domain, because they learn term weights specific to the training corpus or query set.
In summary, COIL introduced the notion of contextual token vectors in retrieval, which was powerful, but it wasn’t practical at scale. The memory overhead and complexity were too high, and reliance on labeled data hurt robustness. MiniCOIL set out to solve exactly these issues: how can we flatten or compress the COIL idea into something that fits in a normal inverted index, and can we train it without needing explicit relevance labels?
MiniCOIL Architecture
QUERY DOCUMENT A DOCUMENT B "bank loan" "bank approved loan" "sat by river bank" | | |[ BERT + PROJECTION ] [ BERT + PROJECTION ] [ BERT + PROJECTION ] | | | v v v Query Vector Doc A Vector Doc B VectorFor "bank": [0.9, 0.1] For "bank": [0.8, 0.1] For "bank": [0.1, 0.9] (Financial) (Financial) (Nature) | | | | | |[ FLATTENING PROCESS: MAPPING VECTORS TO INDEX TERMS ] | | | v v v Query Terms Doc A Index Doc B Index "bank_0": 0.9 "bank_0": 0.8 "bank_0": 0.1 "bank_1": 0.1 "bank_1": 0.1 "bank_1": 0.9
MiniCOIL’s architecture can be seen as a streamlined, more feasible version of COIL. It keeps the concept of per-token semantic vectors but makes them much smaller and integrates them with standard indexing. The two main innovations are vector flattening and a novel training mechanism (covered in the next section). Here, we focus on how MiniCOIL represents and indexes tokens.
Flattening Contextual Vectors (Fig. 8): The term “flattening” refers to converting each token’s small meaning vector into multiple separate index terms so that we can use a regular inverted index for storage and scoring. Concretely, suppose we decide on a vector dimension k (say k=4) for MiniCOIL. We will assign each vocabulary word w, k distinct “slots” or sub-terms: w_1, w_2, …, w_k. These can be thought of as unique token identifiers in the index. When a document is indexed:
Press enter or click to view image in full size
Fig 8- Flattening of Coil
- We run a transformer encoder over the text to get contextual embeddings for each token.
- For each token (word) w in the document, we apply a small trained projection to produce a k-dimensional meaning vector 𝐯w*= [v_*{w,1}, v_{w,2}, …, v_{w,k}].
- Instead of storing w with a single weight, we flatten this by creating k index entries: one for each slot w_i with an associated weight v_{w,i}.
The document’s sparse representation thus includes these w_i terms. Most of them will have non-zero weight since the vector is dense (albeit small). The same process is done for queries at query time to produce the query’s representation.
Detailed Example
Traditional inverted indexes store: term → (document_id, weight). But neural models produce multi-dimensional vectors for each token, which don’t fit this structure directly.
Vector Flattening: A Concrete Example
Let’s say we have k=4 dimensions and we’re indexing the word “neural” in a document.
Step 1: Generate Contextual Vector
After passing the document through a transformer encoder, the token “neural” gets a 4-dimensional meaning vector:
v_neural = [0.8, 0.3, 0.1, 0.6]
but inverted indexes can only store:
- A single term (like “neural”)
- Pointing to documents with one weight per document
Problem: How do we store 4 different numbers (the vector dimensions) in a system that expects just one number per term?
Step 2: Creating Slot Identifiers (The Key Trick)
Think of it this way: Instead of treating “neural” as one thing in the index, we artificially create 4 versions of it.
Conceptual Mapping
Original vector:neural → [0.8, 0.3, 0.1, 0.6] ↓ ↓ ↓ ↓ dim1 dim2 dim3 dim4We create 4 "fake" words:neural → neural_1 (represents dimension 1) neural_2 (represents dimension 2) neural_3 (represents dimension 3) neural_4 (represents dimension 4)
Why Create These Slots?
Each dimension of the vector captures a different aspect of meaning. By creating separate index terms for each dimension, we can:
- Store all 4 numbers in a standard inverted index
- Query each dimension independently
- Use existing search infrastructure
Step 3: Flattening into Index Entries (The Transformation)
Now let’s see exactly how this gets stored in the inverted index.
Standard Inverted Index Format
A normal inverted index looks like:
Term → List of (Document, Weight)----------------------------------------"neural" → [(doc_5, 2.3), (doc_12, 1.8), ...]"network" → [(doc_5, 1.7), (doc_8, 2.1), ...]
MiniCOIL Inverted Index (After Flattening)
Instead, MiniCOIL creates:
Term → List of (Document, Weight)----------------------------------------neural_1 → [(doc_5, 0.8), ...]neural_2 → [(doc_5, 0.3), ...]neural_3 → [(doc_5, 0.1), ...]neural_4 → [(doc_5, 0.6), ...]
The Mapping in Detail
Vector Position → Slot Name → Index Entry-------------------------------------------v_neural[0] = 0.8 → neural_1 → neural_1: (doc_5, weight=0.8)v_neural[1] = 0.3 → neural_2 → neural_2: (doc_5, weight=0.3)v_neural[2] = 0.1 → neural_3 → neural_3: (doc_5, weight=0.6)v_neural[3] = 0.6 → neural_4 → neural_4: (doc_5, weight=0.6)
Complete Worked Example
Let’s index a full sentence: “neural networks” in document doc_5.
Original Process (What We Want to Store)
After transformer encoding:Token: "neural" → Vector: [0.8, 0.3, 0.1, 0.6]Token: "networks" → Vector: [0.7, 0.2, 0.5, 0.4]
Flattening Process (How We Actually Store It)
For “neural”:
Original: "neural" with vector [0.8, 0.3, 0.1, 0.6]Create slots: neural_1 = 0.8 (dimension 1 value) neural_2 = 0.3 (dimension 2 value) neural_3 = 0.1 (dimension 3 value) neural_4 = 0.6 (dimension 4 value)Index entries: neural_1 → (doc_5, 0.8) neural_2 → (doc_5, 0.3) neural_3 → (doc_5, 0.1) neural_4 → (doc_5, 0.6)
For “networks”:
Original: "networks" with vector [0.7, 0.2, 0.5, 0.4]Create slots: networks_1 = 0.7 networks_2 = 0.2 networks_3 = 0.5 networks_4 = 0.4Index entries: networks_1 → (doc_5, 0.7) networks_2 → (doc_5, 0.2) networks_3 → (doc_5, 0.5) networks_4 → (doc_5, 0.4)
Visual Representation of the Index:
INVERTED INDEX (Simplified View)================================Term Documents & Weights------- -------------------neural_1 → doc_5: 0.8, doc_12: 0.6, doc_23: 0.9neural_2 → doc_5: 0.3, doc_12: 0.4, doc_23: 0.2neural_3 → doc_5: 0.1, doc_12: 0.3, doc_23: 0.1neural_4 → doc_5: 0.6, doc_12: 0.7, doc_23: 0.8networks_1 → doc_5: 0.7, doc_8: 0.5networks_2 → doc_5: 0.2, doc_8: 0.3networks_3 → doc_5: 0.5, doc_8: 0.6networks_4 → doc_5: 0.4, doc_8: 0.7
The Magic: How Querying Works
When you query “neural”:
- The query term “neural” also gets encoded into a vector:
q_neural = [0.9, 0.2, 0.0, 0.5] - The system looks up all 4 slots:
- Look up
neural_1in index → finds(doc_5, 0.8) - Look up
neural_2in index → finds(doc_5, 0.3) - Look up
neural_3in index → finds(doc_5, 0.1) - Look up
neural_4in index → finds(doc_5, 0.6)
3. Compute the score by multiplying query vector with document vector:
Score = (0.9 × 0.8) + (0.2 × 0.3) + (0.0 × 0.1) + (0.5 × 0.6) = 0.72 + 0.06 + 0.0 + 0.3 = 1.08
Why This Is Clever
Before flattening: We had a vector that traditional search systems couldn’t handle.
After flattening, We have multiple traditional index terms that existing search infrastructure can process efficiently, while still preserving the semantic information from the neural model.
The vocabulary effectively expands by a factor of k (if k=4, “neural” becomes 4 terms), but we can now use fast inverted index lookups and scoring.
# ============================================================================# MiniCOIL Document Indexing: Complete Walkthrough with Example# ============================================================================# STEP 1: TOKENIZATION# --------------------------------------------------------------------------# Break the document into individual tokens (words/subwords)# --------------------------------------------------------------------------document_text = "The bank approved the loan"doc_tokens = tokenize(document_text)# Result after tokenization:# doc_tokens = ["The", "bank", "approved", "the", "loan"]print(f"Tokens: {doc_tokens}")# Output: Tokens: ['The', 'bank', 'approved', 'the', 'loan']# STEP 2: GET CONTEXTUAL EMBEDDINGS# --------------------------------------------------------------------------# Pass all tokens through a transformer (like BERT) to get contextual # embeddings. Each token gets a rich, context-aware vector (e.g., 768-dim).## KEY INSIGHT: The word "bank" will have different embeddings in:# - "The bank approved the loan" (financial institution)# - "I sat by the river bank" (edge of river)# --------------------------------------------------------------------------context_embeddings = transformer.encode(doc_tokens)# Result: List of 768-dimensional vectors (one per token)# context_embeddings[0] = [0.23, -0.41, 0.88, ..., 0.12] # "The" (768 dims)# context_embeddings[1] = [0.56, 0.23, -0.19, ..., 0.45] # "bank" (768 dims)# context_embeddings[2] = [0.12, 0.67, 0.34, ..., -0.22] # "approved" (768 dims)# ... and so onprint(f"Embedding shape for each token: {context_embeddings[0].shape}")# Output: Embedding shape for each token: (768,)# STEP 3: INITIALIZE STORAGE FOR FLATTENED INDEX ENTRIES# --------------------------------------------------------------------------# We'll collect all the flattened (term_id, weight) pairs here# --------------------------------------------------------------------------sparse_index_entries = []# ConfigurationK_DIMENSIONS = 4 # MiniCOIL uses small k (e.g., 4, 8, 16)SPARSITY_THRESHOLD = 0.05 # Only keep weights above this thresholdDOC_ID = "doc_123" # Current document identifier# STEP 4: PROCESS EACH TOKEN - THE FLATTENING LOOP# --------------------------------------------------------------------------# For each token, we:# 1. Apply learned projection to compress 768 dims → k dims# 2. Apply ReLU to enforce sparsity (zero out negative values)# 3. Create k separate index terms (flattening)# 4. Filter out low-weight terms for efficiency# --------------------------------------------------------------------------for token_idx, (token, embedding) in enumerate(zip(doc_tokens, context_embeddings)): print(f"\n{'='*70}") print(f"Processing Token #{token_idx}: '{token}'") print(f"{'='*70}") # SUB-STEP 4a: APPLY TOKEN-SPECIFIC PROJECTION # ---------------------------------------------------------------------- # Each token has its own learned projection matrix that compresses # the 768-dim contextual embedding down to k dimensions (e.g., k=4) # # projection[token] is a learned k×768 matrix specific to this token # It has learned during training how to extract the most important # semantic information for this particular word # ---------------------------------------------------------------------- meaning_vec = minicoil_projection(token, embedding, k=K_DIMENSIONS) print(f" Original embedding: 768 dimensions") print(f" After projection: {K_DIMENSIONS} dimensions") print(f" meaning_vec = {meaning_vec}") # Example output for "bank": # meaning_vec = [0.8, -0.2, 0.1, 0.6] # SUB-STEP 4b: APPLY ReLU FOR SPARSITY # ---------------------------------------------------------------------- # ReLU (Rectified Linear Unit) zeros out negative values # This enforces sparsity - many dimensions become zero # # Why? Sparse representations are: # - More efficient to store and search # - Better aligned with how inverted indexes work (skip zero weights) # ---------------------------------------------------------------------- meaning_vec = relu(meaning_vec) # max(0, x) for each element print(f" After ReLU: {meaning_vec}") # Example for "bank": # Before ReLU: [0.8, -0.2, 0.1, 0.6] # After ReLU: [0.8, 0.0, 0.1, 0.6] ← negative value zeroed out! # SUB-STEP 4c: FLATTEN THE VECTOR INTO SEPARATE INDEX TERMS # ---------------------------------------------------------------------- # This is THE KEY STEP - the "flattening"! # # Instead of storing: "bank" → [0.8, 0.0, 0.1, 0.6] # We create 4 separate index entries: # "bank_0" → 0.8 # "bank_1" → 0.0 # "bank_2" → 0.1 # "bank_3" → 0.6 # # Each dimension becomes its own searchable "term" in the index! # ---------------------------------------------------------------------- print(f"\n Flattening into index entries:") for dim_idx, weight in enumerate(meaning_vec): # Only store non-zero weights above threshold (sparsity optimization) if weight > SPARSITY_THRESHOLD: # Create unique term identifier: "token_dimensionIndex" term_id = f"{token}_{dim_idx}" print(f" [{dim_idx}] {term_id:20s} → weight = {weight:.3f}") # Add to our collection of index entries sparse_index_entries.append((term_id, weight)) else: print(f" [{dim_idx}] {token}_{dim_idx:20s} → weight = {weight:.3f} (SKIPPED - below threshold)")# STEP 5: REVIEW ALL FLATTENED ENTRIES# --------------------------------------------------------------------------# At this point, we've processed all tokens and created multiple index# entries for each token (one per dimension)# --------------------------------------------------------------------------print(f"\n{'='*70}")print(f"FINAL SPARSE INDEX ENTRIES FOR DOCUMENT '{DOC_ID}'")print(f"{'='*70}")print(f"Total entries: {len(sparse_index_entries)}\n")for term_id, weight in sparse_index_entries: print(f" {term_id:25s} → {weight:.3f}")# Example output:# FINAL SPARSE INDEX ENTRIES FOR DOCUMENT 'doc_123'# ======================================================================# Total entries: 15## The_0 → 0.200# The_3 → 0.100# bank_0 → 0.800# bank_1 → 0.300# bank_2 → 0.100# bank_3 → 0.600# approved_0 → 0.700# approved_1 → 0.500# approved_2 → 0.400# approved_3 → 0.200# the_0 → 0.100# the_2 → 0.100# loan_0 → 0.600# loan_1 → 0.400# loan_3 → 0.700# STEP 6: ADD TO INVERTED INDEX# --------------------------------------------------------------------------# Now we store these entries in a standard inverted index# # The inverted index structure is:# term_id → List of (document_id, weight) postings## This is exactly like a traditional keyword index, except our "terms"# are the flattened dimension-specific tokens like "bank_0", "bank_1", etc.# --------------------------------------------------------------------------# Initialize inverted index (usually this already exists)from collections import defaultdictinverted_index = defaultdict(list) # term_id → [(doc_id, weight), ...]print(f"\n{'='*70}")print(f"ADDING TO INVERTED INDEX")print(f"{'='*70}\n")for term_id, weight in sparse_index_entries: # Add this document to the posting list for this term inverted_index[term_id].append((DOC_ID, weight)) print(f" Index['{term_id}'].add_posting({DOC_ID}, {weight:.3f})")print(f"\nInverted index now contains {len(inverted_index)} unique terms")# STEP 7: VISUALIZE THE FINAL INDEX# --------------------------------------------------------------------------# Let's see what the inverted index looks like# --------------------------------------------------------------------------print(f"\n{'='*70}")print(f"INVERTED INDEX STRUCTURE (Excerpt)")print(f"{'='*70}\n")for term_id in sorted(inverted_index.keys())[:10]: # Show first 10 terms postings = inverted_index[term_id] print(f" {term_id:25s} → {postings}")# Example output:# INVERTED INDEX STRUCTURE (Excerpt)# ======================================================================## The_0 → [('doc_123', 0.2)]# The_3 → [('doc_123', 0.1)]# approved_0 → [('doc_123', 0.7)]# approved_1 → [('doc_123', 0.5)]# approved_2 → [('doc_123', 0.4)]# approved_3 → [('doc_123', 0.2)]# bank_0 → [('doc_123', 0.8)]# bank_1 → [('doc_123', 0.3)]# bank_2 → [('doc_123', 0.1)]# bank_3 → [('doc_123', 0.6)]# ============================================================================# HELPER FUNCTIONS# ============================================================================def minicoil_projection(token: str, embedding: np.ndarray, k: int) -> np.ndarray: """ Apply token-specific learned projection to compress embedding. Args: token: The word/token (e.g., "bank") embedding: Contextual embedding from transformer (e.g., 768-dim) k: Target dimensionality (e.g., 4) Returns: k-dimensional meaning vector In practice, this looks up a learned projection matrix for this token: W_token is a k × 768 matrix meaning_vec = W_token @ embedding (matrix-vector multiplication) """ # Simplified: In real implementation, look up learned weight matrix # projection_matrix = learned_projections[token] # shape: (k, 768) # return projection_matrix @ embedding # For demo purposes, simulate with random projection np.random.seed(hash(token) % 2**32) # Deterministic based on token projection_matrix = np.random.randn(k, len(embedding)) * 0.1 return projection_matrix @ embeddingdef relu(vector: np.ndarray) -> np.ndarray: """ Apply ReLU activation: max(0, x) element-wise. This zeros out negative values, enforcing sparsity. Example: Input: [0.8, -0.2, 0.1, 0.6] Output: [0.8, 0.0, 0.1, 0.6] """ return np.maximum(0, vector)# ============================================================================# BONUS: HOW QUERYING WORKS# ============================================================================def search_query_example(): """ Show how a query would use this flattened index """ print(f"\n{'='*70}") print(f"BONUS: HOW QUERYING WORKS") print(f"{'='*70}\n") query_text = "bank loan" query_tokens = tokenize(query_text) # ["bank", "loan"] print(f"Query: '{query_text}'") print(f"Query tokens: {query_tokens}\n") # Step 1: Encode query tokens the same way as documents query_embeddings = transformer.encode(query_tokens) # Step 2: Project and flatten query tokens query_scores = defaultdict(float) # doc_id → score for token, embedding in zip(query_tokens, query_embeddings): print(f"Processing query token: '{token}'") # Get query vector (same projection as indexing) query_vec = minicoil_projection(token, embedding, k=K_DIMENSIONS) query_vec = relu(query_vec) print(f" Query vector: {query_vec}") # Step 3: Look up each dimension in the index and accumulate scores for dim_idx, query_weight in enumerate(query_vec): if query_weight > SPARSITY_THRESHOLD: term_id = f"{token}_{dim_idx}" # Look up this term in inverted index if term_id in inverted_index: postings = inverted_index[term_id] print(f" Looking up '{term_id}' → found {len(postings)} documents") # For each document containing this term for doc_id, doc_weight in postings: # Score contribution = query_weight × doc_weight contribution = query_weight * doc_weight query_scores[doc_id] += contribution print(f" {doc_id}: {query_weight:.3f} × {doc_weight:.3f} = {contribution:.3f}") print() # Step 4: Rank documents by score print(f"Final document scores:") for doc_id, score in sorted(query_scores.items(), key=lambda x: -x[1]): print(f" {doc_id}: {score:.3f}") # Example output: # Final document scores: # doc_123: 2.840 ← Our document with "bank approved loan"# ============================================================================# KEY TAKEAWAYS# ============================================================================"""🔑 KEY INSIGHT #1: Vector → Multiple Terms Instead of: "bank" → [0.8, 0.3, 0.1, 0.6] We create: "bank_0" → 0.8 "bank_1" → 0.3 "bank_2" → 0.1 "bank_3" → 0.6🔑 KEY INSIGHT #2: Standard Index, Neural Semantics The index structure is standard (term → postings) But the "terms" encode neural semantic information!🔑 KEY INSIGHT #3: Sparsity is Key - ReLU zeros out negative values - Threshold filters low weights - Result: Most dimensions are zero (efficient!)🔑 KEY INSIGHT #4: Token-Specific Projections Each word has its own learned way to compress its meaning "bank" learns different compression than "loan"🔑 KEY INSIGHT #5: Query-Time Compatibility Queries are processed the exact same way Standard inverted index lookup and scoring works!"""```## Summary of the Code Flow**Input:** "The bank approved the loan"**Output:** 15 index entries like:```bank_0 → (doc_123, 0.8)bank_1 → (doc_123, 0.3)loan_0 → (doc_123, 0.6)...
By flattening in this way, we reduce the problem back to a standard term matching scenario. A query token w will also be projected to 𝐯_w and produce query terms w_1,…,w_k with certain weights. The retrieval score between a query and document can then use the normal BM25 formula but applied to these sub-terms. Essentially, the BM25 scoring will multiply term frequency with IDF as usual, but now the term frequency is represented by the value of the meaning vector component. In formula form, MiniCOIL’s scoring can be described as:
Press enter or click to view image in full size
Where:
- BM25_importance is the normal term frequency saturation factor from BM25 (and any document length normalization).
- MeaningSim represents the semantic similarity factor between the query term and document term usage. For MiniCOIL, this is effectively the dot product of the query term’s meaning vector and the document term’s meaning vector. Because when we sum over the k flattened dimensions, ∑ᵢ₌₁ᵏ vᵠq,i × vᴰq,i, we are computing a dot-product between query and doc vectors for that term.
Press enter or click to view image in full size
If the query and document use term q in the same sense, their small vectors will align and the dot product (MeaningSim) will be high (close to 1), thus boosting the score. If the senses differ, the vectors will be nearly orthogonal, yielding a low dot product, which effectively down-weights that term match. In the extreme, if a word was not seen in training (no learned vector), MiniCOIL defaults to treating MeaningSim as 1 (neutral), which reduces to standard BM25 for that term — meaning MiniCOIL will never perform worse than BM25 for exact matches, it only modulates them down when context suggests a mismatch.
By integrating this meaning similarity into the scoring formula, MiniCOIL achieves a hybrid of lexical and semantic matching. Yet importantly, it does so without requiring any exotic index structure: the flattened terms approach allows using a regular inverted index (with term IDs like “bank_0”) and a slightly modified scoring script. This flattening is the practical trick that makes MiniCOIL usable. The COIL model’s operations (vector dot products) are essentially pre-computed into these weighted term postings. The runtime search just does what BM25 would — intersect postings and sum up weights — albeit on a slightly expanded vocabulary (each real word spawns a few sub-tokens).
Example: Suppose “bank” has a 2-dimensional meaning vector in MiniCOIL. The word “bank” in a financial context might be represented as 𝐯{bank}⁽ᶠⁱⁿᵃⁿᶜᵉ⁾= [0.9, 0.1]. In a river context, “bank” might be 𝐯{bank}^{(river)} = [0.2, 0.8]. Index entries for a financial document containing “bank” would include something like ("bank_0": 0.9, "bank_1": 0.1). A query “river bank” might produce for “bank”: ("bank_0": 0.2, "bank_1": 0.8). When scoring, BM25 sees that the query term “bank_0” matches the doc’s “bank_0” with weight 0.9 (and similarly “bank_1” with 0.1 vs 0.8). The summed contribution ≈ 0.20 *.9 + 0.80*.1 = 0.26 (times IDF etc) is relatively low – indicating the doc’s “bank” doesn’t well match the query sense. If the document was about rivers, its “bank” vector would have been closer to [0.2, 0.8], and the score contribution would be higher (e.g. 0.20 * .2 + 0.80* .8 = 0.68). This way, the retrieval naturally ranks the river-related document higher for the query about a river bank.
Through this flattening strategy, MiniCOIL elegantly sidesteps COIL’s compatibility issues. Storage overhead is minimal: if we use 4 dimensions, we effectively quadruple the number of postings, but 4x a sparse index is still very manageable (and far less than storing full 768-d vectors!). Searching is fast because it’s just sparse dot-products like usual. Next, we’ll explore how MiniCOIL learns these meaning vectors in the first place — notably, it does so without needing any relevance labels.
Training Without Relevance Objective
One of the most novel aspects of MiniCOIL is its training methodology. Instead of training on query-document relevance pairs, MiniCOIL is trained in a self-supervised manner purely on unlabeled text. The goal of training is to teach the model to assign similar meaning vectors to a word when it appears in similar contexts, and dissimilar vectors when the context differs (Fig. 9). This approach eliminates the need for a predefined relevance objective or human-labeled data.
Press enter or click to view image in full size
Fig 9- Training without Relevance Objective
The training scheme is inspired by the classic skip-gram or word2vec idea, but adapted to modern contextual embeddings. In word2vec, a model is trained to predict nearby words given a target word. In MiniCOIL, we instead try to predict the contextual meaning of the sentence given a target word’s mini-vector. More concretely, MiniCOIL uses a triplet loss setup (Fig. 10a and 10b):
Press enter or click to view image in full size
Fig 10 a — Getting Rid of Relevance Objective
Press enter or click to view image in full size
Fig 10 b— Training Objective
Press enter or click to view image in full size
Fig 10C — MiniCoil Training Objective
- Anchor: Pick a specific word occurrence in some sentence as the anchor (e.g. the word “bat” in sentence A).
- Positive: Find another sentence that contains the same word (“bat”) used in a similar context/meaning as the anchor (sentence B where “bat” likely means the same thing as in A).
- Negative: Find a sentence with the same word (“bat”) but used in a different meaning (sentence C where “bat” has the opposite or a very different sense).
We want the MiniCOIL embedding of “bat” in sentence A (anchor) to be close to that in sentence B (positive) and far from that in sentence C (negative). By close and far, we mean in terms of cosine or dot-product similarity of the mini vectors. This can be optimized with a standard triplet loss: for anchor a, positive p, negative n, we enforce
If the similarity difference is less than the margin, we incur a loss pushing the anchor and positive closer and/or anchor and negative farther apart.
The clever bit is how to choose these triplets at scale. The training data is a large corpus of text (around 40 million sentences from OpenWebText) as a source of examples. Pre-computed embeddings for all these sentences using a pretrained sentence encoder (a model that gives a single vector per sentence) and stored these in a vector database (Qdrant). Now, given a target word (say “bat”), they can query this database for sentences containing “bat” and retrieve their vector representations. Using the vector similarity between those sentence embeddings as a proxy for contextual similarity, they select positive and negative examples:
- The anchor is some random sentence with “bat”.
- The positive is chosen as another “bat” sentence whose sentence embedding is very close to the anchor’s embedding (meaning they talk about similar things, likely the same sense of “bat”).
- The negative is chosen as a “bat” sentence whose embedding is far from the anchor’s (likely a different topic/meaning).
This process effectively uses the sentence embedding model (which could be a model like Sentence-BERT or similar) to provide weak labels of “same meaning” vs “different meaning” for the word in question. It’s not perfect, but in aggregate it should group usage by sense. For example, sentences about flying animals with “bat” will cluster together, separate from sentences about baseball and “bat”. There’s an assumption that a good sentence-level semantic model can distinguish those contexts — which is generally true for a well-trained model on diverse data.
Once the triplet (A, B, C) is selected, we feed those sentences into the MiniCOIL model (which itself uses a transformer under the hood) and extract the mini-vector for the target word from each. Then we apply the triplet loss, pulling the anchor and positive closer than anchor-negative. By sampling millions of such triplets across many words, the model gradually learns to position each word’s mini-embeddings in a space where different senses are separated.
Press enter or click to view image in full size
Fig 10d — Triplet Loss
Here’s a code of the training loop for one batch of a given word:
# ============================================================================# MiniCOIL Training: Self-Supervised Learning Without Relevance Labels# ============================================================================import torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.utils.data import Dataset, DataLoaderimport numpy as npfrom collections import defaultdictfrom typing import List, Tuple, Dictimport random# ============================================================================# STEP 0: DATA PREPARATION (One-Time Setup)# ============================================================================class SentenceCorpusPreprocessor: """ Preprocess the corpus by: 1. Extracting all sentences 2. Building a word → sentences index 3. Computing sentence embeddings for similarity lookups """ def __init__(self, sentence_encoder_model): """ Args: sentence_encoder_model: Pretrained sentence encoder (e.g., Sentence-BERT) This gives a single vector per sentence """ self.sentence_encoder = sentence_encoder_model self.word_to_sentences = defaultdict(list) # word → [sentence_ids] self.sentences = [] # List of all sentences self.sentence_embeddings = [] # Precomputed embeddings def preprocess_corpus(self, corpus_path: str): """ Read corpus, index words, compute embeddings. EXAMPLE DATA FLOW: ------------------ Input corpus file contains: "The bank approved my loan application." "I sat by the river bank and watched the sunset." "She hit a home run with the baseball bat." "The bat flew out of the cave at night." ... (40 million more sentences) After preprocessing: self.sentences = [ "The bank approved my loan application.", # Index 0 "I sat by the river bank and watched...", # Index 1 "She hit a home run with the baseball bat.", # Index 2 "The bat flew out of the cave at night.", # Index 3 ... ] self.word_to_sentences = { "bank": [0, 1], # "bank" appears in sentences 0 and 1 "bat": [2, 3], # "bat" appears in sentences 2 and 3 "approved": [0], "loan": [0], ... } self.sentence_embeddings = [ [0.23, 0.45, -0.12, ...], # 384-dim vector for sentence 0 [0.18, 0.52, -0.08, ...], # 384-dim vector for sentence 1 [0.67, -0.23, 0.41, ...], # 384-dim vector for sentence 2 [-0.12, 0.78, 0.33, ...], # 384-dim vector for sentence 3 ... ] Note: Sentences 0 and 1 (both about "bank") will have embeddings that reflect their semantic difference: - Sentence 0 (financial): embedding focuses on finance/business - Sentence 1 (river): embedding focuses on nature/geography """ print("="*70) print("PREPROCESSING CORPUS") print("="*70) # Read all sentences print("\n[1/3] Reading sentences from corpus...") with open(corpus_path, 'r') as f: for line in f: sentence = line.strip() if sentence: self.sentences.append(sentence) print(f" Loaded {len(self.sentences):,} sentences") print(f"\n Example sentences:") print(f" [0] {self.sentences[0]}") print(f" [1] {self.sentences[1]}") print(f" [2] {self.sentences[2]}") # Build word → sentences index print("\n[2/3] Building word-to-sentences index...") for sent_idx, sentence in enumerate(self.sentences): words = self.tokenize(sentence) for word in set(words): # Each word once per sentence self.word_to_sentences[word].append(sent_idx) if (sent_idx + 1) % 1000000 == 0: print(f" Indexed {s