Before building any AI solution, take one minute to ask yourself: what is the underlying human process?
And if i made you curious let me also ask you some questions.
How many times did you feel that your technical decisions are not good?
How many times did you feel the impostor syndrome, because that you are not sure about your decisions?
Weβre living in an era where everything moves at lightning speed. Everything must be done yesterday. Weβve stopped focusing on process, on understanding things in depth. We ship fast, we iterate fast, we burn out fast.
Iβve been there. Many times.
And I think this is part of my mission now: to tell you that itβs okay not to be up to date with every new framework and model release. Itβs okay to feel the impostor syndrome. β¦
Before building any AI solution, take one minute to ask yourself: what is the underlying human process?
And if i made you curious let me also ask you some questions.
How many times did you feel that your technical decisions are not good?
How many times did you feel the impostor syndrome, because that you are not sure about your decisions?
Weβre living in an era where everything moves at lightning speed. Everything must be done yesterday. Weβve stopped focusing on process, on understanding things in depth. We ship fast, we iterate fast, we burn out fast.
Iβve been there. Many times.
And I think this is part of my mission now: to tell you that itβs okay not to be up to date with every new framework and model release. Itβs okay to feel the impostor syndrome. Whatβs not okay is letting that pressure push you into building things you donβt understand or because of hype.
This series is about slowing down just enough to build something that actually works.
PS: AI is not everything. We still bring our own brain, personality, and identity, and that remains irreplaceable.
In Part 1, I asked you to leave your technical ego at home. Now letβs see what happens when you actually do that.
Last time, we talked about the Process Archaeologist framework, understanding the human process before building any AI solution. Today, we translate that understanding into something concrete: the data model.
Letβs do an exercise. Iβm going to show you a real system I built for manufacturing service manuals, 500+ page PDFs filled with torque specifications, hydraulic diagrams, fault codes, and maintenance procedures.
The kind of documents where getting the wrong answer could cost someone a finger. (joke).
But yes, complicated documents, with a lot of diagrams, tables, images and multiple links between pages.
The data model Iβll show you didnβt come from a whiteboard session about βbest practices.β It came from watching technicians flip through manuals, curse at tables that span three pages, and squint at diagrams while grease dripped from their hands.
Letβs think & create.
Before I show you any code, let me describe what weβre dealing with.
A service manual for industrial telehandlers (think massive forklifts) contains:
Specifications tables: Torque values, fluid capacities, pressure ratings, often specific to model variants (642, 943, 1055, 1255)
Procedural instructions: Step-by-step maintenance and repair procedures
Diagrams and figures: Exploded parts views, hydraulic schematics, wiring diagrams
Fault codes: Diagnostic tables with SPN/FMI codes and descriptions
Safety warnings: CAUTION and WARNING blocks that must not be missed
Hereβs what makes this domain particularly challenging:
Content spans pages: A torque specification table might start on page 35 and continue through page 37 1.
Figures reference text, text references figures: βTighten bolt to 85 Nm (see Figure 3-2)β 1.
Model-specific variations: The same procedure differs between the 642 and 1255 models 1.
Tables are dense with critical data: Miss one value, and youβve overtorqued a hydraulic fitting
Traditional RAG approaches, chunk by 512 tokens, embed, retrieve, would destroy this content. A chunk boundary could land in the middle of a torque table. A figure reference would point to nothing. Model-specific instructions would get mixed together.
The data model has to understand this domain.
By the end of Part 2, youβll understand:
Why pages (not token chunks) are the right retrieval unit for technical documents
How a three-page sliding window preserves cross-page context without processing entire documents at once
A complete table extraction pipeline: structure preservation β metadata extraction β semantic flattening
How to construct embedding text that captures domain-specific relationships
Multi-vector indexing strategies for hybrid retrieval
Hereβs the system architecture weβll build:
Letβs unpack each decision.
The first architectural decision: what is your retrieval unit?
Most RAG tutorials say βchunk your documents into ~500 token pieces.β This advice works fine for blog posts and documentation. It fails catastrophically for technical manuals.
Hereβs why I chose full pages as my retrieval unit:
Traditional RAG:
PDF β Split by tokens β Embed chunks β Retrieve fragments
Manufacturing RAG:
PDF Page β Extract structure β Tables + Figures + Text β Unified page metadata β Embed
Rationale:
Visual coherence: A page is how technicians actually consume information. They see the whole page, not a text fragment. 1.
Structural integrity: Tables stay intact. Figure references stay near their figures. 1.
Multimodal grounding: I can store the page image alongside extracted content, enabling visual verification.
The extraction pipeline processes each page independently, producing this folder structure:
scratch/service_manual_long/
βββ page_36/
β βββ page_36_full.png # Full page scan
β βββ metadata_page_36.json # Basic extraction info
β βββ context_metadata_page_36.json # Rich semantic metadata
β βββ tables/
β β βββ table-36-1.html # Structured table
β β βββ table-36-1.png # Table image
β βββ images/
β β βββ image-36-1.png # Extracted figure
β β βββ image-36-2.png
β βββ text/
β βββ page_36_text.txt # OCR text content
Each page becomes a self-contained unit with all its modalities preserved.
Hereβs where domain understanding pays off.
When a technician opens a service manual to page 36, they donβt see page 36 in isolation. Theyβve been reading from page 34. Theyβll continue to page 37. Context flows across page boundaries.
A torque specification table that starts on page 35 doesnβt suddenly become irrelevant when you turn to page 36, itβs still the same table, still the same procedure.
This insight led to my most important data modeling decision: extract metadata using a three-page sliding window.
When processing page N, the system simultaneously analyzes:
Page N-1 (previous page)
Page N (current page)
Page N+1 (next page)
Hereβs the actual extraction function signature:
def extract_metadata_from_page(
litellm_client: LitellmClient,
image_path_n: str, # Current page image
image_path_n_1: str, # Previous page image
image_path_n_plus_1: str, # Next page image
metadata_page_n_1_path: str, # Previous page metadata
metadata_page_n_path: str, # Current page metadata
metadata_page_n_plus_1_path: str, # Next page metadata
page_n_1_text_path: str, # Previous page OCR text
page_n_text_path: str, # Current page OCR text
page_n_plus_1_text_path: str, # Next page OCR text
) -> str:
The LLM receives:
Three page images β visual context 1.
Three OCR text files β textual context 1.
Three metadata files β structural context (what tables/figures exist on each page)
This enables the model to understand:
βThis table is a continuation from page 35β
βThis procedure continues onto page 37β
βFigure 35-4 on the previous page is related to this text blockβ
The extracted metadata includes explicit cross-page relationships:
{
"content_elements": [
{
"type": "table",
"element_id": "table-560-1",
"title": "Electrical System Fault Codes",
"cross_page_context": {
"continued_from_previous_page": true,
"continues_on_next_page": true,
"related_content_from_previous_page": ["table-559-1"],
"related_content_from_next_page": ["table-561-1"]
}
}
]
}
The magic here is that we are using the knowledge we learnt from the discovery meeting and model the data for the client needs.
The LLM explicitly determines whether content spans pages by comparing what it sees across the three-page window.
The extraction prompt explicitly instructs the model to handle cross-page relationships:
You will receive three consecutive PDF pages:
Previous page (N-1)
Current page (N) (Your primary focus)
Next page (N+1)
Use these to:
- Detect if any content on page N is continued from N-1
or continues onto N+1.
- Extract key entities, warnings, context, or model mentions
that may not be confined to a single page.
- Provide summaries for text blocks and relate them to
tables and figures accurately.
And the schema enforces it:
"cross_page_context": {
"continued_from_previous_page": "<true|false>",
"continues_on_next_page": "<true|false>",
"related_content_from_previous_page": ["<element IDs from N-1>"],
"related_content_from_next_page": ["<element IDs from N+1>"]
}
Why does this matter?
When a user asks βWhat are the fault codes for the Diesel Exhaust Fluid Dosing Unit?β, the retrieval system doesnβt just return page 560. It knows that page 560βs table continues from 559 and onto 561. The answer generation can pull context from the entire spanning structure.
Hereβs a question I get asked: βIf a table spans 10 pages, donβt you lose coherence by treating each page separately?β
The answer is no, and the three-page window is why.
Consider a specification table that spans pages 24 through 27 (like our fluid capacities table). Each page gets processed with its neighbors:
Page 24: sees [23, 24, 25] β knows it starts a new table, continues to 25
Page 25: sees [24, 25, 26] β knows it continues from 24, continues to 26
Page 26: sees [25, 26, 27] β knows it continues from 25, continues to 27
Page 27: sees [26, 27, 28] β knows it continues from 26, table ends here
This creates a chain of overlapping context. Every page knows:
What came before it (from N-1)
What comes after it (from N+1)
That itβs part of a larger structure
The metadata captures this explicitly:
// Page 25's metadata
"cross_page_context": {
"continued_from_previous_page": true,
"continues_on_next_page": true,
"related_content_from_previous_page": ["table-24-1"],
"related_content_from_next_page": ["table-26-1"]
}
// Page 26's metadata
"cross_page_context": {
"continued_from_previous_page": true,
"continues_on_next_page": true,
"related_content_from_previous_page": ["table-25-1"],
"related_content_from_next_page": ["table-27-1"]
}
At retrieval time, when page 25 is relevant, we can follow the chain: 25 β 24, 25 β 26 β 27.
The table is reconstructable even though each page was processed independently.
The same principle scales. For very complex documents β where context dependencies span further β you can extend the window:
Page N sees: [N-2, N-1, N, N+1, N+2]
This gives each page awareness of content two pages away in either direction. The trade-offs:
For manufacturing service manuals, the 3-page window captures 95% of cross-page relationships.
The key insight: you donβt need to process a 10-page table as a single unit. The sliding window progression means each page carries enough context to reconstruct the whole.
Generic metadata (title, date, author) tells you nothing useful for retrieval in specialized domains.
The metadata schema I designed came directly from understanding how technicians think about their work:
Hereβs a real metadata entry from page 36:
{
"document_metadata": {
"document_title": "General Information and Specifications",
"document_id": "31211033",
"document_type": "Service Manual",
"models_covered": ["642", "742", "943", "1043", "1055", "1255"]
},
"page_number": "36",
"page_visual_description": "The page primarily features a large diagrammatic
illustration of a vehicle, with callouts and symbols indicating specific
parts and maintenance points. The top includes a header with title
'General Information and Specifications' and subheading 'b. 1055, 1255'.
Symbols A and B denote different maintenance actions.",
"section": {
"section_number": "2.5.2",
"section_title": "250 Hour",
"subsection_number": "",
"subsection_title": ""
},
"content_elements": [
{
"type": "figure",
"element_id": "figure-36-1",
"title": "Vehicle Maintenance Points for Models 1055 and 1255",
"summary": "Illustration showing key maintenance points on the vehicle
for models 1055 and 1255, with callouts indicating lubrication
and inspection areas.",
"keywords": ["maintenance", "lubrication", "inspection", "1055", "1255"],
"entities": ["Model 1055", "Model 1255"],
"component_type": "Vehicle Maintenance",
"model_applicability": ["1055", "1255"],
"application_context": ["maintenance", "inspection"],
"within_page_relations": {
"related_figures": [],
"related_tables": [],
"related_text_blocks": []
},
"cross_page_context": {
"continued_from_previous_page": true,
"continues_on_next_page": true,
"related_content_from_previous_page": ["figure-35-4"],
"related_content_from_next_page": ["figure-37-1"]
}
}
]
}
Notice whatβs captured:
Visual description β The LLM describes whatβs actually on the page 1.
Model applicability β This specific figure applies to 1055 and 1255 models only 1.
Section context β Itβs in the β250 Hourβ maintenance schedule 1.
Cross-page flow β This figure is part of a sequence spanning pages 35-37
Tables deserve their own section. In manufacturing documents, tables contain the most critical information, torque specs, fault codes, fluid capacities, part numbers, yet theyβre the hardest content type for RAG systems to handle.
Let me walk you through exactly how we approached this.
Consider this fluid capacities table from page 25 of our service manual. This is what technicians actually deal with, not a simple grid, but a hierarchical specification table with nested categories:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hydraulic System β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β System Capacity β
β βββ 642 β
β β βββ No Outriggers 40.2 gallons (152 liters) β
β β βββ With Outriggers 41.7 gallons (158 liters) β
β βββ 742 β
β β βββ No Outriggers 40.2 gallons (152 liters) β
β βββ Reservoir Capacity 23.8 gallons (90 liters) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Axles β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Differential Housing Capacity β
β βββ Front 7.6 quarts (7.2 liters) β
β βββ Rear 7 quarts (6.6 liters) β
β βββ Friction Modifier Not to Exceed 12.2 oz (360 ml) β
β (Must be premixed with axle fluid) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now imagine chunking this by tokens. The disasters multiply:
Hierarchy destruction: β642β ends up separated from βNo Outriggersβ, which model does 40.2 gallons belong to? 1.
Unit confusion: β7.6 quartsβ in one chunk, β(7.2 liters)β in another 1.
Conditional loss: βWith Outriggersβ separated from its value, now the capacity is ambiguous 1.
Cross-model mixing: Values for 642 and 742 interleaved nonsensically
A technician searching βhydraulic fluid capacity for 642 with outriggersβ would get garbage.
The first step is extracting tables with their structure intact. We use Docling with TableFormer in ACCURATE mode:
class DoclingOCRStrategy:
def __init__(self):
self.pipeline_options = PdfPipelineOptions()
self.pipeline_options.do_table_structure = True
self.pipeline_options.table_structure_options.do_cell_matching = True
self.pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
This gives us:
Cell boundaries β Exact row/column positions
Header detection β Which rows are headers vs data
Merged cell handling β Common in spec tables
Every table gets saved in two formats:
page_560/
βββ tables/
β βββ table-560-1.html # Structured HTML
β βββ table-560-1.png # Visual snapshot
The HTML preserves the full structure:
<table>
<tbody>
<tr>
<th>SPN</th>
<th>FMI</th>
<th>Fault Code</th>
<th>SPN Description</th>
<th>Description</th>
</tr>
<tr>
<td>5745</td>
<td>3</td>
<td>4168</td>
<td>Aftertreatment 1 Diesel Exhaust Fluid Dosing Unit Heater</td>
<td>Voltage Above Normal, or Shorted to High</td>
</tr>
<!-- ... more rows ... -->
</tbody>
</table>
Raw HTML isnβt enough. We need semantic understanding of what the table means.
For each table, we call a vision LLM with both the table image and its HTML:
def generate_table_metadata(
litellm_client: LitellmClient,
html_content: str,
pdf_page: str # Full page image for context
) -> dict:
resp = litellm_client.chat(
messages=[{
"role": "user",
"content": [
{"type": "text", "text": GENERATE_TABLE_METADATA_PROMPT},
{"type": "image_url", "image_url": {"url": page_image}},
{"type": "text", "text": f"<html>{html_content}</html>"},
],
}],
response_format=TableMetadataResponse,
)
The prompt emphasizes context awareness:
Tables in technical documents often depend heavily on their
surrounding context for interpretation. You must analyze nearby
text to accurately understand and summarize the table's purpose.
Critical context may include:
- Section headers, labels, or chapter titles
- Units of measurement and engineering specifications
- Footnotes or annotations
- Product models, standards, or part numbers
- Mentions of illustrations or figures (e.g., "see Fig. 3")
The output schema captures domain-specific metadata:
class TableMetadataResponse(BaseModel):
title: str # "Fault Codes for Aftertreatment Systems"
summary: str # What the table shows, in context
keywords: List[str] # ["fault codes", "DEF", "aftertreatment"]
entities: List[str] # ["SPN 5745", "Selective Catalytic Reduction"]
model_name: Optional[str] # Which machine model
component_type: Optional[str] # "Aftertreatment System"
application_context: List[str] # ["troubleshooting", "diagnostics"]
related_figures: List[RelatedFigure] # Figures that explain this table
Hereβs actual metadata extracted from the fault codes table:
{
"title": "Fault Codes for Aftertreatment Systems",
"summary": "Lists fault codes related to various aftertreatment components,
including sensors and heaters, indicating issues such as voltage
abnormalities and data validity problems. Crucial for diagnosing
and troubleshooting diesel exhaust systems.",
"keywords": [
"fault codes", "aftertreatment", "diesel exhaust",
"sensors", "heaters", "voltage issues", "diagnostics"
],
"entities": [
"Selective Catalytic Reduction",
"Diesel Exhaust Fluid Dosing Unit",
"Outlet Soot Sensor"
],
"component_type": "Aftertreatment System",
"application_context": ["diesel engines", "emission control", "vehicle diagnostics"]
}
Now we know what this table is about, not just what it contains.
Hereβs the key insight: tables need to exist in two forms.
Structured β For display and precise lookup 1.
Flattened β For semantic embedding and search
The flattening prompt is deliberately verbose:
FLATTEN_TABLE_PROMPT = """
You are given a table in HTML format.
Your task:
- Flatten this table into a clear, human-readable paragraph.
- Include *every significant piece of data* from the table: titles,
component names, values, units, material names, codes, and properties.
- Preserve structure and associations, e.g. which value corresponds
to which component.
- Do not omit or generalize any rows or key values.
- Ensure no technical information from the table is lost.
Your outputs will be embedded into a vector index for retrieval.
"""
The result transforms hierarchical structure into searchable prose that preserves all relationships:
"The table describes various fluid capacities for different vehicle systems.
The hydraulic system capacities vary: for the 642 model with no outriggers,
it is 40.2 gallons (152 liters), and with outriggers, it is 41.7 gallons
(158 liters). For the 742 model with no outriggers, it remains 40.2 gallons
(152 liters), and the reservoir capacity to the full mark is 23.8 gallons
(90 liters).
The axles' differential housing capacity is 7.6 quarts (7.2 liters) for the
front and 7 quarts (6.6 liters) for the rear, with a friction modifier limit
of 12.2 ounces (360 milliliters) to be premixed with axle fluid. The wheel
end capacity is 1.2 quarts (1.1 liters) for the front and 1.4 quarts
(1.3 liters) for the rear..."
Notice what the flattening accomplished:
Model association preserved: βfor the 642 model with no outriggers, it is 40.2 gallonsβ
Conditional context retained: βwith outriggers, it is 41.7 gallonsβ
Units kept together: β7.6 quarts (7.2 liters)β
Nested relationships explicit: βfriction modifier... to be premixed with axle fluidβ
Now when someone searches βhydraulic capacity 642 with outriggersβ, the embedding model matches semantically and precisely.
In the final metadata, each table exists as a complete record with all representations:
{
"flattened_tables": [
{
"table_id": "table-25-1",
"html_file": "table-25-1.html",
"html_content": "<table><tbody>...</tbody></table>",
"flattened_content": "The table describes various fluid capacities..."
}
],
"table_metadata": [
{
"table_id": "table-25-1",
"title": "Fluid and System Capacities for Vehicle Components",
"summary": "Provides detailed fluid capacities for various vehicle systems,
including fuel, cooling, hydraulic, and transmission. Also covers
differential housing and wheel end capacities with model-specific values.",
"keywords": [
"fluid capacities", "hydraulic system", "differential housing",
"wheel end capacity", "transmission system", "cooling system"
],
"entities": ["642", "742", "ULS 110 HP", "130 HP"],
"component_type": "Vehicle Systems",
"application_context": ["vehicle maintenance", "fluid management"],
"table_file": "table-25-1.html",
"table_image": "table-25-1.png"
}
],
"cross_page_context": {
"continued_from_previous_page": true,
"continues_on_next_page": true,
"related_content_from_previous_page": ["table-24-1"],
"related_content_from_next_page": ["table-26-1"]
}
}
Notice the cross_page_context, this capacities table actually spans pages 24-27, with different models on each page (642/742 on pages 24-25, 943/1043 on page 26, 1055/1255 on page 27).
This gives us:
A technician types: βWhatβs the hydraulic fluid capacity for a 642 with outriggers?β
With traditional chunking: Garbage. The table is fragmented. β642β is in one chunk, βWith Outriggersβ in another, β41.7 gallonsβ somewhere else. The model hallucinates a number or says βI donβt know.β
With our approach:
Flattened content matches βhydraulic capacity 642 outriggersβ semantically 1.
Table metadata confirms this is about βVehicle Systemsβ and βfluid managementβ 1.
Model applicability filters to 642-relevant pages 1.
HTML structure lets us display the exact row with both gallons and liters 1.
Cross-page context tells us pages 26-27 have values for other models (943, 1043, 1055, 1255)
The answer: βThe hydraulic system capacity for the 642 with outriggers is 41.7 gallons (158 liters). Without outriggers, itβs 40.2 gallons (152 liters). The reservoir capacity to full mark is 23.8 gallons (90 liters).β
And if the technician follows up with βWhat about the 1055?β β we know exactly where to look: page 27, table-27-1, because the cross-page context mapped the entire capacities section.
Thatβs what good data modeling enables.
You might think: βJust embed the OCR text from each page.β Hereβs what that looks like for page 36:
Thatβs it. Thatβs what the OCR extracted. No context about what βAβ and βBβ mean. No indication this is a maintenance schedule. No connection to the models it applies to.
A search for β250 hour maintenance 1055β might not even match this page β the semantic signal is too weak.
The metadata we extracted in previous steps contains rich semantic information. The question is: how do we get that into the embedding?
What actually gets embedded? Not the raw page text. Not the raw metadata JSON.
I construct a purpose-built embedding text that combines structured metadata with content:
def build_embedding_text_from_page_metadata(metadata: dict) -> str:
"""Extract structured text from page metadata for embedding generation."""
doc = metadata.get("document_metadata", {})
section = metadata.get("section", {})
page_number = metadata.get("page_number", "")
content_elements = metadata.get("content_elements", [])
# Header
header = [
f"Document: {doc.get('document_title', '')} "
f"({doc.get('manufacturer', '')}, Revision {doc.get('document_revision', '')})",
f"Section: {section.get('section_number', '')} {section.get('section_title', '')}",
f"Subsection: {section.get('subsection_number', '')} {section.get('subsection_title', '')}",
f"Page: {page_number}",
]
# Content summaries
body = []
for el in content_elements:
el_type = el.get("type", "")
title = el.get("title", "")
summary = el.get("summary", "")
if el_type == "text_block":
body.append(f"Text Block: {title}\nSummary: {summary}")
elif el_type == "figure":
body.append(f"Figure: {title} β {summary}")
elif el_type == "table":
body.append(f"Table: {title} β {summary}")
# Aggregated metadata for semantic matching
all_entities = set()
all_keywords = set()
all_warnings = set()
all_models = set()
for el in content_elements:
all_entities.update(el.get("entities", []))
all_keywords.update(el.get("keywords", []))
all_warnings.update(el.get("warnings", []))
all_models.update(el.get("model_applicability", []))
tail = [
f"Entities: {', '.join(sorted(all_entities))}" if all_entities else "",
f"Warnings: {', '.join(sorted(all_warnings))}" if all_warnings else "",
f"Keywords: {', '.join(sorted(all_keywords))}" if all_keywords else "",
f"Model Applicability: {', '.join(sorted(all_models))}" if all_models else "",
]
return "\n\n".join(part for part in (header + body + tail) if part)
This produces embedding text like:
Document: General Information and Specifications (Revision )
Section: 2.5.2 250 Hour
Page: 36
Figure: Vehicle Maintenance Points for Models 1055 and 1255 β
Illustration showing key maintenance points on the vehicle for
models 1055 and 1255, with callouts indicating lubrication and
inspection areas.
Entities: Model 1055, Model 1255
Keywords: inspection, lubrication, maintenance, 1055, 1255
Model Applicability: 1055, 1255
The embedding captures:
What the page contains (figures, tables, text)
Where it sits in the document structure (section 2.5.2)
Which models it applies to (1055, 1255)
What concepts it covers (lubrication, inspection, maintenance)
Watch these three real queries hit the same system:
βhydraulic system problemsβ β Semantic. The user wants conceptually related content. 1.
βSPN 5745β β Exact match. This is a specific fault code. Close isnβt good enough. 1.
βDEF dosing unit voltage highβ β Hybrid. Some terms are exact (βDEFβ, βvoltageβ), some are semantic (βdosing unitβ β βinjection systemβ).
A single embedding strategy forces a tradeoff:
Dense embeddings excel at query 1, fail at query 2 (SPN 5745 embeds similarly to SPN 5746)
Sparse/keyword matching nails query 2, misses query 1 (no exact word overlap)
Neither handles query 3 well alone
You need all three β and a way to combine them.
A single embedding strategy isnβt enough for technical content. Different query types need different matching approaches.
I index each page with five vector types:
vectors_config={
# Dense semantic matching
"dense": models.VectorParams(
size=384, # MiniLM-L6
distance=models.Distance.COSINE,
),
# Late interaction for precise matching
"colbert": models.VectorParams(
size=128, # ColBERTv2
distance=models.Distance.COSINE,
multivector_config=models.MultiVectorConfig(
comparator=models.MultiVectorComparator.MAX_SIM
),
),
# OpenAI embeddings at different granularities
"small-embedding": models.VectorParams(size=128, ...),
"large-embedding": models.VectorParams(size=1024, ...),
},
# Keyword matching
sparse_vectors_config={
"sparse": models.SparseVectorParams(...) # BM42
}
The retrieval pipeline uses hybrid search:
def retrieve(self, query: str, limit: int = 5) -> List[Dict]:
# Generate all query embeddings
dense_vector = next(self.dense_model.embed(query)).tolist()
sparse_vector = next(self.sparse_model.embed(query))
colbert_vector = next(self.colbert_model.embed(query)).tolist()
# Prefetch with dense and sparse
prefetch = [
models.Prefetch(query=dense_vector, using="dense", limit=20),
models.Prefetch(query=sparse_vector, using="sparse", limit=20),
]
# Rerank with ColBERT
results = self.client.query_points(
collection_name=collection_name,
prefetch=prefetch,
query=colbert_vector,
using="colbert",
limit=limit,
)
return self._format_results(results)
Hereβs query 3 (βDEF dosing unit voltage highβ) flowing through the pipeline:
Dense found the right neighborhood. Sparse ensured exact terms werenβt missed. ColBERT picked the winner by matching at token granularity.
No single vector type could do this alone.
Complex questions need decomposition before retrieval.
When a technician asks: βWhatβs the difference in axle fluid capacity between models 943 and 1255?β
This isnβt one retrieval query. Itβs multiple:
USER_QUESTION_DECOMPOSITION_PROMPT = """
You are an expert assistant that breaks down complex technical questions.
Decompose the question into reasoning-enhancing sub-questions:
- Fact-finding sub-questions (specifications, values)
- Comparative sub-questions (model differences)
- Contextual sub-questions (maintenance implications)
Each sub-question must be mapped to the correct section of the manual.
"""
The decomposition output:
[
{
"sub_question": "What are the front axle differential housing capacities for models 943 and 1255?",
"section_number": 2,
"section_title": "General Information and Specifications",
"matched_chapters": ["Fluid and Lubricant Capacities"]
},
{
"sub_question": "What are the rear axle differential housing capacities for models 943 and 1255?",
"section_number": 2,
"matched_chapters": ["Fluid and Lubricant Capacities"]
},
{
"sub_question": "How does the difference in axle fluid capacity affect lubrication frequency?",
"section_number": 2,
"matched_chapters": ["Lubrication Schedule", "Service and Maintenance Schedules"]
}
]
Each sub-question gets retrieved independently, and the results are combined for answer generation.
Letβs step back and see what weβve created.
We started with a problem: 500+ -page service manuals filled with spanning tables, cross-referenced figures, and model-specific specifications. Traditional RAG would have shredded this content into meaningless fragments.
Instead, we built a data model that respects the domain:
None of these decisions came from a βRAG best practicesβ blog post. They came from understanding how technicians actually use service manuals, flipping between pages, tracing figure references, comparing specifications across models.
Thatβs the Process Archaeologist approach in practice. The human workflow shaped every technical choice.
Hereβs something I want you to sit with: we havenβt written a single retrieval query yet.
This entire article, all seven decisions, covers only the ingestion and indexing pipeline. Weβve taken raw PDFs and transformed them into a richly structured, multi-vector indexed collection in Qdrant.
The retrieval and answer generation? Thatβs where this data model pays off. But itβs also where new challenges emerge:
How do you follow cross-page chains at query time?
When do you retrieve the table HTML vs. the flattened text?
How do you handle queries that span multiple document sections?
What does the actual retrieval β reranking β generation pipeline look like in code?
The table extraction pipeline I showed you in Decision 4? That was the overview. In Part 3, we go deeper.
Youβll see:
The complete extraction code β from PDF page to structured HTML to flattened prose
Edge cases that break naive approaches β merged cells, nested headers, tables that span 10+ pages
The prompts that actually work β including the failures that taught me what to include
A working table extraction module you can adapt for your own domain
If this article helped you think differently about RAG data modeling, share it with someone building document AI systems. And if you want to discuss your own domain challenges, Iβm always up for a conversation, sometimes the best insights come from seeing how these principles apply to problems I havenβt encountered yet.