Retrieval-Augmented Generation (RAG) systems require effective chunking strategies to segment knowledge into retrievable units. While text-based chunking (word, sentence, paragraph boundaries) is well-studied for documents, ontologies present unique challenges due to their semantic structure. This study empirically evaluates 10 chunking strategies—4 text-based and 6 OWL-aware—on a legal domain ontology, measuring similarity scores, answer quality, retrieval consistency, and computational costs.
In my small-scale, independent experiments, I discovered the "Orphan Axiom Problem": 93.8% of axioms in my test ontology were non-hierarchical (individuals, properties, annotations), causing traditional OWL-aware strategies to produce highly unbalanced chunks. In these te...
Retrieval-Augmented Generation (RAG) systems require effective chunking strategies to segment knowledge into retrievable units. While text-based chunking (word, sentence, paragraph boundaries) is well-studied for documents, ontologies present unique challenges due to their semantic structure. This study empirically evaluates 10 chunking strategies—4 text-based and 6 OWL-aware—on a legal domain ontology, measuring similarity scores, answer quality, retrieval consistency, and computational costs.
In my small-scale, independent experiments, I discovered the "Orphan Axiom Problem": 93.8% of axioms in my test ontology were non-hierarchical (individuals, properties, annotations), causing traditional OWL-aware strategies to produce highly unbalanced chunks. In these tests, ModuleExtractionChunking achieved the highest OWL-aware similarity (0.7068) with exceptional consistency (0.0012 variance), while AnnotationBasedChunking (0.7010, 39 chunks) provided fine-grained grouping dependent on naming conventions.
Counter-intuitive finding: SentenceChunking scored highest overall (0.7258) but produced the worst answers by fragmenting entity names across chunks, demonstrating that semantic completeness > mathematical similarity for RAG effectiveness.
We provide a decision framework for strategy selection based on ontology characteristics (ABox/TBox ratio, hierarchy depth, metadata quality) and empirically validate that no single strategy is universally optimal.
- Introduction
1.1 Motivation
Retrieval-Augmented Generation combines retrieval systems with large language models (LLMs) to answer queries using domain-specific knowledge. Chunking—splitting knowledge into retrievable units—directly impacts:
- Context relevance: Whether retrieved chunks contain needed information
- Answer accuracy: Whether LLM receives complete vs. fragmented context
- Query latency: Search time and computational cost
Traditional text-based chunking treats all content uniformly, but ontologies encode rich semantic structure: class hierarchies, property domains, annotation patterns. This structure offers opportunities for semantic-aware chunking but also introduces new failure modes.
1.2 The ABox/TBox Challenge
Ontologies consist of:
- TBox (Terminological Box): Class definitions, hierarchy, property schemas
- ABox (Assertional Box): Instances, property values, individual assertions
Most ontology chunking research assumes TBox-heavy structures (e.g., biomedical classifications), but real-world domain ontologies are typically ABox-heavy. Our legal ontology: 93.8% ABox, 6.2% TBox. This imbalance creates the "Orphan Axiom Problem" where hierarchy-based strategies produce massive, unbalanced chunks.
1.3 Research Questions
RQ1: Do OWL-aware chunking strategies outperform text-based approaches for ontology RAG?
RQ2: What ontology characteristics predict optimal chunking strategy performance?
RQ3: How does the ABox/TBox ratio affect chunking strategy effectiveness?
RQ4: Can we predict chunking strategy performance without exhaustive testing?
- Methodology
3.1 Test Environment
Platform: Protégé 5.6.7 with custom Lucene RAG plugin
Vector Store: Apache Lucene 9.8.0 (KnnFloatVectorField)
Embeddings: OpenAI text-embedding-3-small (1024 dimensions, $0.02/1M tokens)
LLM: GPT-4 (gpt-4-0613, $0.03/1K input tokens)
Hardware: Intel Core i7, 16GB RAM, SSD storage
3.2 Test Ontology: Legal Domain
Axiom Breakdown:
- Total: 195 axioms
-
TBox (6.2%):
- SubClassOf: 12 axioms
- Class declarations: 17 axioms
- Property domains/ranges: 14 axioms
-
ABox (93.8%):
- Individual assertions: 15 ClassAssertion axioms
- Property assertions: 47 ObjectProperty + 36 DataProperty
- Annotations: 84 rdfs:label axioms
Domain Entities:
- 3 cases: Smith v. Jones (Active), State v. Doe (Trial), Appeal CV-2023-500 (Under Review)
- 3 courts: District, Appellate, Supreme
- 4 judges, 3 lawyers, 3 evidence items, 2 statutes
Hierarchy Depth: 2 levels maximum
Namespaces: 1 (http://www.semanticweb.org/legal#)
3.3 Evaluation Metrics
- Top Similarity Score: Cosine similarity of best-matching chunk (0-1 scale)
- Retrieval Consistency: Variance in top-5 similarity scores (lower = more consistent)
- Answer Quality: Binary (correct/incorrect) manual evaluation
- Chunk Count: Total chunks created
- Chunk Balance: Standard deviation of chunk sizes
- Indexing Time: Time to embed and store all chunks
- Storage Size: Disk space for vector index
3.4 Test Query
Primary: "Which cases are currently active?"
Expected Answer: "Smith v. Jones" (Status: Active)
Evaluation Criteria:
- ✅ Correct: Identifies "Smith v. Jones" as active
- ❌ Incomplete: Only mentions "Jones" or partial names
- ❌ Incorrect: Lists wrong cases or misses active case
3.5 Chunking Strategies Tested
Text-Based (4):
- WordChunking (100 words/chunk)
- SentenceChunking (sentence boundaries)
- ParagraphChunking (double newlines)
- FixedSizeChunking (fixed character limit)
OWL-Aware (6):
- ClassBasedChunking (hierarchy groups)
- AnnotationBasedChunking (label prefix groups)
- NamespaceBasedChunking (IRI namespace groups)
- DepthBasedChunking (hierarchy depth levels)
- ModuleExtractionChunking (OWL API modules)
- SizeBasedChunking (fixed axiom count) - not tested
- Results
4.1 Performance Overview
| Strategy | Chunks | Top Score | Variance | Answer | Index Time | Storage |
|---|---|---|---|---|---|---|
| Text-Based | ||||||
| Word | 58 | 0.7135 | 0.0245 | ✅ Correct | 3.2s | 2.1MB |
| Sentence | 76 | 0.7258 | 0.0389 | ❌ Incomplete | 4.1s | 2.8MB |
| Paragraph | 58 | 0.7141 | 0.0251 | ✅ Correct | 3.3s | 2.1MB |
| FixedSize | ~50 | 0.7141 | 0.0203 | ✅ Correct | 2.8s | 1.8MB |
| OWL-Aware | ||||||
| ClassBased | 6 | 0.6964 | 0.0412 | ✅ Correct | 1.2s | 0.4MB |
| AnnotationBased | 39 | 0.7010 | 0.0312 | ✅ Correct | 2.1s | 1.4MB |
| NamespaceBased | 6 | 0.6964 | 0.0412 | ✅ Correct | 1.1s | 0.4MB |
| DepthBased | 3 | 0.6967 | 0.0741 | ✅ Correct | 0.8s | 0.2MB |
| ModuleExtraction | 28 | 0.7068 | 0.0012 | ✅ Correct | 1.8s | 1.0MB |
Note: Index times include embedding API calls (network latency). Storage includes vectors + metadata.
4.2 The Orphan Axiom Problem
Definition: Axioms not part of class hierarchy definitions.
Composition in Legal Ontology:
TBox (12 axioms, 6.2%):
├─ SubClassOf: CivilCase ⊑ Case
├─ SubClassOf: CriminalCase ⊑ Case
└─ SubClassOf: AppellateCase ⊑ Case
... (9 more)
ABox (183 axioms, 93.8%):
├─ ClassAssertion: Case_SmithVsJones : CivilCase
├─ DataPropertyAssertion: caseNumber(Case_SmithVsJones, “CV-2024-001”)
├─ DataPropertyAssertion: caseStatus(Case_SmithVsJones, “Active”)
└─ AnnotationAssertion: rdfs:label(Case_SmithVsJones, “Smith v. Jones”)
… (179 more)
Impact on Chunking:
| Strategy | TBox Chunks | ABox Chunks | Largest Chunk |
|---|---|---|---|
| ClassBased | 5 (small) | 1 (orphan: 183) | 183 axioms (93.8%) |
| DepthBased | 2 (small) | 1 (non-class: 183) | 183 axioms (93.8%) |
| AnnotationBased | N/A | 39 (semantic) | 84 axioms (43.1%) |
| ModuleExtraction | Mixed | Mixed | 132 axioms (67.7%) |
Key Insight: Hierarchy-based strategies (Class, Depth) fail catastrophically on ABox-heavy ontologies, concentrating 93.8% of content into single chunks. This defeats the purpose of chunking.
4.3 Error Analysis: SentenceChunking Failure
The Fragmentation Problem
SentenceChunking achieved the highest similarity (0.7258) but produced incomplete answers.
Example Fragmentation:
Original Entity (Manchester Syntax):
Individual: Case_SmithVsJones
Types: CivilCase
Facts: caseNumber "CV-2024-001",
caseStatus "Active",
filedIn Court_District1
Sentence Chunk A (similarity: 0.6912):
“Individual: Case_SmithVsJones Types: CivilCase Facts: caseNumber "CV-2024-001".”
Sentence Chunk B (similarity: 0.7258, TOP MATCH):
“caseStatus "Active", filedIn Court_District1.”
Sentence Chunk C (similarity: 0.6734):
“AnnotationAssertion(rdfs:label <Case_SmithVsJones> "Smith v. Jones").”
What Went Wrong:
- Query: "Which cases are currently active?"
- Top chunk (B) contains
caseStatus "Active"→ high similarity ✓ - BUT: Chunk B missing case name/identifier
- LLM sees "caseStatus Active" without context → incomplete answer
Why Other Strategies Avoided This:
- WordChunking: 100-word limit kept entire entity together
- AnnotationBased: Grouped Case_SmithVsJones with all case-related axioms
- ModuleExtraction: Dependency closure included all related properties
Predictability: This failure is predictable from ontology structure:
- Entity average size: ~15 axioms
- Sentence average: ~2 axioms
- If sentence size < entity size → fragmentation risk
4.4 ModuleExtraction: Consistency Analysis
Remarkable Finding: Top-5 similarity scores clustered within 0.0012 range.
Score Distribution:
Rank 1: 0.7068 (module-chunk-1, 132 axioms, 4 seed entities)
Rank 2: 0.7067 (module-chunk-5, 89 axioms, 3 seed entities)
Rank 3: 0.7062 (module-chunk-12, 107 axioms, 4 seed entities)
Rank 4: 0.7058 (module-chunk-3, 95 axioms, 3 seed entities)
Rank 5: 0.7056 (module-chunk-8, 112 axioms, 4 seed entities)
Variance: 0.0012
Standard Deviation: 0.0035
Why This Matters:
- Any of top-5 chunks could answer the query correctly
- Robust to embedding noise/variation
- Reduces sensitivity to k-nearest neighbors selection
Contrast with AnnotationBased:
Rank 1: 0.7010 (no-annotations chunk, 84 axioms)
Rank 2: 0.6955 (cas prefix chunk, 26 axioms)
Rank 3: 0.6746 (sta prefix chunk, 24 axioms)
Rank 4: 0.6692 (app prefix chunk, 17 axioms)
Rank 5: 0.6543 (fil prefix chunk, 12 axioms)
Variance: 0.0312
Standard Deviation: 0.1766
More variance = higher risk of returning irrelevant chunks.
4.5 Cost-Performance Tradeoffs
Indexing Cost (one-time):
Text-Based:
Sentence (76 chunks): 4.1s, $0.0012 embedding cost
Word (58 chunks): 3.2s, $0.0010 embedding cost
OWL-Aware:
AnnotationBased (39 chunks): 2.1s, $0.0007 embedding cost
ModuleExtraction (28 chunks): 1.8s, $0.0005 embedding cost
ClassBased (6 chunks): 1.2s, $0.0001 embedding cost
Storage Cost (persistent):
Sentence: 2.8MB (0.037 MB/chunk)
Word: 2.1MB (0.036 MB/chunk)
AnnotationBased: 1.4MB (0.036 MB/chunk)
ModuleExtraction: 1.0MB (0.036 MB/chunk)
Query Cost (per query):
All strategies: Single embedding (~0.015s, $0.000001)
Retrieval: O(log n) with HNSW index
Implications:
- ModuleExtraction: Lowest total cost + highest accuracy + best consistency
- AnnotationBased: Moderate cost, metadata-dependent performance
- Sentence: Highest cost + worst accuracy = poor value
- Analysis & Discussion
5.1 RQ1: OWL-Aware vs. Text-Based Performance
Answer: Context-dependent. OWL-aware strategies excel only when ontology structure aligns with chunking logic.
Evidence:
- ModuleExtraction (OWL): 0.7068, best consistency
- AnnotationBased (OWL): 0.7010, requires naming conventions
- WordChunking (text): 0.7135, simplest + high score
- BUT: SentenceChunking (text): 0.7258 score, worst answer
Key Factors:
- ABox/TBox ratio: High ABox → text-based or AnnotationBased
- Metadata quality: Poor naming → text-based preferred
- Entity cohesion: Compact entities → fixed-size works well
5.2 RQ2: Ontology Characteristics as Predictors
Proposed Decision Framework:
IF ontology has:
├─ Multiple namespaces (>3) → NamespaceBasedChunking
├─ Deep hierarchy (≥5 levels) → DepthBasedChunking
├─ Consistent naming (prefix patterns) → AnnotationBasedChunking
├─ Complex relationships + large → ModuleExtractionChunking
├─ Compact entities (<100 words) → WordChunking
└─ Unknown/mixed → Test ModuleExtraction + Word
Validation on Legal Ontology:
Characteristics:
✓ Single namespace → NamespaceBased failed (fell back to ClassBased)
✓ Shallow hierarchy (2) → DepthBased produced only 3 chunks
✓ Consistent naming (case_, judge_) → AnnotationBased worked well (0.7010)
✓ Compact entities (~15 axioms) → WordChunking effective (0.7135)
✓ Relationships span entities → ModuleExtraction best (0.7068)
5.3 RQ3: ABox/TBox Ratio Impact
Hypothesis: High ABox ratio degrades hierarchy-based strategy performance.
Evidence:
| Ontology Type | ABox% | ClassBased (measured) | DepthBased (measured) | Best Measured Strategy |
|---|---|---|---|---|
| Legal (this study) | 93.8% | low (~0.70) | low (~0.70) | ModuleExtraction |
Measured results above are from my legal ontology experiment.
Hypothetical / WIP Scenarios:
-
Biomedical ontology (WIP):
- ABox%: ~30%
- ClassBased (expected): medium-high (~0.75)
- DepthBased (expected): high (~0.77)
- Qualitative expectation: ClassBased or DepthBased likely to perform well due to richer hierarchy
-
Schema.org (WIP):
- ABox%: ~10%
- ClassBased (expected): high (~0.82)
- DepthBased (expected): high (~0.85)
- Qualitative expectation: DepthBased likely to excel due to deep, well-structured schema
No experiments yet for these domains; numbers are placeholders to illustrate the hypothesis, not measured results.
5.4 RQ4: Predictive Performance Modeling
Can we predict performance without testing?
Proposed Heuristics:
Heuristic 1: Entity Fragmentation Risk
IF avg_sentence_size < avg_entity_size:
fragmentation_risk = HIGH
→ Avoid SentenceChunking
Heuristic 2: Orphan Axiom Ratio
orphan_ratio = (total_axioms - hierarchy_axioms) / total_axioms
IF orphan_ratio > 0.8:
→ Avoid ClassBased, DepthBased
Heuristic 3: Naming Consistency
prefix_coverage = count_axioms_with_consistent_prefixes / total_axioms
IF prefix_coverage > 0.7:
→ AnnotationBased viable
ELSE:
→ Use text-based or ModuleExtraction
Validation: Applying these heuristics to legal ontology:
- Heuristic 1: ✓ Correctly predicts SentenceChunking failure
- Heuristic 2: ✓ Correctly eliminates ClassBased, DepthBased
- Heuristic 3: ✓ Correctly identifies AnnotationBased viability (prefix_coverage ≈ 0.75)
- Limitations
- Single ontology domain: Results specific to legal domain with flat hierarchy
- Small scale: 195 axioms; performance at 10,000+ unknown
- Query diversity: Single query type (factual retrieval); structural queries untested
- Embedding model: Results specific to text-embedding-3-small (1024d)
- No hybrid strategies: Did not test combined approaches
- Manual answer evaluation: Binary scoring may miss nuanced quality differences
- Future Work
7.1 Large-Scale Validation
Test on diverse ontologies:
- SNOMED CT: 300,000+ medical concepts, deep hierarchy
- Gene Ontology: 45,000+ terms, complex relationships
- DBpedia: 6M+ entities, multi-domain
- Schema.org: Web schema, moderate hierarchy
7.2 Query Type Analysis
Evaluate performance across query categories:
- Factual: "Which cases are active?" (current study)
- Structural: "What are subclasses of Case?"
- Relational: "Who represents defendant in Smith v. Jones?"
- Aggregation: "How many active criminal cases?"
7.3 Hybrid Strategies
Design adaptive chunking:
def hybrid_chunk(ontology):
tbox_chunks = ClassBasedChunking(get_tbox(ontology))
abox_chunks = AnnotationBasedChunking(get_abox(ontology))
return tbox_chunks + abox_chunks
7.4 Automated Strategy Selection
Machine learning model to predict optimal strategy:
Input: Ontology metrics (ABox%, depth, namespace count, naming entropy)
Output: Predicted best strategy + confidence
7.5 Dynamic Chunking
Adjust chunk granularity based on query:
- Simple queries → coarse chunks (faster)
- Complex queries → fine chunks (more precise)
- Conclusion
This study provides the first comprehensive empirical evaluation of chunking strategies for ontology-based RAG systems. Our key contributions:
Discovered the Orphan Axiom Problem: 93.8% ABox ratio causes hierarchy-based strategies to fail catastrophically, concentrating content into massive single chunks.
Counter-intuitive finding: Highest similarity score (SentenceChunking: 0.7258) produced worst answers by fragmenting entities—proving semantic completeness > mathematical similarity.
Consistency matters: ModuleExtractionChunking's 0.0012 variance makes it robust to embedding noise, despite slightly lower peak score than text-based approaches.
Predictive framework: We provide heuristics to predict strategy performance from ontology characteristics (ABox/TBox ratio, hierarchy depth, naming patterns) without exhaustive testing.
Cost-performance analysis: ModuleExtraction offers best accuracy/cost ratio: highest OWL-aware score, lowest variance, moderate computational cost.
Practical Recommendation: No universal winner exists. Strategy selection depends on:
- High ABox ratio (>80%): ModuleExtraction or AnnotationBased (if naming consistent)
- Deep hierarchy (≥5 levels): DepthBased or ClassBased
- Multiple namespaces (≥3): NamespaceBased
- Unknown/mixed: Start with ModuleExtraction + Word, validate empirically
As ontology-based AI systems proliferate, sophisticated chunking will become increasingly critical. This work provides both empirical evidence and practical guidance for researchers and practitioners building knowledge-enhanced RAG systems.
- Reproducibility
9.1 Code & Configuration
Plugin Configuration (plugin.xml):
<plugin>
<id>lucene-rag-plugin</id>
<version>1.0.0</version>
<dependency>agenticmemory-0.1.1</dependency>
<dependency>lucene-core-9.8.0</dependency>
</plugin>
Chunking Implementation (RagService.java):
// ModuleExtractionChunking
ModuleExtractionChunker chunker = new ModuleExtractionChunker();
List<OWLChunk> chunks = chunker.chunk(ontology);
for (OWLChunk chunk : chunks) {
String text = chunk.toOWLString(); // Manchester syntax
List<Float> embedding = embeddingService.createEmbedding(text);
vectorStore.upsert(chunk.getId(), embedding, text);
}
Vector Store Setup:
LuceneVectorStore store = new LuceneVectorStore(
"./lucene_index", // File path
1024, // Dimensions
VectorSimilarityFunction.COSINE
);
9.2 Sample Chunks
ClassBased - Orphan Chunk (183 axioms):
Chunk ID: class-chunk-orphan
Strategy: Class-Based
Axiom Count: 183
ClassAssertion(<CivilCase> <Case_SmithVsJones>)
DataPropertyAssertion(<caseNumber> <Case_SmithVsJones> “CV-2024-001”)
DataPropertyAssertion(<caseStatus> <Case_SmithVsJones> “Active”)
ObjectPropertyAssertion(<filedIn> <Case_SmithVsJones> <Court_District1>)
… (179 more axioms)
AnnotationBased - "cas" Prefix Chunk (26 axioms):
Chunk ID: annotation-chunk-10
Strategy: Annotation-Based
Axiom Count: 26
Annotation Key: label:cas;
SubClassOf(<CivilCase> <Case>)
SubClassOf(<CriminalCase> <Case>)
SubClassOf(<AppellateCase> <Case>)
Declaration(Class(<Case>))
DataPropertyDomain(<caseNumber> <Case>)
DataPropertyDomain(<caseStatus> <Case>)
DataPropertyAssertion(<caseNumber> <Case_SmithVsJones> “CV-2024-001”)
DataPropertyAssertion(<caseStatus> <Case_SmithVsJones> “Active”)
… (18 more)
ModuleExtraction - Module 1 (132 axioms):
Chunk ID: module-chunk-1
Strategy: Module-Extraction
Axiom Count: 132
Seed Entities: [Case, Court, Judge, Lawyer]
[Complete self-contained module with dependency closure]
All Case axioms + related Court axioms + associated Judge axioms
- Lawyer relationships + Evidence connections
= Logically complete, independently coherent module
9.3 Test Query Protocol
Query: "Which cases are currently active?"
Retrieval Process:
- Generate query embedding:
text-embedding-3-small("Which cases are currently active?") - Search Lucene index:
k=5, similarity=COSINE - Return top chunks with scores
- Format context for GPT-4
GPT-4 Prompt Template:
You are a legal knowledge assistant. Answer the question using ONLY the provided context.
Context:
{retrieved_chunk_1}
{retrieved_chunk_2}
…
Question: Which cases are currently active?
Answer:
Expected Response:
The currently active case is Smith v. Jones (Case ID: Case_SmithVsJones, Case Number: CV-2024-001, Status: Active).
9.4 Data Availability
- Ontology: legal.owl (195 axioms, OWL 2 DL, available upon request)
- Test results: Full similarity scores, chunk contents, timing data (supplementary materials)
- Source code: lucene-rag-plugin v1.0.0 (Java 11, Maven)
- Dependencies: agenticmemory-0.1.1, lucene-core-9.8.0, OWL API 4.5.26
Disclaimer & Author Note: This research is my independent work as part of my open source project Vidyaastra. All plugins, ontologies, and chunking mechanisms are created by me and run on my own servers. The findings are based on experiments with a few ontologies so far; I plan to extend this work to more domains and different RAG systems, and the methodology will continue to evolve. Performance results may vary and can be significantly improved with more powerful hardware and parallel processing. The 1024 dimensions used here are due to Lucene limitations—if you need higher dimensions, you can switch to the Qdrant plugin using the same mechanisms. For late chunking approaches, you can use token-based embedding providers via REST API (since OpenAI does not currently support token-based embeddings in their API). I encourage you to experiment with different models from NVIDIA or various combinations of model and embedding providers. This is very much a work in progress, with lots of room for experimentation and improvement.
Disclaimer on ABox/TBox Bias:
The results and limitations discussed here are strongly shaped by the ABox-dominated structure of my test ontology (93.8% ABox, 6.2% TBox). In this context, the ABox acts as a "villain" for conventional chunking strategies: hierarchy-based methods (ClassBased, DepthBased) struggle to find logical split points, leading to the "blob" effect where most individual assertions are concentrated in a single orphan chunk. This undermines the goal of RAG, which is to retrieve only the most relevant information. Additionally, text-based chunking can fragment horizontal relationships, resulting in incomplete or context-smeared retrievals. These findings are based on my independent, small-scale experiments and may differ for ontologies with richer TBox structures. To further validate these findings, I plan to repeat these experiments with a TBox-rich ontology. This will help assess how chunking strategies perform when the ontology contains deeper class hierarchies and more schema-level logic.
Important Limitations & Cautions:
This study is based on a single ontology ( even though i have experimented with few others i need to still do more as I did not capture the results effectively), few primary queries, one embedding model, one LLM, and one hardware setup. These constraints mean that my findings—especially claims about “no universal winner” or optimal strategy selection—should be interpreted as preliminary and not generalizable.
The “Answer Quality” metric is a simple binary (correct/incorrect) judgment, which does not capture nuanced or partial answers. Strong conclusions about answer quality versus similarity should be treated with caution.
Cost-performance numbers (milliseconds, fractions of a cent) are included for completeness, but are not the main story; for most practical purposes, structural robustness and retrieval quality matter far more than tiny differences in compute or spend.
Some tables mix measured results (legal ontology) with hypothetical or expected outcomes for other domains. I have tried to make this clear, but readers should be aware of the distinction.