Optimal Chunking for Ontology RAG: Empirical Analysis & Orphan Axiom Problem

Retrieval-Augmented Generation (RAG) systems require effective chunking strategies to segment knowledge into retrievable units. While text-based chunking (word, sentence, paragraph boundaries) is well-studied for documents, ontologies present unique challenges due to their semantic structure. This study empirically evaluates 10 chunking strategies—4 text-based and 6 OWL-aware—on a legal domain ontology, measuring similarity scores, answer quality, retrieval consistency, and computational costs.

In my small-scale, independent experiments, I discovered the "Orphan Axiom Problem": 93.8% of axioms in my test ontology were non-hierarchical (individuals, properties, annotations), causing traditional OWL-aware strategies to produce highly unbalanced chunks. In these tests, ModuleExtractionChunking achieved the highest OWL-aware similarity (0.7068) with exceptional consistency (0.0012 variance), while AnnotationBasedChunking (0.7010, 39 chunks) provided fine-grained grouping dependent on naming conventions.

Counter-intuitive finding: SentenceChunking scored highest overall (0.7258) but produced the worst answers by fragmenting entity names across chunks, demonstrating that semantic completeness > mathematical similarity for RAG effectiveness.

We provide a decision framework for strategy selection based on ontology characteristics (ABox/TBox ratio, hierarchy depth, metadata quality) and empirically validate that no single strategy is universally optimal.

Introduction

1.1 Motivation

Retrieval-Augmented Generation combines retrieval systems with large language models (LLMs) to answer queries using domain-specific knowledge. Chunking—splitting knowledge into retrievable units—directly impacts:

Context relevance: Whether retrieved chunks contain needed information
Answer accuracy: Whether LLM receives complete vs. fragmented context
Query latency: Search time and computational cost

Traditional text-based chunking treats all content uniformly, but ontologies encode rich semantic structure: class hierarchies, property domains, annotation patterns. This structure offers opportunities for semantic-aware chunking but also introduces new failure modes.

1.2 The ABox/TBox Challenge

Ontologies consist of:

TBox (Terminological Box): Class definitions, hierarchy, property schemas
ABox (Assertional Box): Instances, property values, individual assertions

Most ontology chunking research assumes TBox-heavy structures (e.g., biomedical classifications), but real-world domain ontologies are typically ABox-heavy. Our legal ontology: 93.8% ABox, 6.2% TBox. This imbalance creates the "Orphan Axiom Problem" where hierarchy-based strategies produce massive, unbalanced chunks.

1.3 Research Questions

RQ1: Do OWL-aware chunking strategies outperform text-based approaches for ontology RAG?

RQ2: What ontology characteristics predict optimal chunking strategy performance?

RQ3: How does the ABox/TBox ratio affect chunking strategy effectiveness?

RQ4: Can we predict chunking strategy performance without exhaustive testing?

Methodology

3.1 Test Environment

Platform: Protégé 5.6.7 with custom Lucene RAG plugin

Vector Store: Apache Lucene 9.8.0 (KnnFloatVectorField)

Embeddings: OpenAI text-embedding-3-small (1024 dimensions, $0.02/1M tokens)

LLM: GPT-4 (gpt-4-0613, $0.03/1K input tokens)

Hardware: Intel Core i7, 16GB RAM, SSD storage

3.2 Test Ontology: Legal Domain

Axiom Breakdown:

Total: 195 axioms
TBox (6.2%):
- SubClassOf: 12 axioms
- Class declarations: 17 axioms
- Property domains/ranges: 14 axioms
ABox (93.8%):
- Individual assertions: 15 ClassAssertion axioms
- Property assertions: 47 ObjectProperty + 36 DataProperty
- Annotations: 84 rdfs:label axioms

Domain Entities:

3 cases: Smith v. Jones (Active), State v. Doe (Trial), Appeal CV-2023-500 (Under Review)
3 courts: District, Appellate, Supreme
4 judges, 3 lawyers, 3 evidence items, 2 statutes

Hierarchy Depth: 2 levels maximum

Namespaces: 1 (http://www.semanticweb.org/legal#)

3.3 Evaluation Metrics

Top Similarity Score: Cosine similarity of best-matching chunk (0-1 scale)
Retrieval Consistency: Variance in top-5 similarity scores (lower = more consistent)
Answer Quality: Binary (correct/incorrect) manual evaluation
Chunk Count: Total chunks created
Chunk Balance: Standard deviation of chunk sizes
Indexing Time: Time to embed and store all chunks
Storage Size: Disk space for vector index

3.4 Test Query

Primary: "Which cases are currently active?"

Expected Answer: "Smith v. Jones" (Status: Active)

Evaluation Criteria:

✅ Correct: Identifies "Smith v. Jones" as active
❌ Incomplete: Only mentions "Jones" or partial names
❌ Incorrect: Lists wrong cases or misses active case

3.5 Chunking Strategies Tested

Text-Based (4):

WordChunking (100 words/chunk)
SentenceChunking (sentence boundaries)
ParagraphChunking (double newlines)
FixedSizeChunking (fixed character limit)

OWL-Aware (6):

ClassBasedChunking (hierarchy groups)
AnnotationBasedChunking (label prefix groups)
NamespaceBasedChunking (IRI namespace groups)
DepthBasedChunking (hierarchy depth levels)
ModuleExtractionChunking (OWL API modules)
SizeBasedChunking (fixed axiom count) - not tested

Results

4.1 Performance Overview

Strategy	Chunks	Top Score	Variance	Answer	Index Time	Storage
Text-Based
Word	58	0.7135	0.0245	✅ Correct	3.2s	2.1MB
Sentence	76	0.7258	0.0389	❌ Incomplete	4.1s	2.8MB
Paragraph	58	0.7141	0.0251	✅ Correct	3.3s	2.1MB
FixedSize	~50	0.7141	0.0203	✅ Correct	2.8s	1.8MB
OWL-Aware
ClassBased	6	0.6964	0.0412	✅ Correct	1.2s	0.4MB
AnnotationBased	39	0.7010	0.0312	✅ Correct	2.1s	1.4MB
NamespaceBased	6	0.6964	0.0412	✅ Correct	1.1s	0.4MB
DepthBased	3	0.6967	0.0741	✅ Correct	0.8s	0.2MB
ModuleExtraction	28	0.7068	0.0012	✅ Correct	1.8s	1.0MB

Note: Index times include embedding API calls (network latency). Storage includes vectors + metadata.

4.2 The Orphan Axiom Problem

Definition: Axioms not part of class hierarchy definitions.

Composition in Legal Ontology:

TBox (12 axioms, 6.2%):
  ├─ SubClassOf: CivilCase ⊑ Case
  ├─ SubClassOf: CriminalCase ⊑ Case
  └─ SubClassOf: AppellateCase ⊑ Case
  ... (9 more)
ABox (183 axioms, 93.8%):
├─ ClassAssertion: Case_SmithVsJones : CivilCase
├─ DataPropertyAssertion: caseNumber(Case_SmithVsJones, “CV-2024-001”)
├─ DataPropertyAssertion: caseStatus(Case_SmithVsJones, “Active”)
└─ AnnotationAssertion: rdfs:label(Case_SmithVsJones, “Smith v. Jones”)
… (179 more)

Impact on Chunking:

Strategy	TBox Chunks	ABox Chunks	Largest Chunk
ClassBased	5 (small)	1 (orphan: 183)	183 axioms (93.8%)
DepthBased	2 (small)	1 (non-class: 183)	183 axioms (93.8%)
AnnotationBased	N/A	39 (semantic)	84 axioms (43.1%)
ModuleExtraction	Mixed	Mixed	132 axioms (67.7%)

Key Insight: Hierarchy-based strategies (Class, Depth) fail catastrophically on ABox-heavy ontologies, concentrating 93.8% of content into single chunks. This defeats the purpose of chunking.

4.3 Error Analysis: SentenceChunking Failure

The Fragmentation Problem

SentenceChunking achieved the highest similarity (0.7258) but produced incomplete answers.

Example Fragmentation:

Original Entity (Manchester Syntax): Individual: Case_SmithVsJones Types: CivilCase Facts: caseNumber "CV-2024-001", caseStatus "Active", filedIn Court_District1 Sentence Chunk A (similarity: 0.6912): “Individual: Case_SmithVsJones Types: CivilCase Facts: caseNumber "CV-2024-001".” Sentence Chunk B (similarity: 0.7258, TOP MATCH): “caseStatus "Active", filedIn Court_District1.”

Sentence Chunk C (similarity: 0.6734): “AnnotationAssertion(rdfs:label <Case_SmithVsJones> "Smith v. Jones").”

What Went Wrong:

Query: "Which cases are currently active?"
Top chunk (B) contains caseStatus "Active" → high similarity ✓
BUT: Chunk B missing case name/identifier
LLM sees "caseStatus Active" without context → incomplete answer

Why Other Strategies Avoided This:

WordChunking: 100-word limit kept entire entity together
AnnotationBased: Grouped Case_SmithVsJones with all case-related axioms
ModuleExtraction: Dependency closure included all related properties

Predictability: This failure is predictable from ontology structure:

Entity average size: ~15 axioms
Sentence average: ~2 axioms
If sentence size < entity size → fragmentation risk

4.4 ModuleExtraction: Consistency Analysis

Remarkable Finding: Top-5 similarity scores clustered within 0.0012 range.

Score Distribution:

Rank 1: 0.7068 (module-chunk-1, 132 axioms, 4 seed entities)
Rank 2: 0.7067 (module-chunk-5, 89 axioms, 3 seed entities)
Rank 3: 0.7062 (module-chunk-12, 107 axioms, 4 seed entities)
Rank 4: 0.7058 (module-chunk-3, 95 axioms, 3 seed entities)
Rank 5: 0.7056 (module-chunk-8, 112 axioms, 4 seed entities)
Variance: 0.0012
Standard Deviation: 0.0035

Why This Matters:

Any of top-5 chunks could answer the query correctly
Robust to embedding noise/variation
Reduces sensitivity to k-nearest neighbors selection

Contrast with AnnotationBased:

Rank 1: 0.7010 (no-annotations chunk, 84 axioms)
Rank 2: 0.6955 (cas prefix chunk, 26 axioms)
Rank 3: 0.6746 (sta prefix chunk, 24 axioms)
Rank 4: 0.6692 (app prefix chunk, 17 axioms)
Rank 5: 0.6543 (fil prefix chunk, 12 axioms)
Variance: 0.0312
Standard Deviation: 0.1766

More variance = higher risk of returning irrelevant chunks.

4.5 Cost-Performance Tradeoffs

Indexing Cost (one-time):

Text-Based:
  Sentence (76 chunks): 4.1s, $0.0012 embedding cost
  Word (58 chunks):     3.2s, $0.0010 embedding cost
OWL-Aware:
AnnotationBased (39 chunks): 2.1s, $0.0007 embedding cost
ModuleExtraction (28 chunks): 1.8s, $0.0005 embedding cost
ClassBased (6 chunks):        1.2s, $0.0001 embedding cost

Storage Cost (persistent):

Sentence:         2.8MB (0.037 MB/chunk)
Word:             2.1MB (0.036 MB/chunk)
AnnotationBased:  1.4MB (0.036 MB/chunk)
ModuleExtraction: 1.0MB (0.036 MB/chunk)

Query Cost (per query):

All strategies: Single embedding (~0.015s, $0.000001)
Retrieval: O(log n) with HNSW index

Implications:

ModuleExtraction: Lowest total cost + highest accuracy + best consistency
AnnotationBased: Moderate cost, metadata-dependent performance
Sentence: Highest cost + worst accuracy = poor value

Analysis & Discussion

5.1 RQ1: OWL-Aware vs. Text-Based Performance

Answer: Context-dependent. OWL-aware strategies excel only when ontology structure aligns with chunking logic.

Evidence:

ModuleExtraction (OWL): 0.7068, best consistency
AnnotationBased (OWL): 0.7010, requires naming conventions
WordChunking (text): 0.7135, simplest + high score
BUT: SentenceChunking (text): 0.7258 score, worst answer

Key Factors:

ABox/TBox ratio: High ABox → text-based or AnnotationBased
Metadata quality: Poor naming → text-based preferred
Entity cohesion: Compact entities → fixed-size works well

5.2 RQ2: Ontology Characteristics as Predictors

Proposed Decision Framework:

IF ontology has:
  ├─ Multiple namespaces (>3) → NamespaceBasedChunking
  ├─ Deep hierarchy (≥5 levels) → DepthBasedChunking
  ├─ Consistent naming (prefix patterns) → AnnotationBasedChunking
  ├─ Complex relationships + large → ModuleExtractionChunking
  ├─ Compact entities (<100 words) → WordChunking
  └─ Unknown/mixed → Test ModuleExtraction + Word

Validation on Legal Ontology:

Characteristics:
  ✓ Single namespace         → NamespaceBased failed (fell back to ClassBased)
  ✓ Shallow hierarchy (2)    → DepthBased produced only 3 chunks
  ✓ Consistent naming (case_, judge_) → AnnotationBased worked well (0.7010)
  ✓ Compact entities (~15 axioms) → WordChunking effective (0.7135)
  ✓ Relationships span entities → ModuleExtraction best (0.7068)

5.3 RQ3: ABox/TBox Ratio Impact

Hypothesis: High ABox ratio degrades hierarchy-based strategy performance.

Evidence:

Ontology Type	ABox%	ClassBased (measured)	DepthBased (measured)	Best Measured Strategy
Legal (this study)	93.8%	low (~0.70)	low (~0.70)	ModuleExtraction

Measured results above are from my legal ontology experiment.

Hypothetical / WIP Scenarios:

Biomedical ontology (WIP):
- ABox%: ~30%
- ClassBased (expected): medium-high (~0.75)
- DepthBased (expected): high (~0.77)
- Qualitative expectation: ClassBased or DepthBased likely to perform well due to richer hierarchy
Schema.org (WIP):
- ABox%: ~10%
- ClassBased (expected): high (~0.82)
- DepthBased (expected): high (~0.85)
- Qualitative expectation: DepthBased likely to excel due to deep, well-structured schema

No experiments yet for these domains; numbers are placeholders to illustrate the hypothesis, not measured results.

5.4 RQ4: Predictive Performance Modeling

Can we predict performance without testing?

Proposed Heuristics:

Heuristic 1: Entity Fragmentation Risk

IF avg_sentence_size < avg_entity_size:
    fragmentation_risk = HIGH
    → Avoid SentenceChunking

Heuristic 2: Orphan Axiom Ratio

orphan_ratio = (total_axioms - hierarchy_axioms) / total_axioms
IF orphan_ratio > 0.8:
    → Avoid ClassBased, DepthBased

Heuristic 3: Naming Consistency

prefix_coverage = count_axioms_with_consistent_prefixes / total_axioms
IF prefix_coverage > 0.7:
    → AnnotationBased viable
ELSE:
    → Use text-based or ModuleExtraction

Validation: Applying these heuristics to legal ontology:

Heuristic 1: ✓ Correctly predicts SentenceChunking failure
Heuristic 2: ✓ Correctly eliminates ClassBased, DepthBased
Heuristic 3: ✓ Correctly identifies AnnotationBased viability (prefix_coverage ≈ 0.75)

Limitations

Single ontology domain: Results specific to legal domain with flat hierarchy
Small scale: 195 axioms; performance at 10,000+ unknown
Query diversity: Single query type (factual retrieval); structural queries untested
Embedding model: Results specific to text-embedding-3-small (1024d)
No hybrid strategies: Did not test combined approaches
Manual answer evaluation: Binary scoring may miss nuanced quality differences

Future Work

7.1 Large-Scale Validation

Test on diverse ontologies:

SNOMED CT: 300,000+ medical concepts, deep hierarchy
Gene Ontology: 45,000+ terms, complex relationships
DBpedia: 6M+ entities, multi-domain
Schema.org: Web schema, moderate hierarchy

7.2 Query Type Analysis

Evaluate performance across query categories:

Factual: "Which cases are active?" (current study)
Structural: "What are subclasses of Case?"
Relational: "Who represents defendant in Smith v. Jones?"
Aggregation: "How many active criminal cases?"

7.3 Hybrid Strategies

Design adaptive chunking:

def hybrid_chunk(ontology):
    tbox_chunks = ClassBasedChunking(get_tbox(ontology))
    abox_chunks = AnnotationBasedChunking(get_abox(ontology))
    return tbox_chunks + abox_chunks

7.4 Automated Strategy Selection

Machine learning model to predict optimal strategy:

Input: Ontology metrics (ABox%, depth, namespace count, naming entropy)
Output: Predicted best strategy + confidence

7.5 Dynamic Chunking

Adjust chunk granularity based on query:

Simple queries → coarse chunks (faster)
Complex queries → fine chunks (more precise)

Conclusion

This study provides the first comprehensive empirical evaluation of chunking strategies for ontology-based RAG systems. Our key contributions:

Discovered the Orphan Axiom Problem: 93.8% ABox ratio causes hierarchy-based strategies to fail catastrophically, concentrating content into massive single chunks.
Counter-intuitive finding: Highest similarity score (SentenceChunking: 0.7258) produced worst answers by fragmenting entities—proving semantic completeness > mathematical similarity.
Consistency matters: ModuleExtractionChunking's 0.0012 variance makes it robust to embedding noise, despite slightly lower peak score than text-based approaches.
Predictive framework: We provide heuristics to predict strategy performance from ontology characteristics (ABox/TBox ratio, hierarchy depth, naming patterns) without exhaustive testing.
Cost-performance analysis: ModuleExtraction offers best accuracy/cost ratio: highest OWL-aware score, lowest variance, moderate computational cost.

Practical Recommendation: No universal winner exists. Strategy selection depends on:

High ABox ratio (>80%): ModuleExtraction or AnnotationBased (if naming consistent)
Deep hierarchy (≥5 levels): DepthBased or ClassBased
Multiple namespaces (≥3): NamespaceBased
Unknown/mixed: Start with ModuleExtraction + Word, validate empirically

As ontology-based AI systems proliferate, sophisticated chunking will become increasingly critical. This work provides both empirical evidence and practical guidance for researchers and practitioners building knowledge-enhanced RAG systems.

Reproducibility

9.1 Code & Configuration

Plugin Configuration (plugin.xml):

<plugin>
    <id>lucene-rag-plugin</id>
    <version>1.0.0</version>
    <dependency>agenticmemory-0.1.1</dependency>
    <dependency>lucene-core-9.8.0</dependency>
</plugin>

Chunking Implementation (RagService.java):

// ModuleExtractionChunking
ModuleExtractionChunker chunker = new ModuleExtractionChunker();
List<OWLChunk> chunks = chunker.chunk(ontology);
for (OWLChunk chunk : chunks) {
String text = chunk.toOWLString(); // Manchester syntax
List<Float> embedding = embeddingService.createEmbedding(text);
vectorStore.upsert(chunk.getId(), embedding, text);
}

Vector Store Setup:

LuceneVectorStore store = new LuceneVectorStore(
    "./lucene_index",  // File path
    1024,             // Dimensions
    VectorSimilarityFunction.COSINE
);

9.2 Sample Chunks

ClassBased - Orphan Chunk (183 axioms):

Chunk ID: class-chunk-orphan
Strategy: Class-Based
Axiom Count: 183
ClassAssertion(<CivilCase> <Case_SmithVsJones>)
DataPropertyAssertion(<caseNumber> <Case_SmithVsJones> “CV-2024-001”)
DataPropertyAssertion(<caseStatus> <Case_SmithVsJones> “Active”)
ObjectPropertyAssertion(<filedIn> <Case_SmithVsJones> <Court_District1>)
… (179 more axioms)

AnnotationBased - "cas" Prefix Chunk (26 axioms):

Chunk ID: annotation-chunk-10
Strategy: Annotation-Based
Axiom Count: 26
Annotation Key: label:cas;
SubClassOf(<CivilCase> <Case>)
SubClassOf(<CriminalCase> <Case>)
SubClassOf(<AppellateCase> <Case>)
Declaration(Class(<Case>))
DataPropertyDomain(<caseNumber> <Case>)
DataPropertyDomain(<caseStatus> <Case>)
DataPropertyAssertion(<caseNumber> <Case_SmithVsJones> “CV-2024-001”)
DataPropertyAssertion(<caseStatus> <Case_SmithVsJones> “Active”)
… (18 more)

ModuleExtraction - Module 1 (132 axioms):

Chunk ID: module-chunk-1
Strategy: Module-Extraction
Axiom Count: 132
Seed Entities: [Case, Court, Judge, Lawyer]
[Complete self-contained module with dependency closure]
All Case axioms + related Court axioms + associated Judge axioms

Lawyer relationships + Evidence connections
= Logically complete, independently coherent module

9.3 Test Query Protocol

Query: "Which cases are currently active?"

Retrieval Process:

Generate query embedding: text-embedding-3-small("Which cases are currently active?")
Search Lucene index: k=5, similarity=COSINE
Return top chunks with scores
Format context for GPT-4

GPT-4 Prompt Template:

You are a legal knowledge assistant. Answer the question using ONLY the provided context. Context: {retrieved_chunk_1} {retrieved_chunk_2} … Question: Which cases are currently active?

Answer:

Expected Response:

The currently active case is Smith v. Jones (Case ID: Case_SmithVsJones, Case Number: CV-2024-001, Status: Active).

9.4 Data Availability

Ontology: legal.owl (195 axioms, OWL 2 DL, available upon request)
Test results: Full similarity scores, chunk contents, timing data (supplementary materials)
Source code: lucene-rag-plugin v1.0.0 (Java 11, Maven)
Dependencies: agenticmemory-0.1.1, lucene-core-9.8.0, OWL API 4.5.26

Disclaimer & Author Note: This research is my independent work as part of my open source project Vidyaastra. All plugins, ontologies, and chunking mechanisms are created by me and run on my own servers. The findings are based on experiments with a few ontologies so far; I plan to extend this work to more domains and different RAG systems, and the methodology will continue to evolve. Performance results may vary and can be significantly improved with more powerful hardware and parallel processing. The 1024 dimensions used here are due to Lucene limitations—if you need higher dimensions, you can switch to the Qdrant plugin using the same mechanisms. For late chunking approaches, you can use token-based embedding providers via REST API (since OpenAI does not currently support token-based embeddings in their API). I encourage you to experiment with different models from NVIDIA or various combinations of model and embedding providers. This is very much a work in progress, with lots of room for experimentation and improvement.

Disclaimer on ABox/TBox Bias:

The results and limitations discussed here are strongly shaped by the ABox-dominated structure of my test ontology (93.8% ABox, 6.2% TBox). In this context, the ABox acts as a "villain" for conventional chunking strategies: hierarchy-based methods (ClassBased, DepthBased) struggle to find logical split points, leading to the "blob" effect where most individual assertions are concentrated in a single orphan chunk. This undermines the goal of RAG, which is to retrieve only the most relevant information. Additionally, text-based chunking can fragment horizontal relationships, resulting in incomplete or context-smeared retrievals. These findings are based on my independent, small-scale experiments and may differ for ontologies with richer TBox structures. To further validate these findings, I plan to repeat these experiments with a TBox-rich ontology. This will help assess how chunking strategies perform when the ontology contains deeper class hierarchies and more schema-level logic.

Important Limitations & Cautions:
This study is based on a single ontology ( even though i have experimented with few others i need to still do more as I did not capture the results effectively), few primary queries, one embedding model, one LLM, and one hardware setup. These constraints mean that my findings—especially claims about “no universal winner” or optimal strategy selection—should be interpreted as preliminary and not generalizable.

The “Answer Quality” metric is a simple binary (correct/incorrect) judgment, which does not capture nuanced or partial answers. Strong conclusions about answer quality versus similarity should be treated with caution.

Cost-performance numbers (milliseconds, fractions of a cent) are included for completeness, but are not the main story; for most practical purposes, structural robustness and retrieval quality matter far more than tiny differences in compute or spend.

Some tables mix measured results (legal ontology) with hypothetical or expected outcomes for other domains. I have tried to make this clear, but readers should be aware of the distinction.