All You Need to Know About Chunking in Agentic RAG

This article discusses recursive, semantic, hierarchical, and hybrid chunking approaches to build Agentic AI and RAG systems.

10 min readJust now

–

Press enter or click to view image in full size

A GIF of how context window tend to overflow if not chunked well

Your AI agent just crashed because it tried to stuff 50,000 tokens of conversation history into an 8,000-token context window. You implemented basic RAG, but users complain about slow responses and irrelevant context. Your cloud bill exploded because every query burns through 16,000 tokens.

This article discusses recursive, semantic, hierarchical, and hybrid chunking approaches to build Agentic AI and RAG systems.

10 min readJust now

–

Press enter or click to view image in full size

A GIF of how context window tend to overflow if not chunked well

These aren’t edge cases. Agentic systems require persistence, statefulness, and learning across sessions which is fundamentally different memory architecture than stateless LLMs. This guide compares four memory architectures with concrete implementation guidance and decision frameworks. You’ll learn when simple approaches suffice, when to invest in complexity, and how production systems achieve 85–90% token reduction while maintaining accuracy.

Why Fixed-Size Chunking Fails for Agentic Systems
Recursive Character Splitting
Semantic Chunking
Hierarchical Parent-Child Architecture
Hybrid Search and Reranking Stack
Decision Framework: Choosing Your Architecture
Production Considerations: Cost, Latency, and Scale

Why Fixed-Size Chunking Fails for Agentic Systems

Fixed-size chunking splits documents at arbitrary boundaries, such as 512 tokens, regardless of content structure. For agents that need to recall specific experiences and make decisions based on history, it creates four critical failure modes.

**Context Fragmentation at Decision Boundaries — **An agent analyzing a legal contract splits mid-sentence: “The contractor shall indemnify the client against all claims arising from…” lands in chunk A, while “…negligence, breach of warranty, or intellectual property infringement” goes to chunk B. When retrieving chunk A to assess liability, it lacks the crucial qualifier. The decision fails from arbitrary boundary placement destroying semantic units.

**Memory Poisoning Through Imprecise Retrieval — **A 512-token chunk contains three unrelated topics. When querying for Topic A, the agent retrieves this chunk due to Topic A’s strong embedding signal, but Topics B and C contaminate the context. Over multiple retrievals, this noise accumulates and memory becomes unreliable.

**Inability to Navigate Structured Knowledge — **Fixed-size splitting destroys hierarchical relationships: function-to-class connections, section to appendix references. The agent retrieves code snippets but can’t traverse to parent definitions or related configurations , making it blind to the knowledge graph implicit in structured documents.

**Catastrophic Token Economics at Scale **— Your agent handles 1,000 queries daily. Each retrieves 5 chunks at 512 tokens (2,560 tokens), plus 200 tokens for query/response. That’s 2,760,000 tokens daily — $8.28/day for GPT-5, nearly $3,022 annually. Production systems using advanced architectures report 85–90% token reduction, translating to $302–450 annual cost for the same workload.

These problems require architectural solutions, not better embedding models.

Recursive Character Splitting

Recursive character splitting uses cascading separators to find natural boundaries, making it the default choice for 80% of production systems. The algorithm maintains priority-ordered separators: paragraph breaks, sentence endings, clause boundaries, line breaks. It attempts to split at the highest-priority separator keeping chunks within target size.

Configure chunk_size to 600 tokens, chunk_overlap to 120 tokens (20%), and separators ["\n\n", "\n", ". ", ", ", " "]. The system processes documents by attempting paragraph splits first, dropping to sentence-level where paragraphs exceed limits. The 20% overlap ensures context continuity, and it has been empirically tested as optimal between 10% (loses context) and 30% (inflates cost).

This approach improves retrieval precision by 25–40% compared to fixed-size splitting because each chunk represents a coherent semantic unit with consistent vector representation.

When to Use Recursive Splitting

Choose this for general knowledge bases with heterogeneous content: documentation, blog posts, conversation logs, mixed formats. It requires no specialized infrastructure and provides the best accuracy-to-effort ratio. Implementation takes hours, not weeks.

However, recursive splitting can’t capture hierarchical relationships in legal contracts, technical specs, or source code. When agents need to traverse from sections to related sections, or functions to class definitions, this approach lacks the navigation mechanism.

Implementation Considerations

For factoid queries seeking specific values, use 400-token chunks. For analytical queries requiring broader context, use 800-token chunks. Production systems often chunk documents twice at different granularities, routing queries to the appropriate index based on query classification.

Semantic Chunking

Semantic chunking splits exactly where topics change, using embedding similarity to detect conceptual shifts. This eliminates arbitrary boundaries but introduces computational overhead.

The process splits documents into atomic sentences, encodes each into vector embeddings, then computes cosine similarity between adjacent sentences. Where similarity drops below a threshold (typically 0.75–0.85), the algorithm inserts a chunk boundary.

The computational cost is substantial: O(n) embedding calls for n sentences. For a 10,000-token document with 400 sentences, that’s 400 API calls. At OpenAI pricing, this costs roughly $0.004 per document, which is negligible for hundreds of documents, prohibitive for millions.

When to Use Semantic Chunking

Semantic chunking excels for conversational logs and transcripts where topics shift organically without structural markers. A customer support chat jumping from billing to features to troubleshooting needs precise topic boundary detection.

This approach proves valuable when retrieval precision directly impacts user experience more than latency. For a medical diagnosis assistant, false positives could mislead diagnosis. With semantic chunking’s 15–25% precision improvement, it justifies the latency cost.

Avoid semantic chunking when cost or latency constraints are tight. The 3–5x computational overhead makes real-time ingestion impractical. Also, it struggles with highly technical content where term frequency overwhelms semantic structure because source code or dense jargon produces embedding patterns that don’t align with human-perceived boundaries.

Hierarchical Parent-Child Architecture

Hierarchical architecture resolves the inherent tension: retrieval precision favors small chunks, but LLM context quality demands large chunks. This two-tier system separates retrieval from context delivery.

How It Works

The system creates two parallel chunk hierarchies. Child chunks are small (200–400 tokens), optimized purely for retrieval precision. Parent chunks are large (800–1000 tokens), designed to provide comprehensive context. Every child stores a parent_id pointer.

Press enter or click to view image in full size

A GIF showing how Hirechical parent-child architecture work

The retrieval flow: First, query embeds and searches against the child chunk index. These small chunks provide precise semantic matching, maximizing probability of finding exact relevant information. Second, once top-K child chunks are identified, the system uses parent_id pointers to fetch corresponding parent chunks. These larger parents contain child content plus surrounding context, delivered to the LLM.

This decouples optimization objectives. Child chunks can be tiny without worrying about insufficient context. Parents can be large without worrying about retrieval noise.

When to Use Hierarchical Architecture

Hierarchical systems excel for technical documentation and structured knowledge bases where precision and context both matter critically. A developer agent helping with API integration must retrieve the exact authentication endpoint (precision) but also provide parameter definitions, error codes, and examples (context).

Essential for legal and compliance applications. Retrieving a contract clause requires pinpoint accuracy, but evaluating it requires understanding related clauses and definitions. A 200-token child captures the specific clause; its 1000-token parent provides the interpretive framework.

However, hierarchical systems add significant complexity. You’re managing two parallel indexes, maintaining parent-child pointers, and coordinating retrieval across both tiers. Storage requirements roughly double. Avoid for simple Q&A over homogeneous content where the complexity doesn’t justify marginal gains.

Implementation Considerations

Testing shows 200 tokens as practical minimum for child chunks. Parent sizing should target 3–5x child size: for 200-token children, use 600–800 token parents. This ratio ensures parents provide sufficient context without becoming unwieldy.

Store LLM-generated summaries of parent chunks as metadata on child chunks. When retrieving a child, check the parent_summary field to assess parent relevance before fetching full content. This adds 10–15% to ingestion cost but reduces retrieval token consumption by 30–40%.

Hybrid Search and Reranking Stack

Advanced chunking provides high-quality segments. Production systems require multi-layered retrieval stacks balancing speed, precision, and cost.

Hybrid Search

Hybrid search runs two parallel retrieval paths. Dense retrieval uses vector embeddings for semantic similarity. Sparse retrieval uses inverted indexes (BM25) for exact keyword matches, ensuring queries for specific API names or error codes retrieve relevant chunks even if embeddings miss them.

Press enter or click to view image in full size

A gif showcasing hybrid search and reranking in RAG

Reciprocal Rank Fusion (RRF) combines results based on rank position rather than raw scores. A chunk ranked 2nd in dense and 5th in sparse scores better than one ranked 1st in dense but 50th in sparse.

Implementation requires databases supporting both indexes. Qdrant and Weaviate provide native hybrid search. Single-model solutions like BAAI’s BGE-M3 generate both dense and sparse vectors from one encoding pass, simplifying infrastructure.

Reranking

Reranking introduces a second-stage precision filter. Hybrid search retrieves a broad set (top-50), optimized for recall. The reranker applies a computationally expensive cross-encoder model to this smaller set, producing refined ranking focused on precision.

Cross-encoders process query and chunk jointly through transformer attention layers, capturing fine-grained interaction patterns beyond simple keyword or semantic matching. Hybrid search returns 50 candidates; the reranker processes each paired with the query through a model like bge-reranker-large, producing relevance scores. The system selects top-5–10 highest-scoring chunks for LLM processing.

This balances speed and accuracy. Bi-encoder retrieval operates in milliseconds via simple vector operations. Cross-encoder reranking takes 100–300ms but only applies to pre-filtered candidates.

Reranking improves precision by 30–40% in production systems. For agentic systems where incorrect context leads to cascading errors across multi-turn interactions, this precision boost justifies the added latency.

When to Use

Hybrid search becomes essential when your knowledge base contains both natural language and technical terminology. Developer docs include prose and code identifiers. Pure semantic search misses technical terms; pure keyword search misses conceptual relationships.

Reranking proves critical when context quality directly impacts downstream tasks. Poor context at step 1 propagates errors through subsequent steps. Production systems classify queries by complexity and apply reranking selectively to complex queries, optimizing cost and latency.

Decision Logic

Start with content structure. If documents have explicit hierarchical organization such as technical specs with sections, contracts with clauses, code with class-function structure — hierarchical parent-child architecture provides highest accuracy by preserving relationships.

If content lacks structure but shifts between distinct topics, such as customer support transcripts, meeting notes — semantic chunking captures topic boundaries better than rule-based methods. The 3–5x computational cost is acceptable if retrieval accuracy directly impacts user experience.

If content is heterogeneous with no dominant pattern, mixed blog posts, documentation, varied formats, then recursive splitting provides the best effort-to-accuracy ratio. Use this as your starting point.

For query patterns: if queries involve specific technical terms, error codes, function names that must match exactly, hybrid search becomes necessary. If queries are purely conceptual, pure dense retrieval suffices.

If users report “system didn’t find relevant information,” you need better initial retrieval. If users report “too much irrelevant information,” you need reranking.

Migration Path

Begin with recursive splitting at 600–800 tokens with 20% overlap. Implement hybrid dense + sparse search if you have technical content, otherwise dense-only. Deploy and collect metrics for 2–4 weeks.

Analyze failure modes. If precision queries fail (searches for specific terms return wrong chunks), add BM25. If recall issues dominate (relevant chunks exist but weren’t retrieved), reduce chunk size or add semantic chunking for problematic document types. If context quality issues emerge (correct chunks retrieved but LLM responses lack depth), implement hierarchical parent-child.

Add reranking when logs show retrieval returns relevant chunks outside top-5 results. If correct information consistently appears at positions 8–15, reranking moves it to top-5.

Scale complexity only where justified by data. Many production systems use a hybrid: recursive splitting for 70% of content, semantic chunking for conversations, hierarchical for technical docs.

Production Considerations: Cost, Latency, and Scale

Cost Analysis

Traditional RAG retrieving 5 chunks of 800 tokens each consumes 4,000 tokens per query just for context. At GPT-5 pricing ($0.00125 per 1K input tokens), this costs $0.005 per query. For 10,000 queries daily: $50/day or $18,250 annually.

Advanced architectures target 85–90% token reduction. Hierarchical systems might retrieve 5 child chunks at 200 tokens (1,000 tokens), then fetch 2–3 parents at 800 tokens (1,600–2,400 tokens), totaling 2,600–3,400 tokens. Further compression reduces this to about 2,000 tokens. This achieves roughly $0.003 per query, or $11,000 annually , saving $7,250 or about 40%.

Reranking adds marginal costs. Cloud APIs like Cohere charge approximately $0.002 per 1,000 reranked documents. For queries reranking 50 candidates: $0.0001 per query , which is negligible compared to LLM costs.

Recursive splitting costs effectively nothing. Semantic chunking costs about $0.0025 per 10,000-token document for embeddings. Hierarchical systems with LLM-generated summaries add $0.01–0.02 per document. For 100,000 documents: negligible to $1,000–2,000.

Latency Characteristics

Dense vector retrieval using HNSW indexes completes in 10–30ms for indexes under 1 million vectors. Sparse retrieval via BM25 completes in 5–15ms. Hybrid search runs parallel, bounded by the slower path (typically 20–30ms).

Reranking introduces 100–300ms depending on model size. Lightweight models process 50 candidates in 80–120ms. Large models require 200–300ms but deliver superior precision. For latency-sensitive applications, rerank only top-20 candidates rather than top-50, trading slight quality loss for 40–50% latency reduction.

LLM generation dominates total latency at 1–3 seconds. Retrieval and reranking overheads of 100–300ms represent only 10–15% of total latency, they are acceptable tradeoffs for substantially improved context quality.

Scaling Patterns

Vector databases like Pinecone and Qdrant provide native sharding. For self-hosted solutions, implement application-level sharding by partitioning documents into separate indexes (by type, time period, or tenant).

HNSW indexes consume significant RAM, approximately 2–4 bytes per vector dimension per chunk. For 4096-dimension embeddings at 1 million chunks: 16–32 GB RAM. Production systems use HNSW for hot indexes (recent data, frequently queried) and IVFFlat for cold indexes (archived, rarely accessed).

Multi-tenancy isolation prevents memory contamination. Store user_id or tenant_id metadata on every chunk. Filter retrieval queries to include only chunks matching the request’s context.

Conclusion

Memory architecture determines whether your agentic AI operates as a forgetful assistant or an intelligent partner that learns and improves. The progression from fixed-size chunking to hierarchical systems with hybrid retrieval represents fundamental capability differences: precisely locating information while maintaining rich context, navigating structured knowledge, and operating within practical cost and latency budgets.

The architectural principle is clear: optimize during ingestion to minimize operational costs. Front-load computational complexity through advanced chunking, metadata enrichment, and hierarchical structuring. This investment enables 85–90% token reduction during retrieval, translating to order-of-magnitude cost savings.

Start with recursive splitting as your baseline, then layer complexity strategically where data justifies it. Implement hierarchical parent-child for structured technical content, semantic chunking for conversational logs, hybrid search for mixed terminology, and reranking where precision critically impacts downstream tasks. Monitor token consumption, retrieval precision, and user satisfaction to guide architectural evolution.

This article discusses recursive, semantic, hierarchical, and hybrid chunking approaches to build Agentic AI and RAG systems.

This article discusses recursive, semantic, hierarchical, and hybrid chunking approaches to build Agentic AI and RAG systems.

Table of Contents

Why Fixed-Size Chunking Fails for Agentic Systems

Recursive Character Splitting

When to Use Recursive Splitting

Implementation Considerations

Semantic Chunking

When to Use Semantic Chunking

Hierarchical Parent-Child Architecture

How It Works

When to Use Hierarchical Architecture

Implementation Considerations

Hybrid Search and Reranking Stack

Hybrid Search

Reranking

When to Use

Decision Logic

Migration Path

Production Considerations: Cost, Latency, and Scale

Cost Analysis

Latency Characteristics

Scaling Patterns

Conclusion

Similar Posts