CHUNKING · AI CONCEPT

Chunking strategies for RAG: fixed-size, recursive, semantic, late chunking

How to slice documents for RAG: fixed-size, recursive, semantic, document-based and late chunking compared, with rules of thumb for contracts, tables and multilingual texts.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is chunking?

Chunking is the slicing of long documents into pieces (chunks) that get indexed in a vector database. Each chunk is stored as one embedding, retrieved individually, and inserted individually into the prompt of a language model. The chunking strategy decides 30 to 50 percent of the answer quality of a RAG pipeline, often more than the choice of embedding model.

As of May 2026, five strategies are established. Fixed-size slices every N tokens, blind to structure. Recursive character splitting (the LangChain default) breaks first on paragraph, then sentence, then word until the target size is reached. Semantic chunking computes embedding similarity of successive sentences and cuts at semantic breaks. Document-based chunking respects file structure: Markdown headers, PDF bookmarks, HTML sections. Late chunking (Jina AI, 2024) embeds the full document in context first and only slices afterwards, so every chunk carries the meaning of the overall context.

The choice depends on document type: contracts and guidelines benefit from document-based chunking because their structure carries meaning. Long prose benefits from semantic chunking. Tabular data often needs no chunking but a separate indexing path.

Why it matters

Bad chunking shows up not in code but in wrong answers. Anyone who slices a 30-page lease into 500-token blocks without respecting chapter boundaries risks paragraph 4.2 (utilities) and paragraph 4.3 (indexation clause) landing in the same chunk, or a table being cut down the middle. The language model then sees half-sentences and invents the rest.

For fiduciary, legal and insurance contexts, the consequence is more than an inconvenience. An incorrectly answered client question about a contract notice period can carry liability. Chunking is therefore a compliance topic, not just an engineering topic. Anyone running audit-ready RAG systems under Art. 957a CO or under professional secrecy (Art. 321 SCC) must document the chunking strategy and trigger re-indexing on strategy change.

Cost-wise, chunking matters too. Chunks that are too small (under 200 tokens) mean more embedding calls and more vector database storage without quality gain. Chunks that are too large (over 2000 tokens) dilute the embedding because too many topics are mixed, and they often blow the context window during top-k retrieval.

How it works

Fixed-size: a token or character counter counts to N and cuts. Simple, deterministic, fast. As of May 2026, only recommended for purely technical logs or as a fallback.

Recursive character splitting: a list of separators (\n\n, \n, ., , space) is tried recursively. At the first match that fits the target size, the cut happens. LangChain RecursiveCharacterTextSplitter and LlamaIndex SentenceSplitter are the de-facto standards. Rule of thumb: 500 to 1000 tokens chunk size with 10 to 20 percent overlap (50 to 200 tokens).

Semantic chunking: sentences are embedded individually; similarity between consecutive sentences is measured. If it falls below a threshold (e.g. 95th percentile), the cut happens. Greg Kamradt's 5-levels notebook (2024) is the reference implementation. Pro: thematic boundaries respected. Con: extra embedding cost during indexing.

Document-based chunking: structure markers (Markdown headers, PDF bookmarks, DOCX style headings, HTML tags) are extracted first, then each section is sliced. Tools: MarkdownHeaderTextSplitter (LangChain), Unstructured.io, MarkItDown (Microsoft, May 2026). With clean sources, this is the highest-quality strategy.

Late chunking: the entire document goes first into a long-context embedding model (Jina embeddings v3, up to 8192 tokens). The embedding vector is read per token position, then aggregated per chunk. Pro: each chunk vector "knows" what came before and after in the document. As of May 2026 still experimental but measurably better on legal text (retrieval recall +5 to +12 percent).

For PDFs with tables we recommend a hybrid path: layout recognition first (Marker, Unstructured Hi-Res), tables serialised separately as Markdown, prose chunked document-based. Long contracts beyond 100 pages need multi-granularity: one chunk per paragraph plus a summary chunk per chapter. Multilingual documents (DE/FR/IT in one PDF) must not be chunked across language boundaries, otherwise the embedding collapses.

Chunking workflow in 6 steps

01Classify documents: guideline, contract, mail, table, code. One pipeline per class.
02Per class, choose the matching strategy: recursive as default, document-based for markup, semantic for prose, late chunking for long-context needs.
03Parameterise chunk size and overlap: default 800 tokens plus 100 tokens overlap, then iterate against an eval set.
04Attach metadata to every chunk: source, page, client, date, confidentiality tier. Filterable in Qdrant payload.
05Build an eval set with 30 to 50 question/answer pairs and measure retrieval recall@5. Switch strategies only when recall rises.
06Define a re-indexing strategy: full rebuild on schema change, only affected chunks on single-document update (idempotent pipeline required).

When to use which strategy

Recursive character splitting with 500 to 1000 tokens and 10 to 20 percent overlap is the safe default for nearly any start. If you do not know what you are doing, start here.

Document-based chunking pays off as soon as the sources carry structured markup: Markdown guidelines, technical documentation with bookmarks, contracts with clear paragraph numbering. Setup cost is higher (parser choice), but answer quality rises significantly.

Semantic chunking plays out best on long unstructured prose: meeting minutes, interview transcripts, long mail threads. Markers are missing here, and thematic breaks deliver the best cut.

Late chunking is worth trying on legal or medical text where terms are unambiguous only in context ("the defendant" refers to paragraph 1 but appears only in paragraph 17). Precondition: a long-context embedding model and sufficient compute.

Fixed-size only for log analysis, telemetry or other unstructured data.

When not to chunk

Small document collections under 50 pages total: everything fits in the prompt of a modern long-context model (Claude Sonnet: 1M tokens, Gemini 2.5 Pro: 2M tokens). Chunking saves nothing in this setup and makes the pipeline more fragile.

Tables and structured data: a balance-sheet table does not belong in an embedding but in an SQL table or as a JSON document with filter fields in Qdrant payload. RAG via unstructured embedding is the wrong tool here.

Code repositories: function and class boundaries are the right chunks, not token counts. Tree-sitter-based splitters (e.g. LangChain Language-aware) are superior here.

Images, plans, sketches: need vision embeddings (CLIP, Cohere Embed v3 multimodal) or a dedicated OCR pipeline, not text chunking.

Trade-offs

STRENGTHS

Document-based: highest answer quality on structured sources
Recursive: robust default, very fast, well documented
Semantic: respects thematic breaks in prose
Late chunking: keeps global context per chunk vector

WEAKNESSES

Document-based: parser-dependent, fails on poorly structured PDFs
Recursive: ignores meaning, sometimes cuts mid-list
Semantic: doubles embedding cost at index time
Late chunking: experimental, requires a long-context embedding model

FAQ

What is the optimal chunk size?

Rule of thumb: 500 to 1000 tokens with 10 to 20 percent overlap. For Q&A systems, research (Anthropic Contextual Retrieval, 2024) leans toward smaller chunks (200 to 400 tokens) plus a context prefix. For summarisation, larger (1000 to 1500). Measure with recall@5 on an eval set, do not guess.

Do I need overlap between chunks?

Yes, but in moderation. 10 to 20 percent suffices to avoid cut sentences. More overlap raises storage and inference costs without notably improving retrieval recall. With document-based chunking, overlap can be skipped because structure markers provide clean boundaries.

How do I chunk PDFs with tables?

Two parallel paths. Layout recognition (Marker, Unstructured Hi-Res, Microsoft Table Transformer) extracts tables as Markdown or JSON. They go into a separate collection with structured filter fields. Prose is chunked document-based. A cross-reference field in the payload links table and explanatory text.

When is late chunking worth it?

For long, context-dependent texts (contracts, court rulings, scientific papers) and when you already use a long-context embedding model such as Jina embeddings v3 (8192 tokens). For short or thematically separated documents the gain is negligible.

Sources

LangChain Text Splitters - documentation (May 2026) · 2026-05
Jina AI - Late Chunking in Long-Context Embedding Models · 2026-05
Anthropic - Contextual Retrieval · 2026-05
Unstructured.io - Chunking Strategies for RAG · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call