Text Splitting Methodology

This page provides a comprehensive explanation of how Vector Data Loader splits documents into chunks before embedding. Understanding this process helps you tune chunking settings for optimal search quality in your RAG applications.

Overview

When documents are synced to your vector store, they go through a multi-stage splitting process:

Content Detection - The system analyzes the document to determine its format
Splitter Selection - An appropriate splitting algorithm is chosen based on content type
Chunking - The document is split into token-sized pieces at natural boundaries
Metadata Enrichment - Each chunk receives contextual metadata (headers, positions)
Embedding - Chunks are converted to vectors and stored

Splitter Types

Vector Data Loader uses three specialized splitters, automatically selected based on document content:

Content Type	Splitter	Strategy	Best For
HTML (Confluence, websites)	Header-Aware HTML Splitter	Splits at `<h1>`-`<h6>` tags	Structured web content
Markdown (Notion, .md files)	Header-Aware Markdown Splitter	Splits at `#` through `######` headers	Documentation, notes
Plain text, PDF, JSON	Recursive Text Splitter	Character-based with natural boundaries	Unstructured content

Header-Aware Splitting (HTML & Markdown)

For structured documents with headers, the system:

Identifies header boundaries - Detects heading tags (HTML) or # prefixes (Markdown)
Tracks header hierarchy - Maintains parent headers (h1 > h2 > h3) as context
Splits at section boundaries - Creates chunks that respect document structure
Propagates metadata - Each chunk includes its header hierarchy for search context

Example: Header Hierarchy

If a chunk comes from a section under "Authentication" (h1) > "OAuth Setup" (h2) > "Token Refresh" (h3), all three headers are stored in the chunk's metadata. This enables search results to show contextual breadcrumbs like "Authentication > OAuth Setup > Token Refresh".

Handling Large Sections:

When a section exceeds 1.5× the target chunk size, it gets sub-split using the recursive splitter while preserving the parent header metadata. This ensures:

No chunk is excessively large
Sub-chunks still have full header context
Semantic grouping is maintained where possible

Recursive Text Splitting

For unstructured content (PDFs, plain text, JSON), the system uses a recursive character-based approach:

Separator Hierarchy:

The splitter tries each separator in order of preference:

\n\n (paragraph breaks)
\n (line breaks)
. (sentence endings)
! (exclamations)
? (questions)
; (semicolons)
, (commas)
(spaces)
`` (character-level, last resort)

How It Works:

The algorithm looks for the last occurrence of each separator within the target chunk size
It only uses a separator if found after 30% of the chunk (to avoid tiny fragments)
If no good split point is found, it tries the next separator in the hierarchy
This continues until a natural boundary is found or character-level splitting is used

Why This Matters

This approach ensures chunks break at natural language boundaries (sentences, paragraphs) rather than mid-word or mid-sentence, preserving semantic coherence.

Automatic Content Detection

The system automatically detects document format without requiring manual specification:

Format	Detection Method	Confidence
HTML	`<html>` tag, `<!DOCTYPE html>`, or 5+ HTML tags with body structure	High (95%)
Markdown	2+ `#` headers, or combination of headers + links `[text](url)` + code blocks	High (85%)
JSON	Valid JSON structure starting with `{` or `[`	High (95%)
Plain Text	No specific format detected	Default fallback

When the Preserve Document Structure setting is disabled, all content uses the recursive splitter regardless of detected format.

Token-Based Sizing

Why Tokens Instead of Characters?

Embedding models have token limits (e.g., OpenAI's models accept up to 8,191 tokens). Using token-based sizing ensures:

Chunks fit within model limits
Consistent semantic density across chunks
Predictable embedding costs

Token Counting Method

Vector Data Loader uses the cl100k_base tokenizer (the same encoding used by GPT-4 and GPT-3.5-turbo) for consistent token counting across all embedding providers.

Token Estimation:

Average: ~4 characters per token for English text
Code and technical content: may have different ratios
Non-English text: varies by language

info

The cl100k_base tokenizer provides a universal standard. While your embedding model may use a slightly different tokenization, cl100k_base offers a reliable approximation that works well across providers (OpenAI, Cohere, Gemini, etc.).

Chunk Metadata

Each chunk includes rich metadata for RAG applications:

Field	Description	Example
Content	The actual chunk text	"To configure OAuth, first..."
Index	Position within the document (0, 1, 2...)	3
Token Range	Start and end token positions	Start: 1024, End: 1536
Header Hierarchy	Parent headers for context	h1: "Setup", h2: "Authentication"
Splitter Type	Algorithm used	"html_header", "markdown_header", or "recursive"
Character Range	Start and end character positions	Start: 4096, End: 6144

Using Metadata in RAG

This metadata enables powerful RAG features:

Breadcrumb Citations: Display "Section: Authentication > OAuth" in search results
Context Windows: Fetch adjacent chunks using token/character positions
Filtering: Search within specific sections using header metadata
Quality Scoring: Weight results by structural position (h1 sections may be more important)

Why This Architecture Matters for RAG

Benefit	Explanation
Semantic Coherence	Chunks respect document structure, keeping related content together
Contextual Retrieval	Header metadata provides context even for isolated chunks
Consistent Sizing	Token-based sizing prevents embedding truncation
Format Optimization	Different splitters optimize for different content types
Search Quality	Natural boundaries improve retrieval relevance

Tuning Recommendations

Different use cases benefit from different chunking configurations:

Use Case	Chunk Size	Overlap	Preserve Structure	Notes
Precise Q&A	256-384	32-48	On	Smaller chunks for focused answers
General Search	512 (default)	64	On	Balanced for most use cases
Long-form Context	1024-2048	128-256	On	Larger chunks preserve more context
Code Documentation	384-512	48-64	On	Medium chunks for code examples
Unstructured Text	512	64	Off	Use recursive splitting only
Dense Technical Docs	256-384	48-64	On	Smaller for precise retrieval

Chunk Size Trade-offs

Smaller Chunks (128-384 tokens)

Pros: More precise search results, better for specific questions Cons: May lose surrounding context, more chunks to process

Larger Chunks (1024-2048 tokens)

Pros: More context preserved, fewer chunks, better for broad topics Cons: Less precise matching, may include irrelevant content

Overlap Recommendations

General rule: Set overlap to ~12% of chunk size
High overlap (20-25%): Better for content where concepts span paragraphs
Low overlap (5-10%): Better for clearly segmented content
Zero overlap: Only when chunks are completely independent (rare)

Settings Reference

Configure these settings in Settings > Sync Settings > Document Chunking:

Setting	Default	Range	Effect
Preserve Document Structure	On	Toggle	Enables header-aware splitting for HTML/Markdown
Chunk Size (tokens)	512	128-2048	Target size for each chunk
Chunk Overlap (tokens)	64	0-256	Tokens shared between consecutive chunks
Semantic Splitting Threshold	1000	100-5000	Minimum document size (tokens) for header-aware splitting

warning

Changes to chunking settings only affect newly synced documents. To apply new settings to existing documents, you must re-sync them.

Overview​

Splitter Types​

Header-Aware Splitting (HTML & Markdown)​

Recursive Text Splitting​

Automatic Content Detection​

Token-Based Sizing​

Why Tokens Instead of Characters?​

Token Counting Method​

Chunk Metadata​

Using Metadata in RAG​

Why This Architecture Matters for RAG​

Tuning Recommendations​

Chunk Size Trade-offs​

Overlap Recommendations​

Settings Reference​

See Also​