Text Splitting Methodology
This page provides a comprehensive explanation of how Vector Data Loader splits documents into chunks before embedding. Understanding this process helps you tune chunking settings for optimal search quality in your RAG applications.
Overview
When documents are synced to your vector store, they go through a multi-stage splitting process:
- Content Detection - The system analyzes the document to determine its format
- Splitter Selection - An appropriate splitting algorithm is chosen based on content type
- Chunking - The document is split into token-sized pieces at natural boundaries
- Metadata Enrichment - Each chunk receives contextual metadata (headers, positions)
- Embedding - Chunks are converted to vectors and stored
Splitter Types
Vector Data Loader uses three specialized splitters, automatically selected based on document content:
| Content Type | Splitter | Strategy | Best For |
|---|---|---|---|
| HTML (Confluence, websites) | Header-Aware HTML Splitter | Splits at <h1>-<h6> tags | Structured web content |
| Markdown (Notion, .md files) | Header-Aware Markdown Splitter | Splits at # through ###### headers | Documentation, notes |
| Plain text, PDF, JSON | Recursive Text Splitter | Character-based with natural boundaries | Unstructured content |
Header-Aware Splitting (HTML & Markdown)
For structured documents with headers, the system:
- Identifies header boundaries - Detects heading tags (HTML) or
#prefixes (Markdown) - Tracks header hierarchy - Maintains parent headers (h1 > h2 > h3) as context
- Splits at section boundaries - Creates chunks that respect document structure
- Propagates metadata - Each chunk includes its header hierarchy for search context
If a chunk comes from a section under "Authentication" (h1) > "OAuth Setup" (h2) > "Token Refresh" (h3), all three headers are stored in the chunk's metadata. This enables search results to show contextual breadcrumbs like "Authentication > OAuth Setup > Token Refresh".
Handling Large Sections:
When a section exceeds 1.5× the target chunk size, it gets sub-split using the recursive splitter while preserving the parent header metadata. This ensures:
- No chunk is excessively large
- Sub-chunks still have full header context
- Semantic grouping is maintained where possible
Recursive Text Splitting
For unstructured content (PDFs, plain text, JSON), the system uses a recursive character-based approach:
Separator Hierarchy:
The splitter tries each separator in order of preference:
\n\n(paragraph breaks)\n(line breaks).(sentence endings)!(exclamations)?(questions);(semicolons),(commas)(spaces)- `` (character-level, last resort)
How It Works:
- The algorithm looks for the last occurrence of each separator within the target chunk size
- It only uses a separator if found after 30% of the chunk (to avoid tiny fragments)
- If no good split point is found, it tries the next separator in the hierarchy
- This continues until a natural boundary is found or character-level splitting is used
This approach ensures chunks break at natural language boundaries (sentences, paragraphs) rather than mid-word or mid-sentence, preserving semantic coherence.
Automatic Content Detection
The system automatically detects document format without requiring manual specification:
| Format | Detection Method | Confidence |
|---|---|---|
| HTML | <html> tag, <!DOCTYPE html>, or 5+ HTML tags with body structure | High (95%) |
| Markdown | 2+ # headers, or combination of headers + links [text](url) + code blocks | High (85%) |
| JSON | Valid JSON structure starting with { or [ | High (95%) |
| Plain Text | No specific format detected | Default fallback |
When the Preserve Document Structure setting is disabled, all content uses the recursive splitter regardless of detected format.
Token-Based Sizing
Why Tokens Instead of Characters?
Embedding models have token limits (e.g., OpenAI's models accept up to 8,191 tokens). Using token-based sizing ensures:
- Chunks fit within model limits
- Consistent semantic density across chunks
- Predictable embedding costs
Token Counting Method
Vector Data Loader uses the cl100k_base tokenizer (the same encoding used by GPT-4 and GPT-3.5-turbo) for consistent token counting across all embedding providers.
Token Estimation:
- Average: ~4 characters per token for English text
- Code and technical content: may have different ratios
- Non-English text: varies by language
The cl100k_base tokenizer provides a universal standard. While your embedding model may use a slightly different tokenization, cl100k_base offers a reliable approximation that works well across providers (OpenAI, Cohere, Gemini, etc.).
Chunk Metadata
Each chunk includes rich metadata for RAG applications:
| Field | Description | Example |
|---|---|---|
| Content | The actual chunk text | "To configure OAuth, first..." |
| Index | Position within the document (0, 1, 2...) | 3 |
| Token Range | Start and end token positions | Start: 1024, End: 1536 |
| Header Hierarchy | Parent headers for context | h1: "Setup", h2: "Authentication" |
| Splitter Type | Algorithm used | "html_header", "markdown_header", or "recursive" |
| Character Range | Start and end character positions | Start: 4096, End: 6144 |
Using Metadata in RAG
This metadata enables powerful RAG features:
- Breadcrumb Citations: Display "Section: Authentication > OAuth" in search results
- Context Windows: Fetch adjacent chunks using token/character positions
- Filtering: Search within specific sections using header metadata
- Quality Scoring: Weight results by structural position (h1 sections may be more important)
Why This Architecture Matters for RAG
| Benefit | Explanation |
|---|---|
| Semantic Coherence | Chunks respect document structure, keeping related content together |
| Contextual Retrieval | Header metadata provides context even for isolated chunks |
| Consistent Sizing | Token-based sizing prevents embedding truncation |
| Format Optimization | Different splitters optimize for different content types |
| Search Quality | Natural boundaries improve retrieval relevance |
Tuning Recommendations
Different use cases benefit from different chunking configurations:
| Use Case | Chunk Size | Overlap | Preserve Structure | Notes |
|---|---|---|---|---|
| Precise Q&A | 256-384 | 32-48 | On | Smaller chunks for focused answers |
| General Search | 512 (default) | 64 | On | Balanced for most use cases |
| Long-form Context | 1024-2048 | 128-256 | On | Larger chunks preserve more context |
| Code Documentation | 384-512 | 48-64 | On | Medium chunks for code examples |
| Unstructured Text | 512 | 64 | Off | Use recursive splitting only |
| Dense Technical Docs | 256-384 | 48-64 | On | Smaller for precise retrieval |
Chunk Size Trade-offs
Pros: More precise search results, better for specific questions Cons: May lose surrounding context, more chunks to process
Pros: More context preserved, fewer chunks, better for broad topics Cons: Less precise matching, may include irrelevant content
Overlap Recommendations
- General rule: Set overlap to ~12% of chunk size
- High overlap (20-25%): Better for content where concepts span paragraphs
- Low overlap (5-10%): Better for clearly segmented content
- Zero overlap: Only when chunks are completely independent (rare)
Settings Reference
Configure these settings in Settings > Sync Settings > Document Chunking:
| Setting | Default | Range | Effect |
|---|---|---|---|
| Preserve Document Structure | On | Toggle | Enables header-aware splitting for HTML/Markdown |
| Chunk Size (tokens) | 512 | 128-2048 | Target size for each chunk |
| Chunk Overlap (tokens) | 64 | 0-256 | Tokens shared between consecutive chunks |
| Semantic Splitting Threshold | 1000 | 100-5000 | Minimum document size (tokens) for header-aware splitting |
Changes to chunking settings only affect newly synced documents. To apply new settings to existing documents, you must re-sync them.
See Also
- Sync Settings - Configure chunking parameters
- Embedding Model - Choose your embedding provider
- Document Statistics - View chunk details for synced documents