Skip to main content

Text Splitting Methodology

This page provides a comprehensive explanation of how Vector Data Loader splits documents into chunks before embedding. Understanding this process helps you tune chunking settings for optimal search quality in your RAG applications.

Overview

When documents are synced to your vector store, they go through a multi-stage splitting process:

  1. Content Detection - The system analyzes the document to determine its format
  2. Splitter Selection - An appropriate splitting algorithm is chosen based on content type
  3. Chunking - The document is split into token-sized pieces at natural boundaries
  4. Metadata Enrichment - Each chunk receives contextual metadata (headers, positions)
  5. Embedding - Chunks are converted to vectors and stored

Splitter Types

Vector Data Loader uses three specialized splitters, automatically selected based on document content:

Content TypeSplitterStrategyBest For
HTML (Confluence, websites)Header-Aware HTML SplitterSplits at <h1>-<h6> tagsStructured web content
Markdown (Notion, .md files)Header-Aware Markdown SplitterSplits at # through ###### headersDocumentation, notes
Plain text, PDF, JSONRecursive Text SplitterCharacter-based with natural boundariesUnstructured content

Header-Aware Splitting (HTML & Markdown)

For structured documents with headers, the system:

  1. Identifies header boundaries - Detects heading tags (HTML) or # prefixes (Markdown)
  2. Tracks header hierarchy - Maintains parent headers (h1 > h2 > h3) as context
  3. Splits at section boundaries - Creates chunks that respect document structure
  4. Propagates metadata - Each chunk includes its header hierarchy for search context
Example: Header Hierarchy

If a chunk comes from a section under "Authentication" (h1) > "OAuth Setup" (h2) > "Token Refresh" (h3), all three headers are stored in the chunk's metadata. This enables search results to show contextual breadcrumbs like "Authentication > OAuth Setup > Token Refresh".

Handling Large Sections:

When a section exceeds 1.5× the target chunk size, it gets sub-split using the recursive splitter while preserving the parent header metadata. This ensures:

  • No chunk is excessively large
  • Sub-chunks still have full header context
  • Semantic grouping is maintained where possible

Recursive Text Splitting

For unstructured content (PDFs, plain text, JSON), the system uses a recursive character-based approach:

Separator Hierarchy:

The splitter tries each separator in order of preference:

  1. \n\n (paragraph breaks)
  2. \n (line breaks)
  3. . (sentence endings)
  4. ! (exclamations)
  5. ? (questions)
  6. ; (semicolons)
  7. , (commas)
  8. (spaces)
  9. `` (character-level, last resort)

How It Works:

  1. The algorithm looks for the last occurrence of each separator within the target chunk size
  2. It only uses a separator if found after 30% of the chunk (to avoid tiny fragments)
  3. If no good split point is found, it tries the next separator in the hierarchy
  4. This continues until a natural boundary is found or character-level splitting is used
Why This Matters

This approach ensures chunks break at natural language boundaries (sentences, paragraphs) rather than mid-word or mid-sentence, preserving semantic coherence.

Automatic Content Detection

The system automatically detects document format without requiring manual specification:

FormatDetection MethodConfidence
HTML<html> tag, <!DOCTYPE html>, or 5+ HTML tags with body structureHigh (95%)
Markdown2+ # headers, or combination of headers + links [text](url) + code blocksHigh (85%)
JSONValid JSON structure starting with { or [High (95%)
Plain TextNo specific format detectedDefault fallback

When the Preserve Document Structure setting is disabled, all content uses the recursive splitter regardless of detected format.

Token-Based Sizing

Why Tokens Instead of Characters?

Embedding models have token limits (e.g., OpenAI's models accept up to 8,191 tokens). Using token-based sizing ensures:

  • Chunks fit within model limits
  • Consistent semantic density across chunks
  • Predictable embedding costs

Token Counting Method

Vector Data Loader uses the cl100k_base tokenizer (the same encoding used by GPT-4 and GPT-3.5-turbo) for consistent token counting across all embedding providers.

Token Estimation:

  • Average: ~4 characters per token for English text
  • Code and technical content: may have different ratios
  • Non-English text: varies by language
info

The cl100k_base tokenizer provides a universal standard. While your embedding model may use a slightly different tokenization, cl100k_base offers a reliable approximation that works well across providers (OpenAI, Cohere, Gemini, etc.).

Chunk Metadata

Each chunk includes rich metadata for RAG applications:

FieldDescriptionExample
ContentThe actual chunk text"To configure OAuth, first..."
IndexPosition within the document (0, 1, 2...)3
Token RangeStart and end token positionsStart: 1024, End: 1536
Header HierarchyParent headers for contexth1: "Setup", h2: "Authentication"
Splitter TypeAlgorithm used"html_header", "markdown_header", or "recursive"
Character RangeStart and end character positionsStart: 4096, End: 6144

Using Metadata in RAG

This metadata enables powerful RAG features:

  • Breadcrumb Citations: Display "Section: Authentication > OAuth" in search results
  • Context Windows: Fetch adjacent chunks using token/character positions
  • Filtering: Search within specific sections using header metadata
  • Quality Scoring: Weight results by structural position (h1 sections may be more important)

Why This Architecture Matters for RAG

BenefitExplanation
Semantic CoherenceChunks respect document structure, keeping related content together
Contextual RetrievalHeader metadata provides context even for isolated chunks
Consistent SizingToken-based sizing prevents embedding truncation
Format OptimizationDifferent splitters optimize for different content types
Search QualityNatural boundaries improve retrieval relevance

Tuning Recommendations

Different use cases benefit from different chunking configurations:

Use CaseChunk SizeOverlapPreserve StructureNotes
Precise Q&A256-38432-48OnSmaller chunks for focused answers
General Search512 (default)64OnBalanced for most use cases
Long-form Context1024-2048128-256OnLarger chunks preserve more context
Code Documentation384-51248-64OnMedium chunks for code examples
Unstructured Text51264OffUse recursive splitting only
Dense Technical Docs256-38448-64OnSmaller for precise retrieval

Chunk Size Trade-offs

Smaller Chunks (128-384 tokens)

Pros: More precise search results, better for specific questions Cons: May lose surrounding context, more chunks to process

Larger Chunks (1024-2048 tokens)

Pros: More context preserved, fewer chunks, better for broad topics Cons: Less precise matching, may include irrelevant content

Overlap Recommendations

  • General rule: Set overlap to ~12% of chunk size
  • High overlap (20-25%): Better for content where concepts span paragraphs
  • Low overlap (5-10%): Better for clearly segmented content
  • Zero overlap: Only when chunks are completely independent (rare)

Settings Reference

Configure these settings in Settings > Sync Settings > Document Chunking:

SettingDefaultRangeEffect
Preserve Document StructureOnToggleEnables header-aware splitting for HTML/Markdown
Chunk Size (tokens)512128-2048Target size for each chunk
Chunk Overlap (tokens)640-256Tokens shared between consecutive chunks
Semantic Splitting Threshold1000100-5000Minimum document size (tokens) for header-aware splitting
warning

Changes to chunking settings only affect newly synced documents. To apply new settings to existing documents, you must re-sync them.

See Also