Skip to main content

Sync Settings

The Sync Settings page contains two configuration sections.

Document Chunking

Configure how documents are split into chunks for embedding and search:

SettingDescriptionDefaultRange
Preserve Document StructureWhen enabled, HTML and Markdown documents are split at header boundaries (h1, h2, etc.) preserving semantic structureOnToggle
Chunk SizeTarget size for each chunk in tokens512128-2048
Chunk OverlapOverlapping tokens between consecutive chunks640-256
Semantic Splitting ThresholdMinimum document size (tokens) for header-aware splitting1000100-5000

How Chunking Works

  • HTML documents (Confluence, websites) are split at header tags (h1-h6)
  • Markdown documents (Notion, .md files) are split at # headers
  • Other documents (PDFs, text files) use recursive character-based splitting
  • Header hierarchy is preserved in chunk metadata for search context
  • Changes only affect newly synced documents
tip

Smaller chunks (128-256 tokens) are more precise for search but may lose context. Larger chunks (1024-2048 tokens) preserve more context but may be less focused. 512 tokens provides a good balance for most use cases.

See Also

For a comprehensive deep dive into the splitting algorithms, MIME type detection, token counting, and tuning recommendations, see Text Splitting Methodology.

Image Extraction (Enhanced Sync)

When using Enhanced sync with a configured Mistral API key, images are automatically extracted from documents during processing:

  • Automatic extraction: Images found in documents are extracted and stored in organization-scoped storage
  • AI annotations: Each image receives AI-generated annotations including type classification (photograph, chart, diagram), short description, detailed summary, and key data points
  • Searchable content: Annotation text is appended to relevant document chunks, making visual content searchable with any embedding provider
  • Sync metrics: Image extraction counts appear in the sync completion summary
Works with All Providers

Image extraction and annotation works with all embedding providers (OpenAI, Cohere, Gemini, Ollama). You do not need Gemini to benefit from image annotations — the annotation text makes images searchable regardless of your embedding model.

For full details on multimodal capabilities, see Multimodal Embeddings.

Auto-Sync Settings

Configure automatic detection and syncing of stale documents:

SettingDescriptionOptions
Enable Auto-SyncAutomatically check for stale documents based on your frequency settingOn/Off
Check FrequencyHow often to check for stale documentsEvery 12 hours, Daily (24h), Weekly, Every 2 weeks, Monthly
Staleness ThresholdDays without sync before a document is considered stale1-30 days
When Documents Become StaleAction to take when staleness is detectedMark only, Auto-sync, Notify
Max Documents Per RunLimit for auto-sync operations1-25