Sync Settings
The Sync Settings page contains two configuration sections.
Document Chunking
Configure how documents are split into chunks for embedding and search:
| Setting | Description | Default | Range |
|---|---|---|---|
| Preserve Document Structure | When enabled, HTML and Markdown documents are split at header boundaries (h1, h2, etc.) preserving semantic structure | On | Toggle |
| Chunk Size | Target size for each chunk in tokens | 512 | 128-2048 |
| Chunk Overlap | Overlapping tokens between consecutive chunks | 64 | 0-256 |
| Semantic Splitting Threshold | Minimum document size (tokens) for header-aware splitting | 1000 | 100-5000 |
How Chunking Works
- HTML documents (Confluence, websites) are split at header tags (h1-h6)
- Markdown documents (Notion, .md files) are split at # headers
- Other documents (PDFs, text files) use recursive character-based splitting
- Header hierarchy is preserved in chunk metadata for search context
- Changes only affect newly synced documents
Smaller chunks (128-256 tokens) are more precise for search but may lose context. Larger chunks (1024-2048 tokens) preserve more context but may be less focused. 512 tokens provides a good balance for most use cases.
For a comprehensive deep dive into the splitting algorithms, MIME type detection, token counting, and tuning recommendations, see Text Splitting Methodology.
Image Extraction (Enhanced Sync)
When using Enhanced sync with a configured Mistral API key, images are automatically extracted from documents during processing:
- Automatic extraction: Images found in documents are extracted and stored in organization-scoped storage
- AI annotations: Each image receives AI-generated annotations including type classification (photograph, chart, diagram), short description, detailed summary, and key data points
- Searchable content: Annotation text is appended to relevant document chunks, making visual content searchable with any embedding provider
- Sync metrics: Image extraction counts appear in the sync completion summary
Image extraction and annotation works with all embedding providers (OpenAI, Cohere, Gemini, Ollama). You do not need Gemini to benefit from image annotations — the annotation text makes images searchable regardless of your embedding model.
For full details on multimodal capabilities, see Multimodal Embeddings.
Auto-Sync Settings
Configure automatic detection and syncing of stale documents:
| Setting | Description | Options |
|---|---|---|
| Enable Auto-Sync | Automatically check for stale documents based on your frequency setting | On/Off |
| Check Frequency | How often to check for stale documents | Every 12 hours, Daily (24h), Weekly, Every 2 weeks, Monthly |
| Staleness Threshold | Days without sync before a document is considered stale | 1-30 days |
| When Documents Become Stale | Action to take when staleness is detected | Mark only, Auto-sync, Notify |
| Max Documents Per Run | Limit for auto-sync operations | 1-25 |