Multimodal Embeddings
Vector Data Loader supports a two-tier architecture for visual document intelligence:
-
AI Image Extraction with Annotations — Works with all embedding providers (OpenAI, Cohere, Gemini, Ollama). Images are extracted during Enhanced sync and annotated with rich text descriptions that get appended to document chunks, making visual content searchable through text.
-
True Multimodal Embeddings — Available exclusively with Gemini gemini-embedding-2-preview. Actual images are sent alongside text to the embedding model, producing vectors that encode both visual and textual semantics for native visual similarity search.
Even without Gemini, you get significant value from image annotations alone. Any embedding provider can search over the annotation text, giving your RAG pipeline access to visual content described in natural language.
How It Works
During Enhanced sync, the Mistral OCR processor extracts images from documents and generates structured BBox annotations for each image. These annotations include classification (image type), a short description, a detailed summary, and extracted data points.
The annotation text is appended to the relevant document chunks, making visual content searchable with any embedding model. This is the first tier — text-based image search.
When using Gemini gemini-embedding-2-preview as your embedding provider, the system goes further: actual extracted images are sent alongside text in embedding requests, producing true multimodal vectors. Gemini supports up to 6 images per embedding request, and images are matched to chunks based on page position.
Prerequisites
| Requirement | Tier 1 (Annotations) | Tier 2 (Multimodal Embeddings) |
|---|---|---|
| Sync method | Enhanced | Enhanced |
| Mistral API key | Required | Required |
| Embedding provider | Any (OpenAI, Cohere, Gemini, Ollama) | Gemini only |
| Embedding model | Any supported model | gemini-embedding-2-preview |
Image extraction requires a configured Mistral API key for the OCR processor. Without it, Enhanced sync still processes documents but skips image extraction entirely.
Image Extraction
Images are automatically extracted from documents during Enhanced sync:
- Storage: Images are stored in an organization-scoped Supabase Storage bucket
- Linking: Each image is linked to document chunks by page position
- Formats: Supports images embedded in PDFs, DOCX, HTML, and other document formats
- Deduplication: Identical images within a document are stored once
Extracted images count toward your organization's storage usage. Storage is billed at 200 credits per GiB per month. See Credits & Usage for details.
Annotation Types
Each extracted image receives AI-generated BBox annotations with the following fields:
| Field | Description | Example |
|---|---|---|
| image_type | Classification of the image content | photograph, chart, diagram, table, screenshot, logo, illustration |
| short_description | Brief one-line summary | "Q3 2025 revenue bar chart" |
| detailed_summary | Comprehensive description of the image content | "Bar chart showing quarterly revenue from Q1-Q4 2025 with year-over-year comparison..." |
| data_points | Key data values extracted from the image | "Q3 revenue: $4.2M, YoY growth: 18%" |
These annotation fields are concatenated and appended to the document chunk that corresponds to the image's page position, enriching the chunk with visual context.
RAG with Images
When retrieved chunks contain image annotations, the RAG pipeline includes visual context automatically:
- Annotation text is included in the LLM context alongside the chunk text
- All LLMs (Claude, GPT-4, etc.) can reference and reason about visual content through the text annotations
- Signed URLs with a 10-minute TTL provide temporary access to the original extracted images when needed
- Multimodal LLMs can use both the annotation text and the actual image for richer responses
Because annotations are plain text, any LLM — including text-only models — can understand and reference the visual content in your documents. You do not need a multimodal LLM to benefit from image annotations.
Sync Progress
After a sync operation completes, the summary displays image-related metrics:
- Images extracted: Total number of images found and stored from the document
- Images annotated: Number of images that received AI-generated BBox annotations
These counts appear in the sync completion toast and in the document's sync details.
Usage Metrics
Image extraction metrics are available in Settings > Usage:
| Metric | Description |
|---|---|
| Extracted Images | Total count of images extracted across all documents |
| Image Storage | Total storage size consumed by extracted images |
These metrics help you monitor storage consumption and understand the visual richness of your document library.
For embedding model configuration including Gemini multimodal models, see Embedding Model Configuration. For sync method configuration, see Sync Settings.