Multimodal Embeddings

Vector Data Loader supports a two-tier architecture for visual document intelligence:

AI Image Extraction with Annotations — Works with all embedding providers (OpenAI, Cohere, Gemini, Ollama). Images are extracted during Enhanced sync and annotated with rich text descriptions that get appended to document chunks, making visual content searchable through text.
True Multimodal Embeddings — Available exclusively with Gemini gemini-embedding-2-preview. Actual images are sent alongside text to the embedding model, producing vectors that encode both visual and textual semantics for native visual similarity search.

Start with Annotations

Even without Gemini, you get significant value from image annotations alone. Any embedding provider can search over the annotation text, giving your RAG pipeline access to visual content described in natural language.

How It Works

During Enhanced sync, the Mistral OCR processor extracts images from documents and generates structured BBox annotations for each image. These annotations include classification (image type), a short description, a detailed summary, and extracted data points.

The annotation text is appended to the relevant document chunks, making visual content searchable with any embedding model. This is the first tier — text-based image search.

When using Gemini gemini-embedding-2-preview as your embedding provider, the system goes further: actual extracted images are sent alongside text in embedding requests, producing true multimodal vectors. Gemini supports up to 6 images per embedding request, and images are matched to chunks based on page position.

Prerequisites

Requirement	Tier 1 (Annotations)	Tier 2 (Multimodal Embeddings)
Sync method	Enhanced	Enhanced
Mistral API key	Required	Required
Embedding provider	Any (OpenAI, Cohere, Gemini, Ollama)	Gemini only
Embedding model	Any supported model	gemini-embedding-2-preview

Mistral API Key Required

Image extraction requires a configured Mistral API key for the OCR processor. Without it, Enhanced sync still processes documents but skips image extraction entirely.

Image Extraction

Images are automatically extracted from documents during Enhanced sync:

Storage: Images are stored in an organization-scoped Supabase Storage bucket
Linking: Each image is linked to document chunks by page position
Formats: Supports images embedded in PDFs, DOCX, HTML, and other document formats
Deduplication: Identical images within a document are stored once

Storage Costs

Extracted images count toward your organization's storage usage. Storage is billed at 200 credits per GiB per month. See Credits & Usage for details.

Annotation Types

Each extracted image receives AI-generated BBox annotations with the following fields:

Field	Description	Example
image_type	Classification of the image content	photograph, chart, diagram, table, screenshot, logo, illustration
short_description	Brief one-line summary	"Q3 2025 revenue bar chart"
detailed_summary	Comprehensive description of the image content	"Bar chart showing quarterly revenue from Q1-Q4 2025 with year-over-year comparison..."
data_points	Key data values extracted from the image	"Q3 revenue: $4.2M, YoY growth: 18%"

These annotation fields are concatenated and appended to the document chunk that corresponds to the image's page position, enriching the chunk with visual context.

RAG with Images

When retrieved chunks contain image annotations, the RAG pipeline includes visual context automatically:

Annotation text is included in the LLM context alongside the chunk text
All LLMs (Claude, GPT-4, etc.) can reference and reason about visual content through the text annotations
Signed URLs with a 10-minute TTL provide temporary access to the original extracted images when needed
Multimodal LLMs can use both the annotation text and the actual image for richer responses

Works with Any LLM

Because annotations are plain text, any LLM — including text-only models — can understand and reference the visual content in your documents. You do not need a multimodal LLM to benefit from image annotations.

Sync Progress

After a sync operation completes, the summary displays image-related metrics:

Images extracted: Total number of images found and stored from the document
Images annotated: Number of images that received AI-generated BBox annotations

These counts appear in the sync completion toast and in the document's sync details.

Usage Metrics

Image extraction metrics are available in Settings > Usage:

Metric	Description
Extracted Images	Total count of images extracted across all documents
Image Storage	Total storage size consumed by extracted images

These metrics help you monitor storage consumption and understand the visual richness of your document library.

note

For embedding model configuration including Gemini multimodal models, see Embedding Model Configuration. For sync method configuration, see Sync Settings.

How It Works​

Prerequisites​

Image Extraction​

Annotation Types​

RAG with Images​

Sync Progress​

Usage Metrics​