Skip to main content

Multimodal Embeddings

Vector Data Loader supports a two-tier architecture for visual document intelligence:

  1. AI Image Extraction with Annotations — Works with all embedding providers (OpenAI, Cohere, Gemini, Ollama). Images are extracted during Enhanced sync and annotated with rich text descriptions that get appended to document chunks, making visual content searchable through text.

  2. True Multimodal Embeddings — Available exclusively with Gemini gemini-embedding-2-preview. Actual images are sent alongside text to the embedding model, producing vectors that encode both visual and textual semantics for native visual similarity search.

Start with Annotations

Even without Gemini, you get significant value from image annotations alone. Any embedding provider can search over the annotation text, giving your RAG pipeline access to visual content described in natural language.

How It Works

During Enhanced sync, the Mistral OCR processor extracts images from documents and generates structured BBox annotations for each image. These annotations include classification (image type), a short description, a detailed summary, and extracted data points.

The annotation text is appended to the relevant document chunks, making visual content searchable with any embedding model. This is the first tier — text-based image search.

When using Gemini gemini-embedding-2-preview as your embedding provider, the system goes further: actual extracted images are sent alongside text in embedding requests, producing true multimodal vectors. Gemini supports up to 6 images per embedding request, and images are matched to chunks based on page position.

Prerequisites

RequirementTier 1 (Annotations)Tier 2 (Multimodal Embeddings)
Sync methodEnhancedEnhanced
Mistral API keyRequiredRequired
Embedding providerAny (OpenAI, Cohere, Gemini, Ollama)Gemini only
Embedding modelAny supported modelgemini-embedding-2-preview
Mistral API Key Required

Image extraction requires a configured Mistral API key for the OCR processor. Without it, Enhanced sync still processes documents but skips image extraction entirely.

Image Extraction

Images are automatically extracted from documents during Enhanced sync:

  • Storage: Images are stored in an organization-scoped Supabase Storage bucket
  • Linking: Each image is linked to document chunks by page position
  • Formats: Supports images embedded in PDFs, DOCX, HTML, and other document formats
  • Deduplication: Identical images within a document are stored once
Storage Costs

Extracted images count toward your organization's storage usage. Storage is billed at 200 credits per GiB per month. See Credits & Usage for details.

Annotation Types

Each extracted image receives AI-generated BBox annotations with the following fields:

FieldDescriptionExample
image_typeClassification of the image contentphotograph, chart, diagram, table, screenshot, logo, illustration
short_descriptionBrief one-line summary"Q3 2025 revenue bar chart"
detailed_summaryComprehensive description of the image content"Bar chart showing quarterly revenue from Q1-Q4 2025 with year-over-year comparison..."
data_pointsKey data values extracted from the image"Q3 revenue: $4.2M, YoY growth: 18%"

These annotation fields are concatenated and appended to the document chunk that corresponds to the image's page position, enriching the chunk with visual context.

RAG with Images

When retrieved chunks contain image annotations, the RAG pipeline includes visual context automatically:

  • Annotation text is included in the LLM context alongside the chunk text
  • All LLMs (Claude, GPT-4, etc.) can reference and reason about visual content through the text annotations
  • Signed URLs with a 10-minute TTL provide temporary access to the original extracted images when needed
  • Multimodal LLMs can use both the annotation text and the actual image for richer responses
Works with Any LLM

Because annotations are plain text, any LLM — including text-only models — can understand and reference the visual content in your documents. You do not need a multimodal LLM to benefit from image annotations.

Sync Progress

After a sync operation completes, the summary displays image-related metrics:

  • Images extracted: Total number of images found and stored from the document
  • Images annotated: Number of images that received AI-generated BBox annotations

These counts appear in the sync completion toast and in the document's sync details.

Usage Metrics

Image extraction metrics are available in Settings > Usage:

MetricDescription
Extracted ImagesTotal count of images extracted across all documents
Image StorageTotal storage size consumed by extracted images

These metrics help you monitor storage consumption and understand the visual richness of your document library.

note

For embedding model configuration including Gemini multimodal models, see Embedding Model Configuration. For sync method configuration, see Sync Settings.