Skip to main content

Document Statistics

The Document Statistics page provides detailed information about how a document was processed.

Document Stats

Overview Section

MetricDescription
Content SizeSize of the extracted text content in bytes
Estimated TokensApproximate token count for LLM context planning
Vector ChunksNumber of chunks created in the vector store
Sync DurationTime taken for the last sync operation

Chunking Details

Shows how the document was split:

FieldDescription
Splitter TypeAlgorithm used for chunking: Header-Aware (HTML), Header-Aware (Markdown), or Recursive Text
Contextual RetrievalWhether enhanced context was generated for each chunk. Shows "Enhanced" with LLM cost if used, or "Standard" if not
Total ChunksNumber of chunks created from the document
Average Chunk SizeMean character count per chunk
Content SizeTotal size of extracted text
Est. Token CountApproximate tokens for the document

Splitter Types Explained

TypeDescriptionBest For
Header-Aware (HTML)Splits at HTML header tags (h1, h2, etc.) preserving hierarchyWeb pages, Confluence, HTML docs
Header-Aware (Markdown)Splits at Markdown headers (#, ##, etc.)Markdown files, GitHub READMEs
Recursive TextSplits at natural boundaries (paragraphs, sentences)PDFs, plain text, unstructured content

Contextual Retrieval Status

When a document is synced with Enhanced with Contextual Retrieval, the stats page shows:

  • Enhanced label with a sparkle icon
  • LLM cost in parentheses (e.g., $0.002)

Documents synced with standard mode show Standard in muted text.

Embedding Details

Information about the embedding process:

  • Model Used: Which embedding model processed this document
  • Dimensions: Vector dimensionality (e.g., 1536 for OpenAI)

Vector Store References

Shows where chunks are stored:

  • Provider: Active vector store (Supabase, Pinecone, etc.)
  • Collection/Table: Specific storage location