Skip to main content

Website Crawler

The website crawler allows you to ingest content from public web pages into your vector store. It supports crawling single pages or following links to discover related content.

Adding a Website Source

  1. Navigate to Sources from the sidebar
  2. Click Add Source and select the Website card
  3. Enter the target URL in the text field (e.g., https://docs.example.com)
  4. Click Add URL to add the page to your selection

Browsing and Managing Website Pages

Once connected, click Browse on your Website source card to open the Website Manager:

SectionDescription
URL InputText field to add new URLs for syncing
Previously Synced PagesList of pages already synced to your vector store
Selection CountShows how many pages are currently selected
Sync Limit WarningDisplays the maximum pages allowed per immediate sync

Syncing Website Pages

Immediate Sync (up to 15 pages)

  1. Enter URLs or select previously synced pages
  2. Review the selection count at the bottom
  3. Click Sync X Pages to sync immediately
warning

Immediate sync is limited to 15 pages per operation. This ensures reliable processing and prevents timeouts.

Queue-Based Sync (for large batches)

When you need to sync more than 15 pages, use the queue-based sync feature:

  1. Select all desired pages (even if more than 15)
  2. When selection exceeds 15 pages, a warning appears with a Queue button
  3. Click Queue X Pages to add them to the background sync queue
  4. Pages are processed automatically every 5 minutes by the system
Sync MethodPage LimitProcessingBest For
Immediate Sync15 pagesInstantSmall batches, quick updates
Queue SyncUnlimitedBackground (every 5 min)Large site crawls, bulk imports

Crawl Behavior

The crawler:

  • Follows internal links within the same domain
  • Respects robots.txt directives
  • Extracts main content while filtering navigation/ads
  • Maintains URL references for source tracking
  • Deduplicates pages to prevent duplicate syncs

Security Considerations

The crawler includes SSRF (Server-Side Request Forgery) protection to prevent accessing internal or private network resources.