Website Crawler

The website crawler allows you to ingest content from public web pages into your vector store. It supports crawling single pages or following links to discover related content.

Adding a Website Source

Navigate to Sources from the sidebar
Click Add Source and select the Website card
Enter the target URL in the text field (e.g., https://docs.example.com)
Click Add URL to add the page to your selection

Browsing and Managing Website Pages

Once connected, click Browse on your Website source card to open the Website Manager:

Section	Description
URL Input	Text field to add new URLs for syncing
Previously Synced Pages	List of pages already synced to your vector store
Selection Count	Shows how many pages are currently selected
Sync Limit Warning	Displays the maximum pages allowed per immediate sync

Syncing Website Pages

Immediate Sync (up to 15 pages)

Enter URLs or select previously synced pages
Review the selection count at the bottom
Click Sync X Pages to sync immediately

warning

Immediate sync is limited to 15 pages per operation. This ensures reliable processing and prevents timeouts.

Queue-Based Sync (for large batches)

When you need to sync more than 15 pages, use the queue-based sync feature:

Select all desired pages (even if more than 15)
When selection exceeds 15 pages, a warning appears with a Queue button
Click Queue X Pages to add them to the background sync queue
Pages are processed automatically every 5 minutes by the system

Sync Method	Page Limit	Processing	Best For
Immediate Sync	15 pages	Instant	Small batches, quick updates
Queue Sync	Unlimited	Background (every 5 min)	Large site crawls, bulk imports

Crawl Behavior

The crawler:

Follows internal links within the same domain
Respects robots.txt directives
Extracts main content while filtering navigation/ads
Maintains URL references for source tracking
Deduplicates pages to prevent duplicate syncs

Security Considerations

The crawler includes SSRF (Server-Side Request Forgery) protection to prevent accessing internal or private network resources.

Adding a Website Source​

Browsing and Managing Website Pages​

Syncing Website Pages​

Immediate Sync (up to 15 pages)​

Queue-Based Sync (for large batches)​

Crawl Behavior​

Security Considerations​