Website Crawler
The website crawler allows you to ingest content from public web pages into your vector store. It supports crawling single pages or following links to discover related content.
Adding a Website Source
- Navigate to Sources from the sidebar
- Click Add Source and select the Website card
- Enter the target URL in the text field (e.g.,
https://docs.example.com) - Click Add URL to add the page to your selection
Browsing and Managing Website Pages
Once connected, click Browse on your Website source card to open the Website Manager:
| Section | Description |
|---|---|
| URL Input | Text field to add new URLs for syncing |
| Previously Synced Pages | List of pages already synced to your vector store |
| Selection Count | Shows how many pages are currently selected |
| Sync Limit Warning | Displays the maximum pages allowed per immediate sync |
Syncing Website Pages
Immediate Sync (up to 15 pages)
- Enter URLs or select previously synced pages
- Review the selection count at the bottom
- Click Sync X Pages to sync immediately
warning
Immediate sync is limited to 15 pages per operation. This ensures reliable processing and prevents timeouts.
Queue-Based Sync (for large batches)
When you need to sync more than 15 pages, use the queue-based sync feature:
- Select all desired pages (even if more than 15)
- When selection exceeds 15 pages, a warning appears with a Queue button
- Click Queue X Pages to add them to the background sync queue
- Pages are processed automatically every 5 minutes by the system
| Sync Method | Page Limit | Processing | Best For |
|---|---|---|---|
| Immediate Sync | 15 pages | Instant | Small batches, quick updates |
| Queue Sync | Unlimited | Background (every 5 min) | Large site crawls, bulk imports |
Crawl Behavior
The crawler:
- Follows internal links within the same domain
- Respects robots.txt directives
- Extracts main content while filtering navigation/ads
- Maintains URL references for source tracking
- Deduplicates pages to prevent duplicate syncs
Security Considerations
The crawler includes SSRF (Server-Side Request Forgery) protection to prevent accessing internal or private network resources.