-
Notifications
You must be signed in to change notification settings - Fork 9
Choosing Best Configuration
This guide helps you choose the optimal configuration for your specific use case. We'll look at common scenarios and provide recommended configurations for each.
If your website relies heavily on JavaScript for content rendering (e.g., Single Page Applications, dynamic content), use the Puppeteer crawler:
{
"crawler_type": "puppeteer",
"strategy": "default",
"max_concurrency": 4,
"max_requests_per_minute": 120
}
Note: Lower concurrency is recommended for Puppeteer due to higher resource usage.
To crawl only specific language versions of your site, use URL patterns in the configuration:
{
"start_urls": ["https://example.com/en"],
"urls_to_exclude": [
"https://example.com/fr/**",
"https://example.com/de/**"
]
}
For Retrieval Augmented Generation (RAG) systems, use the markdown strategy to get clean, structured content:
{
"strategy": "markdown",
"crawler_type": "cheerio",
"meilisearch_settings": {
"searchableAttributes": [
"content",
"title",
"description"
]
}
}
For sites using Schema.org markup, use the schema strategy:
{
"strategy": "schema",
"schema_settings": {
"convert_dates": true,
"only_type": "Product"
},
"meilisearch_settings": {
"filterableAttributes": [
"price",
"category",
"brand"
],
"sortableAttributes": ["price"]
}
}
For custom content extraction needs, use the custom strategy with specific selectors:
{
"strategy": "custom",
"selectors": {
"title": "h1.main-title",
"content": ["article p", "div.content"],
"category": "span.category-tag"
}
}
For documentation sites, especially those using DocSearch-compatible frontends:
{
"strategy": "docssearch",
"crawler_type": "cheerio",
"meilisearch_settings": {
"searchableAttributes": [
"content",
"hierarchy_lvl1",
"hierarchy_lvl2",
"hierarchy_lvl3"
]
}
}
To respect rate limits and prevent overloading the server:
{
"max_concurrency": 2,
"max_requests_per_minute": 60,
"batch_size": 100,
"additional_request_headers": {
"User-Agent": "Friendly Bot/1.0"
}
}
- Memory Usage: Puppeteer/Playwright crawlers use more memory than Cheerio
- Speed: Cheerio is fastest but can't handle JavaScript rendering
- Concurrency: Start with lower values and increase based on server response
- Batch Size: Larger batches improve indexing speed but use more memory
- Start Simple: Begin with the default strategy and adjust as needed
- Test First: Run a small test crawl before full deployment
- Monitor Logs: Watch for errors or performance issues
- Respect Robots.txt: Ensure your crawling follows site guidelines
- Use Webhooks: Enable webhook notifications for monitoring long crawls
{
"webhook_url": "https://your-monitoring.com/webhook",
"webhook_payload": {
"environment": "production",
"source": "docs-crawler"
}
}