Choosing Best Configuration

Choose the Best Configuration

This guide helps you choose the optimal configuration for your specific use case. We'll look at common scenarios and provide recommended configurations for each.

Common Scenarios

"My website uses JavaScript rendering"

If your website relies heavily on JavaScript for content rendering (e.g., Single Page Applications, dynamic content), use the Puppeteer crawler:

{
  "crawler_type": "puppeteer",
  "strategy": "default",
  "max_concurrency": 4,
  "max_requests_per_minute": 120
}

Note: Lower concurrency is recommended for Puppeteer due to higher resource usage.

"I want to index only one language of my website"

To crawl only specific language versions of your site, use URL patterns in the configuration:

{
  "start_urls": ["https://example.com/en"],
  "urls_to_exclude": [
    "https://example.com/fr/**",
    "https://example.com/de/**"
  ]
}

"I want to build a RAG system with my documentation"

For Retrieval Augmented Generation (RAG) systems, use the markdown strategy to get clean, structured content:

{
  "strategy": "markdown",
  "crawler_type": "cheerio",
  "meilisearch_settings": {
    "searchableAttributes": [
      "content",
      "title",
      "description"
    ]
  }
}

"I'm crawling an e-commerce site with structured data"

For sites using Schema.org markup, use the schema strategy:

{
  "strategy": "schema",
  "schema_settings": {
    "convert_dates": true,
    "only_type": "Product"
  },
  "meilisearch_settings": {
    "filterableAttributes": [
      "price",
      "category",
      "brand"
    ],
    "sortableAttributes": ["price"]
  }
}

"I need precise control over what content is extracted"

For custom content extraction needs, use the custom strategy with specific selectors:

{
  "strategy": "custom",
  "selectors": {
    "title": "h1.main-title",
    "content": ["article p", "div.content"],
    "category": "span.category-tag"
  }
}

"I'm crawling a documentation site with DocSearch"

For documentation sites, especially those using DocSearch-compatible frontends:

{
  "strategy": "docssearch",
  "crawler_type": "cheerio",
  "meilisearch_settings": {
    "searchableAttributes": [
      "content",
      "hierarchy_lvl1",
      "hierarchy_lvl2",
      "hierarchy_lvl3"
    ]
  }
}

"My site has rate limiting"

To respect rate limits and prevent overloading the server:

{
  "max_concurrency": 2,
  "max_requests_per_minute": 60,
  "batch_size": 100,
  "additional_request_headers": {
    "User-Agent": "Friendly Bot/1.0"
  }
}

Performance Considerations

Memory Usage: Puppeteer/Playwright crawlers use more memory than Cheerio
Speed: Cheerio is fastest but can't handle JavaScript rendering
Concurrency: Start with lower values and increase based on server response
Batch Size: Larger batches improve indexing speed but use more memory

Best Practices

Start Simple: Begin with the default strategy and adjust as needed
Test First: Run a small test crawl before full deployment
Monitor Logs: Watch for errors or performance issues
Respect Robots.txt: Ensure your crawling follows site guidelines
Use Webhooks: Enable webhook notifications for monitoring long crawls

{
  "webhook_url": "https://your-monitoring.com/webhook",
  "webhook_payload": {
    "environment": "production",
    "source": "docs-crawler"
  }
}

Overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly