Config Reference

The configuration file (JSON format) supports the following options:

Required Configuration

{
  meilisearch_index_uid: string;
  meilisearch_url: string;
  meilisearch_api_key: string;
  start_urls: string[];
}

meilisearch_index_uid: Unique identifier for the Meilisearch index
meilisearch_url: URL of the Meilisearch server instance
meilisearch_api_key: API key for Meilisearch authentication
start_urls: Initial URLs to begin crawling from

Crawler Configuration

Crawler Type

{
  crawler_type?: 'cheerio' | 'puppeteer' | 'playwright';
  strategy?: 'docssearch' | 'default' | 'schema' | 'markdown' | 'custom';
}

crawler_type (default: "cheerio"): Web scraping engine to use
- cheerio: Fast HTML parser for static sites
- puppeteer: Full Chrome automation for dynamic sites
- playwright: Modern cross-browser automation (beta)
strategy (default: "default"): Content extraction strategy
- default: General-purpose hierarchical content extraction
- docssearch: DocSearch plugin compatibility
- schema: Schema.org structured data extraction
- markdown: Markdown conversion
- custom: User-defined selectors

URL Control

{
  urls_to_exclude?: string[];
  urls_to_index?: string[];
  urls_to_not_index?: string[];
  use_sitemap?: boolean;
  sitemap_urls?: string[];
}

urls_to_exclude: URLs to skip during crawling
urls_to_index: Specific URLs to index (overrides start_urls)
urls_to_not_index: URLs to exclude from indexing but still crawl
use_sitemap (default: false): Whether to use sitemaps for URL discovery
sitemap_urls: Optional custom sitemap URLs to use instead of auto-discovery

The crawler will attempt to discover and use sitemaps in the following ways:

If sitemap_urls is provided, it will use those URLs directly
If use_sitemap is true (default), it will attempt to find sitemaps at common locations:
- /sitemap.xml
- /sitemap_index.xml
- /sitemap
- /robots.txt (will extract Sitemap: directives)
If no sitemap is found or if use_sitemap is false, it will fall back to using start_urls

Example usage:

{
  "urls_to_exclude": ["https://example.com/private"],
  "urls_to_index": ["https://example.com/products"],
  "urls_to_not_index": ["https://example.com/login"],
  "use_sitemap": true,
  "sitemap_urls": [
    "https://example.com/custom-sitemap.xml",
    "https://example.com/playground-sitemap.xml"
  ]
}

Performance Settings

{
  max_concurrency?: number;
  max_requests_per_minute?: number;
  batch_size?: number;
}

max_concurrency (default: Infinity): Maximum concurrent requests
max_requests_per_minute (default: Infinity): Request rate limit
batch_size (default: 1000): Documents per indexing batch

Example usage:

{
  "max_concurrency": 5,
  "max_requests_per_minute": 60,
  "batch_size": 500
}

Meilisearch Configuration

{
  primary_key?: string;
  keep_settings?: boolean;
  meilisearch_settings?: {
    searchableAttributes?: string[];
    filterableAttributes?: string[];
    distinctAttribute?: string;
    sortableAttributes?: string[];
    rankingRules?: string[];
    stopWords?: string[];
    synonyms?: Record<string, string[]>;
    ...
  };
}

primary_key: Unique identifier field for documents
keep_settings (default: true): Whether to keep existing index settings
meilisearch_settings: Custom Meilisearch index settings

Request Configuration

 {
  additional_request_headers?: Record<string, string>;
  user_agents?: string[];
  launch_options?: Record<string, any>;
}

additional_request_headers: Custom HTTP headers for requests
user_agents: Custom User-Agent strings to rotate (default: [])
launch_options: Custom Puppeteer/Playwright launch options (default: null)

Webhook Configuration

{
  webhook_url?: string;
  webhook_payload?: Record<string, any>;
}

webhook_url: URL for webhook notifications
webhook_payload: Custom data for webhook payloads

Webhook notification types:

started: Crawling begins
active: Progress updates
paused: Crawling paused
completed: Successful completion
failed: Error occurred

Error Detection

{
  not_found_selectors?: string[];
}

not_found_selectors: CSS selectors for identifying not found pages

Strategy-Specific Settings

Schema Strategy

{
  schema_settings?: {
    convert_dates?: boolean;
    only_type?: string;
  };
}

schema_settings: Schema-specific settings
- convert_dates: Convert dates to timestamp format
- only_type: Limit extraction to specific schema type

Custom Strategy

{
  selectors?: {
    [key: string]: string | string[];
  };
}

selectors: CSS selectors for content extraction

Overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Config Reference

Config Reference

Required Configuration

Crawler Configuration

Crawler Type

URL Control

Performance Settings

Meilisearch Configuration

Request Configuration

Webhook Configuration

Error Detection

Strategy-Specific Settings

Schema Strategy

Custom Strategy

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Documentation

References

Clone this wiki locally