Skip to content

Config Reference

qdequele edited this page Nov 30, 2024 · 5 revisions

Config Reference

The configuration file (JSON format) supports the following options:

Required Configuration

{
  meilisearch_index_uid: string;
  meilisearch_url: string;
  meilisearch_api_key: string;
  start_urls: string[];
}
  • meilisearch_index_uid: Unique identifier for the Meilisearch index
  • meilisearch_url: URL of the Meilisearch server instance
  • meilisearch_api_key: API key for Meilisearch authentication
  • start_urls: Initial URLs to begin crawling from

Crawler Configuration

Crawler Type

{
  crawler_type?: 'cheerio' | 'puppeteer' | 'playwright';
  strategy?: 'docssearch' | 'default' | 'schema' | 'markdown' | 'custom';
}
  • crawler_type (default: "cheerio"): Web scraping engine to use

    • cheerio: Fast HTML parser for static sites
    • puppeteer: Full Chrome automation for dynamic sites
    • playwright: Modern cross-browser automation (beta)
  • strategy (default: "default"): Content extraction strategy

    • default: General-purpose hierarchical content extraction
    • docssearch: DocSearch plugin compatibility
    • schema: Schema.org structured data extraction
    • markdown: Markdown conversion
    • custom: User-defined selectors

URL Control

{
  urls_to_exclude?: string[];
  urls_to_index?: string[];
  urls_to_not_index?: string[];
  use_sitemap?: boolean;
  sitemap_urls?: string[];
}
  • urls_to_exclude: URLs to skip during crawling
  • urls_to_index: Specific URLs to index (overrides start_urls)
  • urls_to_not_index: URLs to exclude from indexing but still crawl
  • use_sitemap (default: false): Whether to use sitemaps for URL discovery
  • sitemap_urls: Optional custom sitemap URLs to use instead of auto-discovery

The crawler will attempt to discover and use sitemaps in the following ways:

  1. If sitemap_urls is provided, it will use those URLs directly
  2. If use_sitemap is true (default), it will attempt to find sitemaps at common locations:
    • /sitemap.xml
    • /sitemap_index.xml
    • /sitemap
    • /robots.txt (will extract Sitemap: directives)
  3. If no sitemap is found or if use_sitemap is false, it will fall back to using start_urls

Example usage:

{
  "urls_to_exclude": ["https://example.com/private"],
  "urls_to_index": ["https://example.com/products"],
  "urls_to_not_index": ["https://example.com/login"],
  "use_sitemap": true,
  "sitemap_urls": [
    "https://example.com/custom-sitemap.xml",
    "https://example.com/playground-sitemap.xml"
  ]
}

Performance Settings

{
  max_concurrency?: number;
  max_requests_per_minute?: number;
  batch_size?: number;
}
  • max_concurrency (default: Infinity): Maximum concurrent requests
  • max_requests_per_minute (default: Infinity): Request rate limit
  • batch_size (default: 1000): Documents per indexing batch

Example usage:

{
  "max_concurrency": 5,
  "max_requests_per_minute": 60,
  "batch_size": 500
}

Meilisearch Configuration

{
  primary_key?: string;
  keep_settings?: boolean;
  meilisearch_settings?: {
    searchableAttributes?: string[];
    filterableAttributes?: string[];
    distinctAttribute?: string;
    sortableAttributes?: string[];
    rankingRules?: string[];
    stopWords?: string[];
    synonyms?: Record<string, string[]>;
    ...
  };
}
  • primary_key: Unique identifier field for documents
  • keep_settings (default: true): Whether to keep existing index settings
  • meilisearch_settings: Custom Meilisearch index settings

Request Configuration

 {
  additional_request_headers?: Record<string, string>;
  user_agents?: string[];
  launch_options?: Record<string, any>;
}
  • additional_request_headers: Custom HTTP headers for requests
  • user_agents: Custom User-Agent strings to rotate (default: [])
  • launch_options: Custom Puppeteer/Playwright launch options (default: null)

Webhook Configuration

{
  webhook_url?: string;
  webhook_payload?: Record<string, any>;
}
  • webhook_url: URL for webhook notifications
  • webhook_payload: Custom data for webhook payloads

Webhook notification types:

  • started: Crawling begins
  • active: Progress updates
  • paused: Crawling paused
  • completed: Successful completion
  • failed: Error occurred

Error Detection

{
  not_found_selectors?: string[];
}
  • not_found_selectors: CSS selectors for identifying not found pages

Strategy-Specific Settings

Schema Strategy

{
  schema_settings?: {
    convert_dates?: boolean;
    only_type?: string;
  };
}
  • schema_settings: Schema-specific settings
    • convert_dates: Convert dates to timestamp format
    • only_type: Limit extraction to specific schema type

Custom Strategy

{
  selectors?: {
    [key: string]: string | string[];
  };
}
  • selectors: CSS selectors for content extraction