-
Notifications
You must be signed in to change notification settings - Fork 9
Config Reference
qdequele edited this page Nov 30, 2024
·
5 revisions
The configuration file (JSON format) supports the following options:
{
meilisearch_index_uid: string;
meilisearch_url: string;
meilisearch_api_key: string;
start_urls: string[];
}
-
meilisearch_index_uid
: Unique identifier for the Meilisearch index -
meilisearch_url
: URL of the Meilisearch server instance -
meilisearch_api_key
: API key for Meilisearch authentication -
start_urls
: Initial URLs to begin crawling from
{
crawler_type?: 'cheerio' | 'puppeteer' | 'playwright';
strategy?: 'docssearch' | 'default' | 'schema' | 'markdown' | 'custom';
}
-
crawler_type
(default: "cheerio"): Web scraping engine to use-
cheerio
: Fast HTML parser for static sites -
puppeteer
: Full Chrome automation for dynamic sites -
playwright
: Modern cross-browser automation (beta)
-
-
strategy
(default: "default"): Content extraction strategy-
default
: General-purpose hierarchical content extraction -
docssearch
: DocSearch plugin compatibility -
schema
: Schema.org structured data extraction -
markdown
: Markdown conversion -
custom
: User-defined selectors
-
{
urls_to_exclude?: string[];
urls_to_index?: string[];
urls_to_not_index?: string[];
use_sitemap?: boolean;
sitemap_urls?: string[];
}
-
urls_to_exclude
: URLs to skip during crawling -
urls_to_index
: Specific URLs to index (overrides start_urls) -
urls_to_not_index
: URLs to exclude from indexing but still crawl -
use_sitemap
(default: false): Whether to use sitemaps for URL discovery -
sitemap_urls
: Optional custom sitemap URLs to use instead of auto-discovery
The crawler will attempt to discover and use sitemaps in the following ways:
- If
sitemap_urls
is provided, it will use those URLs directly - If
use_sitemap
is true (default), it will attempt to find sitemaps at common locations:/sitemap.xml
/sitemap_index.xml
/sitemap
-
/robots.txt
(will extract Sitemap: directives)
- If no sitemap is found or if
use_sitemap
is false, it will fall back to usingstart_urls
Example usage:
{
"urls_to_exclude": ["https://example.com/private"],
"urls_to_index": ["https://example.com/products"],
"urls_to_not_index": ["https://example.com/login"],
"use_sitemap": true,
"sitemap_urls": [
"https://example.com/custom-sitemap.xml",
"https://example.com/playground-sitemap.xml"
]
}
{
max_concurrency?: number;
max_requests_per_minute?: number;
batch_size?: number;
}
-
max_concurrency
(default: Infinity): Maximum concurrent requests -
max_requests_per_minute
(default: Infinity): Request rate limit -
batch_size
(default: 1000): Documents per indexing batch
Example usage:
{
"max_concurrency": 5,
"max_requests_per_minute": 60,
"batch_size": 500
}
{
primary_key?: string;
keep_settings?: boolean;
meilisearch_settings?: {
searchableAttributes?: string[];
filterableAttributes?: string[];
distinctAttribute?: string;
sortableAttributes?: string[];
rankingRules?: string[];
stopWords?: string[];
synonyms?: Record<string, string[]>;
...
};
}
-
primary_key
: Unique identifier field for documents -
keep_settings
(default: true): Whether to keep existing index settings -
meilisearch_settings
: Custom Meilisearch index settings
{
additional_request_headers?: Record<string, string>;
user_agents?: string[];
launch_options?: Record<string, any>;
}
-
additional_request_headers
: Custom HTTP headers for requests -
user_agents
: Custom User-Agent strings to rotate (default: []) -
launch_options
: Custom Puppeteer/Playwright launch options (default: null)
{
webhook_url?: string;
webhook_payload?: Record<string, any>;
}
-
webhook_url
: URL for webhook notifications -
webhook_payload
: Custom data for webhook payloads
Webhook notification types:
-
started
: Crawling begins -
active
: Progress updates -
paused
: Crawling paused -
completed
: Successful completion -
failed
: Error occurred
{
not_found_selectors?: string[];
}
-
not_found_selectors
: CSS selectors for identifying not found pages
{
schema_settings?: {
convert_dates?: boolean;
only_type?: string;
};
}
-
schema_settings
: Schema-specific settings-
convert_dates
: Convert dates to timestamp format -
only_type
: Limit extraction to specific schema type
-
{
selectors?: {
[key: string]: string | string[];
};
}
-
selectors
: CSS selectors for content extraction