-
Notifications
You must be signed in to change notification settings - Fork 9
API Reference
qdequele edited this page Nov 9, 2024
·
1 revision
The Scrapix HTTP server provides several endpoints for crawling websites and handling webhooks. The server runs on port 8080 by default, or the port specified in the PORT
environment variable.
Adds a crawling task to the queue for asynchronous processing.
Request Body:
{
// Required fields
"meilisearch_index_uid": "string", // Unique identifier for the Meilisearch index
"meilisearch_url": "string", // URL of the Meilisearch server instance
"meilisearch_api_key": "string", // API key for Meilisearch authentication
"start_urls": ["string"], // Initial URLs to begin crawling from
// Optional fields
"crawler_type": "cheerio" | "puppeteer" | "playwright", // Web scraping engine to use (default: "cheerio")
"urls_to_exclude": ["string"], // URLs to skip during crawling
"urls_to_index": ["string"], // Specific URLs to index (overrides start_urls)
"urls_to_not_index": ["string"], // URLs to exclude from indexing but still crawl
"additional_request_headers": { // Custom HTTP headers for requests
"header_name": "value"
},
"primary_key": "string", // Unique identifier field for documents
"batch_size": number, // Documents per indexing batch (default: 1000)
"strategy": "docssearch" | "default" | "schema" | "markdown" | "custom", // Content extraction strategy
"schema_settings": { // Settings for schema-based extraction
"convert_dates": boolean, // Convert dates to timestamp format
"only_type": "string" // Only extract specific schema.org type
},
"user_agents": ["string"], // Custom User-Agent strings to rotate
"webhook_url": "string", // URL for webhook notifications
"webhook_payload": object, // Custom data for webhook payloads
"selectors": { // Custom CSS selectors for content extraction
"selector_name": "value"
},
"max_concurrency": number, // Maximum concurrent requests (default: Infinity)
"max_requests_per_minute": number, // Rate limit for requests (default: Infinity)
"not_found_selectors": ["string"] // Selectors indicating 404 pages
}
Response:
- Message: "Crawling task queued"
Executes a crawling task synchronously and waits for completion.
Request Body: Same as /crawl/async
Response:
- Message: "Crawling finished"
- Number of documents indexed