Skip to content

API Reference

qdequele edited this page Nov 9, 2024 · 1 revision

API Reference

The Scrapix HTTP server provides several endpoints for crawling websites and handling webhooks. The server runs on port 8080 by default, or the port specified in the PORT environment variable.

Endpoints

POST /crawl/async

Adds a crawling task to the queue for asynchronous processing.

Request Body:

{
  // Required fields
  "meilisearch_index_uid": "string", // Unique identifier for the Meilisearch index
  "meilisearch_url": "string",       // URL of the Meilisearch server instance
  "meilisearch_api_key": "string",   // API key for Meilisearch authentication
  "start_urls": ["string"],          // Initial URLs to begin crawling from

  // Optional fields
  "crawler_type": "cheerio" | "puppeteer" | "playwright", // Web scraping engine to use (default: "cheerio")
  "urls_to_exclude": ["string"],     // URLs to skip during crawling
  "urls_to_index": ["string"],       // Specific URLs to index (overrides start_urls)
  "urls_to_not_index": ["string"],   // URLs to exclude from indexing but still crawl
  
  "additional_request_headers": {     // Custom HTTP headers for requests
    "header_name": "value"
  },
  
  "primary_key": "string",           // Unique identifier field for documents
  "batch_size": number,              // Documents per indexing batch (default: 1000)
  
  "strategy": "docssearch" | "default" | "schema" | "markdown" | "custom", // Content extraction strategy
  
  "schema_settings": {               // Settings for schema-based extraction
    "convert_dates": boolean,        // Convert dates to timestamp format
    "only_type": "string"           // Only extract specific schema.org type
  },
  
  "user_agents": ["string"],         // Custom User-Agent strings to rotate
  
  "webhook_url": "string",           // URL for webhook notifications
  "webhook_payload": object,         // Custom data for webhook payloads
  
  "selectors": {                     // Custom CSS selectors for content extraction
    "selector_name": "value"
  },
  
  "max_concurrency": number,         // Maximum concurrent requests (default: Infinity)
  "max_requests_per_minute": number, // Rate limit for requests (default: Infinity)
  "not_found_selectors": ["string"]  // Selectors indicating 404 pages
}

Response:

  • Message: "Crawling task queued"

POST /crawl/sync

Executes a crawling task synchronously and waits for completion.

Request Body: Same as /crawl/async

Response:

  • Message: "Crawling finished"
  • Number of documents indexed