Web Crawler API

A high-performance, production-ready web crawling service that extracts structured data from any HTTP/HTTPS page, including JavaScript-heavy Single Page Applications (SPAs). Built with Node.js, Express, and Playwright for reliable, scalable web data extraction.

Overview

This service provides comprehensive web crawling capabilities through a RESTful API, offering both simple GET endpoints for quick data extraction and a powerful POST endpoint for advanced crawling configurations. The system is designed following SOLID principles with enterprise-grade features including security middleware, rate limiting, response size management, and graceful error handling.

Key Features

Universal Page Support: Renders JavaScript-heavy SPAs, dynamic content, and traditional websites
Structured Data Extraction: Text content, links, images, metadata, JSON-LD, OpenGraph, and more
Performance Monitoring: HAR-like network logs, timing metrics, and resource usage analysis
Security Compliance: Respects robots.txt, implements rate limiting, and includes security headers
Response Size Management: Intelligent truncation with configurable character limits
Production Ready: CORS support, graceful shutdown, shared browser instances, and comprehensive error handling

Architecture

Runtime: Node.js LTS with ES modules
Web Framework: Express.js with security middleware
Rendering Engine: Playwright (Chromium) with optional Puppeteer support
Package Manager: Yarn for dependency management

Installation and Setup

Prerequisites

Node.js LTS (18.x or higher)
Yarn package manager
Chrome/Chromium browser (automatically installed by Playwright)

Quick Start

# Clone and setup
cd 0001-web-crawler
yarn install

# Start the development server
yarn dev

# Server will be available at http://localhost:3000

Basic Usage Example

# Test the service with a simple POST request
curl -X POST http://localhost:3000/capture \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "mode": "both",
    "opts": {
      "timeoutMs": 20000,
      "maxWaitMs": 8000,
      "idleNetworkMs": 800,
      "captureScreenshot": "viewport",
      "downloadImages": "thumb"
    }
  }' | jq .

API Reference

Core Endpoint

POST /capture

The primary endpoint for advanced web crawling with full configuration control.

Request Body:

{
  "url": "https://example.com",
  "mode": "data" | "meta" | "both",
  "opts": { /* see Configuration Options */ },
  "charLimit": 5000  // Optional response size limit
}

Response:

200 OK: Complete crawl results with data and/or metadata
400 Bad Request: Invalid request parameters
408 Request Timeout: Crawl operation timed out
429 Too Many Requests: Rate limit exceeded
500 Internal Server Error: Server-side processing error

Convenience GET Endpoints

Simplified endpoints for common use cases. All endpoints support the charLimit query parameter for response size management.

Data Extraction Endpoints

GET /quick?url=<target>
- Complete crawl with balanced performance settings
- Returns both data and metadata with optimized defaults
GET /data?url=<target>
- Comprehensive data extraction including text, links, images, and structured data
- Includes thumbnail generation for images
GET /preview?url=<target>
- Compact summary with title, description, sample content, and key metrics
- Ideal for generating page previews and cards

Specialized Extraction Endpoints

GET /links?url=<target>
- Extract all page links with metadata (visibility, same-origin status, rel attributes)
GET /images?url=<target>
- Extract all images with thumbnails, dimensions, and content hashes
GET /jsonld?url=<target>
- Extract structured data in JSON-LD format
GET /readability?url=<target>
- Clean article content using Mozilla's Readability algorithm
GET /screenshot?url=<target>&type=viewport|full
- Capture page screenshots in base64 format

Metadata and Analysis Endpoints

GET /meta?url=<target>
- Performance metrics, network analysis, and security information
GET /policies?url=<target>
- Robots.txt content and sitemap discovery
GET /llm-summary?url=<target>
- AI-optimized content summary with key page information

Response Size Management

All endpoints support response size limiting to prevent oversized payloads:

# Limit response to 2000 characters
curl "http://localhost:3000/data?url=https://example.com&charLimit=2000"

# POST endpoint with size limit
curl -X POST http://localhost:3000/capture \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "charLimit": 2000}'

Usage Examples

# Quick comprehensive crawl
curl "http://localhost:3000/quick?url=https://example.com" | jq '{title: .data.title, links: (.data.links | length)}'

# Extract only links with size limit
curl "http://localhost:3000/links?url=https://news.ycombinator.com&charLimit=1500" | jq '.links[0:5]'

# Get page screenshot
curl "http://localhost:3000/screenshot?url=https://example.com&type=viewport" | jq '{hasScreenshot: (.screenshot != null)}'

# Performance analysis
curl "http://localhost:3000/meta?url=https://example.com" | jq '{loadTime: .meta.timings.loadEventMs, requests: (.meta.network | length)}'

# Structured data extraction
curl "http://localhost:3000/jsonld?url=https://example.com" | jq '.jsonld[] | select(.["@type"] == "Organization")'

Configuration Options

The following options can be passed in the opts object for the POST /capture endpoint:

Timing and Performance

timeoutMs (25000): Maximum time allowed for the entire crawl operation
maxWaitMs (12000): Maximum time to wait for page rendering to settle
idleNetworkMs (800): Duration of network inactivity required before considering page loaded
settleStrategy ("auto"): Strategy for determining when page is ready ("auto" | "networkIdle" | "domStable" | "customSelector")
customSelector (null): CSS selector to wait for when using "customSelector" strategy

Browser Configuration

viewport ({width:1366, height:768, deviceScaleFactor:1}): Browser viewport dimensions
userAgent ("auto"): User agent string ("auto" | "desktop" | "mobile" | custom string)
locale ("en-US"): Browser locale setting
timezoneId ("Australia/Sydney"): Timezone for the browser session
javascriptEnabled (true): Enable or disable JavaScript execution

Network and Security

headers ({}): Custom HTTP headers to send with requests
cookies ([]): Array of cookie objects to set before navigation
followRedirects (true): Whether to follow HTTP redirects
redirectLimit (7): Maximum number of redirects to follow
proxy (null): Proxy configuration (e.g., "http://user:pass@host:port")
robotsPolicy ("respect"): How to handle robots.txt ("respect" | "ignore")

Content Extraction

domDepth ("full"): Level of DOM analysis ("shallow" | "full")
downloadImages ("none"): Image processing level ("none" | "thumb" | "all")
imageMaxBytes (2000000): Maximum size for individual image downloads
captureScreenshot ("none"): Screenshot capture mode ("none" | "viewport" | "full")
extractReadability (true): Enable Mozilla Readability content extraction
shadowDom (true): Include Shadow DOM content in extraction

Advanced Features

infiniteScroll ({enabled:false, maxPages:3, scrollDelayMs:600, stopOnSelector:null}): Infinite scroll handling configuration
blockNoise (["analytics","ads"]): Block common tracking and advertising domains
retry ({max:2, varyUA:true, toggleJSOnLast:true}): Retry configuration for failed requests

Security and Ethics

Robots.txt Compliance

By default, the service respects robots.txt directives. If a target path is disallowed, the service returns a BLOCKED_BY_ROBOTS error. While you can override this with opts.robotsPolicy: "ignore", please consider website policies and ethical crawling practices.

Response Schema

Successful Response Structure

All successful API responses follow a consistent structure:

{
  "version": "2.1.0",
  "url": "https://example.com",
  "finalUrl": "https://example.com",
  "timestamp": "2025-09-13T08:44:02.090Z",
  "mode": "both",
  "agent": {
    "engine": "playwright",
    "ua": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "locale": "en-US"
  },
  "redirects": [],
  "data": { /* Data payload when mode includes "data" */ },
  "meta": { /* Metadata payload when mode includes "meta" */ },
  "notes": []
}

Data Payload Structure

When mode is "data" or "both", the response includes:

{
  "data": {
    "title": "Page Title",
    "description": "Page description",
    "language": "en",
    "text": {
      "full": "Complete page text content",
      "readability": {
        "title": "Clean article title",
        "byline": "Author information",
        "contentText": "Clean article text",
        "contentHtml": "Clean article HTML"
      }
    },
    "links": [
      {
        "url": "https://example.com/link",
        "text": "Link text",
        "rel": ["nofollow"],
        "visible": true,
        "sameOrigin": false
      }
    ],
    "images": [
      {
        "url": "https://example.com/image.jpg",
        "alt": "Image description",
        "naturalSize": {"w": 800, "h": 600},
        "bytes": 45678,
        "sha256": "abc123...",
        "thumb": "base64-encoded-thumbnail"
      }
    ],
    "structured": {
      "jsonld": [{"@type": "Organization", "name": "Example"}],
      "opengraph": {"og:title": "Page Title"},
      "twitter": {"twitter:card": "summary"},
      "meta": {"description": "Meta description"},
      "canonical": "https://example.com"
    }
  }
}

Metadata Payload Structure

When mode is "meta" or "both", the response includes:

{
  "meta": {
    "timings": {
      "ttfbMs": 150,
      "domContentLoadedMs": 800,
      "loadEventMs": 1200,
      "firstPaintMs": 600,
      "firstContentfulPaintMs": 650
    },
    "weights": {
      "requests": {"document": 1, "script": 15, "style": 3, "image": 25},
      "bytes": {"document": 50000, "script": 500000, "style": 100000, "image": 2000000}
    },
    "network": [
      {
        "url": "https://example.com",
        "method": "GET",
        "status": 200,
        "mime": "text/html",
        "timing": {},
        "reqHeaders": {},
        "resHeaders": {},
        "size": {"contentLength": 50000},
        "blocked": false
      }
    ],
    "security": {
      "https": true,
      "mixedContent": false,
      "hsts": true,
      "cookies": []
    },
    "screenshot": {
      "type": "viewport",
      "data": "base64-encoded-image-data"
    }
  }
}

Error Handling

Error Response Format

All errors return a consistent error structure:

{
  "error": {
    "code": "NAVIGATION_TIMEOUT",
    "message": "Page failed to load within the specified timeout",
    "hint": "Try increasing the timeoutMs value",
    "details": {
      "timeout": 25000,
      "url": "https://example.com"
    }
  }
}

Error Codes and HTTP Status Mapping

Error Code	HTTP Status	Description
`BAD_REQUEST`	400	Invalid request parameters or malformed URL
`UNSUPPORTED_SCHEME`	400	URL scheme not supported (only HTTP/HTTPS allowed)
`BLOCKED_BY_ROBOTS`	403	Request blocked by robots.txt directives
`BLOCKED_BY_SITE`	403	Site actively blocking automated requests
`NAVIGATION_TIMEOUT`	408	Page failed to load within timeout period
`REDIRECT_LOOP`	409	Infinite redirect loop detected
`RENDER_FAILURE`	500	Browser rendering or JavaScript execution failed
`SIZE_LIMIT`	413	Response size exceeds configured limits
`UNKNOWN`	500	Unexpected error occurred

Common Error Scenarios

Bot Detection: Sites may block automated requests. The service includes retry logic with user agent variation
Timeout Issues: Large or slow-loading pages may exceed timeout limits. Increase timeoutMs and maxWaitMs values
JavaScript Errors: Pages with broken JavaScript may fail to render properly. Try disabling JavaScript with javascriptEnabled: false
Network Issues: SSL certificate problems or network connectivity issues result in RENDER_FAILURE

Production Deployment

Environment Configuration

Set these environment variables for production deployment:

# Server configuration
PORT=3000
NODE_ENV=production

# Security settings
CORS_ORIGIN=https://yourdomain.com

# Rate limiting
RATE_LIMIT_WINDOW_MS=60000
RATE_LIMIT_MAX=120

Deployment Checklist

Reverse Proxy Setup
- Use NGINX, Apache, or CloudFront for TLS termination
- Configure X-Forwarded-* headers for proper client IP detection
- Enable gzip compression for API responses
Security Configuration
- Set restrictive CORS origins in production
- Configure rate limiting based on expected traffic
- Use HTTPS certificates for secure communication
Resource Management
- Monitor memory usage (Playwright browsers can be resource-intensive)
- Configure process managers (PM2, systemd) for automatic restarts
- Set up log rotation for server logs
Monitoring and Observability
- Implement health checks using the /health endpoint
- Monitor response times and error rates
- Set up alerts for high memory usage or frequent crashes

Performance Optimization

Browser Reuse: The service maintains a shared browser instance across requests for optimal performance
Response Caching: Implement Redis or similar caching for frequently requested URLs
Image Processing: Use CDN for image thumbnails and optimize image processing settings
Rate Limiting: Tune rate limits based on server capacity and expected load

Testing and Validation

Health Check

Verify the service is running:

curl http://localhost:3000/health

Expected response:

{
  "status": "ok",
  "browser": "connected",
  "timestamp": "2025-09-13T08:44:02.090Z",
  "uptime": 3600,
  "memory": {...},
  "version": "v18.17.0"
}

Integration Testing

Test all endpoints with a sample URL:

# Test each endpoint type
for endpoint in quick data meta preview links images jsonld readability screenshot; do
  echo "Testing /$endpoint"
  curl -s "http://localhost:3000/$endpoint?url=https://example.com" | jq -r 'keys[]'
done

Load Testing

Use tools like Apache Bench or Artillery for load testing:

# Simple load test
ab -n 100 -c 10 "http://localhost:3000/quick?url=https://example.com"

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
certs		certs
src		src
.gitignore		.gitignore
README.md		README.md
nodemon.json		nodemon.json
package.json		package.json
yarn.lock		yarn.lock

Folders and files

Latest commit

History

Repository files navigation