Skip to content

rahuldangeofficial/0001-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler API

A high-performance, production-ready web crawling service that extracts structured data from any HTTP/HTTPS page, including JavaScript-heavy Single Page Applications (SPAs). Built with Node.js, Express, and Playwright for reliable, scalable web data extraction.

Overview

This service provides comprehensive web crawling capabilities through a RESTful API, offering both simple GET endpoints for quick data extraction and a powerful POST endpoint for advanced crawling configurations. The system is designed following SOLID principles with enterprise-grade features including security middleware, rate limiting, response size management, and graceful error handling.

Key Features

  • Universal Page Support: Renders JavaScript-heavy SPAs, dynamic content, and traditional websites
  • Structured Data Extraction: Text content, links, images, metadata, JSON-LD, OpenGraph, and more
  • Performance Monitoring: HAR-like network logs, timing metrics, and resource usage analysis
  • Security Compliance: Respects robots.txt, implements rate limiting, and includes security headers
  • Response Size Management: Intelligent truncation with configurable character limits
  • Production Ready: CORS support, graceful shutdown, shared browser instances, and comprehensive error handling

Architecture

  • Runtime: Node.js LTS with ES modules
  • Web Framework: Express.js with security middleware
  • Rendering Engine: Playwright (Chromium) with optional Puppeteer support
  • Package Manager: Yarn for dependency management

Installation and Setup

Prerequisites

  • Node.js LTS (18.x or higher)
  • Yarn package manager
  • Chrome/Chromium browser (automatically installed by Playwright)

Quick Start

# Clone and setup
cd 0001-web-crawler
yarn install

# Start the development server
yarn dev

# Server will be available at http://localhost:3000

Basic Usage Example

# Test the service with a simple POST request
curl -X POST http://localhost:3000/capture \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "mode": "both",
    "opts": {
      "timeoutMs": 20000,
      "maxWaitMs": 8000,
      "idleNetworkMs": 800,
      "captureScreenshot": "viewport",
      "downloadImages": "thumb"
    }
  }' | jq .

API Reference

Core Endpoint

POST /capture

The primary endpoint for advanced web crawling with full configuration control.

Request Body:

{
  "url": "https://example.com",
  "mode": "data" | "meta" | "both",
  "opts": { /* see Configuration Options */ },
  "charLimit": 5000  // Optional response size limit
}

Response:

  • 200 OK: Complete crawl results with data and/or metadata
  • 400 Bad Request: Invalid request parameters
  • 408 Request Timeout: Crawl operation timed out
  • 429 Too Many Requests: Rate limit exceeded
  • 500 Internal Server Error: Server-side processing error

Convenience GET Endpoints

Simplified endpoints for common use cases. All endpoints support the charLimit query parameter for response size management.

Data Extraction Endpoints

  • GET /quick?url=<target>

    • Complete crawl with balanced performance settings
    • Returns both data and metadata with optimized defaults
  • GET /data?url=<target>

    • Comprehensive data extraction including text, links, images, and structured data
    • Includes thumbnail generation for images
  • GET /preview?url=<target>

    • Compact summary with title, description, sample content, and key metrics
    • Ideal for generating page previews and cards

Specialized Extraction Endpoints

  • GET /links?url=<target>

    • Extract all page links with metadata (visibility, same-origin status, rel attributes)
  • GET /images?url=<target>

    • Extract all images with thumbnails, dimensions, and content hashes
  • GET /jsonld?url=<target>

    • Extract structured data in JSON-LD format
  • GET /readability?url=<target>

    • Clean article content using Mozilla's Readability algorithm
  • GET /screenshot?url=<target>&type=viewport|full

    • Capture page screenshots in base64 format

Metadata and Analysis Endpoints

  • GET /meta?url=<target>

    • Performance metrics, network analysis, and security information
  • GET /policies?url=<target>

    • Robots.txt content and sitemap discovery
  • GET /llm-summary?url=<target>

    • AI-optimized content summary with key page information

Response Size Management

All endpoints support response size limiting to prevent oversized payloads:

# Limit response to 2000 characters
curl "http://localhost:3000/data?url=https://example.com&charLimit=2000"

# POST endpoint with size limit
curl -X POST http://localhost:3000/capture \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "charLimit": 2000}'

Usage Examples

# Quick comprehensive crawl
curl "http://localhost:3000/quick?url=https://example.com" | jq '{title: .data.title, links: (.data.links | length)}'

# Extract only links with size limit
curl "http://localhost:3000/links?url=https://news.ycombinator.com&charLimit=1500" | jq '.links[0:5]'

# Get page screenshot
curl "http://localhost:3000/screenshot?url=https://example.com&type=viewport" | jq '{hasScreenshot: (.screenshot != null)}'

# Performance analysis
curl "http://localhost:3000/meta?url=https://example.com" | jq '{loadTime: .meta.timings.loadEventMs, requests: (.meta.network | length)}'

# Structured data extraction
curl "http://localhost:3000/jsonld?url=https://example.com" | jq '.jsonld[] | select(.["@type"] == "Organization")'

Configuration Options

The following options can be passed in the opts object for the POST /capture endpoint:

Timing and Performance

  • timeoutMs (25000): Maximum time allowed for the entire crawl operation
  • maxWaitMs (12000): Maximum time to wait for page rendering to settle
  • idleNetworkMs (800): Duration of network inactivity required before considering page loaded
  • settleStrategy ("auto"): Strategy for determining when page is ready ("auto" | "networkIdle" | "domStable" | "customSelector")
  • customSelector (null): CSS selector to wait for when using "customSelector" strategy

Browser Configuration

  • viewport ({width:1366, height:768, deviceScaleFactor:1}): Browser viewport dimensions
  • userAgent ("auto"): User agent string ("auto" | "desktop" | "mobile" | custom string)
  • locale ("en-US"): Browser locale setting
  • timezoneId ("Australia/Sydney"): Timezone for the browser session
  • javascriptEnabled (true): Enable or disable JavaScript execution

Network and Security

  • headers ({}): Custom HTTP headers to send with requests
  • cookies ([]): Array of cookie objects to set before navigation
  • followRedirects (true): Whether to follow HTTP redirects
  • redirectLimit (7): Maximum number of redirects to follow
  • proxy (null): Proxy configuration (e.g., "http://user:pass@host:port")
  • robotsPolicy ("respect"): How to handle robots.txt ("respect" | "ignore")

Content Extraction

  • domDepth ("full"): Level of DOM analysis ("shallow" | "full")
  • downloadImages ("none"): Image processing level ("none" | "thumb" | "all")
  • imageMaxBytes (2000000): Maximum size for individual image downloads
  • captureScreenshot ("none"): Screenshot capture mode ("none" | "viewport" | "full")
  • extractReadability (true): Enable Mozilla Readability content extraction
  • shadowDom (true): Include Shadow DOM content in extraction

Advanced Features

  • infiniteScroll ({enabled:false, maxPages:3, scrollDelayMs:600, stopOnSelector:null}): Infinite scroll handling configuration
  • blockNoise (["analytics","ads"]): Block common tracking and advertising domains
  • retry ({max:2, varyUA:true, toggleJSOnLast:true}): Retry configuration for failed requests

Security and Ethics

Robots.txt Compliance

By default, the service respects robots.txt directives. If a target path is disallowed, the service returns a BLOCKED_BY_ROBOTS error. While you can override this with opts.robotsPolicy: "ignore", please consider website policies and ethical crawling practices.

Response Schema

Successful Response Structure

All successful API responses follow a consistent structure:

{
  "version": "2.1.0",
  "url": "https://example.com",
  "finalUrl": "https://example.com",
  "timestamp": "2025-09-13T08:44:02.090Z",
  "mode": "both",
  "agent": {
    "engine": "playwright",
    "ua": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "locale": "en-US"
  },
  "redirects": [],
  "data": { /* Data payload when mode includes "data" */ },
  "meta": { /* Metadata payload when mode includes "meta" */ },
  "notes": []
}

Data Payload Structure

When mode is "data" or "both", the response includes:

{
  "data": {
    "title": "Page Title",
    "description": "Page description",
    "language": "en",
    "text": {
      "full": "Complete page text content",
      "readability": {
        "title": "Clean article title",
        "byline": "Author information",
        "contentText": "Clean article text",
        "contentHtml": "Clean article HTML"
      }
    },
    "links": [
      {
        "url": "https://example.com/link",
        "text": "Link text",
        "rel": ["nofollow"],
        "visible": true,
        "sameOrigin": false
      }
    ],
    "images": [
      {
        "url": "https://example.com/image.jpg",
        "alt": "Image description",
        "naturalSize": {"w": 800, "h": 600},
        "bytes": 45678,
        "sha256": "abc123...",
        "thumb": "base64-encoded-thumbnail"
      }
    ],
    "structured": {
      "jsonld": [{"@type": "Organization", "name": "Example"}],
      "opengraph": {"og:title": "Page Title"},
      "twitter": {"twitter:card": "summary"},
      "meta": {"description": "Meta description"},
      "canonical": "https://example.com"
    }
  }
}

Metadata Payload Structure

When mode is "meta" or "both", the response includes:

{
  "meta": {
    "timings": {
      "ttfbMs": 150,
      "domContentLoadedMs": 800,
      "loadEventMs": 1200,
      "firstPaintMs": 600,
      "firstContentfulPaintMs": 650
    },
    "weights": {
      "requests": {"document": 1, "script": 15, "style": 3, "image": 25},
      "bytes": {"document": 50000, "script": 500000, "style": 100000, "image": 2000000}
    },
    "network": [
      {
        "url": "https://example.com",
        "method": "GET",
        "status": 200,
        "mime": "text/html",
        "timing": {},
        "reqHeaders": {},
        "resHeaders": {},
        "size": {"contentLength": 50000},
        "blocked": false
      }
    ],
    "security": {
      "https": true,
      "mixedContent": false,
      "hsts": true,
      "cookies": []
    },
    "screenshot": {
      "type": "viewport",
      "data": "base64-encoded-image-data"
    }
  }
}

Error Handling

Error Response Format

All errors return a consistent error structure:

{
  "error": {
    "code": "NAVIGATION_TIMEOUT",
    "message": "Page failed to load within the specified timeout",
    "hint": "Try increasing the timeoutMs value",
    "details": {
      "timeout": 25000,
      "url": "https://example.com"
    }
  }
}

Error Codes and HTTP Status Mapping

Error Code HTTP Status Description
BAD_REQUEST 400 Invalid request parameters or malformed URL
UNSUPPORTED_SCHEME 400 URL scheme not supported (only HTTP/HTTPS allowed)
BLOCKED_BY_ROBOTS 403 Request blocked by robots.txt directives
BLOCKED_BY_SITE 403 Site actively blocking automated requests
NAVIGATION_TIMEOUT 408 Page failed to load within timeout period
REDIRECT_LOOP 409 Infinite redirect loop detected
RENDER_FAILURE 500 Browser rendering or JavaScript execution failed
SIZE_LIMIT 413 Response size exceeds configured limits
UNKNOWN 500 Unexpected error occurred

Common Error Scenarios

  • Bot Detection: Sites may block automated requests. The service includes retry logic with user agent variation
  • Timeout Issues: Large or slow-loading pages may exceed timeout limits. Increase timeoutMs and maxWaitMs values
  • JavaScript Errors: Pages with broken JavaScript may fail to render properly. Try disabling JavaScript with javascriptEnabled: false
  • Network Issues: SSL certificate problems or network connectivity issues result in RENDER_FAILURE

Production Deployment

Environment Configuration

Set these environment variables for production deployment:

# Server configuration
PORT=3000
NODE_ENV=production

# Security settings
CORS_ORIGIN=https://yourdomain.com

# Rate limiting
RATE_LIMIT_WINDOW_MS=60000
RATE_LIMIT_MAX=120

Deployment Checklist

  1. Reverse Proxy Setup

    • Use NGINX, Apache, or CloudFront for TLS termination
    • Configure X-Forwarded-* headers for proper client IP detection
    • Enable gzip compression for API responses
  2. Security Configuration

    • Set restrictive CORS origins in production
    • Configure rate limiting based on expected traffic
    • Use HTTPS certificates for secure communication
  3. Resource Management

    • Monitor memory usage (Playwright browsers can be resource-intensive)
    • Configure process managers (PM2, systemd) for automatic restarts
    • Set up log rotation for server logs
  4. Monitoring and Observability

    • Implement health checks using the /health endpoint
    • Monitor response times and error rates
    • Set up alerts for high memory usage or frequent crashes

Performance Optimization

  • Browser Reuse: The service maintains a shared browser instance across requests for optimal performance
  • Response Caching: Implement Redis or similar caching for frequently requested URLs
  • Image Processing: Use CDN for image thumbnails and optimize image processing settings
  • Rate Limiting: Tune rate limits based on server capacity and expected load

Testing and Validation

Health Check

Verify the service is running:

curl http://localhost:3000/health

Expected response:

{
  "status": "ok",
  "browser": "connected",
  "timestamp": "2025-09-13T08:44:02.090Z",
  "uptime": 3600,
  "memory": {...},
  "version": "v18.17.0"
}

Integration Testing

Test all endpoints with a sample URL:

# Test each endpoint type
for endpoint in quick data meta preview links images jsonld readability screenshot; do
  echo "Testing /$endpoint"
  curl -s "http://localhost:3000/$endpoint?url=https://example.com" | jq -r 'keys[]'
done

Load Testing

Use tools like Apache Bench or Artillery for load testing:

# Simple load test
ab -n 100 -c 10 "http://localhost:3000/quick?url=https://example.com"

About

Production‑ready web capture API that renders any http/https page (including JS‑heavy SPAs) with Playwright and returns structured content plus meta (network/timings/security). It includes convenience GET endpoints using ?target=, Readability extraction, structured data, image hashing/thumbnails, redirect tracking, and robots.txt

Resources

Stars

Watchers

Forks

Contributors