A high-performance, production-ready web crawling service that extracts structured data from any HTTP/HTTPS page, including JavaScript-heavy Single Page Applications (SPAs). Built with Node.js, Express, and Playwright for reliable, scalable web data extraction.
This service provides comprehensive web crawling capabilities through a RESTful API, offering both simple GET endpoints for quick data extraction and a powerful POST endpoint for advanced crawling configurations. The system is designed following SOLID principles with enterprise-grade features including security middleware, rate limiting, response size management, and graceful error handling.
- Universal Page Support: Renders JavaScript-heavy SPAs, dynamic content, and traditional websites
- Structured Data Extraction: Text content, links, images, metadata, JSON-LD, OpenGraph, and more
- Performance Monitoring: HAR-like network logs, timing metrics, and resource usage analysis
- Security Compliance: Respects robots.txt, implements rate limiting, and includes security headers
- Response Size Management: Intelligent truncation with configurable character limits
- Production Ready: CORS support, graceful shutdown, shared browser instances, and comprehensive error handling
- Runtime: Node.js LTS with ES modules
- Web Framework: Express.js with security middleware
- Rendering Engine: Playwright (Chromium) with optional Puppeteer support
- Package Manager: Yarn for dependency management
- Node.js LTS (18.x or higher)
- Yarn package manager
- Chrome/Chromium browser (automatically installed by Playwright)
# Clone and setup
cd 0001-web-crawler
yarn install
# Start the development server
yarn dev
# Server will be available at http://localhost:3000# Test the service with a simple POST request
curl -X POST http://localhost:3000/capture \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"mode": "both",
"opts": {
"timeoutMs": 20000,
"maxWaitMs": 8000,
"idleNetworkMs": 800,
"captureScreenshot": "viewport",
"downloadImages": "thumb"
}
}' | jq .POST /capture
The primary endpoint for advanced web crawling with full configuration control.
Request Body:
{
"url": "https://example.com",
"mode": "data" | "meta" | "both",
"opts": { /* see Configuration Options */ },
"charLimit": 5000 // Optional response size limit
}Response:
200 OK: Complete crawl results with data and/or metadata400 Bad Request: Invalid request parameters408 Request Timeout: Crawl operation timed out429 Too Many Requests: Rate limit exceeded500 Internal Server Error: Server-side processing error
Simplified endpoints for common use cases. All endpoints support the charLimit query parameter for response size management.
-
GET /quick?url=<target>
- Complete crawl with balanced performance settings
- Returns both data and metadata with optimized defaults
-
GET /data?url=<target>
- Comprehensive data extraction including text, links, images, and structured data
- Includes thumbnail generation for images
-
GET /preview?url=<target>
- Compact summary with title, description, sample content, and key metrics
- Ideal for generating page previews and cards
-
GET /links?url=<target>
- Extract all page links with metadata (visibility, same-origin status, rel attributes)
-
GET /images?url=<target>
- Extract all images with thumbnails, dimensions, and content hashes
-
GET /jsonld?url=<target>
- Extract structured data in JSON-LD format
-
GET /readability?url=<target>
- Clean article content using Mozilla's Readability algorithm
-
GET /screenshot?url=<target>&type=viewport|full
- Capture page screenshots in base64 format
-
GET /meta?url=<target>
- Performance metrics, network analysis, and security information
-
GET /policies?url=<target>
- Robots.txt content and sitemap discovery
-
GET /llm-summary?url=<target>
- AI-optimized content summary with key page information
All endpoints support response size limiting to prevent oversized payloads:
# Limit response to 2000 characters
curl "http://localhost:3000/data?url=https://example.com&charLimit=2000"
# POST endpoint with size limit
curl -X POST http://localhost:3000/capture \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "charLimit": 2000}'# Quick comprehensive crawl
curl "http://localhost:3000/quick?url=https://example.com" | jq '{title: .data.title, links: (.data.links | length)}'
# Extract only links with size limit
curl "http://localhost:3000/links?url=https://news.ycombinator.com&charLimit=1500" | jq '.links[0:5]'
# Get page screenshot
curl "http://localhost:3000/screenshot?url=https://example.com&type=viewport" | jq '{hasScreenshot: (.screenshot != null)}'
# Performance analysis
curl "http://localhost:3000/meta?url=https://example.com" | jq '{loadTime: .meta.timings.loadEventMs, requests: (.meta.network | length)}'
# Structured data extraction
curl "http://localhost:3000/jsonld?url=https://example.com" | jq '.jsonld[] | select(.["@type"] == "Organization")'The following options can be passed in the opts object for the POST /capture endpoint:
- timeoutMs (25000): Maximum time allowed for the entire crawl operation
- maxWaitMs (12000): Maximum time to wait for page rendering to settle
- idleNetworkMs (800): Duration of network inactivity required before considering page loaded
- settleStrategy ("auto"): Strategy for determining when page is ready ("auto" | "networkIdle" | "domStable" | "customSelector")
- customSelector (null): CSS selector to wait for when using "customSelector" strategy
- viewport ({width:1366, height:768, deviceScaleFactor:1}): Browser viewport dimensions
- userAgent ("auto"): User agent string ("auto" | "desktop" | "mobile" | custom string)
- locale ("en-US"): Browser locale setting
- timezoneId ("Australia/Sydney"): Timezone for the browser session
- javascriptEnabled (true): Enable or disable JavaScript execution
- headers ({}): Custom HTTP headers to send with requests
- cookies ([]): Array of cookie objects to set before navigation
- followRedirects (true): Whether to follow HTTP redirects
- redirectLimit (7): Maximum number of redirects to follow
- proxy (null): Proxy configuration (e.g., "http://user:pass@host:port")
- robotsPolicy ("respect"): How to handle robots.txt ("respect" | "ignore")
- domDepth ("full"): Level of DOM analysis ("shallow" | "full")
- downloadImages ("none"): Image processing level ("none" | "thumb" | "all")
- imageMaxBytes (2000000): Maximum size for individual image downloads
- captureScreenshot ("none"): Screenshot capture mode ("none" | "viewport" | "full")
- extractReadability (true): Enable Mozilla Readability content extraction
- shadowDom (true): Include Shadow DOM content in extraction
- infiniteScroll ({enabled:false, maxPages:3, scrollDelayMs:600, stopOnSelector:null}): Infinite scroll handling configuration
- blockNoise (["analytics","ads"]): Block common tracking and advertising domains
- retry ({max:2, varyUA:true, toggleJSOnLast:true}): Retry configuration for failed requests
By default, the service respects robots.txt directives. If a target path is disallowed, the service returns a BLOCKED_BY_ROBOTS error. While you can override this with opts.robotsPolicy: "ignore", please consider website policies and ethical crawling practices.
All successful API responses follow a consistent structure:
{
"version": "2.1.0",
"url": "https://example.com",
"finalUrl": "https://example.com",
"timestamp": "2025-09-13T08:44:02.090Z",
"mode": "both",
"agent": {
"engine": "playwright",
"ua": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"locale": "en-US"
},
"redirects": [],
"data": { /* Data payload when mode includes "data" */ },
"meta": { /* Metadata payload when mode includes "meta" */ },
"notes": []
}When mode is "data" or "both", the response includes:
{
"data": {
"title": "Page Title",
"description": "Page description",
"language": "en",
"text": {
"full": "Complete page text content",
"readability": {
"title": "Clean article title",
"byline": "Author information",
"contentText": "Clean article text",
"contentHtml": "Clean article HTML"
}
},
"links": [
{
"url": "https://example.com/link",
"text": "Link text",
"rel": ["nofollow"],
"visible": true,
"sameOrigin": false
}
],
"images": [
{
"url": "https://example.com/image.jpg",
"alt": "Image description",
"naturalSize": {"w": 800, "h": 600},
"bytes": 45678,
"sha256": "abc123...",
"thumb": "base64-encoded-thumbnail"
}
],
"structured": {
"jsonld": [{"@type": "Organization", "name": "Example"}],
"opengraph": {"og:title": "Page Title"},
"twitter": {"twitter:card": "summary"},
"meta": {"description": "Meta description"},
"canonical": "https://example.com"
}
}
}When mode is "meta" or "both", the response includes:
{
"meta": {
"timings": {
"ttfbMs": 150,
"domContentLoadedMs": 800,
"loadEventMs": 1200,
"firstPaintMs": 600,
"firstContentfulPaintMs": 650
},
"weights": {
"requests": {"document": 1, "script": 15, "style": 3, "image": 25},
"bytes": {"document": 50000, "script": 500000, "style": 100000, "image": 2000000}
},
"network": [
{
"url": "https://example.com",
"method": "GET",
"status": 200,
"mime": "text/html",
"timing": {},
"reqHeaders": {},
"resHeaders": {},
"size": {"contentLength": 50000},
"blocked": false
}
],
"security": {
"https": true,
"mixedContent": false,
"hsts": true,
"cookies": []
},
"screenshot": {
"type": "viewport",
"data": "base64-encoded-image-data"
}
}
}All errors return a consistent error structure:
{
"error": {
"code": "NAVIGATION_TIMEOUT",
"message": "Page failed to load within the specified timeout",
"hint": "Try increasing the timeoutMs value",
"details": {
"timeout": 25000,
"url": "https://example.com"
}
}
}| Error Code | HTTP Status | Description |
|---|---|---|
BAD_REQUEST |
400 | Invalid request parameters or malformed URL |
UNSUPPORTED_SCHEME |
400 | URL scheme not supported (only HTTP/HTTPS allowed) |
BLOCKED_BY_ROBOTS |
403 | Request blocked by robots.txt directives |
BLOCKED_BY_SITE |
403 | Site actively blocking automated requests |
NAVIGATION_TIMEOUT |
408 | Page failed to load within timeout period |
REDIRECT_LOOP |
409 | Infinite redirect loop detected |
RENDER_FAILURE |
500 | Browser rendering or JavaScript execution failed |
SIZE_LIMIT |
413 | Response size exceeds configured limits |
UNKNOWN |
500 | Unexpected error occurred |
- Bot Detection: Sites may block automated requests. The service includes retry logic with user agent variation
- Timeout Issues: Large or slow-loading pages may exceed timeout limits. Increase
timeoutMsandmaxWaitMsvalues - JavaScript Errors: Pages with broken JavaScript may fail to render properly. Try disabling JavaScript with
javascriptEnabled: false - Network Issues: SSL certificate problems or network connectivity issues result in
RENDER_FAILURE
Set these environment variables for production deployment:
# Server configuration
PORT=3000
NODE_ENV=production
# Security settings
CORS_ORIGIN=https://yourdomain.com
# Rate limiting
RATE_LIMIT_WINDOW_MS=60000
RATE_LIMIT_MAX=120-
Reverse Proxy Setup
- Use NGINX, Apache, or CloudFront for TLS termination
- Configure
X-Forwarded-*headers for proper client IP detection - Enable gzip compression for API responses
-
Security Configuration
- Set restrictive CORS origins in production
- Configure rate limiting based on expected traffic
- Use HTTPS certificates for secure communication
-
Resource Management
- Monitor memory usage (Playwright browsers can be resource-intensive)
- Configure process managers (PM2, systemd) for automatic restarts
- Set up log rotation for server logs
-
Monitoring and Observability
- Implement health checks using the
/healthendpoint - Monitor response times and error rates
- Set up alerts for high memory usage or frequent crashes
- Implement health checks using the
- Browser Reuse: The service maintains a shared browser instance across requests for optimal performance
- Response Caching: Implement Redis or similar caching for frequently requested URLs
- Image Processing: Use CDN for image thumbnails and optimize image processing settings
- Rate Limiting: Tune rate limits based on server capacity and expected load
Verify the service is running:
curl http://localhost:3000/healthExpected response:
{
"status": "ok",
"browser": "connected",
"timestamp": "2025-09-13T08:44:02.090Z",
"uptime": 3600,
"memory": {...},
"version": "v18.17.0"
}Test all endpoints with a sample URL:
# Test each endpoint type
for endpoint in quick data meta preview links images jsonld readability screenshot; do
echo "Testing /$endpoint"
curl -s "http://localhost:3000/$endpoint?url=https://example.com" | jq -r 'keys[]'
doneUse tools like Apache Bench or Artillery for load testing:
# Simple load test
ab -n 100 -c 10 "http://localhost:3000/quick?url=https://example.com"