Web Scraper All in One Insights MCP Server

A Model Context Protocol (MCP) server that opens websites and captures all HTTP requests made by those sites. This tool is perfect for analyzing web traffic patterns, tracking third-party integrations, and understanding the network behavior of websites.

Features

Complete Request Capture: Captures all HTTP requests made by a website including XHR, fetch, and resource requests
Detailed Analysis: Provides comprehensive data including:
- Request URLs, methods, headers
- Response status codes and headers
- Resource types (document, script, stylesheet, image, etc.)
- Timestamps for timing analysis
- Domain analysis showing all external services
Configurable Options:
- Adjustable wait times for dynamic content
- Option to include/exclude image and media requests
- Custom viewport and user agent settings
Built with Patchright: Uses Patchright (enhanced Playwright) for reliable browser automation
Bun Runtime
Dual Transport Support: Supports both STDIO (for local/CLI integration) and HTTP (for remote/network access)

Requirements

Bun runtime
Google chrome installed
git

Installation

Clone and Install Dependencies

git clone https://github.com/xilistudios/mcp-scraper-aio.git
cd mcp_scraper_analytics
bun install

Test the Server

# Run the server in STDIO mode
bun src/index.ts

# Run the server in HTTP mode (listens on http://127.0.0.1:8080)
bun src/index.ts --http

# Run the server in HTTP mode on a custom port (e.g., 3000)
bun src/index.ts --http --port 3000

Docker

The project includes a docker-compose.yml file for convenient deployment using Docker Compose. It builds the Bun-native image (bundling an undetected Chromium from patchright-nodejs), mounts the ./logs directory for persistence, and runs the MCP server in HTTP mode by default (accessible at http://localhost:3031).

Quick Start

Build the image:
```
docker compose build
```
Run the server:
```
docker compose up
```
For detached mode:
```
docker compose up -d
```
Stop the server:
```
docker compose down
```
Update your mcp.json:

{
  "mcpServers": {
    "mcp-scraper-aio": {
      "url": "http://localhost:3031/sse"
    }
  }
}

Run in STDIO Mode

To run in STDIO mode for local/CLI integration (keeps STDIN/STDOUT attached for MCP clients):

docker compose run --rm mcp bun src/index.ts

Append flags like --verbose as needed (e.g., bun src/index.ts --verbose).

Run MCP Self-Test

To verify the containerized build responds to tool invocations:

docker compose run --rm mcp bun run mcp-test

Custom Configuration

Logs: ./logs is mounted to /app/logs to persist logs/server.log between runs.

Environment Variables: The image sets PLAYWRIGHT_BROWSERS_PATH and PATCHRIGHT_BROWSER_PATH to /ms-playwright. Override them inline:

docker compose run --rm \
  -e PLAYWRIGHT_BROWSERS_PATH=/custom/playwright \
  -e PATCHRIGHT_BROWSER_PATH=/custom/playwright \
  mcp bun src/index.ts

Custom Port: To use a different port (e.g., 3000):
1. Update docker-compose.yml:
  - Change ports: - "3000:3000"
  - Update command: ["bun", "src/index.ts", "--http", "--port", "3000"]
2. Run docker compose up.

The server will log the listening address (e.g., http://127.0.0.1:3000).

Usage

Available Tools

The MCP exposes the following tools. Each entry includes the parameter schema, a concrete input example, and an example output the tool will return.

analyze_website_requests Description: Open a website in a real browser, capture all HTTP requests, and store a site analysis. The ListTools schema documents default client-side hints: waitTime default 3000 ms, includeImages default false, quickMode default false. See the implementation in src/server.ts.

Parameters:

url (string, required): The website URL to analyze (must include http:// or https://)
waitTime (number, optional): Additional wait time in milliseconds for dynamic content (default: 3000, max: 10000)
includeImages (boolean, optional): Whether to include image and media requests (default: false)
quickMode (boolean, optional): Use quick loading mode with minimal waiting (default: false)

Example input:

{
  "url": "https://example.com",
  "waitTime": 3000,
  "includeImages": false,
  "quickMode": false
}

Example output (stored analysis summary):

{
  "websiteInfo": {
    "url": "https://example.com",
    "title": "Example Domain",
    "analysisTimestamp": "2025-08-15T18:00:00.000Z",
    "renderMethod": "client"
  },
  "requestSummary": {
    "totalRequests": 12,
    "uniqueDomains": 4,
    "requestsByType": {
      "document": 1,
      "script": 6,
      "stylesheet": 2,
      "xhr": 2,
      "font": 1
    }
  },
  "domains": [
    "example.com",
    "cdn.example.com",
    "analytics.google.com"
  ],
  "antiBotDetection": {
    "detected": false
  },
  "browserStorage": {
    "cookies": [
      {
        "name": "sessionid",
        "value": "abc123",
        "domain": "example.com",
        "path": "/",
        "expires": 1692136800,
        "httpOnly": true,
        "secure": true,
        "sameSite": "Lax"
      }
    ],
    "localStorage": {
      "theme": "dark",
      "pref": "1"
    },
    "sessionStorage": {
      "lastVisited": "/home"
    }
  }
}

get_requests_by_domain Description: Retrieve all captured requests for a specific domain from a previously stored analysis. This reads the stored analysis results; run analyze_website_requests first. See handler schema in src/handlers/schemas.ts.

Parameters:

url (string, required): The original URL that was analyzed
domain (string, required): The domain to filter requests for (e.g., "api.example.com")

Example input:

{
  "url": "https://example.com",
  "domain": "api.example.com"
}

Example output:

{
  "url": "https://example.com",
  "domain": "api.example.com",
  "totalRequests": 3,
  "requests": [
    {
      "id": "req-1",
      "url": "https://api.example.com/v1/data",
      "method": "GET",
      "resourceType": "xhr",
      "status": 200,
      "timestamp": "2025-08-15T18:00:00.200Z"
    },
    {
      "id": "req-2",
      "url": "https://api.example.com/v1/auth",
      "method": "POST",
      "resourceType": "xhr",
      "status": 201,
      "timestamp": "2025-08-15T18:00:00.450Z"
    }
  ]
}

get_request_details Description: Return full details for a single captured request (headers, body if captured, response, timings). Requires both the analyzed URL and the requestId present in the stored analysis. See the tool registration in src/server.ts.

Parameters:

url (string, required): The original URL that was analyzed
requestId (string, required): The unique ID of the request to retrieve

Example input:

{
  "url": "https://example.com",
  "requestId": "req-2"
}

Example output:

{
  "id": "req-2",
  "url": "https://api.example.com/v1/auth",
  "method": "POST",
  "requestHeaders": {
    "content-type": "application/json"
  },
  "requestBody": "{\"username\":\"user\",\"password\":\"***\"}",
  "status": 201,
  "responseHeaders": {
    "content-type": "application/json"
  },
  "responseBody": "{\"token\":\"abc123\"}",
  "resourceType": "xhr",
  "timestamp": "2025-08-15T18:00:00.450Z",
  "timings": {
    "start": 1692136800.45,
    "end": 1692136800.47,
    "durationMs": 20
  }
}

get_request_summary Description: Return the stored analysis summary for a previously analyzed URL. The handler returns the same summary object produced by analyze_website_requests. See src/handlers/analysis.ts.

Parameters:

url (string, required): The website URL that was previously analyzed

Example input:

{
  "url": "https://example.com"
}

Example output:

{
  "websiteInfo": {
    "url": "https://example.com",
    "title": "Example Domain",
    "analysisTimestamp": "2025-08-15T18:00:00.000Z",
    "renderMethod": "client"
  },
  "requestSummary": {
    "totalRequests": 12,
    "uniqueDomains": 4,
    "requestsByType": {
      "document": 1,
      "script": 6,
      "stylesheet": 2,
      "xhr": 2,
      "font": 1
    }
  },
  "domains": [
    "example.com",
    "cdn.example.com",
    "analytics.google.com"
  ],
  "antiBotDetection": {
    "detected": false
  },
  "browserStorage": {
    "cookies": [
      {
        "name": "sessionid",
        "value": "abc123",
        "domain": "example.com",
        "path": "/",
        "expires": 1692136800,
        "httpOnly": true,
        "secure": true,
        "sameSite": "Lax"
      }
    ],
    "localStorage": {
      "theme": "dark",
      "pref": "1"
    },
    "sessionStorage": {
      "lastVisited": "/home"
    }
  }
}

extract_html_elements Description: Extracts important HTML elements (text, images, links, scripts) with their CSS selectors and basic metadata. Uses page-level extraction logic in src/services/page_analyzer.ts.

Parameters:

url (string, required): The page URL to analyze
filterType (string, required): One of: text, image, link, script

Example input:

{
  "url": "https://example.com",
  "filterType": "text"
}

Example output:

[
  {
    "content": "Hello world",
    "selector": "#intro",
    "type": "text",
    "tag": "p",
    "attributes": {
      "id": "intro",
      "data-test": "x"
    }
  },
  {
    "content": "Title here",
    "selector": ".title",
    "type": "text",
    "tag": "h1",
    "attributes": {}
  }
]

fetch Description: Perform a direct HTTP request (no browser rendering) and return status, headers and body. The schema is defined in src/handlers/schemas.ts.

Parameters:

url (string, required): The URL to fetch
method (string, optional): HTTP method, one of GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS (default: GET)
headers (object, optional): Key-value map of request headers
body (string, optional): Request body for POST/PUT/PATCH

Example input:

{
  "url": "https://api.example.com/data",
  "method": "POST",
  "headers": {
    "Content-Type": "application/json"
  },
  "body": "{\"key\":\"value\"}"
}

Example output:

{
  "status": 200,
  "statusText": "OK",
  "headers": {
    "content-type": "application/json"
  },
  "body": "{\"result\":\"ok\"}",
  "durationMs": 123
}

Integration with AI Assistants

Claude Desktop

Add this to your Claude Desktop MCP settings:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "web-scraper-analytics": {
      "command": "bun",
      "args": ["/path/to/mcp_scraper_analytics/src/index.ts"],
      "disabled": false
    }
  }
}

For HTTP mode, configure as a remote server:

{
  "mcpServers": {
    "web-scraper-analytics-http": {
      "url": "http://localhost:8080",
      "disabled": false
    }
  }
}

VSCode with Claude Dev

Edit ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

{
  "mcpServers": {
    "web-scraper-analytics": {
      "command": "bun",
      "args": ["/path/to/mcp_scraper_analytics/src/index.ts"],
      "disabled": false,
      "autoApprove": []
    }
  }
}

For HTTP mode, add a remote configuration:

{
  "mcpServers": {
    "web-scraper-analytics-http": {
      "url": "http://localhost:8080",
      "disabled": false,
      "autoApprove": []
    }
  }
}

Use Cases

1. Security Analysis

Identify all third-party services a website connects to
Discover potential data leakage through external requests
Track analytics and advertising integrations

2. Performance Optimization

Analyze request patterns and timing
Identify heavy resource loading
Find opportunities for caching and optimization

3. Competitive Analysis

Understand what services competitors use
Discover third-party integrations and tools
Analyze traffic patterns and dependencies

4. Compliance Auditing

Track external data sharing
Identify GDPR/privacy compliance issues
Document all network communications

5. Development Debugging

Debug API calls and network issues
Analyze timing and performance problems
Understand complex application architectures

Example Queries for AI Assistants

"Analyze the HTTP requests made by https://techcrunch.com and show me all the advertising and analytics services they use"

"Check what third-party services https://shopify.com loads and organize them by category"

"Analyze https://github.com and tell me about their performance characteristics based on the network requests"

"Extract all text and image elements from https://example.com and return their CSS selectors and brief content summaries"

Technical Details

Browser Engine: Chromium via Patchright
Request Capture: Real browser automation with full JavaScript execution
Network Monitoring: Captures both requests and responses
Resource Types: Documents, scripts, stylesheets, images, fonts, XHR, fetch, and more
Timeout Handling: Configurable timeouts for different scenarios
Error Handling: Comprehensive error handling and logging
Transport: Dual support for STDIO and HTTP via StreamableHTTPServerTransport

Development

Clean build artifacts

bun run clean

Testing

Test Coverage

bun run test:coverage

Test Features

Run the full test suite:

bun run test

Run tests in watch mode (for development)

bun run test:watch

Project Structure

src/server.ts: Main MCP server implementation
src/handlers/: Tool request handlers
src/services/: Core analysis services (browser, analyzer, etc.)
src/__tests__/: Unit and integration tests

Contributing

Fork the repository
Create a feature branch
Make your changes and add tests
Run bun run test to ensure everything passes
Submit a pull request

License

MIT License - see LICENSE for details.

Troubleshooting

Common Issues

Browser not found: Ensure Chromium is installed via bunx patchright@latest install chromium
Permission errors: Run with appropriate user permissions for browser automation
Network timeouts: Increase waitTime parameter for slow-loading sites

Debug Mode

Run with verbose logging:

bun src/index.ts --verbose

Or in Docker:

docker run --rm -it mcp-scraper-analytics --verbose

Roadmap

Add support for more browser engines (Firefox, WebKit)
Implement request/response body capture for all resource types
Add performance metrics and waterfall charts
Support for authenticated sessions and cookie management
Integration with popular security scanning tools
API for programmatic analysis without MCP

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.roo		.roo
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.rooignore		.rooignore
.roomodes		.roomodes
.swcrc		.swcrc
Dockerfile		Dockerfile
README.md		README.md
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
jest.config.js		jest.config.js
nodemon.json		nodemon.json
package.json		package.json
tsconfig.json		tsconfig.json

xilistudios/mcp-scraper-aio

Folders and files

Latest commit

History

Repository files navigation

Web Scraper All in One Insights MCP Server

Features

Requirements

Installation

Docker

Quick Start

Run in STDIO Mode

Run MCP Self-Test

Custom Configuration

Usage

Available Tools

Integration with AI Assistants

Claude Desktop

VSCode with Claude Dev

Use Cases

1. Security Analysis

2. Performance Optimization

3. Competitive Analysis

4. Compliance Auditing

5. Development Debugging

Example Queries for AI Assistants

Technical Details

Development

Clean build artifacts

Testing

Test Coverage

Test Features

Run tests in watch mode (for development)

Project Structure

Contributing

License

Troubleshooting

Common Issues

Debug Mode

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages