WPSpider

WPSpider is a robust, standalone command-line tool designed to crawl WordPress websites via their publicly available REST API. It efficiently paginates through endpoints (posts, pages, media, etc.) and archives the data into a structured SQLite database for analysis, backup, or migration.

Features

Intelligent Discovery: Automatically handles API path construction from simple domains (e.g., example.com → https://example.com/wp-json/wp/v2/).
Smart Pagination: Iterates through all available pages using maximum page size to ensure complete data retrieval.
Resilient Crawling: Handles rate limiting, missing endpoints, and API errors gracefully without crashing.
Flexible Output: Saves all data to a portable SQLite database (.db) with metadata tracking.
Standalone Executable: Distributed as a single .exe file requiring no Python installation for end-users.
Configurable: Fully controlled via a simple config.json file.

Installation

For End Users

Download the latest release binary (wpspider.exe).
Ensure config.json is present in the same directory.

For Developers

Clone the repository.

Initialize the environment (Windows/PowerShell):

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Usage

1. Configuration

Create or edit the config.json file in the application directory.

Minimal Example:

{
    "target": "https://techcrunch.com",
    "db_name": null
}

Full Configuration Example:

{
    "target": "https://example.com",
    "endpoints": [
        "posts",
        "pages",
        "media",
        "categories",
        "users"
    ],
    "db_name": null,
    "output_directory": null,
    "user_agent": "WPSpider/1.0 (Nebula Crawler; +https://wpspider.local)",
    "log_file": "wpspider.log"
}

Setting	Description	Default
`target`	The URL or domain of the WordPress site.	Required
`endpoints`	List of API endpoints to crawl.	`['posts', 'pages', 'media', ...]`
`db_name`	Output SQLite file. If null, filename is derived from target domain.	`null`
`output_directory`	Output directory used only when `db_name` is null.	`null` (PWD)
`user_agent`	Custom User-Agent string.	`WPSpider/1.0 (Nebula Crawler; +https://wpspider.local)`
`log_file`	Path to save the execution log.	`wpspider.log`

2. Running the Crawler

Run the executable (or Python script) from your terminal:

.\wpspider.exe

Or from source:

python -m wpspider.main

Optional CLI parameters:

--target, -t, --url, --site, --domain
--output, -o, --db, --database, --db-name
--directory, -d, --outdirectory, --outputdirectory
--useragent, --user-agent, -u

The tool will display progress as it connects to the target, discovers endpoints, and fetches records.

Output Structure

Data is saved to a SQLite database specified in your config.

Metadata Table (`targets`)

Tracks the history of crawls.

id: Primary Key
url: The full API URL used.
domain: The target domain.
date_crawled: Timestamp of the operation.

Data Tables

Each endpoint gets its own table (e.g., posts, users). To ensure 100% data fidelity across different WordPress versions and plugin schemas:

id: The WordPress object ID.
data: The full raw JSON object stored as text.
crawled_at: Timestamp for the specific record.

Development

Structure

src/: Source code.
tests/: Unit and integration tests.
scripts/: PowerShell automation scripts.
docs/: Project documentation and schemas.

Building the Executable

To bundle the application into a standalone .exe:

# Install build dependencies
pip install pyinstaller

# Run the build script
.\scripts\build.ps1

The output will be located in the dist/ folder.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
docs		docs
scripts		scripts
src/wpspider		src/wpspider
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
requirements.txt		requirements.txt
wpspider.code-workspace		wpspider.code-workspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WPSpider

Features

Installation

For End Users

For Developers

Usage

1. Configuration

2. Running the Crawler

Output Structure

Metadata Table (`targets`)

Data Tables

Development

Structure

Building the Executable

License

About

Uh oh!

Releases 2

Languages

License

h8rt3rmin8r/wpspider

Folders and files

Latest commit

History

Repository files navigation

WPSpider

Features

Installation

For End Users

For Developers

Usage

1. Configuration

2. Running the Crawler

Output Structure

Metadata Table (targets)

Data Tables

Development

Structure

Building the Executable

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Languages

Metadata Table (`targets`)