WPSpider is a robust, standalone command-line tool designed to crawl WordPress websites via their publicly available REST API. It efficiently paginates through endpoints (posts, pages, media, etc.) and archives the data into a structured SQLite database for analysis, backup, or migration.
- Intelligent Discovery: Automatically handles API path construction from simple domains (e.g.,
example.com→https://example.com/wp-json/wp/v2/). - Smart Pagination: Iterates through all available pages using maximum page size to ensure complete data retrieval.
- Resilient Crawling: Handles rate limiting, missing endpoints, and API errors gracefully without crashing.
- Flexible Output: Saves all data to a portable SQLite database (
.db) with metadata tracking. - Standalone Executable: Distributed as a single
.exefile requiring no Python installation for end-users. - Configurable: Fully controlled via a simple
config.jsonfile.
- Download the latest release binary (
wpspider.exe). - Ensure
config.jsonis present in the same directory.
- Clone the repository.
- Initialize the environment (Windows/PowerShell):
python -m venv .venv .venv\Scripts\Activate.ps1 pip install -r requirements.txt
Create or edit the config.json file in the application directory.
Minimal Example:
{
"target": "https://techcrunch.com",
"db_name": null
}Full Configuration Example:
{
"target": "https://example.com",
"endpoints": [
"posts",
"pages",
"media",
"categories",
"users"
],
"db_name": null,
"output_directory": null,
"user_agent": "WPSpider/1.0 (Nebula Crawler; +https://wpspider.local)",
"log_file": "wpspider.log"
}| Setting | Description | Default |
|---|---|---|
target |
The URL or domain of the WordPress site. | Required |
endpoints |
List of API endpoints to crawl. | ['posts', 'pages', 'media', ...] |
db_name |
Output SQLite file. If null, filename is derived from target domain. | null |
output_directory |
Output directory used only when db_name is null. |
null (PWD) |
user_agent |
Custom User-Agent string. | WPSpider/1.0 (Nebula Crawler; +https://wpspider.local) |
log_file |
Path to save the execution log. | wpspider.log |
Run the executable (or Python script) from your terminal:
.\wpspider.exeOr from source:
python -m wpspider.mainOptional CLI parameters:
--target,-t,--url,--site,--domain--output,-o,--db,--database,--db-name--directory,-d,--outdirectory,--outputdirectory--useragent,--user-agent,-u
The tool will display progress as it connects to the target, discovers endpoints, and fetches records.
Data is saved to a SQLite database specified in your config.
Tracks the history of crawls.
id: Primary Keyurl: The full API URL used.domain: The target domain.date_crawled: Timestamp of the operation.
Each endpoint gets its own table (e.g., posts, users).
To ensure 100% data fidelity across different WordPress versions and plugin schemas:
id: The WordPress object ID.data: The full raw JSON object stored as text.crawled_at: Timestamp for the specific record.
src/: Source code.tests/: Unit and integration tests.scripts/: PowerShell automation scripts.docs/: Project documentation and schemas.
To bundle the application into a standalone .exe:
# Install build dependencies
pip install pyinstaller
# Run the build script
.\scripts\build.ps1The output will be located in the dist/ folder.