Scrapyd

Scrapyd is a service for deploying and running Scrapy spiders.

It enables you to upload Scrapy projects and control their spiders using a JSON API, making it perfect for production web scraping deployments.

Features

🚀 Easy Deployment Deploy Scrapy projects as Python eggs via HTTP API

📊 Web Dashboard Monitor running jobs, view logs, and manage projects via web interface

🔧 RESTful API Complete JSON API for programmatic spider management

⚡ Concurrent Processing Run multiple spiders simultaneously with configurable concurrency

📈 Job Management Schedule, monitor, and cancel spider jobs with persistent queue

🔒 Authentication Built-in HTTP basic authentication support

📂 Project Versioning Deploy and manage multiple versions of your scraping projects

📊 Logging & Monitoring Comprehensive logging and status monitoring capabilities

Quick Start

Installation

pip install scrapyd

Basic Usage

Start Scrapyd:
```
scrapyd
```
The web interface will be available at http://localhost:6800

Deploy a Project:

# Using scrapyd-client
pip install scrapyd-client
scrapyd-deploy

Schedule a Spider:

curl http://localhost:6800/schedule.json \
     -d project=myproject \
     -d spider=myspider

Monitor Jobs:

Visit http://localhost:6800 or use the API:
```
curl http://localhost:6800/daemonstatus.json
```

API Endpoints

Core endpoints for spider management:

/daemonstatus.json - Get daemon status and job counts
/listprojects.json - List all deployed projects
/listspiders.json - List spiders in a project
/listjobs.json - List pending/running/finished jobs
/schedule.json - Schedule a spider to run
/cancel.json - Cancel a running job
/addversion.json - Deploy a new project version
/delversion.json - Delete a project version
/delproject.json - Delete an entire project

Example API Usage

import requests

# Check status
response = requests.get('http://localhost:6800/daemonstatus.json')
print(response.json())

# Schedule a spider
response = requests.post('http://localhost:6800/schedule.json', data={
    'project': 'myproject',
    'spider': 'myspider',
    'setting': 'DOWNLOAD_DELAY=2'
})
job_id = response.json()['jobid']

# Monitor the job
response = requests.get('http://localhost:6800/listjobs.json?project=myproject')
jobs = response.json()

Configuration

Create scrapyd.conf to customize settings:

[scrapyd]
bind_address = 0.0.0.0
http_port = 6800
max_proc_per_cpu = 4
username = admin
password = secret

Docker Support

# Run with Docker
docker run -p 6800:6800 scrapy/scrapyd

# Or build from source
docker build -t scrapyd .
docker run -p 6800:6800 scrapyd

Scrapy Ecosystem Integration

Scrapyd is part of the larger Scrapy ecosystem. Here's how the main components work together:

🕷️ Scrapy - The Core Framework The foundational web scraping framework for Python that provides the tools to build spiders, handle requests/responses, and extract data from websites.

🚀 Scrapyd - Deployment & Management Server A service that allows you to deploy Scrapy projects and run spiders remotely via HTTP API. Perfect for production deployments where you need to manage multiple projects and schedule spider execution.

📦 scrapyd-client - Deployment Tools Command-line tools that simplify deploying Scrapy projects to Scrapyd servers:

scrapyd-deploy - Builds and uploads project eggs to Scrapyd
scrapyd-client - Programmatic Python client for Scrapyd API
Handles versioning, dependencies, and configuration management

⚡ Scrapyrt - Real-time HTTP API A lightweight HTTP interface for Scrapy that enables real-time scraping requests. Unlike Scrapyd (designed for long-running spiders), Scrapyrt is optimized for quick, on-demand scraping tasks.

Typical Workflow

Development: Create spiders using Scrapy framework
Deployment: Use scrapyd-client to deploy projects to Scrapyd
Execution: Schedule and monitor spiders via Scrapyd API
Real-time Tasks: Use Scrapyrt for immediate scraping needs

# 1. Create Scrapy project
scrapy startproject myproject

# 2. Deploy to Scrapyd
scrapyd-deploy

# 3. Schedule spider via Scrapyd
curl http://localhost:6800/schedule.json -d project=myproject -d spider=myspider

# 4. Real-time scraping via Scrapyrt (alternative approach)
curl "localhost:9080/crawl.json?spider_name=myspider&url=http://example.com"

When to Use Which Tool

Use Case	Scrapy	Scrapyd	Scrapyrt
Development	✅ Core framework	❌ Not needed	❌ Not needed
Local Testing	✅ `scrapy crawl`	❌ Overkill	✅ Quick HTTP tests
Production Batches	✅ Spider logic	✅ Job scheduling	❌ Not suitable
Long-running Jobs	✅ Spider logic	✅ Process management	❌ Not recommended
Real-time API	✅ Spider logic	❌ Too heavy	✅ Perfect fit
Multiple Projects	✅ Individual dev	✅ Centralized mgmt	❌ Single project
Job Monitoring	❌ Limited	✅ Full dashboard	❌ Limited

Documentation

📚 Full Documentation: https://scrapyd.readthedocs.io/

Community

Issues: Report bugs and request features on GitHub Issues
Discussions: Join conversations on GitHub Discussions
Stack Overflow: Ask questions with the scrapy tag

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

BSD 3-Clause License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 910 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples/basic-spider		examples/basic-spider
scrapyd		scrapyd
tests		tests
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DETAILED_IMPLEMENTATION_PLAN.md		DETAILED_IMPLEMENTATION_PLAN.md
IMPROVEMENT_PLAN.md		IMPROVEMENT_PLAN.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_PERFORMANCE.md		README_PERFORMANCE.md
deploy-cloudrun.bash		deploy-cloudrun.bash
entrypoint.sh		entrypoint.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrapyd

Features

Quick Start

Installation

Basic Usage

API Endpoints

Example API Usage

Configuration

Docker Support

Scrapy Ecosystem Integration

Typical Workflow

When to Use Which Tool

Documentation

Community

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

EmanueleCannizzaro/scrapyd

Folders and files

Latest commit

History

Repository files navigation

Scrapyd

Features

Quick Start

Installation

Basic Usage

API Endpoints

Example API Usage

Configuration

Docker Support

Scrapy Ecosystem Integration

Typical Workflow

When to Use Which Tool

Documentation

Community

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages