A Python tool to automatically crawl and archive entire subdomains in the Internet Archive's Wayback Machine.
Wayback Archiver helps preserve web content by crawling all pages within a specified subdomain and submitting them to the Internet Archive's Wayback Machine. This is particularly useful for:
- Preserving content from websites that might be shut down
- Creating complete historical snapshots of blogs or documentation sites
- Ensuring important information remains available even if the original site changes
- Archiving personal projects, portfolios, or academic websites
- User-friendly Web Interface: Easy-to-use UI for configuring and monitoring archiving jobs
- Recursive Crawling: Automatically discovers and follows links within the target subdomain
- Smart Filtering: Excludes common paths that would result in duplicate content (like tag pages, categories, etc.)
- Media Handling: Excludes image files from archiving by default (configurable)
- Configurable Parameters:
- Control crawl depth with max pages limit
- Limit recursive crawl depth to prevent excessive crawling
- Set delays between requests for API politeness
- Custom exclude patterns for site-specific requirements
- HTTPS-only mode (enabled by default)
- Image exclusion option (enabled by default)
- Batch Processing: Handles large sites by processing URLs in batches with configurable pauses
- Resilient Operation:
- Persistent connections with DNS caching to reduce connection issues
- Retry mechanism with exponential backoff for failed archive attempts
- Graceful error handling and interruption recovery
- Ability to resume archiving from previously failed URLs
- Robots.txt respect for ethical crawling
- Detailed Logging: Comprehensive logs and output files to track progress and results
- Security Features:
- CSRF protection in the web interface
- Secure credential handling options
- Input validation and sanitization
- Python 3.8+
- Main dependencies (installed automatically via requirements.txt):
requests
: For HTTP operationsbeautifulsoup4
: For HTML parsingflask
: For the web interfaceurllib3
&dnspython
: For improved networking and DNS resolutiongunicorn
: For production deployment
-
Clone or download this repository:
git clone https://github.com/nickheynen/wayback-archiver.git cd wayback-archiver
-
Install the required packages:
pip install -r requirements.txt
-
Start the application by running the
web_interface.py
script:python3 web_interface.py
-
Open your browser and navigate to
http://127.0.0.1:5000
. -
Fill out the form with your target subdomain and configuration options:
- Basic settings: URL, email, delay, page limits, exclude patterns
- Options: Control for robots.txt, HTTPS-only mode, image exclusion
- Authentication: Enter S3 credentials directly or specify a config file path
-
Click "Start Archiving" and monitor the progress in real-time.
You can run the archiver directly from the command line:
python3 wayback_archiver.py https://example.com --email [email protected] --delay 15 --max-pages 500 --max-depth 10
Common command line options:
--delay DELAY Delay between archive requests in seconds (default: 15)
--max-pages MAX_PAGES Maximum number of pages to crawl (default: unlimited)
--max-depth MAX_DEPTH Maximum crawl depth from starting URL (default: 10)
--max-retries MAX_RETRIES
Maximum retry attempts for failed archives (default: 3)
--backoff-factor BACKOFF_FACTOR
Exponential backoff factor for retries (default: 1.5)
--batch-size BATCH_SIZE
Process URLs in batches of this size (default: 150)
--batch-pause BATCH_PAUSE
Seconds to pause between batches (default: 180)
--verbose, -v Enable verbose logging
python3 wayback_archiver.py https://example.com --email [email protected]
There are three ways to provide S3 credentials, in order of security preference:
- Environment Variables (Recommended for advanced users):
# Mac/Linux: Set the variables
export IA_S3_ACCESS_KEY="your_access_key"
export IA_S3_SECRET_KEY="your_secret_key"
# Windows: Set the variables
set IA_S3_ACCESS_KEY=your_access_key
set IA_S3_SECRET_KEY=your_secret_key
# Run the tool with --use-env-keys flag
python3 wayback_archiver.py https://example.com --use-env-keys
- Configuration File (Recommended for beginners):
Step-by-step guide:
a) Create a config file with your favorite text editor:
Mac/Linux:
# Create the file
touch ~/.ia_credentials.ini
# Set secure permissions (only you can read it)
chmod 600 ~/.ia_credentials.ini
# Edit with your preferred editor
nano ~/.ia_credentials.ini
Windows: Create a file named .ia_credentials.ini
in your user folder (e.g., C:\Users\yourusername\.ia_credentials.ini
)
b) Add the following content to the file, replacing with your actual keys:
[default]
s3_access_key = your_access_key
s3_secret_key = your_secret_key
c) Run the archiver with your config file:
# Mac/Linux
python3 wayback_archiver.py https://example.com --config-file ~/.ia_credentials.ini
# Windows
python3 wayback_archiver.py https://example.com --config-file C:\Users\yourusername\.ia_credentials.ini
- Command Line (Not recommended for security reasons):
python3 wayback_archiver.py https://example.com --s3-access-key YOUR_ACCESS_KEY --s3-secret-key YOUR_SECRET_KEY
Note: Your credentials will remain secure in the config file for future use, so you only need to set them up once.
By default, the tool only archives HTTPS URLs. To include HTTP URLs:
python3 wayback_archiver.py https://example.com --include-http
By default, the tool respects robots.txt. To override:
python3 wayback_archiver.py https://example.com --ignore-robots-txt
By default, image files (jpg, png, gif, etc.) are excluded from archiving. To include them:
python3 wayback_archiver.py https://example.com --include-images
For all available options:
python3 wayback_archiver.py --help
The web interface can be deployed online using various hosting services. Here's how to deploy on:
- Create a free account on PythonAnywhere
- Upload the project files or clone the repository
- Set up a web app with Flask
- Configure your WSGI file to point to the
web_interface.py
application
- Add a
Procfile
with the content:web: gunicorn web_interface:app
- Add
gunicorn
to your requirements.txt - Deploy to Heroku
By default, output files are saved to the wayback_results
directory:
- Log Files:
wayback_archiver.log
: Contains detailed logs of the archiving processwayback_web.log
: Contains logs from the web interface (when using web UI)
- Results:
wayback_results/successful_urls_<domain>_<timestamp>.json
: Contains all successfully archived URLswayback_results/failed_urls_<domain>_<timestamp>.json
: Contains URLs that failed to archive
You can use the /results
endpoint in the web interface to view archived results.
This tool uses the Internet Archive's Wayback Machine API. Please use it responsibly:
- Set reasonable delays between requests (the default is 15 seconds)
- Provide your email or S3 authentication when archiving
- Use S3 authentication if you're a frequent contributor (contact Internet Archive for credentials)
- Keep HTTPS-only mode enabled when possible for better web security
- Respect robots.txt directives (enabled by default)
- Respect the terms of service of both the Internet Archive and target websites
- Consider donating to the Internet Archive if you find this tool valuable
This project is licensed under the MIT License. See the LICENSE
file for details.
Contributions are welcome! To contribute:
- Fork the repository
- Create a new branch for your feature
- Add your changes
- Submit a pull request
For bugs, questions, or feature requests, please open an issue on GitHub.