A simple tool to save copies of websites to the Internet Archive's Wayback Machine.
This tool helps you save entire websites to the Wayback Machine. It:
- Starts at any webpage you specify
- Finds all the links on that page
- Follows those links to find more pages on the same site
- Saves each page to the Wayback Machine
This is useful when you want to:
- Save a website that might be taken down soon
- Create a backup of your blog or personal site
- Preserve important documentation or articles
- Archive a project before it changes
-
Make sure you have Python 3.8 or newer installed on your computer
-
Download this tool:
git clone https://github.com/nickheynen/wayback-archiver.git cd wayback-archiver -
Install the required software:
pip3 install -r requirements.txt
-
Run it (replace the URL with the website you want to save):
python3 wayback_archiver.py https://example.com
You can archive multiple websites by providing a text file with one URL per line. For example, if you have a file called URLs to archive.txt:
https://example.com/
https://another-site.org/
https://mysite.net/
Run the tool with the -f or --file option:
python3 wayback_archiver.py -f "URLs to archive.txt"All other options (like delays, depth, etc.) can be used as usual. The tool will process each URL in the file, one after another.
Save a website, waiting 15 seconds between each page (recommended):
python3 wayback_archiver.py https://example.com --delay 15Save only the first 100 pages found:
python3 wayback_archiver.py https://example.com --max-pages 100Save pages but skip image files (recommended):
python3 wayback_archiver.py https://example.com # Images are skipped by defaultWhen you run the tool without any extra options, these are the default settings:
| Setting | Default Value | Description | Change with |
|---|---|---|---|
| Delay between requests | 15 seconds | Time to wait between saving pages | --delay 20 |
| Maximum depth | 10 levels | How many clicks deep to follow links | --max-depth 5 |
| Batch size | 150 pages | Pages to process before taking a longer break | --batch-size 100 |
| Batch pause | 180 seconds | Length of break between batches | --batch-pause 300 |
| Maximum retries | 3 times | Times to retry if a page fails | --max-retries 5 |
| Retry backoff | 1.5x | Multiplier for delay between retries | --backoff-factor 2 |
| HTTPS only | Yes | Only save HTTPS pages (safer) | --include-http |
| Exclude images | Yes | Skip saving image files | --include-images |
| Respect robots.txt | Yes | Follow website crawling rules | --ignore-robots-txt |
| URL patterns excluded | Common patterns* | Skip certain types of URLs | --exclude |
*Default excluded patterns:
/tag/,/category/- Tag and category pages/author/,/page/- Author and pagination pages/comment-page-,/wp-json/- WordPress system pages/feed/,/wp-content/cache/- Feed and cache files/wp-admin/,/search/- Admin and search pages/login/,/register/- User account pages/signup/,/logout/- User account pages/privacy-policy/- Standard policy pages/404/,/error/- Error pages
It's good practice to include your email when using the Wayback Machine:
python3 wayback_archiver.py https://example.com --email [email protected]The tool will follow links to find pages. You can control how many "clicks" deep it goes:
python3 wayback_archiver.py https://example.com --max-depth 5For large sites, the tool can take breaks between groups of pages:
python3 wayback_archiver.py https://example.com --batch-size 50 --batch-pause 180This will process 50 pages, then pause for 3 minutes before continuing.
If some pages fail to save, the tool creates a file in the wayback_results folder. You can retry these pages:
python3 wayback_archiver.py --retry-file wayback_results/failed_urls_example.com_20240220_123456.jsonThe tool creates several files in a folder called wayback_results:
successful_urls_[domain]_[timestamp].json- List of successfully saved pagesfailed_urls_[domain]_[timestamp].json- List of pages that failed to savewayback_archiver.log- Detailed log of what happened during the process
-
"Too Many Requests" Error
- Increase the delay between requests:
--delay 30 - Use smaller batch sizes:
--batch-size 50 - Add longer pauses between batches:
--batch-pause 300
- Increase the delay between requests:
-
"Connection Error" Messages
- The site might be blocking rapid requests; try increasing delays
- Check if the site is accessible in your browser
- Check your internet connection
-
Takes Too Long
- Limit the number of pages:
--max-pages 500 - Reduce how deep it goes:
--max-depth 5 - Skip image files (this is actually the default)
- Focus on specific sections by starting from a subpage
- Limit the number of pages:
- Be considerate: Use reasonable delays between requests (15 seconds or more)
- Some websites don't want to be archived - respect their robots.txt rules
- The tool skips certain paths by default (like login pages and search results)
- For best results, start with a small section of a site before trying to archive everything
- The tool works best with static websites and blogs
- Large, dynamic sites with lots of JavaScript might not archive properly
- Use
python3 wayback_archiver.py --helpto see all options - Create an issue on GitHub if you find a bug or need help
- Check the log file (wayback_archiver.log) for detailed information about any problems
This project is free to use under the MIT License.