An 'archival sliver' of the web. A bit like a 'data lifeboat' for making or replicating web archives of small sets of pages. Uses shot-scraper
to drive a web browser that generates screenshots of your URLs, but runs it through a pywb
web proxy so it can produce a high quality archival version of what you download.
As well as archiving live web pages, this tools can leverage pywb
's support for neatly extracting URLs from other web archives and recording items with all the appropriate provenance information (see below for an example). This means it can work like hartator/wayback-machine-downloader but retain the additional information that the WARC and WACZ web archiving format supports (see Why WARC/WACZ? below).
You open WARC and WACZ files using ReplayWeb.page.
For high-quality web archiving, you could also try:
- ArchiveWeb.page (for manual crawling via a browser extension)
- harvard-lil/scoop: 🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web. (built with
nodejs
, more sophisticated than this tool, but does not support replicating records from existing web archives) - Browsertrix (for larger-scale high-quality crawling, with an extensible set of per-site behaviour scripts to improve archiving, running on Kubernetes)
There's also ArchiveBox which provides a GUI as well as command line tools, but (as far as I can tell?) does not support direct browser-behind-a-web-proxy archiving. You can try it out by running it on PikaPods.
You can find out more about web archives and web archiving tools and services via iipc/awesome-web-archiving: An Awesome List for getting started with web archiving and Web Archiving Community · ArchiveBox/ArchiveBox Wiki.
Web archives use the WARC format rather than just mirror the files from a website on disk. This is primarily because the WARC format also stores all the HTTP response headers that you sometimes need for playback to work reliably with more complex sites. For example, the Content-Type
response header might be the only way the format of a file can be reliably determined. WARCs also store lots of contextual and provenance information.
There is also the newer WACZ format, which wraps WARCs in a ZIP file, with additional metadata and indexes that make playback easier.
Please note that your use of this tool should take into account your legal context and the terms of use of the web sites and web archives you are working with.
Set up a Python environment with sliver
installed. This setup is based on using uv
and assumes you already have that installed.
uv tool install -p python3.11 https://github.com/anjackson/sliver.git
Note that later versions of Python are not yet supported by the upstream dependency pywb.
You should now be able to run e.g.
uvx sliver --help
Now create a directory to work in, to keep the archival files together as you work.
mkdir my-collection
cd my-collection
Create a list of URLs you want to archive.
For crawls from web archives, you can use the sliver lookup
command for this, and then edit the file down so it's just URLs.
Run sliver fetch
to run the screenshotting process via the archiving proxy (running on port 8080, so that port needs to be free).
uvx sliver fetch urls.txt
Alternatively, you can fetch records from a web archive, specifying a target timestamp for the records that should be retrieved:
uvx sliver fetch --source ia --timestamp 20050101000000 urls.txt
During this process, the archives and screenshots are collected in subfolders of a local directory called ./collections/mementos/
If you re-run the command, and new resources will be fetched and added to a new WARC file. Check the screenshots you have produced to see if they are good enough. Re-run sliver fetch
if needed.
TBD If you want to drive the crawl yourself, using sliver proxy
to run the web proxy and configure your browser to use it.
Connecting a browser to the proxy can be difficult, as the HTTPS connection most site now require have to be mediated using auto-generated proxy SSL certificates. Configuring your own web browser to cope with this can be difficult.
TBD Running sliver interact
launches a dedicated, pre-configured browser using shot-scraper
's interactive mode. Anything you do in that browser window will be recorded by the archiving proxy, and a screenshot will be taken at the end. This helps extend and patch tricky crawls, but you shouldn't enter any personal information or passwords unless you don't mind them being archived along with the web pages!
This sub-command has not been implemented yet. In the meantime, you can run this:
uvx --with setuptools wacz create -o archive.wacz -t -d ./collections/mementos/archive/*.warc.gz
There seems to be an undeclared dependency on setuptools
, hence the --with setuptools
. You could also install wacz
separately.
TBD Run sliver package
to package the WARCs and screenshots etc. into a WACZ web archive zip package.
uvx sliver package
Check the final WACZ package works using ReplayWeb.page.
If you want, upload the package to a static site as per Embedding ReplayWeb.page
TBD Describe an example, e.g. using Storj+RClone or Glitch or ...
Example WARC when pulled from a source archive via the Memento API.
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:e082c042-e56d-11ef-bbba-d982d0cb2308>
WARC-Target-URI: https://example.com/
WARC-Date: 2025-02-07T00:00:31Z
WARC-Source-URI: https://web.archive.org/web/20250207000031id_/https://example.com/
WARC-Creation-Date: 2025-02-07T16:09:15Z
WARC-IP-Address: 207.241.237.3
Content-Type: application/http; msgtype=response
Content-Length: 2239
WARC-Payload-Digest: sha1:YXWQ7LLPPIZ7CVO6DVQ4U3Y2IO5M42AG
WARC-Block-Digest: sha1:HBKFWTAU4TPQWD5CNKVIJVW2CANQUTRN
The following notes describe how the initial attempt at patching things together worked. These steps are being moved into the code.
python3 -m venv .venv
source .venv/bin/activate
pip install hatch
curl -o out.cdx -g "https://web.archive.org/cdx/search/cdx?url=example.com&collapse=urlkey&matchType=prefix&limit=10000&filter=statuscode:[23]..&showResumeKey=true"
Then extract the urls from the CDX.
Run the proxy that will pull and record:
# Create a collection to record things into:
hatch run wb-manager init mementos
# Set the proxy running (it's on localhost:8080):
hatch run wayback --threads 12 > wayback.log 2>&1 &
The <./config.yaml> was pretty fiddly to get right for this!
Set up shot-scraper
hatch run shot-scraper install -b chromium
Take the URLs from the CDX file and populate a shots.yml
file.
Then run shot-scraper
with the right settings, so everything goes via the proxy:
hatch run shot-scraper multi -b chromium --browser-arg '--ignore-certificate-errors' --browser-arg '--proxy-server=http://localhost:8080' --timeout 65000 shots.yml
Ran this against about 80 Twitter URLs. A handful of errors, presumably due to rate limiting by the source archive, but it's difficult to tell. Generally reasonable results, but long waits (30s+) needed between pages to try to ensure minimal blocking.
- Using the
multi
mode and config is cumbersome as you need to repeat a lot of config. - It will auto-generate names for the screenshots, but then seems to repeat screenshots and give them new names with
.1
etc. Which kinda defeats the purpose. - Having to set all the browser options etc. at the command-line is obviously rather brittle.
- No video, presumably because Chromium?
- This only gets one instance per URL, the earliest one.
So, it would make sense to wrap this all up as a new command that would launch the proxy, run the shots, and gather the results. e.g.
$ sliver collection-urls.txt
Perhaps using https://pywb.readthedocs.io/en/latest/manual/warcserver.html#custom-warcserver-deployments rather than a config file so it's all in code. Or maybe export PYWB_CONFIG_FILE=...../config.yaml
... yes that works and is easier to manage.
With this generating a collection-urls.wacz
by default.
Made a WACZ
wacz create -o anjackson-net-2025-02-08.wacz -t -d collections/mementos/archive/MLB-20250208201638089003-EMEOIDCD.warc.gz
Copied it up so an S3 store (https://european-alternatives.eu/category/object-storage-providers, https://www.s3compare.io/) that I've made accessible over the web (https://storj.dev/dcs/code/static-site-hosting):
rclone copy -v slivers dr:slivers # or sync???
uplink share --dns slivers.anjackson.dev sj://slivers --not-after=none
Resulting in https://slivers.anjackson.dev/anjackson-net-2025-02-08/
hatch run wb-manager init mementos
export PYWB_CONFIG_FILE=../../config.yaml
hatch run wayback > wayback.log 2>&1 &
hatch run shot-scraper multi -b chromium --browser-arg '--ignore-certificate-errors' --browser-arg '--proxy-server=http://localhost:8080' shots.yaml
hatch run wacz create -o example-com.wacz -t -d collections/mementos/archive/SLIVER-20250208210345321032-57CQFSUN.warc.gz
Then test in https://replayweb.page/
Then clean up and upload