Skip to content

A tool for collection archival slivers of the web and web archives

License

Notifications You must be signed in to change notification settings

anjackson/sliver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sliver

An 'archival sliver' of the web. A bit like a 'data lifeboat' for making or replicating web archives of small sets of pages. Uses shot-scraper to drive a web browser that generates screenshots of your URLs, but runs it through a pywb web proxy so it can produce a high quality archival version of what you download.

As well as archiving live web pages, this tools can leverage pywb's support for neatly extracting URLs from other web archives and recording items with all the appropriate provenance information (see below for an example). This means it can work like hartator/wayback-machine-downloader but retain the additional information that the WARC and WACZ web archiving format supports (see Why WARC/WACZ? below).

Other Tools

You open WARC and WACZ files using ReplayWeb.page.

For high-quality web archiving, you could also try:

There's also ArchiveBox which provides a GUI as well as command line tools, but (as far as I can tell?) does not support direct browser-behind-a-web-proxy archiving. You can try it out by running it on PikaPods.

You can find out more about web archives and web archiving tools and services via iipc/awesome-web-archiving: An Awesome List for getting started with web archiving and Web Archiving Community · ArchiveBox/ArchiveBox Wiki.

Why WARC/WACZ?

Web archives use the WARC format rather than just mirror the files from a website on disk. This is primarily because the WARC format also stores all the HTTP response headers that you sometimes need for playback to work reliably with more complex sites. For example, the Content-Type response header might be the only way the format of a file can be reliably determined. WARCs also store lots of contextual and provenance information.

There is also the newer WACZ format, which wraps WARCs in a ZIP file, with additional metadata and indexes that make playback easier.

Usage

Please note that your use of this tool should take into account your legal context and the terms of use of the web sites and web archives you are working with.

Setup

Set up a Python environment with sliver installed. This setup is based on using uv and assumes you already have that installed.

uv tool install -p python3.11 https://github.com/anjackson/sliver.git

Note that later versions of Python are not yet supported by the upstream dependency pywb.

You should now be able to run e.g.

uvx sliver --help

Now create a directory to work in, to keep the archival files together as you work.

mkdir my-collection
cd my-collection

Create a list of URLs

Create a list of URLs you want to archive.

For crawls from web archives, you can use the sliver lookup command for this, and then edit the file down so it's just URLs.

Fetch the URLs

Run sliver fetch to run the screenshotting process via the archiving proxy (running on port 8080, so that port needs to be free).

uvx sliver fetch urls.txt

Alternatively, you can fetch records from a web archive, specifying a target timestamp for the records that should be retrieved:

uvx sliver fetch --source ia --timestamp 20050101000000 urls.txt

During this process, the archives and screenshots are collected in subfolders of a local directory called ./collections/mementos/

If you re-run the command, and new resources will be fetched and added to a new WARC file. Check the screenshots you have produced to see if they are good enough. Re-run sliver fetch if needed.

Use the proxy to add to your archive

TBD If you want to drive the crawl yourself, using sliver proxy to run the web proxy and configure your browser to use it.

Interact with the browser to add to your archive

Connecting a browser to the proxy can be difficult, as the HTTPS connection most site now require have to be mediated using auto-generated proxy SSL certificates. Configuring your own web browser to cope with this can be difficult.

TBD Running sliver interact launches a dedicated, pre-configured browser using shot-scraper's interactive mode. Anything you do in that browser window will be recorded by the archiving proxy, and a screenshot will be taken at the end. This helps extend and patch tricky crawls, but you shouldn't enter any personal information or passwords unless you don't mind them being archived along with the web pages!

Package the results

This sub-command has not been implemented yet. In the meantime, you can run this:

uvx --with setuptools wacz create -o archive.wacz -t -d ./collections/mementos/archive/*.warc.gz

There seems to be an undeclared dependency on setuptools, hence the --with setuptools. You could also install wacz separately.

TBD Run sliver package to package the WARCs and screenshots etc. into a WACZ web archive zip package.

uvx sliver package

Using the WACZ

Check the final WACZ package works using ReplayWeb.page.

If you want, upload the package to a static site as per Embedding ReplayWeb.page

TBD Describe an example, e.g. using Storj+RClone or Glitch or ...

Extracted WARC Records

Example WARC when pulled from a source archive via the Memento API.

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:e082c042-e56d-11ef-bbba-d982d0cb2308>
WARC-Target-URI: https://example.com/
WARC-Date: 2025-02-07T00:00:31Z
WARC-Source-URI: https://web.archive.org/web/20250207000031id_/https://example.com/
WARC-Creation-Date: 2025-02-07T16:09:15Z
WARC-IP-Address: 207.241.237.3
Content-Type: application/http; msgtype=response
Content-Length: 2239
WARC-Payload-Digest: sha1:YXWQ7LLPPIZ7CVO6DVQ4U3Y2IO5M42AG
WARC-Block-Digest: sha1:HBKFWTAU4TPQWD5CNKVIJVW2CANQUTRN

Initial Prototype Notes

The following notes describe how the initial attempt at patching things together worked. These steps are being moved into the code.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install hatch

Generating a list of URLs

curl -o out.cdx -g "https://web.archive.org/cdx/search/cdx?url=example.com&collapse=urlkey&matchType=prefix&limit=10000&filter=statuscode:[23]..&showResumeKey=true"

Then extract the urls from the CDX.

Set pywb running

Run the proxy that will pull and record:

# Create a collection to record things into:
hatch run wb-manager init mementos
# Set the proxy running (it's on localhost:8080):
hatch run wayback --threads 12 > wayback.log 2>&1 &

The <./config.yaml> was pretty fiddly to get right for this!

Run the screen shot capture

Set up shot-scraper

hatch run shot-scraper install -b chromium

Take the URLs from the CDX file and populate a shots.yml file.

Then run shot-scraper with the right settings, so everything goes via the proxy:

hatch run shot-scraper multi -b chromium --browser-arg '--ignore-certificate-errors' --browser-arg '--proxy-server=http://localhost:8080' --timeout 65000 shots.yml

Ran this against about 80 Twitter URLs. A handful of errors, presumably due to rate limiting by the source archive, but it's difficult to tell. Generally reasonable results, but long waits (30s+) needed between pages to try to ensure minimal blocking.

  • Using the multi mode and config is cumbersome as you need to repeat a lot of config.
  • It will auto-generate names for the screenshots, but then seems to repeat screenshots and give them new names with .1 etc. Which kinda defeats the purpose.
  • Having to set all the browser options etc. at the command-line is obviously rather brittle.
  • No video, presumably because Chromium?
  • This only gets one instance per URL, the earliest one.

So, it would make sense to wrap this all up as a new command that would launch the proxy, run the shots, and gather the results. e.g.

$ sliver collection-urls.txt

Perhaps using https://pywb.readthedocs.io/en/latest/manual/warcserver.html#custom-warcserver-deployments rather than a config file so it's all in code. Or maybe export PYWB_CONFIG_FILE=...../config.yaml... yes that works and is easier to manage.

With this generating a collection-urls.wacz by default.

WACZ Creation & Access

Made a WACZ

wacz create -o anjackson-net-2025-02-08.wacz -t -d collections/mementos/archive/MLB-20250208201638089003-EMEOIDCD.warc.gz

Copied it up so an S3 store (https://european-alternatives.eu/category/object-storage-providers, https://www.s3compare.io/) that I've made accessible over the web (https://storj.dev/dcs/code/static-site-hosting):

rclone copy -v slivers dr:slivers # or sync???
uplink share --dns slivers.anjackson.dev sj://slivers --not-after=none

Resulting in https://slivers.anjackson.dev/anjackson-net-2025-02-08/

Example Command Sequence

hatch run wb-manager init mementos
export PYWB_CONFIG_FILE=../../config.yaml 
hatch run wayback > wayback.log 2>&1 &
hatch run shot-scraper multi -b chromium --browser-arg '--ignore-certificate-errors' --browser-arg '--proxy-server=http://localhost:8080' shots.yaml 
hatch run wacz create -o example-com.wacz -t -d collections/mementos/archive/SLIVER-20250208210345321032-57CQFSUN.warc.gz

Then test in https://replayweb.page/

Then clean up and upload

About

A tool for collection archival slivers of the web and web archives

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages