Our crawls regularly archive pages that embed some pretty large files that don’t change much. The most extreme case is a 1.3 GB video from NASA/JPL on a NOAA climate data page, but there are several other videos, podcast recordings, and giant images that are large and generally unchanging.
We should replace the HTTP response records with revisit records in the WARC files this project produces. Ideally, we’d also use Etag and/or If-Modified-Since headers to avoid getting all that data over the wire in the first place, but that's probably more complicated (I think the easiest path that works with Browsertrix would be to have a proxy that returns data from a previous crawl instead of from the source).
Anyway, we should probably:
- After crawling, run through all the records in the WARC to transform them.
- If a record is larger than some threshold (5MB? 1MB?)…
- If we have previously stored a response with the same digest, etag, etc., replace the record with a revisit record pointing to the WARC record where we previously saved it.
- If not, save the digest, etag, last-modified date, and WARC pointer in some persistent storage somewhere we can use for next time.
- Replace the old WARC with the new, transformed one.
- Save and import as normal.
Besides some custom code for processing the WARC, we need to set up some persistent external storage. Probably one of:
- A special file in S3, like we do for the “unplaybackable cache” in Wayback imports. Infrastructurally simple, but a little more complicated code.
- Redis. Easier to code against and designed for this kind of stuff, but more infrastructure to manage. Also not great to be logging into across datacenters (since our crawls currently run in a different place (GH actions) than everything else (AWS us-west-2).
Our crawls regularly archive pages that embed some pretty large files that don’t change much. The most extreme case is a 1.3 GB video from NASA/JPL on a NOAA climate data page, but there are several other videos, podcast recordings, and giant images that are large and generally unchanging.
We should replace the HTTP response records with revisit records in the WARC files this project produces. Ideally, we’d also use Etag and/or If-Modified-Since headers to avoid getting all that data over the wire in the first place, but that's probably more complicated (I think the easiest path that works with Browsertrix would be to have a proxy that returns data from a previous crawl instead of from the source).
Anyway, we should probably:
Besides some custom code for processing the WARC, we need to set up some persistent external storage. Probably one of: