Skip to content

pwallace/digital-archive-helpers

Repository files navigation

digital-archive-helpers

A collection of small Python utilities for tasks related to building and maintaining Internet Archive collections, batch processing files and metadata, verifying data integrity, and automating other repetitive work in a digital archives context.

Each tool lives in its own subdirectory with a detailed README.


Internet Archive tools

Finds Internet Archive identifiers present in a local metadata corpus (one or more CSVs exported via IA's Advanced Search) that are absent from a source list. Useful for identifying items that exist in a local metadata export but have not yet been uploaded to, or confirmed in, an IA collection.

Searches Internet Archive using a Lucene query and exports the metadata of all matching items to a CSV. Configurable fields; defaults to identifier, title, creator, date, and description. Useful for retrieving custom metadata fields not returned via normal Advanced Search.

Reads a plain-text list of IA identifiers, fetches the file listing for each item via the IA API, and writes results to a CSV where each row is an item and each column is a file extension. Standard IA-generated files are automatically excluded.

List, set metadata, download, or export files.xml for individual files within specific IA items. Supports filtering by filename pattern (regex) or file format. Operates on a single item or a batch list. Only files with source="original" are processed.

Generates a IIIF Presentation API 2.x Collection manifest JSON from a CSV of IA identifiers and item labels. Intended for uploading to GitHub and ingesting into a transcription platform such as FromThePage.


Archive deduplication & integrity

These tools form a workflow for comparing files on a local filesystem (DarkArchive) against a checksum export from Archive-It Vault. A higher-level overview and recommended workflow are in archive-deduping/README.md.

Recursively walks a directory tree, computes checksums for all files, and writes results to a CSV (checksum, path, filename). Read-only; does not modify the source filesystem.

Converts a plain-text checksum export (as downloaded from Archive-It Vault) into the same three-column CSV format used by csv-checksum-lister and csv-checksum-comparator.

Compares two checksum CSVs (e.g., local vs. Vault) and produces a differences CSV listing files missing from one side or the other. Useful for verifying filesystem migrations and validating backups.

Compares two plain-text path lists (local storage and vault storage) and produces a CSV showing which files appear in both and which appear in only one. Top-level directory names are ignored; matching is based on remaining path segments.


File discovery & matching

Locates files or directories from a path list within a large directory tree. Extracts the lowest-level element from each path, searches recursively, and writes matches to files-output.csv (filename matches) and dir-output.csv (directory matches).

Matches a target list of files or directories against a recursive directory listing using a cascade of strategies: name-only, case-sensitive, case-insensitive, and fuzzy (via fuzzywuzzy, threshold ≥ 60). Outputs results to CSV. Useful when file paths don't match exactly.


Metadata extraction

Recursively scans a directory for TIFF files and writes their embedded metadata to a CSV. Case-insensitive extension matching (.tif, .tiff). Does not modify source files.

Recursively scans a directory for .txt, .docx, and .doc files, performs a case-insensitive substring search, and reports matches with line numbers or paragraph indices. Optionally writes results to a file and copies matched files to a folder.


Data cleaning & transformation

Two utilities for preparing file path lists:

  • list-cleaner.py — strips leading file sizes from a recursive directory listing (e.g., rclone ls output) and sorts the result; optionally filters to files or directories only.
  • convert-dos-to-linux-paths.py — converts DOS-style backslash paths to Linux-style forward-slash paths.

Removes all empty lines from a text file and writes the result to a new file. The original file is not modified.


Audio & video processing

Discovers audio files in a directory (sorted alphabetically), concatenates them into a single output file using pydub. Requires ffmpeg.

Two scripts for video processing. See video-tools/README.md for details.

  • video-concatenator — transcodes all video files in a directory to a standard format and concatenates them into a single output file.
  • video-keyframe-splitter — transcodes a video file to insert periodic keyframes, then splits it into segments at those boundaries.

Transcribes an audio file to text using OpenAI's Whisper model. Automatically splits files larger than 25 MB into chunks and joins the results. Supports all Whisper model sizes (tiny through large).


Requirements

Most tools require only Python 3.x and the standard library. Exceptions:

Tool Additional requirements
doc-scanner python-docx
fuzzy-file-finder fuzzywuzzy, python-Levenshtein
audio-track-concatenator pydub, ffmpeg
video-concatenator, video-keyframe-splitter ffmpeg
whisper-transcriber openai-whisper, pydub, ffmpeg
ia-* tools internetarchive

About

Various small scripts to help with (sometimes highly) specific tasks related to creating and maintaining Internet Archive collections, batch processing files and metadata, working with A/V files, simplifying data entry and transformation, and other tedious work-related tasks for digital librarians, archivists, and catalogers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages