A collection of small Python utilities for tasks related to building and maintaining Internet Archive collections, batch processing files and metadata, verifying data integrity, and automating other repetitive work in a digital archives context.
Each tool lives in its own subdirectory with a detailed README.
Finds Internet Archive identifiers present in a local metadata corpus (one or more CSVs exported via IA's Advanced Search) that are absent from a source list. Useful for identifying items that exist in a local metadata export but have not yet been uploaded to, or confirmed in, an IA collection.
Searches Internet Archive using a Lucene query and exports the metadata of all matching items to a CSV. Configurable fields; defaults to identifier, title, creator, date, and description. Useful for retrieving custom metadata fields not returned via normal Advanced Search.
Reads a plain-text list of IA identifiers, fetches the file listing for each item via the IA API, and writes results to a CSV where each row is an item and each column is a file extension. Standard IA-generated files are automatically excluded.
List, set metadata, download, or export files.xml for individual files within specific IA items. Supports filtering by filename pattern (regex) or file format. Operates on a single item or a batch list. Only files with source="original" are processed.
Generates a IIIF Presentation API 2.x Collection manifest JSON from a CSV of IA identifiers and item labels. Intended for uploading to GitHub and ingesting into a transcription platform such as FromThePage.
These tools form a workflow for comparing files on a local filesystem (DarkArchive) against a checksum export from Archive-It Vault. A higher-level overview and recommended workflow are in archive-deduping/README.md.
Recursively walks a directory tree, computes checksums for all files, and writes results to a CSV (checksum, path, filename). Read-only; does not modify the source filesystem.
Converts a plain-text checksum export (as downloaded from Archive-It Vault) into the same three-column CSV format used by csv-checksum-lister and csv-checksum-comparator.
Compares two checksum CSVs (e.g., local vs. Vault) and produces a differences CSV listing files missing from one side or the other. Useful for verifying filesystem migrations and validating backups.
Compares two plain-text path lists (local storage and vault storage) and produces a CSV showing which files appear in both and which appear in only one. Top-level directory names are ignored; matching is based on remaining path segments.
Locates files or directories from a path list within a large directory tree. Extracts the lowest-level element from each path, searches recursively, and writes matches to files-output.csv (filename matches) and dir-output.csv (directory matches).
Matches a target list of files or directories against a recursive directory listing using a cascade of strategies: name-only, case-sensitive, case-insensitive, and fuzzy (via fuzzywuzzy, threshold ≥ 60). Outputs results to CSV. Useful when file paths don't match exactly.
Recursively scans a directory for TIFF files and writes their embedded metadata to a CSV. Case-insensitive extension matching (.tif, .tiff). Does not modify source files.
Recursively scans a directory for .txt, .docx, and .doc files, performs a case-insensitive substring search, and reports matches with line numbers or paragraph indices. Optionally writes results to a file and copies matched files to a folder.
Two utilities for preparing file path lists:
list-cleaner.py— strips leading file sizes from a recursive directory listing (e.g.,rclone lsoutput) and sorts the result; optionally filters to files or directories only.convert-dos-to-linux-paths.py— converts DOS-style backslash paths to Linux-style forward-slash paths.
Removes all empty lines from a text file and writes the result to a new file. The original file is not modified.
Discovers audio files in a directory (sorted alphabetically), concatenates them into a single output file using pydub. Requires ffmpeg.
Two scripts for video processing. See video-tools/README.md for details.
video-concatenator— transcodes all video files in a directory to a standard format and concatenates them into a single output file.video-keyframe-splitter— transcodes a video file to insert periodic keyframes, then splits it into segments at those boundaries.
Transcribes an audio file to text using OpenAI's Whisper model. Automatically splits files larger than 25 MB into chunks and joins the results. Supports all Whisper model sizes (tiny through large).
Most tools require only Python 3.x and the standard library. Exceptions:
| Tool | Additional requirements |
|---|---|
doc-scanner |
python-docx |
fuzzy-file-finder |
fuzzywuzzy, python-Levenshtein |
audio-track-concatenator |
pydub, ffmpeg |
video-concatenator, video-keyframe-splitter |
ffmpeg |
whisper-transcriber |
openai-whisper, pydub, ffmpeg |
ia-* tools |
internetarchive |