A python CLI to identify file formats and bulk convert files. It is designed for digital preservation workflows and is basically a python wrapper around several programs. It uses pygfried (a CPython extension for siegfried), ffmpeg, imagemagick (optionally inkscape) and LibreOffice. If you are not using fileidentification a lot and don't want to install these programs, you can run the script in a docker container. There is a dockerfile ready, the current docker image is still heavy though (1.1 G).
Most probable use case might be when you need to test and possibly convert a huge number of files and you don't know in advance what file types you are dealing with. It features:
- file format identification and extraction of technical metadata with pygfried, ffprobe and imagemagick
- file probing with ffmpeg and imagemagick
- file conversion with ffmpeg, imagemagick and LibreOffice using a JSON file as a protocol
- detailed logging
Build the image, make the bash script executable, and link it to a bin directory that appears in PATH (e.g. $HOME/.local/bin):
docker build -t fileidentification .
chmod +x ./fidr.sh
ln -s `pwd`/fidr.sh $HOME/.local/bin/fidr-
Generate policies for your files:
fidr path/to/directoryThis creates a folder
__fileidentificationinside the target directory with a_policies.jsonfile. Optionally review and edit it to customize conversion rules. -
Test the files on errors and apply the policies:
fidr path/to/directory -iar
The first argument has to be the root folder of your files to process, otherwise combine flags / arguments as you wish.
See Options and Examples below for more available flags.
Install ffmpeg (used to check/convert video and audio files), imagemagick (used to check/convert images), LibreOffice (used to convert Office documents and PDFs), and ghostscript (used by imagemagick when dealing with PDFs):
brew install ffmpeg
brew install imagemagick
brew install ghostscript
brew install --cask libreofficeDepending on your distribution:
On Debian/Ubuntu:
sudo apt-get update
sudo apt-get install ffmpeg imagemagick ghostscript libreofficeIf you don't have uv installed, install it with
curl -LsSf https://astral.sh/uv/install.sh | shThen, you can use uv run to run the fileidentification script.
This creates a venv and installs all necessary python dependencies:
uv run identify.py --helpuv run identify.py path/to/directory
This generates a folder __fileidentification inside the target directory with two JSON files:
_log.json : The technical metadata of all the files in the folder
_policies.json : A file conversion protocol for each file format that was encountered in the folder according to the default policies. Edit it to customize conversion rules.
In the console output there is a table with an overview of the encountered file formats. The rows are colored according to this color code:
- White: Files of that format are processed according to the given policy
- Yellow: Files of that format are skipped (either blank policy or policy is missing)
- Red: Files of that format are being removed (when run with flag
-s,--strict)
Possible values of the "Policy" column:
ffmpeg|magick|soffice: Files of this format are going to be converted with the indicated program- blank: Generated a blank policy (template)
- missing: No policy for this file type
(-i | --assert-file-integrity)
uv run identify.py path/to/directory -i
Probe the files on errors and move corrupted files to the folder in __fileidentification/_REMOVED.
Rename files with extension mismatch.
Optionally add the flag -v (--verbose) for more detailed inspection (see Options below).
NOTE: Currently only audio/video and image files are inspected.
(-a | --apply)
uv run identify.py path/to/directory -a
Apply the policies defined in __fileidentification/_policies.json (or in the policies passed with -p) and convert
files into their target file format.
The converted files are temporarily stored in __fileidentification with the log output
of the program used as log.txt next to it.
(-r | --remove-tmp)
uv run identify.py path/to/directory -r
Delete all temporary files and folders and move the converted files next to their original.
The _log.json takes track of all modifications in the target folder.
Since with each execution of the script it checks whether such a log exists and read/appends to that file.
Iterations of file conversions such as A -> B, B -> C, ... are logged in the same file.
If you wish a simpler csv output, you can add the flag --csv anytime when you run the script,
which maps the _log.json to a csv.
You can also create your own policies, and with that, customise the file conversion output.
Simply edit the generated default file __fileidentification/_policies.json before applying or pass a customised
policies files with the parameter -p.
If you want to start from scratch, run uv run indentify.py path/to/directory -b to create a
blank policies template with all the file formats encountered in the folder.
A policy for a file type consists of the following fields and uses its PRONOM Unique Identifier (PUID) as a key
| Field | Type | |
|---|---|---|
| format_name | str | optional |
| bin | str | required |
| accepted | bool | required |
| target_container | str | required if field accepted is false |
| processing_args | str | required if field accepted is false |
| expected | list[str] | required if field accepted is false |
| remove_original | bool | optional (default is false) |
format_name: The name of the file format.bin: Program to convert or test the file. Literal["","magick","ffmpeg","soffice"]. (Testing currently only is supported on image/audio/video, i.e. ffmpeg and magick.)accepted:falseif the file needs to be converted,trueif it doesn't.processing_args: The arguments used with bin. Can also be an empty string if there is no need for such arguments.expected: the expected file format for the converted file as PUIDremove_original: whether to keep the parent of the converted file in the directory, default isfalse
A policy for Audio/Video Interleaved Format (avi) that need to be transcoded to MPEG-4 Media File (Codec: AVC/H.264, Audio: AAC) looks like this:
{
"fmt/5": {
"format_name": "Audio/Video Interleaved Format",
"bin": "ffmpeg",
"accepted": false,
"target_container": "mp4",
"processing_args": "-c:v libx264 -crf 18 -pix_fmt yuv420p -c:a aac",
"expected": [
"fmt/199"
],
"remove_original": false
}
}A policy for Portable Network Graphics that is accepted as it is:
{
"fmt/13": {
"format_name": "Portable Network Graphics",
"bin": "magick",
"accepted": true
}
}Policy Testing:
You can test the outcome of the conversion policies with
uv run identify.py path/to/directory -t
The script takes the smallest file for each conversion policy and converts it.
If you just want to test a specific policy, append f and the puid:
uv run identify.py path/to/directory -tf fmt/XXX
-i | --assert-file-integrity
Remove corrupt files, and try to fix minor errors
-v | --verbose
Catch more warnings on video and image files during the tests.
This can take a significantly longer time based on what files you have.
-a | --apply
Apply the policies
-r | --remove-tmp
Remove all temporary items and add the converted files next to their original.
-x | --remove-original
This overwrites the remove_original value in the policies and sets it to true when removing the tmp files.
The original files are moved to the __fileidentification/_REMOVED folder.
When used in generating policies, it sets remove_original in the policies to true (default false).
--tmp-dir
Use a custom tmp directory instead of the default __fileidentification
-p | --policies-path
Load a custom policies JSON file instead of generating one out of the default policies.
-e | --extend-policies
Use with -p:
Append filetypes found in the directory to the custom policies if they are missing in it and generate a new policies json.
-s | --strict
Move the files whose format is not listed in the policies file to the folder _REMOVED
(instead of emitting a warning).
When used in generating policies, do not add blank policies for formats that are not mentioned in DEFAULTPOLICIES.
-b | --blank
Create a blank policies based on the file types encountered in the given directory.
-q | --quiet
Just print errors and warnings
--inspect
Just inspect the target folder without any modification
--csv
Get output as CSV, in addition to the log.json
--convert
Re-convert the files that failed during file conversion
Use case: you have defined a set of rules in an external policies file and want to remove files of any format that
is not listed in the external policies.
To get an overview: fidr path/to/directory -s -p path/to/external_policies.json
To apply it directly:
fidr path/to/directory -asr -p path/to/external_policies.json
- load an external policies JSON
- apply the policies (in strict mode, i.e. remove the files whose file type is not listed in the policies)
- remove temporary files
Use case: Your files are on an external storage drive and you might have limited diskspace left and want to only keep the converted files.
fidr path/to/directory --tmp-dir path/to/tmp_dir -ivarx
- use a custom tmp_dir to write files to (instead of the default
path/to/directory/__fileidentification) - probe the files in verbose mode and apply the policies
- remove temporary files and the original of the converted files
Update the file format names and extensions of the PUIDs according to https://www.nationalarchives.gov.uk/.
uv sync --extra update_fmt && uv run update.pycreates an updated version of fileidentification/definitions/fmt2ext.json.
If you use the Docker-based version, don't forget to rebuild the Docker image after updating the PUIDs.
You'll find a good resource to query for fileformats on nationalarchives.gov.uk
The Homepage of siegfried itforarchivists.com/siegfried/
List of File Signatures on wikipedia
Preservation recommendations kost bundesarchiv