MergerFS Cross-Drive Deduplicator A Python script to find and eliminate duplicate files across multiple physical drives in a mergerfs pool. This tool automates the process of creating hardlinks for duplicates that exist on different filesystems, reclaiming significant amounts of disk space.
The Problem Standard deduplication tools like jdupes can create hardlinks to save space, but only if the duplicate files reside on the same physical drive (filesystem). When using mergerfs to pool several drives, it's common for identical files to exist on separate disks, preventing traditional hardlinking and wasting terabytes of space.
Features Cross-Drive Hardlinking: Implements a "delete-and-link" strategy to consolidate duplicates that are on different physical drives.
Failsafe Operation: Uses a stateful CSV log to track every operation (PENDING, DELETED, LINKED). If the script is interrupted, it can be safely resumed and will automatically recover failed operations.
Dry Run Mode: By default, the script runs in a safe, read-only "dry run" mode that shows what actions would be taken without changing any files.
Primary Path Logic: Allows you to designate a "master" path (e.g., your curated media library) to ensure the correct copy of a file is always preserved.
Dependency Checks: Automatically verifies that required tools are installed before running.
Requirements python3
jdupes: The core engine for finding duplicate files.
jq: A command-line JSON processor used for creating a human-readable summary (optional but recommended).
Installation On Debian-based systems (like Ubuntu), you can install the required tools with the following command:
sudo apt-get update && sudo apt-get install jdupes jq
Workflow & Usage The process is broken down into two main phases: scanning for duplicates and then executing the linking script.
Phase 1: Scan for Duplicates First, you need to generate a duplicates.json manifest file. This file is the "work queue" for the script. Since this scan can take many hours, it is highly recommended to run it inside a tmux session to protect it from network disconnects.
tmux new -s jdupes_scan
jdupes --recurse --json /path/to/your/media /path/to/your/downloads > /path/to/save/duplicates.json
Once the scan is complete, you can optionally generate a quick human-readable summary from the JSON file using jq:
jq '"Found (.matchSets | length) sets of duplicates."' /path/to/save/duplicates.json > /path/to/save/summary.log
Phase 2: Run the Deduplication Script With the duplicates.json manifest created, you can now run the Python script (dedupe.py).
- Dry Run (Recommended First Step)
Always perform a dry run first to ensure the script is configured correctly and will perform the expected actions. This command will not delete or link any files.
python3 dedupe.py
--json-file /path/to/save/duplicates.json
--log-file /path/to/save/dedupe_log.csv
--primary-path /mnt/storage/Media/
--pool-root /mnt/pool/
Review the output and the script_run.log file to verify its proposed actions.
- Execute for Real
Once you are confident, run the command again with the --perform-actions flag. This will begin the actual delete-and-link process. It is also recommended to run this inside a tmux session.
python3 dedupe.py
--json-file /path/to/save/duplicates.json
--log-file /path/to/save/dedupe_log.csv
--primary-path /mnt/storage/Media/
--pool-root /mnt/pool/
--perform-actions
Script Arguments --json-file: (Required) Path to the duplicates.json manifest file generated by jdupes.
--log-file: (Required) Path to the CSV log file used for tracking the state of each operation. It will be created if it doesn't exist.
--primary-path: (Required) The primary path where "master" files are kept. Any file within this path will be preserved, and duplicates will be linked to it.
--pool-root: (Required) The root directory that contains your individual disk mounts (e.g., /mnt/pool/ which contains /mnt/pool/disk1, /mnt/pool/disk2, etc.).
--perform-actions: (Optional) When this flag is included, the script will actually perform delete and link operations. Without it, the script runs in a safe, read-only dry-run mode.
License This project is licensed under the MIT License. See the LICENSE file for details.
Disclaimer This script performs file deletion operations. While it is designed to be failsafe, you should always have backups of your important data. The author is not responsible for any data loss. Use at your own risk.