Skip to content

This project provides a Python script designed to find and eliminate duplicate files across multiple physical drives that are part of a mergerfs pool. It automates the process of creating hardlinks for duplicates that exist on different filesystems, something that standard tools cannot do directly.

License

Notifications You must be signed in to change notification settings

Gorillix/mergerfs-dedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MergerFS Cross-Drive Deduplicator A Python script to find and eliminate duplicate files across multiple physical drives in a mergerfs pool. This tool automates the process of creating hardlinks for duplicates that exist on different filesystems, reclaiming significant amounts of disk space.

The Problem Standard deduplication tools like jdupes can create hardlinks to save space, but only if the duplicate files reside on the same physical drive (filesystem). When using mergerfs to pool several drives, it's common for identical files to exist on separate disks, preventing traditional hardlinking and wasting terabytes of space.

Features Cross-Drive Hardlinking: Implements a "delete-and-link" strategy to consolidate duplicates that are on different physical drives.

Failsafe Operation: Uses a stateful CSV log to track every operation (PENDING, DELETED, LINKED). If the script is interrupted, it can be safely resumed and will automatically recover failed operations.

Dry Run Mode: By default, the script runs in a safe, read-only "dry run" mode that shows what actions would be taken without changing any files.

Primary Path Logic: Allows you to designate a "master" path (e.g., your curated media library) to ensure the correct copy of a file is always preserved.

Dependency Checks: Automatically verifies that required tools are installed before running.

Requirements python3

jdupes: The core engine for finding duplicate files.

jq: A command-line JSON processor used for creating a human-readable summary (optional but recommended).

Installation On Debian-based systems (like Ubuntu), you can install the required tools with the following command:

sudo apt-get update && sudo apt-get install jdupes jq

Workflow & Usage The process is broken down into two main phases: scanning for duplicates and then executing the linking script.

Phase 1: Scan for Duplicates First, you need to generate a duplicates.json manifest file. This file is the "work queue" for the script. Since this scan can take many hours, it is highly recommended to run it inside a tmux session to protect it from network disconnects.

Start a new tmux session

tmux new -s jdupes_scan

Run the jdupes scan. This is a read-only operation.

jdupes --recurse --json /path/to/your/media /path/to/your/downloads > /path/to/save/duplicates.json

Once the scan is complete, you can optionally generate a quick human-readable summary from the JSON file using jq:

jq '"Found (.matchSets | length) sets of duplicates."' /path/to/save/duplicates.json > /path/to/save/summary.log

Phase 2: Run the Deduplication Script With the duplicates.json manifest created, you can now run the Python script (dedupe.py).

  1. Dry Run (Recommended First Step)

Always perform a dry run first to ensure the script is configured correctly and will perform the expected actions. This command will not delete or link any files.

python3 dedupe.py
--json-file /path/to/save/duplicates.json
--log-file /path/to/save/dedupe_log.csv
--primary-path /mnt/storage/Media/
--pool-root /mnt/pool/

Review the output and the script_run.log file to verify its proposed actions.

  1. Execute for Real

Once you are confident, run the command again with the --perform-actions flag. This will begin the actual delete-and-link process. It is also recommended to run this inside a tmux session.

python3 dedupe.py
--json-file /path/to/save/duplicates.json
--log-file /path/to/save/dedupe_log.csv
--primary-path /mnt/storage/Media/
--pool-root /mnt/pool/
--perform-actions

Script Arguments --json-file: (Required) Path to the duplicates.json manifest file generated by jdupes.

--log-file: (Required) Path to the CSV log file used for tracking the state of each operation. It will be created if it doesn't exist.

--primary-path: (Required) The primary path where "master" files are kept. Any file within this path will be preserved, and duplicates will be linked to it.

--pool-root: (Required) The root directory that contains your individual disk mounts (e.g., /mnt/pool/ which contains /mnt/pool/disk1, /mnt/pool/disk2, etc.).

--perform-actions: (Optional) When this flag is included, the script will actually perform delete and link operations. Without it, the script runs in a safe, read-only dry-run mode.

License This project is licensed under the MIT License. See the LICENSE file for details.

Disclaimer This script performs file deletion operations. While it is designed to be failsafe, you should always have backups of your important data. The author is not responsible for any data loss. Use at your own risk.

About

This project provides a Python script designed to find and eliminate duplicate files across multiple physical drives that are part of a mergerfs pool. It automates the process of creating hardlinks for duplicates that exist on different filesystems, something that standard tools cannot do directly.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages