-
Notifications
You must be signed in to change notification settings - Fork 92
Open
Description
Somehow related to #29.
I'm using rdfind on some large, messy backup directory. There are plenty of small duplicated files. The statistics are
Now scanning ".", found 1898156 files.
Now have 1898156 files in total.
Removed 2868 files due to nonunique device and inode.
Total size is 434307786245 bytes or 404 GiB
Removed 52470 files due to unique sizes from list. 1842818 files left.
Now eliminating candidates based on first bytes: removed 631798 files from list. 1211020 files left.
Now eliminating candidates based on last bytes: removed 49901 files from list. 1161119 files left.
Now eliminating candidates based on sha1 checksum: removed 61578 files from list. 1099541 files left.
It seems like you have 1099541 files that are not unique
Although the first Now eliminating candidates based on first bytes: iterations removed quite some files I think in some cases it would be better to remove these steps for smaller files. Why? Consider local case (no NFS/SMB) with physical disks. Files are organized in blocks (usually today 4KB), the disk/OS basically read these 4KB at the same speed of 64 bytes (the current SomeByteSize size) so if for these files we just use the checksum directly we potentially save 2 read of the files at the cost of some more CPU. Looking at htop command rdfind is mostly always in D state (waiting for the disk).
Metadata
Metadata
Assignees
Labels
No labels