Skip to content

Do not eliminate candidates using first/last bytes for smaller files. #114

@fziglio

Description

@fziglio

Somehow related to #29.

I'm using rdfind on some large, messy backup directory. There are plenty of small duplicated files. The statistics are

Now scanning ".", found 1898156 files.
Now have 1898156 files in total.
Removed 2868 files due to nonunique device and inode.
Total size is 434307786245 bytes or 404 GiB
Removed 52470 files due to unique sizes from list. 1842818 files left.
Now eliminating candidates based on first bytes: removed 631798 files from list. 1211020 files left.
Now eliminating candidates based on last bytes: removed 49901 files from list. 1161119 files left.
Now eliminating candidates based on sha1 checksum: removed 61578 files from list. 1099541 files left.
It seems like you have 1099541 files that are not unique

Although the first Now eliminating candidates based on first bytes: iterations removed quite some files I think in some cases it would be better to remove these steps for smaller files. Why? Consider local case (no NFS/SMB) with physical disks. Files are organized in blocks (usually today 4KB), the disk/OS basically read these 4KB at the same speed of 64 bytes (the current SomeByteSize size) so if for these files we just use the checksum directly we potentially save 2 read of the files at the cost of some more CPU. Looking at htop command rdfind is mostly always in D state (waiting for the disk).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions