Do not eliminate candidates using first/last bytes for smaller files.

Somehow related to https://github.com/pauldreik/rdfind/issues/29.

I'm using `rdfind` on some large, messy backup directory. There are plenty of small duplicated files. The statistics are

```
Now scanning ".", found 1898156 files.
Now have 1898156 files in total.
Removed 2868 files due to nonunique device and inode.
Total size is 434307786245 bytes or 404 GiB
Removed 52470 files due to unique sizes from list. 1842818 files left.
Now eliminating candidates based on first bytes: removed 631798 files from list. 1211020 files left.
Now eliminating candidates based on last bytes: removed 49901 files from list. 1161119 files left.
Now eliminating candidates based on sha1 checksum: removed 61578 files from list. 1099541 files left.
It seems like you have 1099541 files that are not unique
```
Although the first  `Now eliminating candidates based on first bytes:` iterations removed quite some files I think in some cases it would be better to remove these steps for smaller files. Why? Consider local case (no NFS/SMB) with physical disks. Files are organized in blocks (usually today 4KB), the disk/OS basically read these 4KB at the same speed of 64 bytes (the current `SomeByteSize` size) so if for these files we just use the checksum directly we potentially save 2 read of the files at the cost of some more CPU. Looking at `htop` command `rdfind` is mostly always in `D` state (waiting for the disk).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not eliminate candidates using first/last bytes for smaller files. #114

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Do not eliminate candidates using first/last bytes for smaller files. #114

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions