Skip to content

Commit 94e87cc

Browse files
author
Paul Dreik
committed
allow setting the first/last bytes reading size
1 parent 47b44bf commit 94e87cc

File tree

5 files changed

+86
-4
lines changed

5 files changed

+86
-4
lines changed

Options.cc

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,12 @@ options are (default choice within parentheses)
3030
3131
Processing options:
3232
33+
-firstbytessize N sets the size in bytes when comparing the
34+
beginning of files, prior to full checksumming.
35+
default is 64 byte. Use 0 to disable the stage.
36+
-lastbytessize N sets the size in bytes when comparing the
37+
end of files, prior to full checksumming.
38+
default is 64 byte. Use 0 to disable the stage.
3339
-checksum none | md5 |(sha1)| sha256 | sha512 | xxh128
3440
checksum type
3541
xxh128 is very fast, but is noncryptographic.
@@ -128,6 +134,19 @@ parseOptions(Parser& parser)
128134
o.remove_identical_inode = parser.get_parsed_bool();
129135
} else if (parser.try_parse_bool("-deterministic")) {
130136
o.deterministic = parser.get_parsed_bool();
137+
} else if (parser.try_parse_string("-firstbytessize")) {
138+
const auto tmp = std::stoll(parser.get_parsed_string());
139+
if (tmp < 0) {
140+
throw std::runtime_error(
141+
"negative value of firstbytessize not allowed");
142+
}
143+
o.first_bytes_size = tmp;
144+
} else if (parser.try_parse_string("-lastbytessize")) {
145+
const auto tmp = std::stoll(parser.get_parsed_string());
146+
if (tmp < 0) {
147+
throw std::runtime_error("negative value of lastbytessize not allowed");
148+
}
149+
o.last_bytes_size = tmp;
131150
} else if (parser.try_parse_string("-checksum")) {
132151
if (parser.parsed_string_is("md5")) {
133152
o.usemd5 = true;

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,9 +86,9 @@ Rdfind uses the following algorithm. If N is the number of files to search throu
8686
5. If flag -removeidentinode true: Remove items from the list which already are added, based on the combination of inode and device number. A group of files that are hardlinked to the same file are collapsed to one entry. Also see the comment on hardlinks under ”caveats below”!
8787
6. Sort files on size. Remove files from the list, which have unique sizes.
8888
7. Sort on device and inode(speeds up file reading). Read a few bytes from the beginning of each file (first bytes).
89-
8. Remove files from list that have the same size but different first bytes.
89+
8. Remove files from list that have the same size but different first bytes. (This step is possible to disable by using -firstbytessize 0).
9090
9. Sort on device and inode(speeds up file reading). Read a few bytes from the end of each file (last bytes).
91-
10. Remove files from list that have the same size but different last bytes.
91+
10. Remove files from list that have the same size but different last bytes. (This step is possible to disable by using -lastbytessize 0).
9292
11. Sort on device and inode(speeds up file reading). Perform a checksum calculation for each file (unless disabled with -checksum none).
9393
12. Only keep files on the list with the same size and checksum. These are duplicates.
9494
13. Sort list on size, priority number, and depth. The first file for every set of duplicates is considered to be the original.

rdfind.1

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,14 @@ for files, smaller or bigger can improve performance
9191
dependent on filesystem and checksum algorithm.
9292
The default is 1 MiB, the maximum allowed is 128MiB (inclusive).
9393
.TP
94+
.BR \-firstbytessize " " \fIN\fR
95+
Size in bytes when scanning the first bytes of each file, prior to full
96+
checksumming. Setting this to 0 means skipping the step entirely.
97+
.TP
98+
.BR \-lastbytessize " " \fIN\fR
99+
Size in bytes when scanning the last bytes of each file, prior to full
100+
checksumming. Setting this to 0 means skipping the step entirely.
101+
.TP
94102
.BR \-deterministic " " \fItrue\fR|\fIfalse\fR
95103
If set (the default), sort files of equal rank in an unspecified but
96104
deterministic order. This makes the behaviour independent of in which

rdfind.cc

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -146,9 +146,15 @@ main(int narg, const char* argv[])
146146
// candidates. start looking at the contents.
147147
std::vector<std::pair<Fileinfo::readtobuffermode, const char*>> modes{
148148
{ Fileinfo::readtobuffermode::NOT_DEFINED, "" },
149-
{ Fileinfo::readtobuffermode::READ_FIRST_BYTES, "first bytes" },
150-
{ Fileinfo::readtobuffermode::READ_LAST_BYTES, "last bytes" },
151149
};
150+
if (o.first_bytes_size > 0) {
151+
modes.emplace_back(Fileinfo::readtobuffermode::READ_FIRST_BYTES,
152+
"first bytes");
153+
}
154+
if (o.last_bytes_size > 0) {
155+
modes.emplace_back(Fileinfo::readtobuffermode::READ_LAST_BYTES,
156+
"last bytes");
157+
}
152158
if (o.usemd5) {
153159
modes.emplace_back(Fileinfo::readtobuffermode::CREATE_MD5_CHECKSUM,
154160
"md5 checksum");

testcases/verify_skipfirstbytes.sh

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/sh
2+
# Ensures the skip first bytes step checks
3+
#
4+
5+
set -e
6+
. "$(dirname "$0")/common_funcs.sh"
7+
8+
FIRSTBYTES=1000
9+
MIDDLEBYTES=1000
10+
LASTBYTES=1000
11+
12+
# make a file which is longer than "first bytes" and "last bytes" together,
13+
# so we can make two files that differ only in the middle and will
14+
# need checksumming to see they are different.
15+
makefiles() {
16+
for f in a b; do
17+
(
18+
head -c$FIRSTBYTES </dev/zero
19+
head -c$MIDDLEBYTES </dev/urandom
20+
head -c$LASTBYTES </dev/zero
21+
) >$f
22+
done
23+
}
24+
25+
reset_teststate
26+
makefiles
27+
28+
defaultfirst="-firstbytessize 64"
29+
defaultlast="-lastbytessize 64"
30+
31+
# with no checksum, we should falsely believe the files are equal
32+
# shellcheck disable=SC2086
33+
$rdfind -checksum none $defaultfirst $defaultlast a* b* \
34+
| grep "files that are not unique" >output.log
35+
verify [ "$(cat output.log)" = "It seems like you have 2 files that are not unique" ]
36+
37+
# if we set the first bytes size to be very large, we will detect it
38+
# shellcheck disable=SC2086
39+
$rdfind -checksum none -firstbytessize $((FIRSTBYTES + MIDDLEBYTES)) $defaultlast a* b* \
40+
| grep "files that are not unique" >output.log
41+
verify [ "$(cat output.log)" = "It seems like you have 0 files that are not unique" ]
42+
43+
# if we set the last bytes size to be very large, we will also detect it
44+
# shellcheck disable=SC2086
45+
$rdfind -checksum none $defaultfirst -lastbytessize $((MIDDLEBYTES + LASTBYTES)) a* b* \
46+
| grep "files that are not unique" >output.log
47+
verify [ "$(cat output.log)" = "It seems like you have 0 files that are not unique" ]
48+
49+
dbgecho "all is good for the skip first bytes step check!"

0 commit comments

Comments
 (0)