Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions ChangeLog
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
2021-04-21 Eoghan Murray
* change terminology to prefer 'exclude' over 'remove' when talking
about the file list, in case there's confusion between 'remove' and
'delete'
* as per above, rename option -removeidentinode to -excludeidentinode
2018-11-12 Paul Dreik <[email protected]>
* release of 1.4.1
* fixes build failure on 32 bit platforms
Expand Down
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ Look for duplicate files in directory /home/pauls/bilder:
$ rdfind /home/pauls/bilder/
Now scanning "/home/pauls/bilder", found 3301 files.
Now have 3301 files in total.
Removed 0 files due to nonunique device and inode.
Now removing files with zero size...removed 3 files
Excluded 0 files due to nonunique device and inode.
Now excluding files with zero size...excluded 3 files
Total size is 2861229059 bytes or 3 Gib
Now sorting on size:removed 3176 files due to unique sizes.122 files left.
Now eliminating candidates based on first bytes:removed 8 files.114 files left.
Now eliminating candidates based on last bytes:removed 12 files.102 files left.
Now eliminating candidates based on md5 checksum:removed 2 files.100 files left.
Now sorting on size:excluding 176 files due to unique sizes.122 files left.
Now eliminating candidates based on first bytes:excluded 8 files.114 files left.
Now eliminating candidates based on last bytes:excluded 12 files.102 files left.
Now eliminating candidates based on md5 checksum:excluded 2 files.100 files left.
It seems like you have 100 files that are not unique
Totally, 24 Mib can be reduced.
Now making results file results.txt
Expand Down Expand Up @@ -77,15 +77,15 @@ Rdfind uses the following algorithm. If N is the number of files to search throu
2. For each argument, list the directory contents recursively and assign it to the file list. Assign a directory depth number, starting at 0 for every argument.
3. If the input argument is a file, add it to the file list.
4. Loop over the list, and find out the sizes of all files.
5. If flag -removeidentinode true: Remove items from the list which already are added, based on the combination of inode and device number. A group of files that are hardlinked to the same file are collapsed to one entry. Also see the comment on hardlinks under ”caveats below”!
6. Sort files on size. Remove files from the list, which have unique sizes.
5. If flag -excludeidentinode true: Exclude items already added, based on the combination of inode and device number. A group of files that are hardlinked to the same file are collapsed to one entry. Also see the comment on hardlinks under ”caveats below”!
6. Sort files on size. Exclude files which have unique sizes.
7. Sort on device and inode(speeds up file reading). Read a few bytes from the beginning of each file (first bytes).
8. Remove files from list that have the same size but different first bytes.
8. Exclude files that have the same size but different first bytes.
9. Sort on device and inode(speeds up file reading). Read a few bytes from the end of each file (last bytes).
10. Remove files from list that have the same size but different last bytes.
10. Exclude files that have the same size but different last bytes.
11. Sort on device and inode(speeds up file reading). Perform a checksum calculation for each file.
12. Only keep files on the list with the same size and checksum. These are duplicates.
13. Sort list on size, priority number, and depth. The first file for every set of duplicates is considered to be the original.
12. Exclude files with different size and checksum. The rest are duplicates.
13. Sort remaining duplicates on size, priority number, and depth. The first file for every set of duplicates is considered to be the original.
14. If flag ”-makeresultsfile true”, then print results file (default).
15. If flag ”-deleteduplicates true”, then delete (unlink) duplicate files. Exit.
16. If flag ”-makesymlinks true”, then replace duplicates with a symbolic link to the original. Exit.
Expand Down Expand Up @@ -153,7 +153,7 @@ Here is a small benchmark. Times are obtained from ”elapsed time” in the tim

### Caveats / Features

A group of hardlinked files to a single inode are collapsed to a single entry if `-removeidentinode true`. If you have two equal files (inodes) and two or more hardlinks for one or more of the files, the behaviour might not be what you think. Each group is collapsed to a single entry. That single entry will be hardlinked/symlinked/deleted depending on the options you pass to `rdfind`. This means that rdfind will detect and correct one file at a time. Running multiple times solves the situation. This has been discovered by a user who uses a ”hardlinks and rsync”-type of backup system. There are lots of such backup scripts around using that technique, Apple time machine also uses hardlinks. If a file is moved within the backuped tree, one gets a group of hardlinked files before the move and after the move. Running rdfind on the entire tree has to be done multiple times if -removeidentinode true. To understand the behaviour, here is an example demonstrating the behaviour:
A group of hardlinked files to a single inode are collapsed to a single entry if `-excludeidentinode true`. If you have two equal files (inodes) and two or more hardlinks for one or more of the files, the behaviour might not be what you think. Each group is collapsed to a single entry. That single entry will be hardlinked/symlinked/deleted depending on the options you pass to `rdfind`. This means that rdfind will detect and correct one file at a time. Running multiple times solves the situation. This has been discovered by a user who uses a ”hardlinks and rsync”-type of backup system. There are lots of such backup scripts around using that technique, Apple time machine also uses hardlinks. If a file is moved within the backuped tree, one gets a group of hardlinked files before the move and after the move. Running rdfind on the entire tree has to be done multiple times if -excludeidentinode true. To understand the behaviour, here is an example demonstrating the behaviour:

$ echo abc>a
$ ln a a1
Expand All @@ -171,7 +171,7 @@ A group of hardlinked files to a single inode are collapsed to a single entry if

Everything is as expected.

$ rdfind -removeidentinode true -makehardlinks true ./a* ./b*
$ rdfind -excludeidentinode true -makehardlinks true ./a* ./b*
$ stat --format="name=%n inode=%i nhardlinks=%h" a* b*
name=a inode=58930 nhardlinks=4
name=a1 inode=58930 nhardlinks=4
Expand Down
14 changes: 7 additions & 7 deletions Rdutil.cc
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ Rdutil::sort_on_depth_and_name(std::size_t index_of_first)
}

std::size_t
Rdutil::removeIdenticalInodes()
Rdutil::excludeIdenticalInodes()
{
// sort list on device and inode.
auto cmp = cmpDeviceInode;
Expand All @@ -319,7 +319,7 @@ Rdutil::removeIdenticalInodes()
}

std::size_t
Rdutil::removeUniqueSizes()
Rdutil::excludeUniqueSizes()
{
// sort list on size
auto cmp = cmpSize;
Expand All @@ -341,7 +341,7 @@ Rdutil::removeUniqueSizes()
}

std::size_t
Rdutil::removeUniqSizeAndBuffer()
Rdutil::excludeUniqSizeAndBuffer()
{
// sort list on size
const auto cmp = cmpSize;
Expand Down Expand Up @@ -420,7 +420,7 @@ std::size_t
Rdutil::cleanup()
{
const auto size_before = m_list.size();
auto it = std::remove_if(m_list.begin(), m_list.end(), [](const Fileinfo& A) {
auto it = std::exclude_if(m_list.begin(), m_list.end(), [](const Fileinfo& A) {
return A.deleteflag();
});

Expand All @@ -432,17 +432,17 @@ Rdutil::cleanup()
}
#if 0
std::size_t
Rdutil::remove_small_files(Fileinfo::filesizetype minsize)
Rdutil::exclude_small_files(Fileinfo::filesizetype minsize)
{
const auto size_before = m_list.size();
const auto begin = m_list.begin();
const auto end = m_list.end();
decltype(m_list.begin()) it;
if (minsize == 0) {
it =
std::remove_if(begin, end, [](const Fileinfo& A) { return A.isempty(); });
std::exclude_if(begin, end, [](const Fileinfo& A) { return A.isempty(); });
} else {
it = std::remove_if(begin, end, [=](const Fileinfo& A) {
it = std::exclude_if(begin, end, [=](const Fileinfo& A) {
return A.is_smaller_than(minsize);
});
}
Expand Down
20 changes: 10 additions & 10 deletions Rdutil.hh
Original file line number Diff line number Diff line change
Expand Up @@ -44,21 +44,21 @@ public:
/**
* for each group of identical inodes, only keep the one with the highest
* rank.
* @return number of elements removed
* @return number of elements excluded
*/
std::size_t removeIdenticalInodes();
std::size_t excludeIdenticalInodes();

/**
* remove files with unique size from the list.
* exclude files with unique size from the list.
* @return
*/
std::size_t removeUniqueSizes();
std::size_t excludeUniqueSizes();

/**
* remove files with unique combination of size and buffer from the list.
* exclude files with unique combination of size and buffer from the list.
* @return
*/
std::size_t removeUniqSizeAndBuffer();
std::size_t excludeUniqSizeAndBuffer();

/**
* Assumes the list is already sorted on size, and all elements with the same
Expand All @@ -70,14 +70,14 @@ public:
*/
void markduplicates();

/// removes all items from the list, that have the deleteflag set to true.
/// excludes all items from the list that have the deleteflag set to true.
std::size_t cleanup();

/**
* Removes items with file size less than minsize
* @return the number of removed elements.
* Excludes items with file size less than minsize
* @return the number of excluded elements.
*/
std::size_t remove_small_files(Fileinfo::filesizetype minsize);
std::size_t exclude_small_files(Fileinfo::filesizetype minsize);

// read some bytes. note! destroys the order of the list.
// if lasttype is supplied, it does not reread files if they are shorter
Expand Down
4 changes: 2 additions & 2 deletions rdfind.1
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@ is disabled.
.BR \-followsymlinks " " \fItrue\fR|\fIfalse\fR
Follow symlinks. Default is false.
.TP
.BR \-removeidentinode " " \fItrue\fR|\fIfalse\fR
Removes items found which have identical inode and device ID. Default
.BR \-excludeidentinode " " \fItrue\fR|\fIfalse\fR
Excludes items found which have identical inode and device ID. Default
is true.
.TP
.BR \-checksum " " \fImd5\fR|\fIsha1\fR|\fIsha256\fR
Expand Down
21 changes: 12 additions & 9 deletions rdfind.cc
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ usage()
<< " -maxsize N (N=0) ignores files with size N "
"bytes and larger (use 0 to disable this check).\n"
<< " -followsymlinks true |(false) follow symlinks\n"
<< " -removeidentinode (true)| false ignore files with nonunique "
<< " -excludeidentinode (true)| false ignore files with nonunique "
"device and inode\n"
<< " -checksum md5 |(sha1)| sha256\n"
<< " checksum type\n"
Expand Down Expand Up @@ -101,7 +101,7 @@ struct Options
bool deleteduplicates = false; // delete duplicate files
bool followsymlinks = false; // follow symlinks
bool dryrun = false; // only dryrun, dont destroy anything
bool remove_identical_inode = true; // remove files with identical inodes
bool exclude_identical_inode = true; // exclude files with identical inodes
bool usemd5 = false; // use md5 checksum to check for similarity
bool usesha1 = false; // use sha1 checksum to check for similarity
bool usesha256 = false; // use sha256 checksum to check for similarity
Expand Down Expand Up @@ -163,7 +163,10 @@ parseOptions(Parser& parser)
} else if (parser.try_parse_bool("-n")) {
o.dryrun = parser.get_parsed_bool();
} else if (parser.try_parse_bool("-removeidentinode")) {
o.remove_identical_inode = parser.get_parsed_bool();
// backwards compatibility
o.exclude_identical_inode = parser.get_parsed_bool();
} else if (parser.try_parse_bool("-excludeidentinode")) {
o.exclude_identical_inode = parser.get_parsed_bool();
} else if (parser.try_parse_bool("-deterministic")) {
o.deterministic = parser.get_parsed_bool();
} else if (parser.try_parse_string("-checksum")) {
Expand Down Expand Up @@ -334,17 +337,17 @@ main(int narg, const char* argv[])
// list.
gswd.markitems();

if (o.remove_identical_inode) {
// remove files with identical devices and inodes from the list
std::cout << dryruntext << "Removed " << gswd.removeIdenticalInodes()
if (o.exclude_identical_inode) {
// exclude files with identical devices and inodes from the list
std::cout << dryruntext << "Excluded " << gswd.excludeIdenticalInodes()
<< " files due to nonunique device and inode." << std::endl;
}

std::cout << dryruntext << "Total size is " << gswd.totalsizeinbytes()
<< " bytes or ";
gswd.totalsize(std::cout) << std::endl;

std::cout << "Removed " << gswd.removeUniqueSizes()
std::cout << "Excluded " << gswd.excludeUniqueSizes()
<< " files due to unique sizes from list. ";
std::cout << filelist.size() << " files left." << std::endl;

Expand Down Expand Up @@ -375,8 +378,8 @@ main(int narg, const char* argv[])
// read bytes (destroys the sorting, for disk reading efficiency)
gswd.fillwithbytes(it[0].first, it[-1].first, o.nsecsleep);

// remove non-duplicates
std::cout << "removed " << gswd.removeUniqSizeAndBuffer()
// exclude non-duplicates
std::cout << "excluded " << gswd.excludeUniqSizeAndBuffer()
<< " files from list. ";
std::cout << filelist.size() << " files left." << std::endl;
}
Expand Down
2 changes: 1 addition & 1 deletion testcases/checksum_speedtest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ fi

for checksumtype in md5 sha1 sha256; do
dbgecho "trying checksum $checksumtype"
time $rdfind -removeidentinode false -checksum $checksumtype speedtest/largefile1 speedtest/largefile2 > rdfind.out
time $rdfind -excludeidentinode false -checksum $checksumtype speedtest/largefile1 speedtest/largefile2 > rdfind.out
done

dbgecho "all is good in this test!"