-
Notifications
You must be signed in to change notification settings - Fork 3
Deduplication
WXY edited this page Apr 29, 2020
·
9 revisions
The program performs uniqueness checking via two kinds of hashing.
- Binary - Using the murmur3 algorithm the service is able to definitively determine if a file is an exact copy of another in your collection.
- Perceptual - With the PHash algorithm, specifically the MIT licensed variant from the OpenCV project, the program checks images against other images to see if they look similar to eachother.
- Uniqueness checking occurs just before files are imported into the mono-collection
- If any files fail either checks, the process aborts and a work-in-progress (WIP) file is created
- The WIP file is the program's way of asking for instructions on how it should proceed
- The file will named based on and placed in the same location as the affected file / directory
- When the conflict is resolved, the operator is required to rename the WIP file to set its extension to ".continue"
- At which point the service tries to resume the import process. If there are problems, a new WIP file will be created and the interaction repeats from step 2
In this section, we explain how to respond to different kinds of conflicts and the correct response for each.
The quoted are examples of contents in a WIP file one is expected to encounter for each kind of conflict.
[properties]
source_path = /home/user/in/new_file.jpg
source_type = file
[conflicts]
new_file.jpg = #checksum 09b55510302157e82d41d4be2d41d4be- checksum - This means the file already exists in the collection
- The correct response here is to drop the duplicate file. Or specify
combinewhich will merge the group belonging of the existing file into the group for the newly imported file
- The correct response here is to drop the duplicate file. Or specify
[properties]
source_path = /home/user/in/new_files
source_type = directory
[conflicts]
new_file1.jpg = #peer checksum 09b55510302157e82d41d4be2d41d4be
new_file2.jpg = #peer checksum 09b55510302157e82d41d4be2d41d4be- peer checksum - Two or more files being imported seems to be the same
- This typically happens when importing directories or archives containing duplicate files
- The operator is expected to delete one of the files and remove the entry for the other from the WIP file
[properties]
source_path = /home/user/in/new_file.jpg
source_type = file
[conflicts]
new_file.jpg = #perceptual 09b55510302157e8- perceptual - The file appears to be visually similar to a file that already exists in the mono-collection
- The file has already passed the binary uniqueness check
- If the visual difference is significant to the operator, this conflict may be dismissed by specifying
ignoreaction