Skip to content

Deduplication

WXY edited this page Apr 29, 2020 · 9 revisions

The program performs uniqueness checking via two kinds of hashing.

  • Binary - Using the murmur3 algorithm the service is able to definitively determine if a file is an exact copy of another in your collection.
  • Perceptual - With the PHash algorithm, specifically the MIT licensed variant from the OpenCV project, the program checks images against other images to see if they look similar to eachother.

Operating Procedure

  1. Uniqueness checking occurs just before files are imported into the mono-collection
  2. If any files fail either checks, the process aborts and a work-in-progress (WIP) file is created
    • The WIP file is the program's way of asking for instructions on how it should proceed
    • The file will named based on and placed in the same location as the affected file / directory
  3. When the conflict is resolved, the operator is required to rename the WIP file to set its extension to ".continue"
  4. At which point the service tries to resume the import process. If there are problems, a new WIP file will be created and the interaction repeats from step 2

Resolving Conflicts

In this section, we explain how to respond to different kinds of conflicts and the correct response for each.

The quoted are examples of contents in a WIP file one is expected to encounter for each kind of conflict.

[properties]
source_path = /home/user/in/new_file.jpg
source_type = file

[conflicts]
new_file.jpg = #checksum 09b55510302157e82d41d4be2d41d4be
  • checksum - This means the file already exists in the collection
    • The correct response here is to drop the duplicate file. Or specify combine which will merge the group belonging of the existing file into the group for the newly imported file
[properties]
source_path = /home/user/in/new_files
source_type = directory

[conflicts]
new_file1.jpg = #peer checksum 09b55510302157e82d41d4be2d41d4be
new_file2.jpg = #peer checksum 09b55510302157e82d41d4be2d41d4be
  • peer checksum - Two or more files being imported seems to be the same
    • This typically happens when importing directories or archives containing duplicate files
    • The operator is expected to delete one of the files and remove the entry for the other from the WIP file
[properties]
source_path = /home/user/in/new_file.jpg
source_type = file

[conflicts]
new_file.jpg = #perceptual 09b55510302157e8
  • perceptual - The file appears to be visually similar to a file that already exists in the mono-collection
    • The file has already passed the binary uniqueness check
    • If the visual difference is significant to the operator, this conflict may be dismissed by specifying ignore action

Clone this wiki locally