A simple python script to find duplicate files within a directory. The function takes the directory as the argument and return the duplicate files.
-
Build image
docker build -t find-duplicate-files .
-
Run image
docker run -v <path-directory-to-mount>:/test find-duplicate-files /test
- Group files with same size
- Compare the files in each group by repeatedly checking md5 checksum for a chunk size of 1M
- Print duplicate files in a seperate lines
- Multithreading for each group of files with same size
- Not checking the duplicates of files that dont have same size
- Multiprocessing with multithreading
- For larger files the number of recursion can be higher and can crash
- Dynamic Chunk size as per the size of the files
- Duplicate files should have same size
- Hex Digest for MD5 hash for a thousand different files would not be identical