Long and very long range redundancy removal

The problem is that these days, with huge hard disks being so cheap and so common is not unusual to forget about files you already saved and copy them again in another folder, or to try to reorganize one's lifetime files in different ways.

As an example I bring a 1TB SSD that I backed up from a friend of mine who inadvertently formatted it which had 2 partitions: one where they kept files well organized in clearly named folders and files, and a second partition used as a "messy" storage with lots of the same data from old backups used as a source to keep organizing the data on the first partition.

I’ve tried using ckolivas/lrzip which has implemented this kind of unlimited range redundancy removal through its -U option: there's a noticeable difference when compressing big disk images backups on using `zstd` and `lrzip`. Of course `zstd` is faster but removing redundancies through ckolivas/lrzip most of the times gets a smaller file in the end. For example on a 127 GB disk image with 3 different partitions containing 3 different windows versions and with free space zeroed out:
* using `lrzip -U -l -p1` resulted in a 28.94 GB file
* using `zstd -T0 --long -19` resulted in a 33.66 GB file

I've then tried using its `-U -n` option (only redundancy removal, compression disabled) on the 1000.2 GB disk image I was naming as an example: it resulted in a 732.4 GB file. Just using a simple lzo compression algorithm with lrzip’s option `-U -l` resulted in a 695 GB file. Using zstd on the original img took lots of hours (~40) and only got a 867.6 GB file.

I know I could have simply used `lrzip -U -n` to remove the redundancies and then compress the output file with `zstd`, but I think this could be a very useful feature for zstd itself indeed: I’m pretty sure the fast speed and high compression ratios combo make it a very common algorithm to compress disk images.

So I'd find it very useful to implement some sort of redundancy removal process:
* ckolivas/lrzip makes it using a “sliding mmap” (not sure I’ve actually understood the concept, but it should be a bit-by-bit redundancy removal process) it is quite slow even if very effective
* another way that comes to my mind could be some sort of magic byte recognition algorithm that could remove also some redundancies created by deleted files (that would be faster, but it would be limited to non fragmented files)
* another way could be supporting different common filesystems and reducing redundancy through some sort of “duplicate removal” process using hashes. Maybe even faster but this would totally forgo deleted files.

Don’t really know if anything of this is actually feasible inside the current zstd framework, but I really think some sort of long range redundancy removal should be part of it, considering its possible uses also in commercial environments (virtual machine image compression for example would be an area where this feature would surely be beneficial).

Thanks in advance, thanks for the great software and sorry for my pretty bad English, it’s not my mother tongue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long and very long range redundancy removal #3062

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long and very long range redundancy removal #3062

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions