Speedup scan speed for '--patch-from' via rolling hashes

IPFS has the ability to dedup blocks between different types of files. This functionality is based on a rolling hash algorithm.

You can either select rabin or buzzhash for this task (in IPFS). Rabin is kind of slow, but buzzhash is quite fast.

The rolling hash would allow to 'prescan' both files, get some cut marks and run some fast cryptographic hash algorithm over the chunks, like blake2b.

I think both operations are much cheaper than pattern matching. This way you can skip all pattern matching attempts which are on both sides (A and B) inside the known equal blocks.

The first layer of patching would just generate a lengths+offset+move triple, which can copy the blocks from the original file into a sparse file as first patching operation.

The pattern matching rules could be used on top of that, completing the gaps of the output file.

_Originally posted by @RubenKelevra in https://github.com/facebook/zstd/issues/2063#issuecomment-616705733_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speedup scan speed for '--patch-from' via rolling hashes #2189

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speedup scan speed for '--patch-from' via rolling hashes #2189

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions