Skip to content

Commit ab4a540

Browse files
committed
Describe the relationship with Microsoft's RDC FilterMax
1 parent b1a1baf commit ab4a540

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,3 +92,20 @@ When chunking and deduplicating the Linux kernel source tarballs, we
9292
observed that for that specific data set the optimal ratio between the
9393
minimum and maximum chunk size was somewhere close to 4x. We therefore
9494
recommend that this ratio is used as a starting point.
95+
96+
### Relationship to RDC FilterMax
97+
98+
Microsoft's [Remote Differential Compression algorithm](https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-rdc)
99+
uses a content defined chunking algorithm named FilterMax. Just like
100+
MaxCDC, it attempts to insert cutting points at positions where the hash
101+
value of a rolling hash function is a local maximum. The main difference
102+
is that this is only checked within a small region what the algorithm
103+
names the horizon. This results in a chunk size distribution that is
104+
geometric, similar to traditional Rabin fingerprinting implementations.
105+
106+
Some testing of this construct in combination with the Gear hash
107+
function was performed, using the same methodology as described above.
108+
Deduplicating yielded 398,967 unique chunks with a combined size of
109+
4,031,959,354 bytes. This is 4.11% worse than FastCDC8KB and 6.38% worse
110+
than MaxCDC. The average chunk size was 10,105 bytes, which is similar
111+
to what was used for the previous tests.

0 commit comments

Comments
 (0)