Skip to content

buildbarn/go-cdc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Content Defined Chunking playground

This repository provides implementations for a small number of Content Defined Chunking algorithms for the Go programming language. These implementations are provided for testing/benchmarking purposes.

This repository was created in response to Bazel remote-apis PR #282, where support for Content Defined Chunking is considered for inclusion into the remote execution protocol that is used by tools like Bazel and Buildbarn.

MaxCDC: content defined chunking with lookahead

Algorithms like Rabin fingerprinting and FastCDC all work on the basis that they perform a simple forward scan through the input data. A decision to introduce a cutting point between chunks is made when the bytes right before it are hashed. This simple design has some limitations. For example:

  • If no cutting point is found before the maximum chunk size is reached, the algorithm is forced to make a cut at an undesirable offset. It will not be able to select a "close to optimal" cutting point.

  • When implemented trivially, the size of chunks is expected to follow a geometric distribution. This means that there is a relatively large spread in size between the smallest and largest chunks. For example, for FastCDC8KB the largest chunks can be 32 times as large as the smallest ones (2 KB vs 64 KB).

The MaxCDC algorithm attempts to address this by performing lookaheads. Instead of selecting cutting points on the fly, it always scans the input up to the maximum limit and only afterwards chooses a cutting point that is most desirable. It considers the most desirable cutting point to be the one for which the Gear hash has the highest value, hence the name MaxCDC.

Runtime performance

Implementing MaxCDC trivially has the disadvantage that input data is hashed redundantly. It may process input up to the maximum limit and select a cutting point close to the minimum limit. Any data in between those two limits would be hashed again during the next iteration. To eliminate this overhead, we provide an optimized implementation that preserves potential future cutting points on a stack, allowing subsequent calls to reuse this information. Performance of this optimized implementation is nearly identical to plain FastCDC.

Deduplication performance

In order to validate the quality of the chunking performed by this algorithm, we have created uncompressed tarballs of 80 different versions of the Linux kernel (from v6.0 to v6.8, including all release candidates). Each of these tarballs is approximately 1.4 GB in size.

When chunking all of these tarballs with FastCDC8KB, we see that each tarball is split into about 145k chunks. When deduplicating chunks across all 80 versions, 383,093 chunks remain that have a total size of 3,872,754,501 bytes. Chunks thus have an average size of 10,109 bytes.

We then chunked the same tarballs using MaxCDC, using a minimum size of 4,096 bytes and a maximum size of 14,785 bytes. After deduplicating, this yielded 374,833 chunks having a total size of 3,790,013,152 bytes. The minimum and maximum chunk size were intentionally chosen so that the average chunk size was almost identical to that of FastCDC8KB, namely 10,111 bytes.

We therefore conclude that for this specific benchmark the MaxCDC generated output consumes 2.14% less space than FastCDC8KB. Furthermore, the spread in chunk size is also far better when using MaxCDC (14,785 B / 4,096 B ≈ 3.61) when compared to FastCDC8KB (64 KB / 2 KB = 32).

Tuning recommendations

Assume you use MaxCDC to chunk two non-identical, but similar files. Making the ratio between the minimum and maximum permitted chunk size too small leads to bad performance, because it causes the streams of chunks to take longer to converge after differing parts have finished processing.

Conversely, making the ratio between the minimum and maximum permitted chunk size too large is also suboptimal. The reason being that large chunks have a lower probability of getting deduplicated against others. This causes the average size of chunks stored in a deduplicating data store to become higher than that of the chunking algorithm itself.

When chunking and deduplicating the Linux kernel source tarballs, we observed that for that specific data set the optimal ratio between the minimum and maximum chunk size was somewhere close to 4x. We therefore recommend that this ratio is used as a starting point.

Relationship to RDC FilterMax

Microsoft's Remote Differential Compression algorithm uses a content defined chunking algorithm named FilterMax. Just like MaxCDC, it attempts to insert cutting points at positions where the hash value of a rolling hash function is a local maximum. The main difference is that this is only checked within a small region what the algorithm names the horizon. This results in a chunk size distribution that is geometric, similar to traditional Rabin fingerprinting implementations.

Some testing of this construct in combination with the Gear hash function was performed, using the same methodology as described above. Deduplicating yielded 398,967 unique chunks with a combined size of 4,031,959,354 bytes. This is 4.11% worse than FastCDC8KB and 6.38% worse than MaxCDC. The average chunk size was 10,105 bytes, which is similar to what was used for the previous tests.

RepMaxCDC: repeated application of MaxCDC

RepMaxCDC is a logical extension to MaxCDC. Namely, it behaves as if the MaxCDC algorithm is invoked repeatedly until chunks can no longer be decomposed. This reduces the ratio between the minimum and maximum chunk size to 2, which is optimal.

Like MaxCDC, RepMaxCDC takes a parameter that controls the amount of data that is read ahead. While MaxCDC uses it to control the maximum chunk size, in RepMaxCDC it only denotes the quality of the chunking that is performed (i.e., the horizon size). MaxCDC performs poorly if the ratio between the maximum and minimum chunk size becomes too large. RepMaxCDC's horizon size can be increased freely without reducing quality, though at some point there will be diminishing returns (>8 times the minimum chunk size).

It has been observed that RepMaxCDC provides a rate of deduplication that is indistinguishable from MaxCDC, but only under the condition that MaxCDC's chunk size ratio is chosen optimally. An advantage of RepMaxCDC over MaxCDC is therefore that less tuning is required.

Furthermore, for a given input it is also trivial to check whether it is already chunked, purely looking at its size. This is a property that both FastCDC and MaxCDC lack. This makes it practical to integrate RepMaxCDC into existing Content Addressable Storage systems that were originally not designed with chunking in mind. Existing storage APIs for uploading and downloading objects can unambiguously determine whether requests pertain to an individual chunk, or a file consisting of multiple chunks.

The simple implementation of RepMaxCDC isn't a lot more complex than MaxCDC's. However, it tends to perform poorly due to repeated hashing of the input. The optimized implementation of RepMaxCDC addresses this, and provides the same throughput as FastCDC and MaxCDC.

The optimized implementation works by keeping track of all points in the input data at which the hash value exceeds previously observed values. This results in a staircase. Whenever it notices that the distance since the start of the last step becomes too long (i.e., reaching the minimum chunk size), it knows it can select definitive cutting points up to the start of that step. The staircase may then be discarded.

Chunk size distribution

Below are some graphs showing the chunk size distribution of the algorithms discussed above.

FastCDC

Chunk size distribution for FastCDC

FastCDC uses different thresholds for determining where to make cuts, depending on whether the expected chunk size of 8 KB is reached. It is important to note that the expected chunk size is not the average, nor the median.

You can see that FastCDC effectively tries to emulate a normal distribution by stitching two geometric distributions together.

MaxCDC

Chunk size distribution for MaxCDC

With MaxCDC, the size distribution of chunks before deduplication is uniform. As smaller objects have a higher probability of getting deduplicated, you see that after deduplication larger objects occur somewhat more frequently.

RepMaxCDC

Chunk size distribution for RepMaxCDC

With RepMaxCDC, the size distribution of chunks is clearly not uniform. Namely, the algorithm prefers creating chunks closer to the minimum chunk size. This can be explained by the fact that any chunks created by MaxCDC that are only slightly larger than twice the minimum chunk size, will continue to be broken up further into two small chunks by RepMaxCDC.

If one were to think of a file as being the side of a long street, the minimum chunk size to be equal to the length of a car, and the start of each chunk being a position at which a car is parked, then this problem is analogous to Rényi's parking problem (1958). The expected density is known as Rényi's parking constant, which is approximately 0.7475979203. This seems to match our observations, where the mean chunk size (prior to deduplication) is the minimum size, multiplied by $1/0.74759\ldots = 1.3376\dots$.

Future work

The MaxCDC and RepMaxCDC algorithms have at this point been stablized, and their behavior will no longer be altered. However, this doesn't mean we can't add improved algorithms to this repository later on. Here are some topics we should try to explore:

  • The MaxCDC and RepMaxCDC algorithms hash data using h=(h<<1)+gear[b]. This was done to be able to make comparisons with FastCDC fair. However, the Linux kernel tarball tests show that h=(h>>1)+(gear[b]>>1) yields better results, as it better tolerates changes close to chunk boundaries. Should future versions of the algorithm use that function instead?

  • Maybe a high-quality rolling hash function should always evaluate to zero when all bytes in the window are equal to each other. That way cutting points are only considered in such positions if all bytes in the chunk have the same value.

  • Is it possible to make SIMD aware implementations of the MaxCDC and RepMaxCDC algorithms? If not, how can the algorithms be modified to accommodate this?

  • There exist CDC algorithms that don't depend on hashing at all, such as Asymmetric Extremum (AE). Could the scheme introduced by MaxCDC and RepMaxCDC be adjusted to work with such algorithms as well?

About

Content Defined Chunking playground

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages