|
36 | 36 | import org.apache.datasketches.memory.WritableMemory;
|
37 | 37 |
|
38 | 38 | /**
|
39 |
| - * This is a high performance implementation of Phillipe Flajolet’s HLL sketch but with |
40 |
| - * significantly improved error behavior. If the ONLY use case for sketching is counting |
41 |
| - * uniques and merging, the HLL sketch the HLL sketch is a reasonable choice, although the highest |
42 |
| - * performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). |
43 |
| - * For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the |
44 |
| - * Theta sketch family for the same accuracy. |
| 39 | + * The HllSketch is actually a collection of compact implementations of Phillipe Flajolet’s HyperLogLog (HLL) |
| 40 | + * sketch but with significantly improved error behavior and excellent speed performance. |
45 | 41 | *
|
46 |
| - * <p>This implementation offers three different types of HLL sketch, each with different |
47 |
| - * trade-offs with accuracy, space and performance. These types are specified with the |
48 |
| - * {@link TgtHllType} parameter. |
| 42 | + * <p>If the use case for sketching is primarily counting uniques and merging, the HLL sketch is the 2nd highest |
| 43 | + * performing in terms of accuracy for storage space consumed in the DataSketches library |
| 44 | + * (the new CPC sketch developed by Kevin J. Lang now beats HLL in terms of accuracy / space). |
| 45 | + * For large counts, HLL sketches can be 2 to 8 times smaller for the same accuracy than the DataSketches Theta |
| 46 | + * Sketches when serialized, but the Theta sketches can do set intersections and differences while HLL and CPC cannot. |
| 47 | + * The CPC sketch and HLL share similar use cases, but the CPC sketch is about 30 to 40% smaller than the HLL sketch |
| 48 | + * when serialized and larger than the HLL when active in memory. Choose your weapons!</p> |
49 | 49 | *
|
50 |
| - * <p>In terms of accuracy, all three types, for the same <i>lgConfigK</i>, have the same error |
51 |
| - * distribution as a function of <i>n</i>, the number of unique values fed to the sketch. |
52 |
| - * The configuration parameter <i>lgConfigK</i> is the log-base-2 of <i>K</i>, |
53 |
| - * where <i>K</i> is the number of buckets or slots for the sketch. |
| 50 | + * <p>A new HLL sketch is created with a simple constructor:</p> |
| 51 | + * <pre>{@code |
| 52 | + * int lgK = 12; //This is log-base2 of k, so k = 4096. lgK can be from 4 to 21 |
| 53 | + * HllSketch sketch = new HllSketch(lgK); //TgtHllType.HLL_4 is the default |
| 54 | + * //OR |
| 55 | + * HllSketch sketch = new HllSketch(lgK, TgtHllType.HLL_6); |
| 56 | + * //OR |
| 57 | + * HllSketch sketch = new HllSketch(lgK, TgtHllType.HLL_8); |
| 58 | + * }</pre> |
54 | 59 | *
|
55 |
| - * <p>During warmup, when the sketch has only received a small number of unique items |
56 |
| - * (up to about 10% of <i>K</i>), this implementation leverages a new class of estimator |
57 |
| - * algorithms with significantly better accuracy. |
| 60 | + * <p>All three different sketch types are targets in that the sketches start out in a warm-up mode that is small in |
| 61 | + * size and gradually grows as needed until the full HLL array is allocated. The HLL_4, HLL_6 and HLL_8 represent |
| 62 | + * different levels of compression of the final HLL array where the 4, 6 and 8 refer to the number of bits each |
| 63 | + * bucket of the HLL array is compressed down to. |
| 64 | + * The HLL_4 is the most compressed but generally slower than the other two, especially during union operations.</p> |
58 | 65 | *
|
59 |
| - * <p>This sketch also offers the capability of operating off-heap. Given a WritableMemory object |
60 |
| - * created by the user, the sketch will perform all of its updates and internal phase transitions |
61 |
| - * in that object, which can actually reside either on-heap or off-heap based on how it is |
62 |
| - * configured. In large systems that must update and merge many millions of sketches, having the |
63 |
| - * sketch operate off-heap avoids the serialization and deserialization costs of moving sketches |
64 |
| - * to and from off-heap memory-mapped files, for example, and eliminates big garbage collection |
65 |
| - * delays. |
| 66 | + * <p>All three types share the same API. Updating the HllSketch is very simple:</p> |
| 67 | + * |
| 68 | + * <pre>{@code |
| 69 | + * long n = 1000000; |
| 70 | + * for (int i = 0; i < n; i++) { |
| 71 | + * sketch.update(i); |
| 72 | + * } |
| 73 | + * }</pre> |
| 74 | + * |
| 75 | + * <p>Each of the presented integers above are first hashed into 128-bit hash values that are used by the sketch |
| 76 | + * HLL algorithm, so the above loop is essentially equivalent to using a random number generator initialized with a |
| 77 | + * seed so that the sequence is deterministic and random.</p> |
| 78 | + * |
| 79 | + * <p>Obtaining the cardinality results from the sketch is also simple:</p> |
| 80 | + * |
| 81 | + * <pre>{@code |
| 82 | + * double estimate = sketch.getEstimate(); |
| 83 | + * double estUB = sketch.getUpperBound(1.0); //the upper bound at 1 standard deviation. |
| 84 | + * double estLB = sketch.getLowerBound(1.0); //the lower bound at 1 standard deviation. |
| 85 | + * //OR |
| 86 | + * System.out.println(sketch.toString()); //will output a summary of the sketch. |
| 87 | + * }</pre> |
| 88 | + * |
| 89 | + * <p>Which produces a console output something like this:</p> |
| 90 | + * |
| 91 | + * <pre>{@code |
| 92 | + * ### HLL SKETCH SUMMARY: |
| 93 | + * Log Config K : 12 |
| 94 | + * Hll Target : HLL_4 |
| 95 | + * Current Mode : HLL |
| 96 | + * LB : 977348.7024560181 |
| 97 | + * Estimate : 990116.6007366662 |
| 98 | + * UB : 1003222.5095308956 |
| 99 | + * OutOfOrder Flag: false |
| 100 | + * CurMin : 5 |
| 101 | + * NumAtCurMin : 1 |
| 102 | + * HipAccum : 990116.6007366662 |
| 103 | + * }</pre> |
66 | 104 | *
|
67 | 105 | * @author Lee Rhodes
|
68 | 106 | * @author Kevin Lang
|
|
0 commit comments