Skip to content

Commit 289fa2c

Browse files
committed
deploy: d40921a
1 parent e7c9045 commit 289fa2c

File tree

113 files changed

+16374
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

113 files changed

+16374
-0
lines changed

docs/5.2.0/.buildinfo

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Sphinx build info version 1
2+
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
3+
config: 86d4b171ba47c51d1a3d7f924e02560f
4+
tags: 645f666f9bcd5a90fca523b33c5a78b7
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
Compressed Probabilistic Counting (CPC)
2+
---------------------------------------
3+
High performance C++ implementation of Compressed Probabilistic Counting (CPC) Sketch.
4+
This is a unique-counting sketch that implements the Compressed Probabilistic Counting (CPC, a.k.a FM85) algorithms developed by Kevin Lang in his paper
5+
`Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm <https://arxiv.org/abs/1708.06839>`_.
6+
This sketch is extremely space-efficient when serialized.
7+
In an apples-to-apples empirical comparison against compressed HyperLogLog sketches, this new algorithm simultaneously wins on the two dimensions of the space/accuracy tradeoff and produces sketches that are smaller than the entropy of HLL, so no possible implementation of compressed HLL can match its space efficiency for a given accuracy. As described in the paper this sketch implements a newly developed ICON estimator algorithm that survives unioning operations, another well-known estimator, the Historical Inverse Probability (HIP) estimator does not.
8+
The update speed performance of this sketch is quite fast and is comparable to the speed of HLL.
9+
The unioning (merging) capability of this sketch also allows for merging of sketches with different configurations of K.
10+
For additional security this sketch can be configured with a user-specified hash seed.
11+
12+
13+
.. autoclass:: _datasketches.cpc_sketch
14+
:members:
15+
:undoc-members:
16+
:exclude-members: deserialize
17+
18+
.. rubric:: Static Methods:
19+
20+
.. automethod:: deserialize
21+
22+
.. rubric:: Non-static Methods:
23+
24+
.. automethod:: __init__
25+
26+
27+
.. autoclass:: _datasketches.cpc_union
28+
:members:
29+
:undoc-members:
30+
:exclude-members: deserialize
31+
32+
.. automethod:: __init__
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
HyperLogLog (HLL)
2+
-----------------
3+
This is a high performance implementation of Phillipe Flajolet's HLL sketch but with significantly improved error behavior.
4+
5+
If the ONLY use case for sketching is counting uniques and merging, the HLL sketch is a reasonable choice, although the highest performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the Theta sketch family for the same accuracy.
6+
7+
This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance.
8+
These types are specified with the target_hll_type parameter.
9+
10+
In terms of accuracy, all three types, for the same lg_config_k, have the same error distribution as a function of ``n``, the number of unique values fed to the sketch.
11+
The configuration parameter ``lg_config_k`` is the log-base-2 of ``k``, where ``k`` is the number of buckets or slots for the sketch.
12+
13+
During warmup, when the sketch has only received a small number of unique items (up to about 10% of ``k``), this implementation leverages a new class of estimator algorithms with significantly better accuracy.
14+
15+
16+
.. autoclass:: _datasketches.tgt_hll_type
17+
18+
.. autoattribute:: HLL_4
19+
:annotation: : 4 bits per entry
20+
21+
.. autoattribute:: HLL_6
22+
:annotation: : 6 bits per entry
23+
24+
.. autoattribute:: HLL_8
25+
:annotation: : 8 bits per entry
26+
27+
28+
.. autoclass:: _datasketches.hll_sketch
29+
:members:
30+
:undoc-members:
31+
:exclude-members: deserialize, get_max_updatable_serialization_bytes, get_rel_err
32+
33+
.. rubric:: Static Methods:
34+
35+
.. automethod:: deserialize
36+
.. automethod:: get_max_updatable_serialization_bytes
37+
.. automethod:: get_rel_err
38+
39+
.. rubric:: Non-static Methods:
40+
41+
.. automethod:: __init__
42+
43+
.. autoclass:: _datasketches.hll_union
44+
:members:
45+
:undoc-members:
46+
:exclude-members: get_rel_err
47+
48+
.. rubric:: Static Methods:
49+
50+
.. automethod:: get_rel_err
51+
52+
.. rubric:: Non-static Methods:
53+
54+
.. automethod:: __init__
55+
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Distinct Counting
2+
=================
3+
4+
.. currentmodule:: datasketches
5+
6+
Distinct counting is one of the earliest tasks to which sketches were applied. The concept is simple:
7+
Provide an estimate of the number of unique elements in a set of data. One of the earliest solutions came
8+
from Flajolet and Martin in 1985 with their seminal work
9+
`Probabilistic counting Algorithms for Data Base Applications <http://db.cs.berkeley.edu/cs286/papers/flajoletmartin-jcss1985.pdf>`_.
10+
11+
The DataSketches library offers several types of distinct counting sketches, each with different properties.
12+
13+
* :class:`hll_sketch`: Hyper Log Log, a well-known sketch for distinct counting but no longer state-of-the-art.
14+
* :class:`cpc_sketch`: Provides a better accuracy-space trade-off than HLL, but with a somewhat larger footprint while in-memory.
15+
* :class:`theta_sketch`: Theta sketch, a type of k-minimum value sketch, which provide good performance with intersection and set difference operations.
16+
* :class:`tuple_sketch`: Tuple sketch, which is similar to a theta sketch but supports additional data stored with each key.
17+
18+
.. toctree::
19+
:maxdepth: 1
20+
21+
hyper_log_log
22+
cpc
23+
theta
24+
tuple
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
Theta Sketch
2+
------------
3+
4+
.. currentmodule:: datasketches
5+
6+
Theta sketches are used for distinct counting.
7+
8+
The theta package contains the basic sketch classes that are members of the `Theta Sketch Framework <https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html>`_.
9+
There is a separate Tuple package for many of the sketches that are derived from the same algorithms defined in the Theta Sketch Framework paper.
10+
11+
The *Theta Sketch* sketch is a space-efficient method for estimating cardinalities of sets.
12+
It can also easily handle set operations (such as union, intersection, difference) while maintaining good accuracy.
13+
Theta sketch is a practical variant of the K-Minimum Values sketch which avoids the need to sort the stored
14+
hash values on every insertion to the sketch.
15+
It has better error properties than the HyperLogLog sketch for set operations beyond the simple union.
16+
17+
Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.
18+
19+
Several `Jaccard similarity <https://en.wikipedia.org/wiki/Jaccard_similarity>`_
20+
measures can be computed between theta sketches with the :class:`theta_jaccard_similarity` class.
21+
22+
.. autoclass:: theta_sketch
23+
:members:
24+
:undoc-members:
25+
26+
.. autoclass:: update_theta_sketch
27+
:members:
28+
:undoc-members:
29+
30+
.. automethod:: __init__
31+
32+
33+
.. autoclass:: compact_theta_sketch
34+
:members:
35+
:undoc-members:
36+
:exclude-members: deserialize
37+
38+
.. rubric:: Static Methods:
39+
40+
.. automethod:: deserialize
41+
42+
.. rubric:: Non-static Methods:
43+
44+
.. automethod:: __init__
45+
46+
47+
.. autoclass:: theta_union
48+
:members:
49+
:undoc-members:
50+
51+
.. automethod:: __init__
52+
53+
54+
.. autoclass:: theta_intersection
55+
:members:
56+
:undoc-members:
57+
58+
.. automethod:: __init__
59+
60+
61+
.. autoclass:: theta_a_not_b
62+
:members:
63+
:undoc-members:
64+
65+
.. automethod:: __init__
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
Tuple Sketch
2+
------------
3+
4+
.. currentmodule:: datasketches
5+
6+
Tuple sketches are an extension of Theta sketches, meaning they provide estimate of distinct counts, that
7+
allow the keeping of arbitrary summaries associated with each retained key
8+
(for example, a count for every key). The use of a :class:`tuple_sketch` requires a :class:`TuplePolicy` which
9+
defines how summaries are created, updated, merged, or intersected. The library provides a few basic
10+
examples of :class:`TuplePolicy` implementations, but the right custom summary and policy can allow very
11+
complicated analysis to be performed quite easily.
12+
13+
Set operations (union, intersection, A-not-B) are performed through the use of dedicated objects.
14+
15+
Several `Jaccard similarity <https://en.wikipedia.org/wiki/Jaccard_similarity>`_
16+
measures can be computed between theta sketches with the :class:`tuple_jaccard_similarity` class.
17+
18+
.. note::
19+
Serializing and deserializing this sketch requires the use of a :class:`PyObjectSerDe`.
20+
21+
.. autoclass:: tuple_sketch
22+
:members:
23+
:undoc-members:
24+
25+
.. autoclass:: update_tuple_sketch
26+
:members:
27+
:undoc-members:
28+
29+
.. automethod:: __init__
30+
31+
32+
.. autoclass:: compact_tuple_sketch
33+
:members:
34+
:undoc-members:
35+
:exclude-members: deserialize
36+
37+
.. rubric:: Static Methods:
38+
39+
.. automethod:: deserialize
40+
41+
.. rubric:: Non-static Methods:
42+
43+
.. automethod:: __init__
44+
45+
46+
.. autoclass:: tuple_union
47+
:members:
48+
:undoc-members:
49+
50+
.. automethod:: __init__
51+
52+
53+
.. autoclass:: tuple_intersection
54+
:members:
55+
:undoc-members:
56+
57+
.. automethod:: __init__
58+
59+
60+
.. autoclass:: tuple_a_not_b
61+
:members:
62+
:undoc-members:
63+
64+
.. automethod:: __init__
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
CountMin Sketch
2+
---------------
3+
4+
The CountMin sketch, as described in Cormode and Muthukrishnan in
5+
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf,
6+
is used for approximate Frequency Estimation.
7+
For an item :math:`x` with frequency :math:`f_x`, the sketch provides an estimate, :math:`\hat{f_x}`,
8+
such that :math:`f_x \approx \hat{f_x}.`
9+
The sketch guarantees that :math:`f_x \le \hat{f_x}` and provides a probabilistic upper bound which is dependent on the size parameters.
10+
The sketch provides an estimate of the occurrence frequency for any queried item but, in contrast
11+
to the Frequent Items Sketch, this sketch does not provide a list of
12+
heavy hitters.
13+
14+
.. currentmodule:: _datasketches
15+
16+
.. autoclass:: count_min_sketch
17+
:members:
18+
:undoc-members:
19+
:exclude-members: deserialize, suggest_num_buckets, suggest_num_hashes
20+
21+
.. rubric:: Static Methods:
22+
23+
.. automethod:: deserialize
24+
.. automethod:: suggest_num_buckets
25+
.. automethod:: suggest_num_hashes
26+
27+
.. rubric:: Non-static Methods:

0 commit comments

Comments
 (0)