|
17 | 17 | /** |
18 | 18 | * A Frequent Distinct Tuples sketch. |
19 | 19 | * |
20 | | - * <p>Given a multiset of tuples with dimensions <i>{d1,d2, d3, ..., dN}</i>, and a primary subset of |
21 | | - * dimensions <i>M < N</i>, the task is to identify the combinations of <i>M</i> subset dimensions |
22 | | - * that have the most frequent number of distinct combinations of the <i>N-M</i> non-primary |
23 | | - * dimensions. |
| 20 | + * <p>Suppose our data is a stream of pairs {IP address, User ID} and we want to identify the |
| 21 | + * IP addresses that have the most distinct User IDs. Or conversely, we would like to identify |
| 22 | + * the User IDs that have the most distinct IP addresses. This is a common challenge in the |
| 23 | + * analysis of big data and the FDT sketch helps solve this problem using probabilistic techniques. |
24 | 24 | * |
25 | | - * <p>We define a specific combination of the <i>M</i> primary dimensions as a <i>Primary Key</i> |
26 | | - * and all combinations of the <i>M</i> primary dimensions as the set of <i>Primary Keys</i>. |
| 25 | + * <p>More generally, given a multiset of tuples with dimensions <i>{d1,d2, d3, ..., dN}</i>, |
| 26 | + * and a primary subset of dimensions <i>M < N</i>, our task is to identify the combinations of |
| 27 | + * <i>M</i> subset dimensions that have the most frequent number of distinct combinations of |
| 28 | + * the <i>N-M</i> non-primary dimensions. |
27 | 29 | * |
28 | | - * <p>We define the set of all combinations of <i>N-M</i> non-primary dimensions associated with a |
29 | | - * single primary key as a <i>Group</i>. |
30 | | - * |
31 | | - * <p>For example, assume <i>N=3, M=2</i>, where the set of Primary Keys are defined by |
32 | | - * <i>{d1, d2}</i>. After populating the sketch with a stream of tuples all of size <i>N</i>, |
33 | | - * we wish to identify the Primary Keys that have the most frequent number of distinct occurrences |
34 | | - * of <i>{d3}</i>. Equivalently, we want to identify the Primary Keys with the largest Groups. |
35 | | - * |
36 | | - * <p>Alternatively, if we choose the Primary Key as <i>{d1}</i>, then we can identify the |
37 | | - * <i>{d1}</i>s that have the largest groups of <i>{d2, d3}</i>. The choice of |
38 | | - * which dimensions to choose for the Primary Keys is performed in a post-processing phase |
39 | | - * after the sketch has been populated. Thus, multiple queries can be performed against the |
40 | | - * populated sketch with different selections of Primary Keys. |
41 | | - * |
42 | | - * <p>As a simple concrete example, let's assume <i>N = 2</i> and let <i>d1 := IP address</i>, and |
43 | | - * <i>d2 := User ID</i>. |
44 | | - * Let's choose <i>{d1}</i> as the Primary Keys, then the sketch allows the identification of the |
45 | | - * <i>IP addresses</i> that have the largest populations of distinct <i>User IDs</i>. Conversely, |
46 | | - * if we choose <i>{d2}</i> as the Primary Keys, the sketch allows the identification of the |
47 | | - * <i>User IDs</i> with the largest populations of distinct <i>IP addresses</i>. |
48 | | - * |
49 | | - * <p>An important caveat is that if the distribution is too flat, there may not be any |
50 | | - * "most frequent" combinations of the primary keys above the threshold. Also, if one primary key |
51 | | - * is too dominant, the sketch may be able to only report on the single most frequent primary |
52 | | - * key combination, which means the possible existance of false negatives. |
53 | | - * |
54 | | - * <p>In this implementation the input tuples presented to the sketch are string arrays. |
| 30 | + * <p>Please refer to the web page |
| 31 | + * <a href="https://datasketches.github.io/docs/Frequency/FrequentDistinctTuplesSketch.html"> |
| 32 | + * https://datasketches.github.io/docs/Frequency/FrequentDistinctTuplesSketch.html</a> for a more |
| 33 | + * complete discussion about this sketch. |
55 | 34 | * |
56 | 35 | * @author Lee Rhodes |
57 | 36 | */ |
|
0 commit comments