Skip to content

Commit 1847edd

Browse files
committed
Update javadocs to link to web site.
1 parent 02dc8f5 commit 1847edd

1 file changed

Lines changed: 12 additions & 33 deletions

File tree

src/main/java/com/yahoo/sketches/fdt/FdtSketch.java

Lines changed: 12 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -17,41 +17,20 @@
1717
/**
1818
* A Frequent Distinct Tuples sketch.
1919
*
20-
* <p>Given a multiset of tuples with dimensions <i>{d1,d2, d3, ..., dN}</i>, and a primary subset of
21-
* dimensions <i>M &lt; N</i>, the task is to identify the combinations of <i>M</i> subset dimensions
22-
* that have the most frequent number of distinct combinations of the <i>N-M</i> non-primary
23-
* dimensions.
20+
* <p>Suppose our data is a stream of pairs {IP address, User ID} and we want to identify the
21+
* IP addresses that have the most distinct User IDs. Or conversely, we would like to identify
22+
* the User IDs that have the most distinct IP addresses. This is a common challenge in the
23+
* analysis of big data and the FDT sketch helps solve this problem using probabilistic techniques.
2424
*
25-
* <p>We define a specific combination of the <i>M</i> primary dimensions as a <i>Primary Key</i>
26-
* and all combinations of the <i>M</i> primary dimensions as the set of <i>Primary Keys</i>.
25+
* <p>More generally, given a multiset of tuples with dimensions <i>{d1,d2, d3, ..., dN}</i>,
26+
* and a primary subset of dimensions <i>M &lt; N</i>, our task is to identify the combinations of
27+
* <i>M</i> subset dimensions that have the most frequent number of distinct combinations of
28+
* the <i>N-M</i> non-primary dimensions.
2729
*
28-
* <p>We define the set of all combinations of <i>N-M</i> non-primary dimensions associated with a
29-
* single primary key as a <i>Group</i>.
30-
*
31-
* <p>For example, assume <i>N=3, M=2</i>, where the set of Primary Keys are defined by
32-
* <i>{d1, d2}</i>. After populating the sketch with a stream of tuples all of size <i>N</i>,
33-
* we wish to identify the Primary Keys that have the most frequent number of distinct occurrences
34-
* of <i>{d3}</i>. Equivalently, we want to identify the Primary Keys with the largest Groups.
35-
*
36-
* <p>Alternatively, if we choose the Primary Key as <i>{d1}</i>, then we can identify the
37-
* <i>{d1}</i>s that have the largest groups of <i>{d2, d3}</i>. The choice of
38-
* which dimensions to choose for the Primary Keys is performed in a post-processing phase
39-
* after the sketch has been populated. Thus, multiple queries can be performed against the
40-
* populated sketch with different selections of Primary Keys.
41-
*
42-
* <p>As a simple concrete example, let's assume <i>N = 2</i> and let <i>d1 := IP address</i>, and
43-
* <i>d2 := User ID</i>.
44-
* Let's choose <i>{d1}</i> as the Primary Keys, then the sketch allows the identification of the
45-
* <i>IP addresses</i> that have the largest populations of distinct <i>User IDs</i>. Conversely,
46-
* if we choose <i>{d2}</i> as the Primary Keys, the sketch allows the identification of the
47-
* <i>User IDs</i> with the largest populations of distinct <i>IP addresses</i>.
48-
*
49-
* <p>An important caveat is that if the distribution is too flat, there may not be any
50-
* "most frequent" combinations of the primary keys above the threshold. Also, if one primary key
51-
* is too dominant, the sketch may be able to only report on the single most frequent primary
52-
* key combination, which means the possible existance of false negatives.
53-
*
54-
* <p>In this implementation the input tuples presented to the sketch are string arrays.
30+
* <p>Please refer to the web page
31+
* <a href="https://datasketches.github.io/docs/Frequency/FrequentDistinctTuplesSketch.html">
32+
* https://datasketches.github.io/docs/Frequency/FrequentDistinctTuplesSketch.html</a> for a more
33+
* complete discussion about this sketch.
5534
*
5635
* @author Lee Rhodes
5736
*/

0 commit comments

Comments
 (0)