Update javadocs to link to web site.

leerho · leerho · commit 1847eddd37be · 2019-04-25T18:46:13.000-07:00
diff --git a/src/main/java/com/yahoo/sketches/fdt/FdtSketch.java b/src/main/java/com/yahoo/sketches/fdt/FdtSketch.java
@@ -17,41 +17,20 @@
 /**
  * A Frequent Distinct Tuples sketch.
  *
- * <p>Given a multiset of tuples with dimensions <i>{d1,d2, d3, ..., dN}</i>, and a primary subset of
- * dimensions <i>M &lt; N</i>, the task is to identify the combinations of <i>M</i> subset dimensions
- * that have the most frequent number of distinct combinations of the <i>N-M</i> non-primary
- * dimensions.
+ * <p>Suppose our data is a stream of pairs {IP address, User ID} and we want to identify the
+ * IP addresses that have the most distinct User IDs.  Or conversely, we would like to identify
+ * the User IDs that have the most distinct IP addresses. This is a common challenge in the
+ * analysis of big data and the FDT sketch helps solve this problem using probabilistic techniques.
  *
- * <p>We define a specific combination of the <i>M</i> primary dimensions as a <i>Primary Key</i>
- * and all combinations of the <i>M</i> primary dimensions as the set of <i>Primary Keys</i>.
+ * <p>More generally, given a multiset of tuples with dimensions <i>{d1,d2, d3, ..., dN}</i>,
+ * and a primary subset of dimensions <i>M &lt; N</i>, our task is to identify the combinations of
+ * <i>M</i> subset dimensions that have the most frequent number of distinct combinations of
+ * the <i>N-M</i> non-primary dimensions.
  *
- * <p>We define the set of all combinations of <i>N-M</i> non-primary dimensions associated with a
- * single primary key as a <i>Group</i>.
- *
- * <p>For example, assume <i>N=3, M=2</i>, where the set of Primary Keys are defined by
- * <i>{d1, d2}</i>. After populating the sketch with a stream of tuples all of size <i>N</i>,
- * we wish to identify the Primary Keys that have the most frequent number of distinct occurrences
- * of <i>{d3}</i>. Equivalently, we want to identify the Primary Keys with the largest Groups.
- *
- * <p>Alternatively, if we choose the Primary Key as <i>{d1}</i>, then we can identify the
- * <i>{d1}</i>s that have the largest groups of <i>{d2, d3}</i>. The choice of
- * which dimensions to choose for the Primary Keys is performed in a post-processing phase
- * after the sketch has been populated. Thus, multiple queries can be performed against the
- * populated sketch with different selections of Primary Keys.
- *
- * <p>As a simple concrete example, let's assume <i>N = 2</i> and let <i>d1 := IP address</i>, and
- * <i>d2 := User ID</i>.
- * Let's choose <i>{d1}</i> as the Primary Keys, then the sketch allows the identification of the
- * <i>IP addresses</i> that have the largest populations of distinct <i>User IDs</i>. Conversely,
- * if we choose <i>{d2}</i> as the Primary Keys, the sketch allows the identification of the
- * <i>User IDs</i> with the largest populations of distinct <i>IP addresses</i>.
- *
- * <p>An important caveat is that if the distribution is too flat, there may not be any
- * "most frequent" combinations of the primary keys above the threshold. Also, if one primary key
- * is too dominant, the sketch may be able to only report on the single most frequent primary
- * key combination, which means the possible existance of false negatives.
- *
- * <p>In this implementation the input tuples presented to the sketch are string arrays.
+ * <p>Please refer to the web page
+ * <a href="https://datasketches.github.io/docs/Frequency/FrequentDistinctTuplesSketch.html">
+ * https://datasketches.github.io/docs/Frequency/FrequentDistinctTuplesSketch.html</a> for a more
+ * complete discussion about this sketch.
  *
  * @author Lee Rhodes
  */