Added large-data example

jbednar · jbednar · commit 6182a177c5fc · 2016-02-09T11:39:26.000-06:00
diff --git a/examples/plotting_problems.ipynb b/examples/plotting_problems.ipynb
@@ -41,7 +41,7 @@
    "source": [
     "## Overplotting\n",
     "\n",
-    "Let's consider plotting data that comes from two categories, here plotted in blue and red as A and B below.  When the two categories are overlaid, the result can be very different depending on which one is plotted first:"
+    "Let's consider plotting data that comes from two categories, here plotted in blue and red as **A** and **B** below.  When the two categories are overlaid, the result can be very different depending on which one is plotted first:"
    ]
   },
   {
@@ -67,7 +67,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Plots C and D shown the same distribution of points, yet they give a very different impression of which category is more common, which can lead to incorrect decisions based on this data.  Actually, both are equally common in this case.  The cause for this problem is simply occlusion:"
+    "Plots **C** and **D** shown the same distribution of points, yet they give a very different impression of which category is more common, which can lead to incorrect decisions based on this data.  Actually, both are equally common in this case.  The cause for this problem is simply occlusion:"
    ]
   },
   {
@@ -91,7 +91,7 @@
     "\n",
     "## Saturation\n",
     "\n",
-    "You can reduce problems with overplotting by using transparency, via the alpha parameter provided in most plotting programs.  E.g. if alpha is 0.1, full brightness will be achieved only when 10 points overlap, reducing the effects of plot ordering but making it harder to see individual points:"
+    "You can reduce problems with overplotting by using transparency/opacity, via the alpha parameter provided to control opacity in most plotting programs.  E.g. if alpha is 0.1, full brightness will be achieved only when 10 points overlap, reducing the effects of plot ordering but making it harder to see individual points:"
    ]
   },
   {
@@ -110,7 +110,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here C and D look fairly similar (as they should, since the distributions are identical), but there are still a few locations that have reached **saturation**, a problem that will occur when more than 10 points overlap.  With multiple categories as here, saturation leads to overplotting problems, because only the last 10 points plotted will affect the final color.  With a single category, saturation simply obscures differences in density.  For instance, 10, 20, and 2000 points overlapping will all look the same visually, for alpha=0.1.\n",
+    "Here **C** and **D** look fairly similar (as they should, since the distributions are identical), but there are still a few locations that have reached **saturation**, a problem that will occur when more than 10 points overlap.  With multiple categories as here, saturation leads to overplotting problems, because only the last 10 points plotted will affect the final color.  With a single category, saturation simply obscures differences in density.  For instance, 10, 20, and 2000 points overlapping will all look the same visually, for alpha=0.1.\n",
     "\n",
     "The biggest problem with using alpha to avoid saturation is that the correct value depends on the dataset -- e.g. if there are more points overlapping, a manually adjusted alpha setting that worked well for a previous dataset will systematically misrepresent the new dataset:"
    ]
@@ -139,7 +139,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here C and D again look very different, yet represent the same distributions.  The correct alpha also depends on the dot size, because that affects the amount of overlap. With smaller dots, C and D look more similar, but the dots are now difficult to see because they are too transparent for this size:"
+    "Here **C** and **D** again look very different, yet represent the same distributions.  The correct alpha also depends on the dot size, because that affects the amount of overlap. With smaller dots, **C** and **D** look more similar, but the dots are now difficult to see because they are too transparent for this size:"
    ]
   },
   {
@@ -162,7 +162,7 @@
     "\n",
     "## Binning problems\n",
     "\n",
-    "For large enough datasets, plotting every point as above is not always practical.  2D histograms visualized as heatmaps offer a practical way to visualized data compactly, and can also address issues like saturation directly (by effectively auto-ranging the alpha parameter based on the bin with the highest count).  Heatmaps can approximate a probability density function, averaging out noise or irrelevant variations to reveal an underlying distribution.\n",
+    "For large enough datasets, plotting every point as above is not always practical.  2D histograms visualized as heatmaps offer a practical way to visualize data compactly, and can also address issues like saturation directly (by effectively auto-ranging the alpha parameter based on the bin with the highest count).  Heatmaps can approximate a probability density function, averaging out noise or irrelevant variations to reveal an underlying distribution.\n",
     "\n",
     "Here, let's consider a sum of two normal distributions slightly offset from each other:"
    ]
@@ -175,20 +175,31 @@
    },
    "outputs": [],
    "source": [
-    "%%opts Image.Blues (cmap=\"Blues\") Image.Reds (cmap=\"Reds\") Image (cmap=\"Blues\") Image {+axiswise} Points (s=2)\n",
+    "%%opts Image (cmap=\"Blues\") Image {+axiswise} Points (s=2)\n",
     "\n",
-    "num=600\n",
-    "np.random.seed(42)\n",
-    "offset=1.5\n",
-    "blue_coords = (np.random.normal( offset,size=num), np.random.normal( offset*0,size=num))\n",
-    "red_coords  = (np.random.normal(-offset,size=num), np.random.normal(-offset*0,size=num))\n",
-    "merged = (np.hstack((blue_coords[0],red_coords[0])),np.hstack((blue_coords[1],red_coords[1])))\n",
+    "def distribution(num=100):\n",
+    "    np.random.seed(42)\n",
+    "    offset=1.5\n",
+    "    blue_coords = (np.random.normal( offset,size=num), np.random.normal( offset*0,size=num))\n",
+    "    red_coords  = (np.random.normal(-offset,size=num), np.random.normal(-offset*0,size=num))\n",
+    "    merged = (np.hstack((blue_coords[0],red_coords[0])),np.hstack((blue_coords[1],red_coords[1])))\n",
+    "    return merged\n",
     "\n",
     "def heatmap(coords,bins=10):\n",
     "    hist= np.histogram2d(coords[0], coords[1], bins=bins)\n",
     "    return hv.Image(hist[0][:,::-1].T)\n",
     "\n",
-    "(hv.Points(merged) + [heatmap(merged,bins) for bins in [8,10,20,100,1000]]).cols(3)"
+    "dist=distribution(num=600)\n",
+    "(hv.Points(dist) + [heatmap(dist,bins) for bins in [8,10,20,100,1000]]).cols(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As you can see, the distribution looks very different depending on the number of bins used -- at some heatmap resolutions, the two underlying groups can be distinguished, but in others the shape is unclear. In plot **F** the data is only dimly visible at all, due to the small pixel-sized bins that make multiple counts per bin unlikely.  I.e., in **F** nearly all the bins look like they are empty, but in fact there are many non-empty bins, they just happen to have fewer dots than certain other bins that randomly happen to have high overlap between datapoints.  This **undersaturation** problem, where values falsely appear to be zero (most of the white pixels in **F**) because of plotting parameter settings, can hide data just as seriously as **oversaturation**, leading to incorrect conclusions about the shape of the data.  (The data is certainly not just a few dozen points as it appears in **F**!) Clearly, the bin size is now another important parameter that needs manual adjustment. \n",
+    "\n",
+    "A key insight is that these problems vary with the amount of data:"
    ]
   },
   {
@@ -199,23 +210,37 @@
    },
    "outputs": [],
    "source": [
-    "hv.archive.export()"
+    "%%opts Image (cmap=\"Blues\") Image {+axiswise} Points (s=2)\n",
+    "\n",
+    "dist=distribution(num=60000)\n",
+    "(hv.Points(dist) + [heatmap(dist,bins) for bins in [8,10,20,100,1000]]).cols(3)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As you can see, the distribution looks very different depending on the number of bins used -- at some heatmap resolutions, the two underlying groups can be distinguished, but in others the shape is unclear. In plot F the data is only dimly visible at all, due to the small pixel-sized bins that make multiple counts per bin unlikely.  In F nearly all the bins look like they are empty, but in fact there are many non-empty bins, they just happen to have fewer dots than a few other random bins with high overlap between datapoints.  This **undersaturation** problem, where values falsely appear to be zero because of plotting parameter settings, can hide data just as seriously as **oversaturation**, leading to incorrect conclusions about the shape of the data.\n",
+    "As you can see, the points-based scatterplot (**A**) has gotten worse with more data -- overplotting now completely obscures the shape of the distribution.  But most of the heatmap approaches improve with the additional data, because the extra points lead to a better approximation of the underlying distribution.  Even **F** now conveys some information about the shape, though it's still suffering from undersaturation.\n",
     "\n",
-    "Clearly, the bin size is now an important parameter that needs manual adjustment.  Yet for truly large data, there's not usually any plot like A available where all datapoints can be seen -- how do you know when your bin size is appropriate for a given dataset?  Usually the answer is \"trial and error\", which is awkward and time consuming for large data.  The result is...\n",
+    "Clearly, for truly large data, a scatterplot like **A** will rarely be useful or practical, so how does one know when the right bin size has been chosen, or whether there is enough data to reveal the distribution (as in **f**)?   Usually the answer is \"trial and error\", which is awkward and time consuming for large data.  The result is...\n",
     "\n",
     "## For big data, you don't know when the viz is lying\n",
     "\n",
     "I.e., visualization is supposed to help you explore and understand your data, but if your visualizations are systematically misrepresenting your data because of overplotting, saturation, undersaturation, and inappropriate binning, then you won't be able to discover the real qualities of your data and will be unable to make the right decisions.\n",
     "\n",
     "The [`datashader`](https://github.com/bokeh/datashader) library has been designed to overcome many of the above problems, by automatically calculating appropriate parameters based on the data itself, and by allowing interactive visualizations of even truly large datasets with millions or billions of data points so that their structure can be revealed."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "hv.archive.export()"
+   ]
   }
  ],
  "metadata": {