Doc updates (#436)

dcherian · web-flow · commit 89e8238e616a · 2025-04-05T14:20:53.000Z
diff --git a/docs/source/implementation.md b/docs/source/implementation.md
@@ -110,6 +110,26 @@ width: 100%
 
 This approach allows grouping by a dask array so group labels can be discovered at compute time, similar to `dask.dataframe.groupby`.
 
+### reindexing to a sparse array
+
+For large numbers of groups, we might be reducing to a very sparse array (e.g. [this issue](https://github.com/xarray-contrib/flox/issues/428)).
+
+To control memory, we can instruct flox to reindex the intermediate results to a `sparse.COO` array using:
+
+```python
+from flox import ReindexArrayType, ReindexStrategy
+
+ReindexStrategy(
+    # do not reindex to the full output grid at the blockwise aggregation stage
+    blockwise=False,
+    # when combining intermediate results after blockwise aggregation, reindex to the
+    # common grid using a sparse.COO array type
+    array_type=ReindexArrayType.SPARSE_COO,
+)
+```
+
+See [this user story](user-stories/large-zonal-stats) for more discussion.
+
 ### Example
 
 For example, consider `groupby("time.month")` with monthly frequency data and chunksize of 4 along `time`.
diff --git a/docs/source/user-stories/large-zonal-stats.ipynb b/docs/source/user-stories/large-zonal-stats.ipynb
@@ -9,7 +9,7 @@
     "\n",
     "\"Zonal statistics\" spans a large range of problems. \n",
     "\n",
-    "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
+    "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has a global extent on a 30m grid with shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
     "\n",
     "The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
    ]
@@ -44,7 +44,7 @@
     "from flox.xarray import xarray_reduce\n",
     "\n",
     "sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
-    "chunksizes = {\"y\": 2_000, \"x\": 2_000}\n",
+    "chunksizes = {\"y\": 10_000, \"x\": 10_000}\n",
     "dims = (\"y\", \"x\")\n",
     "shape = tuple(sizes[d] for d in dims)\n",
     "chunks = tuple(chunksizes[d] for d in dims)\n",
@@ -124,13 +124,13 @@
    "id": "8",
    "metadata": {},
    "source": [
-    "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n",
+    "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000 which is much smaller than 23 x 5 x 7 x 248 x 86 x 854 = 14_662_360_160.\n",
     "\n",
-    "We end up with one humoungous 56GB chunk, that is mostly empty.\n",
+    "We end up with one humoungous 56GB chunk, that is mostly empty (sparsity ~ 48,000/14_662_360_160 ~ 0.2%).\n",
     "\n",
     "## We can do better using a sparse array\n",
     "\n",
-    "Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
+    "Since the results are very sparse, we can instruct flox to construct dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
     "\n",
     "```python\n",
     "ReindexStrategy(\n",
@@ -174,6 +174,42 @@
     "\n",
     "The computation runs smoothly with low memory."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11",
+   "metadata": {},
+   "source": [
+    "## Why\n",
+    "\n",
+    "To understand why you might do this, here is how flox runs reductions. In the images below, the `areas` array on the left has 5 2D chunks. Each color represents a group, each square represents a value of the array; clearly there are different groups in each chunk. \n",
+    "\n",
+    "\n",
+    "### reindex = True\n",
+    "\n",
+    "<img src=\"../../diagrams/new-map-reduce-reindex-True-annotated.svg\" width=100%>\n",
+    "\n",
+    "First, the grouped-reduction is run on each chunk independently, and the results are constructed as _dense_ arrays on the full 23 x 5 x 7 x 248 x 86 x 854 output grid. This means that every chunk balloons to ~50GB. This method cannot work well.\n",
+    "\n",
+    "### reindex = False with sparse intermediates\n",
+    "\n",
+    "<img src=\"../../diagrams/new-map-reduce-reindex-False-annotated.svg\" width=100%>\n",
+    "\n",
+    "First, the grouped-reduction is run on each chunk independently. Conceptually the result after this step is an array with differently sized chunks. \n",
+    "\n",
+    "Next results from neighbouring blocks are concatenated and a reduction is run again. These results are first aligned or reindexed to a common grid of group labels, termed \"reindexing\". At this stage, we instruct flox to construct a _sparse array_ during reindexing, otherwise we will eventually end up constructing _dense_ reindexed arrays of shape 23 x 5 x 7 x 248 x 86 x 854.\n",
+    "\n",
+    "\n",
+    "## Can we do better?\n",
+    "\n",
+    "Yes. \n",
+    "\n",
+    "1. Using the reindexing machinery to convert intermediates to sparse is a little bit hacky. A better option would be to aggregate directly to sparse arrays, potentially using a new `engine=\"sparse\"` ([issue](https://github.com/xarray-contrib/flox/issues/346)).\n",
+    "2. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000. A much more sensible solution would be to allow grouping by these _geometries_ directly. This would allow us to be smart about the reduction, by exploiting the ideas underlying the [`method=\"cohorts\"` strategy](../implementation.md#method-cohorts).\n",
+    "\n",
+    "Regardless, the ability to do such reindexing allows flox to scale to much larger grouper arrays than previously possible.\n",
+    "\n"
+   ]
   }
  ],
  "metadata": {
diff --git a/flox/core.py b/flox/core.py
@@ -2478,7 +2478,7 @@ def groupby_reduce(
         array's dtype.
     method : {"map-reduce", "blockwise", "cohorts"}, optional
         Note that this arg is chosen by default using heuristics.
-        Strategy for reduction of dask arrays only:
+        Strategy for reduction of dask arrays only.
           * ``"map-reduce"``:
             First apply the reduction blockwise on ``array``, then
             combine a few newighbouring blocks, apply the reduction.
diff --git a/flox/xarray.py b/flox/xarray.py
@@ -113,7 +113,7 @@ def xarray_reduce(
         DType for the output. Can be anything that is accepted by ``np.dtype``.
     method : {"map-reduce", "blockwise", "cohorts"}, optional
         Note that this arg is chosen by default using heuristics.
-        Strategy for reduction of dask arrays only:
+        Strategy for reduction of dask arrays only.
           * ``"map-reduce"``:
             First apply the reduction blockwise on ``array``, then
             combine a few newighbouring blocks, apply the reduction.