Skip to content

Commit 89e8238

Browse files
authored
Doc updates (#436)
1 parent c171ea0 commit 89e8238

File tree

4 files changed

+63
-7
lines changed

4 files changed

+63
-7
lines changed

docs/source/implementation.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,26 @@ width: 100%
110110

111111
This approach allows grouping by a dask array so group labels can be discovered at compute time, similar to `dask.dataframe.groupby`.
112112

113+
### reindexing to a sparse array
114+
115+
For large numbers of groups, we might be reducing to a very sparse array (e.g. [this issue](https://github.com/xarray-contrib/flox/issues/428)).
116+
117+
To control memory, we can instruct flox to reindex the intermediate results to a `sparse.COO` array using:
118+
119+
```python
120+
from flox import ReindexArrayType, ReindexStrategy
121+
122+
ReindexStrategy(
123+
# do not reindex to the full output grid at the blockwise aggregation stage
124+
blockwise=False,
125+
# when combining intermediate results after blockwise aggregation, reindex to the
126+
# common grid using a sparse.COO array type
127+
array_type=ReindexArrayType.SPARSE_COO,
128+
)
129+
```
130+
131+
See [this user story](user-stories/large-zonal-stats) for more discussion.
132+
113133
### Example
114134

115135
For example, consider `groupby("time.month")` with monthly frequency data and chunksize of 4 along `time`.

docs/source/user-stories/large-zonal-stats.ipynb

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"\n",
1010
"\"Zonal statistics\" spans a large range of problems. \n",
1111
"\n",
12-
"This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
12+
"This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has a global extent on a 30m grid with shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
1313
"\n",
1414
"The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
1515
]
@@ -44,7 +44,7 @@
4444
"from flox.xarray import xarray_reduce\n",
4545
"\n",
4646
"sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
47-
"chunksizes = {\"y\": 2_000, \"x\": 2_000}\n",
47+
"chunksizes = {\"y\": 10_000, \"x\": 10_000}\n",
4848
"dims = (\"y\", \"x\")\n",
4949
"shape = tuple(sizes[d] for d in dims)\n",
5050
"chunks = tuple(chunksizes[d] for d in dims)\n",
@@ -124,13 +124,13 @@
124124
"id": "8",
125125
"metadata": {},
126126
"source": [
127-
"Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n",
127+
"Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000 which is much smaller than 23 x 5 x 7 x 248 x 86 x 854 = 14_662_360_160.\n",
128128
"\n",
129-
"We end up with one humoungous 56GB chunk, that is mostly empty.\n",
129+
"We end up with one humoungous 56GB chunk, that is mostly empty (sparsity ~ 48,000/14_662_360_160 ~ 0.2%).\n",
130130
"\n",
131131
"## We can do better using a sparse array\n",
132132
"\n",
133-
"Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
133+
"Since the results are very sparse, we can instruct flox to construct dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
134134
"\n",
135135
"```python\n",
136136
"ReindexStrategy(\n",
@@ -174,6 +174,42 @@
174174
"\n",
175175
"The computation runs smoothly with low memory."
176176
]
177+
},
178+
{
179+
"cell_type": "markdown",
180+
"id": "11",
181+
"metadata": {},
182+
"source": [
183+
"## Why\n",
184+
"\n",
185+
"To understand why you might do this, here is how flox runs reductions. In the images below, the `areas` array on the left has 5 2D chunks. Each color represents a group, each square represents a value of the array; clearly there are different groups in each chunk. \n",
186+
"\n",
187+
"\n",
188+
"### reindex = True\n",
189+
"\n",
190+
"<img src=\"../../diagrams/new-map-reduce-reindex-True-annotated.svg\" width=100%>\n",
191+
"\n",
192+
"First, the grouped-reduction is run on each chunk independently, and the results are constructed as _dense_ arrays on the full 23 x 5 x 7 x 248 x 86 x 854 output grid. This means that every chunk balloons to ~50GB. This method cannot work well.\n",
193+
"\n",
194+
"### reindex = False with sparse intermediates\n",
195+
"\n",
196+
"<img src=\"../../diagrams/new-map-reduce-reindex-False-annotated.svg\" width=100%>\n",
197+
"\n",
198+
"First, the grouped-reduction is run on each chunk independently. Conceptually the result after this step is an array with differently sized chunks. \n",
199+
"\n",
200+
"Next results from neighbouring blocks are concatenated and a reduction is run again. These results are first aligned or reindexed to a common grid of group labels, termed \"reindexing\". At this stage, we instruct flox to construct a _sparse array_ during reindexing, otherwise we will eventually end up constructing _dense_ reindexed arrays of shape 23 x 5 x 7 x 248 x 86 x 854.\n",
201+
"\n",
202+
"\n",
203+
"## Can we do better?\n",
204+
"\n",
205+
"Yes. \n",
206+
"\n",
207+
"1. Using the reindexing machinery to convert intermediates to sparse is a little bit hacky. A better option would be to aggregate directly to sparse arrays, potentially using a new `engine=\"sparse\"` ([issue](https://github.com/xarray-contrib/flox/issues/346)).\n",
208+
"2. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000. A much more sensible solution would be to allow grouping by these _geometries_ directly. This would allow us to be smart about the reduction, by exploiting the ideas underlying the [`method=\"cohorts\"` strategy](../implementation.md#method-cohorts).\n",
209+
"\n",
210+
"Regardless, the ability to do such reindexing allows flox to scale to much larger grouper arrays than previously possible.\n",
211+
"\n"
212+
]
177213
}
178214
],
179215
"metadata": {

flox/core.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2478,7 +2478,7 @@ def groupby_reduce(
24782478
array's dtype.
24792479
method : {"map-reduce", "blockwise", "cohorts"}, optional
24802480
Note that this arg is chosen by default using heuristics.
2481-
Strategy for reduction of dask arrays only:
2481+
Strategy for reduction of dask arrays only.
24822482
* ``"map-reduce"``:
24832483
First apply the reduction blockwise on ``array``, then
24842484
combine a few newighbouring blocks, apply the reduction.

flox/xarray.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def xarray_reduce(
113113
DType for the output. Can be anything that is accepted by ``np.dtype``.
114114
method : {"map-reduce", "blockwise", "cohorts"}, optional
115115
Note that this arg is chosen by default using heuristics.
116-
Strategy for reduction of dask arrays only:
116+
Strategy for reduction of dask arrays only.
117117
* ``"map-reduce"``:
118118
First apply the reduction blockwise on ``array``, then
119119
combine a few newighbouring blocks, apply the reduction.

0 commit comments

Comments
 (0)