|
9 | 9 | "\n",
|
10 | 10 | "\"Zonal statistics\" spans a large range of problems. \n",
|
11 | 11 | "\n",
|
12 |
| - "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n", |
| 12 | + "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has a global extent on a 30m grid with shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n", |
13 | 13 | "\n",
|
14 | 14 | "The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
|
15 | 15 | ]
|
|
44 | 44 | "from flox.xarray import xarray_reduce\n",
|
45 | 45 | "\n",
|
46 | 46 | "sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
|
47 |
| - "chunksizes = {\"y\": 2_000, \"x\": 2_000}\n", |
| 47 | + "chunksizes = {\"y\": 10_000, \"x\": 10_000}\n", |
48 | 48 | "dims = (\"y\", \"x\")\n",
|
49 | 49 | "shape = tuple(sizes[d] for d in dims)\n",
|
50 | 50 | "chunks = tuple(chunksizes[d] for d in dims)\n",
|
|
124 | 124 | "id": "8",
|
125 | 125 | "metadata": {},
|
126 | 126 | "source": [
|
127 |
| - "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n", |
| 127 | + "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000 which is much smaller than 23 x 5 x 7 x 248 x 86 x 854 = 14_662_360_160.\n", |
128 | 128 | "\n",
|
129 |
| - "We end up with one humoungous 56GB chunk, that is mostly empty.\n", |
| 129 | + "We end up with one humoungous 56GB chunk, that is mostly empty (sparsity ~ 48,000/14_662_360_160 ~ 0.2%).\n", |
130 | 130 | "\n",
|
131 | 131 | "## We can do better using a sparse array\n",
|
132 | 132 | "\n",
|
133 |
| - "Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n", |
| 133 | + "Since the results are very sparse, we can instruct flox to construct dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n", |
134 | 134 | "\n",
|
135 | 135 | "```python\n",
|
136 | 136 | "ReindexStrategy(\n",
|
|
174 | 174 | "\n",
|
175 | 175 | "The computation runs smoothly with low memory."
|
176 | 176 | ]
|
| 177 | + }, |
| 178 | + { |
| 179 | + "cell_type": "markdown", |
| 180 | + "id": "11", |
| 181 | + "metadata": {}, |
| 182 | + "source": [ |
| 183 | + "## Why\n", |
| 184 | + "\n", |
| 185 | + "To understand why you might do this, here is how flox runs reductions. In the images below, the `areas` array on the left has 5 2D chunks. Each color represents a group, each square represents a value of the array; clearly there are different groups in each chunk. \n", |
| 186 | + "\n", |
| 187 | + "\n", |
| 188 | + "### reindex = True\n", |
| 189 | + "\n", |
| 190 | + "<img src=\"../../diagrams/new-map-reduce-reindex-True-annotated.svg\" width=100%>\n", |
| 191 | + "\n", |
| 192 | + "First, the grouped-reduction is run on each chunk independently, and the results are constructed as _dense_ arrays on the full 23 x 5 x 7 x 248 x 86 x 854 output grid. This means that every chunk balloons to ~50GB. This method cannot work well.\n", |
| 193 | + "\n", |
| 194 | + "### reindex = False with sparse intermediates\n", |
| 195 | + "\n", |
| 196 | + "<img src=\"../../diagrams/new-map-reduce-reindex-False-annotated.svg\" width=100%>\n", |
| 197 | + "\n", |
| 198 | + "First, the grouped-reduction is run on each chunk independently. Conceptually the result after this step is an array with differently sized chunks. \n", |
| 199 | + "\n", |
| 200 | + "Next results from neighbouring blocks are concatenated and a reduction is run again. These results are first aligned or reindexed to a common grid of group labels, termed \"reindexing\". At this stage, we instruct flox to construct a _sparse array_ during reindexing, otherwise we will eventually end up constructing _dense_ reindexed arrays of shape 23 x 5 x 7 x 248 x 86 x 854.\n", |
| 201 | + "\n", |
| 202 | + "\n", |
| 203 | + "## Can we do better?\n", |
| 204 | + "\n", |
| 205 | + "Yes. \n", |
| 206 | + "\n", |
| 207 | + "1. Using the reindexing machinery to convert intermediates to sparse is a little bit hacky. A better option would be to aggregate directly to sparse arrays, potentially using a new `engine=\"sparse\"` ([issue](https://github.com/xarray-contrib/flox/issues/346)).\n", |
| 208 | + "2. The total number of GADM geometries for levels 0, 1, and 2 is ~48,000. A much more sensible solution would be to allow grouping by these _geometries_ directly. This would allow us to be smart about the reduction, by exploiting the ideas underlying the [`method=\"cohorts\"` strategy](../implementation.md#method-cohorts).\n", |
| 209 | + "\n", |
| 210 | + "Regardless, the ability to do such reindexing allows flox to scale to much larger grouper arrays than previously possible.\n", |
| 211 | + "\n" |
| 212 | + ] |
177 | 213 | }
|
178 | 214 | ],
|
179 | 215 | "metadata": {
|
|
0 commit comments