xarray-contrib
diff --git a/‎docs/diagrams/bitmask-patterns-perfect.png
15.5 KB b/‎docs/diagrams/bitmask-patterns-perfect.png
15.5 KB
diff --git a/‎docs/diagrams/containment.png
12.8 KB b/‎docs/diagrams/containment.png
12.8 KB
diff --git a/‎docs/diagrams/counties-bitmask-containment.png
11 KB b/‎docs/diagrams/counties-bitmask-containment.png
11 KB
diff --git a/‎docs/diagrams/nwm-cohorts.png
149 KB b/‎docs/diagrams/nwm-cohorts.png
149 KB
diff --git a/‎docs/source/implementation.md
Lines changed: 98 additions & 18 deletions b/‎docs/source/implementation.md
Lines changed: 98 additions & 18 deletions
diff --git a/‎flox/visualize.py
Lines changed: 7 additions & 11 deletions b/‎flox/visualize.py
Lines changed: 7 additions & 11 deletions
@@ -1,3 +1,12 @@
+---
+jupytext:
+  text_representation:
+    format_name: myst
+kernelspec:
+  display_name: Python 3
+  name: python3
+---
+
 (algorithms)=
 
 # Parallel Algorithms
@@ -7,10 +16,14 @@
 can be hard. Performance strongly depends on how the groups are distributed amongst the blocks of an array.
 
 `flox` implements 4 strategies for grouped reductions, each is appropriate for a particular distribution of groups
-among the blocks of a dask array. Switch between the various strategies by passing `method`
-and/or `reindex` to either {py:func}`flox.groupby_reduce` or {py:func}`flox.xarray.xarray_reduce`.
+among the blocks of a dask array.
+
+```{tip}
+By default, `flox >= 0.9.0` will use [heuristics](method-heuristics) to choose a `method`.
+```
 
-Your options are:
+Switch between the various strategies by passing `method` and/or `reindex` to either {py:func}`flox.groupby_reduce`
+or {py:func}`flox.xarray.xarray_reduce`. Your options are:
 
 1. [`method="map-reduce"` with `reindex=False`](map-reindex-false)
 1. [`method="map-reduce"` with `reindex=True`](map-reindex-True)
@@ -20,18 +33,17 @@ Your options are:
 The most appropriate strategy for your problem will depend on the chunking of your dataset,
 and the distribution of group labels across those chunks.
 
-```{tip}
 Currently these strategies are implemented for dask. We would like to generalize to other parallel array types
 as appropriate (e.g. Ramba, cubed, arkouda). Please open an issue to discuss if you are interested.
-```
 
 (xarray-split)=
 
-## Background: Xarray's current GroupBy strategy
+## Background
 
-Xarray's current strategy is to find all unique group labels, index out each group,
-and then apply the reduction operation. Note that this only works if we know the group
-labels (i.e. you cannot use this strategy to group by a dask array).
+Without `flox` installed, Xarray's GroupBy strategy is to find all unique group labels,
+index out each group, and then apply the reduction operation. Note that this only works
+if we know the group labels (i.e. you cannot use this strategy to group by a dask array),
+and is basically an unvectorized slow for-loop over groups.
 
 Schematically, this looks like (colors indicate group labels; separated groups of colors
 indicate different blocks of an array):
@@ -208,23 +220,91 @@ One annoyance is that if the chunksize doesn't evenly divide the number of group
 Consider our earlier example, `groupby("time.month")` with monthly frequency data and chunksize of 4 along `time`.
 ![cohorts-schematic](/../diagrams/cohorts-month-chunk4.png)
 
+```{code-cell}
+import flox
+import numpy as np
+
+labels = np.tile(np.arange(12), 12)
+chunks = (tuple(np.repeat(4, labels.size // 4)),)
+```
+
 `flox` can find these cohorts, below it identifies the cohorts with labels `1,2,3,4`; `5,6,7,8`, and `9,10,11,12`.
 
-```python
->>> flox.find_group_cohorts(labels, array.chunks[-1]).values()
-[[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]  # 3 cohorts
+```{code-cell}
+preferred_method, chunks_cohorts = flox.core.find_group_cohorts(labels, chunks)
+chunks_cohorts.values()
 ```
 
 Now consider `chunksize=5`.
 ![cohorts-schematic](/../diagrams/cohorts-month-chunk5.png)
 
-```python
->>> flox.core.find_group_cohorts(labels, array.chunks[-1]).values()
-[[1], [2, 3], [4, 5], [6], [7, 8], [9, 10], [11], [12]]  # 8 cohorts
+```{code-cell}
+labels = np.tile(np.arange(12), 12)
+chunks = (tuple(np.repeat(5, labels.size // 5)) + (4,),)
+preferred_method, chunks_cohorts = flox.core.find_group_cohorts(labels, chunks, merge=True)
+chunks_cohorts.values()
 ```
 
-We find 8 cohorts (note the original xarray strategy is equivalent to constructing 12 cohorts).
-In this case, it seems to better to rechunk to a size of `4` along `time`.
-If you have ideas for improving this case, please open an issue.
+We find 7 cohorts (note the original xarray strategy is equivalent to constructing 12 cohorts).
+In this case, it seems to better to rechunk to a size of `4` (or `6`) along `time`.
+
+Indeed flox's heuristics think `"map-reduce"` is better for this case:
+
+```{code-cell}
+preferred_method
+```
 
 ### Example : spatial grouping
+
+Spatial groupings are particularly interesting for the `"cohorts"` strategy. Consider the problem of computing county-level
+aggregated statistics ([example blog post](https://xarray.dev/blog/flox)). There are ~3100 groups (counties), each marked by
+a different color. There are ~2300 chunks of size (350, 350) in (lat, lon). Many groups are contained to a small number of chunks:
+see left panel where the grid lines mark chunk boundaries.
+
+![cohorts-schematic](/../diagrams/nwm-cohorts.png)
+
+This seems like a good fit for `'cohorts'`: to get the answer for a county in the Northwest US, we needn't look at values
+for the southwest US. How do we decide that automatically for the user?
+
+(method-heuristics)=
+
+## Heuristics
+
+`flox >=0.9` will automatically choose `method` for you. To do so, we need to detect how each group
+label is distributed across the chunks of the array; and the degree to which the chunk distribution for a particular
+label overlaps with all other labels. The algorithm is as follows.
+
+1. First determine which labels are present in each chunk. The distribution of labels across chunks
+   is represented internally as a 2D boolean sparse array `S[chunks, labels]`. `S[i, j] = 1` when
+   label `j` is present in chunk `i`.
+
+1. Then we look for patterns in `S` to decide if we can use `"blockwise"`. The dark color cells are `1` at that
+   cell in `S`.
+   ![bitmask-patterns](/../diagrams/bitmask-patterns-perfect.png)
+
+   - On the left, is a monthly grouping for a monthly time series with chunk size 4. There are 3 non-overlapping cohorts so
+     `method="cohorts"` is perfect.
+   - On the right, is a resampling problem of a daily time series with chunk size 10 to 5-daily frequency. Two 5-day periods
+     are exactly contained in one chunk, so `method="blockwise"` is perfect.
+
+1. The metric used for determining the degree of overlap between the chunks occupied by different labels is
+   [containment](http://ekzhu.com/datasketch/lshensemble.html). For each label `i` we can quickly compute containment against
+   all other labels `j` as `C = S.T @ S / number_chunks_per_label`. Here is `C` for a range of chunk sizes from 1 to 12, for computing
+   the monthly mean of a monthly time series problem, \[the title on each image is `(chunk size, sparsity)`\].
+
+   ```python
+   chunks = np.arange(1, 13)
+   labels = np.tile(np.arange(1, 13), 30)
+   ```
+
+   ![cohorts-schematic](/../diagrams/containment.png)
+
+1. To choose between `"map-reduce"` and `"cohorts"`, we need a summary measure of the degree to which the labels overlap with
+   each other. We use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`.
+   When sparsity > 0.6, we choose `"map-reduce"` since there is decent overlap between (any) cohorts. Otherwise we use `"cohorts"`.
+
+Cool, isn't it?!
+
+For reference here is `S` and `C` for the US county groupby problem:
+![county-bitmask](/../diagrams/counties-bitmask-containment.png)
+The sparsity of `C` is 0.006, so `"cohorts"` seems a good strategy here.
@@ -139,35 +139,31 @@ def visualize_cohorts_2d(by, chunks):
     assert by.ndim == 2
     print("finding cohorts...")
     chunks = [chunks[ax] for ax in range(-by.ndim, 0)]
-    before_merged = find_group_cohorts(by, chunks, merge=False)
-    merged = find_group_cohorts(by, chunks, merge=True)
+    _, chunks_cohorts = find_group_cohorts(by, chunks)
     print("finished cohorts...")
 
     xticks = np.cumsum(chunks[-1])
     yticks = np.cumsum(chunks[-2])
 
-    f, ax = plt.subplots(1, 3, constrained_layout=True, sharex=False, sharey=False)
+    f, ax = plt.subplots(1, 2, constrained_layout=True, sharex=False, sharey=False)
     ax = ax.ravel()
     # ax[1].set_visible(False)
     # ax = ax[[0, 2, 3]]
 
     ngroups = len(_unique(by))
     h0 = ax[0].imshow(by, vmin=0, cmap=get_colormap(ngroups))
-    h1 = _visualize_cohorts(chunks, before_merged, ax=ax[1])
-    h2 = _visualize_cohorts(chunks, merged, ax=ax[2])
+    h2 = _visualize_cohorts(chunks, chunks_cohorts, ax=ax[1])
 
-    for axx in ax:
-        axx.grid(True, which="both")
+    ax[0].grid(True, which="both")
     for axx in ax[:1]:
         axx.set_xticks(xticks)
         axx.set_yticks(yticks)
-    for h, axx in zip([h0, h1, h2], ax):
+    for h, axx in zip([h0, h2], ax):
         f.colorbar(h, ax=axx, orientation="horizontal")
 
     ax[0].set_title(f"by: {ngroups} groups")
-    ax[1].set_title(f"{len(before_merged)} cohorts")
-    ax[2].set_title(f"{len(merged)} merged cohorts")
-    f.set_size_inches((12, 6))
+    ax[1].set_title(f"{len(chunks_cohorts)} cohorts")
+    f.set_size_inches((9, 6))
 
 
 def _visualize_cohorts(chunks, cohorts, ax=None):