proposal: new distributed histogram type

I've attached a notebook showing a proof-of-concept for a cluster-distributed histogram using dask-array that relies on the sparseness of per-chunk fills in dask to avoid materializing the histogram on any particular worker.
This implementation makes a number of shortcuts and fusions of operations that building this out of the shuffle primitives in dask lays out a bit more verbosely.

Right now this only works for unweighted histogram and no categories but it's not too much of a stretch to extend it. The main trouble will be keeping the number of additionally created tasks down per category.

I have scaled to decent performance for O(billion) bin histograms (checked up to 4D) and it all fits on my laptop and fills in a reasonable amount of time compared to `da.histogramdd` and with significantly less memory usage than the present implementation of dask histogram's boost backend. The histogram can then be reassembled on the client machine and quickly turned into boost-histogram or anything else suitable.

Here's the notebook with the proof of concept and validation against `da.histogramdd`:
[distributed-histogram.ipynb.zip](https://github.com/user-attachments/files/18876142/distributed-histogram.ipynb.zip)

Here are settings for scaling to very large number of bins successfully:
NOTE: The `da.histogramdd` check will perform extremely sluggishly for the following settings
125M bins:
```
ndims = 3
nbins = 500
hist_partitions = 50
n_hist_slices = 25
```

1B bins:
```
ndims = 4
nbins = 178
hist_partitions = 100
n_hist_slices = 10
```

These settings will run pleasantly on a modern laptop. You're more or less limited by system memory for the output histogram.

@nsmith- @holzman @mapsacosta @nanoemc @douglasdavis @martindurant 
Thoughts on how to improve or integrate with existing dask-histogram greatly appreciated.

Jotting down my own thoughts concerning moving forward:
For weighted filling we need to keep track of a tuple of `(filled_bins, sum_weights, sum_weights_squared)`. That's not exactly congruent with the dask-array setup I'm using right now, but it is still congruent with blockwise, piping it all together sounds a little nasty.

For categories we'll need to be careful about when we actually make more tasks since it's easy to generate an enormous task overhead with a category for each systematic variation. However, since we know when multiple `.fill` calls take the same coordinate collections and only a different weight collection and category label, we can figure out how to fuse those when building the task graph, and then do a vectorized fill over weights for things like that and not actually bloat the size of the graph.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: new distributed histogram type #168

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: new distributed histogram type #168

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions