Skip to content

proposal: new distributed histogram type #168

@lgray

Description

@lgray

I've attached a notebook showing a proof-of-concept for a cluster-distributed histogram using dask-array that relies on the sparseness of per-chunk fills in dask to avoid materializing the histogram on any particular worker.
This implementation makes a number of shortcuts and fusions of operations that building this out of the shuffle primitives in dask lays out a bit more verbosely.

Right now this only works for unweighted histogram and no categories but it's not too much of a stretch to extend it. The main trouble will be keeping the number of additionally created tasks down per category.

I have scaled to decent performance for O(billion) bin histograms (checked up to 4D) and it all fits on my laptop and fills in a reasonable amount of time compared to da.histogramdd and with significantly less memory usage than the present implementation of dask histogram's boost backend. The histogram can then be reassembled on the client machine and quickly turned into boost-histogram or anything else suitable.

Here's the notebook with the proof of concept and validation against da.histogramdd:
distributed-histogram.ipynb.zip

Here are settings for scaling to very large number of bins successfully:
NOTE: The da.histogramdd check will perform extremely sluggishly for the following settings
125M bins:

ndims = 3
nbins = 500
hist_partitions = 50
n_hist_slices = 25

1B bins:

ndims = 4
nbins = 178
hist_partitions = 100
n_hist_slices = 10

These settings will run pleasantly on a modern laptop. You're more or less limited by system memory for the output histogram.

@nsmith- @holzman @mapsacosta @Nanoemc @douglasdavis @martindurant
Thoughts on how to improve or integrate with existing dask-histogram greatly appreciated.

Jotting down my own thoughts concerning moving forward:
For weighted filling we need to keep track of a tuple of (filled_bins, sum_weights, sum_weights_squared). That's not exactly congruent with the dask-array setup I'm using right now, but it is still congruent with blockwise, piping it all together sounds a little nasty.

For categories we'll need to be careful about when we actually make more tasks since it's easy to generate an enormous task overhead with a category for each systematic variation. However, since we know when multiple .fill calls take the same coordinate collections and only a different weight collection and category label, we can figure out how to fuse those when building the task graph, and then do a vectorized fill over weights for things like that and not actually bloat the size of the graph.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions