Speed with large arrays

Thanks for this package, @mthh.  This is more of a question than issue.  I'm using your code in a raster GIS context and trying to get natural breaks for large rasters (> 1,000,000 cells).  I've got equal interval and quantile classification modes built in for comparison.  I'm curious if you've ever done speed comparisons on large datasets and if you'd advocate sampling to reduce classification times.

Here's a sample of what I'm seeing:
```
import numpy as np
import jenkspy

# Equal interval
def equal_intervals(arr, bin_count):
    return np.linspace(arr.min(), arr.max(), num=bin_count + 1)

# Quantile
def approx_quantiles(arr, bin_count):
    if arr.size <= bin_count:
        return np.sort(arr)
    q = np.linspace(0, 1, bin_count + 1)
    bins = np.quantile(arr, q)
    uniq, counts = np.unique(bins, return_counts=True)
    dup = uniq[counts > 1]
    if len(dup):
        new = arr[arr != dup[0]]
        return np.sort(
            np.hstack((dup[0], approx_quantiles(new, bin_count - 1)))
        )
    return bins

# Natural breaks
def jenks(arr, bin_count):
    return jenkspy.jenks_breaks(arr, bin_count)

# Sample array (n=10000)
arr = np.random.randint(0, 100, 10000)
```
When timing these, jenks is slower than I would expect, but I admit to not studying the algorithm carefully enough to know if this is expected.
```
%timeit equal_intervals(arr, 6)
61.5 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit approx_quantiles(arr, 6)
431 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit jenks(arr, 6)
514 ms ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
Sampling the original array obviously helps speeds, but the bin boundaries aren't exact:
```
import math
arr = np.random.randint(0, 1000, 100000)
for i in range(10, 0, -1):
    size = math.floor(arr.size * i / 10)
    sample = np.random.choice(arr, size=size)
    %timeit jenks(sample, 6)
    print(f"{i*10}% sample", jenks(sample, 6))

49.6 s ± 864 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100% sample [0.0, 167.0, 335.0, 503.0, 670.0, 835.0, 999.0]

39.6 s ± 661 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
90% sample [0.0, 165.0, 331.0, 497.0, 664.0, 831.0, 999.0]

30.8 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
80% sample [0.0, 167.0, 333.0, 500.0, 667.0, 834.0, 999.0]

23.8 s ± 686 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
70% sample [0.0, 170.0, 339.0, 507.0, 672.0, 834.0, 999.0]

17.7 s ± 708 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
60% sample [0.0, 168.0, 335.0, 500.0, 665.0, 831.0, 999.0]

12.1 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
50% sample [0.0, 169.0, 334.0, 500.0, 665.0, 831.0, 999.0]

7.28 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
40% sample [0.0, 168.0, 335.0, 502.0, 668.0, 833.0, 999.0]

4.31 s ± 221 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
30% sample [0.0, 168.0, 337.0, 503.0, 667.0, 832.0, 999.0]

2.05 s ± 48.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20% sample [0.0, 170.0, 338.0, 507.0, 674.0, 837.0, 999.0]

458 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10% sample [0.0, 167.0, 334.0, 500.0, 664.0, 830.0, 999.0]
```
Thanks for any pointers you may have on how to use `jenkspy` more efficiently.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed with large arrays #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Speed with large arrays #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions