Skip to content

Speed with large arrays #7

@grovduck

Description

@grovduck

Thanks for this package, @mthh. This is more of a question than issue. I'm using your code in a raster GIS context and trying to get natural breaks for large rasters (> 1,000,000 cells). I've got equal interval and quantile classification modes built in for comparison. I'm curious if you've ever done speed comparisons on large datasets and if you'd advocate sampling to reduce classification times.

Here's a sample of what I'm seeing:

import numpy as np
import jenkspy

# Equal interval
def equal_intervals(arr, bin_count):
    return np.linspace(arr.min(), arr.max(), num=bin_count + 1)

# Quantile
def approx_quantiles(arr, bin_count):
    if arr.size <= bin_count:
        return np.sort(arr)
    q = np.linspace(0, 1, bin_count + 1)
    bins = np.quantile(arr, q)
    uniq, counts = np.unique(bins, return_counts=True)
    dup = uniq[counts > 1]
    if len(dup):
        new = arr[arr != dup[0]]
        return np.sort(
            np.hstack((dup[0], approx_quantiles(new, bin_count - 1)))
        )
    return bins

# Natural breaks
def jenks(arr, bin_count):
    return jenkspy.jenks_breaks(arr, bin_count)

# Sample array (n=10000)
arr = np.random.randint(0, 100, 10000)

When timing these, jenks is slower than I would expect, but I admit to not studying the algorithm carefully enough to know if this is expected.

%timeit equal_intervals(arr, 6)
61.5 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit approx_quantiles(arr, 6)
431 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit jenks(arr, 6)
514 ms ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sampling the original array obviously helps speeds, but the bin boundaries aren't exact:

import math
arr = np.random.randint(0, 1000, 100000)
for i in range(10, 0, -1):
    size = math.floor(arr.size * i / 10)
    sample = np.random.choice(arr, size=size)
    %timeit jenks(sample, 6)
    print(f"{i*10}% sample", jenks(sample, 6))

49.6 s ± 864 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100% sample [0.0, 167.0, 335.0, 503.0, 670.0, 835.0, 999.0]

39.6 s ± 661 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
90% sample [0.0, 165.0, 331.0, 497.0, 664.0, 831.0, 999.0]

30.8 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
80% sample [0.0, 167.0, 333.0, 500.0, 667.0, 834.0, 999.0]

23.8 s ± 686 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
70% sample [0.0, 170.0, 339.0, 507.0, 672.0, 834.0, 999.0]

17.7 s ± 708 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
60% sample [0.0, 168.0, 335.0, 500.0, 665.0, 831.0, 999.0]

12.1 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
50% sample [0.0, 169.0, 334.0, 500.0, 665.0, 831.0, 999.0]

7.28 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
40% sample [0.0, 168.0, 335.0, 502.0, 668.0, 833.0, 999.0]

4.31 s ± 221 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
30% sample [0.0, 168.0, 337.0, 503.0, 667.0, 832.0, 999.0]

2.05 s ± 48.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20% sample [0.0, 170.0, 338.0, 507.0, 674.0, 837.0, 999.0]

458 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10% sample [0.0, 167.0, 334.0, 500.0, 664.0, 830.0, 999.0]

Thanks for any pointers you may have on how to use jenkspy more efficiently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions