Skip to content

Use src/sampling.jl/seqsample_d!() and fix a bug in it #998

@lampretl

Description

@lampretl

The algorithm D in
Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984, page 716-17
is the main result of that article. In src/sampling.jl there are implementations of the A, C, D algorithms.

The main interface function sample! uses algorithms A, C, but not D. Yet in the article on page 704 in table I, we see that D is by far the most performant.

Using sample!(a,x) where n=length(a) is much larger than k=length(x) gives awful performance:

using StatsBase
n=10^14;  k=10^8;  a=1:n;  x=Vector{UInt}(undef,k);  @time sample!(a, x, replace=false, ordered=true)

Suggested correction (1):
Lines 507-511 should be replaced with just seqsample_d!(rng, a, x):

if n > 10 * k * k

Suggested correction (2):
In seqsample_d!(rng, a, x) there is a small but critical mistake: in lines 389, 390, 443, 447 code

alpha = 1 / 13 # choose alpha value
threshold = alpha * n
threshold -= alpha
seqsample_a!(rng, a[i+1:end], @view x[j+1:end])

must be replaced with

alphainv = 13.0 # choose alphainv value experimentally on specific CPU
threshold = alphainv * n
threshold -= alphainv
seqsample_a!(rng, (@view a[i+1:end]), (@view x[j+1:end]))

You can see this in the article on page 715: $alpha\simeq0.07$. And on page 716 in the algorithm: $threshold = alphainverse \times n$. Without this, the switch to seqsample_a!(rng, a, x) is never done:

seqsample_a!(rng, a[i+1:end], @view x[j+1:end])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions