The algorithm D in
Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984, page 716-17
is the main result of that article. In src/sampling.jl there are implementations of the A, C, D algorithms.
The main interface function sample! uses algorithms A, C, but not D. Yet in the article on page 704 in table I, we see that D is by far the most performant.
Using sample!(a,x) where n=length(a) is much larger than k=length(x) gives awful performance:
using StatsBase
n=10^14; k=10^8; a=1:n; x=Vector{UInt}(undef,k); @time sample!(a, x, replace=false, ordered=true)
Suggested correction (1):
Lines 507-511 should be replaced with just seqsample_d!(rng, a, x):
Suggested correction (2):
In seqsample_d!(rng, a, x) there is a small but critical mistake: in lines 389, 390, 443, 447 code
alpha = 1 / 13 # choose alpha value
threshold = alpha * n
threshold -= alpha
seqsample_a!(rng, a[i+1:end], @view x[j+1:end])
must be replaced with
alphainv = 13.0 # choose alphainv value experimentally on specific CPU
threshold = alphainv * n
threshold -= alphainv
seqsample_a!(rng, (@view a[i+1:end]), (@view x[j+1:end]))
You can see this in the article on page 715: $alpha\simeq0.07$. And on page 716 in the algorithm: $threshold = alphainverse \times n$. Without this, the switch to seqsample_a!(rng, a, x) is never done:
|
seqsample_a!(rng, a[i+1:end], @view x[j+1:end]) |
The algorithm D in
Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984, page 716-17
is the main result of that article. In
src/sampling.jlthere are implementations of the A, C, D algorithms.The main interface function
sample!uses algorithms A, C, but not D. Yet in the article on page 704 in table I, we see that D is by far the most performant.Using
sample!(a,x)wheren=length(a)is much larger thank=length(x)gives awful performance:Suggested correction (1):
Lines 507-511 should be replaced with just
seqsample_d!(rng, a, x):StatsBase.jl/src/sampling.jl
Line 507 in 3e5382f
Suggested correction (2):
In
seqsample_d!(rng, a, x)there is a small but critical mistake: in lines 389, 390, 443, 447 codemust be replaced with
You can see this in the article on page 715:$alpha\simeq0.07$ . And on page 716 in the algorithm: $threshold = alphainverse \times n$ . Without this, the switch to
seqsample_a!(rng, a, x)is never done:StatsBase.jl/src/sampling.jl
Line 447 in 3e5382f