Skip to content

Reservoir sampling probability inaccuracy #6091

@chzhoo

Description

@chzhoo

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Your Environments (required)

  • OS: Linux 8cef9170aed4 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Thu Oct 22 00:20:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Compiler: g++ (Nebula Graph Build) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

How To Reproduce(required)

Steps to reproduce the behavior:

#include "common/algorithm/ReservoirSampling.h"
......
for (size_t time = 0; time < 1024; time++) {
ReservoirSampling<int64_t> sampler(1);
sampler.sampling(1);
sampler.sampling(2);
auto result = sampler.samples();
//The sampling result only contains the number 2, not the number 1.
std::cout << "sample result: " <<result[0] << std::endl;
}

Expected behavior

The sampling result only contains the number 1 and the number 2.

Additional context

The <go_statement> SAMPLE <sample_list> statement employs the reservoir sampling algorithm, but the current implementation exhibits probability inaccuracies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects/nonePR/issue: this bug affects none version.severity/noneSeverity of bugtype/bugType: something is unexpected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions