Skip to content

Algorithm fails on small data #41

@nsutcliffe

Description

@nsutcliffe

As part our CICD pipeline, we have a daily build that runs on relatively small amounts of data. As part of this, we discovered an interesting bug; as part of the method estimateTau, there is the following line:

val y = DenseVector(estimators.map { case (_, d) => math.log(d) })

In this case, d is the average distance between points. We are finding that on the small data used in our daily build, beta can exceed 0. When this happens, yMax, which is defined as:

val yMax = breeze.linalg.max(y)

is below negative one, and subsequently used as the bufferSize.

Specifically, the following appears in the log:

ERROR KNN: Unable to estimate Tau with positive beta: 0.1577160047542901. This maybe because data is too small.
Setting to -1.3153582722102333 which is the maximum average distance we found in the sample.
This may leads to poor accuracy. Consider manually set bufferSize instead.
You can also try setting balanceThreshold to zero so only metric trees are built.

(this does not cause the code to stop, and it continues)

Exception in thread "main" java.lang.IllegalArgumentException: knn_2166a4d536d3 parameter bufferSize given invalid value -1.3153582722102333

This then causes an error and the pipeline stops.

From my understanding, very low average distances would always cause errors if beta exceeds 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions