Skip to content

Jensen-Shannon distance should be computed using base 2 logarithm for it to be bound between 0 and 1 #494

@perezbecker

Description

@perezbecker

Your current implementation for the Jensen-Shannon Distance (JSD) is leveraging scipy.spatial.distance.jensenshannon, which defaults to computing the JSD using logarithms with base e. Under this definition, the JSD is bound between 0 and sqrt(ln(2))=0.83255... (see, for example, the JSD Wikipedia article)

I believe you want to compute JSD using base 2 logarithms, so that it is bound between 0 and 1 as you state in your data drift detection blog post.

Here is code which reproduces this issue by computing the JSD between two extremely drifted distributions:

from evidently.calculations.stattests import jensenshannon_stat_test
import numpy as np
import pandas as pd

x = pd.Series(np.random.normal(100, 10, 100_000))
y = pd.Series(np.random.normal(10_000, 10, 100_000))

print(jensenshannon_stat_test(x, y, feature_type='num',threshold=0.1))
>>> StatTestResult(drift_score=0.8325546111576977, drifted=True, actual_threshold=0.1)

Note that the drift_score is close to the current maximum theoretical value of sqrt(ln(2))=0.83255..., instead of the desired value of 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions