-
Notifications
You must be signed in to change notification settings - Fork 799
Description
In the implemntation of get_binned_data
| np.place(reference_percents, reference_percents == 0, 0.0001) |
By default, 0.0001 is used to fill the value with a percent of 0. It is valid when the minimum percent of non-zero is greater than 0.0001. But when the minimum percent in the data is less than or equal to 0.0001, like 0.00001. After we filling zeros, the original zero percent is now 0.0001, which is larger than the existing minimum percent(0.00001), and the data distribution is wrong after filling zeros.
The default value of fill zeros in this place, I think, should be set dynamically according to the percentage of data, rather than being fixed directly. How to set dynamically needs further discussion. For example, it should be set to one tenth of the minimum percent, or other values, but it must be guaranteed to be smaller than the minimum percent, otherwise it will be wrong.
I found this problem because when I tried to calculate Kullback Leibler divergence drift score on my data, I found that the minimum percent of my data was 0.00001, which was smaller than the default fill zeros, which resulted in incorrect Kullback Leibler divergence drift score. For example, the default calculated Kullback Leibler divergence drift score in my data is 0.41. After I change fill zeros to one tenth of the minimum percent, the calculated Kullback Leibler divergence drive score is 1.9.