In Evidently 0.4.19 with Python 3.10, the ClassificationQualityMetric() and ClassificationConfusionMatrix() (these are the one I tested but i suspect other metrics to be impacted) throw an error when some data labels contain numerical values. Even if the dataframe column type is specified as string.
See sample code below:
from evidently.report import Report
from evidently.metrics import *
import pandas as pd
label_target = ['foo', 'bar', 'fun', 'foo', 'fun', 'foo', '101', '102']
label_predict = ['foo', 'bar', 'fun', 'bar', 'fun', 'fun', '101', '101']
data_df = pd.DataFrame({'target': label_target, 'prediction': label_predict}, dtype="string")
report = Report(metrics=[
ClassificationQualityMetric(),
ClassificationConfusionMatrix(),
])
report.run(reference_data=None, current_data=data_df)
report
It ends up with the following error:
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/evidently/calculations/classification_performance.py:316, in calculate_matrix(target, prediction, labels)
315 def calculate_matrix(target: pd.Series, prediction: pd.Series, labels: List[Union[str, int]]) -> ConfusionMatrix:
--> 316 sorted_labels = sorted(labels)
317 matrix = metrics.confusion_matrix(target, prediction, labels=sorted_labels)
318 return ConfusionMatrix(labels=sorted_labels, values=[row.tolist() for row in matrix])
TypeError: '<' not supported between instances of 'str' and 'int'
Adding a char (like a dot) at the end of the label name numbers fixes the issue:
label_target = ['foo', 'bar', 'fun', 'foo', 'fun', 'foo', '101.', '102-']
label_predict = ['foo', 'bar', 'fun', 'bar', 'fun', 'fun', '101.', '101.']
But I do not think that this is the expected behavior and that the dataframe column type should be respected all along the metric(s) computation.
In Evidently 0.4.19 with Python 3.10, the ClassificationQualityMetric() and ClassificationConfusionMatrix() (these are the one I tested but i suspect other metrics to be impacted) throw an error when some data labels contain numerical values. Even if the dataframe column type is specified as string.
See sample code below:
It ends up with the following error:
Adding a char (like a dot) at the end of the label name numbers fixes the issue:
But I do not think that this is the expected behavior and that the dataframe column type should be respected all along the metric(s) computation.