Skip to content

[Bug]: Feature name containing numbers may lead to an error in the ROC calculation process. #240

@pengDLDG

Description

@pengDLDG

Contact Details

[email protected]

Short description of the problem here.

Suppose that I have the node named "A0516" containing two categories (0 and 1) after discretization. With the function of 'bn.predict_probability()', we can get a DataFrame object ('predictions'), which contains two columns ('A0516_0' and 'A0516_1'). Unfortunately, the following code in the 'roc_auc()' function will cause it to become 4 columns, leading to a error when we use roc_auc().

predictions = bn.predict_probability(data, node)
predictions.rename(columns=lambda x: x.lstrip(node + "_"), inplace=True)
predictions = predictions[sorted(predictions.columns)]

The original purpose of 'x.lstrip(node+"_ ")' was to convert 'A0516_0' and 'A0516_1' into '0' and '1'. However, since both '0' and '1' are present in the string "A0516", this results in two identical empty strings, which causes the number of columns in "predictions" to double after sorting and leads to subsequent errors.

CausalNex Version

0.12.1

Python Version

3.8.20

Relevant code snippet

from causalnex.structure.notears import from_pandas
from causalnex.network import BayesianNetwork
from causalnex.discretiser import Discretiser
from causalnex.evaluation import roc_auc

sm = from_pandas(df)
...
bn = BayesianNetwork(sm)

df_discrete = df.copy()
for col in df_discrete.columns:
    df_discrete[col] = Discretiser(method="quantile",num_buckets=2).fit_transform(df_discrete[col].values)

bn = bn.fit_node_states_and_cpds(df_discrete,method="BayesianEstimator", bayes_prior="K2")

roc, auc = roc_auc(bn, df_discrete, "A0516")

Relevant log output

ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 roc, auc = roc_auc(bn, df_discrete, "A0516")
      2 print(auc)

File ~\.conda\envs\causenet_python\lib\site-packages\causalnex\evaluation\evaluation.py:106, in roc_auc(bn, data, node)
    103 predictions.rename(columns=lambda x: x.lstrip(node + "_"), inplace=True)
    104 predictions = predictions[sorted(predictions.columns)]
--> 106 fpr, tpr, _ = metrics.roc_curve(
    107     ground_truth.values.ravel(), predictions.values.ravel()
    108 )
    109 roc = list(zip(fpr, tpr))
    110 auc = metrics.auc(fpr, tpr)

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\utils\_param_validation.py:214, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    208 try:
    209     with config_context(
    210         skip_parameter_validation=(
    211             prefer_skip_nested_validation or global_skip_validation
    212         )
    213     ):
--> 214         return func(*args, **kwargs)
    215 except InvalidParameterError as e:
    216     # When the function is just a wrapper around an estimator, we allow
    217     # the function to delegate validation to the estimator, but we replace
    218     # the name of the estimator by the name of the function in the error
    219     # message to avoid confusion.
    220     msg = re.sub(
    221         r"parameter of \w+ must be",
    222         f"parameter of {func.__qualname__} must be",
    223         str(e),
    224     )

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\metrics\_ranking.py:1095, in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    993 @validate_params(
    994     {
    995         "y_true": ["array-like"],
   (...)
   1004     y_true, y_score, *, pos_label=None, sample_weight=None, drop_intermediate=True
   1005 ):
   1006     """Compute Receiver operating characteristic (ROC).
   1007 
   1008     Note: this implementation is restricted to the binary classification task.
   (...)
   1093     array([ inf, 0.8 , 0.4 , 0.35, 0.1 ])
   1094     """
-> 1095     fps, tps, thresholds = _binary_clf_curve(
   1096         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight
   1097     )
   1099     # Attempt to drop thresholds corresponding to points in between and
   1100     # collinear with other points. These are always suboptimal and do not
   1101     # appear on a plotted ROC curve (and thus do not affect the AUC).
   (...)
   1106     # but does not drop more complicated cases like fps = [1, 3, 7],
   1107     # tps = [1, 2, 4]; there is no harm in keeping too many thresholds.
   1108     if drop_intermediate and len(fps) > 2:

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\metrics\_ranking.py:806, in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    803 if not (y_type == "binary" or (y_type == "multiclass" and pos_label is not None)):
    804     raise ValueError("{0} format is not supported".format(y_type))
--> 806 check_consistent_length(y_true, y_score, sample_weight)
    807 y_true = column_or_1d(y_true)
    808 y_score = column_or_1d(y_score)

File ~\.conda\envs\causenet_python\lib\site-packages\sklearn\utils\validation.py:407, in check_consistent_length(*arrays)
    405 uniques = np.unique(lengths)
    406 if len(uniques) > 1:
--> 407     raise ValueError(
    408         "Found input variables with inconsistent numbers of samples: %r"
    409         % [int(l) for l in lengths]
    410     )

ValueError: Found input variables with inconsistent numbers of samples: [36, 72]

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions