Skip to content

Conversation

@waridrox
Copy link
Contributor

@waridrox waridrox commented Sep 15, 2025

Closes #2029
CC: @thomass-dev

When debugging, in the data_accessor.py file:

if X is None:
raise ValueError(err_msg.format(f"X_{dataset}", data_source))
elif not sbd.is_dataframe(X):
X = pd.DataFrame(X, columns=[f"Feature {i}" for i in range(X.shape[1])])
if with_y:
if y is None:
raise ValueError(err_msg.format(f"y_{dataset}", data_source))
if isinstance(y, pd.Series) and y.name is not None:
y = y.to_frame()
elif not sbd.is_dataframe(y):
if y.ndim == 1:
columns = ["Target"]
else:
columns = [f"Target {i}" for i in range(y.shape[1])]
y = pd.DataFrame(y, columns=columns)
return X, y

This method only converts numpy arrays to DataFrames with string column names, but doesn't handle the case if DataFrames already exist but have integer type column names.

This then further passes down to the skrub functions (which expects strings), and the final error of TypeError: cannot use a string pattern on a bytes-like object occurs because suggested_name is an integer from the RangeIndex columns.

def _get_new_name(suggested_name, forbidden_names):
    """Get a new name for a column."""
    # .......    
    tags = re.findall(tag_pattern, suggested_name)  # <==== 
    # .......

tag_pattern is a string regex pattern and re.findall(tag_pattern, suggested_name) method expects both arguments to be strings or both to be bytes-like objects.

I tried to overcome this by ensuring that the DataFrame object has string column names if it already exists to avoid issues with skrub later when passing down from data_accessor.py function.

Alternatively, should I simply just raise exception errors if the compute fails in subsequent steps instead of this?

@thomass-dev
Copy link
Collaborator

thomass-dev commented Sep 15, 2025

Thanks for your investigation and your contribution. Please add a dedicated test.

Copy link
Contributor

@auguste-probabl auguste-probabl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

Simplified the tests a bit.

@auguste-probabl auguste-probabl changed the title fix(skore): convert DataFrame column names to string type fix(skore): Convert DataFrame column names to strings Oct 7, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2025

Coverage

Coverage Report for skore/
FileStmtsMissCoverMissing
skore/src/skore
   __init__.py230100% 
   _config.py310100% 
   exceptions.py440%4, 15, 19, 23
skore/src/skore/_sklearn
   __init__.py60100% 
   _base.py1981492%45, 58, 127, 130, 183, 186–187, 189–192, 225, 228–229
   find_ml_task.py610100% 
   types.py27196%28
skore/src/skore/_sklearn/_comparison
   __init__.py70100% 
   feature_importance_accessor.py39294%90, 109
   metrics_accessor.py178398%173, 253, 1215
   report.py1070100% 
   utils.py540100% 
skore/src/skore/_sklearn/_cross_validation
   __init__.py90100% 
   data_accessor.py45393%128, 131, 134
   feature_importance_accessor.py240100% 
   metrics_accessor.py182199%244
   report.py135199%487
skore/src/skore/_sklearn/_estimator
   __init__.py90100% 
   data_accessor.py66198%76
   feature_importance_accessor.py144298%223–224
   metrics_accessor.py356897%200, 202, 209, 300, 369, 373, 388, 423
   report.py167298%448–449
skore/src/skore/_sklearn/_plot
   __init__.py30100% 
   base.py102793%61–62, 200, 224–226, 230
   utils.py770100% 
skore/src/skore/_sklearn/_plot/data
   __init__.py20100% 
   table_report.py185199%706
skore/src/skore/_sklearn/_plot/metrics
   __init__.py60100% 
   confusion_matrix.py70494%92, 100, 122, 230
   feature_importance_display.py672168%88, 121–122, 124, 142–146, 148–155, 158–160, 162
   metrics_summary_display.py80100% 
   precision_recall_curve.py280598%455, 555, 559, 619, 751
   prediction_error.py227597%179, 186, 422, 505, 705
   roc_curve.py292897%385, 508, 513, 614, 619, 623, 692, 832
skore/src/skore/_sklearn/train_test_split
   __init__.py00100% 
   train_test_split.py580100% 
skore/src/skore/_sklearn/train_test_split/warning
   __init__.py80100% 
   high_class_imbalance_too_few_examples_warning.py19194%83
   high_class_imbalance_warning.py200100% 
   random_state_unset_warning.py100100% 
   shuffle_true_warning.py90100% 
   stratify_is_set_warning.py100100% 
   time_based_column_warning.py210100% 
   train_test_split_warning.py30100% 
skore/src/skore/_utils
   __init__.py6266%8, 13
   _accessor.py90396%34, 146, 190
   _environment.py270100% 
   _fixes.py80100% 
   _index.py50100% 
   _logger.py22481%15–17, 19
   _measure_time.py100100% 
   _parallel.py38392%23, 33, 124
   _patch.py13561%21, 23–24, 35, 37
   _progress_bar.py460100% 
   _repr_html.py80100% 
   _show_versions.py380100% 
   _testing.py550100% 
skore/src/skore/project
   __init__.py20100% 
   project.py480100% 
   summary.py75198%120
   widget.py165696%436, 439–441, 525–526
TOTAL400511897% 

Tests Skipped Failures Errors Time
1074 5 💤 0 ❌ 0 🔥 4m 21s ⏱️

@auguste-probabl auguste-probabl added this pull request to the merge queue Oct 7, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2025

Documentation preview @ b33ad33

Merged via the queue into probabl-ai:main with commit 6e7292b Oct 7, 2025
32 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(skore): Data analyze fails when called with multi class dataset

3 participants