Skip to content

Conversation

@Muhammad-Rebaal
Copy link
Contributor

Closes #1509

Hi @MarieSacksick , @auguste-probabl !

I've added a dataset_utils.py which consist of 2 functions :

compare_datasets: A function that compares two datasets and provides detailed analysis of their differences, including shape, columns, and overlapping records.

check_data_leakage: A helper function that works with report objects to verify that new data doesn't overlap significantly with training data.

Also added, these new utility functions in __init__.py to be exposed for the users.

@auguste-probabl
Copy link
Contributor

Hi @Muhammad-Rebaal, thanks, super interesting. Since #1509 is a big feature that requires some thought, you should not expect your PR to be merged as-is. However it's very cool as a proof of concept, much appreciated.

@Muhammad-Rebaal
Copy link
Contributor Author

Hi @Muhammad-Rebaal, thanks, super interesting. Since #1509 is a big feature that requires some thought, you should not expect your PR to be merged as-is. However it's very cool as a proof of concept, much appreciated.

Thanks for feedback , it would be a pleasure for me to implement that and I'd love to see this in action , happy to know if there is something to refine more.

@thomass-dev
Copy link
Collaborator

thomass-dev commented May 26, 2025

[automated comment] Please update your PR with main, so that the pytest workflow status will be reported.

@github-actions
Copy link
Contributor

Coverage

Coverage Report for skore/
FileStmtsMissCoverMissing
venv/lib/python3.12/site-packages/skore
   __init__.py240100% 
   _config.py280100% 
   dataset_utils.py923463%79, 81, 137, 140–141, 147–148, 150, 152–153, 155, 160, 169–170, 184–187, 207, 211–215, 217–218, 220–222, 289, 291, 293, 302–303
   exceptions.py440%4, 15, 19, 23
venv/lib/python3.12/site-packages/skore/project
   __init__.py20100% 
   metadata.py670100% 
   project.py430100% 
   reports.py110100% 
   widget.py138596%375–377, 447–448
venv/lib/python3.12/site-packages/skore/sklearn
   __init__.py60100% 
   _base.py1691491%45, 58, 126, 129, 182, 185–186, 188–191, 224, 227–228
   find_ml_task.py610100% 
   types.py200100% 
venv/lib/python3.12/site-packages/skore/sklearn/_comparison
   __init__.py50100% 
   metrics_accessor.py205398%165, 329, 1283
   report.py950100% 
   utils.py580100% 
venv/lib/python3.12/site-packages/skore/sklearn/_cross_validation
   __init__.py50100% 
   metrics_accessor.py204199%321
   report.py1080100% 
venv/lib/python3.12/site-packages/skore/sklearn/_estimator
   __init__.py70100% 
   feature_importance_accessor.py143298%216–217
   metrics_accessor.py367897%181, 183, 190, 281, 350, 354, 369, 404
   report.py1550100% 
venv/lib/python3.12/site-packages/skore/sklearn/_plot
   __init__.py20100% 
   base.py50100% 
   style.py280100% 
   utils.py118595%50, 74–76, 80
venv/lib/python3.12/site-packages/skore/sklearn/_plot/metrics
   __init__.py50100% 
   confusion_matrix.py69494%90, 98, 120, 228
   precision_recall_curve.py174298%521, 524
   prediction_error.py1600100% 
   roc_curve.py242498%380, 497, 598, 791
venv/lib/python3.12/site-packages/skore/sklearn/train_test_split
   __init__.py00100% 
   train_test_split.py490100% 
venv/lib/python3.12/site-packages/skore/sklearn/train_test_split/warning
   __init__.py80100% 
   high_class_imbalance_too_few_examples_warning.py17194%80
   high_class_imbalance_warning.py180100% 
   random_state_unset_warning.py100100% 
   shuffle_true_warning.py10190%46
   stratify_is_set_warning.py100100% 
   time_based_column_warning.py21195%73
   train_test_split_warning.py40100% 
venv/lib/python3.12/site-packages/skore/utils
   __init__.py6266%8, 13
   _accessor.py52296%67, 108
   _environment.py270100% 
   _fixes.py80100% 
   _index.py50100% 
   _logger.py22481%15–17, 19
   _measure_time.py100100% 
   _parallel.py38392%23, 33, 124
   _patch.py13561%21, 23–24, 35, 37
   _progress_bar.py450100% 
   _show_versions.py33293%65–66
   _testing.py120100% 
TOTAL323810796% 

Tests Skipped Failures Errors Time
788 5 💤 0 ❌ 0 🔥 1m 5s ⏱️

@github-actions
Copy link
Contributor

Documentation preview @ c3a26b7

@MarieSacksick MarieSacksick changed the title enh : Add dataset_utils.py to control the "new" in "new data" enh: Add dataset_utils.py to control the "new" in "new data" May 30, 2025
Copy link
Contributor

@MarieSacksick MarieSacksick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!
Sorry for the delay for the review.
What I like is that the compare dataset function can also be helpful to check whether train and test are statistically different at some point.

I gave some comments.

Can you also add tests please?

Thanks :) !

if results["column_comparison"]["only_in_dataset1"]:
print(f"- Columns only in dataset 1: {results['column_comparison']['only_in_dataset1']}")
if results["column_comparison"]["only_in_dataset2"]:
print(f"- Columns only in dataset 2: {results['column_comparison']['only_in_dataset2']}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be interesting here to precise which columns are common.

overlap_count = len(merged)
results["overlap_analysis"] = {
"overlap_row_count": overlap_count,
"overlap_percentage": round(overlap_count / len(dataset1) * 100, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the datasets 1 and 2 have very different length (as it can be if we are dealing with a train and test set), this percentage can be misleading. Can you either find a key more explicit (all the ideas I have in mind are too long, such as overlap_percentage_to_dataset1), or remove it please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other idea, explicit it completely in the key "message", and use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that you are using it below - let's find a clear name to keep it :)!

results["overlap_analysis"] = {
"message": "No shared columns to check for overlapping rows"
}
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the difficulties that could trigger a failure? The user should be warned plainly if there is an interruption somewhere. As I understand it, here they would have to dig and read the whole output to be able to see it.

for col in shared_columns:
# Check if column is numeric
if pd.api.types.is_numeric_dtype(dataset1[col]) and pd.api.types.is_numeric_dtype(dataset2[col]):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the difficulties that could trigger a failure? The user should be warned plainly if there is an interruption somewhere. As I understand it, here they would have to dig and read the whole output to be able to see it.

# Check if column is numeric
if pd.api.types.is_numeric_dtype(dataset1[col]) and pd.api.types.is_numeric_dtype(dataset2[col]):
try:
# Perform Kolmogorov-Smirnov test for distribution comparison
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice, we can enrich with other tests later, but it's a good start.

Comment on lines +123 to +134
dataset1_subset = dataset1[shared_columns].drop_duplicates()
dataset2_subset = dataset2[shared_columns].drop_duplicates()

merged = pd.merge(
dataset1_subset, dataset2_subset,
on=shared_columns, how='inner'
)

overlap_count = len(merged)
results["overlap_analysis"] = {
"overlap_row_count": overlap_count,
"overlap_percentage": round(overlap_count / len(dataset1) * 100, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever it's not super plain to do, can you split the big function in sub functions please? This way, it's much easier to test all the use-cases, and also to read.

Comment on lines +155 to +168
ks_stat, p_value = stats.ks_2samp(
dataset1[col].dropna().values,
dataset2[col].dropna().values
)

results["feature_distribution"][col] = {
"ks_statistic": ks_stat,
"p_value": p_value,
"same_distribution": p_value > 0.05, # Common threshold
"dataset1_mean": dataset1[col].mean(),
"dataset2_mean": dataset2[col].mean(),
"dataset1_std": dataset1[col].std(),
"dataset2_std": dataset2[col].std(),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be put in a sub function.

if hasattr(report, "X_train") and report.X_train is not None:
# EstimatorReport case
train_data = report.X_train
elif hasattr(report, "reports_") and len(report.reports_) > 0 and hasattr(report.reports_[0], "X_train"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took too long to give you a review; something changed in ComparisonReport: it's now possible to have EstimatorReports with different X_train, as long as the y is the same. Therefore there is no way to know which report the user want to check.
I think this function, check_data_leakage could apply only to EstimatorReport.

@Muhammad-Rebaal
Copy link
Contributor Author

Hello! Sorry for the delay for the review. What I like is that the compare dataset function can also be helpful to check whether train and test are statistically different at some point.

I gave some comments.

Can you also add tests please?

Thanks :) !

Hi @MarieSacksick !

No worries, I know its a big feature and it would take some time.
First of all thanks to @glemaitre as per his vision, I'd be also thinking that we would do a more focused approach rather than making a global utility to be used everywhere if we're focusing on report-focused on EstimatorReport from a CrossValidationReport or from a ComparisonReport by designing an api to implement TableReport from skurb .

Would be happy to know your thoughts before moving forward.

Thank You !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enh: help to control the "new" in "new data"

4 participants