-
Notifications
You must be signed in to change notification settings - Fork 101
enh: Add dataset_utils.py to control the "new" in "new data"
#1661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @Muhammad-Rebaal, thanks, super interesting. Since #1509 is a big feature that requires some thought, you should not expect your PR to be merged as-is. However it's very cool as a proof of concept, much appreciated. |
Thanks for feedback , it would be a pleasure for me to implement that and I'd love to see this in action , happy to know if there is something to refine more. |
|
[automated comment] Please update your PR with main, so that the |
Coverage Report for |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| File | Stmts | Miss | Cover | Missing |
|---|---|---|---|---|
| venv/lib/python3.12/site-packages/skore | ||||
| __init__.py | 24 | 0 | 100% | |
| _config.py | 28 | 0 | 100% | |
| dataset_utils.py | 92 | 34 | 63% | 79, 81, 137, 140–141, 147–148, 150, 152–153, 155, 160, 169–170, 184–187, 207, 211–215, 217–218, 220–222, 289, 291, 293, 302–303 |
| exceptions.py | 4 | 4 | 0% | 4, 15, 19, 23 |
| venv/lib/python3.12/site-packages/skore/project | ||||
| __init__.py | 2 | 0 | 100% | |
| metadata.py | 67 | 0 | 100% | |
| project.py | 43 | 0 | 100% | |
| reports.py | 11 | 0 | 100% | |
| widget.py | 138 | 5 | 96% | 375–377, 447–448 |
| venv/lib/python3.12/site-packages/skore/sklearn | ||||
| __init__.py | 6 | 0 | 100% | |
| _base.py | 169 | 14 | 91% | 45, 58, 126, 129, 182, 185–186, 188–191, 224, 227–228 |
| find_ml_task.py | 61 | 0 | 100% | |
| types.py | 20 | 0 | 100% | |
| venv/lib/python3.12/site-packages/skore/sklearn/_comparison | ||||
| __init__.py | 5 | 0 | 100% | |
| metrics_accessor.py | 205 | 3 | 98% | 165, 329, 1283 |
| report.py | 95 | 0 | 100% | |
| utils.py | 58 | 0 | 100% | |
| venv/lib/python3.12/site-packages/skore/sklearn/_cross_validation | ||||
| __init__.py | 5 | 0 | 100% | |
| metrics_accessor.py | 204 | 1 | 99% | 321 |
| report.py | 108 | 0 | 100% | |
| venv/lib/python3.12/site-packages/skore/sklearn/_estimator | ||||
| __init__.py | 7 | 0 | 100% | |
| feature_importance_accessor.py | 143 | 2 | 98% | 216–217 |
| metrics_accessor.py | 367 | 8 | 97% | 181, 183, 190, 281, 350, 354, 369, 404 |
| report.py | 155 | 0 | 100% | |
| venv/lib/python3.12/site-packages/skore/sklearn/_plot | ||||
| __init__.py | 2 | 0 | 100% | |
| base.py | 5 | 0 | 100% | |
| style.py | 28 | 0 | 100% | |
| utils.py | 118 | 5 | 95% | 50, 74–76, 80 |
| venv/lib/python3.12/site-packages/skore/sklearn/_plot/metrics | ||||
| __init__.py | 5 | 0 | 100% | |
| confusion_matrix.py | 69 | 4 | 94% | 90, 98, 120, 228 |
| precision_recall_curve.py | 174 | 2 | 98% | 521, 524 |
| prediction_error.py | 160 | 0 | 100% | |
| roc_curve.py | 242 | 4 | 98% | 380, 497, 598, 791 |
| venv/lib/python3.12/site-packages/skore/sklearn/train_test_split | ||||
| __init__.py | 0 | 0 | 100% | |
| train_test_split.py | 49 | 0 | 100% | |
| venv/lib/python3.12/site-packages/skore/sklearn/train_test_split/warning | ||||
| __init__.py | 8 | 0 | 100% | |
| high_class_imbalance_too_few_examples_warning.py | 17 | 1 | 94% | 80 |
| high_class_imbalance_warning.py | 18 | 0 | 100% | |
| random_state_unset_warning.py | 10 | 0 | 100% | |
| shuffle_true_warning.py | 10 | 1 | 90% | 46 |
| stratify_is_set_warning.py | 10 | 0 | 100% | |
| time_based_column_warning.py | 21 | 1 | 95% | 73 |
| train_test_split_warning.py | 4 | 0 | 100% | |
| venv/lib/python3.12/site-packages/skore/utils | ||||
| __init__.py | 6 | 2 | 66% | 8, 13 |
| _accessor.py | 52 | 2 | 96% | 67, 108 |
| _environment.py | 27 | 0 | 100% | |
| _fixes.py | 8 | 0 | 100% | |
| _index.py | 5 | 0 | 100% | |
| _logger.py | 22 | 4 | 81% | 15–17, 19 |
| _measure_time.py | 10 | 0 | 100% | |
| _parallel.py | 38 | 3 | 92% | 23, 33, 124 |
| _patch.py | 13 | 5 | 61% | 21, 23–24, 35, 37 |
| _progress_bar.py | 45 | 0 | 100% | |
| _show_versions.py | 33 | 2 | 93% | 65–66 |
| _testing.py | 12 | 0 | 100% | |
| TOTAL | 3238 | 107 | 96% | |
| Tests | Skipped | Failures | Errors | Time |
|---|---|---|---|---|
| 788 | 5 💤 | 0 ❌ | 0 🔥 | 1m 5s ⏱️ |
dataset_utils.py to control the "new" in "new data"dataset_utils.py to control the "new" in "new data"
MarieSacksick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello!
Sorry for the delay for the review.
What I like is that the compare dataset function can also be helpful to check whether train and test are statistically different at some point.
I gave some comments.
Can you also add tests please?
Thanks :) !
| if results["column_comparison"]["only_in_dataset1"]: | ||
| print(f"- Columns only in dataset 1: {results['column_comparison']['only_in_dataset1']}") | ||
| if results["column_comparison"]["only_in_dataset2"]: | ||
| print(f"- Columns only in dataset 2: {results['column_comparison']['only_in_dataset2']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be interesting here to precise which columns are common.
| overlap_count = len(merged) | ||
| results["overlap_analysis"] = { | ||
| "overlap_row_count": overlap_count, | ||
| "overlap_percentage": round(overlap_count / len(dataset1) * 100, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the datasets 1 and 2 have very different length (as it can be if we are dealing with a train and test set), this percentage can be misleading. Can you either find a key more explicit (all the ideas I have in mind are too long, such as overlap_percentage_to_dataset1), or remove it please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other idea, explicit it completely in the key "message", and use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that you are using it below - let's find a clear name to keep it :)!
| results["overlap_analysis"] = { | ||
| "message": "No shared columns to check for overlapping rows" | ||
| } | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the difficulties that could trigger a failure? The user should be warned plainly if there is an interruption somewhere. As I understand it, here they would have to dig and read the whole output to be able to see it.
| for col in shared_columns: | ||
| # Check if column is numeric | ||
| if pd.api.types.is_numeric_dtype(dataset1[col]) and pd.api.types.is_numeric_dtype(dataset2[col]): | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the difficulties that could trigger a failure? The user should be warned plainly if there is an interruption somewhere. As I understand it, here they would have to dig and read the whole output to be able to see it.
| # Check if column is numeric | ||
| if pd.api.types.is_numeric_dtype(dataset1[col]) and pd.api.types.is_numeric_dtype(dataset2[col]): | ||
| try: | ||
| # Perform Kolmogorov-Smirnov test for distribution comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice, we can enrich with other tests later, but it's a good start.
| dataset1_subset = dataset1[shared_columns].drop_duplicates() | ||
| dataset2_subset = dataset2[shared_columns].drop_duplicates() | ||
|
|
||
| merged = pd.merge( | ||
| dataset1_subset, dataset2_subset, | ||
| on=shared_columns, how='inner' | ||
| ) | ||
|
|
||
| overlap_count = len(merged) | ||
| results["overlap_analysis"] = { | ||
| "overlap_row_count": overlap_count, | ||
| "overlap_percentage": round(overlap_count / len(dataset1) * 100, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever it's not super plain to do, can you split the big function in sub functions please? This way, it's much easier to test all the use-cases, and also to read.
| ks_stat, p_value = stats.ks_2samp( | ||
| dataset1[col].dropna().values, | ||
| dataset2[col].dropna().values | ||
| ) | ||
|
|
||
| results["feature_distribution"][col] = { | ||
| "ks_statistic": ks_stat, | ||
| "p_value": p_value, | ||
| "same_distribution": p_value > 0.05, # Common threshold | ||
| "dataset1_mean": dataset1[col].mean(), | ||
| "dataset2_mean": dataset2[col].mean(), | ||
| "dataset1_std": dataset1[col].std(), | ||
| "dataset2_std": dataset2[col].std(), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be put in a sub function.
| if hasattr(report, "X_train") and report.X_train is not None: | ||
| # EstimatorReport case | ||
| train_data = report.X_train | ||
| elif hasattr(report, "reports_") and len(report.reports_) > 0 and hasattr(report.reports_[0], "X_train"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took too long to give you a review; something changed in ComparisonReport: it's now possible to have EstimatorReports with different X_train, as long as the y is the same. Therefore there is no way to know which report the user want to check.
I think this function, check_data_leakage could apply only to EstimatorReport.
Hi @MarieSacksick ! No worries, I know its a big feature and it would take some time. Would be happy to know your thoughts before moving forward. Thank You ! |
Closes #1509
Hi @MarieSacksick , @auguste-probabl !
I've added a
dataset_utils.pywhich consist of 2 functions :compare_datasets:A function that compares two datasets and provides detailed analysis of their differences, including shape, columns, and overlapping records.check_data_leakage:A helper function that works with report objects to verify that new data doesn't overlap significantly with training data.Also added, these new utility functions in
__init__.pyto be exposed for the users.