enh: Add `dataset_utils.py` to control the "new" in "new data" #1661

Muhammad-Rebaal · 2025-05-06T15:29:29Z

I've added a dataset_utils.py which consist of 2 functions :

compare_datasets: A function that compares two datasets and provides detailed analysis of their differences, including shape, columns, and overlapping records.

check_data_leakage: A helper function that works with report objects to verify that new data doesn't overlap significantly with training data.

Also added, these new utility functions in __init__.py to be exposed for the users.

auguste-probabl · 2025-05-14T08:51:31Z

Hi @Muhammad-Rebaal, thanks, super interesting. Since #1509 is a big feature that requires some thought, you should not expect your PR to be merged as-is. However it's very cool as a proof of concept, much appreciated.

Muhammad-Rebaal · 2025-05-14T16:21:30Z

Hi @Muhammad-Rebaal, thanks, super interesting. Since #1509 is a big feature that requires some thought, you should not expect your PR to be merged as-is. However it's very cool as a proof of concept, much appreciated.

Thanks for feedback , it would be a pleasure for me to implement that and I'd love to see this in action , happy to know if there is something to refine more.

thomass-dev · 2025-05-26T10:31:06Z

[automated comment] Please update your PR with main, so that the pytest workflow status will be reported.

github-actions · 2025-05-27T09:09:43Z

Coverage Report for skore/

File	Stmts	Miss	Cover	Missing
venv/lib/python3.12/site-packages/skore
__init__.py	24	0	100%
_config.py	28	0	100%
dataset_utils.py	92	34	63%	79, 81, 137, 140–141, 147–148, 150, 152–153, 155, 160, 169–170, 184–187, 207, 211–215, 217–218, 220–222, 289, 291, 293, 302–303
exceptions.py	4	4	0%	4, 15, 19, 23
venv/lib/python3.12/site-packages/skore/project
__init__.py	2	0	100%
metadata.py	67	0	100%
project.py	43	0	100%
reports.py	11	0	100%
widget.py	138	5	96%	375–377, 447–448
venv/lib/python3.12/site-packages/skore/sklearn
__init__.py	6	0	100%
_base.py	169	14	91%	45, 58, 126, 129, 182, 185–186, 188–191, 224, 227–228
find_ml_task.py	61	0	100%
types.py	20	0	100%
venv/lib/python3.12/site-packages/skore/sklearn/_comparison
__init__.py	5	0	100%
metrics_accessor.py	205	3	98%	165, 329, 1283
report.py	95	0	100%
utils.py	58	0	100%
venv/lib/python3.12/site-packages/skore/sklearn/_cross_validation
__init__.py	5	0	100%
metrics_accessor.py	204	1	99%	321
report.py	108	0	100%
venv/lib/python3.12/site-packages/skore/sklearn/_estimator
__init__.py	7	0	100%
feature_importance_accessor.py	143	2	98%	216–217
metrics_accessor.py	367	8	97%	181, 183, 190, 281, 350, 354, 369, 404
report.py	155	0	100%
venv/lib/python3.12/site-packages/skore/sklearn/_plot
__init__.py	2	0	100%
base.py	5	0	100%
style.py	28	0	100%
utils.py	118	5	95%	50, 74–76, 80
venv/lib/python3.12/site-packages/skore/sklearn/_plot/metrics
__init__.py	5	0	100%
confusion_matrix.py	69	4	94%	90, 98, 120, 228
precision_recall_curve.py	174	2	98%	521, 524
prediction_error.py	160	0	100%
roc_curve.py	242	4	98%	380, 497, 598, 791
venv/lib/python3.12/site-packages/skore/sklearn/train_test_split
__init__.py	0	0	100%
train_test_split.py	49	0	100%
venv/lib/python3.12/site-packages/skore/sklearn/train_test_split/warning
__init__.py	8	0	100%
high_class_imbalance_too_few_examples_warning.py	17	1	94%	80
high_class_imbalance_warning.py	18	0	100%
random_state_unset_warning.py	10	0	100%
shuffle_true_warning.py	10	1	90%	46
stratify_is_set_warning.py	10	0	100%
time_based_column_warning.py	21	1	95%	73
train_test_split_warning.py	4	0	100%
venv/lib/python3.12/site-packages/skore/utils
__init__.py	6	2	66%	8, 13
_accessor.py	52	2	96%	67, 108
_environment.py	27	0	100%
_fixes.py	8	0	100%
_index.py	5	0	100%
_logger.py	22	4	81%	15–17, 19
_measure_time.py	10	0	100%
_parallel.py	38	3	92%	23, 33, 124
_patch.py	13	5	61%	21, 23–24, 35, 37
_progress_bar.py	45	0	100%
_show_versions.py	33	2	93%	65–66
_testing.py	12	0	100%
TOTAL	3238	107	96%

Tests	Skipped	Failures	Errors	Time
788	5 💤	0 ❌	0 🔥	1m 5s ⏱️

github-actions · 2025-05-27T09:13:38Z

Documentation preview @ c3a26b7

MarieSacksick

Hello!
Sorry for the delay for the review.
What I like is that the compare dataset function can also be helpful to check whether train and test are statistically different at some point.

I gave some comments.

Can you also add tests please?

Thanks :) !

MarieSacksick · 2025-05-30T14:51:16Z