Skip to content

Conversation

@glemaitre
Copy link
Member

@glemaitre glemaitre commented Dec 20, 2024

closes #834

Investigate an API for a EstimatorReport.

TODO

  • Metrics
    • handle string metrics has specified in the accessor
    • handle callable metrics
    • handle scikit-learn scorers
    • use efficiently the cache as much as possible
    • add testing for all of those features
    • allow to pass new validation set to functions instead of using the internal validation set
    • add a proper help and rich __repr__
  • Plots
    • add the roc curve display
    • add the precision recall curve display
    • add prediction error display for regressor
    • make proper testing for those displays
    • add a proper __repr__ for those displays
  • Documentation
    • (done for the checked part) add an example to showcase all the different features
    • find a way to show the accessors documentation in the page of EstimatorReport. It could be a bit tricky because they are only defined once the instance created.
      • We need to have a look at the series.rst page from pandas to see how they document this sort of pattern.
    • check the autocompletion: when typing report.metrics.->tab it should provide the autocompetion. edit: having a stub file is actually working. I prefer this than type hints directly in the file.
  • Open questions
    • we use hashing to retrieve external set.
    • use the caching for the external validation set? To make it work we need to compute the hash of potentially big arrays. This might more costly than making the model predict.

Notes

This PR build upon:

@glemaitre glemaitre marked this pull request as draft December 20, 2024 21:10
@glemaitre glemaitre changed the title feat: design of ModelReport feat: Design of ModelReport Dec 20, 2024
@@ -1,9 +1,11 @@
"""Enhance `sklearn` functions."""

from skore.sklearn._estimator import EstimatorReport
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's disturbing that you want to expose something from a private/protected module.
Shouldn't skore.sklearn.estimator be exposed too by removing _?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I want the user to be able to do

skore.EstimatorReport

or

skore.sklean.EstimatorReport

but I don't want to expose in a lower level. In scikit-learn (and other package), whenever you don't want people to import from the private module, you add an _ even if it is a folder.

For instance, I would probably to the same for cross_validation.

However, it is something that we can discuss later.

"""Setup and teardown fixture for matplotlib.
This fixture checks if we can import matplotlib. If not, the tests will be
skipped. Otherwise, we close the figures before and after running the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fmi, why closing before, not just after?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a definitive answer since I did not write in scikit-learn. What I can infer is that some test might fail and might not end in the teardown maybe. So the subsequent test is here to make a clean start. However, I'm unsure.

"estimator[/bold cyan]"
)

def _create_help_tree(self):
Copy link
Collaborator

@thomass-dev thomass-dev Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add to the helper the representation of the attributes of the reporter.
For instance, it can help users to know that the reporter contains the fitted estimator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an ending branch listing all getter and init attributes.

image

)
)
# trigger the computation
list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have a list of indeterminated progress instead of one progress bar that "jumps".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see what we can do to improve the current state.

@@ -0,0 +1,168 @@
from typing import Any, Callable, Literal, Optional, Union
Copy link
Collaborator

@thomass-dev thomass-dev Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To-do: check if removing the stub files breaks the auto-completion or not, and check if a work-around exists (ping @augustebaum).

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2025

Documentation preview @ 82f6332

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2025

Coverage

Coverage Report for backend
FileStmtsMissCoverMissing
venv/lib/python3.12/site-packages/skore
   __init__.py120100% 
   __main__.py8180%19
   exceptions.py30100% 
venv/lib/python3.12/site-packages/skore/cli
   __init__.py50100% 
   cli.py33385%104, 111, 117
   color_format.py43390%35–>40, 41–43
   launch_dashboard.py261539%36–57
   quickstart_command.py14750%37–51
venv/lib/python3.12/site-packages/skore/item
   __init__.py210100% 
   cross_validation_item.py1371093%27–42, 370
   item.py411368%85, 88, 92–112
   item_repository.py42293%12–13
   media_item.py70494%15–18
   numpy_array_item.py25193%15
   pandas_dataframe_item.py34195%15
   pandas_series_item.py34195%15
   polars_dataframe_item.py32194%15
   polars_series_item.py27194%15
   primitive_item.py27292%13–15
   sklearn_base_estimator_item.py33195%15
   skrub_table_report_item.py10186%11
venv/lib/python3.12/site-packages/skore/persistence
   __init__.py00100% 
   abstract_storage.py22195%130
   disk_cache_storage.py33195%44
   in_memory_storage.py200100% 
venv/lib/python3.12/site-packages/skore/project
   __init__.py30100% 
   create.py52888%116–122, 132–133, 140–141
   load.py23389%43–45
   open.py140100% 
   project.py64491%135, 149, 183, 187
venv/lib/python3.12/site-packages/skore/sklearn
   __init__.py40100% 
   find_ml_task.py35195%41–>49, 50
   types.py20100% 
venv/lib/python3.12/site-packages/skore/sklearn/_estimator
   __init__.py100100% 
   base.py76298%87–88
   metrics_accessor.py198298%131, 266
   report.py165197%145–>151, 147–>149, 150, 153–>155, 159–>163, 408–>413
   utils.py11110%1–19
venv/lib/python3.12/site-packages/skore/sklearn/_plot
   __init__.py40100% 
   precision_recall_curve.py126297%200–>203, 313–314
   prediction_error.py75099%289–>297
   roc_curve.py95394%156, 167–>170, 223–224
   utils.py770100% 
venv/lib/python3.12/site-packages/skore/sklearn/cross_validation
   __init__.py20100% 
   cross_validation_helpers.py47490%104–>136, 123–126
   cross_validation_reporter.py35195%177
venv/lib/python3.12/site-packages/skore/sklearn/cross_validation/plots
   __init__.py00100% 
   compare_scores_plot.py29192%10, 45–>48
   timing_plot.py29194%10
venv/lib/python3.12/site-packages/skore/sklearn/train_test_split
   __init__.py00100% 
   train_test_split.py34294%15–16
venv/lib/python3.12/site-packages/skore/sklearn/train_test_split/warning
   __init__.py80100% 
   high_class_imbalance_too_few_examples_warning.py17378%16–18, 80
   high_class_imbalance_warning.py18288%16–18
   random_state_unset_warning.py11187%15
   shuffle_true_warning.py9091%44–>exit
   stratify_is_set_warning.py11187%15
   time_based_column_warning.py22189%17, 69–>exit
   train_test_split_warning.py5180%21
venv/lib/python3.12/site-packages/skore/ui
   __init__.py00100% 
   app.py25571%24, 53–58
   dependencies.py7186%12
   project_routes.py500100% 
venv/lib/python3.12/site-packages/skore/utils
   __init__.py00100% 
   _accessor.py70100% 
   _logger.py21484%14–18
   _show_versions.py310100% 
venv/lib/python3.12/site-packages/skore/view
   __init__.py00100% 
   view.py50100% 
   view_repository.py16283%8–9
TOTAL222513693% 

Tests Skipped Failures Errors Time
349 0 💤 0 ❌ 0 🔥 44.190s ⏱️

@glemaitre
Copy link
Member Author

OK. It should be good to go and we should be able to iterate.

Copy link
Contributor

@sylvaincom sylvaincom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for this very useful PR @glemaitre and the whole team for reviewing it! Let's iterate on sub-issues if needed

@thomass-dev thomass-dev merged commit 1a4151a into probabl-ai:main Jan 10, 2025
18 checks passed
waridrox pushed a commit to waridrox/skore that referenced this pull request Apr 15, 2025
closes probabl-ai#834

Investigate an API for a `EstimatorReport`.

#### TODO

- [x] Metrics
  - [x] handle string metrics has specified in the accessor
  - [x] handle callable metrics
  - [x] handle scikit-learn scorers
  - [x] use efficiently the cache as much as possible
  - [x] add testing for all of those features
- [x] allow to pass new validation set to functions instead of using the
internal validation set
  - [x] add a proper help and rich `__repr__`
- [x] Plots
  - [x] add the roc curve display
  - [x] add the precision recall curve display
  - [x] add prediction error display for regressor
  - [x] make proper testing for those displays
  - [x] add a proper `__repr__` for those displays
- [x] Documentation 
- [x] (done for the checked part) add an example to showcase all the
different features
- [x] find a way to show the accessors documentation in the page of
`EstimatorReport`. It could be a bit tricky because they are only
defined once the instance created.
- We need to have a look at the `series.rst` page from pandas to see how
they document this sort of pattern.
- [x] check the autocompletion: when typing `report.metrics.->tab` it
should provide the autocompetion. **edit**: having a stub file is
actually working. I prefer this than type hints directly in the file.
- Open questions
  - [x] we use hashing to retrieve external set.
- use the caching for the external validation set? To make it work we
need to compute the hash of potentially big arrays. This might more
costly than making the model predict.

#### Notes

This PR build upon:
- probabl-ai#962 to reuse the
`skore.console`
- probabl-ai#998 to be able to detect
clusterer in a consistent manner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(back): Estimator Report

7 participants