-
Notifications
You must be signed in to change notification settings - Fork 101
Description
What would you like to say?
Part of #1874.
As a reminder, today:
# %%
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from skore import train_test_split
from skore import EstimatorReport, ComparisonReport
X, y = load_breast_cancer(return_X_y=True)
split_data = train_test_split(X=X, y=y, random_state=0, as_dict=True)
classifier = LogisticRegression(max_iter=10_000)
report_1 = EstimatorReport(classifier, **split_data)
report_2 = EstimatorReport(RandomForestClassifier(), **split_data)
comp = ComparisonReport([report_1, report_2])
# %%
display = comp.metrics.precision_recall()
display.frame()outputs:
and:
Frame
For the frame, let's add an extra column data_source, and concatenate all the data in a long and unique dataframe.
Plot
For the plot, to avoid having too many lines on the same plot (here in the example only two models are compared, but we could easily have more), and to avoid comparing the train curve of a model with the test curve of another, which wouldn't make any sense, let's use the subploting option once it's implemented in #1445 to have one plot with all the train curves, and one plot with all the test curves.
Why not having one subplot per model with train and test at the same place? It's not necessarily a bad idea, and we could think about adding an option to offer both possibilities. Yet, to start simple and iterate, it seems that having one subplot per model is more useful when one wants to investigate one given model, and therefore it's the EstimatorReport job, rather than the ComparisonReport.