docs(getting started): Change dataset to synthetic #1941

mohsinm-dev · 2025-07-27T15:07:40Z

Summary

Replace load_breast_cancer with make_classification using challenging parameters
Creates a more compelling demonstration of Skore's capabilities
The breast cancer dataset produces near-perfect results that don't effectively showcase the model
evaluation tools

Changes

Use make_classification with parameters: n_samples=1000, n_features=20, n_informative=10,
n_redundant=10, n_clusters_per_class=1, random_state=42
Convert to DataFrame format to maintain compatibility with existing code
Update comments to reflect synthetic dataset usage

Why this improves the example

The synthetic dataset provides more realistic performance metrics (ROC-AUC ~0.90 vs ~0.99) and
clearer differences between models, making it better for educational purposes, and demonstrating
When different models/techniques matter.

Test plan

Verified EstimatorReport works with the synthetic dataset
Tested CrossValidationReport functionality
Confirmed ComparisonReport works correctly
Validated feature importance calculations
Ensured all metrics and visualizations display properly

Fixes #1861

examples/getting_started/plot_skore_getting_started.py

thomass-dev · 2025-07-28T07:23:26Z

Did you use an LLM to generate your PR? Please check the documentation.

The challenge is to understand the issue, and to be able to easily contribute in future.
Thanks.

examples/getting_started/plot_skore_getting_started.py

thomass-dev

I don't think the PR is sufficient to resolve the issue. @sylvaincom can you take a look? Thanks!

This is the current rendering:

github-actions · 2025-07-29T08:32:00Z

Documentation preview @ d973f3e

sylvaincom · 2025-07-29T10:03:58Z

Hi, I agree with @thomass-dev: this feels LLM generated and you did not take into account the description of the parent issue, especially the shown plots

comparator.metrics.roc().plot() should display a ROC curve that does not appear too easy

mohsinm-dev · 2025-07-29T11:13:30Z

@sylvaincom I’ve updated the example to use 10,000 samples with CrossValidationReport and matched the code structure to your reference. Can you please check this ?

- Use make_classification with n_samples=10_000 and multiclass parameters - Update estimator to LogisticRegression for better demonstration Fixes probabl-ai#1861

sylvaincom · 2025-07-29T15:02:32Z

You do not seem to understand what you're doing: you're calling rf_report (RF : random forest) a model report corresponding to a logistic regression

- Change rf to lr and rf_report to lr_report for consistency

glemaitre

A couple of remarks.

examples/getting_started/plot_skore_getting_started.py

This reverts commit ce1da49.

This reverts commit 130d435.

auguste-probabl · 2025-08-28T08:21:53Z

Here's what you get with RandomForestClassifier:

CVReport:

ComparisonReport:

IMO it looks good enough for our purposes.

thomass-dev

Thanks @auguste-probabl for the follow-up, LGTM now.

github-actions bot assigned mohsinm-dev Jul 27, 2025

mohsinm-dev force-pushed the docs/improve-getting-started-dataset branch from 0458060 to 78684a1 Compare July 27, 2025 15:13

thomass-dev requested changes Jul 28, 2025

View reviewed changes

examples/getting_started/plot_skore_getting_started.py Outdated Show resolved Hide resolved

mohsinm-dev force-pushed the docs/improve-getting-started-dataset branch from 78684a1 to f5df837 Compare July 29, 2025 00:19

thomass-dev reviewed Jul 29, 2025

View reviewed changes

examples/getting_started/plot_skore_getting_started.py Show resolved Hide resolved

thomass-dev requested changes Jul 29, 2025

View reviewed changes

docs: Update dataset in getting started guide to use synthetic data

b78d496

- Use make_classification with n_samples=10_000 and multiclass parameters - Update estimator to LogisticRegression for better demonstration Fixes probabl-ai#1861

mohsinm-dev force-pushed the docs/improve-getting-started-dataset branch from f5df837 to b78d496 Compare July 29, 2025 11:20

mohsinm-dev added 2 commits July 29, 2025 21:09

docs: Fix variable naming to match LogisticRegression

130d435

- Change rf to lr and rf_report to lr_report for consistency

Fix variable naming to match LogisticRegression

ce1da49

glemaitre self-requested a review July 30, 2025 20:38

glemaitre reviewed Jul 30, 2025

View reviewed changes

examples/getting_started/plot_skore_getting_started.py Outdated Show resolved Hide resolved

examples/getting_started/plot_skore_getting_started.py Outdated Show resolved Hide resolved

examples/getting_started/plot_skore_getting_started.py Outdated Show resolved Hide resolved

auguste-probabl changed the title ~~docs: Change dataset in getting started guide from breast cancer to s…~~ docs: Change dataset in getting started guide to synthetic dataset Aug 13, 2025

auguste-probabl changed the title ~~docs: Change dataset in getting started guide to synthetic dataset~~ docs(getting started): Change dataset to synthetic Aug 13, 2025

auguste-probabl added 4 commits August 28, 2025 10:11

merge

c0de3da

Revert "Fix variable naming to match LogisticRegression"

7475411

This reverts commit ce1da49.

Revert "docs: Fix variable naming to match LogisticRegression"

92dc2fc

This reverts commit 130d435.

Revert to RandomForestClassifier

d973f3e

auguste-probabl requested review from glemaitre and thomass-dev August 28, 2025 08:21

thomass-dev approved these changes Sep 1, 2025

View reviewed changes

thomass-dev added this pull request to the merge queue Sep 1, 2025

Merged via the queue into probabl-ai:main with commit 1e9a7b6 Sep 1, 2025
15 checks passed

docs(getting started): Change dataset to synthetic #1941

docs(getting started): Change dataset to synthetic #1941

Uh oh!

Conversation

mohsinm-dev commented Jul 27, 2025

Summary

Changes

Why this improves the example

Test plan

Uh oh!

Uh oh!

thomass-dev commented Jul 28, 2025

Uh oh!

Uh oh!

thomass-dev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sylvaincom commented Jul 29, 2025

Uh oh!

mohsinm-dev commented Jul 29, 2025

Uh oh!

sylvaincom commented Jul 29, 2025

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

auguste-probabl commented Aug 28, 2025

Uh oh!

thomass-dev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thomass-dev left a comment •

edited

Loading

github-actions bot commented Jul 29, 2025 •

edited

Loading