Skip to content

Conversation

@mohsinm-dev
Copy link
Contributor

Summary

  • Replace load_breast_cancer with make_classification using challenging parameters
  • Creates a more compelling demonstration of Skore's capabilities
  • The breast cancer dataset produces near-perfect results that don't effectively showcase the model
    evaluation tools

Changes

  • Use make_classification with parameters: n_samples=1000, n_features=20, n_informative=10,
    n_redundant=10, n_clusters_per_class=1, random_state=42
  • Convert to DataFrame format to maintain compatibility with existing code
  • Update comments to reflect synthetic dataset usage

Why this improves the example

The synthetic dataset provides more realistic performance metrics (ROC-AUC ~0.90 vs ~0.99) and
clearer differences between models, making it better for educational purposes, and demonstrating
When different models/techniques matter.

Test plan

  • Verified EstimatorReport works with the synthetic dataset
  • Tested CrossValidationReport functionality
  • Confirmed ComparisonReport works correctly
  • Validated feature importance calculations
  • Ensured all metrics and visualizations display properly

Fixes #1861

@mohsinm-dev mohsinm-dev force-pushed the docs/improve-getting-started-dataset branch from 0458060 to 78684a1 Compare July 27, 2025 15:13
@thomass-dev
Copy link
Collaborator

Did you use an LLM to generate your PR? Please check the documentation.

The challenge is to understand the issue, and to be able to easily contribute in future.
Thanks.

@mohsinm-dev mohsinm-dev force-pushed the docs/improve-getting-started-dataset branch from 78684a1 to f5df837 Compare July 29, 2025 00:19
Copy link
Collaborator

@thomass-dev thomass-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the PR is sufficient to resolve the issue. @sylvaincom can you take a look? Thanks!

This is the current rendering:
image

@github-actions
Copy link
Contributor

github-actions bot commented Jul 29, 2025

Documentation preview @ d973f3e

@sylvaincom
Copy link
Contributor

Hi, I agree with @thomass-dev: this feels LLM generated and you did not take into account the description of the parent issue, especially the shown plots

comparator.metrics.roc().plot() should display a ROC curve that does not appear too easy

@mohsinm-dev
Copy link
Contributor Author

@sylvaincom I’ve updated the example to use 10,000 samples with CrossValidationReport and matched the code structure to your reference. Can you please check this ?

Screenshot 2025-07-29 at 4 13 00 PM

- Use make_classification with n_samples=10_000 and multiclass parameters
- Update estimator to LogisticRegression for better demonstration

Fixes probabl-ai#1861
@mohsinm-dev mohsinm-dev force-pushed the docs/improve-getting-started-dataset branch from f5df837 to b78d496 Compare July 29, 2025 11:20
@sylvaincom
Copy link
Contributor

You do not seem to understand what you're doing: you're calling rf_report (RF : random forest) a model report corresponding to a logistic regression

@glemaitre glemaitre self-requested a review July 30, 2025 20:38
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of remarks.

@auguste-probabl auguste-probabl changed the title docs: Change dataset in getting started guide from breast cancer to s… docs: Change dataset in getting started guide to synthetic dataset Aug 13, 2025
@auguste-probabl auguste-probabl changed the title docs: Change dataset in getting started guide to synthetic dataset docs(getting started): Change dataset to synthetic Aug 13, 2025
@auguste-probabl
Copy link
Contributor

Here's what you get with RandomForestClassifier:

CVReport:
2025-08-28T10_18_22_screenshot

ComparisonReport:
2025-08-28T10_19_20_screenshot

IMO it looks good enough for our purposes.

Copy link
Collaborator

@thomass-dev thomass-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @auguste-probabl for the follow-up, LGTM now.

@thomass-dev thomass-dev added this pull request to the merge queue Sep 1, 2025
Merged via the queue into probabl-ai:main with commit 1e9a7b6 Sep 1, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: Change the data used in the getting started guide

5 participants