Skip to content

varianceRatio SOP may lead to data leakage for ML applications #113

@misch91

Description

@misch91

Dear all

Of all the filters for MSDataset objects, the varianceRatio should be applied with precaution, especially if the data is supposed to be processed with machine learning methods later that contain a train/test data split.

Reasoning: In ML, all feature selection steps are to be performed after the train/test split on the training data only in order to avoid possible information leakage of training data into the test data. One of the many popular methods for feature selection is applying a variance threshold (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) or selection for F-score (ANOVA). Applying feature selection before the split is one of the most frequent errors among novices and leads to overestimation of ML models and incorrect subsequent interpretation (see DOI: 10.1021/acs.jproteome.2c00117 for an insightful explanation).

Suggestion: Either add a word of warning to the documentation that application of the varianceRatio filter may impede data integrity for ML data analysis later, or reset the default value of varianceRatio to 1.0 (now: 1.1), which means turning off this filter by default.

By the way, all the other standard feature filters (corrThreshold, rsdThreshold, etc.) are not problematic as they are filtering by robustness criteria, not by biological variance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions