Description
Describe the issue linked to the documentation
There is some discussion going on about the usefulness of some (if not all) over / under sampling methods implemented in the imbalanced learn package.
Typically there is some doubt about the usefulness of SMOTE:
- from researchers (To SMOTE or not to SMOTE ?)
- from practitioners (see weekly discussion on Kaggle, Data Science stack exchange ... etc.)
- and even one of the authors of the package (Learning from Imbalanced data: I was wrong but I was not the only one)
Basically it seems that:
- Methods do not improve ranking (think AUC)
- Methods do break probability calibration (ECE / calibration curve)
I think that it is a problem that those discussions are not more visible to the newcomers. (And that more experienced people need to have to deal with that on a weekly basis).
Suggest a potential alternative/fix
It would be nice to have
- a clearer demonstration in the doc, because for the moment only the usage is described:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
It shows that it oversampled, but not that it works either in terms or ranking (AUC) / probability calibration (ECE / calibration curve).
Could the doc be upgraded with a better exemple ?
- a visible user warning regarding the discussions on usefulness of these methods.
While (one of the) authors have changed its mind about the usefulness of these methods, it seems that a younger crowd is still very eager to jump on these shiny methods. I think it would be helpful for the DS community to make a clearer stance.
I would suggest at least a very visible warning in the doc, like a red banner ('there are some discussion about the usefulness of these methods. See: XXX. Use with caution').
This could be expanded with a UserWarning... may be a bit brutal but it could prevent a lot of trouble.
Edit: not sure why it added the good first issue automatically... but I'll take it.