CTGAN is a tabular GAN-based oversampling to address class imbalance but has a class overlap problem. We Combined CTGAN with the ENN under-sampling technique to overcome the class overlap. CTGAN-ENN reduced the number of class overlaps by each feature in all datasets.
- Best F1-Score (0.994) in Mobile dataset with Random Forest Algorithm
- Best AUC (1.000) in Mobile dataset with XGBoost Algorithm
- Best G-Mean (0.984) in Telco 2 dataset with Random Forest and Gradient Boosting Algorithm
We can see on the picture above, CTGAN-ENN clearly separated the customer churn class blue (not churn) and red (churn) and made machine learning algorithm easily to learn.
Install CTGAN-ENN using pip:
pip install ctganenn
- minClass: the minority class in the dataset (dataframe).
- majClass: the majority class in the dataset (dataframe).
- genData: how much data that you want to generate from minorty class (int).
- targetLabel: what is your target label name in dataset (string).
from ctganenn import CTGANENN
X, y=CTGANENN(minClass,majClass,genData,targetLabel)
the output of method are X and y :
- X : all features of your dataset
- y : target label of your dataset
you can process the X and y variable to the next step for classification stage. For example using Decision Tree Classifier:
model = tree.DecisionTreeClassifier()
classification = model.fit(X, y)
CTGAN-ENN on this version only works for binary classification
This work was supported by Khon Kaen University ASEAN GMS grant and part of AIDA (Applied Intelligence and Data Analytics) lab in College of Computing, Khon Kaen University, Thailand.
@misc{ctganenn,
author = {I Nyoman Mahayasa Adiputra, Paweena Wanchai},
title = {CTGAN-ENN: A tabular GAN-based Hybrid Sampling Method for Imbalanced and Overlapped Data in Customer Churn Prediction},
year = {2024},
url = {https://doi.org/10.1186/s40537-024-00982-x}
}