- Introduction
- Data Wrangling
- Exploratory Data Analysis
- Data Preprocessing and Training
- Modeling
- Conclusions and Recommendations
- References
- Installation
- Technologies Used
- Contact
This project seeks to address the significant challenge of credit default faced by American Express by leveraging data science. By developing a predictive model, we aim to forecast the likelihood of a customer defaulting on their credit card payments, thereby aiding American Express in managing credit default risks more effectively.
Credit defaults pose substantial financial risks to American Express. The goal is to harness data science to create a classification model using 2021 customer data to predict credit card payment defaults, enabling proactive mitigation strategies and fostering a financially secure customer-issuer relationship.
We performed data cleaning and preparation by loading data from a CSV file, standardizing nomenclatures, handling missing values, and profiling the dataset to ensure optimal design for future analysis steps.
We visualized the dataset to understand the inherent dynamics within the data, identify multicollinearity, and observe class imbalance issues which are crucial for the next steps of preprocessing and modeling.
In this stage, multicollinearity was addressed by dropping redundant features, the dataset was scaled using MinMaxScaler, and the class imbalance was handled using SMOTE to ensure a balanced dataset for effective modeling.
Addressed multicollinearity by dropping features exhibiting high correlation to reduce redundancy and improve model performance.
Employed MinMaxScaler to harmonize the range of features, ensuring each feature has an equal opportunity to influence the model.
We utilized SMOTE to handle class imbalance, enhancing the dataset with more instances of the minority class for a balanced training set.
The modeling phase involved evaluating different metrics, hyperparameter tuning, and selecting the XGBoost Classifier due to its high performance, efficiency, and suitability for handling imbalanced datasets.
Employed AUC-ROC and AUC-PRC as primary metrics to evaluate model performance, with AUC-ROC used as a baseline for model selection due to its robustness across various thresholds.
Used RandomizedSearchCV for efficient hyperparameter tuning, optimizing the model's learning characteristics without exhaustive computational demand.
In this section, we evaluated the models based on their performance in an unseen test set and selected the best model to apply.
The analysis underscores the critical challenge of credit defaults and highlights the efficacy of the selected model in addressing this issue. Recommendations include continuous model evaluation, further feature engineering, devising risk mitigation strategies, and ensuring model interpretability, fairness, and regulatory compliance.
The references section lists all the external resources and data sources referred to in the project
The project was implemented in Python 3.8. To install the required packages, use the following command:
pip install -r requirements.txt
- Python
- Numpy
- Pandas
- Scikit-Learn
- Matplotlib
- Seaborn
If you have any questions, comments, or would like to contribute, please feel free to contact me at [email protected].