Data Preprocessing, Exploratory Data Analysis (EDA), Feature Engineering Notebooks, and ML Model training
This project focuses on developing a Credit Scoring Model for a buy-now-pay-later service in partnership with an eCommerce company. The dataset contains transaction records, and the aim is to analyze the data and engineer relevant features that will help in predicting credit risk.
- Python 3.x
- Jupyter Notebook or JupyterLab
- Required Python libraries (see below)
To set up the environment for this notebook, follow these steps:
-
Clone the Repository:
git clone https://github.com/Atnabon/credit-scoring.git
-
Install Required Libraries:
Make sure to install the required libraries by running:
pip install -r requirements.txt
-
Start Jupyter Notebook:
Launch the Jupyter Notebook:
jupyter notebook
Open
scripts/data_preprocessing.ipynbornotebook/EDA.ipynbfrom the Jupyter interface.
- Notebook: data_preprocessing.ipynb
- Purpose: Clean and prepare the data for analysis and modeling.
- Load Data: Import data from CSV or other file formats.
- Display Basic Information: Use
info()to display data types, non-null counts, and memory usage. - Display the First Few Rows: Use
head()to inspect the first few records. - Display Column Names: Confirm column names with
columnsto ensure the data is loaded correctly. - Check for Missing Values: Identify missing values using
isnull().sum(). - Drop Duplicate Rows: Remove duplicate rows, if any, using
drop_duplicates(). - Convert
transactionstarttimeto Datetime: Parsetransactionstarttimeinto a datetime format for time-based analysis. - Handle Missing Values from Invalid Datetime Conversion: Address any missing values that may arise during datetime conversion.
- Visualize the Cleaned Data:print out the data.
- Save the Cleaned Data: Export the cleaned dataset for further analysis or model training.
- Notebook:
EDA.ipynb - Purpose: Gain insights into the data's structure, distributions, and relationships between variables to guide modeling decisions.
- Processes Included:
- Summary statistics of numerical columns.
- Distribution analysis using histograms and box plots.
- Correlation analysis using heatmaps.
- Analysis of categorical variables with count plots and bar charts.
- Insights into time-series data trends if applicable.
- Identification of outliers and their impact on the dataset.
This notebook focuses on feature engineering to enhance the dataset for modeling. Key tasks include:
- Aggregate Features: Creating new features such as total transaction amount, average transaction amount, transaction count, and standard deviation of transaction amounts for each customer.
- Time-Based Features: Extracting features from the transaction timestamp (hour, day, month, year).
- Encoding Categorical Variables: Applying Weight of Evidence (WOE) transformation to categorical features for better model interpretability.
- Handling Missing Values: Implementing strategies for filling or removing missing values in the dataset.
- Normalization/Standardization: Scaling numerical features to ensure they are on a similar scale, improving model performance.
This notebook focuses on feature engineering using the RFMS (Recency, Frequency, Monetary, Seniority) formalism and applying Weight of Evidence (WoE) binning for customer risk classification. The main steps include:
- RFMS Feature Engineering: Calculating Recency, Frequency, Monetary, and Seniority features from the transaction data.
- Risk Label Assignment: Classifying customers as 'good' or 'bad' based on their RFMS score.
- WoE Binning: Transforming RFMS features using WoE based on the RiskLabel.
- Information Value (IV) Calculation: Evaluating the importance of each RFMS feature using IV to assess predictive power.
This notebook trains the machine learning models:
- Model Selection: Random Forest and Gradient Boosting Machines.
- Data Splitting: Dividing data into training and testing sets.
- Model Training: Fitting the models to training data.
- Hyperparameter Tuning: Optimizing models using Grid and Random Search.
- Overfitting Prevention: Using cross-validation and regularization techniques.
- Model Evaluation: Assessing metrics like Accuracy, Precision, Recall, F1 Score, and ROC-AUC.
The repository is structured as follows:
├── .vscode/
│ └── settings.json
├── .github/
│ └── workflows/
│ └── unittests.yml
├──app
│ ├── main.py
│ └── requirements.txt
├── notebooks/
│ ├── __init__.py
│ ├── EDA.ipynb
│ ├── Feature_Engineering.ipynb
│ ├── Modelling.ipynb
│ └── Default_Estimator_and_WOE_Binning.ipynb
├── scripts/
│ ├── __init__.py
│ ├── data_cleaning.ipynb
├── src/
│ └── __init__.py
├── tests/
│ └── __init__.py
├── Dockerfile
├──.dockerignore
├──.gitignore
├── README.md
└── requirements.txt
If you would like to contribute to this project:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch). - Commit your changes (
git commit -am 'Add new feature'). - Push to the branch (
git push origin feature-branch). - Create a new Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.