This Jupyter notebook implements a machine learning pipeline to predict loan default risk. The analysis follows the steps of data collection, visualization, preprocessing, feature engineering, model development, evaluation, and provides recommendations for lenders.
- Data Collection
- Data Visualization
- Data Preprocessing
- Data Analysis
- Feature Engineering
- Model Development
- Model Evaluation
- Results & Recommendations
- Usage Instructions
- Contact
- Training Data: Contains borrower features (demographics, financials) and the target label
Risk_Flag(0 = no default, 1 = default). - Test Data: Contains the same input features with an identifier column
Id, but no target label. Reserved for final predictions.
- Bar charts of default rates by categorical variables (Marital Status, House Ownership, Car Ownership, Profession, City, State).
- KDE plots comparing distributions of numeric features (Income, Age, Experience, Job Tenure, House Tenure) for defaulters vs. non-defaulters.
- Duplicate Removal & Column Normalization: Remove repeated records; standardize column names (lowercase, no whitespace).
- ID Column Removal: Drop
Idcolumn to avoid leakage. - Encoding: Convert categorical and boolean fields to numeric.
- Out-of-Fold Target Encoding: Apply to high-cardinality features (Profession, City, State) to generate risk scores without leakage.
- Compute Pearson correlation between features and
Risk_Flag. - Assess non-linear relationships with mutual information.
- Bin Income and Job Tenure into quintiles and visualize default rates to justify interaction features.
- Job Stability:
current_job_yrs / (experience + 1) - Residence Stability:
current_house_yrs * (1 + house_owned) - Age Buckets: Discretize age into cohorts to capture non-linear effects.
- Interaction Terms:
income * job_stabilityandincome * profession_risk.
- Data Split: 60% training (with SMOTE), 20% validation, 20% test (stratified).
- Hyperparameter Tuning: Manual validation curves for:
- Logistic Regression (
C) - Decision Tree (
max_depth,min_samples_leaf,max_features) - Random Forest (
n_estimators,max_depth,min_samples_leaf,max_features)
- Logistic Regression (
- Final Model: Random Forest with 300 trees,
max_depth=15,min_samples_leaf=5,max_features='sqrt'.
- Metrics: ROC AUC, precision, recall, F1-score on training, validation, and test sets.
- Lift Curve: Shows concentration of defaulters across risk deciles.
- Key Risk Factors: Stability ratios, age cohorts, geographic/occupational risk scores, interaction terms.
- Model Performance: Training AUC ~0.98, Validation/Test AUC ~0.93/~0.92, recall of defaults ~0.74.
- Recommended Actions:
- Integrate stability ratios into underwriting.
- Tailor policies for youngest/oldest age buckets.
- Adjust pricing/documentation by profession and region.
- Monitor top risk deciles with targeted outreach.
- Retrain model quarterly to adapt to new conditions.
- Install dependencies:
pip install pandas numpy scikit-learn imbalanced-learn matplotlib seaborn