This project is a machine learning pipeline for predicting loan approval status based on applicant data. It uses Python, pandas, scikit-learn, imbalanced-learn, and visualization libraries to preprocess data, train multiple classifiers, tune hyperparameters, and make predictions on new user input.
- Overview
- Features
- Dataset
- Requirements
- Setup & Usage
- Project Structure
- Modeling Approach
- Results
- How to Predict for New Applicants
- References
This project aims to automate the process of loan approval by building a predictive model using historical loan application data. The workflow includes:
- Data cleaning and preprocessing
- Exploratory data analysis (EDA)
- Handling class imbalance
- Feature encoding and scaling
- Training and evaluating multiple classifiers (Logistic Regression, Random Forest, SVM)
- Hyperparameter tuning with GridSearchCV
- Feature importance analysis
- Interactive prediction for new applicants
- Data Cleaning: Handles missing values and encodes categorical variables.
- EDA: Visualizes distributions and relationships in the data.
- Imbalance Handling: Uses oversampling to balance approved/denied classes.
- Model Comparison: Compares Logistic Regression, Random Forest, and SVM.
- Hyperparameter Tuning: Optimizes Random Forest with GridSearchCV.
- Feature Importance: Visualizes which features matter most.
- User Prediction: Interactive CLI for predicting loan eligibility for new applicants.
- The dataset should be named
data.csvand placed in the project root. - It must include columns such as:
Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status.
- Python 3.7+
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- imbalanced-learn
Install dependencies with:
pip install -r requirements.txtExample requirements.txt:
pandas
numpy
matplotlib
seaborn
scikit-learn
imbalanced-learn
-
Clone the repository and navigate to the project directory.
-
Place your
data.csvfile in the root directory. -
Run the notebook:
- Open
main.ipynbin Jupyter Notebook or VS Code. - Run all cells sequentially.
- Open
-
For new predictions:
- At the end of the notebook, follow the CLI prompts to enter applicant details and get a loan eligibility prediction.
.
├── main.ipynb # Main Jupyter notebook with all code
├── data.csv # Input dataset (not included)
└── README.md # Project documentation
-
Data Loading & Inspection:
Loads the CSV, checks for missing values, and inspects data types. -
EDA:
- Visualizes class balance and feature distributions.
- Uses count plots and pie charts for categorical variables.
-
Data Cleaning:
- Drops
Loan_ID. - Fills missing values (mean/mode as appropriate).
- Removes rows with missing
Credit_History.
- Drops
-
Encoding:
- Label encodes categorical variables.
-
Feature Correlation:
- Visualizes correlations with a heatmap.
-
Balancing:
- Uses
RandomOverSamplerto balance the classes.
- Uses
-
Splitting & Scaling:
- Splits into train/test sets.
- Scales features with
StandardScaler.
-
Model Training & Evaluation:
- Trains Logistic Regression, Random Forest, and SVM.
- Evaluates with accuracy, precision, recall, F1, ROC AUC, and cross-validation.
- Plots confusion matrices.
-
Model Comparison:
- Plots a bar chart comparing model accuracies.
-
Hyperparameter Tuning:
- Uses
GridSearchCVto optimize Random Forest.
- Uses
-
Feature Importance:
- Plots feature importances for the best Random Forest model.
-
User Prediction:
- Collects user input via CLI.
- Encodes and scales input.
- Predicts eligibility and probability.
- Best Model: Random Forest (with tuned hyperparameters)
- Metrics:
- Accuracy, Precision, Recall, F1, ROC AUC (see notebook output for details)
- Feature Importance:
- Visualized in the notebook; shows which applicant features most influence approval.
At the end of the notebook, you will be prompted to enter applicant details such as gender, marital status, dependents, education, employment, income, loan amount, term, credit history, and property area. The model will output:
- Eligibility: Whether the applicant is eligible for a loan.
- Probability: The model's confidence in the prediction.
- scikit-learn documentation
- imbalanced-learn documentation
- pandas documentation
- seaborn documentation