| title | emoji | colorFrom | colorTo | sdk | pinned | license | short_description |
|---|---|---|---|---|---|---|---|
Home Credit Default Risk Prediction |
🍃 |
indigo |
purple |
docker |
true |
mit |
ML Classification models applied to Home Credit Risk dataset |
This project focuses on building a machine learning pipeline to predict a client's ability to repay a loan. It is a binary classification task that uses a real-world financial dataset to identify clients who may face payment difficulties.
The project goes beyond a standard model by including a practical application that:
- Preprocesses and cleans the dataset for model training.
- Trains a machine learning model to predict loan repayment risk.
- Deploys an interactive predictor app using Marimo, hosted on Hugging Face Spaces.
- Allows users to make predictions by providing the top 10 most influential features.
This work showcases a complete end-to-end workflow, transforming raw data into a functional, user-friendly tool for risk assessment.
Important
- Check out the deployed app here: 👉️ Home Credit Default Risk Prediction App 👈️
- Check out the Jupyter Notebook for a detailed walkthrough of the project here: 👉️ Jupyter Notebook 👈️
- Model Selection: Four different models were trained and evaluated, with LightGBM selected as the final model due to its superior performance, achieving a ROC AUC score of 0.751 on the test set.
- Automated Preprocessing: The data preprocessing pipeline handles common tasks such as feature scaling and categorical encoding, ensuring the model receives clean and formatted data.
- Interactive Predictor: An application built with Marimo allows users to interact with the trained model directly. It uses the top 10 most important features—identified from the final LightGBM model—to generate real-time predictions.
This project was built using the following technologies and libraries:
Dashboard & Hosting:
- Marimo: A Python library for building interactive dashboards.
- Hugging Face Spaces: Used for hosting and sharing the interactive dashboard.
Data Analysis & Visualization:
- Pandas: For data manipulation and analysis.
- Matplotlib: For creating static visualizations.
- Seaborn: For creating statistical graphics.
Modeling & Training:
- Scikit-Learn: For machine learning tasks such as preprocessing, feature engineering, and model training.
- LightGBM: It is a gradient boosting framework that uses tree based learning algorithms.
Development Tools:
This project utilizes the Home Credit Default Risk from Kaggle, a public dataset containing details on over 246,000 of individuals who have made payments on their loans.
- Source: Kaggle Dataset
