A reproducible, testable microloan analysis pipeline for Kenyan loan applications with interactive visualization and robust data handling.
This project provides a complete pipeline for analyzing microloan application data with emphasis on:
- Defensive data handling — Robust numeric coercion with comprehensive diagnostics
- Modular architecture — Clear separation between ETL, modeling, analytics, and UI
- Reproducibility — Test coverage and deterministic results
- Interactive exploration — Streamlit UI for dynamic analysis and visualization
- Features
- Installation
- Quick Start
- Data Schema
- Project Structure
- Usage
- Testing
- Diagnostics
- Contributing
- License
- ✅ Data Processing — ETL pipeline with feature engineering (EMI, DTI ratios)
- ✅ Machine Learning — Model training with cross-validation and persistence
- ✅ Advanced Analytics — Clustering, temporal patterns, and risk analysis
- ✅ Interactive UI — Streamlit dashboard with Plotly visualizations
- ✅ Comprehensive Testing — Unit tests for core functionality
- ✅ Quality Diagnostics — Detailed reporting on data quality issues
- Python 3.8 or higher
- pip package manager
-
Clone the repository
git clone <repository-url> cd kenya-loan-analysis
-
Create and activate virtual environment
python -m venv .venv # On macOS/Linux source .venv/bin/activate # On Windows .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
Launch the Streamlit application:
streamlit run src/app.pyThen:
- Upload an Excel or CSV file matching the expected schema
- Explore data quality diagnostics
- Train models and view performance metrics
- Run advanced analytics (clustering, temporal analysis, risk scoring)
- Generate and download insights reports
The pipeline expects the following columns:
| Column | Type | Description |
|---|---|---|
Loan_ID |
string | Unique loan identifier |
Gender |
categorical | Applicant gender |
Married |
categorical | Marital status |
Dependents |
numeric/categorical | Number of dependents |
Education |
categorical | Education level |
Self_Employed |
categorical | Employment type |
ApplicantIncome |
numeric | Primary applicant income |
CoapplicantIncome |
numeric | Co-applicant income |
LoanAmount |
numeric | Requested loan amount |
Loan_Amount_Term |
numeric | Loan term in months |
Credit_History |
numeric | Credit history indicator |
Property_Area |
categorical | Property location type |
Loan_Status |
categorical | Approval status (Y/N variants) |
county |
categorical | Kenyan county |
application_date |
date | Application submission date |
- Date parsing:
application_datesupports day-first format - Numeric coercion: Non-numeric values are converted to NaN with diagnostic reporting
- Loan status: Handles heterogeneous values (Y, N, Yes, No, sequences) with robust conversion logic
.
├── src/
│ ├── data_processor.py # ETL, cleaning, feature engineering
│ ├── model_trainer.py # ML pipelines, evaluation, persistence
│ ├── analytics.py # Clustering, temporal, risk analysis
│ └── app.py # Streamlit UI application
├── tests/ # Unit tests (pytest)
├── models/ # Saved model artifacts
├── requirements.txt # Python dependencies
└── README.md
data_processor.py
- Load and validate input data
- Feature engineering (EMI, TotalIncome, DTI)
- Encoding and normalization
- Schema validation via
DataSchemahelper
model_trainer.py
- Training pipeline configuration
- Cross-validation and evaluation
- Feature importance analysis
- Model serialization
analytics.py
- KMeans clustering analysis
- Temporal pattern detection (monthly/quarterly)
- Risk scoring and segmentation
- Returns diagnostic dicts and Plotly figures
app.py
- Streamlit UI components
- File upload handling
- Visualization rendering
- Report generation
from src.data_processor import load_and_process_data
from src.model_trainer import train_model
from src.analytics import run_clustering_analysis
# Load and process data
df, diagnostics = load_and_process_data('data.xlsx')
# Train model
model, metrics = train_model(df)
# Run analytics
cluster_results, fig = run_clustering_analysis(df)# Run tests
pytest -v
# Run with coverage
pytest --cov=src --cov-report=html
# Run Streamlit app
streamlit run src/app.py --server.port 8501Run the test suite:
# All tests
pytest
# Verbose output
pytest -v
# Specific test file
pytest tests/test_analytics.py
# With coverage report
pytest --cov=src --cov-report=term-missingTest coverage includes:
- Data coercion and validation
- Numeric conversion edge cases
- Diagnostic generation
- Feature engineering logic
The pipeline provides comprehensive diagnostics at each stage:
- Counts of values coerced to NaN
- Invalid date formats
- Missing required columns
- Data type mismatches
- Original
Loan_Statusvalue distribution - Conversion success/failure rates
- Rows excluded from clustering
- Feature missingness reports
Inputs: DataFrame or Excel/CSV file matching schema
Outputs: Processed DataFrame, diagnostics dict, Plotly figures (where applicable)
Error Handling: Returns diagnostics for data quality issues; raises only for programming errors
Contributions are welcome! Please follow these guidelines:
-
Fork the repository and create a feature branch
git checkout -b feature/your-feature-name
-
Write tests for new functionality
pytest tests/
-
Follow conventions
- Use type hints where applicable
- Return diagnostics for data operations
- Keep functions small and focused
- Use module-level logging instead of
print()
-
Submit a pull request with:
- Clear description of changes
- Link to related issues
- Test coverage for new code
- Return Plotly figure objects from analytics functions
- Let
app.pyhandle rendering (avoidfig.show()) - Use defensive DataFrame operations (avoid chained assignment)
- Provide structured diagnostics for debugging
- Add anonymized sample dataset for demos
- Implement CI/CD with GitHub Actions
- Add pre-commit hooks for code quality
- Expand test coverage to >80%
- Add API documentation with Sphinx
- Support additional data formats (Parquet, JSON)
This project is licensed under the terms specified in the LICENSE file.
Dependencies are listed in requirements.txt and subject to their respective licenses.
- Issues: Open an issue for bugs or feature requests
- Discussions: Start a discussion for questions or ideas
- Documentation: Check inline docstrings and module comments
Made with 🇰🇪 for transparent, defensible loan analysis