A modular Python pipeline for credit risk analysis with simulated real-time event ingestion, feature engineering, and ML-ready data preparation.
src/
├── pipeline.py # Main orchestrator (run_pipeline)
├── events/ # Event streaming simulation
│ ├── queue_manager.py # In-memory event queue
│ ├── producer.py # Event generation from raw data
│ └── consumer.py # Event consumption and storage
├── preprocessing/
│ ├── data_transformations.py # Cleaning, backfill, missing value tracking
│ └── feature_engineering.py # FeatureEngineer class with transforms
├── models/
│ └── credit_risk_model.py # Model data loading utilities
├── notebooks/
│ └── credit_risk_analysis.ipynb # EDA and analysis notebook
├── data/
│ ├── raw/ # Source Excel data
│ └── processed/ # Parquet outputs
└── viz/ # Visualization utilities
1. Install dependencies:
pip install -r requirements.txt2. Run the pipeline:
python -m src.pipeline3. Use in Python:
from src.pipeline import run_pipeline, Paths
full_df, modeling_df = run_pipeline(verbose=True)| Stage | Description |
|---|---|
| 1. Event Ingestion | Producer-consumer pattern simulates real-time data arrival via queue |
| 2. Load & Transform | Backfill overdues, track missing values, clean data |
| 3. Feature Engineering | Log transforms, severity scoring, age normalization, KNN imputation, outlier detection |
| 4. Export | Save modeling-ready DataFrame to parquet |
- Event Streaming Simulation – Queue-based ingestion decouples data arrival from processing
- Feature Engineering – Log transforms, severity scoring, business scale, IQR outlier detection
- KNN Imputation – Handles missing values intelligently
- Modular Architecture – Clean separation of concerns across preprocessing, models, and events
- Dual Output – Returns both full DataFrame (all columns) and modeling DataFrame (ML-ready features)
The pipeline produces two DataFrames:
full_df– Complete dataset with all engineered featuresmodeling_df– Subset of features ready for ML training (~13 columns)
Saved to: src/data/processed/modeling_df.parquet
- Python 3.9+
- pandas, numpy – Data manipulation
- scikit-learn – KNN imputation
- fastparquet – Efficient data storage
- matplotlib, seaborn – Visualization (notebook)
- Email: pesmithiii7@gmail.com
- Documentation: Milestone 1 | Milestone 2 | Milestone 3
- Repository: GitHub