|
| 1 | +# Architecture & Design Decisions |
| 2 | + |
| 3 | +> How this project works and why every choice was made. |
| 4 | +
|
| 5 | +## Data Flow |
| 6 | + |
| 7 | +```mermaid |
| 8 | +flowchart LR |
| 9 | + subgraph Input["📦 Data In"] |
| 10 | + CSV[DataCo CSV\n180K orders] |
| 11 | + end |
| 12 | +
|
| 13 | + subgraph Safety["🛡️ Leakage Shield"] |
| 14 | + DROP[Drop 3 post-shipment\ncolumns automatically] |
| 15 | + end |
| 16 | +
|
| 17 | + subgraph Pipeline["🔄 ML Pipeline"] |
| 18 | + FEAT[Extract 12\npre-shipment features] |
| 19 | + SPLIT[Split FIRST\nthen encode] |
| 20 | + TRAIN[Train XGBoost\n+ LogReg] |
| 21 | + end |
| 22 | +
|
| 23 | + subgraph Evaluate["📊 Evaluate & Explain"] |
| 24 | + EVAL[F1 / Precision\nRecall / AUC-ROC] |
| 25 | + SHAP[SHAP\nexplainability] |
| 26 | + MLF[MLflow\ntrack & version] |
| 27 | + end |
| 28 | +
|
| 29 | + subgraph Output["🎯 Outputs"] |
| 30 | + DASH[Streamlit\nDashboard] |
| 31 | + REG[Model\nRegistry] |
| 32 | + end |
| 33 | +
|
| 34 | + CSV --> DROP --> FEAT --> SPLIT --> TRAIN |
| 35 | + TRAIN --> EVAL --> MLF --> REG |
| 36 | + TRAIN --> SHAP --> DASH |
| 37 | + EVAL --> DASH |
| 38 | +``` |
| 39 | + |
| 40 | +## Architecture Pattern |
| 41 | + |
| 42 | +**Hexagonal (Ports & Adapters)** — dependencies point inward only. |
| 43 | + |
| 44 | +``` |
| 45 | +adapters/ → domain/ ← application/ |
| 46 | +(frameworks) (pure) (orchestration) |
| 47 | +``` |
| 48 | + |
| 49 | +- `domain/` imports ONLY Python stdlib. Zero sklearn, pandas, or numpy. |
| 50 | +- Adapters implement domain Protocol interfaces. |
| 51 | +- Swap XGBoost for LightGBM? One new adapter. Domain unchanged. |
| 52 | + |
| 53 | +## Key Design Decisions |
| 54 | + |
| 55 | +| Decision | Choice | Why | Alternative Considered | |
| 56 | +|----------|--------|-----|----------------------| |
| 57 | +| Architecture | Hexagonal | Swappable adapters, testable domain | Flat scripts | |
| 58 | +| Primary metric | F1 score | 55/45 split makes accuracy misleading | AUC-ROC | |
| 59 | +| Explainability | SHAP | Additive, local + global, theoretically grounded | LIME | |
| 60 | +| Tracking | MLflow (SQLite) | Industry standard, Model Registry | W&B | |
| 61 | +| Encoding | Split-before-encode | Prevents preprocessing leakage | Encode-then-split | |
| 62 | +| Clustering | K-Means + silhouette | Interpretable, good for segmentation | DBSCAN | |
| 63 | +| Dashboard | Streamlit | Python-native, free hosting | Dash | |
| 64 | +| Testing | pytest + Hypothesis | Property-based catches edge cases | unittest | |
| 65 | +| Data source toggle | Strategy C hybrid | Full metrics + live sample predictions | All-sample or all-full | |
| 66 | + |
| 67 | +## Leakage Protection (3 Layers) |
| 68 | + |
| 69 | +| Layer | What It Does | |
| 70 | +|-------|-------------| |
| 71 | +| `LEAKAGE_COLUMNS` constant | Physically drops `Days for shipping (real)`, `Delivery Status`, `shipping date` at CSV adapter | |
| 72 | +| Split-before-encode | Encoder never sees test data during fit | |
| 73 | +| Property-based tests | Hypothesis verifies leakage column names never appear in feature output | |
| 74 | + |
| 75 | +## Dashboard Architecture (Strategy C) |
| 76 | + |
| 77 | +```mermaid |
| 78 | +flowchart TD |
| 79 | + subgraph Static["Pre-computed (instant load)"] |
| 80 | + FM[full_run_metrics.json\nF1, AUC, confusion matrices] |
| 81 | + FS[full_dataset_stats.json\nLate rates by mode/region] |
| 82 | + FP[full_pca_clusters.json\n2000-point PCA subsample] |
| 83 | + end |
| 84 | +
|
| 85 | + subgraph Live["Sample-trained (cached ~5s)"] |
| 86 | + MODEL[XGBoost + LogReg\non 1K sample] |
| 87 | + SHAP2[SHAP explainers] |
| 88 | + end |
| 89 | +
|
| 90 | + subgraph Toggle["Sidebar Toggle"] |
| 91 | + FULL[Full Dataset 180K] |
| 92 | + SAMPLE[Sample 1K] |
| 93 | + end |
| 94 | +
|
| 95 | + FULL --> FM & FS & FP |
| 96 | + SAMPLE --> MODEL & SHAP2 |
| 97 | + MODEL --> PREDICT[Risk Predictor\nalways live] |
| 98 | +``` |
| 99 | + |
| 100 | +**Risk Predictor** always uses live sample model for interactivity. Stats tabs switch between pre-computed full metrics and live sample metrics via sidebar toggle. |
0 commit comments