Merge pull request #26 from tirthjoship/chore/docs-restructure

tirthjoship · web-flow · commit 3bab957564c1 · 2026-05-14T23:23:43.000-07:00
docs: restructure — public ARCHITECTURE.md + private interview prep
diff --git a/.gitignore b/.gitignore
@@ -11,6 +11,7 @@ AGENTS.md
 docs/superpowers/plans/
 docs/superpowers/specs/
 docs/COMPLETION_CHECKLIST.md
+docs/PROJECT_DEEP_DIVE.md
 
 # Conda env — requirements.txt covers deps for Streamlit Cloud
 environment.yml
diff --git a/README.md b/README.md
@@ -13,9 +13,9 @@
   <a href="#quick-start">Quick Start</a> &bull;
   <a href="#dashboard">Dashboard</a> &bull;
   <a href="#model-comparison">Results</a> &bull;
-  <a href="#architecture">Architecture</a> &bull;
+  <a href="#how-it-works">Architecture</a> &bull;
   <a href="#explainability">Explainability</a> &bull;
-  <a href="docs/PROJECT_DEEP_DIVE.md">Deep Dive</a>
+  <a href="docs/ARCHITECTURE.md">Design Decisions</a>
 </p>
 
 <p align="center">
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -0,0 +1,100 @@
+# Architecture & Design Decisions
+
+> How this project works and why every choice was made.
+
+## Data Flow
+
+```mermaid
+flowchart LR
+    subgraph Input["📦 Data In"]
+        CSV[DataCo CSV\n180K orders]
+    end
+
+    subgraph Safety["🛡️ Leakage Shield"]
+        DROP[Drop 3 post-shipment\ncolumns automatically]
+    end
+
+    subgraph Pipeline["🔄 ML Pipeline"]
+        FEAT[Extract 12\npre-shipment features]
+        SPLIT[Split FIRST\nthen encode]
+        TRAIN[Train XGBoost\n+ LogReg]
+    end
+
+    subgraph Evaluate["📊 Evaluate & Explain"]
+        EVAL[F1 / Precision\nRecall / AUC-ROC]
+        SHAP[SHAP\nexplainability]
+        MLF[MLflow\ntrack & version]
+    end
+
+    subgraph Output["🎯 Outputs"]
+        DASH[Streamlit\nDashboard]
+        REG[Model\nRegistry]
+    end
+
+    CSV --> DROP --> FEAT --> SPLIT --> TRAIN
+    TRAIN --> EVAL --> MLF --> REG
+    TRAIN --> SHAP --> DASH
+    EVAL --> DASH
+```
+
+## Architecture Pattern
+
+**Hexagonal (Ports & Adapters)** — dependencies point inward only.
+
+```
+adapters/     →  domain/  ←  application/
+(frameworks)     (pure)      (orchestration)
+```
+
+- `domain/` imports ONLY Python stdlib. Zero sklearn, pandas, or numpy.
+- Adapters implement domain Protocol interfaces.
+- Swap XGBoost for LightGBM? One new adapter. Domain unchanged.
+
+## Key Design Decisions
+
+| Decision | Choice | Why | Alternative Considered |
+|----------|--------|-----|----------------------|
+| Architecture | Hexagonal | Swappable adapters, testable domain | Flat scripts |
+| Primary metric | F1 score | 55/45 split makes accuracy misleading | AUC-ROC |
+| Explainability | SHAP | Additive, local + global, theoretically grounded | LIME |
+| Tracking | MLflow (SQLite) | Industry standard, Model Registry | W&B |
+| Encoding | Split-before-encode | Prevents preprocessing leakage | Encode-then-split |
+| Clustering | K-Means + silhouette | Interpretable, good for segmentation | DBSCAN |
+| Dashboard | Streamlit | Python-native, free hosting | Dash |
+| Testing | pytest + Hypothesis | Property-based catches edge cases | unittest |
+| Data source toggle | Strategy C hybrid | Full metrics + live sample predictions | All-sample or all-full |
+
+## Leakage Protection (3 Layers)
+
+| Layer | What It Does |
+|-------|-------------|
+| `LEAKAGE_COLUMNS` constant | Physically drops `Days for shipping (real)`, `Delivery Status`, `shipping date` at CSV adapter |
+| Split-before-encode | Encoder never sees test data during fit |
+| Property-based tests | Hypothesis verifies leakage column names never appear in feature output |
+
+## Dashboard Architecture (Strategy C)
+
+```mermaid
+flowchart TD
+    subgraph Static["Pre-computed (instant load)"]
+        FM[full_run_metrics.json\nF1, AUC, confusion matrices]
+        FS[full_dataset_stats.json\nLate rates by mode/region]
+        FP[full_pca_clusters.json\n2000-point PCA subsample]
+    end
+
+    subgraph Live["Sample-trained (cached ~5s)"]
+        MODEL[XGBoost + LogReg\non 1K sample]
+        SHAP2[SHAP explainers]
+    end
+
+    subgraph Toggle["Sidebar Toggle"]
+        FULL[Full Dataset 180K]
+        SAMPLE[Sample 1K]
+    end
+
+    FULL --> FM & FS & FP
+    SAMPLE --> MODEL & SHAP2
+    MODEL --> PREDICT[Risk Predictor\nalways live]
+```
+
+**Risk Predictor** always uses live sample model for interactivity. Stats tabs switch between pre-computed full metrics and live sample metrics via sidebar toggle.
diff --git a/docs/PROJECT_DEEP_DIVE.md b/docs/PROJECT_DEEP_DIVE.md