Skip to content

Commit 3bab957

Browse files
authored
Merge pull request #26 from tirthjoship/chore/docs-restructure
docs: restructure — public ARCHITECTURE.md + private interview prep
2 parents 5680980 + 73d030b commit 3bab957

4 files changed

Lines changed: 103 additions & 430 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ AGENTS.md
1111
docs/superpowers/plans/
1212
docs/superpowers/specs/
1313
docs/COMPLETION_CHECKLIST.md
14+
docs/PROJECT_DEEP_DIVE.md
1415

1516
# Conda env — requirements.txt covers deps for Streamlit Cloud
1617
environment.yml

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@
1313
<a href="#quick-start">Quick Start</a> &bull;
1414
<a href="#dashboard">Dashboard</a> &bull;
1515
<a href="#model-comparison">Results</a> &bull;
16-
<a href="#architecture">Architecture</a> &bull;
16+
<a href="#how-it-works">Architecture</a> &bull;
1717
<a href="#explainability">Explainability</a> &bull;
18-
<a href="docs/PROJECT_DEEP_DIVE.md">Deep Dive</a>
18+
<a href="docs/ARCHITECTURE.md">Design Decisions</a>
1919
</p>
2020

2121
<p align="center">

docs/ARCHITECTURE.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Architecture & Design Decisions
2+
3+
> How this project works and why every choice was made.
4+
5+
## Data Flow
6+
7+
```mermaid
8+
flowchart LR
9+
subgraph Input["📦 Data In"]
10+
CSV[DataCo CSV\n180K orders]
11+
end
12+
13+
subgraph Safety["🛡️ Leakage Shield"]
14+
DROP[Drop 3 post-shipment\ncolumns automatically]
15+
end
16+
17+
subgraph Pipeline["🔄 ML Pipeline"]
18+
FEAT[Extract 12\npre-shipment features]
19+
SPLIT[Split FIRST\nthen encode]
20+
TRAIN[Train XGBoost\n+ LogReg]
21+
end
22+
23+
subgraph Evaluate["📊 Evaluate & Explain"]
24+
EVAL[F1 / Precision\nRecall / AUC-ROC]
25+
SHAP[SHAP\nexplainability]
26+
MLF[MLflow\ntrack & version]
27+
end
28+
29+
subgraph Output["🎯 Outputs"]
30+
DASH[Streamlit\nDashboard]
31+
REG[Model\nRegistry]
32+
end
33+
34+
CSV --> DROP --> FEAT --> SPLIT --> TRAIN
35+
TRAIN --> EVAL --> MLF --> REG
36+
TRAIN --> SHAP --> DASH
37+
EVAL --> DASH
38+
```
39+
40+
## Architecture Pattern
41+
42+
**Hexagonal (Ports & Adapters)** — dependencies point inward only.
43+
44+
```
45+
adapters/ → domain/ ← application/
46+
(frameworks) (pure) (orchestration)
47+
```
48+
49+
- `domain/` imports ONLY Python stdlib. Zero sklearn, pandas, or numpy.
50+
- Adapters implement domain Protocol interfaces.
51+
- Swap XGBoost for LightGBM? One new adapter. Domain unchanged.
52+
53+
## Key Design Decisions
54+
55+
| Decision | Choice | Why | Alternative Considered |
56+
|----------|--------|-----|----------------------|
57+
| Architecture | Hexagonal | Swappable adapters, testable domain | Flat scripts |
58+
| Primary metric | F1 score | 55/45 split makes accuracy misleading | AUC-ROC |
59+
| Explainability | SHAP | Additive, local + global, theoretically grounded | LIME |
60+
| Tracking | MLflow (SQLite) | Industry standard, Model Registry | W&B |
61+
| Encoding | Split-before-encode | Prevents preprocessing leakage | Encode-then-split |
62+
| Clustering | K-Means + silhouette | Interpretable, good for segmentation | DBSCAN |
63+
| Dashboard | Streamlit | Python-native, free hosting | Dash |
64+
| Testing | pytest + Hypothesis | Property-based catches edge cases | unittest |
65+
| Data source toggle | Strategy C hybrid | Full metrics + live sample predictions | All-sample or all-full |
66+
67+
## Leakage Protection (3 Layers)
68+
69+
| Layer | What It Does |
70+
|-------|-------------|
71+
| `LEAKAGE_COLUMNS` constant | Physically drops `Days for shipping (real)`, `Delivery Status`, `shipping date` at CSV adapter |
72+
| Split-before-encode | Encoder never sees test data during fit |
73+
| Property-based tests | Hypothesis verifies leakage column names never appear in feature output |
74+
75+
## Dashboard Architecture (Strategy C)
76+
77+
```mermaid
78+
flowchart TD
79+
subgraph Static["Pre-computed (instant load)"]
80+
FM[full_run_metrics.json\nF1, AUC, confusion matrices]
81+
FS[full_dataset_stats.json\nLate rates by mode/region]
82+
FP[full_pca_clusters.json\n2000-point PCA subsample]
83+
end
84+
85+
subgraph Live["Sample-trained (cached ~5s)"]
86+
MODEL[XGBoost + LogReg\non 1K sample]
87+
SHAP2[SHAP explainers]
88+
end
89+
90+
subgraph Toggle["Sidebar Toggle"]
91+
FULL[Full Dataset 180K]
92+
SAMPLE[Sample 1K]
93+
end
94+
95+
FULL --> FM & FS & FP
96+
SAMPLE --> MODEL & SHAP2
97+
MODEL --> PREDICT[Risk Predictor\nalways live]
98+
```
99+
100+
**Risk Predictor** always uses live sample model for interactivity. Stats tabs switch between pre-computed full metrics and live sample metrics via sidebar toggle.

0 commit comments

Comments
 (0)