Modeled provincial healthcare investment efficiency across 10 Canadian provinces to uncover where funding breaks down, enabling evidence-based reallocation recommendations for system planners.
At the national level, higher provincial budgets correlate with shorter wait times (r = −0.50, p < 0.001). Break the data down by province, and the relationship reverses — provinces with the highest budgets often have the longest waits.
This is a Simpson's Paradox. It is consistent with reactive government funding: provinces with structurally long wait times receive budget increases, but aging populations, physician shortages, and facility constraints absorb the investment without producing wait time improvement.
The budget signal is real. It is not strong enough to be the primary lever.
| Metric | Value |
|---|---|
| Pearson correlation | r = −0.50, p < 0.001 |
| Variance explained (OLS) | R² = 0.205 — budget explains 20.5% of wait variation |
| Effect size | ~0.2 days per $1B (directional estimate, not a causal coefficient) |
| Unexplained variance | 79.5% — structural factors dominate |
| Observations | n = 54 (10 provinces × 6 years, 2013–2018) |
| Top predictive feature (XGBoost) | budget_rank — provincial structural position, not raw spend |
This analysis does not establish causality. All recommendations are directional hypotheses requiring quasi-experimental validation.
| Recommendation | Evidence Basis | Trade-off |
|---|---|---|
| Shift KPI to wait-days per $M CAD (efficiency, not total spend) | R² = 0.205: budget level alone is an insufficient performance signal | Efficiency metrics require consistent cost accounting across provinces |
| Target BC, NB, PEI first (highest unexplained within-province variance) | Province-level analysis shows these provinces diverge most from budget-predicted wait times | Requires provincial buy-in; not a funding conversation |
| Invest in structural capacity (physician supply, facility distribution) over aggregate budget transfers | 79.5% of variance is structural — this is where the ROI is | Longer payback cycle; harder to show as a political win |
.
├── src/ # Python pipeline
│ ├── config.py # Path constants, province maps, analysis params
│ ├── data_ingestion.py # CIHI download + SQLite write; synthetic fallback
│ ├── data_cleaning.py # Budget + wait time cleaning + merge
│ ├── feature_engineering.py # 7 features: per-capita, lag, rank, trend
│ ├── modeling.py # 3-model strategy: OLS → Ridge/Lasso → XGBoost
│ ├── evaluation.py # Metrics, partial dependence, decision output
│ ├── run_pipeline.py # Orchestrator — run this
│ └── requirements.txt
│
├── tests/
│ ├── conftest.py # Shared pytest fixtures (synthetic data, session-scoped)
│ └── test_pipeline.py # 25 smoke tests: schema, merge ~60 rows, 7 features, R² gate
│
├── docs/
│ ├── executive_brief.md # 1-page standalone brief for system planners
│ ├── executive_one_pager.md # Single-slide summary for senior executives
│ ├── decision_output.md # Explicit recommendations with evidence + trade-offs
│ ├── program_narrative.md # TPM framing: decisions, trade-offs, stakeholders
│ ├── slide_deck_outline.md # 5-slide consulting-grade deck outline
│ ├── communications_guide.md # Audience-specific briefing summaries and analytical FAQ
│ └── program_delivery.md # Delivery plan: charter, WBS, gates, RACI, risk register
│
├── notebooks/
│ ├── canadian_healthcare_analysis.Rmd # R analysis
│ └── canadian_healthcare_analysis.md # Knitted markdown output
│
├── data/
│ ├── README.md # Data dictionary, schemas, province codes, assumptions
│ ├── input/ # Raw CIHI xlsx files (gitignored — download fresh)
│ └── processed/ # Cleaned CSVs (gitignored — regenerated by pipeline)
│
├── outputs/ # Rendered R outputs (PDF, HTML)
├── pipelines/
│ ├── README.md # Pipeline execution guide and deployment instructions
│ └── github_actions_pipeline.yml # Draft CI/CD workflow (Phase 2 reference)
├── pytest.ini # pytest configuration (testpaths = tests)
├── .gitignore
└── README.md
| Model | Features | Purpose | Limitation |
|---|---|---|---|
| Baseline OLS | Budget only | Replicates R analysis; sanity-check gate (R² ≈ 0.205) | Omitted variable bias; no non-linearity |
| Ridge / Lasso | All 7 engineered features | Stability with n = 54; Lasso auto-selects features | Less interpretable; penalised coefficients |
| XGBoost* | All 7 engineered features | Non-linear pattern exploration; feature importance | Pattern exploration only — not for prediction or deployment |
*XGBoost falls back to RandomForestRegressor if libomp is not installed (brew install libomp on macOS).
| Simple baseline | This analysis | |
|---|---|---|
| Features | Budget (raw, millions CAD) | 7 features capturing per-capita normalization, temporal dynamics, structural position |
| Problem | Confounded by province size; ON ($59B) vs PEI ($680M) not comparable | Per-capita budget addresses scale; lag addresses timing; rank addresses structural position |
All 7 features:
| Feature | Addresses |
|---|---|
budget_per_capita |
Province size confound — most important correction |
volume_per_capita |
Demand-side pressure differences across provinces |
budget_lag1 |
Tests reactive vs. proactive funding hypothesis |
province_encoded |
Structural position (fiscal scale, ordinal) |
year_trend |
Secular time trend (aging, technology) |
budget_yoy_change |
Direction of investment, not just level |
budget_rank |
Relative provincial position within each year |
Population normalization uses static 2016 Census baseline to avoid introducing temporal bias from interpolated estimates.
-
Budget explains 20.5% of wait time variance — statistically significant, practically limited. The other 79.5% is the more important signal.
-
Simpson's Paradox at the provincial level — the national negative trend reverses province-by-province, consistent with reactive government funding into structurally constrained systems.
-
Low marginal return — approximately 0.2 days per $1B on the observed dataset. Large funding increases produce small outcomes.
-
Structural position outperforms raw budget — XGBoost identifies
budget_rankas more predictive than raw Budget, confirming the structural hypothesis. -
Diminishing returns — partial dependence analysis estimates that beyond ~$5,000–6,000 per capita, additional spending produces minimal further wait time reduction (directional, n = 54, not a policy rule).
| Approach | Why Not Used |
|---|---|
| Deep learning | n = 54; no latent structure; no generalisation basis; overkill |
| Causal inference (IV / DiD / RDD) | No valid instrument variable; no policy discontinuity; observational panel data |
| 50+ feature models | Noise risk overwhelms n = 54; parsimony is the correct call, not a limitation |
| Individual patient-level analysis | Not in CIHI public data; aggregate-to-individual inference is the ecological fallacy |
Choosing not to use a technique, with documented reasoning, is the senior analytical move.
| Document | Purpose |
|---|---|
| docs/executive_brief.md | 1-page brief — read this first |
| docs/executive_one_pager.md | Single-slide summary for senior executives |
| docs/decision_output.md | Full recommendation set with evidence and trade-offs |
| docs/program_narrative.md | TPM framing: analytical decisions, trade-offs, stakeholder context |
| docs/slide_deck_outline.md | 5-slide consulting-grade deck outline |
| docs/communications_guide.md | Audience-specific briefing summaries for system planners and technical reviewers |
| docs/program_delivery.md | Program delivery plan: charter, WBS, milestones, gate criteria, RACI, risk register |
| data/README.md | Data dictionary, sourcing, schemas, assumptions |
Prerequisites: Python 3.10+, pip
# 1. Clone the repo
git clone <repo-url>
cd <repo-name>
# 2. Install dependencies (all free / open-source)
pip install -r src/requirements.txt
# macOS only — required for XGBoost:
brew install libomp
# 3. Run the pipeline
python src/run_pipeline.py
# Optional: attempt live CIHI download
python src/run_pipeline.py --liveNo server setup required. SQLite (Python stdlib). No .env file. No credentials.
Output:
data/healthcare.db— SQLite database with raw and processed tablesdata/processed/merged_final.csv— read by R notebook- Terminal: model comparison table, feature importance, decision output
R notebook (optional):
# Run Python pipeline first to generate data/processed/merged_final.csv
# Then open notebooks/canadian_healthcare_analysis.Rmd in RStudio and knitRunning the test suite:
pytestTests run against synthetic data. No CIHI connection required. Expected: 25 tests pass in under 30 seconds.
Both datasets are free public data from the Canadian Institute for Health Information (CIHI).
| Dataset | Source |
|---|---|
| National Health Expenditure Trends | CIHI data catalogue |
| Wait Times for Priority Procedures | CIHI data catalogue |
- Correlational, not causal. This analysis does not establish causality. All policy recommendations are framed as directional hypotheses requiring quasi-experimental validation.
- Small sample. n = 54 province-year observations. Results are directional; cross-validated R² reflects generalisation limits.
- Pre-COVID data. 2013–2018 only. Post-2020 disruption likely changed these dynamics significantly.
- Aggregate level. Province-year aggregates mask within-province variation. Ecological fallacy risk prohibits individual-level inference.
- Budget is total expenditure, not procedure-specific. Targeted capacity investment analysis requires procedure-level budget data not in the CIHI public dataset.
Nammn Joshii | LinkedIn | GitHub
Provenance: Original analysis: October 2019. Repository structured for public portfolio: April 2026. The analytical findings, data, and code are unchanged from the original analysis. Documentation (program narrative, decision output, delivery plan) reflects structured retrospective framing of the 2019 work.