Business question: What drives subscriber churn in health plan services — and what can retention teams do about it?
Subscriber churn is one of the highest-cost problems in healthtech services. Acquiring a new client costs 5–7× more than retaining an existing one. This project uses a real-world churn dataset to:
- Identify which subscriber segments are most at risk (EDA + segmentation)
- Quantify the behavioral and contractual signals that precede cancellation (SQL cohort analysis)
- Build a classification model that flags high-risk subscribers before they leave (ML)
The analytical stack mirrors production work at healthtech companies: SQL for cohort logic, Python for modeling, and visual storytelling throughout.
health-churn-analysis/
│
├── README.md ← You are here
│
├── data/
│ └── subscribers.csv ← Public dataset (Telco Churn, reframed as health plan)
│
├── notebooks/
│ ├── 01_eda.ipynb ← Distributions, null analysis, churn by segment
│ ├── 02_viz.ipynb ← Seaborn + Plotly charts, visual storytelling
│ └── 03_model.ipynb ← Logistic Regression + Random Forest, AUC-ROC, SHAP
│
└── sql/
├── 01_churn_rate_by_cohort.sql ← Monthly cohort retention table
├── 02_retention_by_plan_type.sql ← Churn rate by contract type
└── 03_high_risk_segments.sql ← Window functions to rank at-risk segments
(Charts will be embedded here after notebooks are complete)
| Finding | Detail |
|---|---|
| Monthly plans churn 3× more than annual contracts | Switching users to semi-annual locks in retention |
| First 90 days are critical | 42% of churn happens within the first quarter — onboarding matters most |
| High-engagement users almost never churn | Users with 3+ service touchpoints/month have <5% churn rate |
| Model AUC-ROC | TBD after notebook 03 |
git clone https://github.com/vinicius-goulart/health-churn-analysis.git
cd health-churn-analysisuv syncjupyter notebookOpen and run: 01_eda.ipynb → 02_viz.ipynb → 03_model.ipynb
Queries in sql/ are written for PostgreSQL. To run against the dataset locally:
# Load data into a local PostgreSQL instance or use DuckDB
duckdb -c "CREATE TABLE subscribers AS SELECT * FROM 'data/subscribers.csv';"
duckdb -c ".read sql/01_churn_rate_by_cohort.sql"Source: Telco Customer Churn — IBM Sample Data
The dataset contains ~7,000 subscribers with features including:
- Contract type (month-to-month, one year, two year)
- Tenure (months as a subscriber)
- Monthly charges and total charges
- Service usage features
- Churn label (Yes/No)
Framing: Column names have been adapted to a health plan context (e.g.,
Contract→plan_type,tenure→months_as_subscriber) to make the business narrative more realistic and relevant to healthtech recruiting contexts.
| Layer | Tools |
|---|---|
| Data wrangling | Python, Pandas, NumPy |
| SQL analytics | PostgreSQL / DuckDB |
| Visualization | Seaborn, Plotly |
| ML modeling | scikit-learn (LogisticRegression, RandomForestClassifier) |
| Explainability | SHAP |
| Notebooks | Jupyter |
Based on the analysis:
-
Prioritize onboarding for month-to-month subscribers — the first 90 days are where most churn occurs. A structured 30/60/90 day check-in cadence can reduce early churn by an estimated 15–20%.
-
Incentivize annual contract upgrades — offering a discount or added benefit at the 3-month mark for monthly subscribers targets the highest-risk window.
-
Deploy the churn model as a weekly scoring job — flag subscribers with >60% predicted churn probability for proactive outreach by the CS team.
Vinícius Goulart Nardelli
Data Analyst · São Paulo, BR
LinkedIn · GitHub
This project is part of my data analytics portfolio. All data is public and used for educational purposes.