Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12"]
python-version: ["3.11", "3.12"]

steps:

Expand Down
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,13 @@

# OSX-specific
.DS_Store

# uv files
uv.lock
.venv

#autogluon training runs
autogluon*

# archive folder if present
archive
4 changes: 4 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,7 @@ repos:
- id: ruff-check
args: [ --fix ]
- id: ruff-format
- repo: https://github.com/kynan/nbstripout
rev: 0.9.1
hooks:
- id: nbstripout
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) [2026] [Swiss Data Science Center]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
140 changes: 116 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,149 @@
# SME-KT-ZH Collaboration Forecasting
# SME-KT-ZH: Collaboration Forecasting

This project is made using [this template](https://github.com/sdsc-innovation/cookiecutter-python).
Next steps include:
This project leverages **survival analysis** on B2B and B2C sales transaction data to predict customer re-order timing. By estimating when a customer is likely to return, the model generates a ranked priority list to drive proactive outreach and strategic collaboration planning.

- [x] Create project from the Cookiecutter template.
- [ ] Create a virtual environment to work in an isolated Python installation.
- [ ] Install [pre-commit](https://pre-commit.com/) hooks.
- [ ] Keep either `.gitlab-ci.yml` or `.github`, according to your Git hosting platform.
- [ ] Update `authors` and `description`, in `pyproject.toml`.
- [ ] `requirements.txt` should contain the *exact* (a.k.a. pinned) versions of the dependencies used development, including tools. However, do not include indirect dependencies.
- [ ] Add installation dependencies in `pyproject.toml`, with permissive version constraints.
- [ ] Add a `LICENSE` file, if applicable. This is *highly recommended* if the project is open source.
- [ ] Add a [`CITATION.cff`](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files), to ease citation of your work.
- [ ] Replace this `README.md` with a proper one. Among others, it must explain the overall context, the installation instructions, a quick start guide, and a repository structure description.
### Scope & Purpose
* **What this is:** A prototype developed during a week-long workshop within the [Canton of Zurich SME program](https://www.datascience.ch/innovation/canton-zurich-sme-program) (Step 2: *Practical Sessions and Prototyping*).
* **The Goal:** To provide a foundation for understanding survival analysis in a commercial setting.
* **Open Source:** You are encouraged to use this codebase as a starting point for your own data and experiments.

---

### Disclaimer
**This project is a proof-of-concept.** The code is intended for educational and prototyping purposes only. It is **not** production-ready and should not be deployed into live systems without significant refactoring and robust testing.

---

## Table of Contents

- [Data](#data)
- [Notebooks](#notebooks)
- [Source Modules](#source-modules)
- [Installation](#installation)
- [Development Tools](#development-tools)

---

## Data

The `data/` directory contains two files:

| File | Description |
|---|---|
| `sales_df.csv` | Transaction-level sales records with customer IDs, dates, and customer category/type attributes |
| `feiertage.csv` | Swiss public holiday calendar used as an external covariate |

> **Note:** `sales_df` contains **synthetic data**. The data is generated to reflect the statistical properties and patterns of real sales transactions, including realistic customer ordering cadences, seasonal effects, and B2B/B2C customer mix. It does not contain any personal or commercially sensitive information.

---

## Notebooks

Notebooks are located in `notebooks/` and should be run **in the following order**. Each builds on the insights of the previous one.

### 1. `EDA.ipynb` — Exploratory Data Analysis

Start here. This notebook provides a thorough understanding of the data before any modeling is attempted:

- Transaction-level overview and customer segmentation (B2C vs. B2B)
- Temporal patterns: daily, weekly, monthly, and quarterly seasonality
- Multi-seasonal decomposition (MSTL) and stationarity tests (ADF/KPSS)
- Holiday correlation analysis
- AutoGluon time series forecasting benchmarks (with and without holiday covariates)

Running EDA first is essential because survival models are sensitive to data quality and distribution assumptions. Understanding customer ordering cadence, the degree of censoring, and data irregularities directly informs modeling choices.

---

### 2. `Lifelines_Modelling.ipynb` — Parametric Survival Models

Introduces survival analysis via the [`lifelines`](https://lifelines.readthedocs.io) library. Models the time between purchases as a survival problem, where a "next order" is the event and customers who have not yet reordered are right-censored.

Three models are fitted and compared:

| Model | Type | Key characteristic |
|---|---|---|
| **Cox Proportional Hazards (CoxPH)** | Semi-parametric | Makes no assumption about the baseline hazard shape; assumes covariate effects are multiplicative and constant over time (proportional hazards). Highly interpretable. |
| **Weibull AFT** | Parametric | Models time-to-event directly under a Weibull distribution. Assumes a specific hazard shape; covariates stretch or compress the time axis. |
| **Log-Normal AFT** | Parametric | Same AFT framework as Weibull but with a log-normal distribution, allowing for non-monotone hazard. |

**Why start with Lifelines?** These models are interpretable, fast to fit, and provide a strong, explainable baseline. The Cox model in particular has well-understood diagnostics (proportional hazards assumption tests) that help validate whether survival analysis is appropriate for this data. Evaluation uses the concordance index (C-index) and recall@k on a held-out test set.

---

### 3. `RSF_Modelling.ipynb` — Random Survival Forest

Fits a **Random Survival Forest (RSF)** using [`scikit-survival`](https://scikit-survival.readthedocs.io). RSF is a non-parametric ensemble method that:

- Makes no distributional assumptions about the event time
- Captures non-linear covariate effects and feature interactions automatically
- Supports richer feature engineering (recency, frequency, customer category dummies)
- Includes hyperparameter tuning

**Why RSF after Lifelines?** RSF is more complex and less interpretable than Cox/AFT models. Running the parametric models first establishes a performance baseline and validates the survival analysis framing. RSF is then used to explore whether more flexible modelling — at the cost of interpretability — improves customer ranking. Results from all three approaches are compared in a final summary table.

---

## Source Modules

Helper code is organized under `src/sme_kt_zh_collaboration_forecasting/`:

| Module | Purpose |
|---|---|
| `utils.py` | Data loading utility (`read_sales_data`): reads `sales_df.csv`, parses dates, and creates a numeric customer ID while preserving the original name |
| `EDA.py` | Reusable EDA functions: general sales time-series analysis (seasonality plots, MSTL decomposition, stationarity tests), holiday-lag correlation, and AutoGluon training-set builders |
| `modelling.py` | Survival analysis pipeline helpers: data preparation (inter-purchase durations, censoring), train/test splitting, and evaluation utilities (C-index, predicted vs. real priority ranking) for Cox, AFT, and RSF models |

---

## Installation

Install pinned development dependencies using:
**Important:** The project uses [`uv`](https://github.com/astral-sh/uv) for dependency management. A `uv.lock` file is included for fully reproducible installs. To use it:
>
> ```bash
> uv sync
> ```

```

Alternative installation methods follow:

```bash
pip install -r requirements.txt
```

If you are using Conda to manage your Python environments:

```
```bash
conda env create -f environment.yml
```

Alternatively, if you are using an existing environment, you can install the module in [editable mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html), which includes only minimal dependencies:

```
```bash
pip install -e .
```

---

## Development tools
## Development Tools

In order to use [pre-commit](https://pre-commit.com/) hooks, they need to be registered:
Register [pre-commit](https://pre-commit.com/) hooks after installation:

```
```bash
pre-commit install
```

It is a good practice to manually invoke hooks after installation, just in case:
Run hooks manually to verify the setup:

```
```bash
pre-commit run --all-files
```

Unit tests (using [pytest](https://pytest.org/)) are not executed as a pre-commit hook, to keep the overhead to a minimum. Instead, a CI/CD pipeline is configured to run tests after each commit. You can also execute them locally, manually:
Unit tests (via [pytest](https://pytest.org/)) can be run locally:

```
```bash
pytest
```

## Lincense
Licensed under the [MIT License](LICENSE)
Empty file removed data/.gitkeep
Empty file.
145 changes: 145 additions & 0 deletions data/feiertage.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
Year,Date,Holiday Name,Scope,Day of Week
2015,2015-01-01,Neujahrstag,Cantonal,Thursday
2015,2015-01-02,Berchtoldstag,Cantonal,Friday
2015,2015-04-03,Karfreitag,Cantonal,Friday
2015,2015-04-06,Ostermontag,Cantonal,Monday
2015,2015-04-13,Sechseläuten,Local (City),Monday
2015,2015-05-01,Tag der Arbeit,Cantonal,Friday
2015,2015-05-14,Auffahrt,Cantonal,Thursday
2015,2015-05-25,Pfingstmontag,Cantonal,Monday
2015,2015-08-01,Bundesfeier,Federal,Saturday
2015,2015-09-14,Knabenschiessen,Local (City),Monday
2015,2015-12-25,Weihnachtstag,Cantonal,Friday
2015,2015-12-26,Stephanstag,Cantonal,Saturday
2016,2016-01-01,Neujahrstag,Cantonal,Friday
2016,2016-01-02,Berchtoldstag,Cantonal,Saturday
2016,2016-03-25,Karfreitag,Cantonal,Friday
2016,2016-03-28,Ostermontag,Cantonal,Monday
2016,2016-04-18,Sechseläuten,Local (City),Monday
2016,2016-05-01,Tag der Arbeit,Cantonal,Sunday
2016,2016-05-05,Auffahrt,Cantonal,Thursday
2016,2016-05-16,Pfingstmontag,Cantonal,Monday
2016,2016-08-01,Bundesfeier,Federal,Monday
2016,2016-09-12,Knabenschiessen,Local (City),Monday
2016,2016-12-25,Weihnachtstag,Cantonal,Sunday
2016,2016-12-26,Stephanstag,Cantonal,Monday
2017,2017-01-01,Neujahrstag,Cantonal,Sunday
2017,2017-01-02,Berchtoldstag,Cantonal,Monday
2017,2017-04-14,Karfreitag,Cantonal,Friday
2017,2017-04-17,Ostermontag,Cantonal,Monday
2017,2017-04-24,Sechseläuten,Local (City),Monday
2017,2017-05-01,Tag der Arbeit,Cantonal,Monday
2017,2017-05-25,Auffahrt,Cantonal,Thursday
2017,2017-06-05,Pfingstmontag,Cantonal,Monday
2017,2017-08-01,Bundesfeier,Federal,Tuesday
2017,2017-09-11,Knabenschiessen,Local (City),Monday
2017,2017-12-25,Weihnachtstag,Cantonal,Monday
2017,2017-12-26,Stephanstag,Cantonal,Tuesday
2018,2018-01-01,Neujahrstag,Cantonal,Monday
2018,2018-01-02,Berchtoldstag,Cantonal,Tuesday
2018,2018-03-30,Karfreitag,Cantonal,Friday
2018,2018-04-02,Ostermontag,Cantonal,Monday
2018,2018-04-16,Sechseläuten,Local (City),Monday
2018,2018-05-01,Tag der Arbeit,Cantonal,Tuesday
2018,2018-05-10,Auffahrt,Cantonal,Thursday
2018,2018-05-21,Pfingstmontag,Cantonal,Monday
2018,2018-08-01,Bundesfeier,Federal,Wednesday
2018,2018-09-10,Knabenschiessen,Local (City),Monday
2018,2018-12-25,Weihnachtstag,Cantonal,Tuesday
2018,2018-12-26,Stephanstag,Cantonal,Wednesday
2019,2019-01-01,Neujahrstag,Cantonal,Tuesday
2019,2019-01-02,Berchtoldstag,Cantonal,Wednesday
2019,2019-04-19,Karfreitag,Cantonal,Friday
2019,2019-04-22,Ostermontag,Cantonal,Monday
2019,2019-04-08,Sechseläuten,Local (City),Monday
2019,2019-05-01,Tag der Arbeit,Cantonal,Wednesday
2019,2019-05-30,Auffahrt,Cantonal,Thursday
2019,2019-06-10,Pfingstmontag,Cantonal,Monday
2019,2019-08-01,Bundesfeier,Federal,Thursday
2019,2019-09-09,Knabenschiessen,Local (City),Monday
2019,2019-12-25,Weihnachtstag,Cantonal,Wednesday
2019,2019-12-26,Stephanstag,Cantonal,Thursday
2020,2020-01-01,Neujahrstag,Cantonal,Wednesday
2020,2020-01-02,Berchtoldstag,Cantonal,Thursday
2020,2020-04-10,Karfreitag,Cantonal,Friday
2020,2020-04-13,Ostermontag,Cantonal,Monday
2020,2020-04-20,Sechseläuten,Local (City),Monday
2020,2020-05-01,Tag der Arbeit,Cantonal,Friday
2020,2020-05-21,Auffahrt,Cantonal,Thursday
2020,2020-06-01,Pfingstmontag,Cantonal,Monday
2020,2020-08-01,Bundesfeier,Federal,Saturday
2020,2020-09-14,Knabenschiessen,Local (City),Monday
2020,2020-12-25,Weihnachtstag,Cantonal,Friday
2020,2020-12-26,Stephanstag,Cantonal,Saturday
2021,2021-01-01,Neujahrstag,Cantonal,Friday
2021,2021-01-02,Berchtoldstag,Cantonal,Saturday
2021,2021-04-02,Karfreitag,Cantonal,Friday
2021,2021-04-05,Ostermontag,Cantonal,Monday
2021,2021-04-19,Sechseläuten,Local (City),Monday
2021,2021-05-01,Tag der Arbeit,Cantonal,Saturday
2021,2021-05-13,Auffahrt,Cantonal,Thursday
2021,2021-05-24,Pfingstmontag,Cantonal,Monday
2021,2021-08-01,Bundesfeier,Federal,Sunday
2021,2021-09-13,Knabenschiessen,Local (City),Monday
2021,2021-12-25,Weihnachtstag,Cantonal,Saturday
2021,2021-12-26,Stephanstag,Cantonal,Sunday
2022,2022-01-01,Neujahrstag,Cantonal,Saturday
2022,2022-01-02,Berchtoldstag,Cantonal,Sunday
2022,2022-04-15,Karfreitag,Cantonal,Friday
2022,2022-04-18,Ostermontag,Cantonal,Monday
2022,2022-04-25,Sechseläuten,Local (City),Monday
2022,2022-05-01,Tag der Arbeit,Cantonal,Sunday
2022,2022-05-26,Auffahrt,Cantonal,Thursday
2022,2022-06-06,Pfingstmontag,Cantonal,Monday
2022,2022-08-01,Bundesfeier,Federal,Monday
2022,2022-09-12,Knabenschiessen,Local (City),Monday
2022,2022-12-25,Weihnachtstag,Cantonal,Sunday
2022,2022-12-26,Stephanstag,Cantonal,Monday
2023,2023-01-01,Neujahrstag,Cantonal,Sunday
2023,2023-01-02,Berchtoldstag,Cantonal,Monday
2023,2023-04-07,Karfreitag,Cantonal,Friday
2023,2023-04-10,Ostermontag,Cantonal,Monday
2023,2023-04-17,Sechseläuten,Local (City),Monday
2023,2023-05-01,Tag der Arbeit,Cantonal,Monday
2023,2023-05-18,Auffahrt,Cantonal,Thursday
2023,2023-05-29,Pfingstmontag,Cantonal,Monday
2023,2023-08-01,Bundesfeier,Federal,Tuesday
2023,2023-09-11,Knabenschiessen,Local (City),Monday
2023,2023-12-25,Weihnachtstag,Cantonal,Monday
2023,2023-12-26,Stephanstag,Cantonal,Tuesday
2024,2024-01-01,Neujahrstag,Cantonal,Monday
2024,2024-01-02,Berchtoldstag,Cantonal,Tuesday
2024,2024-03-29,Karfreitag,Cantonal,Friday
2024,2024-04-01,Ostermontag,Cantonal,Monday
2024,2024-04-15,Sechseläuten,Local (City),Monday
2024,2024-05-01,Tag der Arbeit,Cantonal,Wednesday
2024,2024-05-09,Auffahrt,Cantonal,Thursday
2024,2024-05-20,Pfingstmontag,Cantonal,Monday
2024,2024-08-01,Bundesfeier,Federal,Thursday
2024,2024-09-09,Knabenschiessen,Local (City),Monday
2024,2024-12-25,Weihnachtstag,Cantonal,Wednesday
2024,2024-12-26,Stephanstag,Cantonal,Thursday
2025,2025-01-01,Neujahrstag,Cantonal,Wednesday
2025,2025-01-02,Berchtoldstag,Cantonal,Thursday
2025,2025-04-18,Karfreitag,Cantonal,Friday
2025,2025-04-21,Ostermontag,Cantonal,Monday
2025,2025-04-28,Sechseläuten,Local (City),Monday
2025,2025-05-01,Tag der Arbeit,Cantonal,Thursday
2025,2025-05-29,Auffahrt,Cantonal,Thursday
2025,2025-06-09,Pfingstmontag,Cantonal,Monday
2025,2025-08-01,Bundesfeier,Federal,Friday
2025,2025-09-15,Knabenschiessen,Local (City),Monday
2025,2025-12-25,Weihnachtstag,Cantonal,Thursday
2025,2025-12-26,Stephanstag,Cantonal,Friday
2026,2026-01-01,Neujahrstag,Cantonal,Thursday
2026,2026-01-02,Berchtoldstag,Cantonal,Friday
2026,2026-04-03,Karfreitag,Cantonal,Friday
2026,2026-04-06,Ostermontag,Cantonal,Monday
2026,2026-04-20,Sechseläuten,Local (City),Monday
2026,2026-05-01,Tag der Arbeit,Cantonal,Friday
2026,2026-05-14,Auffahrt,Cantonal,Thursday
2026,2026-05-25,Pfingstmontag,Cantonal,Monday
2026,2026-08-01,Bundesfeier,Federal,Saturday
2026,2026-09-14,Knabenschiessen,Local (City),Monday
2026,2026-12-25,Weihnachtstag,Cantonal,Friday
2026,2026-12-26,Stephanstag,Cantonal,Saturday
Loading
Loading