|
1 | 1 | # SME-KT-ZH Collaboration Forecasting |
2 | 2 |
|
3 | | -This project is made using [this template](https://github.com/sdsc-innovation/cookiecutter-python). |
4 | | -Next steps include: |
| 3 | +This project applies **survival analysis** to B2B/B2C sales transaction data to predict when customers are likely to place their next order. The output is a ranked priority list of customers, enabling proactive outreach and collaboration planning. |
5 | 4 |
|
6 | | - - [x] Create project from the Cookiecutter template. |
7 | | - - [ ] Create a virtual environment to work in an isolated Python installation. |
8 | | - - [ ] Install [pre-commit](https://pre-commit.com/) hooks. |
9 | | - - [ ] Keep either `.gitlab-ci.yml` or `.github`, according to your Git hosting platform. |
10 | | - - [ ] Update `authors` and `description`, in `pyproject.toml`. |
11 | | - - [ ] `requirements.txt` should contain the *exact* (a.k.a. pinned) versions of the dependencies used development, including tools. However, do not include indirect dependencies. |
12 | | - - [ ] Add installation dependencies in `pyproject.toml`, with permissive version constraints. |
13 | | - - [ ] Add a `LICENSE` file, if applicable. This is *highly recommended* if the project is open source. |
14 | | - - [ ] Add a [`CITATION.cff`](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files), to ease citation of your work. |
15 | | - - [ ] Replace this `README.md` with a proper one. Among others, it must explain the overall context, the installation instructions, a quick start guide, and a repository structure description. |
| 5 | +--- |
16 | 6 |
|
| 7 | +## Table of Contents |
| 8 | + |
| 9 | +- [Data](#data) |
| 10 | +- [Notebooks](#notebooks) |
| 11 | +- [Source Modules](#source-modules) |
| 12 | +- [Installation](#installation) |
| 13 | +- [Development Tools](#development-tools) |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Data |
| 18 | + |
| 19 | +The `data/` directory contains two files: |
| 20 | + |
| 21 | +| File | Description | |
| 22 | +|---|---| |
| 23 | +| `sales_df.csv` | Transaction-level sales records with customer IDs, dates, and customer category/type attributes | |
| 24 | +| `feiertage.csv` | Swiss public holiday calendar used as an external covariate | |
| 25 | + |
| 26 | +> **Note:** Both files contain **anonymized synthetic data**. The data is generated to reflect the statistical properties and patterns of real sales transactions, including realistic customer ordering cadences, seasonal effects, and B2B/B2C customer mix. It does not contain any personal or commercially sensitive information. |
| 27 | +
|
| 28 | +--- |
| 29 | + |
| 30 | +## Notebooks |
| 31 | + |
| 32 | +Notebooks are located in `notebooks/` and should be run **in the following order**. Each builds on the insights of the previous one. |
| 33 | + |
| 34 | +### 1. `EDA.ipynb` — Exploratory Data Analysis |
| 35 | + |
| 36 | +Start here. This notebook provides a thorough understanding of the data before any modeling is attempted: |
| 37 | + |
| 38 | +- Transaction-level overview and customer segmentation (B2C vs. B2B) |
| 39 | +- Temporal patterns: daily, weekly, monthly, and quarterly seasonality |
| 40 | +- Multi-seasonal decomposition (MSTL) and stationarity tests (ADF/KPSS) |
| 41 | +- Holiday correlation analysis |
| 42 | +- AutoGluon time series forecasting benchmarks (with and without holiday covariates) |
| 43 | + |
| 44 | +Running EDA first is essential because survival models are sensitive to data quality and distribution assumptions. Understanding customer ordering cadence, the degree of censoring, and data irregularities directly informs modeling choices. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +### 2. `Lifelines_Modelling.ipynb` — Parametric Survival Models |
| 49 | + |
| 50 | +Introduces survival analysis via the [`lifelines`](https://lifelines.readthedocs.io) library. Models the time between purchases as a survival problem, where a "next order" is the event and customers who have not yet reordered are right-censored. |
| 51 | + |
| 52 | +Three models are fitted and compared: |
| 53 | + |
| 54 | +| Model | Type | Key characteristic | |
| 55 | +|---|---|---| |
| 56 | +| **Cox Proportional Hazards (CoxPH)** | Semi-parametric | Makes no assumption about the baseline hazard shape; assumes covariate effects are multiplicative and constant over time (proportional hazards). Highly interpretable. | |
| 57 | +| **Weibull AFT** | Parametric | Models time-to-event directly under a Weibull distribution. Assumes a specific hazard shape; covariates stretch or compress the time axis. | |
| 58 | +| **Log-Normal AFT** | Parametric | Same AFT framework as Weibull but with a log-normal distribution, allowing for non-monotone hazard. | |
| 59 | + |
| 60 | +**Why start with Lifelines?** These models are interpretable, fast to fit, and provide a strong, explainable baseline. The Cox model in particular has well-understood diagnostics (proportional hazards assumption tests) that help validate whether survival analysis is appropriate for this data. Evaluation uses the concordance index (C-index) and recall@k on a held-out test set. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +### 3. `RSF_Modelling.ipynb` — Random Survival Forest |
| 65 | + |
| 66 | +Fits a **Random Survival Forest (RSF)** using [`scikit-survival`](https://scikit-survival.readthedocs.io). RSF is a non-parametric ensemble method that: |
| 67 | + |
| 68 | +- Makes no distributional assumptions about the event time |
| 69 | +- Captures non-linear covariate effects and feature interactions automatically |
| 70 | +- Supports richer feature engineering (recency, frequency, customer category dummies) |
| 71 | +- Includes hyperparameter tuning |
| 72 | + |
| 73 | +**Why RSF after Lifelines?** RSF is more complex and less interpretable than Cox/AFT models. Running the parametric models first establishes a performance baseline and validates the survival analysis framing. RSF is then used to explore whether more flexible modelling — at the cost of interpretability — improves customer ranking. Results from all three approaches are compared in a final summary table. |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +## Source Modules |
| 78 | + |
| 79 | +Helper code is organized under `src/sme_kt_zh_collaboration_forecasting/`: |
| 80 | + |
| 81 | +| Module | Purpose | |
| 82 | +|---|---| |
| 83 | +| `utils.py` | Data loading utility (`read_sales_data`): reads `sales_df.csv`, parses dates, and creates a numeric customer ID while preserving the original name | |
| 84 | +| `EDA.py` | Reusable EDA functions: general sales time-series analysis (seasonality plots, MSTL decomposition, stationarity tests), holiday-lag correlation, and AutoGluon training-set builders | |
| 85 | +| `modelling.py` | Survival analysis pipeline helpers: data preparation (inter-purchase durations, censoring), train/test splitting, and evaluation utilities (C-index, predicted vs. real priority ranking) for Cox, AFT, and RSF models | |
| 86 | + |
| 87 | +--- |
17 | 88 |
|
18 | 89 | ## Installation |
19 | 90 |
|
20 | 91 | Install pinned development dependencies using: |
21 | 92 |
|
22 | | -``` |
| 93 | +```bash |
23 | 94 | pip install -r requirements.txt |
24 | 95 | ``` |
25 | 96 |
|
26 | 97 | If you are using Conda to manage your Python environments: |
27 | 98 |
|
28 | | -``` |
| 99 | +```bash |
29 | 100 | conda env create -f environment.yml |
30 | 101 | ``` |
31 | 102 |
|
32 | 103 | Alternatively, if you are using an existing environment, you can install the module in [editable mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html), which includes only minimal dependencies: |
33 | 104 |
|
34 | | -``` |
| 105 | +```bash |
35 | 106 | pip install -e . |
36 | 107 | ``` |
37 | 108 |
|
| 109 | +> **Important:** The project uses [`uv`](https://github.com/astral-sh/uv) for dependency management. A `uv.lock` file is included for fully reproducible installs. To use it: |
| 110 | +> |
| 111 | +> ```bash |
| 112 | +> uv sync |
| 113 | +> ``` |
38 | 114 |
|
39 | | -## Development tools |
| 115 | +--- |
40 | 116 |
|
41 | | -In order to use [pre-commit](https://pre-commit.com/) hooks, they need to be registered: |
| 117 | +## Development Tools |
42 | 118 |
|
43 | | -``` |
| 119 | +Register [pre-commit](https://pre-commit.com/) hooks after installation: |
| 120 | +
|
| 121 | +```bash |
44 | 122 | pre-commit install |
45 | 123 | ``` |
46 | 124 |
|
47 | | -It is a good practice to manually invoke hooks after installation, just in case: |
| 125 | +Run hooks manually to verify the setup: |
48 | 126 |
|
49 | | -``` |
| 127 | +```bash |
50 | 128 | pre-commit run --all-files |
51 | 129 | ``` |
52 | 130 |
|
53 | | -Unit tests (using [pytest](https://pytest.org/)) are not executed as a pre-commit hook, to keep the overhead to a minimum. Instead, a CI/CD pipeline is configured to run tests after each commit. You can also execute them locally, manually: |
| 131 | +Unit tests (via [pytest](https://pytest.org/)) can be run locally: |
54 | 132 |
|
55 | | -``` |
| 133 | +```bash |
56 | 134 | pytest |
57 | 135 | ``` |
0 commit comments