Skip to content

Commit b836943

Browse files
author
Oliver
committed
feat: Add exploratory data analysis and modeling for sales data
- Implemented EDA functions in EDA.py to analyze transactional sales data, including seasonal decomposition, feature extraction, and visualization. - Created modeling functions in modelling.py for survival analysis using Cox Proportional Hazards and Random Survival Forest models, including data preparation and evaluation metrics. - Added utility functions in utils.py for reading and preprocessing sales data. - Introduced fitting_scores.csv to store model fitting scores with and without holiday effects.
1 parent cf419a8 commit b836943

13 files changed

Lines changed: 83736 additions & 34 deletions

File tree

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,13 @@
99

1010
# OSX-specific
1111
.DS_Store
12+
13+
# uv files
14+
uv.lock
15+
.venv
16+
17+
#autogluon training runs
18+
autogluon*
19+
20+
# archive folder if present
21+
archive

README.md

Lines changed: 100 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,135 @@
11
# SME-KT-ZH Collaboration Forecasting
22

3-
This project is made using [this template](https://github.com/sdsc-innovation/cookiecutter-python).
4-
Next steps include:
3+
This project applies **survival analysis** to B2B/B2C sales transaction data to predict when customers are likely to place their next order. The output is a ranked priority list of customers, enabling proactive outreach and collaboration planning.
54

6-
- [x] Create project from the Cookiecutter template.
7-
- [ ] Create a virtual environment to work in an isolated Python installation.
8-
- [ ] Install [pre-commit](https://pre-commit.com/) hooks.
9-
- [ ] Keep either `.gitlab-ci.yml` or `.github`, according to your Git hosting platform.
10-
- [ ] Update `authors` and `description`, in `pyproject.toml`.
11-
- [ ] `requirements.txt` should contain the *exact* (a.k.a. pinned) versions of the dependencies used development, including tools. However, do not include indirect dependencies.
12-
- [ ] Add installation dependencies in `pyproject.toml`, with permissive version constraints.
13-
- [ ] Add a `LICENSE` file, if applicable. This is *highly recommended* if the project is open source.
14-
- [ ] Add a [`CITATION.cff`](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files), to ease citation of your work.
15-
- [ ] Replace this `README.md` with a proper one. Among others, it must explain the overall context, the installation instructions, a quick start guide, and a repository structure description.
5+
---
166

7+
## Table of Contents
8+
9+
- [Data](#data)
10+
- [Notebooks](#notebooks)
11+
- [Source Modules](#source-modules)
12+
- [Installation](#installation)
13+
- [Development Tools](#development-tools)
14+
15+
---
16+
17+
## Data
18+
19+
The `data/` directory contains two files:
20+
21+
| File | Description |
22+
|---|---|
23+
| `sales_df.csv` | Transaction-level sales records with customer IDs, dates, and customer category/type attributes |
24+
| `feiertage.csv` | Swiss public holiday calendar used as an external covariate |
25+
26+
> **Note:** Both files contain **anonymized synthetic data**. The data is generated to reflect the statistical properties and patterns of real sales transactions, including realistic customer ordering cadences, seasonal effects, and B2B/B2C customer mix. It does not contain any personal or commercially sensitive information.
27+
28+
---
29+
30+
## Notebooks
31+
32+
Notebooks are located in `notebooks/` and should be run **in the following order**. Each builds on the insights of the previous one.
33+
34+
### 1. `EDA.ipynb` — Exploratory Data Analysis
35+
36+
Start here. This notebook provides a thorough understanding of the data before any modeling is attempted:
37+
38+
- Transaction-level overview and customer segmentation (B2C vs. B2B)
39+
- Temporal patterns: daily, weekly, monthly, and quarterly seasonality
40+
- Multi-seasonal decomposition (MSTL) and stationarity tests (ADF/KPSS)
41+
- Holiday correlation analysis
42+
- AutoGluon time series forecasting benchmarks (with and without holiday covariates)
43+
44+
Running EDA first is essential because survival models are sensitive to data quality and distribution assumptions. Understanding customer ordering cadence, the degree of censoring, and data irregularities directly informs modeling choices.
45+
46+
---
47+
48+
### 2. `Lifelines_Modelling.ipynb` — Parametric Survival Models
49+
50+
Introduces survival analysis via the [`lifelines`](https://lifelines.readthedocs.io) library. Models the time between purchases as a survival problem, where a "next order" is the event and customers who have not yet reordered are right-censored.
51+
52+
Three models are fitted and compared:
53+
54+
| Model | Type | Key characteristic |
55+
|---|---|---|
56+
| **Cox Proportional Hazards (CoxPH)** | Semi-parametric | Makes no assumption about the baseline hazard shape; assumes covariate effects are multiplicative and constant over time (proportional hazards). Highly interpretable. |
57+
| **Weibull AFT** | Parametric | Models time-to-event directly under a Weibull distribution. Assumes a specific hazard shape; covariates stretch or compress the time axis. |
58+
| **Log-Normal AFT** | Parametric | Same AFT framework as Weibull but with a log-normal distribution, allowing for non-monotone hazard. |
59+
60+
**Why start with Lifelines?** These models are interpretable, fast to fit, and provide a strong, explainable baseline. The Cox model in particular has well-understood diagnostics (proportional hazards assumption tests) that help validate whether survival analysis is appropriate for this data. Evaluation uses the concordance index (C-index) and recall@k on a held-out test set.
61+
62+
---
63+
64+
### 3. `RSF_Modelling.ipynb` — Random Survival Forest
65+
66+
Fits a **Random Survival Forest (RSF)** using [`scikit-survival`](https://scikit-survival.readthedocs.io). RSF is a non-parametric ensemble method that:
67+
68+
- Makes no distributional assumptions about the event time
69+
- Captures non-linear covariate effects and feature interactions automatically
70+
- Supports richer feature engineering (recency, frequency, customer category dummies)
71+
- Includes hyperparameter tuning
72+
73+
**Why RSF after Lifelines?** RSF is more complex and less interpretable than Cox/AFT models. Running the parametric models first establishes a performance baseline and validates the survival analysis framing. RSF is then used to explore whether more flexible modelling — at the cost of interpretability — improves customer ranking. Results from all three approaches are compared in a final summary table.
74+
75+
---
76+
77+
## Source Modules
78+
79+
Helper code is organized under `src/sme_kt_zh_collaboration_forecasting/`:
80+
81+
| Module | Purpose |
82+
|---|---|
83+
| `utils.py` | Data loading utility (`read_sales_data`): reads `sales_df.csv`, parses dates, and creates a numeric customer ID while preserving the original name |
84+
| `EDA.py` | Reusable EDA functions: general sales time-series analysis (seasonality plots, MSTL decomposition, stationarity tests), holiday-lag correlation, and AutoGluon training-set builders |
85+
| `modelling.py` | Survival analysis pipeline helpers: data preparation (inter-purchase durations, censoring), train/test splitting, and evaluation utilities (C-index, predicted vs. real priority ranking) for Cox, AFT, and RSF models |
86+
87+
---
1788

1889
## Installation
1990

2091
Install pinned development dependencies using:
2192

22-
```
93+
```bash
2394
pip install -r requirements.txt
2495
```
2596

2697
If you are using Conda to manage your Python environments:
2798

28-
```
99+
```bash
29100
conda env create -f environment.yml
30101
```
31102

32103
Alternatively, if you are using an existing environment, you can install the module in [editable mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html), which includes only minimal dependencies:
33104

34-
```
105+
```bash
35106
pip install -e .
36107
```
37108

109+
> **Important:** The project uses [`uv`](https://github.com/astral-sh/uv) for dependency management. A `uv.lock` file is included for fully reproducible installs. To use it:
110+
>
111+
> ```bash
112+
> uv sync
113+
> ```
38114
39-
## Development tools
115+
---
40116
41-
In order to use [pre-commit](https://pre-commit.com/) hooks, they need to be registered:
117+
## Development Tools
42118
43-
```
119+
Register [pre-commit](https://pre-commit.com/) hooks after installation:
120+
121+
```bash
44122
pre-commit install
45123
```
46124
47-
It is a good practice to manually invoke hooks after installation, just in case:
125+
Run hooks manually to verify the setup:
48126

49-
```
127+
```bash
50128
pre-commit run --all-files
51129
```
52130

53-
Unit tests (using [pytest](https://pytest.org/)) are not executed as a pre-commit hook, to keep the overhead to a minimum. Instead, a CI/CD pipeline is configured to run tests after each commit. You can also execute them locally, manually:
131+
Unit tests (via [pytest](https://pytest.org/)) can be run locally:
54132

55-
```
133+
```bash
56134
pytest
57135
```

data/feiertage.csv

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
Year,Date,Holiday Name,Scope,Day of Week
2+
2015,2015-01-01,Neujahrstag,Cantonal,Thursday
3+
2015,2015-01-02,Berchtoldstag,Cantonal,Friday
4+
2015,2015-04-03,Karfreitag,Cantonal,Friday
5+
2015,2015-04-06,Ostermontag,Cantonal,Monday
6+
2015,2015-04-13,Sechseläuten,Local (City),Monday
7+
2015,2015-05-01,Tag der Arbeit,Cantonal,Friday
8+
2015,2015-05-14,Auffahrt,Cantonal,Thursday
9+
2015,2015-05-25,Pfingstmontag,Cantonal,Monday
10+
2015,2015-08-01,Bundesfeier,Federal,Saturday
11+
2015,2015-09-14,Knabenschiessen,Local (City),Monday
12+
2015,2015-12-25,Weihnachtstag,Cantonal,Friday
13+
2015,2015-12-26,Stephanstag,Cantonal,Saturday
14+
2016,2016-01-01,Neujahrstag,Cantonal,Friday
15+
2016,2016-01-02,Berchtoldstag,Cantonal,Saturday
16+
2016,2016-03-25,Karfreitag,Cantonal,Friday
17+
2016,2016-03-28,Ostermontag,Cantonal,Monday
18+
2016,2016-04-18,Sechseläuten,Local (City),Monday
19+
2016,2016-05-01,Tag der Arbeit,Cantonal,Sunday
20+
2016,2016-05-05,Auffahrt,Cantonal,Thursday
21+
2016,2016-05-16,Pfingstmontag,Cantonal,Monday
22+
2016,2016-08-01,Bundesfeier,Federal,Monday
23+
2016,2016-09-12,Knabenschiessen,Local (City),Monday
24+
2016,2016-12-25,Weihnachtstag,Cantonal,Sunday
25+
2016,2016-12-26,Stephanstag,Cantonal,Monday
26+
2017,2017-01-01,Neujahrstag,Cantonal,Sunday
27+
2017,2017-01-02,Berchtoldstag,Cantonal,Monday
28+
2017,2017-04-14,Karfreitag,Cantonal,Friday
29+
2017,2017-04-17,Ostermontag,Cantonal,Monday
30+
2017,2017-04-24,Sechseläuten,Local (City),Monday
31+
2017,2017-05-01,Tag der Arbeit,Cantonal,Monday
32+
2017,2017-05-25,Auffahrt,Cantonal,Thursday
33+
2017,2017-06-05,Pfingstmontag,Cantonal,Monday
34+
2017,2017-08-01,Bundesfeier,Federal,Tuesday
35+
2017,2017-09-11,Knabenschiessen,Local (City),Monday
36+
2017,2017-12-25,Weihnachtstag,Cantonal,Monday
37+
2017,2017-12-26,Stephanstag,Cantonal,Tuesday
38+
2018,2018-01-01,Neujahrstag,Cantonal,Monday
39+
2018,2018-01-02,Berchtoldstag,Cantonal,Tuesday
40+
2018,2018-03-30,Karfreitag,Cantonal,Friday
41+
2018,2018-04-02,Ostermontag,Cantonal,Monday
42+
2018,2018-04-16,Sechseläuten,Local (City),Monday
43+
2018,2018-05-01,Tag der Arbeit,Cantonal,Tuesday
44+
2018,2018-05-10,Auffahrt,Cantonal,Thursday
45+
2018,2018-05-21,Pfingstmontag,Cantonal,Monday
46+
2018,2018-08-01,Bundesfeier,Federal,Wednesday
47+
2018,2018-09-10,Knabenschiessen,Local (City),Monday
48+
2018,2018-12-25,Weihnachtstag,Cantonal,Tuesday
49+
2018,2018-12-26,Stephanstag,Cantonal,Wednesday
50+
2019,2019-01-01,Neujahrstag,Cantonal,Tuesday
51+
2019,2019-01-02,Berchtoldstag,Cantonal,Wednesday
52+
2019,2019-04-19,Karfreitag,Cantonal,Friday
53+
2019,2019-04-22,Ostermontag,Cantonal,Monday
54+
2019,2019-04-08,Sechseläuten,Local (City),Monday
55+
2019,2019-05-01,Tag der Arbeit,Cantonal,Wednesday
56+
2019,2019-05-30,Auffahrt,Cantonal,Thursday
57+
2019,2019-06-10,Pfingstmontag,Cantonal,Monday
58+
2019,2019-08-01,Bundesfeier,Federal,Thursday
59+
2019,2019-09-09,Knabenschiessen,Local (City),Monday
60+
2019,2019-12-25,Weihnachtstag,Cantonal,Wednesday
61+
2019,2019-12-26,Stephanstag,Cantonal,Thursday
62+
2020,2020-01-01,Neujahrstag,Cantonal,Wednesday
63+
2020,2020-01-02,Berchtoldstag,Cantonal,Thursday
64+
2020,2020-04-10,Karfreitag,Cantonal,Friday
65+
2020,2020-04-13,Ostermontag,Cantonal,Monday
66+
2020,2020-04-20,Sechseläuten,Local (City),Monday
67+
2020,2020-05-01,Tag der Arbeit,Cantonal,Friday
68+
2020,2020-05-21,Auffahrt,Cantonal,Thursday
69+
2020,2020-06-01,Pfingstmontag,Cantonal,Monday
70+
2020,2020-08-01,Bundesfeier,Federal,Saturday
71+
2020,2020-09-14,Knabenschiessen,Local (City),Monday
72+
2020,2020-12-25,Weihnachtstag,Cantonal,Friday
73+
2020,2020-12-26,Stephanstag,Cantonal,Saturday
74+
2021,2021-01-01,Neujahrstag,Cantonal,Friday
75+
2021,2021-01-02,Berchtoldstag,Cantonal,Saturday
76+
2021,2021-04-02,Karfreitag,Cantonal,Friday
77+
2021,2021-04-05,Ostermontag,Cantonal,Monday
78+
2021,2021-04-19,Sechseläuten,Local (City),Monday
79+
2021,2021-05-01,Tag der Arbeit,Cantonal,Saturday
80+
2021,2021-05-13,Auffahrt,Cantonal,Thursday
81+
2021,2021-05-24,Pfingstmontag,Cantonal,Monday
82+
2021,2021-08-01,Bundesfeier,Federal,Sunday
83+
2021,2021-09-13,Knabenschiessen,Local (City),Monday
84+
2021,2021-12-25,Weihnachtstag,Cantonal,Saturday
85+
2021,2021-12-26,Stephanstag,Cantonal,Sunday
86+
2022,2022-01-01,Neujahrstag,Cantonal,Saturday
87+
2022,2022-01-02,Berchtoldstag,Cantonal,Sunday
88+
2022,2022-04-15,Karfreitag,Cantonal,Friday
89+
2022,2022-04-18,Ostermontag,Cantonal,Monday
90+
2022,2022-04-25,Sechseläuten,Local (City),Monday
91+
2022,2022-05-01,Tag der Arbeit,Cantonal,Sunday
92+
2022,2022-05-26,Auffahrt,Cantonal,Thursday
93+
2022,2022-06-06,Pfingstmontag,Cantonal,Monday
94+
2022,2022-08-01,Bundesfeier,Federal,Monday
95+
2022,2022-09-12,Knabenschiessen,Local (City),Monday
96+
2022,2022-12-25,Weihnachtstag,Cantonal,Sunday
97+
2022,2022-12-26,Stephanstag,Cantonal,Monday
98+
2023,2023-01-01,Neujahrstag,Cantonal,Sunday
99+
2023,2023-01-02,Berchtoldstag,Cantonal,Monday
100+
2023,2023-04-07,Karfreitag,Cantonal,Friday
101+
2023,2023-04-10,Ostermontag,Cantonal,Monday
102+
2023,2023-04-17,Sechseläuten,Local (City),Monday
103+
2023,2023-05-01,Tag der Arbeit,Cantonal,Monday
104+
2023,2023-05-18,Auffahrt,Cantonal,Thursday
105+
2023,2023-05-29,Pfingstmontag,Cantonal,Monday
106+
2023,2023-08-01,Bundesfeier,Federal,Tuesday
107+
2023,2023-09-11,Knabenschiessen,Local (City),Monday
108+
2023,2023-12-25,Weihnachtstag,Cantonal,Monday
109+
2023,2023-12-26,Stephanstag,Cantonal,Tuesday
110+
2024,2024-01-01,Neujahrstag,Cantonal,Monday
111+
2024,2024-01-02,Berchtoldstag,Cantonal,Tuesday
112+
2024,2024-03-29,Karfreitag,Cantonal,Friday
113+
2024,2024-04-01,Ostermontag,Cantonal,Monday
114+
2024,2024-04-15,Sechseläuten,Local (City),Monday
115+
2024,2024-05-01,Tag der Arbeit,Cantonal,Wednesday
116+
2024,2024-05-09,Auffahrt,Cantonal,Thursday
117+
2024,2024-05-20,Pfingstmontag,Cantonal,Monday
118+
2024,2024-08-01,Bundesfeier,Federal,Thursday
119+
2024,2024-09-09,Knabenschiessen,Local (City),Monday
120+
2024,2024-12-25,Weihnachtstag,Cantonal,Wednesday
121+
2024,2024-12-26,Stephanstag,Cantonal,Thursday
122+
2025,2025-01-01,Neujahrstag,Cantonal,Wednesday
123+
2025,2025-01-02,Berchtoldstag,Cantonal,Thursday
124+
2025,2025-04-18,Karfreitag,Cantonal,Friday
125+
2025,2025-04-21,Ostermontag,Cantonal,Monday
126+
2025,2025-04-28,Sechseläuten,Local (City),Monday
127+
2025,2025-05-01,Tag der Arbeit,Cantonal,Thursday
128+
2025,2025-05-29,Auffahrt,Cantonal,Thursday
129+
2025,2025-06-09,Pfingstmontag,Cantonal,Monday
130+
2025,2025-08-01,Bundesfeier,Federal,Friday
131+
2025,2025-09-15,Knabenschiessen,Local (City),Monday
132+
2025,2025-12-25,Weihnachtstag,Cantonal,Thursday
133+
2025,2025-12-26,Stephanstag,Cantonal,Friday
134+
2026,2026-01-01,Neujahrstag,Cantonal,Thursday
135+
2026,2026-01-02,Berchtoldstag,Cantonal,Friday
136+
2026,2026-04-03,Karfreitag,Cantonal,Friday
137+
2026,2026-04-06,Ostermontag,Cantonal,Monday
138+
2026,2026-04-20,Sechseläuten,Local (City),Monday
139+
2026,2026-05-01,Tag der Arbeit,Cantonal,Friday
140+
2026,2026-05-14,Auffahrt,Cantonal,Thursday
141+
2026,2026-05-25,Pfingstmontag,Cantonal,Monday
142+
2026,2026-08-01,Bundesfeier,Federal,Saturday
143+
2026,2026-09-14,Knabenschiessen,Local (City),Monday
144+
2026,2026-12-25,Weihnachtstag,Cantonal,Friday
145+
2026,2026-12-26,Stephanstag,Cantonal,Saturday

0 commit comments

Comments
 (0)