You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Add exploratory data analysis and modeling for sales data
- Implemented EDA functions in EDA.py to analyze transactional sales data, including seasonal decomposition, feature extraction, and visualization.
- Created modeling functions in modelling.py for survival analysis using Cox Proportional Hazards and Random Survival Forest models, including data preparation and evaluation metrics.
- Added utility functions in utils.py for reading and preprocessing sales data.
- Introduced fitting_scores.csv to store model fitting scores with and without holiday effects.
This project is made using [this template](https://github.com/sdsc-innovation/cookiecutter-python).
4
-
Next steps include:
3
+
This project applies **survival analysis** to B2B/B2C sales transaction data to predict when customers are likely to place their next order. The output is a ranked priority list of customers, enabling proactive outreach and collaboration planning.
5
4
6
-
-[x] Create project from the Cookiecutter template.
7
-
-[ ] Create a virtual environment to work in an isolated Python installation.
-[ ] Keep either `.gitlab-ci.yml` or `.github`, according to your Git hosting platform.
10
-
-[ ] Update `authors` and `description`, in `pyproject.toml`.
11
-
-[ ]`requirements.txt` should contain the *exact* (a.k.a. pinned) versions of the dependencies used development, including tools. However, do not include indirect dependencies.
12
-
-[ ] Add installation dependencies in `pyproject.toml`, with permissive version constraints.
13
-
-[ ] Add a `LICENSE` file, if applicable. This is *highly recommended* if the project is open source.
14
-
-[ ] Add a [`CITATION.cff`](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files), to ease citation of your work.
15
-
-[ ] Replace this `README.md` with a proper one. Among others, it must explain the overall context, the installation instructions, a quick start guide, and a repository structure description.
5
+
---
16
6
7
+
## Table of Contents
8
+
9
+
-[Data](#data)
10
+
-[Notebooks](#notebooks)
11
+
-[Source Modules](#source-modules)
12
+
-[Installation](#installation)
13
+
-[Development Tools](#development-tools)
14
+
15
+
---
16
+
17
+
## Data
18
+
19
+
The `data/` directory contains two files:
20
+
21
+
| File | Description |
22
+
|---|---|
23
+
|`sales_df.csv`| Transaction-level sales records with customer IDs, dates, and customer category/type attributes |
24
+
|`feiertage.csv`| Swiss public holiday calendar used as an external covariate |
25
+
26
+
> **Note:** Both files contain **anonymized synthetic data**. The data is generated to reflect the statistical properties and patterns of real sales transactions, including realistic customer ordering cadences, seasonal effects, and B2B/B2C customer mix. It does not contain any personal or commercially sensitive information.
27
+
28
+
---
29
+
30
+
## Notebooks
31
+
32
+
Notebooks are located in `notebooks/` and should be run **in the following order**. Each builds on the insights of the previous one.
33
+
34
+
### 1. `EDA.ipynb` — Exploratory Data Analysis
35
+
36
+
Start here. This notebook provides a thorough understanding of the data before any modeling is attempted:
37
+
38
+
- Transaction-level overview and customer segmentation (B2C vs. B2B)
39
+
- Temporal patterns: daily, weekly, monthly, and quarterly seasonality
40
+
- Multi-seasonal decomposition (MSTL) and stationarity tests (ADF/KPSS)
41
+
- Holiday correlation analysis
42
+
- AutoGluon time series forecasting benchmarks (with and without holiday covariates)
43
+
44
+
Running EDA first is essential because survival models are sensitive to data quality and distribution assumptions. Understanding customer ordering cadence, the degree of censoring, and data irregularities directly informs modeling choices.
Introduces survival analysis via the [`lifelines`](https://lifelines.readthedocs.io) library. Models the time between purchases as a survival problem, where a "next order" is the event and customers who have not yet reordered are right-censored.
51
+
52
+
Three models are fitted and compared:
53
+
54
+
| Model | Type | Key characteristic |
55
+
|---|---|---|
56
+
|**Cox Proportional Hazards (CoxPH)**| Semi-parametric | Makes no assumption about the baseline hazard shape; assumes covariate effects are multiplicative and constant over time (proportional hazards). Highly interpretable. |
57
+
|**Weibull AFT**| Parametric | Models time-to-event directly under a Weibull distribution. Assumes a specific hazard shape; covariates stretch or compress the time axis. |
58
+
|**Log-Normal AFT**| Parametric | Same AFT framework as Weibull but with a log-normal distribution, allowing for non-monotone hazard. |
59
+
60
+
**Why start with Lifelines?** These models are interpretable, fast to fit, and provide a strong, explainable baseline. The Cox model in particular has well-understood diagnostics (proportional hazards assumption tests) that help validate whether survival analysis is appropriate for this data. Evaluation uses the concordance index (C-index) and recall@k on a held-out test set.
61
+
62
+
---
63
+
64
+
### 3. `RSF_Modelling.ipynb` — Random Survival Forest
65
+
66
+
Fits a **Random Survival Forest (RSF)** using [`scikit-survival`](https://scikit-survival.readthedocs.io). RSF is a non-parametric ensemble method that:
67
+
68
+
- Makes no distributional assumptions about the event time
69
+
- Captures non-linear covariate effects and feature interactions automatically
**Why RSF after Lifelines?** RSF is more complex and less interpretable than Cox/AFT models. Running the parametric models first establishes a performance baseline and validates the survival analysis framing. RSF is then used to explore whether more flexible modelling — at the cost of interpretability — improves customer ranking. Results from all three approaches are compared in a final summary table.
74
+
75
+
---
76
+
77
+
## Source Modules
78
+
79
+
Helper code is organized under `src/sme_kt_zh_collaboration_forecasting/`:
80
+
81
+
| Module | Purpose |
82
+
|---|---|
83
+
|`utils.py`| Data loading utility (`read_sales_data`): reads `sales_df.csv`, parses dates, and creates a numeric customer ID while preserving the original name |
|`modelling.py`| Survival analysis pipeline helpers: data preparation (inter-purchase durations, censoring), train/test splitting, and evaluation utilities (C-index, predicted vs. real priority ranking) for Cox, AFT, and RSF models |
86
+
87
+
---
17
88
18
89
## Installation
19
90
20
91
Install pinned development dependencies using:
21
92
22
-
```
93
+
```bash
23
94
pip install -r requirements.txt
24
95
```
25
96
26
97
If you are using Conda to manage your Python environments:
27
98
28
-
```
99
+
```bash
29
100
conda env create -f environment.yml
30
101
```
31
102
32
103
Alternatively, if you are using an existing environment, you can install the module in [editable mode](https://setuptools.pypa.io/en/latest/userguide/development_mode.html), which includes only minimal dependencies:
33
104
34
-
```
105
+
```bash
35
106
pip install -e .
36
107
```
37
108
109
+
> **Important:** The project uses [`uv`](https://github.com/astral-sh/uv) for dependency management. A `uv.lock` file is included for fully reproducible installs. To use it:
110
+
>
111
+
> ```bash
112
+
> uv sync
113
+
>```
38
114
39
-
## Development tools
115
+
---
40
116
41
-
In order to use [pre-commit](https://pre-commit.com/) hooks, they need to be registered:
117
+
## Development Tools
42
118
43
-
```
119
+
Register [pre-commit](https://pre-commit.com/) hooks after installation:
120
+
121
+
```bash
44
122
pre-commit install
45
123
```
46
124
47
-
It is a good practice to manually invoke hooks after installation, just in case:
125
+
Run hooks manually to verify the setup:
48
126
49
-
```
127
+
```bash
50
128
pre-commit run --all-files
51
129
```
52
130
53
-
Unit tests (using[pytest](https://pytest.org/)) are not executed as a pre-commit hook, to keep the overhead to a minimum. Instead, a CI/CD pipeline is configured to run tests after each commit. You can also execute them locally, manually:
131
+
Unit tests (via[pytest](https://pytest.org/)) can be run locally:
0 commit comments