|
2 | 2 | "cells": [ |
3 | 3 | { |
4 | 4 | "cell_type": "markdown", |
5 | | - "id": "ccc87551", |
| 5 | + "id": "0", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | 8 | "# Exploratory Data Analysis (EDA): SME KT ZH Collaboration Forecasting\n", |
|
26 | 26 | { |
27 | 27 | "cell_type": "code", |
28 | 28 | "execution_count": null, |
29 | | - "id": "a158e87f", |
| 29 | + "id": "1", |
30 | 30 | "metadata": {}, |
31 | 31 | "outputs": [], |
32 | 32 | "source": [ |
|
45 | 45 | { |
46 | 46 | "cell_type": "code", |
47 | 47 | "execution_count": null, |
48 | | - "id": "004d97cd", |
| 48 | + "id": "2", |
49 | 49 | "metadata": {}, |
50 | 50 | "outputs": [], |
51 | 51 | "source": [ |
52 | | - "df = su.read_sales_data()" |
| 52 | + "df = su.read_sales_data(\"../data/sales_df.csv\")" |
53 | 53 | ] |
54 | 54 | }, |
55 | 55 | { |
56 | 56 | "cell_type": "markdown", |
57 | | - "id": "7d04a405", |
| 57 | + "id": "3", |
58 | 58 | "metadata": {}, |
59 | 59 | "source": [ |
60 | 60 | "## Basic transaction-level overview\n", |
|
70 | 70 | { |
71 | 71 | "cell_type": "code", |
72 | 72 | "execution_count": null, |
73 | | - "id": "531913f3", |
| 73 | + "id": "4", |
74 | 74 | "metadata": {}, |
75 | 75 | "outputs": [], |
76 | 76 | "source": [ |
|
87 | 87 | { |
88 | 88 | "cell_type": "code", |
89 | 89 | "execution_count": null, |
90 | | - "id": "39e01043", |
| 90 | + "id": "5", |
91 | 91 | "metadata": {}, |
92 | 92 | "outputs": [], |
93 | 93 | "source": [ |
|
104 | 104 | { |
105 | 105 | "cell_type": "code", |
106 | 106 | "execution_count": null, |
107 | | - "id": "fb0a1dbf", |
| 107 | + "id": "6", |
108 | 108 | "metadata": {}, |
109 | 109 | "outputs": [], |
110 | 110 | "source": [ |
|
127 | 127 | { |
128 | 128 | "cell_type": "code", |
129 | 129 | "execution_count": null, |
130 | | - "id": "f4faefd6", |
| 130 | + "id": "7", |
131 | 131 | "metadata": {}, |
132 | 132 | "outputs": [], |
133 | 133 | "source": [ |
|
151 | 151 | }, |
152 | 152 | { |
153 | 153 | "cell_type": "markdown", |
154 | | - "id": "bdc4b5ef", |
| 154 | + "id": "8", |
155 | 155 | "metadata": {}, |
156 | 156 | "source": [ |
157 | 157 | "## General sales EDA\n", |
|
171 | 171 | { |
172 | 172 | "cell_type": "code", |
173 | 173 | "execution_count": null, |
174 | | - "id": "6f751ed9", |
| 174 | + "id": "9", |
175 | 175 | "metadata": {}, |
176 | 176 | "outputs": [], |
177 | 177 | "source": [ |
|
180 | 180 | }, |
181 | 181 | { |
182 | 182 | "cell_type": "markdown", |
183 | | - "id": "3771b60e", |
| 183 | + "id": "10", |
184 | 184 | "metadata": {}, |
185 | 185 | "source": [ |
186 | 186 | "## Segmenting by customer type\n", |
|
190 | 190 | { |
191 | 191 | "cell_type": "code", |
192 | 192 | "execution_count": null, |
193 | | - "id": "3d7a0272", |
| 193 | + "id": "11", |
194 | 194 | "metadata": {}, |
195 | 195 | "outputs": [], |
196 | 196 | "source": [ |
|
204 | 204 | }, |
205 | 205 | { |
206 | 206 | "cell_type": "markdown", |
207 | | - "id": "77d506b6", |
| 207 | + "id": "12", |
208 | 208 | "metadata": {}, |
209 | 209 | "source": [ |
210 | 210 | "## Customer ordering cadence\n", |
|
213 | 213 | }, |
214 | 214 | { |
215 | 215 | "cell_type": "markdown", |
216 | | - "id": "c3612493", |
| 216 | + "id": "13", |
217 | 217 | "metadata": {}, |
218 | 218 | "source": [ |
219 | 219 | "Order frequency varies widely across customers, which makes aggregate demand harder to predict. The histogram below summarizes each customer's average inter-order time in days per order." |
|
222 | 222 | { |
223 | 223 | "cell_type": "code", |
224 | 224 | "execution_count": null, |
225 | | - "id": "0b95b6dc", |
| 225 | + "id": "14", |
226 | 226 | "metadata": {}, |
227 | 227 | "outputs": [], |
228 | 228 | "source": [ |
|
247 | 247 | }, |
248 | 248 | { |
249 | 249 | "cell_type": "markdown", |
250 | | - "id": "4d12be23", |
| 250 | + "id": "15", |
251 | 251 | "metadata": {}, |
252 | 252 | "source": [ |
253 | 253 | "The scatter plot below compares each customer's mean inter-order time with its standard deviation for customers with more than three orders. Points below the 45-degree line are relatively regular, while points above it are more bursty or irregular. B2B and B2C appear broadly similar, with no sharply separated clusters." |
|
256 | 256 | { |
257 | 257 | "cell_type": "code", |
258 | 258 | "execution_count": null, |
259 | | - "id": "28a5ba87", |
| 259 | + "id": "16", |
260 | 260 | "metadata": {}, |
261 | 261 | "outputs": [], |
262 | 262 | "source": [ |
|
313 | 313 | }, |
314 | 314 | { |
315 | 315 | "cell_type": "markdown", |
316 | | - "id": "29e2214d", |
| 316 | + "id": "17", |
317 | 317 | "metadata": {}, |
318 | 318 | "source": [ |
319 | 319 | "## Holidays versus sales volume\n", |
320 | | - "We plot weekly order counts and overlay cantonal and federal holidays as vertical lines. The visual pattern is weak, and the lagged correlation analysis below suggests that any holiday effect is small relative to normal week-to-week variability." |
| 320 | + "We plot weekly order counts and overlay cantonal and federal holidays as vertical lines. The visual pattern is weak, and the MAE of models with or without holidays as a covariate below is similar, further reinforcing the idea that any holiday effect is small relative to normal week-to-week variability." |
321 | 321 | ] |
322 | 322 | }, |
323 | 323 | { |
324 | 324 | "cell_type": "code", |
325 | 325 | "execution_count": null, |
326 | | - "id": "2f04af4a", |
| 326 | + "id": "18", |
327 | 327 | "metadata": {}, |
328 | 328 | "outputs": [], |
329 | 329 | "source": [ |
|
338 | 338 | { |
339 | 339 | "cell_type": "code", |
340 | 340 | "execution_count": null, |
341 | | - "id": "000770aa", |
| 341 | + "id": "19", |
342 | 342 | "metadata": {}, |
343 | 343 | "outputs": [], |
344 | 344 | "source": [ |
|
363 | 363 | }, |
364 | 364 | { |
365 | 365 | "cell_type": "markdown", |
366 | | - "id": "43ae44f3", |
367 | | - "metadata": {}, |
368 | | - "source": [ |
369 | | - "Across the tested lags, correlations between the holiday indicator and daily sales remain close to zero. Even when some p-values become small because of sample size, the effect size is negligible relative to ordinary day-to-day variability, so the notebook does not find a meaningful standalone holiday signal at the daily level." |
370 | | - ] |
371 | | - }, |
372 | | - { |
373 | | - "cell_type": "code", |
374 | | - "execution_count": null, |
375 | | - "id": "39eba19a", |
376 | | - "metadata": {}, |
377 | | - "outputs": [], |
378 | | - "source": [ |
379 | | - "best, results_df = se.simple_daily_correlation(bb_df, holiday_df, 180)\n", |
380 | | - "\n", |
381 | | - "fig = px.line(results_df, x=\"lag\", y=\"correlation\")\n", |
382 | | - "fig.add_scatter(x=results_df[\"lag\"], y=results_df[\"p_value\"], name=\"p-value\")\n", |
383 | | - "fig.update_layout(title=\"Lag versus p-value and correlation\")\n", |
384 | | - "fig.data[0].name = \"correlation\"\n", |
385 | | - "fig.data[0].showlegend = True\n", |
386 | | - "fig.show()" |
387 | | - ] |
388 | | - }, |
389 | | - { |
390 | | - "cell_type": "markdown", |
391 | | - "id": "b86c47da", |
| 366 | + "id": "20", |
392 | 367 | "metadata": {}, |
393 | 368 | "source": [ |
394 | 369 | "## Forecasting with and without holiday covariates\n", |
|
398 | 373 | { |
399 | 374 | "cell_type": "code", |
400 | 375 | "execution_count": null, |
401 | | - "id": "ebe54f3d", |
| 376 | + "id": "21", |
402 | 377 | "metadata": {}, |
403 | 378 | "outputs": [], |
404 | 379 | "source": [ |
|
412 | 387 | { |
413 | 388 | "cell_type": "code", |
414 | 389 | "execution_count": null, |
415 | | - "id": "6137ea03", |
| 390 | + "id": "22", |
416 | 391 | "metadata": {}, |
417 | 392 | "outputs": [], |
418 | 393 | "source": [ |
|
514 | 489 | }, |
515 | 490 | { |
516 | 491 | "cell_type": "markdown", |
517 | | - "id": "4c9f50fe", |
| 492 | + "id": "23", |
518 | 493 | "metadata": {}, |
519 | 494 | "source": [ |
520 | 495 | "\n", |
521 | | - "Across this parameter sweep, the best models achieve similar validation MAE with and without the holiday covariate. Including `is_holiday` can improve average performance by helping the model represent closure or low-activity periods, but the gains remain modest overall." |
| 496 | + "Across this parameter sweep, the best models achieve similar validation MAE with and without the holiday covariate. Including `is_holiday` may be able to improve average performance by helping the model represent closure or low-activity periods, but the gains remain modest overall." |
522 | 497 | ] |
523 | 498 | }, |
524 | 499 | { |
525 | 500 | "cell_type": "code", |
526 | 501 | "execution_count": null, |
527 | | - "id": "2085460b", |
| 502 | + "id": "24", |
528 | 503 | "metadata": {}, |
529 | 504 | "outputs": [], |
530 | 505 | "source": [ |
|
537 | 512 | }, |
538 | 513 | { |
539 | 514 | "cell_type": "markdown", |
540 | | - "id": "de9df146", |
| 515 | + "id": "25", |
541 | 516 | "metadata": {}, |
542 | 517 | "source": [ |
543 | 518 | "## Example weekly forecast\n", |
|
547 | 522 | { |
548 | 523 | "cell_type": "code", |
549 | 524 | "execution_count": null, |
550 | | - "id": "46a72d2c", |
| 525 | + "id": "26", |
551 | 526 | "metadata": {}, |
552 | 527 | "outputs": [], |
553 | 528 | "source": [ |
|
593 | 568 | { |
594 | 569 | "cell_type": "code", |
595 | 570 | "execution_count": null, |
596 | | - "id": "7949bc33", |
| 571 | + "id": "27", |
597 | 572 | "metadata": {}, |
598 | 573 | "outputs": [], |
599 | 574 | "source": [ |
|
607 | 582 | " predictions,\n", |
608 | 583 | " max_history_length=200,\n", |
609 | 584 | " item_ids=[0],\n", |
610 | | - ")" |
| 585 | + ")\n", |
| 586 | + "pass" |
611 | 587 | ] |
612 | 588 | } |
613 | 589 | ], |
614 | 590 | "metadata": { |
615 | 591 | "kernelspec": { |
616 | | - "display_name": "sme-kt-zh-collaboration-forecasting", |
| 592 | + "display_name": "sme-kt-zh-collaboration-forecasting (3.12.3)", |
617 | 593 | "language": "python", |
618 | 594 | "name": "python3" |
619 | 595 | }, |
|
627 | 603 | "name": "python", |
628 | 604 | "nbconvert_exporter": "python", |
629 | 605 | "pygments_lexer": "ipython3", |
630 | | - "version": "3.11.14" |
| 606 | + "version": "3.12.3" |
631 | 607 | } |
632 | 608 | }, |
633 | 609 | "nbformat": 4, |
|
0 commit comments