End-to-end retail analytics system — SQL ETL pipeline, Bronze-Silver-Gold architecture, star schema data model, and Power BI KPI dashboard with DAX, RLS, and deployment pipeline. Processes 715,536 transactional records across 10 stores, 42 products, and 2 years of trading data.
Business Impact: Reporting preparation time reduced from 45 minutes to under 5 minutes. Automated shrinkage anomaly detection flagged 7 critical incidents including a Z-score 6.68 theft spike at Manchester Central — £33,140 above baseline.
- Business Context
- Architecture
- Key Business Questions Answered
- Tech Stack
- Project Structure
- Data Model
- SQL Modules
- Key Findings
- Setup and Usage
- Test Suite
- Skills Demonstrated
Retail operations generate high-volume transactional data across stores, products, and time periods. Without structured analytics infrastructure, reporting is manual, inconsistent, and slow — leaving managers without the insight to act on labour overspend or loss events in time.
This system builds a complete analytics layer on top of raw retail data:
- ETL pipeline generates and validates 715K+ transactions across 10 UK stores
- Bronze-Silver-Gold architecture mirrors enterprise data warehouse patterns
- SQL KPI layer calculates weekly sales, labour cost %, week-on-week variance, and Z-score shrinkage anomaly detection
- Power BI dashboard delivers self-serve reporting with row-level security by region
Synthetic Data Generator (Python)
│
▼
CSV Files → data/generated/
│
▼
ETL Loader (load_to_database.py → SQLAlchemy)
│
▼
SQLite Database — Bronze Schema
├── fact_transactions (715,536 rows)
├── fact_labour (7,310 rows)
├── fact_shrinkage (5,697 rows)
├── dim_date (731 rows)
├── dim_store (10 rows)
└── dim_product (42 rows)
│
▼
SQL Analytics Layer
├── 01_create_schema.sql — Bronze/Silver/Gold DDL
├── 02_data_quality_validation.sql — 7 automated DQ checks
├── 03_kpi_labour_cost.sql — Labour cost %, WoW variance, basket value
└── 04_shrinkage_anomaly_detection.sql — Z-score rolling baseline detection
│
▼
Power BI Dashboard
├── DAX measures (time intelligence, YTD, WoW variance)
├── Row-Level Security (region-based)
└── Deployment pipeline configuration
| # | Business Question | Module |
|---|---|---|
| 1 | What are weekly net sales by store and region? | 03_kpi_labour_cost.sql — KPI 1 |
| 2 | Which stores have labour cost % above the 14% threshold? | 03_kpi_labour_cost.sql — KPI 2 |
| 3 | What is the week-on-week revenue variance per store? | 03_kpi_labour_cost.sql — KPI 3 |
| 4 | Which categories and regions drive the most sales? | 03_kpi_labour_cost.sql — KPI 4 |
| 5 | What are the top 20 products by net sales? | 03_kpi_labour_cost.sql — KPI 5 |
| 6 | Which stores show Z-score shrinkage anomalies? | 04_shrinkage_anomaly_detection.sql |
| 7 | What is shrinkage by cause code (theft/waste/damage)? | 04_shrinkage_anomaly_detection.sql |
| 8 | Are there data quality issues before BI consumption? | 02_data_quality_validation.sql |
| Layer | Technology |
|---|---|
| Data Generation | Python, NumPy, Faker — realistic seasonal patterns, UK bank holidays |
| ETL Pipeline | Python, Pandas, SQLAlchemy |
| Database | SQLite (dev) with Bronze/Silver/Gold schema pattern |
| Data Modelling | Star schema — 3 fact tables + 3 dimension tables |
| SQL Analytics | CTEs, Window Functions (LAG, RANK, STDDEV), Z-score anomaly detection |
| BI & Reporting | Power BI — DAX, Row-Level Security, Deployment Pipeline |
| Testing | pytest — schema, data quality, KPI, ETL |
| CI/CD | GitHub Actions |
retail-kpi-analytics/
├── .github/
│ └── workflows/
│ └── ci.yml # CI — pytest on every push
├── architecture/
│ └── pipeline_diagram.png # System architecture diagram
├── data/
│ └── generated/ # CSV outputs from data generator
├── docs/ # Supporting documentation
├── powerbi/
│ └── screenshots/ # Dashboard screenshots
├── python/
│ ├── generate_synthetic_data.py # 715K+ row retail data generator
│ └── load_to_database.py # ETL loader — CSV to SQLite
├── sql/
│ ├── 01_create_schema.sql # Bronze/Silver/Gold DDL
│ ├── 02_data_quality_validation.sql # 7 automated DQ checks
│ ├── 03_kpi_labour_cost.sql # Core KPI calculations
│ └── 04_shrinkage_anomaly_detection.sql # Z-score anomaly detection
├── tests/
│ └── test_retail_pipeline.py # pytest suite
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
└── requirements.txt
| Table | Rows | Description |
|---|---|---|
fact_transactions |
715,536 | Every retail transaction — store, product, quantity, price |
fact_labour |
7,310 | Daily labour hours and cost per store |
fact_shrinkage |
5,697 | Weekly shrinkage incidents by cause code |
dim_date |
731 | Full date dimension with fiscal weeks, bank holidays, weekends |
dim_store |
10 | 10 UK stores across 5 regions (London, North, Midlands, Wales, South) |
dim_product |
42 | Products across 5 categories (Ambient, Fresh, Frozen, BWS, Non-Food) |
Total sales value: £5,422,618 across 2024–2025 Average basket value: £7.58
Defines Bronze, Silver, and Gold schemas mirroring enterprise data warehouse architecture. Bronze holds raw ingested data, Silver holds validated data, Gold holds the star schema for Power BI consumption.
Seven automated DQ checks run before any KPI calculations:
- NULL checks on critical columns with percentage reporting
- Orphan record detection across all foreign keys
- Duplicate detection on transaction IDs and labour records
- Impossible value detection (negative prices, future dates, outliers)
- Date range completeness — gaps in trading days
- Store coverage completeness — flags stores below 90% trading day coverage
- Labour vs sales sanity check — flags labour cost outside 3–25% band
Five core KPI queries using CTEs and window functions:
- KPI 1: Weekly net sales by store with average daily sales
- KPI 2: Labour cost % by store with RAG status (target: 8–14%)
- KPI 3: Week-on-week variance using
LAG()window function - KPI 4: Sales by category and region with average basket value
- KPI 5: Top 20 products by net sales with % of total contribution
Statistical anomaly detection using a 12-week rolling per-store baseline — not a national average. Design rationale: London and Welsh stores have structurally different shrinkage patterns, so a shared benchmark produces false alerts.
Z-score thresholds:
- Z ≥ 3.0 → CRITICAL — Investigate immediately
- Z ≥ 2.5 → HIGH ALERT — Review this week
- Z ≥ 2.0 → ELEVATED — Monitor closely
| Metric | Result |
|---|---|
| Total transactions | 715,536 |
| Total sales value | £5,422,618 |
| Average basket value | £7.58 |
| Stores monitored | 10 across 5 UK regions |
| Trading days | 731 (2024–2025 including weekends) |
| Shrinkage anomalies detected | 7 (5 CRITICAL, 1 HIGH, 1 ELEVATED) |
| Highest Z-score | 6.68 — Manchester Central fiscal week 20 (£33,140 above baseline) |
| Labour overspend | Southampton 19.45%, Birmingham Moseley 19.01% in week 8 |
- Python 3.10+
git clone https://github.com/anandi-mahure/retail-kpi-analytics.git
cd retail-kpi-analytics
pip install -r requirements.txt# Step 1 — Generate 715K+ synthetic retail transactions
python python/generate_synthetic_data.py
# Step 2 — Load to SQLite database
python python/load_to_database.pyOpen data/retail_kpi.db in any SQLite client (DB Browser, DBeaver) and execute scripts in order:
01_create_schema.sql
02_data_quality_validation.sql
03_kpi_labour_cost.sql
04_shrinkage_anomaly_detection.sql
pytest tests/ -v
pytest tests/ -v --cov=python --cov-report=term-missing| Category | Tests |
|---|---|
| Schema validation | 4 tests — all tables and structure |
| Data quality | 8 tests — nulls, negatives, orphans, duplicates |
| KPI calculations | 6 tests — revenue, labour %, shrinkage, WoW |
| ETL logic | 5 tests — record counts, integrity, load accuracy |
| Category | Skills |
|---|---|
| SQL | CTEs, Window Functions (LAG, RANK, STDDEV), Z-score anomaly detection, Bronze/Silver/Gold schema |
| Python | Pandas ETL, SQLAlchemy, synthetic data generation with seasonal patterns and UK bank holidays |
| Statistics | Z-score methodology, 12-week rolling baseline, per-store normalisation |
| BI | Power BI DAX (time intelligence, WoW variance, basket value), Row-Level Security, Deployment Pipeline |
| Data Engineering | Star schema — 3 fact tables + 3 dimensions, data quality validation layer |
| Testing | pytest fixtures, in-memory SQLite testing, CI/CD via GitHub Actions |
Anandi Mahure — Data Analyst | MSc Data Science, University of Bath (Dean's Award 2025)
Part of a production-grade data analytics portfolio targeting top DA/Analytics Engineer roles across UK, Dubai, and Australia.
Part of a production-grade data analytics portfolio targeting top DA/Analytics Engineer roles across UK, Dubai, and Australia.