Enterprise-Grade On-premise Data Warehouse Solution for E-Commerce

👆 click the picture to see the presentation video!

Enterprise-Grade On-premise Data Warehouse Solution for E-Commerce

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging Docker containers to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.

The core value of this project lies in its implementation of enterprise-grade data warehouse modeling, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, the deployment of a big data cluster via Docker containers simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates CI/CD automation, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also highly optimized to maximize hardware resource utilization.

To monitor and manage the system effectively, a Grafana-based cluster monitoring system has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating business intelligence (BI) and visualization solutions, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.

By combining these critical features—including:

✅ Core Feature	🔥 Core Highlights	📦 Deliverables
1. Data Warehouse Modeling and Documentation	- Full dimensional modeling process (Star Schema / Snowflake Schema) - Standardized development norms (ODS/DWD/DWM/DWS/DWT/ADS six-layer modeling) - Business Matrix: defining & managing dimensions & fact tables	- Data warehouse design document (Markdown) - Hive SQL modeling code - DWH Dimensional Modelling Architecture Diagram
2. A Self-Built Distributed Big Data Platform	- Fully containerized deployment with Docker for quick replication - High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse	- Docker images (Open sourced on GitHub Container Registry) - `docker-compose.yml` (one-click cluster startup) - Infra Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow - Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow - Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
3. Distributed Batch Processing	- ETL processing using PySpark - data ETL job: OLTP to DWH && DWH to OLAP - Data Warehouse internal processing: ODS → DWD → DIM/DWM → DWS → ADS - batch processing job scheduler using Airflow	PySpark and Spark SQL Code - Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP) - Code - Batch Processing (DWH Internal Transform) - Code - Scheduling based on Airflow (DAGs)
4. CI/CD Automation	- Automated data platform cluster launching and stop	- GitHub Actions workflow pipeline `.yaml` - CI/CD code and documentation - Sample log screenshots
5. Storage & Computation Optimization	- SQL optimization (dynamic partitioning, indexing, storage partitioning) - Spark tuning: Salting, Skew Join Hint, Broadcast Join, `reduceByKey` vs. `groupByKey` - Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression	- Pre & post optimization performance comparison - Spark optimization code - SQL execution plan screenshots
6. DevOps - Monitoring and Alerting	- Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL - AlertManager for alerting and email receiving	- Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager - Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager - Code - Container Metrics Exporter Start&Stop Scripts: `my-start-node-exporter.sh` & `my-stop-node-exporter.sh` - Key Screenshots
7. Business Intelligence & Visualization	- PowerBI dashboards for data analysis - Real business-driven visualizations - Providing actionable business insights	- PowerBI visualization screenshots - PowerBI `.pbix` file - Key business metric explanations (BI Insights)

this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.

⚙️ Core Deliverables

1. Data Warehouse Modeling and Documentation

This project demonstrates my ability to build a data warehouse from the ground up following enterprise-grade standards. I independently designed and documented a complete SOP for data warehouse development, covering every critical step in the modeling roadmap. From initial business data research to final model delivery, I established a standardized methodology that ensures clarity, scalability, and maintainability. The SOP includes detailed best practices on data warehouse layering, table naming conventions, field naming rules, and lifecycle management for warehouse tables. For more information, please refer to the documentation below.

🔗 Click to Show DWH Dimensional Modelling Documents and Code

Data Warehouse Development Specification

🔨 Code - Hive DDL(for Data Warehouse All Layers including ods, dwd, dwm, dws, dwt, dim (Operational Data Storage, DW detail, DW middle, DW summary, DW theme, DW Dimension, Analytical Data Storage-CK)

Figure 1: DWH Dimensional Modelling SOP

Figure 2: DWH Dimensional Modelling Methodology Diagram

Figure 3: DWH Dimensional Modelling Architecture

2. A Self-Built Distributed Big Data Platform

This distributed data platform was built entirely from scratch by myself. Starting with a base Ubuntu 20.04 docker image, I manually installed and configured each component step by step, ultimately creating a fully functional three-node Hadoop cluster with distributed storage and computing capabilities. The platform is fully containerized, featuring a highly available HDFS and YARN architecture. It supports Hive for data warehousing, Spark for distributed computing, Airflow for workflow orchestration, and Prometheus + Grafana for performance monitoring. A MySQL container manages metadata for both Hive and Airflow and is also monitored by Prometheus. An Oracle container simulates the backend of a business system and serves as a data source for the data warehouse. All container images are open-sourced and published to 🔨 GitHub Container Registry, making it easy for anyone to deploy the same platform locally.

🔨 Code - Docker Compose File

🔨 Code - Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

🔨 Code - Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

🔨 Code - Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

Figure 1: All Containers Window

Figure 2: Data Platform Architecture

3. Distributed Batch Processing

This project implements a robust distributed batch processing architecture using PySpark for computation and Apache Airflow for orchestration. The batch layer focuses on high-throughput, scalable ETL workflows and integrates seamlessly with the overall data warehouse design. The core functionalities are structured as follows:

🔨 Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP)

A PySpark-based incremental extraction process is used to ingest new records from the Oracle OLTP database into the data warehouse. Additionally, downstream scripts handle the transformation and export of analytical and result-layer datasets from the data warehouse into external OLAP systems, enabling fast access by BI tools (e.g., Power BI, Tableau).

🔨 Code - Batch Processing (DWH Internal Transform)

Multi-stage transformations are implemented using Spark SQL within PySpark jobs to process data across warehouse layers, such as: ODS (Operational Data Store) → DWD (Data Warehouse Detail) and DWD → DIM (Dimension Tables). These transformations ensure structured, cleaned, and query-optimized data for analytical use cases.

🔨 Code - Scheduling based on Airflow (DAGs)

The entire batch workflow is automated via Apache Airflow, with DAGs scheduled to run nightly at 2:00 AM. The scheduler coordinates the extraction, transformation, and loading tasks, handles dependencies, and ensures timely creation of new partitions and ingestion of the latest data into the warehouse.

Figure 1: ETL Data Pipeline

Figure 2: Airflow Web UI

4. CI/CD Automation

GitHub Actions Code

🔨 Code - workflows.main YAML

Key Screenshots

Figure 1: Data platform launching and stop automation

Figure 2: Sample Log Screenshot I

Figure 3: Sample Log Screenshot II

🔗 Link - Automation Workflow Web UI

5. Storage & Computation Optimization

6. DevOps - Monitoring and Alerting

🔨 Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager

🔨 Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager

🔨 Code - Container Metrics Exporter Start&Stop Scripts: my-start-node-exporter.sh & my-stop-node-exporter.sh

Figure 1: Prometheus

Figure 2: Grafana-Hadoop-Cluster-instance-hadoop-master

Figure 3: Grafana-MySQLD

7. Business Intelligence & Visualization

🔗 Link - PowerBI Public Access(Expirable)

Use Microsoft PowerBI connect to the Clickhouse and extract the analytical data storage layer

Figure 1: PowerBI Dashboard Demo

Tech Stack

This project sets up a high-availability big data platform, including the following components:

Components	Features	Version
Apache Hadoop	Big Data Distributed Framework	3.2.4
Apache Zookeeper	High Availability	3.8.4
Apache Spark	Distributed Computing	3.3.0
Apache Hive	Data Warehousing	3.1.3
Apache Airflow	Workflow Scheduling	2.7.2
MySQL	Metastore	8.0.39
Oracle Database	Workflow Scheduling	19.0.0
Azure Cloud ClickHouse	OLAP Analysis	24.12
Microsoft PowerBI	BI Dashboard	latest
Prometheus	Monitoring	2.52.0
Grafana	Monitoring GUI	10.3.1
Docker	Containerization	28.0.1

📁 Project Directory

/bigdata-datawarehouse-project
│── /.github/workflows            # CI/CD automation workflows via GitHub Actions
│── /docs                         # docs (all business and technologies documents about this project)
│── /src
    │── /data_pipeline            # ETL flow: OLTP2DWH & DWH2OLAP
    │── /warehouse_modeling       # DWH modelling（Hive SQL etc.）
    │── /batch_processing         # Data Batch processing (PySpark + SparkSQL)
    │── /scheduler                # Task Scheduler(Airflow DAGs)
    │── /infra                    # infrastructure deployment(Docker, configuration files)
    │── /snippets                 # common used commands and snippets
    │── /scripts                  # container internal shell scripts
    │── /bi                       # PowerBI Dashboard pbix file
    │── /README                   # Source Code Use Instruction Markdown Files
    │── README.md                 # Navigation of Source Code Use Instruction
    │── main_data_pipeline.py     # **main entry point for the data pipeline module
    │── main_batch_processing.py  # **main entry point for the batch processing module
│── /tests                        # all small features unit testing snippets (DWH modelling, data pipeline, dags etc.) 
│── README.md                     # Introduction about project
│── docker-compose-bigdata.yml    # Docker Compose to launch the docker cluster
│── .env                          # `public the .env on purpose` for docker-compose file use
│── .gitignore                    # Git ignore some directory not to be committed to the remote repo
│── .gitattributes                # Git repository attributes config
│── LICENSE                       # COPYRIGHT for this project
│── mysql-metadata-restore.sh     # container operational level scripts: restore mysql container metadata
│── mysql-metastore-dump.sh       # container operational level scripts: dump mysql container metadata
│── push-to-ghcr.sh               # container operational level scripts: push the images to GitHub Container Registry
│── start-data-clients.sh         # container operational level scripts: start hive, spark etc
│── start-hadoop-cluster.sh       # container operational level scripts: start hadoop HA cluster 
│── start-other-services.sh       # container operational level scripts: start airflow, prometheus, grafana etc
│── stop-data-clients.sh          # container operational level scripts: stop hive, spark etc
│── stop-hadoop-cluster.sh        # container operational level scripts: stop hadoop HA cluster 
│── stop-other-services.sh        # container operational level scripts: stop airflow, prometheus, grafana etc

🚀 Source Code Instruction for Use `/src`

🚀 Cluster Quick Start

Data Pipeline Instruction

🚀 /src/data_pipeline + /src/main_data_pipeline.py

Batch Processing Instruction

🚀 /src/batch_processing + /src/main_batch_processing.py

Warehouse Modelling Instruction

🚀 /src/warehouse_modelling

Automation Scheduler Instruction

🚀 /src/scheduler

Container DevOps Scripts Instruction

🚀 /src/scripts

📌 Project Documents `/docs`

1. Tech Architecture

Project Tech Architecture

2. Development Specification

Data Warehouse Development Specification
SQL Development Specification
- Development Specification

3. Troubleshooting

Development

Testing

Optimization

Too many INFO logs: Reducing Spark Console Log Levels

License

This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.github/workflows		.github/workflows
.idea		.idea
docs		docs
src		src
test		test
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose-bigdata.yml		docker-compose-bigdata.yml
mysql-metadata-restore.sh		mysql-metadata-restore.sh
mysql-metastore-dump.sh		mysql-metastore-dump.sh
push-to-ghcr.sh		push-to-ghcr.sh
start-data-clients.sh		start-data-clients.sh
start-hadoop-cluster.sh		start-hadoop-cluster.sh
start-other-services.sh		start-other-services.sh
stop-data-clients.sh		stop-data-clients.sh
stop-hadoop-cluster.sh		stop-hadoop-cluster.sh
stop-other-services.sh		stop-other-services.sh

License

Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform

Folders and files

Latest commit

History

Repository files navigation

Enterprise-Grade On-premise Data Warehouse Solution for E-Commerce

⚙️ Core Deliverables

1. Data Warehouse Modeling and Documentation

2. A Self-Built Distributed Big Data Platform

3. Distributed Batch Processing

4. CI/CD Automation

5. Storage & Computation Optimization

6. DevOps - Monitoring and Alerting

7. Business Intelligence & Visualization

Tech Stack

📁 Project Directory

🚀 Source Code Instruction for Use /src

Data Pipeline Instruction

Batch Processing Instruction

Warehouse Modelling Instruction

Automation Scheduler Instruction

Container DevOps Scripts Instruction

📌 Project Documents /docs

1. Tech Architecture

2. Development Specification

3. Troubleshooting

Development

Testing

Optimization

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Uh oh!

Languages

🚀 Source Code Instruction for Use `/src`

📌 Project Documents `/docs`