Skip to content

Smars-Bin-Hu/EComDWH-BatchDataProcessingPlatform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

1

πŸ‘† click the picture to see the presentation video!

Enterprise-Grade On-premise Data Warehouse Solution for E-Commerce

Sublime's custom image Sublime's custom image


This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging Docker containers to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.

The core value of this project lies in its implementation of enterprise-grade data warehouse modeling, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, the deployment of a big data cluster via Docker containers simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates CI/CD automation, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also highly optimized to maximize hardware resource utilization.

To monitor and manage the system effectively, a Grafana-based cluster monitoring system has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating business intelligence (BI) and visualization solutions, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.

By combining these critical featuresβ€”including:

βœ… Core Feature πŸ”₯ Core Highlights πŸ“¦ Deliverables
1. Data Warehouse Modeling and Documentation - Full dimensional modeling process (Star Schema / Snowflake Schema)
- Standardized development norms (ODS/DWD/DWM/DWS/DWT/ADS six-layer modeling)
- Business Matrix: defining & managing dimensions & fact tables
- Data warehouse design document (Markdown)
- Hive SQL modeling code
- DWH Dimensional Modelling Architecture Diagram
2. A Self-Built Distributed Big Data Platform - Fully containerized deployment with Docker for quick replication
- High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse
- Docker images (Open sourced on GitHub Container Registry)
- docker-compose.yml (one-click cluster startup)
- Infra Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
- Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
- Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow
3. Distributed Batch Processing - ETL processing using PySpark
- data ETL job: OLTP to DWH && DWH to OLAP
- Data Warehouse internal processing: ODS β†’ DWD β†’ DIM/DWM β†’ DWS β†’ ADS
- batch processing job scheduler using Airflow
PySpark and Spark SQL Code
- Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP)
- Code - Batch Processing (DWH Internal Transform)
- Code - Scheduling based on Airflow (DAGs)
4. CI/CD Automation - Automated data platform cluster launching and stop - GitHub Actions workflow pipeline .yaml
- CI/CD code and documentation
- Sample log screenshots
5. Storage & Computation Optimization - SQL optimization (dynamic partitioning, indexing, storage partitioning)
- Spark tuning: Salting, Skew Join Hint, Broadcast Join, reduceByKey vs. groupByKey
- Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression
- Pre & post optimization performance comparison
- Spark optimization code
- SQL execution plan screenshots
6. DevOps - Monitoring and Alerting - Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL
- AlertManager for alerting and email receiving
- Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager
- Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager
- Code - Container Metrics Exporter Start&Stop Scripts: my-start-node-exporter.sh & my-stop-node-exporter.sh
- Key Screenshots
7. Business Intelligence & Visualization - PowerBI dashboards for data analysis
- Real business-driven visualizations
- Providing actionable business insights
- PowerBI visualization screenshots
- PowerBI .pbix file
- Key business metric explanations (BI Insights)

this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.

βš™οΈ Core Deliverables

1. Data Warehouse Modeling and Documentation

This project demonstrates my ability to build a data warehouse from the ground up following enterprise-grade standards. I independently designed and documented a complete SOP for data warehouse development, covering every critical step in the modeling roadmap. From initial business data research to final model delivery, I established a standardized methodology that ensures clarity, scalability, and maintainability. The SOP includes detailed best practices on data warehouse layering, table naming conventions, field naming rules, and lifecycle management for warehouse tables. For more information, please refer to the documentation below.

πŸ”— Click to Show DWH Dimensional Modelling Documents and Code

Data Warehouse Development Specification

πŸ”¨ Code - Hive DDL(for Data Warehouse All Layers including ods, dwd, dwm, dws, dwt, dim (Operational Data Storage, DW detail, DW middle, DW summary, DW theme, DW Dimension, Analytical Data Storage-CK)

image

Figure 1: DWH Dimensional Modelling SOP

image

Figure 2: DWH Dimensional Modelling Methodology Diagram

ECom-DWH-Pipeline

Figure 3: DWH Dimensional Modelling Architecture

2. A Self-Built Distributed Big Data Platform

This distributed data platform was built entirely from scratch by myself. Starting with a base Ubuntu 20.04 docker image, I manually installed and configured each component step by step, ultimately creating a fully functional three-node Hadoop cluster with distributed storage and computing capabilities. The platform is fully containerized, featuring a highly available HDFS and YARN architecture. It supports Hive for data warehousing, Spark for distributed computing, Airflow for workflow orchestration, and Prometheus + Grafana for performance monitoring. A MySQL container manages metadata for both Hive and Airflow and is also monitored by Prometheus. An Oracle container simulates the backend of a business system and serves as a data source for the data warehouse. All container images are open-sourced and published to πŸ”¨ GitHub Container Registry, making it easy for anyone to deploy the same platform locally.

πŸ”¨ Code - Docker Compose File

πŸ”¨ Code - Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

πŸ”¨ Code - Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

πŸ”¨ Code - Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow

image

Figure 1: All Containers Window

ECom-DWH-Pipeline

Figure 2: Data Platform Architecture

3. Distributed Batch Processing

This project implements a robust distributed batch processing architecture using PySpark for computation and Apache Airflow for orchestration. The batch layer focuses on high-throughput, scalable ETL workflows and integrates seamlessly with the overall data warehouse design. The core functionalities are structured as follows:

  1. πŸ”¨ Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP)

A PySpark-based incremental extraction process is used to ingest new records from the Oracle OLTP database into the data warehouse. Additionally, downstream scripts handle the transformation and export of analytical and result-layer datasets from the data warehouse into external OLAP systems, enabling fast access by BI tools (e.g., Power BI, Tableau).

  1. πŸ”¨ Code - Batch Processing (DWH Internal Transform)

Multi-stage transformations are implemented using Spark SQL within PySpark jobs to process data across warehouse layers, such as: ODS (Operational Data Store) β†’ DWD (Data Warehouse Detail) and DWD β†’ DIM (Dimension Tables). These transformations ensure structured, cleaned, and query-optimized data for analytical use cases.

  1. πŸ”¨ Code - Scheduling based on Airflow (DAGs)

The entire batch workflow is automated via Apache Airflow, with DAGs scheduled to run nightly at 2:00 AM. The scheduler coordinates the extraction, transformation, and loading tasks, handles dependencies, and ensures timely creation of new partitions and ingestion of the latest data into the warehouse.

ECom-ETL-Data-Pipeline

Figure 1: ETL Data Pipeline

image

Figure 2: Airflow Web UI

4. CI/CD Automation

  1. GitHub Actions Code

πŸ”¨ Code - workflows.main YAML

  1. Key Screenshots
image

Figure 1: Data platform launching and stop automation

image

Figure 2: Sample Log Screenshot I

image

Figure 3: Sample Log Screenshot II

  1. πŸ”— Link - Automation Workflow Web UI

5. Storage & Computation Optimization

6. DevOps - Monitoring and Alerting

πŸ”¨ Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager

πŸ”¨ Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager

πŸ”¨ Code - Container Metrics Exporter Start&Stop Scripts: my-start-node-exporter.sh & my-stop-node-exporter.sh

Prometheus

Figure 1: Prometheus

Grafana-Hadoop-Cluster-instance-hadoop-master

Figure 2: Grafana-Hadoop-Cluster-instance-hadoop-master

Grafana-MySQLD

Figure 3: Grafana-MySQLD

7. Business Intelligence & Visualization

πŸ”— Link - PowerBI Public Access(Expirable)

Use Microsoft PowerBI connect to the Clickhouse and extract the analytical data storage layer image

Figure 1: PowerBI Dashboard Demo

Tech Stack

This project sets up a high-availability big data platform, including the following components:

Apache Spark Apache Hadoop Apache ZooKeeper Apache Airflow Apache Hive ClickHouse Prometheus Grafana MySQL Oracle Database Microsoft PowerBI Docker

Components Features Version
Apache Hadoop Big Data Distributed Framework 3.2.4
Apache Zookeeper High Availability 3.8.4
Apache Spark Distributed Computing 3.3.0
Apache Hive Data Warehousing 3.1.3
Apache Airflow Workflow Scheduling 2.7.2
MySQL Metastore 8.0.39
Oracle Database Workflow Scheduling 19.0.0
Azure Cloud ClickHouse OLAP Analysis 24.12
Microsoft PowerBI BI Dashboard latest
Prometheus Monitoring 2.52.0
Grafana Monitoring GUI 10.3.1
Docker Containerization 28.0.1

πŸ“ Project Directory

/bigdata-datawarehouse-project
│── /.github/workflows            # CI/CD automation workflows via GitHub Actions
│── /docs                         # docs (all business and technologies documents about this project)
│── /src
    │── /data_pipeline            # ETL flow: OLTP2DWH & DWH2OLAP
    │── /warehouse_modeling       # DWH modelling(Hive SQL etc.οΌ‰
    │── /batch_processing         # Data Batch processing (PySpark + SparkSQL)
    │── /scheduler                # Task Scheduler(Airflow DAGs)
    │── /infra                    # infrastructure deployment(Docker, configuration files)
    │── /snippets                 # common used commands and snippets
    │── /scripts                  # container internal shell scripts
    │── /bi                       # PowerBI Dashboard pbix file
    │── /README                   # Source Code Use Instruction Markdown Files
    │── README.md                 # Navigation of Source Code Use Instruction
    │── main_data_pipeline.py     # **main entry point for the data pipeline module
    │── main_batch_processing.py  # **main entry point for the batch processing module
│── /tests                        # all small features unit testing snippets (DWH modelling, data pipeline, dags etc.) 
│── README.md                     # Introduction about project
│── docker-compose-bigdata.yml    # Docker Compose to launch the docker cluster
│── .env                          # `public the .env on purpose` for docker-compose file use
│── .gitignore                    # Git ignore some directory not to be committed to the remote repo
│── .gitattributes                # Git repository attributes config
│── LICENSE                       # COPYRIGHT for this project
│── mysql-metadata-restore.sh     # container operational level scripts: restore mysql container metadata
│── mysql-metastore-dump.sh       # container operational level scripts: dump mysql container metadata
│── push-to-ghcr.sh               # container operational level scripts: push the images to GitHub Container Registry
│── start-data-clients.sh         # container operational level scripts: start hive, spark etc
│── start-hadoop-cluster.sh       # container operational level scripts: start hadoop HA cluster 
│── start-other-services.sh       # container operational level scripts: start airflow, prometheus, grafana etc
│── stop-data-clients.sh          # container operational level scripts: stop hive, spark etc
│── stop-hadoop-cluster.sh        # container operational level scripts: stop hadoop HA cluster 
│── stop-other-services.sh        # container operational level scripts: stop airflow, prometheus, grafana etc

Data Pipeline Instruction

Batch Processing Instruction

Warehouse Modelling Instruction

Automation Scheduler Instruction

Container DevOps Scripts Instruction

πŸ“Œ Project Documents /docs

1. Tech Architecture

2. Development Specification

3. Troubleshooting

Development

Testing

Optimization

License

This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.

About

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •