π click the picture to see the presentation video!
This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data. By leveraging Docker containers to simulate a big data platform, it achieves a complete workflow from ETL processing to data warehouse modeling, OLAP analysis, and data visualization.
The core value of this project lies in its implementation of enterprise-grade data warehouse modeling, integrating e-commerce order data with relevant business themes through standardized dimension modeling and fact table design, ensuring data accuracy, consistency, and traceability. Meanwhile, the deployment of a big data cluster via Docker containers simplifies environment management and operational costs, offering a flexible deployment model for distributed batch processing powered by Spark. Additionally, the project incorporates CI/CD automation, enabling rapid iterations while maintaining the stability and reliability of the data pipeline. Storage and computation are also highly optimized to maximize hardware resource utilization.
To monitor and manage the system effectively, a Grafana-based cluster monitoring system has been implemented, providing real-time insights into cluster health metrics and assisting in performance tuning and capacity planning. Finally, by integrating business intelligence (BI) and visualization solutions, the project transforms complex data warehouse analytics into intuitive dashboards and reports, allowing business teams to make data-driven decisions more efficiently.
By combining these critical featuresβincluding:
| β Core Feature | π₯ Core Highlights | π¦ Deliverables |
|---|---|---|
| 1. Data Warehouse Modeling and Documentation | - Full dimensional modeling process (Star Schema / Snowflake Schema) - Standardized development norms (ODS/DWD/DWM/DWS/DWT/ADS six-layer modeling) - Business Matrix: defining & managing dimensions & fact tables |
- Data warehouse design document (Markdown) - Hive SQL modeling code - DWH Dimensional Modelling Architecture Diagram |
| 2. A Self-Built Distributed Big Data Platform | - Fully containerized deployment with Docker for quick replication - High-availability environment: Hadoop + Hive + Spark + Zookeeper + ClickHouse |
- Docker images (Open sourced on GitHub Container Registry) - docker-compose.yml (one-click cluster startup) - Infra Configuration Files for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow - Container Internal Scripts: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow - Common Used Snippets for Cluster: Hadoop, ZooKeeper, Hive, MySql, Spark, Prometheus&Grafana, Airflow |
| 3. Distributed Batch Processing | - ETL processing using PySpark - data ETL job: OLTP to DWH && DWH to OLAP - Data Warehouse internal processing: ODS β DWD β DIM/DWM β DWS β ADS - batch processing job scheduler using Airflow |
PySpark and Spark SQL Code - Code - Data Pipeline (OLTP -> DWH, DWH -> OLAP) - Code - Batch Processing (DWH Internal Transform) - Code - Scheduling based on Airflow (DAGs) |
| 4. CI/CD Automation | - Automated data platform cluster launching and stop | - GitHub Actions workflow pipeline .yaml - CI/CD code and documentation - Sample log screenshots |
| 5. Storage & Computation Optimization | - SQL optimization (dynamic partitioning, indexing, storage partitioning) - Spark tuning: Salting, Skew Join Hint, Broadcast Join, reduceByKey vs. groupByKey - Hive tuning: Z-Order sorting (boost ClickHouse queries), Parquet + Snappy compression |
- Pre & post optimization performance comparison - Spark optimization code - SQL execution plan screenshots |
| 6. DevOps - Monitoring and Alerting | - Prometheus + Grafana for performance monitoring Hadoop Cluster / MySQL - AlertManager for alerting and email receiving |
- Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager - Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager - Code - Container Metrics Exporter Start&Stop Scripts: my-start-node-exporter.sh & my-stop-node-exporter.sh - Key Screenshots |
| 7. Business Intelligence & Visualization | - PowerBI dashboards for data analysis - Real business-driven visualizations - Providing actionable business insights |
- PowerBI visualization screenshots - PowerBI .pbix file - Key business metric explanations (BI Insights) |
this project delivers a professional, robust, and highly efficient solution for enterprises dealing with large-scale data processing and analytics.
This project demonstrates my ability to build a data warehouse from the ground up following enterprise-grade standards. I independently designed and documented a complete SOP for data warehouse development, covering every critical step in the modeling roadmap. From initial business data research to final model delivery, I established a standardized methodology that ensures clarity, scalability, and maintainability. The SOP includes detailed best practices on data warehouse layering, table naming conventions, field naming rules, and lifecycle management for warehouse tables. For more information, please refer to the documentation below.
π Click to Show DWH Dimensional Modelling Documents and Code
Data Warehouse Development Specification
- Data Warehouse Layering Specification
- Table Naming Conventions
- Data Warehouse Column Naming Conventions
- Data Table Lifecycle Management Specification
- DWH Modelling Architecture Diagram
π¨ Code - Hive DDL(for Data Warehouse All Layers including ods, dwd, dwm, dws, dwt, dim (Operational Data Storage, DW detail, DW middle, DW summary, DW theme, DW Dimension, Analytical Data Storage-CK)
Figure 1: DWH Dimensional Modelling SOP
Figure 2: DWH Dimensional Modelling Methodology Diagram
Figure 3: DWH Dimensional Modelling Architecture
This distributed data platform was built entirely from scratch by myself. Starting with a base Ubuntu 20.04 docker image, I manually installed and configured each component step by step, ultimately creating a fully functional three-node Hadoop cluster with distributed storage and computing capabilities. The platform is fully containerized, featuring a highly available HDFS and YARN architecture. It supports Hive for data warehousing, Spark for distributed computing, Airflow for workflow orchestration, and Prometheus + Grafana for performance monitoring. A MySQL container manages metadata for both Hive and Airflow and is also monitored by Prometheus. An Oracle container simulates the backend of a business system and serves as a data source for the data warehouse. All container images are open-sourced and published to π¨ GitHub Container Registry, making it easy for anyone to deploy the same platform locally.
π¨ Code - Docker Compose File
Figure 1: All Containers Window
Figure 2: Data Platform Architecture
This project implements a robust distributed batch processing architecture using PySpark for computation and Apache Airflow for orchestration. The batch layer focuses on high-throughput, scalable ETL workflows and integrates seamlessly with the overall data warehouse design. The core functionalities are structured as follows:
A PySpark-based incremental extraction process is used to ingest new records from the Oracle OLTP database into the data warehouse. Additionally, downstream scripts handle the transformation and export of analytical and result-layer datasets from the data warehouse into external OLAP systems, enabling fast access by BI tools (e.g., Power BI, Tableau).
Multi-stage transformations are implemented using Spark SQL within PySpark jobs to process data across warehouse layers, such as: ODS (Operational Data Store) β DWD (Data Warehouse Detail) and DWD β DIM (Dimension Tables). These transformations ensure structured, cleaned, and query-optimized data for analytical use cases.
The entire batch workflow is automated via Apache Airflow, with DAGs scheduled to run nightly at 2:00 AM. The scheduler coordinates the extraction, transformation, and loading tasks, handles dependencies, and ensures timely creation of new partitions and ingestion of the latest data into the warehouse.
Figure 1: ETL Data Pipeline
Figure 2: Airflow Web UI
- GitHub Actions Code
π¨ Code - workflows.main YAML
- Key Screenshots
Figure 1: Data platform launching and stop automation
Figure 2: Sample Log Screenshot I
Figure 3: Sample Log Screenshot II
π¨ Code - Monitoring Services Configuaration Files: Prometheus, Grafana, AlertManager
π¨ Code - Monitoring Services Start&Stop Scripts: Prometheus, Grafana, AlertManager
Figure 1: Prometheus
Figure 2: Grafana-Hadoop-Cluster-instance-hadoop-master
Figure 3: Grafana-MySQLD
π Link - PowerBI Public Access(Expirable)
Use Microsoft PowerBI connect to the Clickhouse and extract the analytical data storage layer

Figure 1: PowerBI Dashboard Demo
This project sets up a high-availability big data platform, including the following components:
| Components | Features | Version |
|---|---|---|
| Apache Hadoop | Big Data Distributed Framework | 3.2.4 |
| Apache Zookeeper | High Availability | 3.8.4 |
| Apache Spark | Distributed Computing | 3.3.0 |
| Apache Hive | Data Warehousing | 3.1.3 |
| Apache Airflow | Workflow Scheduling | 2.7.2 |
| MySQL | Metastore | 8.0.39 |
| Oracle Database | Workflow Scheduling | 19.0.0 |
| Azure Cloud ClickHouse | OLAP Analysis | 24.12 |
| Microsoft PowerBI | BI Dashboard | latest |
| Prometheus | Monitoring | 2.52.0 |
| Grafana | Monitoring GUI | 10.3.1 |
| Docker | Containerization | 28.0.1 |
/bigdata-datawarehouse-project
βββ /.github/workflows # CI/CD automation workflows via GitHub Actions
βββ /docs # docs (all business and technologies documents about this project)
βββ /src
βββ /data_pipeline # ETL flow: OLTP2DWH & DWH2OLAP
βββ /warehouse_modeling # DWH modellingοΌHive SQL etc.οΌ
βββ /batch_processing # Data Batch processing (PySpark + SparkSQL)
βββ /scheduler # Task Scheduler(Airflow DAGs)
βββ /infra # infrastructure deployment(Docker, configuration files)
βββ /snippets # common used commands and snippets
βββ /scripts # container internal shell scripts
βββ /bi # PowerBI Dashboard pbix file
βββ /README # Source Code Use Instruction Markdown Files
βββ README.md # Navigation of Source Code Use Instruction
βββ main_data_pipeline.py # **main entry point for the data pipeline module
βββ main_batch_processing.py # **main entry point for the batch processing module
βββ /tests # all small features unit testing snippets (DWH modelling, data pipeline, dags etc.)
βββ README.md # Introduction about project
βββ docker-compose-bigdata.yml # Docker Compose to launch the docker cluster
βββ .env # `public the .env on purpose` for docker-compose file use
βββ .gitignore # Git ignore some directory not to be committed to the remote repo
βββ .gitattributes # Git repository attributes config
βββ LICENSE # COPYRIGHT for this project
βββ mysql-metadata-restore.sh # container operational level scripts: restore mysql container metadata
βββ mysql-metastore-dump.sh # container operational level scripts: dump mysql container metadata
βββ push-to-ghcr.sh # container operational level scripts: push the images to GitHub Container Registry
βββ start-data-clients.sh # container operational level scripts: start hive, spark etc
βββ start-hadoop-cluster.sh # container operational level scripts: start hadoop HA cluster
βββ start-other-services.sh # container operational level scripts: start airflow, prometheus, grafana etc
βββ stop-data-clients.sh # container operational level scripts: stop hive, spark etc
βββ stop-hadoop-cluster.sh # container operational level scripts: stop hadoop HA cluster
βββ stop-other-services.sh # container operational level scripts: stop airflow, prometheus, grafana etcπ Source Code Instruction for Use /src
-
Data Warehouse Development Specification
-
SQL Development Specification
-
Troubleshooting: NodeManager Disk Space Issue Preventing YARN Registration
-
Troubleshooting: Error Handling Log: YARN Web UI Log Loading Failures
-
Troubleshooting: Spark Task - Issue with Writing ODS Layer Avro Data to HDFS but Hive Cannot Read
-
Troubleshooting: JSON-like Dictionary Representation in Python Script Causes Execution Failure
-
Troubleshooting: Resolving "Unrecognized column type: DATE_TYPE" Issue on ods_orders_ipd Table
-
Troubleshooting Hive Unable to Read Parquet Files Written by Spark SQL
-
Troubleshooting NameNode Startup Failure in Hadoop HA Environment
-
Preventing SIGHUP from Killing Background Processes When Using Docker
exec -it
This project is licensed under the MIT License - see the LICENSE file for details.
Created and maintained by Smars-Bin-Hu.


