A datalake house project build with Spark, Airflow, Docker, Minio and much more !
In this project data is ingested from a fake quick commerce transactional database (like bigbasket) and processed in multiple data pipelines.
- Data is processed using Spark running on 3 node cluster deployed locally using Docker.
- Data is stored in medallion architecture with Backfill and Incremental Load strategy running on MinIO as object storage system.
- Delta lake storage layer is used with Apache Parquet as storage format.
- Gold layer provides business ready data stored in SCD-2 for BI/DA/ML and pre-computed aggregated tables.
- Data pipeline workflow is orchestrated using Apache Airflow dags.
Python scripts generate fake data using faker lib.
- Data is generated for multiple users residing in various cities in india.
- Product dataset is sourced from kaggle.
- Order data is generated as users are ordering products from quick commerce platform.
- Then data is inserted into postgres's tables including customers, stores, products, order etc... and ready to be extracted by Spark bronze layer pipeline.
- Containerization - Docker
- Batch Processing - Spark
- Orchestration - Airflow
- OLTP DB - PostgreSQL
- Data Lake - Minio, Delta Lake
- Language - Python, SQL
-
Bronze Layer:
-
Silver Layer:
-
Gold Layer:
- Tables are created to be analysis ready in SCD TYPE 2 with Star Schema.
- Strategies like Merge are used.
- Data quality checks are written for Idempotent data Pipeline making data it more reliable.
- Pre-aggregated tables are created on frequently queried metrices like daily store sales, daily customer spend etc.

A lot can still be done.
- Write data quality checks like Write-Audit-Publish.
- Include CI/CD
- Create aditional dimensional models, KPIs, views for additional business processes.
- Add Visulisations.
I would like to thank datatalks.club and dataexpert.io bootcamps for providing course which enabled me to build this project and implementing various tools I learnt there.


