Skip to content

A datalake house project build with Spark, Airflow, Docker, Minio and much more !

Notifications You must be signed in to change notification settings

csNitishBhardwaj/Quick-Commerce-BigBasket-DataLakeHouse

Repository files navigation

Quick Commerce Data LakeHouse

Architecture

quickcommercify-arch

A datalake house project build with Spark, Airflow, Docker, Minio and much more !

Description

Objective

In this project data is ingested from a fake quick commerce transactional database (like bigbasket) and processed in multiple data pipelines.

  • Data is processed using Spark running on 3 node cluster deployed locally using Docker.
  • Data is stored in medallion architecture with Backfill and Incremental Load strategy running on MinIO as object storage system.
  • Delta lake storage layer is used with Apache Parquet as storage format.
  • Gold layer provides business ready data stored in SCD-2 for BI/DA/ML and pre-computed aggregated tables.
  • Data pipeline workflow is orchestrated using Apache Airflow dags.

Dataset

Python scripts generate fake data using faker lib.

  • Data is generated for multiple users residing in various cities in india.
  • Product dataset is sourced from kaggle.
  • Order data is generated as users are ordering products from quick commerce platform.
  • Then data is inserted into postgres's tables including customers, stores, products, order etc... and ready to be extracted by Spark bronze layer pipeline.

Tools & Technologies

  • Containerization - Docker
  • Batch Processing - Spark
  • Orchestration - Airflow
  • OLTP DB - PostgreSQL
  • Data Lake - Minio, Delta Lake
  • Language - Python, SQL

Data Pipeline

  • Bronze Layer:

    • Data is extracted from postgres and loaded into bronze layer as raw data partitioned by date for big table. quickcommerce-bronze-dag
  • Silver Layer:

    • Data is moved to silver layer with incremental load strategy.
    • Data is cleaned with appropriate data types and column names.
    • Data is conformed to delta lake tables. quickcommerce-silver-dag
  • Gold Layer:

    • Tables are created to be analysis ready in SCD TYPE 2 with Star Schema.
    • Strategies like Merge are used.
    • Data quality checks are written for Idempotent data Pipeline making data it more reliable.
    • Pre-aggregated tables are created on frequently queried metrices like daily store sales, daily customer spend etc. quickcommerce-bronze-dag

How can I make this better ?

A lot can still be done.

  • Write data quality checks like Write-Audit-Publish.
  • Include CI/CD
  • Create aditional dimensional models, KPIs, views for additional business processes.
  • Add Visulisations.

Special Mentions

I would like to thank datatalks.club and dataexpert.io bootcamps for providing course which enabled me to build this project and implementing various tools I learnt there.

About

A datalake house project build with Spark, Airflow, Docker, Minio and much more !

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published