Skip to content

This project showcases a full end-to-end data engineering pipeline built using the Brazilian Olist e-commerce dataset, deployed using modern data tools and cloud services. It is structured around the Medallion Architecture to simulate production-grade pipelines and transformations.

Notifications You must be signed in to change notification settings

BlvckOgre/End-to-End-E-commerce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›’ Brazilian E-Commerce Data Engineering Pipeline

This project showcases a full end-to-end data engineering pipeline built using the Brazilian Olist e-commerce dataset, deployed using modern data tools and cloud services. It is structured around the Medallion Architecture to simulate production-grade pipelines and transformations.


πŸ“Š Project Overview

This project aims to ingest, clean, transform, and model data from multiple sources into an analytics-ready format. It mirrors real-world complexity by integrating structured and semi-structured data from cloud storage, relational databases, and NoSQL stores.


πŸ—‚ Dataset Summary

The dataset used is a popular e-commerce dataset from Olist, consisting of multiple CSV files representing:

  • Customers
  • Orders
  • Products
  • Payments
  • Sellers
  • Reviews
  • Product Category Translations (MongoDB)
  • Order Payments (MySQL)

🧰 Tech Stack

Area Tool / Technology
Languages PySpark, SQL
Orchestration Azure Data Factory
Storage Azure Data Lake Storage Gen2
Processing Azure Databricks, Google Colab
Data Sources CSV (GitHub), MySQL, MongoDB
Data Warehouse Azure Synapse Analytics

🧱 Architecture: Medallion Pattern

External Sources (GitHub CSVs, MySQL, MongoDB)
      |
[Azure Data Factory]
      |
ADLS Gen2 - Bronze (Raw Files)
      |
[Azure Databricks: PySpark ETL]
      |
ADLS Gen2 - Silver (Cleaned, Joined)
      |
[Azure Synapse: SQL CTAS, Views]
      |
ADLS Gen2 - Gold (Analytics-Ready Data)

βš™οΈ Pipeline Stages

πŸ”Ή 1. Ingestion (Bronze Layer)

  • GitHub CSV Files:

    • Ingested via Data Factory using a parameterized pipeline.
    • Uses a JSON-based config and a Lookup + ForEach + Copy pattern.
  • MySQL (Order Payments):

    • Loaded using a custom Data Factory pipeline.
    • MySQL database hosted and accessed securely.
  • MongoDB (Product Category Names):

    • Loaded using a Colab notebook and pymongo.
    • Accessed directly from Azure Databricks during ETL.

πŸ”Έ 2. Transformation (Silver Layer)

  • All source files are read in Azure Databricks using PySpark.

  • Transformations include:

    • Schema harmonization
    • Date/time format standardization
    • Removal of nulls and duplicates
    • Merging/joining tables with orders as the central fact
    • MongoDB product categories are cleaned and matched to product data
  • Final intermediate dataset (df_final) is saved in Parquet format to the Silver layer.


🟑 3. Modeling & Serving (Gold Layer)

  • Silver data is ingested into Azure Synapse SQL Pools.

  • SQL-based transformation logic is applied using:

    • CTAS (Create Table As Select)
    • Views for business logic and reporting KPIs
  • Final outputs are written back to ADLS Gen2 (Gold directory) in Parquet format.


πŸ“ Project Directory Structure

/olist-ecommerce-pipeline/
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ ingestion_github.ipynb
β”‚   β”œβ”€β”€ ingestion_mysql_colab.ipynb
β”‚   β”œβ”€β”€ ingestion_mongodb_colab.ipynb
β”‚   β”œβ”€β”€ databricks_etl.ipynb
β”‚   └── synapse_final_transforms.sql
β”œβ”€β”€ pipeline_configs/
β”‚   └── github_file_list.json
β”œβ”€β”€ resources/
β”‚   └── architecture_diagram.png (optional)
└── README.md

πŸ“ˆ Sample Business KPIs

These can be generated via SQL or connected to a dashboarding tool like Power BI:

  • πŸ› Total orders per state
  • ⏱ Average delivery time per seller
  • πŸ’³ Distribution of payment methods
  • 🧾 Revenue per product category
  • ⭐ Average review score over time

πŸ”’ Security

  • All services deployed under the same Azure Resource Group.
  • Databricks-to-ADLS connection uses Azure App Registration (Service Principal OAuth).
  • Role-Based Access Control (RBAC) used for secure access between services.

πŸš€ Future Improvements

  • Add Data Quality checks (e.g. with Great Expectations)
  • CI/CD setup using GitHub Actions or Azure DevOps
  • Implement data partitioning for performance
  • Add streaming ingestion (Kafka/Event Hub)

🧾 Notebooks

All work is encapsulated in notebooks located under the /notebooks/ folder and can be run in Google Colab, Databricks, or Azure Synapse where applicable.


Author: Data Engineering Team
Note: External data links used for ingestion simulation only. Not committed to this repository.


About

This project showcases a full end-to-end data engineering pipeline built using the Brazilian Olist e-commerce dataset, deployed using modern data tools and cloud services. It is structured around the Medallion Architecture to simulate production-grade pipelines and transformations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published