This project showcases a full end-to-end data engineering pipeline built using the Brazilian Olist e-commerce dataset, deployed using modern data tools and cloud services. It is structured around the Medallion Architecture to simulate production-grade pipelines and transformations.
This project aims to ingest, clean, transform, and model data from multiple sources into an analytics-ready format. It mirrors real-world complexity by integrating structured and semi-structured data from cloud storage, relational databases, and NoSQL stores.
The dataset used is a popular e-commerce dataset from Olist, consisting of multiple CSV files representing:
- Customers
- Orders
- Products
- Payments
- Sellers
- Reviews
- Product Category Translations (MongoDB)
- Order Payments (MySQL)
| Area | Tool / Technology |
|---|---|
| Languages | PySpark, SQL |
| Orchestration | Azure Data Factory |
| Storage | Azure Data Lake Storage Gen2 |
| Processing | Azure Databricks, Google Colab |
| Data Sources | CSV (GitHub), MySQL, MongoDB |
| Data Warehouse | Azure Synapse Analytics |
External Sources (GitHub CSVs, MySQL, MongoDB)
|
[Azure Data Factory]
|
ADLS Gen2 - Bronze (Raw Files)
|
[Azure Databricks: PySpark ETL]
|
ADLS Gen2 - Silver (Cleaned, Joined)
|
[Azure Synapse: SQL CTAS, Views]
|
ADLS Gen2 - Gold (Analytics-Ready Data)
-
GitHub CSV Files:
- Ingested via Data Factory using a parameterized pipeline.
- Uses a JSON-based config and a Lookup + ForEach + Copy pattern.
-
MySQL (Order Payments):
- Loaded using a custom Data Factory pipeline.
- MySQL database hosted and accessed securely.
-
MongoDB (Product Category Names):
- Loaded using a Colab notebook and
pymongo. - Accessed directly from Azure Databricks during ETL.
- Loaded using a Colab notebook and
-
All source files are read in Azure Databricks using PySpark.
-
Transformations include:
- Schema harmonization
- Date/time format standardization
- Removal of nulls and duplicates
- Merging/joining tables with
ordersas the central fact - MongoDB product categories are cleaned and matched to product data
-
Final intermediate dataset (
df_final) is saved in Parquet format to the Silver layer.
-
Silver data is ingested into Azure Synapse SQL Pools.
-
SQL-based transformation logic is applied using:
- CTAS (Create Table As Select)
- Views for business logic and reporting KPIs
-
Final outputs are written back to ADLS Gen2 (Gold directory) in Parquet format.
/olist-ecommerce-pipeline/
βββ notebooks/
β βββ ingestion_github.ipynb
β βββ ingestion_mysql_colab.ipynb
β βββ ingestion_mongodb_colab.ipynb
β βββ databricks_etl.ipynb
β βββ synapse_final_transforms.sql
βββ pipeline_configs/
β βββ github_file_list.json
βββ resources/
β βββ architecture_diagram.png (optional)
βββ README.md
These can be generated via SQL or connected to a dashboarding tool like Power BI:
- π Total orders per state
- β± Average delivery time per seller
- π³ Distribution of payment methods
- π§Ύ Revenue per product category
- β Average review score over time
- All services deployed under the same Azure Resource Group.
- Databricks-to-ADLS connection uses Azure App Registration (Service Principal OAuth).
- Role-Based Access Control (RBAC) used for secure access between services.
- Add Data Quality checks (e.g. with Great Expectations)
- CI/CD setup using GitHub Actions or Azure DevOps
- Implement data partitioning for performance
- Add streaming ingestion (Kafka/Event Hub)
All work is encapsulated in notebooks located under the /notebooks/ folder and can be run in Google Colab, Databricks, or Azure Synapse where applicable.
Author: Data Engineering Team
Note: External data links used for ingestion simulation only. Not committed to this repository.