Databricks PySpark ETL Pipeline – Store Orders Analytics

Overview

This project implements an end-to-end ETL pipeline in Databricks using PySpark and Spark SQL to analyze e-commerce order data. The pipeline follows a bronze–silver–gold architecture to ingest raw data, apply data quality checks and business transformations, and produce analytics-ready marketing and discount metrics.

Dataset

The input dataset represents online store orders and includes fields such as:

OrderID, Date, CustomerID
Product, Quantity, UnitPrice, TotalPrice
CouponCode, ReferralSource
PaymentMethod, OrderStatus, TrackingNumber

Only CouponCode is allowed to contain null values.

ETL Architecture

Bronze → Silver → Gold

Bronze: Raw CSV ingestion with schema inference
Silver: Cleaned and enriched data with derived columns and data quality enforcement
Gold: Aggregated business metrics built using Spark SQL

Gold Layer (Business Metrics)

Built using Spark SQL on top of the silver layer:

Coupon Usage Metrics

Total orders with coupons
Coupon usage rate
Average order value by coupon usage

Referral & Marketing Attribution Metrics

Total orders per referral source
Total revenue per referral source
Coupon usage rate per referral source
Average order value per referral source

Technologies Used

Databricks
PySpark
Spark SQL
Python

How to Run

Open the notebook in Databricks
Update the input file path if needed
Run cells top to bottom
Inspect silver-layer data and gold-layer aggregations

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Store Orders.ipynb		Store Orders.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks PySpark ETL Pipeline – Store Orders Analytics

Overview

Dataset

ETL Architecture

Gold Layer (Business Metrics)

Technologies Used

How to Run

About

Uh oh!

Releases

Packages

Languages

abduls22/Store-Orders-Analytics-ETL-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Databricks PySpark ETL Pipeline – Store Orders Analytics

Overview

Dataset

ETL Architecture

Gold Layer (Business Metrics)

Technologies Used

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages