This project implements an end-to-end ETL pipeline in Databricks using PySpark and Spark SQL to analyze e-commerce order data. The pipeline follows a bronze–silver–gold architecture to ingest raw data, apply data quality checks and business transformations, and produce analytics-ready marketing and discount metrics.
The input dataset represents online store orders and includes fields such as:
- OrderID, Date, CustomerID
- Product, Quantity, UnitPrice, TotalPrice
- CouponCode, ReferralSource
- PaymentMethod, OrderStatus, TrackingNumber
Only CouponCode is allowed to contain null values.
Bronze → Silver → Gold
- Bronze: Raw CSV ingestion with schema inference
- Silver: Cleaned and enriched data with derived columns and data quality enforcement
- Gold: Aggregated business metrics built using Spark SQL
Built using Spark SQL on top of the silver layer:
Coupon Usage Metrics
- Total orders with coupons
- Coupon usage rate
- Average order value by coupon usage
Referral & Marketing Attribution Metrics
- Total orders per referral source
- Total revenue per referral source
- Coupon usage rate per referral source
- Average order value per referral source
- Databricks
- PySpark
- Spark SQL
- Python
- Open the notebook in Databricks
- Update the input file path if needed
- Run cells top to bottom
- Inspect silver-layer data and gold-layer aggregations