This project implements an end-to-end E-Commerce Analytics Platform using Databricks, PySpark, Delta Lake, and SQL.
The solution follows the Medallion Architecture (Bronze → Silver → Gold) to ingest, clean, transform, and analyze e-commerce sales data. The final business-ready datasets are visualized through Databricks Dashboards to provide actionable insights into customer behavior, sales performance, product performance, and regional trends.
The objective of this project is to build a scalable analytics platform capable of:
- Ingesting raw e-commerce datasets
- Performing data cleansing and transformation
- Creating analytical fact tables
- Generating business KPIs
- Supporting dashboard-based decision making
- Demonstrating modern Data Engineering best practices
Dataset Used:
Brazilian E-Commerce Public Dataset by Olist
The dataset contains information related to:
- Customers
- Orders
- Order Items
- Products
- Payments
- Reviews
- Sellers
- Product Categories
| Technology | Purpose |
|---|---|
| Databricks | Data Platform |
| PySpark | Data Processing |
| Delta Lake | Storage Layer |
| SQL | Data Analysis |
| Databricks Dashboard | Business Reporting |
| GitHub | Version Control |
Raw datasets were ingested into Delta tables without major transformations.
Tables:
- bronze.customers
- bronze.orders
- bronze.order_items
- bronze.products
- bronze.payments
- bronze.reviews
- bronze.sellers
- bronze.geolocation
- bronze.category_translation
Purpose:
- Preserve source data
- Enable traceability
- Support reprocessing
Data cleansing and enrichment performed.
Examples:
- Null handling
- Type casting
- Date conversions
- Derived columns
- Data quality improvements
Tables:
- silver.customers
- silver.orders
- silver.order_items
- silver.products
- silver.payments
- silver.reviews
- silver.sellers
- silver.category_translation
Business-ready analytical models.
Contains sales transactions enriched with:
- Customer information
- Product information
- Payment information
- Revenue metrics
- Order details
Tracks revenue growth over time.
Table:
gold.kpi_monthly_revenueIdentifies top-performing customer regions.
Table:
gold.kpi_state_salesAnalyzes category-level sales performance.
Table:
gold.kpi_category_salesHighlights best-selling products.
Table:
gold.kpi_top_productsExecutive KPIs:
- Total Revenue
- Total Orders
- Total Customers
- Average Order Value (AOV)
Visualizations:
- Monthly Revenue Trend
- Revenue by State
- Revenue by Category
- Top Products Analysis
Planned improvements:
- Incremental Data Loading
- Slowly Changing Dimension (SCD Type 2)
- dbt Integration
- Automated Data Quality Tests
- Workflow Orchestration
- CI/CD Pipeline
- Real-time Streaming Ingestion
Through this project I gained hands-on experience in:
- Medallion Architecture
- PySpark Data Transformations
- Delta Lake
- Fact Table Modeling
- KPI Development
- Dashboard Design
- GitHub Version Control
- End-to-End Data Engineering Workflows
Lakshika Bhagat Data Analytics | Data Engineering |Python | PySpark | Delta Lake |Databricks | SQL | Databricks | git | github