Skip to content

PySpark-based ETL pipeline that extracts transaction data from a MySQL database, cleans and transforms it, aggregates monthly sales per customer, and writes the processed data to an S3 bucket in Parquet format.

Notifications You must be signed in to change notification settings

Sabab080/pyspark-etl-customer-sales

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

PySpark ETL Pipeline

Overview

PySpark-based ETL pipeline that extracts transaction data from a MySQL database, cleans and transforms it, aggregates monthly sales per customer, and writes the processed data to an S3 bucket in Parquet format.

Workflow

Extract → Fetches transaction data from MySQL using JDBC with partitioning for performance.
Transform → Cleans the data by handling missing values and renaming columns.
Aggregate → Groups data by customer_id and month to calculate monthly sales.
Load → Writes the aggregated dataset to an S3 bucket in Parquet format for efficient storage and retrieval.

Features

✅ Optimized Data Extraction → Uses partitioning to handle large datasets efficiently.

✅ Modular & Scalable Design → Follows Single Responsibility Principle (SRP) for maintainability.

✅ Error Handling & Logging → Ensures smooth execution with detailed logging.

✅ Parallel Processing → Utilizes Spark's distributed computing for high performance.

About

PySpark-based ETL pipeline that extracts transaction data from a MySQL database, cleans and transforms it, aggregates monthly sales per customer, and writes the processed data to an S3 bucket in Parquet format.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages