End-to-End Data Engineering Pipeline

Project Overview

This project implements an end-to-end data engineering pipeline that processes raw data and converts it into analytics-ready datasets. The pipeline uses Apache Spark for distributed data processing, Apache Airflow for workflow orchestration, Delta Lake for reliable storage, and Docker for environment setup.

The system automatically ingests raw datasets, performs data cleaning and transformations, and stores the processed data in Delta format.

Architecture

Raw Data ↓ Input Directory ↓ Airflow DAG (Scheduler / Event Trigger) ↓ Spark Job (PySpark) ↓ Data Cleaning & Transformation ↓ Delta Lake Storage ↓ Processed Dataset for Analytics

System Design (Phase 1)

Components

Data source (CSV files): Raw files land in data/input/ (bind-mounted into containers).
Orchestrator (Apache Airflow):
- Scheduled pipeline: runs daily at 5:00 AM to process the latest input data.
- Event-driven pipeline: watches the input directory and triggers processing when new files arrive.
- Airflow UI + logs: provides run history, task retries, and debugging output.
Processing engine (Apache Spark / PySpark):
- Reads input CSVs, performs cleaning/transformation, and writes results.
- Can run in a distributed mode (Spark master/workers) inside Docker.
Storage layer (Delta Lake):
- Output is written in Delta format to data/output/ to provide ACID transactions, consistent reads, and scalable file-based storage.

Data Flow

Ingest: CSV files are placed into data/input/.
Trigger:
- Daily DAG triggers at 5:00 AM, or
- Event DAG triggers when new files appear in data/input/.
Process (Spark job):
- Read CSVs (header + schema inference)
- Remove duplicates
- Handle missing/null values
- Add processing_timestamp
Persist (Delta): Write the curated dataset as a Delta table under data/output/ using overwrite mode.
Consume: Analytics tools / downstream jobs read from the Delta output.

Reliability & Idempotency

Idempotent writes: The Spark job writes with overwrite, so reruns produce the same deterministic output for the same input snapshot.
Airflow retries: transient failures are handled via task retries (configured in DAGs), with full logs available in the UI.
Delta Lake guarantees: Delta provides transactional consistency and protects against partial writes.

Scalability

Horizontal scaling: Spark can scale out via worker containers to handle larger datasets.
Schema handling: Schema inference is enabled for quick prototyping; schema enforcement/evolution can be added as an enhancement for production hardening.

Observability

Airflow UI: run status, retries, task logs.
Spark logs/UI: job stages, performance metrics, executor details (when enabled in the Spark setup).

Technology Stack

Python
Apache Spark (PySpark)
Apache Airflow
Delta Lake
Docker

Project Structure

data-engineering-pipeline-hackathon
│
├── dags
│   ├── daily_spark_pipeline.py
│   └── event_trigger_pipeline.py
│
├── spark_jobs
│   └── process_data.py
│
├── data
│   ├── input
│   └── output
│
├── docker
│   └── docker-compose.yml
│
├── requirements.txt
└── README.md

Setup Instructions

1. Clone Repository

git clone https://github.com/ShreyasDamle2805/data-engineering-pipeline-hackathon.git
cd data-engineering-pipeline-hackathon

2. Start Docker Environment

docker compose up

3. Install Python Dependencies

pip install -r requirements.txt

Running the Pipeline

Place the dataset inside:

data/input/

Open Airflow UI

http://localhost:8081

Enable the DAG:

daily_spark_pipeline
event_trigger_pipeline

The Spark job will run and process the data.

Data Processing Steps

The Spark job performs the following transformations:

Removes duplicate records
Handles missing values
Adds a processing timestamp column
Writes processed data in Delta Lake format

Output

The processed dataset is stored in:

data/output/processed/

The output follows the Delta Lake structure.

Optional Enhancements

Additional improvements that can be implemented:

Schema evolution
Data partitioning
Data quality validation
Logging and monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Data Engineering Pipeline

Project Overview

Architecture

Raw Data ↓ Input Directory ↓ Airflow DAG (Scheduler / Event Trigger) ↓ Spark Job (PySpark) ↓ Data Cleaning & Transformation ↓ Delta Lake Storage ↓ Processed Dataset for Analytics

System Design (Phase 1)

Components

Data Flow

Reliability & Idempotency

Scalability

Observability

Technology Stack

Project Structure

Setup Instructions

1. Clone Repository

2. Start Docker Environment

3. Install Python Dependencies

Running the Pipeline

Data Processing Steps

Output

Optional Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dags		dags
data		data
docker		docker
spark_jobs		spark_jobs
.gitignore		.gitignore
README.md		README.md
architecture.png		architecture.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

End-to-End Data Engineering Pipeline

Project Overview

Architecture

Raw Data ↓ Input Directory ↓ Airflow DAG (Scheduler / Event Trigger) ↓ Spark Job (PySpark) ↓ Data Cleaning & Transformation ↓ Delta Lake Storage ↓ Processed Dataset for Analytics

System Design (Phase 1)

Components

Data Flow

Reliability & Idempotency

Scalability

Observability

Technology Stack

Project Structure

Setup Instructions

1. Clone Repository

2. Start Docker Environment

3. Install Python Dependencies

Running the Pipeline

Data Processing Steps

Output

Optional Enhancements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages