ETL pipeline with AWS S3 and Spark

The repository is aimed to build an ETL pipeline that loads JSON data from AWS S3, process it into parquet files and saves to another S3 bucket.

Assignment

A music streaming startup, Sparkify, has grown their user base and song database and want to move their data warehouse to a data lake. You are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

High-level architecture of the project

Logical Data Model for the project

Overview of the files in the repository

etl.py reads data from S3, processes that data using Spark, and writes them back to S3
dl.cfg contains AWS credentials
img folder contains an image for the current file
data folder contains data for the easy preview of the functionality

Running the project

Pre-requisites

In your AWS account, create an S3 bucket
In 'dl.cfg' file, fill input and output data paths, and also AWS user credentials (user should have full access to S3 buckets that will be used)

How to run the project

In the terminal window, run 'etl.py' file (e.g. python etl.py)

Project results

The result tables look like in the following screenshots:

Songs input file, songs table:

Songs input file, artists table:

Logs input file, users table:

Logs input file, time table:

Songplays table:

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
__MACOSX		__MACOSX
data		data
img		img
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL pipeline with AWS S3 and Spark

The repository is aimed to build an ETL pipeline that loads JSON data from AWS S3, process it into parquet files and saves to another S3 bucket.

Assignment

High-level architecture of the project

Logical Data Model for the project

Overview of the files in the repository

Running the project

Pre-requisites

How to run the project

Project results

About

Languages

milamarcan/etl_aws_s3_spark_datalake

Folders and files

Latest commit

History

Repository files navigation

ETL pipeline with AWS S3 and Spark

The repository is aimed to build an ETL pipeline that loads JSON data from AWS S3, process it into parquet files and saves to another S3 bucket.

Assignment

High-level architecture of the project

Logical Data Model for the project

Overview of the files in the repository

Running the project

Pre-requisites

How to run the project

Project results

About

Topics

Resources

Stars

Watchers

Forks

Languages