The repository is aimed to build an ETL pipeline that loads JSON data from AWS S3, process it into parquet files and saves to another S3 bucket.
A music streaming startup, Sparkify, has grown their user base and song database and want to move their data warehouse to a data lake. You are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.
- etl.py reads data from S3, processes that data using Spark, and writes them back to S3
- dl.cfg contains AWS credentials
- img folder contains an image for the current file
- data folder contains data for the easy preview of the functionality
- In your AWS account, create an S3 bucket
- In 'dl.cfg' file, fill input and output data paths, and also AWS user credentials (user should have full access to S3 buckets that will be used)
In the terminal window, run 'etl.py' file (e.g. python etl.py
)
The result tables look like in the following screenshots:
- Songs input file, songs table:
- Songs input file, artists table:
- Logs input file, users table:
- Logs input file, time table:
- Songplays table: