Table of Contents
- Discuss the purpose of this database in the context of the startup, Sparkify, and their analytical goals.
- State and justify your database schema design and ETL pipeline.
- [Optional] Provide example queries and results for song play analysis.
A music streaming app startup "Sparkify" is collecting the data about user activity on their app. the user activities are collected as JSON logs on their stream server.
To understand the users demand and behaviour to improve the app and make it more user relevant, the analytics team wants to build a database to load these JSON user activity logs and able to write queries on adhoc basis to analyze the captured data.
we decided to build the database using PostgreSQL. As it is open source Database aswell and the requirement to satisfy the adhoc analytical queries and aggregation requirement can be best supported by a RDBM database
- conceptual modelling As part of the data modelling excersie we identified following entities to be required
- Songs
- users
- Artists
- time
Then we decided to build the data model using star schema as this is best model faster analytical capabilities for adhoc queries and also best suited for analysis.
we will use Python to build our ETL pipeline as this is new language which provides lot of functionalities to read data and also integrate in future into any orchestration tool like airflow
- go to terminal run the below command to create database and required tables


