Building An Azure Data Lakehouse for Bike Share Data Analytics

Project Overview

Divvy is a bike sharing program in Chicago, Illinois USA that allows riders to purchase a pass at a kiosk or use a mobile application to unlock a bike at stations around the city and use the bike for a specified amount of time. The bikes can be returned to the same station or to another station. The City of Chicago makes the anonymized bike trip data publicly available for projects like this where we can analyze the data.

Since the data from Divvy are anonymous, fake rider and account profiles along with fake payment data to go along with the data from Divvy have been generated. The dataset looks like this:

Business Requirements

Analyze the duration of each ride:
- Based on date and time factors such as day of the week and time of day
- Based on the starting and/or ending station
- Based on the rider's age at the time of the ride
- Based on whether the rider is a member or casual rider
Analyze the cost:
- Per month, quarter, and year
- Per member, based on the rider's age at account start
Analyze the cost per member:
- Based on the number of rides the rider averages per month
- Based on the number of minutes the rider spends on a bike per month

Technology Stack

Azure Databricks
Azure Data Lake Storage Gen2
Azure Key Vault
Azure Active Directory
Azure Storage Explorer

Solution Architecture

Dataflow

Transfer on-premises raw files to the landing zone container inside Azure Data Lake Storage Gen2 using Azure Storage Explorer.
Ingest and process the raw data with Azure Databricks.
Use a medallion architecture for storage that organizes data into layers:
- Bronze: Holds raw data with ingestion time in Parquet.
- Silver: Contains cleaned and filtered data in Delta.
- Gold: Stores aggregated data that is useful for business analytics in Delta.
Analyse business requirements using Azure Databricks SQL Analytics.
For data governance:
- Azure AD application to enable Azure Databricks to access Azure Data Lake via Service Principal.
- Azure Key Vault to securely manage secrets, keys, and certificates.

STAR Schema Design - Gold layer

The STAR schema consists of two fact tables (Fact Trip and Fact Payment) and three dimension tables (Dim Calendar, Dim Rider, Dim Station). Star Schema — PDF

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
_src		_src
analyse		analyse
images		images
ingest_and_load		ingest_and_load
pdf		pdf
setup		setup
transform		transform
.gitignore		.gitignore
README.md		README.md
bike-share-project.dbc		bike-share-project.dbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Building An Azure Data Lakehouse for Bike Share Data Analytics

Project Overview

Business Requirements

Technology Stack

Solution Architecture

Dataflow

STAR Schema Design - Gold layer

References:

About

Uh oh!

Releases

Packages

Languages

fabiansum/bike-share-analytics-data-lakehouse

Folders and files

Latest commit

History

Repository files navigation

Building An Azure Data Lakehouse for Bike Share Data Analytics

Project Overview

Business Requirements

Technology Stack

Solution Architecture

Dataflow

STAR Schema Design - Gold layer

References:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages