GitHub - gBEN1-bit/End-to-End-Azure-Data-Engineering-Project: This project focuses on building a complete End to End Azure Data Engineering Project from Data Ingestion, Data Transformation, Data Loading to Data Visualization.

End-to-End Azure Data Engineering Project

This project focuses on building a complete End to End Azure Data Engineering Project from Data Ingestion, Data Transformation, Data Loading to Data Visualization.

The Tech Stack used includes some resources in Azure which are:

Azure Data Factory: This resource was used for Data Ingestion in this project by building data pipelines. Azure Data Factory Documentation
Azure Data Lake Storage Gen2: This was used to store data in their RAW format both in parquet and delta format respectively. Azure Data Lake Storage Gen2 Documentation
Azure Databricks: For data transformation using Python Language. Azure Databricks Documentation
Azure Synapse Analytics: For Data warehousing (by using a serverless database over a dedicated database due to the small quantity of the data) and querying of data. Azure Synapse Analytics Documentation
Azure Key Vault: This resource was used to save/store and manage sensitive information like secrets, keys, and certificates e.g database username and password, databricks workspace access token etc. Azure Key Vault Documentation
Microsoft Entra ID (Previously known as Azure Active Directory): To manage the governance and user management of the resource group by creating security groups. Microsoft Entra ID Documentation
Microsoft Power BI: For data modeling and visualization. Microsoft Power BI Documentation

Project Prerequisite

You need a Microsoft Azure account.
You need an Azure subscription or an Azure for student subscription.

Project Architecture

Project Architecture

Note: Firstly, you need to create a resource group in Azure that houses all the resources listed above (aside the Microsoft Power BI).

End-to-End Data Pipeline

Pipeline 1

Pipeline 2

Data Ingestion

The Data Ingestion process starts by connecting to the Data source (which is the AdventureWorksLT2017 and you can download it here) with the help of the Microsoft Integration Runtime.

Note: Upon creating the Azure Data Factory resource, it automatically creates an auto resolve integration runtime which you can use to connect to other cloud resources like Azure Synapse Analytics. But to connect to an on-premise Data source you need the Microsoft Integration Runtime

The first step in this data pipeline uses the lookup activity to look for all the tables belonging to a particular schema by using a sql query.

select s.name as SchemaName, t.name as TableName from sys.tables t INNER JOIN sys.schemas s on t.schema_id = s.schema_id where s.name = ‘SalesLT’

The second step uses the For Each activity to dynamically select * from all the tables in the result of the look up activity and store the data in the as-is-staging container in Azure Data Lake Storage Gen2 (in a parquet format).

Data Transformation

Note: To mount the Azure Data Lake Storage Gen2 to the DBFS, credential passthrough was used. You can read more on this here.

The third step uses Azure Databricks Notebook to perform the first-level transformation which uses the python language to convert all the date columns to a date format and store the result in the firstleveltransformation container in Azure Data Lake Storage Gen2 (in a delta format).
The fourth step uses Azure Databricks Notebook to perform the second-level transformation which uses the python language to convert all the column name header from PascalCase to snake_case naming convention and store the result in the secondleveltransformation container in Azure Data Lake Storage Gen2 (in a delta format).

Note: The first container in Azure Data Lake Storage Gen2 is stored in parquet format because that is the data source in its raw format while the other two containers store the transformed data in delta format because "Delta keeps version history".

Data Loading

This is the step where the data is loaded from the secondleveltransformation container in Azure Data Lake Storage Gen2 into the Azure Synapse Analytics (Serverless DB) using a stored procedure that dynamically creates a view for all the tables.

Data Modeling and Visualization

Using Azure Synapse Analytics Data connector in Power BI the loaded data in the Azure Synapse Analytics is imported into Power BI to build simple interactive dashboards that provide real-time insights. You can view the dashboard here.

Dashboard

Data Modeling is done in Power BI to communicate connections/relationships between data points and structures.

Data Model

Security and Governance

Microsoft Entra ID helps to securing the data platform and also implement security by creating security groups, so that certain roles can be given to such groups.
Azure Key Vault helps to save/store and manage sensitive information like secrets, keys, and certificates.

Triggers

Triggers in ADF is used to schedule the refresh of the data by running the data pipeline at the specified time and interval.

Triggers

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data Source Database		Data Source Database
Images		Images
Power BI		Power BI
Python Scripts		Python Scripts
SQL Scripts		SQL Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End Azure Data Engineering Project

Project Prerequisite

Project Architecture

End-to-End Data Pipeline

Data Ingestion

Data Transformation

Data Loading

Data Modeling and Visualization

Security and Governance

Triggers

About

Uh oh!

Releases

Packages

Languages

gBEN1-bit/End-to-End-Azure-Data-Engineering-Project

Folders and files

Latest commit

History

Repository files navigation

End-to-End Azure Data Engineering Project

Project Prerequisite

Project Architecture

End-to-End Data Pipeline

Data Ingestion

Data Transformation

Data Loading

Data Modeling and Visualization

Security and Governance

Triggers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages