Azure Databricks Data Lake PoC

About

A simple proof of concept data pipeline utilising Azure Databricks and Azure Data Lake. Infrastructure is managed using Terraform.

Publicly available HL7 FHIR data is ingested using a Databricks Spark job from the HAPI FHIR server and stored in the bronze layer of the data lake. Ingested data is stored as Delta Lake tables in Azure's ADLS Gen2 storage.

dbt is connected to Databricks SQL Warehouse compute using the dbt-databricks adapter. Transformations from bronze to silver and gold layers are performed using dbt.

Video Overview

Youtube Overview

Prerequisites

Python 3.10+

Install CLI tools via Homebrew:

brew install azure-cli

brew tap hashicorp/tap
brew install hashicorp/tap/terraform

brew tap databricks/tap
brew install databricks

Install Python dependencies and activate the virtual environment:

poetry install
source $(poetry env info --path)/bin/activate

Deployment

Authenticate to Azure and select your subscription:
```
az login
```
Copy terraform/terraform_example.tfvars to terraform/terraform.tfvars and set your subscription_id.
Initialise the remote backend for Terraform state:
```
bash terraform/tf_backend_setup.sh
```

Deploy the infrastructure:

cd terraform
terraform init
terraform plan   # review the changes
terraform apply  # provision resources

Terraform will output a databricks_workspace_url — open it and sign in with your Microsoft account. You should see the Unity Catalog with bronze, silver, and gold schemas in the left-hand nav.

Tip: Update the resource group and storage account names in terraform.tfvars to something unique to avoid naming conflicts.

dbt Setup

Set up environment variables for dbt to connect to Databricks:

source dbt_project/setup_dbt_env.sh

This script:

Fetches DBT_DATABRICKS_HOST and DBT_DATABRICKS_HTTP_PATH from Terraform outputs
Generates a Databricks personal access token via the CLI

Verify the variables are set:

echo $DBT_DATABRICKS_HOST
echo $DBT_DATABRICKS_HTTP_PATH
echo $DBT_DATABRICKS_TOKEN

Run dbt commands from the dbt_project/ directory:

cd dbt_project
dbt debug  # test connection

Teardown

Unity Catalog schemas and external locations are protected from accidental deletion by default. To destroy all resources, pass force_destroy_catalog=true:

cd terraform
terraform apply -var="force_destroy_catalog=true"   # update state with force_destroy flags
terraform destroy -var="force_destroy_catalog=true"  # destroy all resources

Note: The Terraform backend resource group (rg-azuredbpoc-tfstate-dev) is not managed by terraform destroy and will be preserved.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.img		.img
dbt_project		dbt_project
docs		docs
ingestion		ingestion
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure Databricks Data Lake PoC

About

Video Overview

Prerequisites

Deployment

dbt Setup

Teardown

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Azure Databricks Data Lake PoC

About

Video Overview

Prerequisites

Deployment

dbt Setup

Teardown

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages