A simple proof of concept data pipeline utilising Azure Databricks and Azure Data Lake. Infrastructure is managed using Terraform.
Publicly available HL7 FHIR data is ingested using a Databricks Spark job from the HAPI FHIR server and stored in the bronze layer of the data lake. Ingested data is stored as Delta Lake tables in Azure's ADLS Gen2 storage.
dbt is connected to Databricks SQL Warehouse compute using the dbt-databricks adapter. Transformations from bronze to silver and gold layers are performed using dbt.
- Python 3.10+
Install CLI tools via Homebrew:
brew install azure-cli
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
brew tap databricks/tap
brew install databricksInstall Python dependencies and activate the virtual environment:
poetry install
source $(poetry env info --path)/bin/activate- Authenticate to Azure and select your subscription:
az login
- Copy
terraform/terraform_example.tfvarstoterraform/terraform.tfvarsand set yoursubscription_id. - Initialise the remote backend for Terraform state:
bash terraform/tf_backend_setup.sh
- Deploy the infrastructure:
cd terraform terraform init terraform plan # review the changes terraform apply # provision resources
Terraform will output a databricks_workspace_url — open it and sign in with your Microsoft account. You should see the Unity Catalog with bronze, silver, and gold schemas in the left-hand nav.
Tip: Update the resource group and storage account names in
terraform.tfvarsto something unique to avoid naming conflicts.
Set up environment variables for dbt to connect to Databricks:
source dbt_project/setup_dbt_env.shThis script:
- Fetches
DBT_DATABRICKS_HOSTandDBT_DATABRICKS_HTTP_PATHfrom Terraform outputs - Generates a Databricks personal access token via the CLI
Verify the variables are set:
echo $DBT_DATABRICKS_HOST
echo $DBT_DATABRICKS_HTTP_PATH
echo $DBT_DATABRICKS_TOKENRun dbt commands from the dbt_project/ directory:
cd dbt_project
dbt debug # test connectionUnity Catalog schemas and external locations are protected from accidental deletion by default. To destroy all resources, pass force_destroy_catalog=true:
cd terraform
terraform apply -var="force_destroy_catalog=true" # update state with force_destroy flags
terraform destroy -var="force_destroy_catalog=true" # destroy all resourcesNote: The Terraform backend resource group (
rg-azuredbpoc-tfstate-dev) is not managed byterraform destroyand will be preserved.
