Skip to content

JoshuaHarris391/azure-databricks-dl-poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure Databricks Data Lake PoC

Architecture Diagram

About

A simple proof of concept data pipeline utilising Azure Databricks and Azure Data Lake. Infrastructure is managed using Terraform.

Publicly available HL7 FHIR data is ingested using a Databricks Spark job from the HAPI FHIR server and stored in the bronze layer of the data lake. Ingested data is stored as Delta Lake tables in Azure's ADLS Gen2 storage.

dbt is connected to Databricks SQL Warehouse compute using the dbt-databricks adapter. Transformations from bronze to silver and gold layers are performed using dbt.

Video Overview

Prerequisites

  • Python 3.10+

Install CLI tools via Homebrew:

brew install azure-cli

brew tap hashicorp/tap
brew install hashicorp/tap/terraform

brew tap databricks/tap
brew install databricks

Install Python dependencies and activate the virtual environment:

poetry install
source $(poetry env info --path)/bin/activate

Deployment

  1. Authenticate to Azure and select your subscription:
    az login
  2. Copy terraform/terraform_example.tfvars to terraform/terraform.tfvars and set your subscription_id.
  3. Initialise the remote backend for Terraform state:
    bash terraform/tf_backend_setup.sh
  4. Deploy the infrastructure:
    cd terraform
    terraform init
    terraform plan   # review the changes
    terraform apply  # provision resources

Terraform will output a databricks_workspace_url — open it and sign in with your Microsoft account. You should see the Unity Catalog with bronze, silver, and gold schemas in the left-hand nav.

Tip: Update the resource group and storage account names in terraform.tfvars to something unique to avoid naming conflicts.

dbt Setup

Set up environment variables for dbt to connect to Databricks:

source dbt_project/setup_dbt_env.sh

This script:

  • Fetches DBT_DATABRICKS_HOST and DBT_DATABRICKS_HTTP_PATH from Terraform outputs
  • Generates a Databricks personal access token via the CLI

Verify the variables are set:

echo $DBT_DATABRICKS_HOST
echo $DBT_DATABRICKS_HTTP_PATH
echo $DBT_DATABRICKS_TOKEN

Run dbt commands from the dbt_project/ directory:

cd dbt_project
dbt debug  # test connection

Teardown

Unity Catalog schemas and external locations are protected from accidental deletion by default. To destroy all resources, pass force_destroy_catalog=true:

cd terraform
terraform apply -var="force_destroy_catalog=true"   # update state with force_destroy flags
terraform destroy -var="force_destroy_catalog=true"  # destroy all resources

Note: The Terraform backend resource group (rg-azuredbpoc-tfstate-dev) is not managed by terraform destroy and will be preserved.

About

Simple datalake ELT pipeline for ingesting publicly available FHIR data into ADLS, with Delta Tables in Unity Catalog.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors