Skip to content

Latest commit

 

History

History
157 lines (122 loc) · 6.6 KB

File metadata and controls

157 lines (122 loc) · 6.6 KB

Provisioning Databricks Managed File Events on AWS

This example is using the aws-managed-file-events module.

This template provides a deployment of AWS infrastructure for Databricks Managed File Events, enabling file notification mode for Auto Loader with automatic S3 event notifications and SQS queues.

How to use

  1. Reference this module using one of the different module source types
  2. Add a variables.tf with the same content in variables.tf
  3. Add a terraform.tfvars file and provide values to each defined variable
  4. Configure authentication to your Databricks workspace and AWS account
  5. Add a output.tf file
  6. (Optional) Configure your remote backend
  7. Run terraform init to initialize terraform and get provider ready
  8. Run terraform apply to create the resources

Complete Example with All Options

The following shows all available module options:

module "managed_file_events" {
  source = "../../modules/aws-managed-file-events"

  # Required variables
  prefix                = var.prefix
  region                = var.region
  aws_account_id        = var.aws_account_id
  databricks_account_id = var.databricks_account_id

  # S3 Configuration
  create_bucket        = true                    # Set to false to use existing bucket
  existing_bucket_name = null                    # Required if create_bucket = false
  bucket_name          = "my-custom-bucket-name" # Custom bucket name (default: prefix-file-events)
  s3_path_prefix       = "data/incoming"         # Path prefix within the bucket
  force_destroy_bucket = false                   # Allow bucket deletion with objects

  # External Location Configuration
  external_location_name  = "my-external-location"  # Custom name (default: prefix-file-events-location)
  storage_credential_name = "my-storage-credential" # Custom name (default: prefix-file-events-credential)

  # Catalog Configuration (Optional)
  create_catalog         = true
  catalog_name           = "my_catalog"
  catalog_owner          = "data-engineers@company.com"
  catalog_isolation_mode = "OPEN"  # OPEN or ISOLATED

  # Grants Configuration
  external_location_grants = [
    {
      principal  = "data-engineers@company.com"
      privileges = ["READ_FILES", "WRITE_FILES"]
    }
  ]

  storage_credential_grants = [
    {
      principal  = "data-engineers@company.com"
      privileges = ["CREATE_EXTERNAL_LOCATION"]
    }
  ]

  catalog_grants = [
    {
      principal  = "data-engineers@company.com"
      privileges = ["USE_CATALOG", "CREATE_SCHEMA"]
    },
    {
      principal  = "analysts@company.com"
      privileges = ["USE_CATALOG"]
    }
  ]

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
    Project     = "data-platform"
  }
}

Using with Auto Loader

Once deployed, you can use Auto Loader with managed file events in your Databricks notebooks:

df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.useManagedFileEvents", "true") \
    .load("s3://your-bucket/path")

Or in Lakeflow Declarative Pipelines:

from pyspark import pipelines as dp

@dp.table
def my_table():
    return spark.readStream.format("cloudFiles") \
        .option("cloudFiles.format", "json") \
        .option("cloudFiles.useManagedFileEvents", "true") \
        .load("/Volumes") # Ingesting from a volume that points to your S3 bucket will be more performant than the S3 location itself.

Reference

Requirements

Name Version
aws >= 5.0
databricks >= 1.65.0

Providers

No providers.

Modules

Name Source Version
managed_file_events ../../modules/aws-managed-file-events n/a

Resources

No resources.

Inputs

Name Description Type Default Required
aws_account_id (Required) AWS Account ID string n/a yes
databricks_account_id (Required) Databricks Account ID string n/a yes
databricks_client_id (Required) Databricks service principal client ID string n/a yes
databricks_client_secret (Required) Databricks service principal client secret string n/a yes
databricks_host (Required) Databricks workspace URL (e.g., https://xxx.cloud.databricks.com) string n/a yes
databricks_pat_token (Required) Databricks service principal client secret string n/a yes
prefix (Required) Prefix for resource naming string n/a yes
region (Required) AWS region to deploy to string n/a yes
aws_profile (Optional) AWS CLI profile name for authentication string null no
tags (Optional) Tags to add to created resources map(string) {} no

Outputs

Name Description
bucket_name Name of the S3 bucket
external_location_name Name of the external location
external_location_url S3 URL of the external location
iam_role_arn ARN of the IAM role
storage_credential_name Name of the storage credential