|
| 1 | +## Athena Databricks Connector |
| 2 | + |
| 3 | +This connector enables Amazon Athena to query data stored in Databricks Unity Catalog using JDBC. It allows you to perform federated queries on Databricks tables directly from Athena. |
| 4 | + |
| 5 | +## Connector Status: Preview |
| 6 | + |
| 7 | +The Databricks Athena connector is currently in **preview** and available only as source code for building locally. This is not a production-ready release. |
| 8 | + |
| 9 | +We welcome questions, suggestions, and contributions from the community. |
| 10 | + |
| 11 | +## What is the Databricks Connector? |
| 12 | + |
| 13 | +The Databricks Connector is a JDBC-based Athena federated query connector that enables querying data in Databricks Unity Catalog. It implements both metadata and record handling capabilities to: |
| 14 | + |
| 15 | +1. Provide schema information about Databricks databases, tables, and columns |
| 16 | +2. Read data from Databricks tables for query processing via JDBC |
| 17 | +3. Authenticate using personal access tokens stored in AWS Secrets Manager |
| 18 | + |
| 19 | +The connector consists of: |
| 20 | + |
| 21 | +- **DatabricksMetadataHandler**: Handles metadata operations (list schemas, tables, get table definitions, partitions, and splits) |
| 22 | +- **DatabricksRecordHandler**: Handles data reading operations from Databricks via JDBC |
| 23 | +- **DatabricksCompositeHandler**: Combines both handlers into a single Lambda function |
| 24 | + |
| 25 | +## Prerequisites |
| 26 | + |
| 27 | +Before deploying this connector, ensure you have: |
| 28 | + |
| 29 | +- [Proper permissions/policies to deploy/use Athena Federated Queries](https://docs.aws.amazon.com/athena/latest/ug/federated-query-iam-access.html) |
| 30 | +- An S3 bucket for spilling large query results |
| 31 | +- A Databricks workspace with Unity Catalog enabled |
| 32 | +- A Databricks personal access token stored in AWS Secrets Manager |
| 33 | +- Athena workgroup configured to use Athena Engine Version 3 |
| 34 | + |
| 35 | +## How To Deploy |
| 36 | + |
| 37 | +### Build the Connector |
| 38 | + |
| 39 | +From the repository root, initialize the submodule and build: |
| 40 | + |
| 41 | +```bash |
| 42 | +git submodule update --init |
| 43 | +mvn clean package -DskipTests -f connectors/pom.xml |
| 44 | +``` |
| 45 | + |
| 46 | +The parent POM builds the `athena-jdbc` dependency from the submodule first, then the Databricks connector. |
| 47 | + |
| 48 | +### Deploy Using SAM CLI |
| 49 | + |
| 50 | +```bash |
| 51 | +sam build -t connectors/athena-databricks-connector/athena-databricks-connector.yaml && \ |
| 52 | +sam deploy --guided -t connectors/athena-databricks-connector/athena-databricks-connector.yaml |
| 53 | +``` |
| 54 | + |
| 55 | +### CloudFormation Parameters |
| 56 | + |
| 57 | +| Parameter | Description | Default | |
| 58 | +|-----------|-------------|---------| |
| 59 | +| AthenaCatalogName | Lambda function name (must match pattern: `^[a-z0-9-_]{1,64}$`) | databricks | |
| 60 | +| SpillBucket | S3 bucket name for spilling data (bucket name only, not URI or ARN) | Required | |
| 61 | +| SpillPrefix | Prefix within SpillBucket | athena-spill | |
| 62 | +| LambdaTimeout | Maximum Lambda invocation runtime (1-900 seconds) | 900 | |
| 63 | +| LambdaMemory | Lambda memory in MB (128-3008) | 1024 | |
| 64 | +| DatabricksHost | Databricks workspace hostname (e.g. `dbc-59ed3753-5cf0.cloud.databricks.com`) | Required | |
| 65 | +| SecretName | Name of the Secrets Manager secret containing the Databricks personal access token | Required | |
| 66 | +| DatabricksDefaultDatabase | Default Databricks Unity Catalog database (catalog.schema) | default | |
| 67 | +| DatabricksFetchSize | Number of rows fetched per JDBC round trip | 10000 | |
| 68 | +| EnableArrow | Enable Arrow-based result serialization (Cloud Fetch). Requires more Lambda memory | 0 | |
| 69 | +| DisableSpillEncryption | Disable encryption for spilled data | false | |
| 70 | + |
| 71 | +### Update Lambda Function |
| 72 | + |
| 73 | +For subsequent code updates after initial deployment, build and push the Docker image manually: |
| 74 | + |
| 75 | +```bash |
| 76 | +cd connectors && mvn clean package -DskipTests && \ |
| 77 | +cd athena-databricks-connector && \ |
| 78 | +finch build -t databricks-connector . && \ |
| 79 | +finch tag databricks-connector:latest <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest && \ |
| 80 | +finch push <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest && \ |
| 81 | +aws lambda update-function-code \ |
| 82 | + --function-name databricks \ |
| 83 | + --image-uri <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest \ |
| 84 | + --region <region> |
| 85 | +``` |
| 86 | + |
| 87 | +## Secrets Manager Configuration |
| 88 | + |
| 89 | +The connector authenticates with Databricks using a personal access token (PAT) stored in AWS Secrets Manager. The token is retrieved at runtime by the Federation SDK — it is never embedded in code or environment variables. |
| 90 | +### Security Best Practices |
| 91 | + |
| 92 | +We recommend storing your Databricks personal access token in AWS Secrets Manager rather than as a plaintext Lambda environment variable. Reference the secret ARN in the `DATABRICKS_TOKEN` environment variable using dynamic references: |
| 93 | + |
| 94 | + {{resolve:secretsmanager:your-secret-name:SecretString:token}} |
| 95 | +### How it works |
| 96 | + |
| 97 | +The connector's JDBC connection string contains a `${secret-name}` placeholder. At runtime, the SDK: |
| 98 | + |
| 99 | +1. Extracts the secret name from the placeholder |
| 100 | +2. Calls Secrets Manager to retrieve the secret value |
| 101 | +3. Injects the `username` and `password` into the JDBC connection properties |
| 102 | +4. Strips the placeholder from the URL before connecting |
| 103 | + |
| 104 | +### Create the secret |
| 105 | + |
| 106 | +The secret must be a JSON object with `username` and `password` fields. For Databricks PAT auth, the username is always `token`: |
| 107 | + |
| 108 | +```bash |
| 109 | +aws secretsmanager create-secret \ |
| 110 | + --name my-databricks-secret \ |
| 111 | + --secret-string '{"username": "token", "password": "<your-databricks-personal-access-token>"}' \ |
| 112 | + --region <region> |
| 113 | +``` |
| 114 | + |
| 115 | +### Update the secret |
| 116 | + |
| 117 | +To rotate or update the token: |
| 118 | + |
| 119 | +```bash |
| 120 | +aws secretsmanager put-secret-value \ |
| 121 | + --secret-id my-databricks-secret \ |
| 122 | + --secret-string '{"username": "token", "password": "<new-token>"}' \ |
| 123 | + --region <region> |
| 124 | +``` |
| 125 | + |
| 126 | +No redeployment needed — the connector reads the secret on each invocation. |
| 127 | + |
| 128 | +## Run Queries |
| 129 | + |
| 130 | +Once deployed, query Databricks data through Athena: |
| 131 | + |
| 132 | +```sql |
| 133 | +-- List schemas |
| 134 | +SHOW DATABASES in `lambda:databricks`; |
| 135 | + |
| 136 | +-- List tables in a schema |
| 137 | +SHOW TABLES in `lambda:databricks`.default; |
| 138 | + |
| 139 | +-- Describe table layout (column names and types) |
| 140 | +SHOW COLUMNS IN `lambda:databricks`.default.test_table |
| 141 | + |
| 142 | +-- Query a table |
| 143 | +SELECT * FROM `lambda:databricks`."default"."your_table" LIMIT 10; |
| 144 | +``` |
| 145 | + |
| 146 | +You can run queries from the Athena console or the AWS CLI: |
| 147 | + |
| 148 | +```bash |
| 149 | +# Start a query |
| 150 | +aws athena start-query-execution \ |
| 151 | + --query-string 'SELECT * FROM `lambda:databricks`."default"."your_table" LIMIT 10' \ |
| 152 | + --work-group primary \ |
| 153 | + --region <region> |
| 154 | + |
| 155 | +# Fetch results (use the QueryExecutionId from the previous command) |
| 156 | +aws athena get-query-results \ |
| 157 | + --query-execution-id <query-execution-id> \ |
| 158 | + --region <region> |
| 159 | +``` |
| 160 | + |
| 161 | +## JDBC Driver Configuration |
| 162 | + |
| 163 | +### Arrow and Cloud Fetch (Default: Disabled) |
| 164 | + |
| 165 | +The Databricks JDBC driver supports [Cloud Fetch](https://docs.databricks.com/en/integrations/jdbc/capability.html#cloud-fetch-in-jdbc), which downloads query results as ~20MB Arrow-serialized chunks in parallel from DBFS. While this is faster than row-by-row streaming, each in-flight chunk consumes Lambda memory. With the default thread pool of 16, this can easily exceed Lambda's memory limit (1–3GB) on large result sets. |
| 166 | + |
| 167 | +This connector disables Arrow by default (`EnableArrow=0`) so results stream row-by-row via Thrift instead. Memory usage is bounded by `DatabricksFetchSize` (default: 10,000 rows per JDBC round trip). |
| 168 | + |
| 169 | +To re-enable Cloud Fetch for higher throughput, set the `EnableArrow` parameter to `1` during deployment. You may also need to increase `LambdaMemory` to accommodate the larger in-flight buffers. |
| 170 | + |
| 171 | +### Fetch Size (Default: 10,000) |
| 172 | + |
| 173 | +`DatabricksFetchSize` controls how many rows the JDBC driver buffers per round trip. Higher values reduce network round trips but use more memory. The default of 10,000 is safe for Lambda at 1GB with typical row sizes (~1KB). Lower it for tables with very wide rows. |
| 174 | + |
| 175 | +## Troubleshooting |
| 176 | + |
| 177 | +- **No partitioning support**: All data is read in a single split. For large tables, use `LIMIT` or `WHERE` clauses to avoid Lambda timeout or out-of-memory errors. |
| 178 | +- **Check Lambda Logs**: `aws logs tail /aws/lambda/databricks --follow --format short --region <region>` |
| 179 | +- **Verify Permissions**: Ensure the Lambda execution role has access to Secrets Manager and the spill bucket |
| 180 | + |
| 181 | +## Additional Resources |
| 182 | + |
| 183 | +- [Athena Federated Query Documentation](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html) |
| 184 | +- [AWS Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation) |
| 185 | +- [Databricks JDBC Driver](https://docs.databricks.com/aws/en/integrations/jdbc-oss/) |
0 commit comments