Skip to content

Commit 3f8aaba

Browse files
andy-k-improvingacarbonettomykola-pereyma
authored
FEAT: Databricks connector (#133)
* Databricks connector - Module skeleton (#106) * Init commit Java module Move src Test file minimise dep Minimise dep Update changes Import driver Update code Update connection string Update header * Folder restrcuture * Update tests * Update Java doc * POM refactor * Update Readme * Databricks Connector - Deployment Instruction (#107) * SAM template # Conflicts: # connectors/athena-databricks-connector/athena-databricks-connector.yaml * Update template * Update docker file * Update doc * Update doc * Update doc * Update doc * Update doc * Databricks Connector - Metadatahandler (#108) * Update jdbc * Ignore git module changes Remove un compatible query Update API signature Update metadataHandler Update metdata Update wording * Update doc * Databricks connector - RecordHandler impelementation (#110) * Init commit Java module Move src Test file minimise dep Minimise dep Update changes Import driver Update code Update connection string Update header * Folder restrcuture # Conflicts: # .gitmodules # connectors/athena-databricks-connector/pom.xml # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksCompositeHandler.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksEnvironmentProperties.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksMetadataHandler.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksRecordHandler.java * Update tests * SAM template # Conflicts: # connectors/athena-databricks-connector/athena-databricks-connector.yaml * Update doc * Impl - Record handler Working copy Testable contructor Unit test Update record Make fetch size configable Update fetch size Update doc Update pushdown support Unit test * Update connectors/athena-databricks-connector/README.md Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> * Update doc # Conflicts: # connectors/athena-databricks-connector/README.md --------- Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> * Notebook: Databricks connector (#128) * Init commit Java module Move src Test file minimise dep Minimise dep Update changes Import driver Update code Update connection string Update header * Folder restrcuture # Conflicts: # .gitmodules # connectors/athena-databricks-connector/pom.xml # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksCompositeHandler.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksEnvironmentProperties.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksMetadataHandler.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksRecordHandler.java * Update tests * SAM template # Conflicts: # connectors/athena-databricks-connector/athena-databricks-connector.yaml * Update doc * Databricks notebook Update notebook Update notebook Update dep Update notebook * Update notebook * Update testcase * Update notebooks/import_databricks_demo.ipynb Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> * Update notebooks/import_databricks_demo.ipynb Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> * Update notebooks/import_databricks_demo.ipynb Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> * Update notebooks/import_databricks_demo.ipynb Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> * Update notebooks/import_databricks_demo.ipynb Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> * Update doc * Make NEV optional --------- Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> # Conflicts: # pyproject.toml * Databricks connector: Deployment instruction fixes (#130) * Init commit Java module Move src Test file minimise dep Minimise dep Update changes Import driver Update code Update connection string Update header * Folder restrcuture # Conflicts: # .gitmodules # connectors/athena-databricks-connector/pom.xml # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksCompositeHandler.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksEnvironmentProperties.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksMetadataHandler.java # connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksRecordHandler.java * Update tests * SAM template # Conflicts: # connectors/athena-databricks-connector/athena-databricks-connector.yaml * Update doc * Databricks notebook Update notebook Update notebook Update dep Update notebook * Update testcase * Update jdbc string Update default option Update log stmt Reorder parameters Minimise diff * Update example host name * Update notebook * Connector: Directory restructure (#132) * Move s3 vector * Update path * Misc update * Update pyproject.toml Co-authored-by: mykola-pereyma <pereymam@amazon.com> * Update connectors/athena-databricks-connector/src/main/resources/log4j2.xml Co-authored-by: mykola-pereyma <pereymam@amazon.com> * Update connectors/athena-databricks-connector/Dockerfile Co-authored-by: mykola-pereyma <pereymam@amazon.com> * Update connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksCompositeHandler.java Co-authored-by: mykola-pereyma <pereymam@amazon.com> * Update connectors/athena-databricks-connector/athena-databricks-connector.yaml Co-authored-by: mykola-pereyma <pereymam@amazon.com> * Update connectors/athena-databricks-connector/README.md Co-authored-by: mykola-pereyma <pereymam@amazon.com> * Update connectors/athena-databricks-connector/src/main/java/com/amazonaws/athena/connectors/databricks/DatabricksRecordHandler.java Co-authored-by: mykola-pereyma <pereymam@amazon.com> * Fix notebook * Update github action * Update doc * Update python * Update java * Update py test --------- Co-authored-by: Andrew Carbonetto <andrew.carbonetto@improving.com> Co-authored-by: mykola-pereyma <pereymam@amazon.com>
1 parent a66e66f commit 3f8aaba

File tree

49 files changed

+2441
-37
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+2441
-37
lines changed

.github/dependabot.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,11 @@ updates:
55
schedule:
66
interval: "weekly"
77
- package-ecosystem: "maven"
8-
directory: "/athena-s3vector-connector"
8+
directory: "/connectors/athena-databricks-connector"
9+
schedule:
10+
interval: "weekly"
11+
- package-ecosystem: "maven"
12+
directory: "/connectors/athena-s3vector-connector"
913
schedule:
1014
interval: "weekly"
1115
- package-ecosystem: "pip"
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: Databricks Connector CI
2+
3+
on:
4+
pull_request:
5+
paths:
6+
- 'connectors/athena-databricks-connector/**'
7+
push:
8+
branches:
9+
- main
10+
paths:
11+
- 'connectors/athena-databricks-connector/**'
12+
13+
permissions:
14+
contents: read
15+
16+
jobs:
17+
build:
18+
runs-on: ubuntu-latest
19+
strategy:
20+
matrix:
21+
java-version: [ '11', '17' ]
22+
env:
23+
AWS_REGION: us-west-2
24+
AWS_DEFAULT_REGION: us-west-2
25+
steps:
26+
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
27+
with:
28+
submodules: true
29+
- name: Set up JDK ${{ matrix.java-version }}
30+
uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5.2.0
31+
with:
32+
distribution: 'corretto'
33+
java-version: ${{ matrix.java-version }}
34+
- name: Build with Maven
35+
run: mvn install -Dcheckstyle.skip=true
36+
working-directory: connectors/athena-databricks-connector

.github/workflows/athena-s3vector-connector.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@ name: Java CI Push
33
on:
44
pull_request:
55
paths:
6-
- 'athena-s3vector-connector/**'
6+
- 'connectors/athena-s3vector-connector/**'
77
push:
88
branches:
99
- main
1010
paths:
11-
- 'athena-s3vector-connector/**'
11+
- 'connectors/athena-s3vector-connector/**'
1212

1313
permissions:
1414
contents: read
@@ -31,4 +31,4 @@ jobs:
3131
java-version: ${{ matrix.java-version }}
3232
- name: Build with Maven
3333
run: mvn install
34-
working-directory: athena-s3vector-connector
34+
working-directory: connectors/athena-s3vector-connector

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ cython_debug/
171171
# and can be added to the global gitignore or merged into this file. For a more nuclear
172172
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
173173
.idea/
174+
*.iml
174175
nx-neptune/
175176

176177

@@ -183,5 +184,6 @@ nx-neptune/
183184

184185
# Java
185186
.jqwik-database
187+
dependency-reduced-pom.xml
186188

187189

.gitmodules

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[submodule "connectors/aws-athena-query-federation"]
2+
path = connectors/aws-athena-query-federation
3+
url = https://github.com/awslabs/aws-athena-query-federation.git
4+
ignore = dirty

connectors/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Athena Connectors
2+
3+
This directory contains Athena federated query connectors and their shared dependencies.
4+
5+
## Structure
6+
7+
```
8+
connectors/
9+
├── pom.xml # Parent POM (multi-module build)
10+
├── aws-athena-query-federation/ # Git submodule (pinned to specific version)
11+
│ └── athena-jdbc/ # JDBC base module used by Databricks connector
12+
├── athena-databricks-connector/ # Databricks Unity Catalog connector
13+
└── athena-s3vector-connector/ # S3 Vector connector
14+
```
15+
16+
## Why the Submodule?
17+
18+
The `athena-databricks-connector` depends on the `athena-jdbc` module from the [AWS Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation). This module provides base classes for JDBC-based connectors (`JdbcMetadataHandler`, `JdbcRecordHandler`, connection management, etc.).
19+
20+
The `athena-jdbc` module is **not published to Maven Central**, so it cannot be pulled as a regular Maven dependency. The submodule allows us to build it from source as part of the multi-module Maven build, without copying source files into our repository.
21+
22+
## Build
23+
24+
From the repository root:
25+
26+
```bash
27+
# Initialize submodule (first time or after clone)
28+
git submodule update --init
29+
30+
# Build all modules
31+
mvn -f connectors/pom.xml clean package -DskipTests
32+
33+
# Build only the Databricks connector (uses cached athena-jdbc from .m2)
34+
mvn -f connectors/pom.xml package -DskipTests -pl :athena-databricks-connector
35+
36+
# Build Databricks connector + its dependencies
37+
mvn -f connectors/pom.xml package -DskipTests -pl :athena-databricks-connector -am
38+
```
39+
40+
## Updating the Federation SDK Version
41+
42+
1. Check available tags:
43+
```bash
44+
cd connectors/aws-athena-query-federation
45+
git fetch --tags
46+
git tag | grep v2026
47+
```
48+
49+
2. Checkout the desired version:
50+
```bash
51+
git checkout v2026.12.0
52+
cd ../..
53+
```
54+
55+
3. Update the `athena-sdk.version` property in `athena-databricks-connector/pom.xml` to match.
56+
57+
4. Commit the submodule update:
58+
```bash
59+
git add connectors/aws-athena-query-federation
60+
git commit -m "Bump federation-sdk to v2026.12.0"
61+
```
62+
63+
5. Rebuild to verify:
64+
```bash
65+
mvn -f connectors/pom.xml clean package -DskipTests
66+
```
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
ARG JAVA_VERSION=11
2+
FROM public.ecr.aws/lambda/java:${JAVA_VERSION}
3+
4+
ARG JAR_VERSION=0.1.0
5+
COPY target/athena-databricks-connector-${JAR_VERSION}.jar ${LAMBDA_TASK_ROOT}
6+
RUN jar xf ${LAMBDA_TASK_ROOT}/athena-databricks-connector-${JAR_VERSION}.jar
7+
8+
CMD ["com.amazonaws.athena.connectors.databricks.DatabricksCompositeHandler"]
File renamed without changes.
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
## Athena Databricks Connector
2+
3+
This connector enables Amazon Athena to query data stored in Databricks Unity Catalog using JDBC. It allows you to perform federated queries on Databricks tables directly from Athena.
4+
5+
## Connector Status: Preview
6+
7+
The Databricks Athena connector is currently in **preview** and available only as source code for building locally. This is not a production-ready release.
8+
9+
We welcome questions, suggestions, and contributions from the community.
10+
11+
## What is the Databricks Connector?
12+
13+
The Databricks Connector is a JDBC-based Athena federated query connector that enables querying data in Databricks Unity Catalog. It implements both metadata and record handling capabilities to:
14+
15+
1. Provide schema information about Databricks databases, tables, and columns
16+
2. Read data from Databricks tables for query processing via JDBC
17+
3. Authenticate using personal access tokens stored in AWS Secrets Manager
18+
19+
The connector consists of:
20+
21+
- **DatabricksMetadataHandler**: Handles metadata operations (list schemas, tables, get table definitions, partitions, and splits)
22+
- **DatabricksRecordHandler**: Handles data reading operations from Databricks via JDBC
23+
- **DatabricksCompositeHandler**: Combines both handlers into a single Lambda function
24+
25+
## Prerequisites
26+
27+
Before deploying this connector, ensure you have:
28+
29+
- [Proper permissions/policies to deploy/use Athena Federated Queries](https://docs.aws.amazon.com/athena/latest/ug/federated-query-iam-access.html)
30+
- An S3 bucket for spilling large query results
31+
- A Databricks workspace with Unity Catalog enabled
32+
- A Databricks personal access token stored in AWS Secrets Manager
33+
- Athena workgroup configured to use Athena Engine Version 3
34+
35+
## How To Deploy
36+
37+
### Build the Connector
38+
39+
From the repository root, initialize the submodule and build:
40+
41+
```bash
42+
git submodule update --init
43+
mvn clean package -DskipTests -f connectors/pom.xml
44+
```
45+
46+
The parent POM builds the `athena-jdbc` dependency from the submodule first, then the Databricks connector.
47+
48+
### Deploy Using SAM CLI
49+
50+
```bash
51+
sam build -t connectors/athena-databricks-connector/athena-databricks-connector.yaml && \
52+
sam deploy --guided -t connectors/athena-databricks-connector/athena-databricks-connector.yaml
53+
```
54+
55+
### CloudFormation Parameters
56+
57+
| Parameter | Description | Default |
58+
|-----------|-------------|---------|
59+
| AthenaCatalogName | Lambda function name (must match pattern: `^[a-z0-9-_]{1,64}$`) | databricks |
60+
| SpillBucket | S3 bucket name for spilling data (bucket name only, not URI or ARN) | Required |
61+
| SpillPrefix | Prefix within SpillBucket | athena-spill |
62+
| LambdaTimeout | Maximum Lambda invocation runtime (1-900 seconds) | 900 |
63+
| LambdaMemory | Lambda memory in MB (128-3008) | 1024 |
64+
| DatabricksHost | Databricks workspace hostname (e.g. `dbc-59ed3753-5cf0.cloud.databricks.com`) | Required |
65+
| SecretName | Name of the Secrets Manager secret containing the Databricks personal access token | Required |
66+
| DatabricksDefaultDatabase | Default Databricks Unity Catalog database (catalog.schema) | default |
67+
| DatabricksFetchSize | Number of rows fetched per JDBC round trip | 10000 |
68+
| EnableArrow | Enable Arrow-based result serialization (Cloud Fetch). Requires more Lambda memory | 0 |
69+
| DisableSpillEncryption | Disable encryption for spilled data | false |
70+
71+
### Update Lambda Function
72+
73+
For subsequent code updates after initial deployment, build and push the Docker image manually:
74+
75+
```bash
76+
cd connectors && mvn clean package -DskipTests && \
77+
cd athena-databricks-connector && \
78+
finch build -t databricks-connector . && \
79+
finch tag databricks-connector:latest <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest && \
80+
finch push <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest && \
81+
aws lambda update-function-code \
82+
--function-name databricks \
83+
--image-uri <account-id>.dkr.ecr.<region>.amazonaws.com/<repo-name>:latest \
84+
--region <region>
85+
```
86+
87+
## Secrets Manager Configuration
88+
89+
The connector authenticates with Databricks using a personal access token (PAT) stored in AWS Secrets Manager. The token is retrieved at runtime by the Federation SDK — it is never embedded in code or environment variables.
90+
### Security Best Practices
91+
92+
We recommend storing your Databricks personal access token in AWS Secrets Manager rather than as a plaintext Lambda environment variable. Reference the secret ARN in the `DATABRICKS_TOKEN` environment variable using dynamic references:
93+
94+
{{resolve:secretsmanager:your-secret-name:SecretString:token}}
95+
### How it works
96+
97+
The connector's JDBC connection string contains a `${secret-name}` placeholder. At runtime, the SDK:
98+
99+
1. Extracts the secret name from the placeholder
100+
2. Calls Secrets Manager to retrieve the secret value
101+
3. Injects the `username` and `password` into the JDBC connection properties
102+
4. Strips the placeholder from the URL before connecting
103+
104+
### Create the secret
105+
106+
The secret must be a JSON object with `username` and `password` fields. For Databricks PAT auth, the username is always `token`:
107+
108+
```bash
109+
aws secretsmanager create-secret \
110+
--name my-databricks-secret \
111+
--secret-string '{"username": "token", "password": "<your-databricks-personal-access-token>"}' \
112+
--region <region>
113+
```
114+
115+
### Update the secret
116+
117+
To rotate or update the token:
118+
119+
```bash
120+
aws secretsmanager put-secret-value \
121+
--secret-id my-databricks-secret \
122+
--secret-string '{"username": "token", "password": "<new-token>"}' \
123+
--region <region>
124+
```
125+
126+
No redeployment needed — the connector reads the secret on each invocation.
127+
128+
## Run Queries
129+
130+
Once deployed, query Databricks data through Athena:
131+
132+
```sql
133+
-- List schemas
134+
SHOW DATABASES in `lambda:databricks`;
135+
136+
-- List tables in a schema
137+
SHOW TABLES in `lambda:databricks`.default;
138+
139+
-- Describe table layout (column names and types)
140+
SHOW COLUMNS IN `lambda:databricks`.default.test_table
141+
142+
-- Query a table
143+
SELECT * FROM `lambda:databricks`."default"."your_table" LIMIT 10;
144+
```
145+
146+
You can run queries from the Athena console or the AWS CLI:
147+
148+
```bash
149+
# Start a query
150+
aws athena start-query-execution \
151+
--query-string 'SELECT * FROM `lambda:databricks`."default"."your_table" LIMIT 10' \
152+
--work-group primary \
153+
--region <region>
154+
155+
# Fetch results (use the QueryExecutionId from the previous command)
156+
aws athena get-query-results \
157+
--query-execution-id <query-execution-id> \
158+
--region <region>
159+
```
160+
161+
## JDBC Driver Configuration
162+
163+
### Arrow and Cloud Fetch (Default: Disabled)
164+
165+
The Databricks JDBC driver supports [Cloud Fetch](https://docs.databricks.com/en/integrations/jdbc/capability.html#cloud-fetch-in-jdbc), which downloads query results as ~20MB Arrow-serialized chunks in parallel from DBFS. While this is faster than row-by-row streaming, each in-flight chunk consumes Lambda memory. With the default thread pool of 16, this can easily exceed Lambda's memory limit (1–3GB) on large result sets.
166+
167+
This connector disables Arrow by default (`EnableArrow=0`) so results stream row-by-row via Thrift instead. Memory usage is bounded by `DatabricksFetchSize` (default: 10,000 rows per JDBC round trip).
168+
169+
To re-enable Cloud Fetch for higher throughput, set the `EnableArrow` parameter to `1` during deployment. You may also need to increase `LambdaMemory` to accommodate the larger in-flight buffers.
170+
171+
### Fetch Size (Default: 10,000)
172+
173+
`DatabricksFetchSize` controls how many rows the JDBC driver buffers per round trip. Higher values reduce network round trips but use more memory. The default of 10,000 is safe for Lambda at 1GB with typical row sizes (~1KB). Lower it for tables with very wide rows.
174+
175+
## Troubleshooting
176+
177+
- **No partitioning support**: All data is read in a single split. For large tables, use `LIMIT` or `WHERE` clauses to avoid Lambda timeout or out-of-memory errors.
178+
- **Check Lambda Logs**: `aws logs tail /aws/lambda/databricks --follow --format short --region <region>`
179+
- **Verify Permissions**: Ensure the Lambda execution role has access to Secrets Manager and the spill bucket
180+
181+
## Additional Resources
182+
183+
- [Athena Federated Query Documentation](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html)
184+
- [AWS Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation)
185+
- [Databricks JDBC Driver](https://docs.databricks.com/aws/en/integrations/jdbc-oss/)

0 commit comments

Comments
 (0)