Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 34 additions & 28 deletions website/docs/acknowledgements/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,31 +51,37 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT

## Rust Crates

- ahash 0.7.8, Apache-2.0 OR MIT
<br/>https://github.com/tkaitchuck/ahash

- ahash 0.8.12, Apache-2.0 OR MIT
<br/>https://github.com/tkaitchuck/ahash

- ansi_term 0.12.1, MIT
<br/>https://github.com/ogham/rust-ansi-term

- anyhow 1.0.95, Apache-2.0 OR MIT
<br/>https://github.com/dtolnay/anyhow

- arrow 54.2.1, Apache-2.0
- arrow 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- arrow-buffer 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- arrow-cast 54.2.1, Apache-2.0
- arrow-cast 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- arrow-csv 54.2.1, Apache-2.0
- arrow-csv 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- arrow-flight 54.2.1, Apache-2.0
- arrow-flight 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- arrow-ipc 54.2.1, Apache-2.0
- arrow-ipc 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- arrow-json 54.2.1, Apache-2.0
- arrow-json 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- arrow-odbc 16.0.0, MIT
Expand All @@ -84,10 +90,10 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- arrow-schema 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- async-graphql 7.0.15, Apache-2.0 OR MIT
- async-graphql 7.0.16, Apache-2.0 OR MIT
<br/>https://github.com/async-graphql/async-graphql

- async-graphql-axum 7.0.13, Apache-2.0 OR MIT
- async-graphql-axum 7.0.16, Apache-2.0 OR MIT
<br/>https://github.com/async-graphql/async-graphql

- async-openai 0.28.0, MIT
Expand All @@ -114,7 +120,10 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- axum 0.7.9, MIT
<br/>https://github.com/tokio-rs/axum

- axum-extra 0.9.6, MIT
- axum 0.8.3, MIT
<br/>https://github.com/tokio-rs/axum

- axum-extra 0.10.1, MIT
<br/>https://github.com/tokio-rs/axum

- azure_core 0.21.0, MIT
Expand Down Expand Up @@ -159,7 +168,7 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- charset 0.1.5, Apache-2.0 OR MIT
<br/>https://github.com/hsivonen/charset

- chrono 0.4.39, Apache-2.0 OR MIT
- chrono 0.4.41, Apache-2.0 OR MIT
<br/>https://github.com/chronotope/chrono

- chrono-tz 0.8.6, Apache-2.0 OR MIT
Expand Down Expand Up @@ -192,7 +201,10 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- dashmap 6.1.0, MIT
<br/>https://github.com/xacrimon/dashmap

- datafusion 45.0.0, Apache-2.0
- datafusion 46.0.1, Apache-2.0
<br/>https://github.com/apache/datafusion

- datafusion-datasource 46.0.1, Apache-2.0
<br/>https://github.com/apache/datafusion

- datafusion-federation 0.1.6, Apache-2.0
Expand All @@ -201,13 +213,13 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- datafusion-federation-sql 0.1.6, Apache-2.0
<br/>

- datafusion-functions-json 0.45.0, Apache-2.0
- datafusion-functions-json 0.46.0, Apache-2.0
<br/>https://github.com/datafusion-contrib/datafusion-functions-json/

- datafusion-table-providers 0.1.0,
<br/>https://github.com/datafusion-contrib/datafusion-table-providers

- delta_kernel 0.9.0, Apache-2.0
- delta_kernel 0.10.0, Apache-2.0
<br/>https://github.com/delta-io/delta-kernel-rs

- dirs 5.0.1, Apache-2.0 OR MIT
Expand Down Expand Up @@ -348,15 +360,6 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- mailparse 0.15.0, 0BSD
<br/>https://github.com/staktrace/mailparse

- mcp-client 1.0.7, MIT
<br/>https://github.com/modelcontextprotocol/rust-sdk/

- mcp-core 1.0.7, MIT
<br/>https://github.com/modelcontextprotocol/rust-sdk/

- mcp-server 1.0.7, MIT
<br/>https://github.com/modelcontextprotocol/rust-sdk/

- mediatype 0.19.18, MIT
<br/>https://github.com/picoHz/mediatype

Expand Down Expand Up @@ -423,7 +426,7 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- opentelemetry_sdk 0.29.0, Apache-2.0
<br/>https://github.com/open-telemetry/opentelemetry-rust/tree/main/opentelemetry-sdk

- parquet 54.2.1, Apache-2.0
- parquet 54.3.1, Apache-2.0
<br/>https://github.com/apache/arrow-rs

- paste 1.0.15, Apache-2.0 OR MIT
Expand Down Expand Up @@ -486,12 +489,15 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- reqwest 0.11.27, Apache-2.0 OR MIT
<br/>https://github.com/seanmonstar/reqwest

- reqwest 0.12.12, Apache-2.0 OR MIT
- reqwest 0.12.15, Apache-2.0 OR MIT
<br/>https://github.com/seanmonstar/reqwest

- reqwest-eventsource 0.6.0, Apache-2.0 OR MIT
<br/>https://github.com/jpopesculian/reqwest-eventsource

- rmcp 0.1.5, Apache-2.0 OR MIT
<br/>https://github.com/modelcontextprotocol/rust-sdk/

- rstest 0.25.0, Apache-2.0 OR MIT
<br/>https://github.com/la10736/rstest

Expand Down Expand Up @@ -606,7 +612,7 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- tokenizers 0.21.0, Apache-2.0
<br/>https://github.com/huggingface/tokenizers

- tokio 1.43.0, MIT
- tokio 1.45.0, MIT
<br/>https://github.com/tokio-rs/tokio

- tokio-postgres 0.7.13, Apache-2.0 OR MIT
Expand Down Expand Up @@ -669,13 +675,13 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- utoipa 5.3.1, Apache-2.0 OR MIT
<br/>https://github.com/juhaku/utoipa

- utoipa-swagger-ui 8.1.0, Apache-2.0 OR MIT
- utoipa-swagger-ui 9.0.1, Apache-2.0 OR MIT
<br/>https://github.com/juhaku/utoipa

- uuid 0.8.2, Apache-2.0 OR MIT
<br/>https://github.com/uuid-rs/uuid

- uuid 1.13.1, Apache-2.0 OR MIT
- uuid 1.16.0, Apache-2.0 OR MIT
<br/>https://github.com/uuid-rs/uuid

- winver 1.0.0, MIT
Expand All @@ -690,5 +696,5 @@ gopkg.in/yaml.v3, https://github.com/go-yaml/yaml/blob/v3.0.1/LICENSE, MIT
- zip 1.1.4, MIT
<br/>https://github.com/zip-rs/zip2.git

- zip 2.3.0, MIT
- zip 2.6.1, MIT
<br/>https://github.com/zip-rs/zip2.git
81 changes: 60 additions & 21 deletions website/docs/components/catalogs/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ tags:
- data-connectors
---

Connect to a [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) as a catalog provider for federated SQL query using [Spark Connect](https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html) or directly from [Delta Lake](https://delta.io/) tables.
Connect to a [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) as a catalog provider for federated SQL query using [Spark Connect](https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html), directly from [Delta Lake](https://delta.io/) tables, or using the [SQL Statement Execution API](https://docs.databricks.com/aws/en/dev-tools/sql-execution-tutorial).

## Configuration

Expand All @@ -22,7 +22,7 @@ catalogs:
include:
- '*.my_table_name' # include only the "my_table_name" tables
params:
mode: delta_lake # or spark_connect
mode: delta_lake # or spark_connect or sql_warehouse
databricks_endpoint: dbc-a12cd3e4-56f7.cloud.databricks.com
dataset_params:
# delta_lake S3 parameters
Expand All @@ -32,6 +32,8 @@ catalogs:
databricks_aws_endpoint: s3.us-west-2.amazonaws.com
# spark_connect parameters
databricks_cluster_id: 1234-567890-abcde123
# sql_warehouse parameters
databricks_sql_warehouse_id: 2b4e24cff378fb24
```

## `from`
Expand All @@ -48,14 +50,23 @@ Use the `include` field to specify which tables to include from the catalog. The

## `params`

The `params` field is used to configure the connection to the Databricks Unity Catalog. The following parameters are supported:
The following parameters are supported for configuring the connection to the Databricks Unity Catalog:

- `mode`: The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:
- `spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.
- `delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided.
- `databricks_endpoint`: The Databricks workspace endpoint, e.g. `dbc-a12cd3e4-56f7.cloud.databricks.com`.
- `databricks_token`: The Databricks API token to authenticate with the Unity Catalog API. Use the [secret replacement syntax](../secret-stores/index.md) to reference a secret, e.g. `${secrets:my_databricks_token}`.
- `databricks_use_ssl`: If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`.
| Parameter Name | Definition |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode` | The execution mode for querying against Databricks. `spark_connect` uses Spark Connect to query against Databricks requires a Spark cluster to be available. `delta_lake` queries directly from Delta Tables and requires the object store credentials to be provided. Default is `spark_connect`. |
| `databricks_endpoint` | The Databricks workspace endpoint, e.g. `dbc-a12cd3e4-56f7.cloud.databricks.com` |
| `databricks_token` | The Databricks API token to authenticate with the Unity Catalog API. Use the [secret replacement syntax](../secret-stores/index.md) to reference a secret, e.g. `${secrets:my_databricks_token}`. |
| `databricks_use_ssl` | If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`. |

To locate the Databricks endpoint, do the following:

1. Log in to your Databricks workspace.
2. In the sidebar, click Compute.
3. In the list of available clusters, click the target cluster's name.
4. On the Configuration tab, expand Advanced options.
5. Click the JDBC/ODBC tab.
6. The endpoint is the Server Hostname.

## Authentication

Expand Down Expand Up @@ -100,18 +111,42 @@ The `dataset_params` field is used to configure the dataset-specific parameters

### Spark Connect parameters

- `databricks_cluster_id`: The ID of the compute cluster in Databricks to use for the query. e.g. `1234-567890-abcde123`.
| Dataset Parameter Name | Definition |
| ----------------------- | ---------------------------------------------------------------------------------------------- |
| `databricks_cluster_id` | The ID of the compute cluster in Databricks to use for the query. e.g. `1234-567890-abcde123`. |

To locate the cluster ID, do the following:

1. Log in to your Databricks workspace.
2. In the sidebar, click Compute.
3. In the list of available clusters, click the target cluster's name.
4. On the Configuration tab, expand Advanced options.
5. Click the JDBC/ODBC tab.
6. The cluster ID is the prefix of the Server Hostname.

### Delta Lake object store parameters

Configure the connection to the object store when using `mode: delta_lake`. Use the [secret replacement syntax](../secret-stores/index.md) to reference a secret, e.g. `${secrets:aws_access_key_id}`.

### SQL Warehouse parameters

- `databricks_sql_warehouse_id`: The ID of the SQL Warehouse in Databricks to use for the query. e.g. `2b4e24cff378fb24`.

To locate your SQL Warehouse ID, do the following:

1. Log in to your Databricks workspace.
2. In the sidebar, click SQL -> SQL Warehouses.
3. In the list of available warehouses, click the target warehouse's name.
4. Next to the **Name** field, the ID follows the name in parentheses. For example: `My Serverless Warehouse (ID: 2b4e24cff378fb24)`

#### AWS S3

- `databricks_aws_region`: The AWS region for the S3 object store. E.g. `us-west-2`.
- `databricks_aws_access_key_id`: The access key ID for the S3 object store.
- `databricks_aws_secret_access_key`: The secret access key for the S3 object store.
- `databricks_aws_endpoint`: The endpoint for the S3 object store. E.g. `s3.us-west-2.amazonaws.com`.
| Dataset Parameter Name | Definition |
| ---------------------------------- | ------------------------------------------------------------------------ |
| `databricks_aws_region` | The AWS region for the S3 object store. E.g. `us-west-2`. |
| `databricks_aws_access_key_id` | The access key ID for the S3 object store. |
| `databricks_aws_secret_access_key` | The secret access key for the S3 object store. |
| `databricks_aws_endpoint` | The endpoint for the S3 object store. E.g. `s3.us-west-2.amazonaws.com`. |

Example:

Expand Down Expand Up @@ -141,12 +176,14 @@ One of the following auth values must be provided for Azure Blob:
- `databricks_azure_storage_sas_key`.
:::

- `databricks_azure_storage_account_name`: The Azure Storage account name.
- `databricks_azure_storage_account_key`: The Azure Storage master key for accessing the storage account.
- `databricks_azure_storage_client_id`: The service principal client id for accessing the storage account.
- `databricks_azure_storage_client_secret`: The service principal client secret for accessing the storage account.
- `databricks_azure_storage_sas_key`: The shared access signature key for accessing the storage account.
- `databricks_azure_storage_endpoint`: The endpoint for the Azure Blob storage account.
| Dataset Parameter Name | Definition |
| ---------------------------------------- | ---------------------------------------------------------------------- |
| `databricks_azure_storage_account_name` | The Azure Storage account name. |
| `databricks_azure_storage_account_key` | The Azure Storage master key for accessing the storage account. |
| `databricks_azure_storage_client_id` | The service principal client id for accessing the storage account. |
| `databricks_azure_storage_client_secret` | The service principal client secret for accessing the storage account. |
| `databricks_azure_storage_sas_key` | The shared access signature key for accessing the storage account. |
| `databricks_azure_storage_endpoint` | The endpoint for the Azure Blob storage account. |

Example:

Expand All @@ -167,7 +204,9 @@ catalogs:

#### Google Storage (GCS)

- `google_service_account`: Filesystem path to the Google service account JSON key file.
| Dataset Parameter Name | Definition |
| ------------------------ | ------------------------------------------------------------ |
| `google_service_account` | Filesystem path to the Google service account JSON key file. |

Example:

Expand Down
17 changes: 15 additions & 2 deletions website/docs/components/data-connectors/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ tags:
- delta-lake
---

Databricks as a connector for federated SQL query against Databricks using [Spark Connect](https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html) or directly from [Delta Lake](https://delta.io/) tables.
Databricks as a connector for federated SQL query against Databricks using [Spark Connect](https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html), directly from [Delta Lake](https://delta.io/) tables, or using the [SQL Statement Execution API](https://docs.databricks.com/aws/en/dev-tools/sql-execution-tutorial).

```yaml
datasets:
Expand Down Expand Up @@ -62,6 +62,7 @@ Use the [secret replacement syntax](../secret-stores/index.md) to reference a se
| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode` | The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:<br /> <ul><li>`spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.</li><li>`delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided.</li></ul> |
| `databricks_endpoint` | The endpoint of the Databricks instance. Required for both modes. |
| `databricks_sql_warehouse_id` | The ID of the SQL Warehouse in Databricks to use for the query. Only valid when `mode` is `sql_warehouse`. |
| `databricks_cluster_id` | The ID of the compute cluster in Databricks to use for the query. Only valid when `mode` is `spark_connect`. |
| `databricks_use_ssl` | If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`. |
| `client_timeout` | Optional. Applicable only in `delta_lake` mode. Specifies timeout for object store operations. Default value is `30s` E.g. `client_timeout: 60s` |
Expand Down Expand Up @@ -157,6 +158,18 @@ Configure the connection to the object store when using `mode: delta_lake`. Use
databricks_token: ${secrets:my_token}
```

### SQL Warehouse

```yaml
- from: databricks:spiceai.datasets.my_table # A reference to a table in the Databricks unity catalog
name: my_table
params:
mode: sql_warehouse
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_sql_warehouse_id: 2b4e24cff378fb24
databricks_token: ${secrets:my_token}
```

### Delta Lake (S3)

```yaml
Expand Down Expand Up @@ -259,4 +272,4 @@ Memory limitations can be mitigated by storing acceleration data on disk, which

## Cookbook

- A cookbook recipe to configure Databricks as data connector in Spice under `delta_lake` mode. [Spice on Databricks (mode: delta_lake)](https://github.com/spiceai/cookbook/tree/trunk/databricks/delta_lake#readme)
- A cookbook recipe to configure Databricks as a data connector in Spice. [Spice on Databricks](https://github.com/spiceai/cookbook/tree/trunk/databricks)
Loading