Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
db83140
fix: Add nanosecond timestamp range limitation to MSSQL and Oracle ve…
Apr 18, 2026
4ead185
fix: Correct PostgreSQL accelerator connection pool parameter name an…
Apr 19, 2026
577c5dd
fix: Correct IMAP ssl_mode default from tls to auto
Apr 19, 2026
fb7b1cb
fix: Correct DynamoDB default time_format to include milliseconds
Apr 19, 2026
99740c9
fix: Use lowercase security_protocol values in Kafka and Debezium docs
Apr 18, 2026
a1f1dd1
fix: Add missing DynamoDB parameters to versioned docs
Apr 18, 2026
85601a8
docs: add production deployment guides for RC and GA components (#1524)
lukekim Apr 18, 2026
98bed1a
fix: Correct "Dashbord" typo in Datadog monitoring docs
Apr 20, 2026
159864a
fix: Remove nonexistent duckdb_connection_string and fix param name
Apr 20, 2026
54f8b2d
fix: Remove nonexistent dremio_token parameter from deployment guide
Apr 20, 2026
f4523ed
fix: Correct GraphQL deployment guide parameter names
Apr 20, 2026
8ac9c5a
fix: Add missing connector prefix to azure_storage_client_secret in a…
claudespice Apr 20, 2026
0ed660f
fix: Correct Postgres connector deployment guide defaults and params …
claudespice Apr 20, 2026
da0a4e3
fix: Correct multiple inaccuracies in MySQL deployment guide (#1534)
claudespice Apr 20, 2026
a8a0223
fix: Document PostgreSQL replication parameters for WAL streaming (#1…
claudespice Apr 20, 2026
59d80fa
faq: add incremental ingestion Q and native PostgreSQL replication CD…
lukekim Apr 21, 2026
6986a0b
docs(otel): document runtime.telemetry.metric_prefix, properties, and…
phillipleblanc Apr 21, 2026
338331f
fix: Note scylladb_ssl parameter is not yet implemented
Apr 21, 2026
edb1f90
fix: Correct pg_sslmode default from verify-full to prefer
Apr 21, 2026
166b8b6
Document multi-vector embeddings and late-interaction search
lukekim Apr 21, 2026
c6bbd20
Replace KaTeX expression with plain text (no math plugin configured)
lukekim Apr 21, 2026
ff77178
Release Post for v.2.0.0-rc.3 (#1546)
wyattwenzel Apr 21, 2026
d172a4a
fix: Correct IMAP ssl_mode default from `tls` to `auto` (#1529)
claudespice Apr 21, 2026
384c4b8
fix: Correct PostgreSQL accelerator connection pool parameter name an…
claudespice Apr 21, 2026
2fe576e
fix: Restore MSSQL and Oracle nanosecond timestamp notes from trunk
Copilot Apr 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion website/docs/components/catalogs/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ catalogs:
One of the following auth values must be provided for Azure Blob:

- `databricks_azure_storage_account_key`,
- `databricks_azure_storage_client_id` and `azure_storage_client_secret`, or
- `databricks_azure_storage_client_id` and `databricks_azure_storage_client_secret`, or
- `databricks_azure_storage_sas_key`.
:::

Expand Down
128 changes: 128 additions & 0 deletions website/docs/components/catalogs/unity-catalog/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: 'Unity Catalog Catalog Connector Deployment Guide'
sidebar_label: 'Deployment Guide'
description: 'Operating guide for the Unity Catalog catalog connector in production: workspace authentication, table-type filtering, effective-permissions flow, and observability.'
sidebar_position: 10
pagination_prev: null
pagination_next: null
tags:
- catalogs
- unity-catalog
- observability
---

Production operating guide for the Unity Catalog catalog connector — discovering Databricks Unity Catalog tables and federating them through Spice.

For Databricks-specific operational concerns (SQL Warehouse resilience, metrics, permissions flow as applied to Databricks workspaces), see the [Databricks Deployment Guide](../../data-connectors/databricks/deployment) — the Unity Catalog logic described there applies directly when the catalog connector targets a Databricks workspace.

## Authentication & Secrets

| Parameter | Description |
| ---------------------- | ------------------------------------------------------------------------------------ |
| `unity_catalog_token` | Bearer token for the Unity Catalog API. Use `${secrets:...}` from a secret store. |

The catalog URL must match the pattern `https://<host>/api/2.1/unity-catalog/catalogs/<catalog_id>` and is parsed into the endpoint and catalog identifier at startup. Mismatched URLs are rejected as configuration errors.

The token is optional — when unset, the catalog connector issues unauthenticated requests, suitable for locally-hosted Unity Catalog deployments (OSS UC) with permissive access. For Databricks workspaces, the token is always required.

Secrets must be sourced from a [secret store](../../secret-stores/) in production. Rotate tokens from the UC / Databricks console and update the secret store.

## Resilience Controls

### HTTP Retry Policy

The Unity Catalog client uses the shared `resilient_http` helper with these defaults:

- Maximum retries: **3**
- Backoff: fibonacci
- Retriable conditions: HTTP `408`, `429`, `5xx`, and transient network errors (connect, timeout)
- Respects `Retry-After`, `retry-after-ms`, `x-retry-after-ms` headers
- Maximum backoff: 300 seconds

These are not exposed as user-tunable parameters on the Unity Catalog connector itself.

### Discovery Concurrency

The connector fans out schema and table enumeration with bounded concurrency to avoid thundering-herd on the UC API:

- Schema refresh: up to **5** concurrent requests (`buffer_unordered(5)`)
- Permission checks: up to **5** concurrent requests (`buffer_unordered(5)`)

For catalogs with thousands of tables, initial discovery can take minutes while the connector respects these limits.

## Table Type and Permission Handling

### Table Type Filtering

| Table Type | Supported | Notes |
| ------------------- | --------- | -------------------------------------- |
| `MANAGED` | Yes | Standard Delta tables |
| `EXTERNAL` | Yes | Tables with external storage locations |
| `FOREIGN` | Yes | Lakehouse Federation foreign tables |
| `MATERIALIZED_VIEW` | Yes | Materialized views |
| `VIEW` | No | Skipped during discovery |
| `STREAMING_TABLE` | No | Skipped during discovery |

Unsupported table types are skipped during catalog discovery. When referenced directly, an error is returned.

### Effective Permissions

Before creating a table provider, the connector checks permissions via `GET /api/2.1/unity-catalog/effective-permissions/table/{catalog.schema.table}`. The following privileges grant read access:

- `SELECT`
- `ALL_PRIVILEGES` / `ALL PRIVILEGES`
- `OWNER` / `OWNERSHIP`

**Behavior**:

- **Discovery**: Tables without read permission are skipped.
- **Direct reference**: An `InsufficientPermissions` error is returned.
- **Foreign tables**: The precheck is skipped (`requires_read_permission_validation = false`) because Lakehouse Federation access can be valid when the UC effective-permissions endpoint does not report a table-level privilege. Access is still enforced by Databricks at query time.
- **Graceful degradation**: If the UC API is unreachable or returns an error for the permissions endpoint, discovery proceeds with a warning — table providers are still created, and any per-query authorization failures surface at query time.

## Capacity & Sizing

- **Initial discovery**: Scales with the number of schemas × tables. Bounded concurrency caps throughput; plan 5–30 minutes for catalogs with thousands of tables on a cold start.
- **Refresh**: Catalog refresh re-enumerates schemas and tables at the configured interval. For very large catalogs, refresh less frequently (every few hours) unless schemas change rapidly.
- **Permission-check cost**: One API call per table. The buffer of 5 caps concurrency.

## Metrics

The Unity Catalog connector does not currently register UC-specific OpenTelemetry metric instruments. When used via the Databricks connector, the shared SQL Warehouse and UC spans produce task-history records that can be aggregated for operational insight.

Monitor via:

- Spice query execution metrics (`query_duration_ms`, `query_processed_rows`) from `runtime.metrics`.
- Task-history spans listed below.
- Databricks / UC workspace audit logs for API-level visibility.

See [Component Metrics](../../../features/observability/component_metrics) for general configuration.

## Task History

Unity Catalog operations emit the following [task history](../../../reference/task_history) spans:

| Span | Input | Description |
| ------------------------------ | ----------------------------- | ---------------------------------------- |
| `uc_get_table` | Fully-qualified table name | Fetch table metadata from Unity Catalog. |
| `uc_get_catalog` | Catalog ID | Fetch catalog metadata. |
| `uc_list_schemas` | Catalog ID | List schemas in a catalog. |
| `uc_list_tables` | `catalog_id.schema_name` | List tables in a schema. |
| `uc_get_effective_permissions` | Fully-qualified table name | Check effective permissions for a table. |

## Known Limitations

- **VIEW and STREAMING_TABLE are skipped**: Only queryable table types are exposed.
- **No UC write-back**: The connector is read-only; writes to UC are not supported through Spice.
- **HTTP retry/concurrency parameters not exposed**: The resilient-HTTP defaults (3 retries, fibonacci backoff, concurrency 5) are not currently user-tunable on the UC connector.
- **Graceful degradation on permission-endpoint failures**: If UC effective-permissions is unreachable, Spice proceeds; authorization errors surface at query time rather than discovery time.

## Troubleshooting

| Symptom | Likely cause | Resolution |
| ----------------------------------------------------------------------- | ------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- |
| `401 Unauthorized` on catalog list | Missing, expired, or wrong-workspace token. | Regenerate token in UC / Databricks; update secret store. |
| Table visible in UC but missing from the Spice catalog | Table type is VIEW / STREAMING_TABLE or permissions were denied. | Confirm table type is supported and that the principal has `SELECT` (or equivalent). |
| `InsufficientPermissions` on direct table reference | Role lacks read privilege on the table. | Grant `SELECT` on the table in UC. |
| Slow catalog discovery on thousands of tables | Bounded concurrency + permission checks per table. | Expected behavior; schedule discovery during low-traffic windows and cache via accelerated datasets. |
| Tables from a Lakehouse Federation source missing | FOREIGN precheck passed but Databricks denied at query time. | Verify the Databricks workspace has federation privileges granted to the principal. |
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ The `dataset_params` field is used to configure the dataset-specific parameters
One of the following auth values must be provided for Azure Blob:

- `unity_catalog_azure_storage_account_key`,
- `unity_catalog_azure_storage_client_id` and `azure_storage_client_secret`, or
- `unity_catalog_azure_storage_client_id` and `unity_catalog_azure_storage_client_secret`, or
- `unity_catalog_azure_storage_sas_key`.
:::

Expand Down
69 changes: 69 additions & 0 deletions website/docs/components/data-accelerators/arrow/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: 'Arrow Data Accelerator Deployment Guide'
sidebar_label: 'Deployment Guide'
description: 'Operating guide for the Arrow (in-memory) data accelerator in production: memory sizing, indexes, and observability.'
sidebar_position: 10
pagination_prev: null
pagination_next: null
tags:
- data-accelerators
- arrow
- observability
---

Production operating guide for the Arrow in-memory data accelerator covering memory sizing, optional hash indexes, and observability.

## Authentication & Secrets

The Arrow accelerator is an in-process, in-memory engine. There is no external storage and no authentication or secret management required.

## Resilience & Durability

The Arrow accelerator is **not durable**. Data is held in RAM and is lost on process restart; every restart re-materializes the dataset from the source connector.

- **Crash recovery**: None — on restart, the dataset is refreshed from scratch.
- **File modes**: File-mode acceleration is rejected at startup; Arrow is memory-only. Use [DuckDB](../duckdb/deployment), [SQLite](../sqlite/deployment), [PostgreSQL](../postgres/deployment), or [Cayenne](../cayenne/deployment) when durability or spill is required.
- **Concurrency**: Arrow reads are lock-free. Refresh cadence is controlled by the runtime refresh semaphore, not by the accelerator itself.

## Capacity & Sizing

- **Memory**: Plan for 1.0–1.5× the raw row-oriented size of the source data, plus overhead for string dictionaries. Use the source connector's schema and row count to estimate.
- **Hash index**: Optional, disabled by default. When enabled via `hash_index: enabled`, a hash map is built over the primary-key columns. Build time scales linearly with rows; memory overhead is approximately 24–48 bytes per row plus the key size.
- **Startup cost**: Full-dataset materialization happens on startup. For tables larger than ~1 GB, consider a durable accelerator to avoid repeated full refresh on every restart.

## Metrics

Generic acceleration metrics are available with the `dataset_acceleration_` prefix. Hash-index operations emit dedicated metrics when the index is enabled:

| Metric | Type | Description |
| ---------------------------------- | --------- | --------------------------------------------------------- |
| `hash_index_builds` | Counter | Total hash-index builds (one per refresh). |
| `hash_index_build_duration_ms` | Histogram | Time to build the hash index. |
| `hash_index_entries` | Gauge | Number of entries in the index. |
| `hash_index_memory_bytes` | Gauge | Approximate memory footprint of the index. |
| `hash_index_lookups` | Counter | Total hash-index lookups performed by queries. |
| `hash_index_lookup_rows` | Counter | Total rows returned via hash-index lookups. |

See [Component Metrics](../../../features/observability/component_metrics) for enabling and exporting metrics. Refresh metrics are described in [Acceleration](../../../features/data-acceleration/).

## Task History

Arrow acceleration operations (refresh, query) participate in [task history](../../../reference/task_history) through the shared acceleration spans (`accelerated_table_refresh`, `sql_query`). No Arrow-specific spans are emitted — the accelerator is a thin wrapper over Arrow memory.

## Known Limitations

- **No persistence**: Every restart refreshes from the source.
- **No traditional indexes**: Arrow does not support B-tree indexes. Hash index provides point-lookup acceleration but not range or sort-order optimization.
- **Only primary-key hash index**: The hash index requires a `primary_key` constraint; `unique` constraints alone do not enable the index.
- **Memory pressure**: If the dataset exceeds available RAM, the runtime will OOM; no spill-to-disk mechanism exists in the Arrow accelerator itself.
- **`partition_by`**: Not applicable — Arrow accelerator holds a single in-memory representation.

## Troubleshooting

| Symptom | Likely cause | Resolution |
| ------------------------------------------------ | ------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| OOM on refresh | Source dataset larger than RAM. | Switch to a durable accelerator (DuckDB / SQLite / Cayenne) that supports spill to disk. |
| Long startup time | Full-dataset refresh runs on boot. | Switch to a durable accelerator so refresh is incremental, not full, on restart. |
| `hash_index` ignored | No primary-key constraint on the dataset. | Add `primary_key:` to the dataset definition; hash index activates automatically. |
| Query slow for point lookups | Hash index disabled or wrong key column. | Enable `hash_index: enabled`; ensure the query filter matches the primary-key columns. |
| Accelerator refuses to start with file mode | Arrow rejects file-mode acceleration. | Switch `engine:` to `duckdb`, `sqlite`, `postgres`, or `cayenne`. |
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ See [Hash Index](../../features/data-acceleration/hash-index) for configuration

When accelerating a dataset using the In-Memory Arrow Data Accelerator, some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb) and [`sqlite`](./sqlite) accelerators by specifying `mode: file`.
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](duckdb) and [`sqlite`](sqlite) accelerators by specifying `mode: file`.

:::

Expand Down
Loading
Loading