Skip to content

Commit 15ea2f1

Browse files
authored
docs: add production deployment guides for RC and GA components (#1524)
1 parent f582c1a commit 15ea2f1

47 files changed

Lines changed: 2206 additions & 26 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: 'Unity Catalog Catalog Connector Deployment Guide'
3+
sidebar_label: 'Deployment Guide'
4+
description: 'Operating guide for the Unity Catalog catalog connector in production: workspace authentication, table-type filtering, effective-permissions flow, and observability.'
5+
sidebar_position: 10
6+
pagination_prev: null
7+
pagination_next: null
8+
tags:
9+
- catalogs
10+
- unity-catalog
11+
- observability
12+
---
13+
14+
Production operating guide for the Unity Catalog catalog connector — discovering Databricks Unity Catalog tables and federating them through Spice.
15+
16+
For Databricks-specific operational concerns (SQL Warehouse resilience, metrics, permissions flow as applied to Databricks workspaces), see the [Databricks Deployment Guide](../../data-connectors/databricks/deployment) — the Unity Catalog logic described there applies directly when the catalog connector targets a Databricks workspace.
17+
18+
## Authentication & Secrets
19+
20+
| Parameter | Description |
21+
| ---------------------- | ------------------------------------------------------------------------------------ |
22+
| `unity_catalog_token` | Bearer token for the Unity Catalog API. Use `${secrets:...}` from a secret store. |
23+
24+
The catalog URL must match the pattern `https://<host>/api/2.1/unity-catalog/catalogs/<catalog_id>` and is parsed into the endpoint and catalog identifier at startup. Mismatched URLs are rejected as configuration errors.
25+
26+
The token is optional — when unset, the catalog connector issues unauthenticated requests, suitable for locally-hosted Unity Catalog deployments (OSS UC) with permissive access. For Databricks workspaces, the token is always required.
27+
28+
Secrets must be sourced from a [secret store](../../secret-stores/) in production. Rotate tokens from the UC / Databricks console and update the secret store.
29+
30+
## Resilience Controls
31+
32+
### HTTP Retry Policy
33+
34+
The Unity Catalog client uses the shared `resilient_http` helper with these defaults:
35+
36+
- Maximum retries: **3**
37+
- Backoff: fibonacci
38+
- Retriable conditions: HTTP `408`, `429`, `5xx`, and transient network errors (connect, timeout)
39+
- Respects `Retry-After`, `retry-after-ms`, `x-retry-after-ms` headers
40+
- Maximum backoff: 300 seconds
41+
42+
These are not exposed as user-tunable parameters on the Unity Catalog connector itself.
43+
44+
### Discovery Concurrency
45+
46+
The connector fans out schema and table enumeration with bounded concurrency to avoid thundering-herd on the UC API:
47+
48+
- Schema refresh: up to **5** concurrent requests (`buffer_unordered(5)`)
49+
- Permission checks: up to **5** concurrent requests (`buffer_unordered(5)`)
50+
51+
For catalogs with thousands of tables, initial discovery can take minutes while the connector respects these limits.
52+
53+
## Table Type and Permission Handling
54+
55+
### Table Type Filtering
56+
57+
| Table Type | Supported | Notes |
58+
| ------------------- | --------- | -------------------------------------- |
59+
| `MANAGED` | Yes | Standard Delta tables |
60+
| `EXTERNAL` | Yes | Tables with external storage locations |
61+
| `FOREIGN` | Yes | Lakehouse Federation foreign tables |
62+
| `MATERIALIZED_VIEW` | Yes | Materialized views |
63+
| `VIEW` | No | Skipped during discovery |
64+
| `STREAMING_TABLE` | No | Skipped during discovery |
65+
66+
Unsupported table types are skipped during catalog discovery. When referenced directly, an error is returned.
67+
68+
### Effective Permissions
69+
70+
Before creating a table provider, the connector checks permissions via `GET /api/2.1/unity-catalog/effective-permissions/table/{catalog.schema.table}`. The following privileges grant read access:
71+
72+
- `SELECT`
73+
- `ALL_PRIVILEGES` / `ALL PRIVILEGES`
74+
- `OWNER` / `OWNERSHIP`
75+
76+
**Behavior**:
77+
78+
- **Discovery**: Tables without read permission are skipped.
79+
- **Direct reference**: An `InsufficientPermissions` error is returned.
80+
- **Foreign tables**: The precheck is skipped (`requires_read_permission_validation = false`) because Lakehouse Federation access can be valid when the UC effective-permissions endpoint does not report a table-level privilege. Access is still enforced by Databricks at query time.
81+
- **Graceful degradation**: If the UC API is unreachable or returns an error for the permissions endpoint, discovery proceeds with a warning — table providers are still created, and any per-query authorization failures surface at query time.
82+
83+
## Capacity & Sizing
84+
85+
- **Initial discovery**: Scales with the number of schemas × tables. Bounded concurrency caps throughput; plan 5–30 minutes for catalogs with thousands of tables on a cold start.
86+
- **Refresh**: Catalog refresh re-enumerates schemas and tables at the configured interval. For very large catalogs, refresh less frequently (every few hours) unless schemas change rapidly.
87+
- **Permission-check cost**: One API call per table. The buffer of 5 caps concurrency.
88+
89+
## Metrics
90+
91+
The Unity Catalog connector does not currently register UC-specific OpenTelemetry metric instruments. When used via the Databricks connector, the shared SQL Warehouse and UC spans produce task-history records that can be aggregated for operational insight.
92+
93+
Monitor via:
94+
95+
- Spice query execution metrics (`query_duration_ms`, `query_processed_rows`) from `runtime.metrics`.
96+
- Task-history spans listed below.
97+
- Databricks / UC workspace audit logs for API-level visibility.
98+
99+
See [Component Metrics](../../../features/observability/component_metrics) for general configuration.
100+
101+
## Task History
102+
103+
Unity Catalog operations emit the following [task history](../../../reference/task_history) spans:
104+
105+
| Span | Input | Description |
106+
| ------------------------------ | ----------------------------- | ---------------------------------------- |
107+
| `uc_get_table` | Fully-qualified table name | Fetch table metadata from Unity Catalog. |
108+
| `uc_get_catalog` | Catalog ID | Fetch catalog metadata. |
109+
| `uc_list_schemas` | Catalog ID | List schemas in a catalog. |
110+
| `uc_list_tables` | `catalog_id.schema_name` | List tables in a schema. |
111+
| `uc_get_effective_permissions` | Fully-qualified table name | Check effective permissions for a table. |
112+
113+
## Known Limitations
114+
115+
- **VIEW and STREAMING_TABLE are skipped**: Only queryable table types are exposed.
116+
- **No UC write-back**: The connector is read-only; writes to UC are not supported through Spice.
117+
- **HTTP retry/concurrency parameters not exposed**: The resilient-HTTP defaults (3 retries, fibonacci backoff, concurrency 5) are not currently user-tunable on the UC connector.
118+
- **Graceful degradation on permission-endpoint failures**: If UC effective-permissions is unreachable, Spice proceeds; authorization errors surface at query time rather than discovery time.
119+
120+
## Troubleshooting
121+
122+
| Symptom | Likely cause | Resolution |
123+
| ----------------------------------------------------------------------- | ------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- |
124+
| `401 Unauthorized` on catalog list | Missing, expired, or wrong-workspace token. | Regenerate token in UC / Databricks; update secret store. |
125+
| Table visible in UC but missing from the Spice catalog | Table type is VIEW / STREAMING_TABLE or permissions were denied. | Confirm table type is supported and that the principal has `SELECT` (or equivalent). |
126+
| `InsufficientPermissions` on direct table reference | Role lacks read privilege on the table. | Grant `SELECT` on the table in UC. |
127+
| Slow catalog discovery on thousands of tables | Bounded concurrency + permission checks per table. | Expected behavior; schedule discovery during low-traffic windows and cache via accelerated datasets. |
128+
| Tables from a Lakehouse Federation source missing | FOREIGN precheck passed but Databricks denied at query time. | Verify the Databricks workspace has federation privileges granted to the principal. |
File renamed without changes.
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: 'Arrow Data Accelerator Deployment Guide'
3+
sidebar_label: 'Deployment Guide'
4+
description: 'Operating guide for the Arrow (in-memory) data accelerator in production: memory sizing, indexes, and observability.'
5+
sidebar_position: 10
6+
pagination_prev: null
7+
pagination_next: null
8+
tags:
9+
- data-accelerators
10+
- arrow
11+
- observability
12+
---
13+
14+
Production operating guide for the Arrow in-memory data accelerator covering memory sizing, optional hash indexes, and observability.
15+
16+
## Authentication & Secrets
17+
18+
The Arrow accelerator is an in-process, in-memory engine. There is no external storage and no authentication or secret management required.
19+
20+
## Resilience & Durability
21+
22+
The Arrow accelerator is **not durable**. Data is held in RAM and is lost on process restart; every restart re-materializes the dataset from the source connector.
23+
24+
- **Crash recovery**: None — on restart, the dataset is refreshed from scratch.
25+
- **File modes**: File-mode acceleration is rejected at startup; Arrow is memory-only. Use [DuckDB](../duckdb/deployment), [SQLite](../sqlite/deployment), [PostgreSQL](../postgres/deployment), or [Cayenne](../cayenne/deployment) when durability or spill is required.
26+
- **Concurrency**: Arrow reads are lock-free. Refresh cadence is controlled by the runtime refresh semaphore, not by the accelerator itself.
27+
28+
## Capacity & Sizing
29+
30+
- **Memory**: Plan for 1.0–1.5× the raw row-oriented size of the source data, plus overhead for string dictionaries. Use the source connector's schema and row count to estimate.
31+
- **Hash index**: Optional, disabled by default. When enabled via `hash_index: enabled`, a hash map is built over the primary-key columns. Build time scales linearly with rows; memory overhead is approximately 24–48 bytes per row plus the key size.
32+
- **Startup cost**: Full-dataset materialization happens on startup. For tables larger than ~1 GB, consider a durable accelerator to avoid repeated full refresh on every restart.
33+
34+
## Metrics
35+
36+
Generic acceleration metrics are available with the `dataset_acceleration_` prefix. Hash-index operations emit dedicated metrics when the index is enabled:
37+
38+
| Metric | Type | Description |
39+
| ---------------------------------- | --------- | --------------------------------------------------------- |
40+
| `hash_index_builds` | Counter | Total hash-index builds (one per refresh). |
41+
| `hash_index_build_duration_ms` | Histogram | Time to build the hash index. |
42+
| `hash_index_entries` | Gauge | Number of entries in the index. |
43+
| `hash_index_memory_bytes` | Gauge | Approximate memory footprint of the index. |
44+
| `hash_index_lookups` | Counter | Total hash-index lookups performed by queries. |
45+
| `hash_index_lookup_rows` | Counter | Total rows returned via hash-index lookups. |
46+
47+
See [Component Metrics](../../../features/observability/component_metrics) for enabling and exporting metrics. Refresh metrics are described in [Acceleration](../../../features/data-acceleration/).
48+
49+
## Task History
50+
51+
Arrow acceleration operations (refresh, query) participate in [task history](../../../reference/task_history) through the shared acceleration spans (`accelerated_table_refresh`, `sql_query`). No Arrow-specific spans are emitted — the accelerator is a thin wrapper over Arrow memory.
52+
53+
## Known Limitations
54+
55+
- **No persistence**: Every restart refreshes from the source.
56+
- **No traditional indexes**: Arrow does not support B-tree indexes. Hash index provides point-lookup acceleration but not range or sort-order optimization.
57+
- **Only primary-key hash index**: The hash index requires a `primary_key` constraint; `unique` constraints alone do not enable the index.
58+
- **Memory pressure**: If the dataset exceeds available RAM, the runtime will OOM; no spill-to-disk mechanism exists in the Arrow accelerator itself.
59+
- **`partition_by`**: Not applicable — Arrow accelerator holds a single in-memory representation.
60+
61+
## Troubleshooting
62+
63+
| Symptom | Likely cause | Resolution |
64+
| ------------------------------------------------ | ------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
65+
| OOM on refresh | Source dataset larger than RAM. | Switch to a durable accelerator (DuckDB / SQLite / Cayenne) that supports spill to disk. |
66+
| Long startup time | Full-dataset refresh runs on boot. | Switch to a durable accelerator so refresh is incremental, not full, on restart. |
67+
| `hash_index` ignored | No primary-key constraint on the dataset. | Add `primary_key:` to the dataset definition; hash index activates automatically. |
68+
| Query slow for point lookups | Hash index disabled or wrong key column. | Enable `hash_index: enabled`; ensure the query filter matches the primary-key columns. |
69+
| Accelerator refuses to start with file mode | Arrow rejects file-mode acceleration. | Switch `engine:` to `duckdb`, `sqlite`, `postgres`, or `cayenne`. |

website/docs/components/data-accelerators/arrow.md renamed to website/docs/components/data-accelerators/arrow/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ See [Hash Index](../../features/data-acceleration/hash-index) for configuration
6565

6666
When accelerating a dataset using the In-Memory Arrow Data Accelerator, some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
6767

68-
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb) and [`sqlite`](./sqlite) accelerators by specifying `mode: file`.
68+
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](duckdb) and [`sqlite`](sqlite) accelerators by specifying `mode: file`.
6969

7070
:::
7171

0 commit comments

Comments
 (0)