Skip to content

Commit 1b1c5f4

Browse files
authored
Merge branch 'trunk' into lukim/vector-engine-elastic-search
2 parents 4e2aae5 + 4557b0d commit 1b1c5f4

23 files changed

Lines changed: 201 additions & 32 deletions

File tree

website/docs/components/data-connectors/imap.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ The IMAP connector supports the following connection and authentication paramete
100100
| `imap_username` | Optional. The username to use for the IMAP connection. Defaults to the value of the `from:` mailbox field. |
101101
| `imap_password` | Optional. The password to use for the IMAP connection, in plaintext authentication mode. |
102102
| `imap_host` | Optional. The host or IP address of the IMAP server to connect to. Not required for known connections like Outlook or Gmail. |
103-
| `imap_port` | Optional. The port of the IMAP server to connect to. |
103+
| `imap_port` | Optional. The port of the IMAP server to connect to. Defaults to `993`. |
104104
| `imap_mailbox` | Optional. The mailbox to read mail from. Defaults to `INBOX`, the standard email inbox. |
105105
| `imap_ssl_mode` | Optional. The IMAP SSL mode to use. Defaults to `auto`, permitted values of `tls`, `starttls`, `disabled` or `auto`. |
106106

website/docs/components/data-connectors/mongodb.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,9 @@ The MongoDB data connector can be configured by providing the following `params`
8080
| `mongodb_connection_string` | The connection string to use to connect to the MongoDB server. This can be used instead of providing individual connection parameters. |
8181
| `mongodb_user` | The MongoDB username. |
8282
| `mongodb_pass` | The password to connect with. |
83-
| `mongodb_host` | The hostname of the MongoDB server. |
84-
| `mongodb_port` | The port of the MongoDB server. |
85-
| `mongodb_db` | The name of the database to connect to. |
83+
| `mongodb_host` | The hostname of the MongoDB server. Defaults to `localhost`. |
84+
| `mongodb_port` | The port of the MongoDB server. Defaults to `27017`. |
85+
| `mongodb_db` | The name of the database to connect to. Defaults to `default`. |
8686
| `mongodb_sslmode` | Optional. Specifies the SSL/TLS behavior for the connection, supported values:<br /> <ul><li>`required`: (default) This mode requires an SSL connection. If a secure connection cannot be established, server will not connect.</li><li>`preferred`: This mode will try to establish a secure SSL connection if possible, but will connect insecurely if the server does not support SSL.</li><li>`disabled`: This mode will not attempt to use an SSL connection, even if the server supports it.</li></ul> |
8787
| `mongodb_sslrootcert` | Optional parameter specifying the path to a custom PEM certificate that the connector will trust. |
8888
| `mongodb_time_zone` | Optional. Specifies connection time zone. Default is `UTC`. Accepts: <br /><ul><li>Fixed offsets (e.g., `+02:00`).</li><li>IANA time zone names (e.g., `America/Los_Angeles`)</li></ul> |

website/docs/features/data-acceleration/snapshots.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Every accelerated dataset must write to its own file (for example, `/nvme/my_dat
5151

5252
## Configure snapshot storage
5353

54-
Snapshots are controlled with a top-level `snapshots` block in the Spicepod. The location must point to a folder on S3 or the local filesystem. When the location is an S3 bucket, the configuration accepts any S3 dataset parameters under `params`.
54+
Snapshots are controlled with a top-level `snapshots` block in the Spicepod. The location can point to S3, Azure ADLS Gen2, Google Cloud Storage, or the local filesystem.
5555

5656
```yaml
5757
snapshots:
@@ -62,6 +62,17 @@ snapshots:
6262
s3_auth: iam_role # Defaults to iam_role for snapshots
6363
```
6464

65+
### Supported storage backends
66+
67+
| Backend | URL scheme | Environment variables |
68+
| --- | --- | --- |
69+
| Amazon S3 | `s3://` | Standard AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, etc.) |
70+
| Azure ADLS Gen2 | `abfss://`, `abfs://` | `AZURE_STORAGE_ACCOUNT_NAME`, `AZURE_STORAGE_ACCOUNT_KEY`, `AZURE_CLIENT_ID`/`AZURE_TENANT_ID`/`AZURE_FEDERATED_TOKEN_FILE` |
71+
| Google Cloud Storage | `gs://` | `GOOGLE_APPLICATION_CREDENTIALS`, Workload Identity |
72+
| Local filesystem | Absolute or relative path | N/A |
73+
74+
When the location is an S3 bucket, the configuration accepts any [S3 dataset parameters](../../components/data-connectors/s3) under `params`. Azure and GCS locations also accept their respective connector parameters under `params` for explicit credential overrides. When no explicit credentials are supplied, Spice reads standard environment variables for each cloud provider.
75+
6576
### Failure behavior
6677

6778
`bootstrap_on_failure_behavior` controls what Spice does when it cannot load the most recent snapshot.

website/docs/features/observability/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,10 @@ runtime:
239239

240240
When `metrics` is empty or omitted, all available metrics are exported.
241241

242+
:::caution Filtering happens after `metric_prefix` is applied
243+
The whitelist is matched against the **final** metric name, after `runtime.telemetry.metric_prefix` has been prepended. If you set `metric_prefix: 'spiceai.'`, the entries under `metrics:` must include the prefix (e.g. `spiceai.query_duration_ms`), otherwise nothing will match and no metrics will be exported.
244+
:::
245+
242246
For full configuration details, see the [runtime.telemetry reference](../reference/spicepod/runtime#runtimetelemetry).
243247

244248
## Available Metrics

website/docs/features/search/vector-search.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,8 @@ FROM vector_search('sales', 'cutting edge AI', 1500)
157157
ORDER BY score DESC;
158158
```
159159

160+
`WHERE` predicates on base table columns are pushed down as pre-filters — only matching rows are scored and ranked. See [Search in SQL](../../reference/sql/search#vector-search-vector_search) for details.
161+
160162
:::warning[Limitations]
161163

162164
- `vector_search` UDTF does not yet support chunked embedding columns. Chunking support is on the roadmap.

website/docs/monitoring/datadog/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,10 @@ runtime:
7575

7676
The runtime metric `query_duration_ms` is then exported as `spiceai.query_duration_ms`.
7777

78+
:::caution Combining `metric_prefix` with metric filtering
79+
If you also set [`runtime.telemetry.otel_exporter.metrics`](/docs/next/reference/spicepod/runtime#runtimetelemetryotel_exporter) to whitelist specific metrics, the entries must include the prefix. The filter runs after the prefix is applied, so e.g. `query_duration_ms` will not match when `metric_prefix: 'spiceai.'` is set — use `spiceai.query_duration_ms` instead.
80+
:::
81+
7882
### Add Custom Tags via Resource Attributes
7983

8084
Attach custom key/value pairs to every metric using [`runtime.telemetry.properties`](/docs/next/reference/spicepod/runtime#runtimetelemetryproperties). Spice sends these as OpenTelemetry resource attributes:

website/docs/reference/spicepod/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ Enable or disable snapshot management globally. Defaults to `true`.
135135

136136
### `snapshots.location`
137137

138-
The folder where snapshots are stored. Supports S3 bucket URIs (`s3://bucket/prefix/`) and absolute or relative filesystem paths. The path must resolve to a single folder; Spice creates per-dataset folders underneath using Hive-style partitions (`month=YYYY-MM/day=YYYY-MM-DD/dataset=<name>`).
138+
The folder where snapshots are stored. Supports S3 bucket URIs (`s3://bucket/prefix/`), Azure ADLS Gen2 URIs (`abfss://container@account.dfs.core.windows.net/path/`), Google Cloud Storage URIs (`gs://bucket/prefix/`), and absolute or relative filesystem paths. The path must resolve to a single folder; Spice creates per-dataset folders underneath using Hive-style partitions (`month=YYYY-MM/day=YYYY-MM-DD/dataset=<name>`).
139139

140140
### `snapshots.bootstrap_on_failure_behavior`
141141

@@ -147,7 +147,7 @@ Controls what happens when Spice cannot load the most recent snapshot on startup
147147

148148
### `snapshots.params`
149149

150-
Optional key-value map passed to the snapshot storage layer. When `location` points to S3, the configuration accepts any of the [S3 dataset parameters](../components/data-connectors/s3). Snapshots default to `s3_auth: iam_role`, which differs from the S3 dataset default of `public`.
150+
Optional key-value map passed to the snapshot storage layer. When `location` points to S3, the configuration accepts any of the [S3 dataset parameters](../components/data-connectors/s3). Snapshots default to `s3_auth: iam_role`, which differs from the S3 dataset default of `public`. Azure ADLS and GCS locations also accept their respective connector parameters for explicit credential overrides; when no overrides are supplied, Spice reads standard environment variables for each cloud provider.
151151

152152
## `models`
153153

website/docs/reference/spicepod/runtime.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,6 +435,10 @@ runtime:
435435
- dataset_load_state
436436
```
437437

438+
:::caution Filtering happens after `metric_prefix` is applied
439+
The whitelist is matched against the **final** metric name, after [`runtime.telemetry.metric_prefix`](#runtimetelemetrymetric_prefix) has been prepended. If you set `metric_prefix: 'spiceai.'`, the entries under `metrics:` must include the prefix (e.g. `spiceai.query_duration_ms`), otherwise nothing will match and no metrics will be exported.
440+
:::
441+
438442
**Authenticated exporters:**
439443

440444
For collectors that require authentication, set the `headers` map. Load credentials from a [secret store](../../components/secret-stores) via `${secrets:...}` rather than committing them to source.

website/docs/reference/sql/json.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,124 @@ SELECT
477477
FROM products;
478478
```
479479

480+
## JSON Table Functions (UDTFs)
481+
482+
Spice includes table-valued functions for decomposing JSON structures into relational rows. Each function is available as both a UDTF (in the `FROM` clause with literal input) and a scalar UDF returning a list of structs (for per-row use with `UNNEST`).
483+
484+
### `flatten_json`
485+
486+
Walks an arbitrary JSON value and emits one row per reachable leaf.
487+
488+
```sql
489+
flatten_json(input Utf8 [, options...]) -> TABLE(
490+
path Utf8,
491+
parent_path Utf8,
492+
key Utf8,
493+
value Utf8,
494+
type Utf8 -- "object"|"array"|"string"|"number"|"integer"|"boolean"|"null"
495+
)
496+
```
497+
498+
**Options (named arguments):**
499+
500+
| Option | Type | Default | Description |
501+
| --- | --- | --- | --- |
502+
| `max_depth` | UInt | `64` | Maximum recursion depth. |
503+
| `max_rows` | UInt | `1000000` | Per-document row cap. |
504+
| `max_bytes` | UInt | `8388608` | Input size limit (bytes). |
505+
| `path_style` | Utf8 | `"dot"` | `"dot"` or `"json-pointer"`. |
506+
| `include_internal` | Bool | `false` | Also emit interior object/array rows. |
507+
| `array_wildcard` | Bool | `false` | Collapse array indices to `[*]` instead of `[0]`, `[1]`, etc. |
508+
509+
**UDTF example:**
510+
511+
```sql
512+
SELECT path, value, type
513+
FROM flatten_json('{"user": {"name": "Alice", "scores": [95, 87]}}');
514+
```
515+
516+
| path | value | type |
517+
| --- | --- | --- |
518+
| `user.name` | `Alice` | `string` |
519+
| `user.scores[0]` | `95` | `integer` |
520+
| `user.scores[1]` | `87` | `integer` |
521+
522+
**Scalar UDF example (per-row with `UNNEST`):**
523+
524+
```sql
525+
SELECT rows.path, rows.value, rows.type
526+
FROM (SELECT UNNEST(flatten_json(body)) AS rows FROM documents);
527+
```
528+
529+
### `flatten_json_properties`
530+
531+
Decomposes a JSON Schema document into one row per field, extracting metadata such as types, descriptions, required status, enums, and format.
532+
533+
```sql
534+
flatten_json_properties(input Utf8 [, options...]) -> TABLE(
535+
path Utf8,
536+
parent_path Utf8,
537+
name Utf8,
538+
description Utf8,
539+
type Utf8,
540+
required Boolean,
541+
format Utf8,
542+
enum_values List<Utf8>,
543+
metadata Utf8
544+
)
545+
```
546+
547+
Handles `properties` recursion, `items.properties` (arrays of objects), `additionalProperties` maps, `allOf`/`oneOf`/`anyOf` merging, and local `$ref` pointers with cycle detection.
548+
549+
**Options (named arguments):**
550+
551+
| Option | Type | Default | Description |
552+
| --- | --- | --- | --- |
553+
| `max_depth` | UInt | `32` | Maximum recursion depth. |
554+
| `max_rows` | UInt | `100000` | Per-document row cap. |
555+
| `max_bytes` | UInt | `8388608` | Input size limit (bytes). |
556+
| `path_style` | Utf8 | `"dot"` | `"dot"` or `"json-pointer"`. |
557+
| `dialect` | Utf8 | `"json-schema"` | `"json-schema"` or `"openapi"` (metrics tagging). |
558+
| `include_internal` | Bool | `false` | Also emit container rows (objects, arrays). |
559+
| `expand_maps` | Bool | `false` | Walk into `additionalProperties` and emit child paths with a wildcard segment (e.g., `parent.[*].child`). |
560+
| `map_wildcard` | Utf8 | `"[*]"` | Wildcard segment for map values when `expand_maps` is `true`. |
561+
562+
**Example:**
563+
564+
```sql
565+
SELECT path, type, required, description
566+
FROM flatten_json_properties('{
567+
"type": "object",
568+
"properties": {
569+
"name": {"type": "string", "description": "User name"},
570+
"age": {"type": "integer"}
571+
},
572+
"required": ["name"]
573+
}');
574+
```
575+
576+
| path | type | required | description |
577+
| --- | --- | --- | --- |
578+
| `name` | `string` | `true` | `User name` |
579+
| `age` | `integer` | `false` | |
580+
581+
**Expanding maps:**
582+
583+
When a JSON Schema uses `additionalProperties` to describe map values, enable `expand_maps` to produce JSONPath-style paths:
584+
585+
```sql
586+
SELECT path, type
587+
FROM flatten_json_properties(
588+
'{"type": "object", "additionalProperties": {"type": "object", "properties": {"id": {"type": "string"}, "primary": {"type": "boolean"}}}}',
589+
expand_maps => true
590+
);
591+
```
592+
593+
| path | type |
594+
| --- | --- |
595+
| `[*].id` | `string` |
596+
| `[*].primary` | `boolean` |
597+
480598
## Further Reading
481599

482600
- [datafusion-functions-json](https://github.com/datafusion-contrib/datafusion-functions-json) - The underlying JSON manipulation library

0 commit comments

Comments
 (0)