Skip to content

Commit 72f901d

Browse files
ewgeniuslukekimmach-kernelphillipleblanckrinart
authored
Docs update for 1.10.0 (#1258)
* Caching acceleration docs (#1250) * agg pushdown parameter documentation (#1256) * Add additional hashing algorithms, including updated default (#1259) * Add hashing algorithms to runtime reference * DynamoDB Streams (#1252) * DynamoDB Streams * Update * More updates * Update * Update * Update * Update * Update * Update engine * Update * Update cookbooks * Add OTel exporter docs (#1260) * Add OTel exporter docs * Update with specific metrics * Caching acceleration docs update (#1261) * Add OTel exporter docs * Update with specific metrics * Update caching docs * Rename `aggregate_pushdown_optimization` -> `duckdb_aggregate_pushdown_optimization` (#1263) * Rename param to `duckdb_aggregate_pushdown` * fix name prefix * fix name * fix typo * Remove OTEL exporter http mode (#1264) --------- Co-authored-by: Luke Kim <80174+lukekim@users.noreply.github.com> Co-authored-by: David Stancu <david@spice.ai> Co-authored-by: Phillip LeBlanc <phillip@spice.ai> Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech> Co-authored-by: Viktor Yershov <viktor@spice.ai>
1 parent 62ff8ed commit 72f901d

8 files changed

Lines changed: 1177 additions & 145 deletions

File tree

website/docs/components/data-accelerators/duckdb.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ DuckDB acceleration supports the following optional parameters under `accelerati
4242
- `on_refresh_recompute_statistics` (string, default: `enabled`): Triggers automatic `ANALYZE` execution after data refreshes. This keeps DuckDB optimizer statistics up-to-date for efficient query plans and performance. Set to `disabled` to turn automatic statistics recomputation off. See [DuckDB ANALYZE statement documentation](https://duckdb.org/docs/stable/sql/statements/analyze).
4343
- `partition_mode` (string, default: `files`): Controls how partitioned data is stored. Can only be used with `partition_by`. Set to `tables` to store partitions as separate tables within a single DuckDB database, improving resource usage through single shared connection pool for all partitions. Default `files` mode creates separate database files per partition with individual connection pools and generally faster query performance.
4444
- `duckdb_partitioned_write_flush_threshold` (integer, default: `122880`): The number of rows buffered per partition before flushing data to acceleration storage. Only applicable when using `partition_mode: tables`. Using a larger value can improve write performance but requires more memory.
45+
- `optimizer_duckdb_aggregate_pushdown` (string, default: `disabled`): Enables aggregate pushdown optimization to execute supported aggregate queries directly in DuckDB. Set to `enabled` to push down aggregations for improved query performance on supported functions like `count`, `sum`, `avg`, `min`, and `max`. Requires `query_federation` to be `disabled`.
4546

4647
Refer to the [datasets configuration reference](/docs/reference/spicepod/datasets.md#acceleration) for additional supported fields.
4748

website/docs/components/data-connectors/dynamodb.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ tags:
66
- data-connectors
77
- dynamodb
88
- nosql
9+
- component-metrics
910
---
1011

1112
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. This connector enables using DynamoDB tables as data sources for federated SQL queries in Spice.
@@ -428,3 +429,107 @@ describe users;
428429
...
429430
+----------------+------------------+-------------+
430431
```
432+
433+
## Streams
434+
435+
The DynamoDB Data Connector integrates with [DynamoDB Streams](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) to enable real-time streaming of table changes. This feature supports both initial table bootstrapping and continuous change data capture (CDC), allowing Spice to automatically detect and stream inserts, updates, and deletes from DynamoDB tables.
436+
437+
:::warning
438+
439+
Using the DynamoDB connector **requires** [acceleration](/docs/components/data-accelerators/index.md) with `refresh_mode: changes` and defined `on_conflict` configuration.
440+
441+
:::
442+
443+
### Basic Configuration
444+
445+
To enable streaming from DynamoDB, enable acceleration and set the `refresh_mode` to `changes` in your dataset configuration.
446+
447+
You also need to configure the `on_conflict` parameter to specify how the connector should handle updates to existing records. The keys defined in `on_conflict` must match your DynamoDB table's partition key and range key (if your table has one)
448+
```yaml
449+
datasets:
450+
- from: dynamodb:my_table
451+
name: orders_stream
452+
acceleration:
453+
enabled: true
454+
engine: duckdb
455+
mode: file
456+
refresh_mode: changes
457+
on_conflict:
458+
(id, version): upsert
459+
```
460+
461+
### Configuration Parameters
462+
463+
#### Dataset Parameters
464+
465+
- **`ready_lag`** - Defines the maximum lag threshold before the dataset is reported as "Ready". Once the stream lag falls below this value, queries can be executed against the dataset. Default behavior reports ready immediately after bootstrap completes.
466+
467+
- **`scan_interval`** - Controls the polling frequency for checking new records in the DynamoDB stream. Lower values provide more real-time updates but increase API calls. Higher values reduce API usage but may introduce additional latency.
468+
469+
#### Acceleration Parameters
470+
471+
- **`on_conflict`** - Specifies the conflict resolution strategy when streaming changes that match existing records. The keys in the tuple should correspond to your DynamoDB table's partition key and range key (if applicable). The `upsert` action will insert new records or update existing ones based on these key columns.
472+
473+
**Examples:**
474+
- Single partition key: `id: upsert`
475+
- Partition key + range key: `(partition_key, sort_key): upsert`
476+
477+
- **`snapshots_trigger_threshold`** - Determines how frequently snapshots are created during streaming. A value of `5` means a snapshot is created every 5 batch updates. Snapshots enable faster recovery and better query performance but consume additional storage.
478+
479+
### Metrics
480+
481+
The following [Component Metrics](../../features/observability/component_metrics.md) are provided for monitoring streaming performance and health:
482+
483+
| Metric | Type | Description |
484+
|--------------------------|---------|-----------------------------------------------------------------------------|
485+
| `shards_active` | Gauge | Current number of active shards in the stream |
486+
| `records_consumed_total` | Counter | Total number of records consumed from the stream |
487+
| `lag_ms` | Gauge | Current lag in milliseconds between stream watermark and the current time |
488+
| `errors_transient_total` | Counter | Total number of transient errors encountered while polling from the stream |
489+
490+
These metrics are not enabled by default, enable them by setting the metrics parameter:
491+
```yaml
492+
datasets:
493+
- from: kafka:user_events
494+
name: events
495+
metrics:
496+
- name: shards_active
497+
- name: lag_ms
498+
```
499+
500+
You can find an example dashboard for DynamoDB Streams in [monitoring/grafana-dashboard.json](https://github.com/spiceai/spiceai/blob/trunk/monitoring/grafana-dashboard.json).
501+
502+
## Advanced Configuration
503+
504+
For production workloads requiring fine-tuned control over streaming behavior and performance characteristics:
505+
```yaml
506+
datasets:
507+
- from: dynamodb:my_table
508+
name: orders_stream
509+
params:
510+
ready_lag: 1s # Dataset reports as Ready when lag is below 1 second
511+
scan_interval: 100ms # Poll for new stream records every 100 milliseconds
512+
acceleration:
513+
enabled: true
514+
engine: duckdb
515+
mode: file
516+
refresh_mode: changes
517+
on_conflict:
518+
(id, version): upsert
519+
params:
520+
snapshots_trigger_threshold: 5 # Create snapshot every 5 batch updates
521+
metrics:
522+
- name: shards_active
523+
enabled: true
524+
- name: records_consumed_total
525+
enabled: true
526+
- name: lag_ms
527+
enabled: true
528+
- name: errors_transient_total
529+
enabled: true
530+
```
531+
532+
## Cookbooks
533+
534+
- A cookbook recipe to configure DynamoDB as a data connector in Spice. [DynamoDB Data Connector](https://github.com/spiceai/cookbook/tree/trunk/dynamodb#readme)
535+
- A cookbook recipe to configure DynamoDB Streams as a data connector in Spice. [DynamoDB Streams Data Connector](https://github.com/spiceai/cookbook/tree/trunk/dynamodb/streams#readme)

website/docs/features/caching/index.md

Lines changed: 26 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -45,22 +45,22 @@ runtime:
4545

4646
Every cache type (`sql_results`, `search_results`, `embeddings`) supports the following parameters:
4747

48-
| Parameter name | Optional | Default | Description |
49-
| ---------------------------- | -------- | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
50-
| `enabled` | Yes | `true` | Defaults to `true`. |
51-
| `max_size` | Yes | `128MiB` | Maximum cache size. Defaults to `128MiB`. |
52-
| `eviction_policy` | Yes | `lru` | Cache replacement policy when the cache reaches `max_size`. Defaults to `lru`, which is currently the only supported value. |
53-
| `item_ttl` | Yes | `1s` | Cache entry expiration duration (Time to Live). Defaults to 1 second. |
54-
| `hashing_algorithm` | Yes | `siphash` | Selects which hashing algorithm is used to hash the cache keys when storing the results. Defaults to `siphash`. Supports `siphash` or `ahash`. |
48+
| Parameter name | Optional | Default | Description |
49+
| ------------------- | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
50+
| `enabled` | Yes | `true` | Defaults to `true`. |
51+
| `max_size` | Yes | `128MiB` | Maximum cache size. Defaults to `128MiB`. |
52+
| `eviction_policy` | Yes | `lru` | Cache replacement policy when the cache reaches `max_size`. Defaults to `lru`, which is currently the only supported value. |
53+
| `item_ttl` | Yes | `1s` | Cache entry expiration duration (Time to Live). Defaults to 1 second. |
54+
| `hashing_algorithm` | Yes | `xxh3` | Selects which hashing algorithm is used to hash the cache keys when storing the results. Defaults to `xxh3`. Supports `xxh3`, `ahash`, `siphash`, `blake3`, `xxh32`, `xxh64`, or `xxh128`. |
5555

5656
## `caching.sql_results` Parameters
5757

5858
In addition to the common caching parameters, `sql_results` also supports additional parameters:
5959

60-
| Parameter name | Optional | Default | Description |
61-
| ---------------- | -------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
62-
| `cache_key_type` | Yes | `plan` | Determines how cache keys are generated. Defaults to `plan`. `plan` uses the query's logical plan, while `sql` uses the raw SQL query string. |
63-
| `encoding` | Yes | `none` | Compression algorithm for cached results. Defaults to `none`. Supports `none` or `zstd`. |
60+
| Parameter name | Optional | Default | Description |
61+
| ---------------------------- | -------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
62+
| `cache_key_type` | Yes | `plan` | Determines how cache keys are generated. Defaults to `plan`. `plan` uses the query's logical plan, while `sql` uses the raw SQL query string. |
63+
| `encoding` | Yes | `none` | Compression algorithm for cached results. Defaults to `none`. Supports `none` or `zstd`. |
6464
| `stale_while_revalidate_ttl` | Yes | `0s` | Duration to serve stale cache entries while revalidating in the background. When set to a non-zero value, expired cache entries continue to be served while a background refresh occurs. Defaults to `0s` (disabled). |
6565

6666
### Choosing a `cache_key_type`
@@ -74,10 +74,13 @@ Use `sql` for the lowest latency with identical queries that do not include dyna
7474

7575
The hashing algorithm determines how cache keys are hashed before being stored, impacting both lookup speed and protection against potential DOS attacks.
7676

77-
- **`siphash` (Default):** Uses the SipHash1-3 algorithm for hashing the cache keys, the [default hashing algorithm of Rust](https://github.com/rust-lang/rust/commit/db1b1919baba8be48d997d9f70a6a5df7e31612a). This hashing algorithm is a secure algorithm that implements verified protections against ["hash flooding"](https://v8.dev/blog/hash-flooding) denial of service (DoS) attacks. Reasonably performant, and provides a high level of security.
77+
- **`xxh3` (Default):** Uses the [XXH3](https://cyan4973.github.io/xxHash/) algorithm for hashing the cache keys. XXH3 is a fast, non-cryptographic hash algorithm that provides high performance and good distribution. It is suitable for scenarios where speed is critical and cryptographic security is not required.
78+
- **`siphash`:** Uses the SipHash1-3 algorithm for hashing the cache keys, the [default hashing algorithm of Rust](https://github.com/rust-lang/rust/commit/db1b1919baba8be48d997d9f70a6a5df7e31612a). This hashing algorithm is a secure algorithm that implements verified protections against ["hash flooding"](https://v8.dev/blog/hash-flooding) denial of service (DoS) attacks. Reasonably performant, and provides a high level of security.
7879
- **`ahash`:** Uses the [AHash](https://github.com/tkaitchuck/ahash) algorithm for hashing the cache keys. The AHash algorithm is a [high quality](https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md#Quality) hashing algorithm, and has claimed resistance against hashing DoS attacks. AHash has higher performance than SipHash1-3, especially when used with `cache_key_type: plan`.
80+
- **`blake3`:** Uses the [BLAKE3](https://github.com/BLAKE3-team/BLAKE3) cryptographic hash function. BLAKE3 is a fast, parallelizable hash function that provides cryptographic security while maintaining high performance. It is suitable for scenarios requiring both speed and cryptographic guarantees.
81+
- **`xxh32`, `xxh64`, `xxh128`:** Variants of the XXH hashing algorithm with different output sizes. These algorithms offer a balance between speed and collision resistance, with larger hash sizes providing better collision resistance at the cost of performance.
7982

80-
Consider using `ahash` if maximum performance is most important, or where hashing DoS attacks are unlikely or a low risk. More information on the security mechanisms of AHash are available [in the AHash documentation](https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks).
83+
Use `xxh3` (the default) for its superior speed in most scenarios. Use `ahash`, `xxh64` or `xxh128` for reduced collision probability when caching a large number of queries. Use `blake3` when cryptographic security is required. Use `siphash` when protection against hash flooding attacks is a priority.
8184

8285
### Choosing an `encoding`
8386

@@ -192,6 +195,16 @@ With this configuration:
192195

193196
This approach is particularly useful for queries that take significant time to execute, providing a better user experience by reducing perceived latency while keeping data reasonably fresh.
194197

198+
:::warning[Conflict with Caching Accelerator SWR]
199+
When using a dataset with `refresh_mode: caching`, you cannot configure both the results cache's `stale_while_revalidate_ttl` and the caching accelerator's `caching_stale_while_revalidate_ttl` for the same dataset. These parameters control similar behavior at different layers.
200+
201+
Choose one approach:
202+
203+
- **Results cache SWR**: Configure `runtime.caching.sql_results.stale_while_revalidate_ttl` for SQL query results caching
204+
- **Caching accelerator SWR**: Configure `acceleration.params.caching_stale_while_revalidate_ttl` for [HTTP-based dataset caching](/docs/features/data-acceleration/refresh-modes/caching)
205+
206+
:::
207+
195208
### HTTP/Flight API
196209

197210
The following endpoints support the standard HTTP [`Cache-Control` header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control):

website/docs/features/data-acceleration/data-refresh.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,18 @@ Acceleration data can be refreshed (updated) by:
2121

2222
## Refresh Modes
2323

24-
Spice supports three modes to refresh/update local data from a connected data source. `full` is the default mode.
24+
Spice supports four modes to refresh/update local data from a connected data source. `full` is the default mode.
2525

2626
| Mode | Description | Example |
2727
| --------- | ---------------------------------------------------- | ---------------------------------------------------------------- |
2828
| `full` | Replace/overwrite the entire dataset on each refresh | A table of users |
2929
| `append` | Append/add data to the dataset on each refresh | Append-only, immutable datasets, such as time-series or log data |
3030
| `changes` | Apply incremental changes | Customer order lifecycle table |
31+
| `caching` | Read-through caching for SQL queries | API search results or dynamic content endpoints |
32+
33+
Learn more about each mode:
34+
35+
- [Caching Mode](/docs/features/data-acceleration/refresh-modes/caching.md)
3136

3237
Example:
3338

@@ -125,6 +130,12 @@ Appending modified files is only supported for datasets that support setting the
125130

126131
Datasets configured with acceleration `refresh_mode: changes` requires a [Change Data Capture (CDC)](/docs/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/docs/components/data-connectors/debezium.md).
127132

133+
### Caching
134+
135+
The `caching` refresh mode is designed for HTTP-based datasets where request metadata acts as cache keys. This mode is particularly useful for API responses that return multiple rows for a single request, such as search results or dynamic content endpoints.
136+
137+
See [Caching Mode](/docs/features/data-acceleration/refresh-modes/caching.md) for detailed documentation and examples.
138+
128139
## Ready State
129140

130141
| | |
@@ -425,7 +436,7 @@ datasets:
425436
acceleration:
426437
enabled: true
427438
refresh_mode: full
428-
refresh_cron: "0 12 * * 1-5"
439+
refresh_cron: '0 12 * * 1-5'
429440
```
430441

431442
This configuration will refresh `taxi_trips` data at midday every weekday. For more information about cron schedules, see the [cron schedule reference](/docs/reference/cron.md).

0 commit comments

Comments
 (0)